# Spark in the Cloud (GCP)

## Connect to GCS

Instead of downloading all our parquets locally, let's read it directly from our data lake on GCS

Upload our local partitions to GCS with

```bash
cd <project_root>
gsutil -m cp -r data/taxi_ingest_data/parts gs://<data_lake>/data/parts/
```

Now the goal is to have

```py
df_green = spark.read.parquet('gs://<data_lake>/data/parts/green/*/*')
```

1. Get the cloud storage connector for hadoop to connect PySpark and GCS

    ```bash
    # starting from data/
    mkdir lib
    cd lib
    gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar gcs-connector-hadoop3-latest.jar
    ```
2. Install the connector.
    1. If not on GCE VM, must set `creds_loc` and obtain the service account key json. The service account must have access to GCS
    1. Otherwise, ensure the GCE VM service account has sufficient access to GCS
    1. Configure spark via `SparkConf` and `SparkContext`

 Reference docs: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#getting_the_connector

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

In [2]:
# spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()


# creds_loc = 'path/to/key.json'

conf = SparkConf() \
    .setMaster('local[*]') \
    .setAppName('test_cloud') \
    .set('spark.jars', '../data/lib/gcs-connector-hadoop3-latest.jar')
    # .set('spark.hadoop.google.cloud.auth.service.account.enable', 'true')
    # .set('spark.hadoop.google.cloud.auth.service.account.json.keyfile', creds_loc)

In [3]:
sc = SparkContext(conf=conf)

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.AbstractFileSystem.gs.impl",  "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
# hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", creds_loc)
# hadoop_conf.set("fs.gs.auth.service.account.enable", "true")

23/02/24 17:23:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
spark = SparkSession.builder \
    .config(conf=sc.getConf()) \
    .getOrCreate()

In [15]:
import os
from dotenv import load_dotenv

In [17]:
load_dotenv()

DATA_LAKE = os.getenv('DATA_LAKE')

In [18]:
df_green = spark.read.parquet(f'gs://{DATA_LAKE}/data/parts/green/2020/*')

In [19]:
df_green.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- lpep_dropoff_datetime: timestamp (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: float (nullable = true)
 |-- fare_amount: float (nullable = true)
 |-- extra: float (nullable = true)
 |-- mta_tax: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- tolls_amount: float (nullable = true)
 |-- ehail_fee: float (nullable = true)
 |-- improvement_surcharge: float (nullable = true)
 |-- total_amount: float (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- trip_type: integer (nullable = true)
 |-- congestion_surcharge: float (nullable = true)

