# Connecting local Spark to GCS

## Import libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

## Configure Spark

### SparkConf - configuration object

Instead of just using `builder` attribute, in this case we need to provide some configuration. We create a **`SparkConf()`** configuration object, where we use the cluster in local mode, set the application name, and specify some settings, such as the locations of both the connector jar file and the google credentials.

In [2]:
credentials_location = "/home/sgrodriguez/.google/credentials/google_credentials.json"

conf = SparkConf() \
    .setMaster("local[*]") \
    .setAppName("connect local spark to gcs") \
    .set("spark.jars", "../../lib/gcs-connector-hadoop3-2.2.20.jar") \
    .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_location)

### SparkContext

Unlike in previous notebooks, where we implicitly had created a context (which represents a connection to a Spark cluster) when running `SparkSession.builder`, this time we need to explicitly create and configure the context.

In [3]:
sc = SparkContext(conf=conf)

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.AbstractFileSystem.gs.impl",  "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location)
hadoop_conf.set("fs.gs.auth.service.account.enable", "true")

24/02/24 11:06:53 WARN Utils: Your hostname, GRAD0365UBUNTU resolves to a loopback address: 127.0.1.1; using 192.168.68.103 instead (on interface wlp0s20f3)
24/02/24 11:06:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/02/24 11:06:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### SparkSession

We can now instantiate a SparkSession.

In [4]:
spark = SparkSession.builder \
    .config(conf=sc.getConf()) \
    .getOrCreate()

## Check connection to GCS

Let's check if our connection to GCS is successful by reading some of the Parquet files we have in our bucket.

In [5]:
df_green = spark.read.parquet("gs://dataeng_zoomcamp_bucket_warm-rock-411419/spark/pq/green/*/*")

                                                                                

In [6]:
df_green.show(5)

[Stage 1:>                                                          (0 + 1) / 1]

+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|VendorID|lpep_pickup_datetime|lpep_dropoff_datetime|store_and_fwd_flag|RatecodeID|PULocationID|DOLocationID|passenger_count|trip_distance|fare_amount|extra|mta_tax|tip_amount|tolls_amount|ehail_fee|improvement_surcharge|total_amount|payment_type|trip_type|congestion_surcharge|
+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|       2| 2020-01-12 18:15:04|  2020-01-12 18:19:52|                 N|         1|          41|          41|              1|         0.78|        5.5|  0.0|    0.

                                                                                

In [7]:
df_green.count()

                                                                                

2304517