## 1. Introduction

This notebook shows how to connect Jupyter notebooks to a Spark Cluster, read a local CSV and store it to Hadoop as partitioned parquet files.

## 2. Connection to Spark Cluster

To connect to the Spark cluster, create a SparkSession object with the following params:

+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);
+ **master:** Spark Master URL, same used by Spark Workers;
+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config.

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        getOrCreate()

24/05/31 17:16:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 3. Load and Store Data
We will now load data from a local CSV and store it to Hadoop partitioned by column.
Afterward you can access Hadoop UI to explore the saved parquet files.
Access Hadoop UI on 'http://localhost:9870' (Utilities -> Browse the files system )

In [2]:
import pandas
from pyspark.sql.types import *
from pyspark.sql import functions as F
import os
import time    
epochNow = int(time.time())

In [1]:
# Create a Spark DataFrame from a local CSV file
brewDF = spark.read.csv("./data/breweries.csv", header=True, inferSchema=True)

NameError: name 'spark' is not defined

In [2]:
# Show first 5 rows
brewDF.show(5)

NameError: name 'brewDF' is not defined

In [None]:
csvName = "breweries"
# Write Dataframe into HDFS
# Repartition it by "city" column before storing as parquet files in Hadoop
brewDF.write.option("header",True) \
        .partitionBy("city") \
        .mode("overwrite") \
        .parquet("hdfs://namenode:9000/gold_zone/{}_{}.parquet".format(csvName,epochNow))
print("Sales Dataframe stored in Hadoop.")

[Stage 12:>                                                         (0 + 1) / 1]

In [None]:
# Read from HDFS to confirm it was successfully stored
df_load = spark.read.parquet("hdfs://namenode:9000/gold_zone/{}_{}.parquet".format(csvName,epochNow))
print("Sales Dataframe read from Hadoop : ")
df_load.show()