# First look at Spark/PySpark

## Import libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession

In [2]:
pyspark.__version__

'3.3.3'

In [3]:
pyspark.__file__

'/opt/spark/python/pyspark/__init__.py'

## Instantiate a Spark session

We need to instantiate a **SparkSession**, an object that we use to interact with Spark.

```
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test") \
    .getOrCreate()
```
* **`SparkSession`** class is the entry point into all functionality in Spark. To create a SparkSession, use the **`builder`** attribute.
* **`master()`** sets the Spark master URL to connect to. The `local` string means that Spark will run on a local cluster. `[*]` means that Spark will run with as many CPU cores as possible.
* **`appName()`** sets a name for the application/session, which will be shown in the Spark web UI.
* **`getOrCreate()`** gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.


In [4]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test") \
    .getOrCreate()

24/02/22 08:52:13 WARN Utils: Your hostname, GRAD0365UBUNTU resolves to a loopback address: 127.0.1.1; using 10.5.4.63 instead (on interface wlp0s20f3)
24/02/22 08:52:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/02/22 08:52:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/02/22 08:52:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Once we've instantiated a session, we can access the Spark UI by browsing to **`localhost:4040`**. The UI will display all current jobs. Since we've just created the instance, there should be no jobs currently running.

## Read a CSV file

For this example we will use the [High Volume For-Hire Vehicle Trip Records for January 2021](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhvhv) file.

In [5]:
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz -O ../data/fhvhv_tripdata_2021-01.csv.gz

--2024-02-22 08:52:15--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz
Resolviendo github.com (github.com)... 140.82.121.3
Conectando con github.com (github.com)[140.82.121.3]:443... conectado.
Petición HTTP enviada, esperando respuesta... 302 Found
Ubicación: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/035746e8-4e24-47e8-a3ce-edcf6d1b11c7?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240222%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240222T075215Z&X-Amz-Expires=300&X-Amz-Signature=5969715ea17004a87d69d86fda7f0eb3eb0327c525509e1c77c700bee87aaed1&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=513814948&response-content-disposition=attachment%3B%20filename%3Dfhvhv_tripdata_2021-01.csv.gz&response-content-type=application%2Foctet-stream [siguiente]
--2024-02-22 08:52:15--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/51381494

Similarly to Pandas, Spark can read CSV files into **dataframes**, a tabular data structure. Unlike Pandas, Spark can handle much bigger datasets but it's unable to infer the datatypes of each column.

Let's read the file and create a dataframe.

In [6]:
df = spark.read \
    .option("header", "true") \
    .csv("../data/fhvhv_tripdata_2021-01.csv.gz")

* **`read`** reads the file.
* **`option()`** contains options for the `read` method. In this case, we're specifying that the first line of the CSV file contains the column names.
* **`csv()`** is for reading CSV files.

We can see the contents of the dataframe with **`df.show()`** (only a few rows will be shown) or **`df.head()`**. 

In [7]:
df.show()

+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|
+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|           HV0003|              B02682|2021-01-01 00:33:44|2021-01-01 00:49:07|         230|         166|   null|
|           HV0003|              B02682|2021-01-01 00:55:19|2021-01-01 01:18:21|         152|         167|   null|
|           HV0003|              B02764|2021-01-01 00:23:56|2021-01-01 00:38:05|         233|         142|   null|
|           HV0003|              B02764|2021-01-01 00:42:51|2021-01-01 00:45:50|         142|         143|   null|
|           HV0003|              B02764|2021-01-01 00:48:14|2021-01-01 01:08:42|         143|          78|   null|
|           HV0005|              B02510|2021-01-01 00:06:59|2021-01-01 00:43:01|

In [8]:
df.head(2)

[Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime='2021-01-01 00:33:44', dropoff_datetime='2021-01-01 00:49:07', PULocationID='230', DOLocationID='166', SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime='2021-01-01 00:55:19', dropoff_datetime='2021-01-01 01:18:21', PULocationID='152', DOLocationID='167', SR_Flag=None)]

We can also check the current schema with **`df.printSchema()`**; you will notice that all values are strings.

In [9]:
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: string (nullable = true)
 |-- dropoff_datetime: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- SR_Flag: string (nullable = true)



### Infer datatypes

We can use a trick with Pandas to infer the datatypes:
1. Create a smaller CSV file with the first 100 records.
2. Import Pandas and create a Pandas dataframe. This dataframe will have inferred datatypes.
3. Create a Spark dataframe from the Pandas dataframe and check its schema.
```py
spark.createDataFrame(df_pandas).schema
```
4. Based on the output of the previous method, import the datatypes needed from **`pyspark.sql.types`**, and create a **`StructType`** object containing a list of the datatypes. **`types`** contains all of the available data types for Spark dataframes.
```py
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType
schema = StructType([StructField(...), StructField(...), ...])
```
5. Create a new Spark dataframe and include the schema as an option.
```py
df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv("../data/fhvhv_tripdata_2021-01.csv.gz")
```

In [11]:
!gzip -dk "../data/fhvhv_tripdata_2021-01.csv.gz"

gzip: ../data/fhvhv_tripdata_2021-01.csv already exists; do you wish to overwrite (y or n)? ^C


In [12]:
!head -n 101 "../data/fhvhv_tripdata_2021-01.csv" > "../data/head.csv"
!head -n 5 "../data/head.csv"

hvfhs_license_num,dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag
HV0003,B02682,2021-01-01 00:33:44,2021-01-01 00:49:07,230,166,
HV0003,B02682,2021-01-01 00:55:19,2021-01-01 01:18:21,152,167,
HV0003,B02764,2021-01-01 00:23:56,2021-01-01 00:38:05,233,142,
HV0003,B02764,2021-01-01 00:42:51,2021-01-01 00:45:50,142,143,


In [13]:
!wc -l "../data/head.csv"

101 ../data/head.csv


In [14]:
import pandas as pd
df_pandas = pd.read_csv("../data/head.csv")
df_pandas.dtypes

hvfhs_license_num        object
dispatching_base_num     object
pickup_datetime          object
dropoff_datetime         object
PULocationID              int64
DOLocationID              int64
SR_Flag                 float64
dtype: object

In [15]:
# workaround to avoid AttributeError: 'DataFrame' object has no attribute 'iteritems'
pd.DataFrame.iteritems = pd.DataFrame.items

spark.createDataFrame(df_pandas).schema

StructType([StructField('hvfhs_license_num', StringType(), True), StructField('dispatching_base_num', StringType(), True), StructField('pickup_datetime', StringType(), True), StructField('dropoff_datetime', StringType(), True), StructField('PULocationID', LongType(), True), StructField('DOLocationID', LongType(), True), StructField('SR_Flag', DoubleType(), True)])

In [16]:
# modify the previous output to have the data types as we want
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType 

schema = StructType([
        StructField('hvfhs_license_num', StringType(), True),
        StructField('dispatching_base_num', StringType(), True), 
        StructField('pickup_datetime', TimestampType(), True), 
        StructField('dropoff_datetime', TimestampType(), True), 
        StructField('PULocationID', IntegerType(), True), 
        StructField('DOLocationID', IntegerType(), True), 
        StructField('SR_Flag', StringType(), True)
])

In [17]:
df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv("../data/fhvhv_tripdata_2021-01.csv.gz")

df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: string (nullable = true)



### Infer datatypes: a quicker option

Another option, which saves us all the previous steps, is to set the **`inferSchema`** option to True.

In [18]:
df = spark.read \
    .option("header", True) \
    .option("inferSchema", True) \
    .csv("../data/fhvhv_tripdata_2021-01.csv.gz")

df.printSchema()

[Stage 4:>                                                          (0 + 1) / 1]

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: integer (nullable = true)



                                                                                

## Partitions / saving to parquet files

A **Spark cluster** is composed of multiple **executors**. Each executor can process data independently in order to parallelize and speed up work.

In the previous example we read a single large CSV file. This file can only be read by a single executor, which means that the rest of the executors will be idle and we will not take advantage of parallelization. 

To avoid this we can split a file into multiple parts so that each executor can take care of a part and have all executors working simultaneously. These splits are called **partitions**.

In our example, we partition the dataframe that we obtained when reading the CSV file, and save the result in parquet format. This will create multiple parquet files.

In [19]:
# create 24 partitions of the dataframe
df = df.repartition(24)

# save the partitions in parquet format
df.write.parquet("../data/fhvhv/2021/01")

[Stage 5:>                                                          (0 + 1) / 1]

24/02/22 08:54:14 WARN MemoryManager: Total allocation exceeds 95,00% (960.285.889 bytes) of heap memory
Scaling row group sizes to 89,43% for 8 writers




24/02/22 08:54:16 WARN MemoryManager: Total allocation exceeds 95,00% (960.285.889 bytes) of heap memory
Scaling row group sizes to 89,43% for 8 writers




24/02/22 08:54:17 WARN MemoryManager: Total allocation exceeds 95,00% (960.285.889 bytes) of heap memory
Scaling row group sizes to 89,43% for 8 writers


                                                                                

We can check in the Spark UI the progress of our jobs, which are divided into stages that contain tasks. The tasks in a stage will not start until all tasks in the previous stage are finished.

When creating a dataframe, Spark produces as many partitions as CPU cores available by default, and each partition creates a task. 

Besides the 24 parquet files, you should also see a `_SUCCESS` file which should be empty. This file is created to indicate that the job have finished successfully.

Trying to write the files again will output an error because Spark will not write to a non-empty folder. You can force an overwrite with the `mode` argument.

In [20]:
df.write.parquet("../data/fhvhv/2021/01/", mode="overwrite")

[Stage 8:>                                                          (0 + 1) / 1]

24/02/22 08:54:46 WARN MemoryManager: Total allocation exceeds 95,00% (960.285.889 bytes) of heap memory
Scaling row group sizes to 89,43% for 8 writers




24/02/22 08:54:47 WARN MemoryManager: Total allocation exceeds 95,00% (960.285.889 bytes) of heap memory
Scaling row group sizes to 89,43% for 8 writers




24/02/22 08:54:49 WARN MemoryManager: Total allocation exceeds 95,00% (960.285.889 bytes) of heap memory
Scaling row group sizes to 89,43% for 8 writers
24/02/22 08:54:49 WARN MemoryManager: Total allocation exceeds 95,00% (960.285.889 bytes) of heap memory
Scaling row group sizes to 89,43% for 8 writers


                                                                                

The opposite of partitioning (joining multiple partitions into a single partition) is called **coalescing**.

## Read parquet files

In [21]:
df = spark.read.parquet("../data/fhvhv/2021/01")
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: integer (nullable = true)

