# Spark DataFrames

We can read as a dataframe the parquet files we have created in the previpus section. Parquet contains the information about schema, so, unlike with CSV, we do not need to specify or infer it when reading the files. This is one of the reasons parquet files are smaller than CSVs: since they "know" the schema they use more efficient ways of compressing the data (for example, storing as integers instead of long values).

## Import libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType 

## Create a Spark session

In [2]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test") \
    .getOrCreate()

24/02/22 08:39:43 WARN Utils: Your hostname, GRAD0365UBUNTU resolves to a loopback address: 127.0.1.1; using 10.5.4.63 instead (on interface wlp0s20f3)
24/02/22 08:39:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/02/22 08:39:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Read (partitioned) parquet files

In [3]:
df = spark.read.parquet("../data/fhvhv/2021/01/")
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: integer (nullable = true)



We can apply many Pandas-like operations to Spark dataframes.
* If we want to select a few columns, we use **`select()`** method.
* To filter by some value, we use **`filter()`**.
* We can find some of the many other operations that we can do with Spark in [this quickstart guide from the official Spark documentation](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html).

In [4]:
df.select("pickup_datetime", "PULocationID", "dropoff_datetime", "DOLocationID") \
    .filter(df.PULocationID == 256) \
    .show() 

+-------------------+------------+-------------------+------------+
|    pickup_datetime|PULocationID|   dropoff_datetime|DOLocationID|
+-------------------+------------+-------------------+------------+
|2021-01-16 21:58:50|         256|2021-01-16 22:16:27|         177|
|2021-01-17 19:36:28|         256|2021-01-17 19:52:28|          49|
|2021-01-16 20:19:14|         256|2021-01-16 20:33:52|          36|
|2021-01-11 10:42:29|         256|2021-01-11 10:46:44|         217|
|2021-01-30 00:23:44|         256|2021-01-30 00:42:09|         168|
|2021-01-14 15:55:47|         256|2021-01-14 16:03:43|         255|
|2021-01-10 19:11:19|         256|2021-01-10 19:32:45|          36|
|2021-01-21 16:18:24|         256|2021-01-21 16:31:47|          34|
|2021-01-04 07:03:43|         256|2021-01-04 07:20:17|         162|
|2021-01-10 00:55:17|         256|2021-01-10 01:04:55|         144|
|2021-01-01 01:14:38|         256|2021-01-01 01:17:29|         255|
|2021-01-16 12:11:12|         256|2021-01-16 12:

## Actions vs transformations

Spark uses a programming concept called **lazy evaluation**. Before Spark does anything with the data in your program, it first builds step-by-step directions of what functions and data it will need. Spark builds the **DAG** from your code, and checks if it can procrastinate, waiting until the last possible moment to get the data. So for example, in the code above, Spark does not really do any job until we call the **`show()`** method.

In Spark we differenciate between actions and transformations:
* **Transformations:** lazy operations, which are not executed right away. These are operations we use for transforming the data, such as:
  * Selecting columns.
  * Filtering.
  * Joins.
  * Group by operations.
  * Partitions.
  * ...
* **Actions:** eager operations, those that are executed right away. Computations only happen when an action is triggered, so then the job will have to perform all of the transformations that lead to that action to produce a value. Examples of actions are:
  * Show, take, head.
  * Write, read.
  * ...

In [5]:
# transformation select()
df.select("pickup_datetime", "PULocationID", "dropoff_datetime", "DOLocationID")

DataFrame[pickup_datetime: timestamp, PULocationID: int, dropoff_datetime: timestamp, DOLocationID: int]

In [6]:
# transformations select() + filter()
df.select("pickup_datetime", "PULocationID", "dropoff_datetime", "DOLocationID") \
    .filter(df.PULocationID == 256)

DataFrame[pickup_datetime: timestamp, PULocationID: int, dropoff_datetime: timestamp, DOLocationID: int]

In [7]:
# action show() after transformations
df.select("pickup_datetime", "PULocationID", "dropoff_datetime", "DOLocationID") \
    .filter(df.PULocationID == 256).show()

+-------------------+------------+-------------------+------------+
|    pickup_datetime|PULocationID|   dropoff_datetime|DOLocationID|
+-------------------+------------+-------------------+------------+
|2021-01-16 21:58:50|         256|2021-01-16 22:16:27|         177|
|2021-01-17 19:36:28|         256|2021-01-17 19:52:28|          49|
|2021-01-16 20:19:14|         256|2021-01-16 20:33:52|          36|
|2021-01-11 10:42:29|         256|2021-01-11 10:46:44|         217|
|2021-01-30 00:23:44|         256|2021-01-30 00:42:09|         168|
|2021-01-14 15:55:47|         256|2021-01-14 16:03:43|         255|
|2021-01-10 19:11:19|         256|2021-01-10 19:32:45|          36|
|2021-01-21 16:18:24|         256|2021-01-21 16:31:47|          34|
|2021-01-04 07:03:43|         256|2021-01-04 07:20:17|         162|
|2021-01-10 00:55:17|         256|2021-01-10 01:04:55|         144|
|2021-01-01 01:14:38|         256|2021-01-01 01:17:29|         255|
|2021-01-16 12:11:12|         256|2021-01-16 12:

## Functions and User Defined Functions (UDFs)

Besides the SQL and Pandas-like commands we've seen so far, Spark provides additional built-in functions that allow for more complex data manipulation. By convention, these functions are imported as follows.

In [8]:
from pyspark.sql import functions as F

Example of built-in functions usage:

In [9]:
df \
    .withColumn("pickup_date", F.to_date(df.pickup_datetime)) \
    .withColumn("dropoff_date", F.to_date(df.dropoff_datetime)) \
    .select("pickup_date", "PULocationID", "dropoff_date", "DOLocationID") \
    .show()

+-----------+------------+------------+------------+
|pickup_date|PULocationID|dropoff_date|DOLocationID|
+-----------+------------+------------+------------+
| 2021-01-25|         247|  2021-01-25|         169|
| 2021-01-01|          14|  2021-01-01|         227|
| 2021-01-29|         134|  2021-01-29|         138|
| 2021-01-12|           4|  2021-01-12|         137|
| 2021-01-26|           7|  2021-01-26|         239|
| 2021-01-22|         242|  2021-01-22|          78|
| 2021-01-23|          80|  2021-01-23|         148|
| 2021-01-19|         244|  2021-01-19|         128|
| 2021-01-12|         230|  2021-01-12|         225|
| 2021-01-20|         228|  2021-01-20|          26|
| 2021-01-29|         123|  2021-01-29|         130|
| 2021-01-16|          47|  2021-01-16|         119|
| 2021-01-03|          14|  2021-01-03|          75|
| 2021-01-28|          61|  2021-01-28|          61|
| 2021-01-10|         150|  2021-01-10|         150|
| 2021-01-30|         205|  2021-01-30|       

* **`withColumn()`**: adds a new column to the dataframe.
* **`F.to_date()`**: converts timestamp to date format.

Find the list of built-in functions in the [Spark documentation](https://spark.apache.org/docs/latest/api/sql/index.html).

We can also create our own **User Defined Functions (UDFs)**. UDFs are regular functions which are then passed as parameters to a special builder.

In [10]:
# function that changes values when they're divisible by 7 or 3
def crazy_stuff(base_num):
    num = int(base_num[1:])
    if num % 7 == 0:
        return f"s/{num:03x}"
    elif num % 3 == 0:
        return f"a/{num:03x}"
    else:
        return f"e/{num:03x}"

In [11]:
crazy_stuff("B02682")

'a/a7a'

In [12]:
# convert the previous regular function into a UDF
crazy_stuff_udf = F.udf(crazy_stuff, returnType=StringType())

In [13]:
# apply the UDF to our dataframe
df \
    .withColumn("pickup_date", F.to_date(df.pickup_datetime)) \
    .withColumn("dropoff_date", F.to_date(df.dropoff_datetime)) \
    .withColumn("base_id", crazy_stuff_udf(df.dispatching_base_num)) \
    .select("base_id", "pickup_date", "PULocationID", "dropoff_date", "DOLocationID") \
    .show()

+-------+-----------+------------+------------+------------+
|base_id|pickup_date|PULocationID|dropoff_date|DOLocationID|
+-------+-----------+------------+------------+------------+
|  e/9ce| 2021-01-25|         247|  2021-01-25|         169|
|  e/9ce| 2021-01-01|          14|  2021-01-01|         227|
|  e/9ce| 2021-01-29|         134|  2021-01-29|         138|
|  e/b32| 2021-01-12|           4|  2021-01-12|         137|
|  e/9ce| 2021-01-26|           7|  2021-01-26|         239|
|  e/b32| 2021-01-22|         242|  2021-01-22|          78|
|  e/9ce| 2021-01-23|          80|  2021-01-23|         148|
|  e/acc| 2021-01-19|         244|  2021-01-19|         128|
|  e/b38| 2021-01-12|         230|  2021-01-12|         225|
|  s/b44| 2021-01-20|         228|  2021-01-20|          26|
|  e/b32| 2021-01-29|         123|  2021-01-29|         130|
|  e/b3e| 2021-01-16|          47|  2021-01-16|         119|
|  e/b32| 2021-01-03|          14|  2021-01-03|          75|
|  e/9ce| 2021-01-28|   