In this file 
* Part-1
    * Create a SparkSession object
    * Download a zipped csv file from the web into the `data` folder
    * Convert zipped csv to csv (not required, both pandas and PySpark can read zipped csv files automatically)
    * Read the csv as a PySpark DataFrame object
    * View the top 5 rows of the PySpark DataFrame
    * Because PySpark sets all columns to type string we try and extract the correct schema for the dataset by doing the following:
        1. Reading a few rows of the csv file to a PANDAS DataFrame object
        2. Reading the Pandas DataFrame into a PySpark DataFrame object and reading the schema (because Spark by default infers the schema based on the   pandas data types TO PySpark data types) 
        3. Then using the extractd schema, we edit it in VS Code
        4. Pass this extracted and edited schema as a variable while reading the csv file as a PySpark DataFrame object
    * Set the partition size of the PySpark Df as 24
    * Write the PySpark DF as a parquet file to a folder as 24 partitions
* Part-2
    * Read the parquet partitions into a PySpark DF
    * Experiment with some PySpark DF functions
        * Lazy Functions
        * Active Functions

# Part-1 

## Create a SparkSession object

In [1]:
import pyspark
from pyspark.sql import SparkSession

In [2]:
# Libraries for setting the directories
from pathlib import Path
import os

In [3]:
# Creating a SparkSession object
spark = SparkSession.builder.master("local[*]").appName('test04').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/02/23 18:12:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
type(spark)

pyspark.sql.session.SparkSession

In [5]:
# Setting the working directory
working_dir = os.getcwd()
parent_working_dir = os.path.dirname(working_dir) 
data_dir = os.path.join(parent_working_dir, 'data')
print(f'Current Directory: {working_dir}')
print(f'Data Directory: {data_dir}')

Current Directory: /home/sanyashireen/week_5_batch_processing/code
Data Directory: /home/sanyashireen/week_5_batch_processing/data


In [6]:
# Moving to the data directory
os.chdir(data_dir)
print(f'Current Directory: {os.getcwd()}')

Current Directory: /home/sanyashireen/week_5_batch_processing/data


## Download a zipped csv file from the web into the data folder

In [7]:
# Checking we are in the data directory and 
# extracting zipped csv file from the web
!ls
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz 

# Use the `-P` option flag to specify the output folder if different from current working directory 
#-P '/home/sanyashireen/week_5_batch_processing/data/'

fhvhv  head.csv  taxi+_zone_lookup.csv	zones
--2023-02-23 18:14:31--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/035746e8-4e24-47e8-a3ce-edcf6d1b11c7?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230223%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230223T181431Z&X-Amz-Expires=300&X-Amz-Signature=0d4d7c0ed042c37917e5e72aa5d2a90d88bd0716a1b41f80165d6bff05ce9822&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=513814948&response-content-disposition=attachment%3B%20filename%3Dfhvhv_tripdata_2021-01.csv.gz&response-content-type=application%2Foctet-stream [following]
--2023-02-23 18:14:31--  https://objects.githubusercontent.com/github-produ

In [8]:
!ls 

fhvhv  fhvhv_tripdata_2021-01.csv.gz  head.csv	taxi+_zone_lookup.csv  zones


## Convert zipped csv to csv (not required, both pandas and PySpark can read zipped csv files automatically)

In [12]:
# Saving the zipped file as a csv - this is not required as both pandas and PySpark can automatically read zipped csv's
# I did it because I was not able to execute the wc command below
#!gunzip -v fhvhv_tripdata_2021-01.csv.gz

fhvhv_tripdata_2021-01.csv.gz:	 82.7% -- replaced with fhvhv_tripdata_2021-01.csv


In [13]:
#!wc -l fhvhv_tripdata_2021-01.csv

11908469 fhvhv_tripdata_2021-01.csv


In [9]:
# Combine the above two commands into one as to view the number of records and first 10 rows
!zcat fhvhv_tripdata_2021-01.csv.gz | wc -l
!zcat fhvhv_tripdata_2021-01.csv.gz | head -n 10

11908469
hvfhs_license_num,dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag
HV0003,B02682,2021-01-01 00:33:44,2021-01-01 00:49:07,230,166,
HV0003,B02682,2021-01-01 00:55:19,2021-01-01 01:18:21,152,167,
HV0003,B02764,2021-01-01 00:23:56,2021-01-01 00:38:05,233,142,
HV0003,B02764,2021-01-01 00:42:51,2021-01-01 00:45:50,142,143,
HV0003,B02764,2021-01-01 00:48:14,2021-01-01 01:08:42,143,78,
HV0005,B02510,2021-01-01 00:06:59,2021-01-01 00:43:01,88,42,
HV0005,B02510,2021-01-01 00:50:00,2021-01-01 01:04:57,42,151,
HV0003,B02764,2021-01-01 00:14:30,2021-01-01 00:50:27,71,226,
HV0003,B02875,2021-01-01 00:22:54,2021-01-01 00:30:20,112,255,

gzip: stdout: Broken pipe


In [10]:
# Assigning the file name to a variable
data_file = 'fhvhv_tripdata_2021-01.csv.gz'

## Read the csv as a PySpark DataFrame object

In [11]:
# Reading the zipped CSV as a PySpark DF
df = spark.read.option("header", "true").csv(f'{data_dir}/{data_file}')

In [12]:
type(df)

pyspark.sql.dataframe.DataFrame

## View the top 5 rows of the PySpark DataFrame

In [13]:
# Viewing the first 5 rows of the PySpark DF as a table
df.show(5)

+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|
+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+
|           HV0003|              B02682|2021-01-01 00:33:44|2021-01-01 00:49:07|         230|         166|   null|
|           HV0003|              B02682|2021-01-01 00:55:19|2021-01-01 01:18:21|         152|         167|   null|
|           HV0003|              B02764|2021-01-01 00:23:56|2021-01-01 00:38:05|         233|         142|   null|
|           HV0003|              B02764|2021-01-01 00:42:51|2021-01-01 00:45:50|         142|         143|   null|
|           HV0003|              B02764|2021-01-01 00:48:14|2021-01-01 01:08:42|         143|          78|   null|
+-----------------+--------------------+-------------------+-------------------+

In [15]:
# Viewing the first few rows as a list
# Note: Spark saves all the data as strings, does that for numeric columns too
df.head(5)

[Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime='2021-01-01 00:33:44', dropoff_datetime='2021-01-01 00:49:07', PULocationID='230', DOLocationID='166', SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime='2021-01-01 00:55:19', dropoff_datetime='2021-01-01 01:18:21', PULocationID='152', DOLocationID='167', SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02764', pickup_datetime='2021-01-01 00:23:56', dropoff_datetime='2021-01-01 00:38:05', PULocationID='233', DOLocationID='142', SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02764', pickup_datetime='2021-01-01 00:42:51', dropoff_datetime='2021-01-01 00:45:50', PULocationID='142', DOLocationID='143', SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02764', pickup_datetime='2021-01-01 00:48:14', dropoff_datetime='2021-01-01 01:08:42', PULocationID='143', DOLocationID='78', SR_Flag=None)]

### Looking at the datatypes of the various elements returned by the functions PYSpark DF

In [16]:
type(df.head(5))

list

In [17]:
type(df.head(5)[0])

pyspark.sql.types.Row

### Experimenting some function of the PySpark DF object

In [19]:
df.describe

<bound method DataFrame.describe of DataFrame[hvfhs_license_num: string, dispatching_base_num: string, pickup_datetime: string, dropoff_datetime: string, PULocationID: string, DOLocationID: string, SR_Flag: string]>

In [20]:
type(df.describe)

method

In [21]:
# This is not a pandas dataframe rather a pyspark sql dataframe
type(df)

pyspark.sql.dataframe.DataFrame

In [22]:
# As we can everything is a string
df.schema

StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,StringType,true),StructField(DOLocationID,StringType,true),StructField(SR_Flag,StringType,true)))

## Creating Pandas DF with fewer rows to extract ideal schema

In [None]:
# First creating a smaller file and save it into head.csv
#!export data_folder=/home/sanyashireen/week_5_batch_processing/data
#!export data_file=fhvhv_tripdata_2021-01.csv
#!head -n 100 "${data_folder}/${data_file}" > "${data_folder}/head.csv" 

In [28]:
# Unzip the zipped csv and copy first 101 rows into a new file head.csv
!zcat fhvhv_tripdata_2021-01.csv.gz | head -n 101 > head.csv


gzip: stdout: Broken pipe


In [29]:
!head -n 10 head.csv

hvfhs_license_num,dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag

HV0003,B02682,2021-01-01 00:33:44,2021-01-01 00:49:07,230,166,

HV0003,B02682,2021-01-01 00:55:19,2021-01-01 01:18:21,152,167,

HV0003,B02764,2021-01-01 00:23:56,2021-01-01 00:38:05,233,142,

HV0003,B02764,2021-01-01 00:42:51,2021-01-01 00:45:50,142,143,

HV0003,B02764,2021-01-01 00:48:14,2021-01-01 01:08:42,143,78,

HV0005,B02510,2021-01-01 00:06:59,2021-01-01 00:43:01,88,42,

HV0005,B02510,2021-01-01 00:50:00,2021-01-01 01:04:57,42,151,

HV0003,B02764,2021-01-01 00:14:30,2021-01-01 00:50:27,71,226,

HV0003,B02875,2021-01-01 00:22:54,2021-01-01 00:30:20,112,255,



In [23]:
import pandas as pd

In [24]:
# Read csv data into pandas df
df_pandas = pd.read_csv('head.csv')

In [25]:
df_pandas.dtypes

hvfhs_license_num        object
dispatching_base_num     object
pickup_datetime          object
dropoff_datetime         object
PULocationID              int64
DOLocationID              int64
SR_Flag                 float64
dtype: object

In [29]:
# Convering pandas DF to PySparkDf and extracting the schema
# This schema is copied and edited in VS Code
spark.createDataFrame(df_pandas).schema

StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,LongType,true),StructField(DOLocationID,LongType,true),StructField(SR_Flag,DoubleType,true)))

In [27]:
# Looking at the dataype of the converted df
type(spark.createDataFrame(df_pandas)) #.show() or .schema

pyspark.sql.dataframe.DataFrame

### To assign the data types we have to use Pyspark datatypes which have to be impoted

In [31]:
from pyspark.sql import types

In [33]:
# Assigning the edited schema got above to a variable
# True is to indicate they can have null values
schema = types.StructType([
    types.StructField('hvfhs_license_num', types.StringType(), True),
    types.StructField('dispatching_base_num', types.StringType(), True),
    types.StructField('pickup_datetime', types.TimestampType(), True),
    types.StructField('dropoff_datetime', types.TimestampType(), True),
    types.StructField('PULocationID', types.IntegerType(), True),
    types.StructField('DOLocationID', types.IntegerType(), True),
    types.StructField('SR_Flag', types.StringType(), True)])

In [39]:
# Re-reading the csv as a PySpark DF with the new Schema
pyspark_df = spark.read\
          .option("header", "true")\
          .schema(schema)\
          .csv(f'{data_dir}/{data_file}')

In [40]:
pyspark_df .head(3)

[Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime=datetime.datetime(2021, 1, 1, 0, 33, 44), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 49, 7), PULocationID=230, DOLocationID=166, SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02682', pickup_datetime=datetime.datetime(2021, 1, 1, 0, 55, 19), dropoff_datetime=datetime.datetime(2021, 1, 1, 1, 18, 21), PULocationID=152, DOLocationID=167, SR_Flag=None),
 Row(hvfhs_license_num='HV0003', dispatching_base_num='B02764', pickup_datetime=datetime.datetime(2021, 1, 1, 0, 23, 56), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 38, 5), PULocationID=233, DOLocationID=142, SR_Flag=None)]

In [48]:
# Setting the partition size which will be used awhen writing to the file
pyspark_df  = pyspark_df .repartition(24)

In [49]:
# folder to write the partition into
data_partition_dir = 'pq/fhvhv/2021/01'

# to STOP overwriting use this
# pyspark_df.write.parquet(f'{data_dir}/fhvhv/2021/01/')

# to ALLOW overwriting use this
# writing the PySpark DF as a parquet file after changing the schema
pyspark_df.write.parquet(f'{data_dir}/{data_partition_dir}/', mode='overwrite')

                                                                                

In [30]:
# Counting how many paritions are written to the folder
!ls -lh fhvhv/2021/01/ | wc -l

26


# Part-2

## Reading back the paruet partition files form memory

In [42]:
# Folder containing all the partitions
data_partition_dir = 'fhvhv/2021/01'

In [43]:
# Reading all the parquet partions into a single PySparkDF
df = spark.read.parquet(f"{data_dir}/{data_partition_dir}/")

In [44]:
df

DataFrame[hvfhs_license_num: string, dispatching_base_num: string, pickup_datetime: timestamp, dropoff_datetime: timestamp, PULocationID: int, DOLocationID: int, SR_Flag: string]

In [45]:
type(df)

pyspark.sql.dataframe.DataFrame

## BASIC functions for PySPark DF
* .select()
* .filter()
* .take()
* .withColumn(new_column_or_old_column_name, function_to_apply)

In [46]:
# Prints the schema in a nice tree like structure
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- SR_Flag: string (nullable = true)



### Select Function (Columns Selection)

In [39]:
# Select only certain columns
# .select(), .filter(), joins(), groupby is a Transformations which is lazy and not executed immediately
df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID')

DataFrame[pickup_datetime: timestamp, dropoff_datetime: timestamp, PULocationID: int, DOLocationID: int]

### Select function with Filter function (Row selection based on values)

In [41]:
# Filer & show data
# .show(), .take(), head(), write() are Actions which are eager and executed immediately
df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID')\
.filter(df.hvfhs_license_num == 'HV0003')\
.show()

+-------------------+-------------------+------------+------------+
|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|
+-------------------+-------------------+------------+------------+
|2021-01-05 22:14:07|2021-01-05 22:32:28|         189|         107|
|2021-01-02 17:59:55|2021-01-02 18:10:39|          88|         137|
|2021-01-02 23:57:54|2021-01-03 00:15:48|         238|         224|
|2021-01-06 15:53:13|2021-01-06 16:07:07|         169|         208|
|2021-01-07 07:35:24|2021-01-07 07:55:49|          75|          88|
|2021-01-07 08:45:12|2021-01-07 08:51:17|         210|         210|
|2021-01-02 15:44:26|2021-01-02 16:10:50|         243|          69|
|2021-01-04 16:50:28|2021-01-04 16:57:43|         250|         213|
|2021-01-03 10:30:34|2021-01-03 10:44:53|          87|          79|
|2021-01-03 22:05:20|2021-01-03 22:27:55|          68|         181|
|2021-01-04 08:01:02|2021-01-04 08:33:27|          95|         236|
|2021-01-02 13:01:10|2021-01-02 13:08:11|       

In [49]:
# Filer & show data
# .show(), .take(), head(), write() are Actions which are eager and executed immediately
df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID')\
.filter(df.hvfhs_license_num == 'HV0003')\
.head(5)

[Row(pickup_datetime=datetime.datetime(2021, 1, 5, 22, 14, 7), dropoff_datetime=datetime.datetime(2021, 1, 5, 22, 32, 28), PULocationID=189, DOLocationID=107),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 17, 59, 55), dropoff_datetime=datetime.datetime(2021, 1, 2, 18, 10, 39), PULocationID=88, DOLocationID=137),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 23, 57, 54), dropoff_datetime=datetime.datetime(2021, 1, 3, 0, 15, 48), PULocationID=238, DOLocationID=224),
 Row(pickup_datetime=datetime.datetime(2021, 1, 6, 15, 53, 13), dropoff_datetime=datetime.datetime(2021, 1, 6, 16, 7, 7), PULocationID=169, DOLocationID=208),
 Row(pickup_datetime=datetime.datetime(2021, 1, 7, 7, 35, 24), dropoff_datetime=datetime.datetime(2021, 1, 7, 7, 55, 49), PULocationID=75, DOLocationID=88)]

### take function (Selection of x records)

In [60]:
# Extracting records as list
df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID')\
.filter(df.hvfhs_license_num == 'HV0003')\
.take(5)

[Row(pickup_datetime=datetime.datetime(2021, 1, 5, 22, 14, 7), dropoff_datetime=datetime.datetime(2021, 1, 5, 22, 32, 28), PULocationID=189, DOLocationID=107),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 17, 59, 55), dropoff_datetime=datetime.datetime(2021, 1, 2, 18, 10, 39), PULocationID=88, DOLocationID=137),
 Row(pickup_datetime=datetime.datetime(2021, 1, 2, 23, 57, 54), dropoff_datetime=datetime.datetime(2021, 1, 3, 0, 15, 48), PULocationID=238, DOLocationID=224),
 Row(pickup_datetime=datetime.datetime(2021, 1, 6, 15, 53, 13), dropoff_datetime=datetime.datetime(2021, 1, 6, 16, 7, 7), PULocationID=169, DOLocationID=208),
 Row(pickup_datetime=datetime.datetime(2021, 1, 7, 7, 35, 24), dropoff_datetime=datetime.datetime(2021, 1, 7, 7, 55, 49), PULocationID=75, DOLocationID=88)]

## Advanced Functions
New Columns can created by using advanced functions on existing columns.
All these functions can be imported from the 'functions' module of PySpark

***Examples:***
* functions.to_data(column_name)
* Also you can assign custom functions

In [51]:
from pyspark.sql import functions as F

In [52]:
df\
    .withColumn('pickup_date', F.to_date(df.pickup_datetime))\
    .withColumn('dropoff_date', F.to_date(df.pickup_datetime))\
    .select('pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID')\
    .show()

+-----------+------------+------------+------------+
|pickup_date|dropoff_date|PULocationID|DOLocationID|
+-----------+------------+------------+------------+
| 2021-01-03|  2021-01-03|         255|          34|
| 2021-01-05|  2021-01-05|         189|         107|
| 2021-01-02|  2021-01-02|          88|         137|
| 2021-01-02|  2021-01-02|         238|         224|
| 2021-01-06|  2021-01-06|         169|         208|
| 2021-01-07|  2021-01-07|          75|          88|
| 2021-01-07|  2021-01-07|         210|         210|
| 2021-01-02|  2021-01-02|         243|          69|
| 2021-01-04|  2021-01-04|         250|         213|
| 2021-01-03|  2021-01-03|          87|          79|
| 2021-01-03|  2021-01-03|          68|         181|
| 2021-01-04|  2021-01-04|          95|         236|
| 2021-01-02|  2021-01-02|         262|         236|
| 2021-01-04|  2021-01-04|         225|         233|
| 2021-01-06|  2021-01-06|         237|          83|
| 2021-01-05|  2021-01-05|         231|       