# Module 5 Homework

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the Yellow 2024-10 data from the official website: 

```bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet
```

In [1]:
!mkdir -p ./../data/raw/yellow/2024/10/
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet -O ./../data/raw/yellow/2024/10/yellow_tripdata_2024-10.parquet

--2025-03-09 12:46:47--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet
3.164.82.160, 3.164.82.112, 3.164.82.40, ...ci6vzurychx.cloudfront.net)... 
connected. to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|3.164.82.160|:443... 
HTTP request sent, awaiting response... 200 OK
Length: 64346071 (61M) [binary/octet-stream]
Saving to: ‘./../data/raw/yellow/2024/10/yellow_tripdata_2024-10.parquet’


2025-03-09 12:46:52 (14.7 MB/s) - ‘./../data/raw/yellow/2024/10/yellow_tripdata_2024-10.parquet’ saved [64346071/64346071]



## Question 1: Install Spark and PySpark

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types


spark = SparkSession.builder \
            .master("local[*]") \
            .appName("test") \
            .getOrCreate()
spark.version

25/03/09 07:16:57 WARN Utils: Your hostname, SRCIND-21BQ9G3 resolves to a loopback address: 127.0.1.1; using 172.26.144.30 instead (on interface eth0)
25/03/09 07:16:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/09 07:16:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


'3.5.5'

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 33354)
Traceback (most recent call last):
  File "/usr/lib/python3.12/socketserver.py", line 318, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.12/socketserver.py", line 349, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.12/socketserver.py", line 362, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.12/socketserver.py", line 761, in __init__
    self.handle()
  File "/root/spark/spark-3.5.5-bin-hadoop3/python/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/root/spark/spark-3.5.5-bin-hadoop3/python/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
                           ^^^^^^
  File "/root/spark/spark-3.5.5-bin-hadoop3/python/pyspark/accumulators.py", line 271, i

## Question 2: Yellow October 2024

Read the October 2024 Yellow into a Spark Dataframe.

Repartition the Dataframe to 4 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

- 6MB
- 25MB
- 75MB
- 100MB

In [3]:
df = spark.read.parquet("./../data/raw/yellow/2024/10/")

                                                                                

In [4]:
df \
    .repartition(4) \
    .write.parquet("./../data/pq/yellow/2024/10", mode="overwrite")

                                                                                

In [5]:
!ls -lh ./../data/pq/yellow/2024/10/

total 90M
-rw-r--r-- 1 root root   0 Mar  9 12:47 _SUCCESS
-rw-r--r-- 1 root root 23M Mar  9 12:47 part-00000-dc5facc8-d46e-40a6-9bae-a32002188af4-c000.snappy.parquet
-rw-r--r-- 1 root root 23M Mar  9 12:47 part-00001-dc5facc8-d46e-40a6-9bae-a32002188af4-c000.snappy.parquet
-rw-r--r-- 1 root root 23M Mar  9 12:47 part-00002-dc5facc8-d46e-40a6-9bae-a32002188af4-c000.snappy.parquet
-rw-r--r-- 1 root root 23M Mar  9 12:47 part-00003-dc5facc8-d46e-40a6-9bae-a32002188af4-c000.snappy.parquet


In [6]:
df = spark.read.parquet("./../data/pq/yellow/2024/10/")

## Question 3: Count records 

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

- 85,567
- 105,567
- 125,567
- 145,567

In [7]:
df \
    .filter("tpep_pickup_datetime >= '2024-10-15 00:00:00' AND tpep_pickup_datetime < '2024-10-16 00:00:00'") \
    .count()

128893

## Question 4: Longest trip

What is the length of the longest trip in the dataset in hours?

- 122
- 142
- 162
- 182

In [8]:
def calculate_trip_time(pickup_datetime, dropoff_datetime):
    hour = (dropoff_datetime - pickup_datetime).total_seconds() / 3600
    return hour

In [9]:
calculate_trip_time_udf = F.udf(calculate_trip_time, types.DoubleType())

In [10]:
df \
    .withColumn('trip_time_hours', calculate_trip_time_udf(df.tpep_pickup_datetime, df.tpep_dropoff_datetime)) \
    .select('tpep_pickup_datetime', 'tpep_pickup_datetime', 'trip_time_hours') \
    .agg(F.max('trip_time_hours')) \
    .show()

                                                                                

+--------------------+
|max(trip_time_hours)|
+--------------------+
|  162.61777777777777|
+--------------------+



## Question 5: User Interface

Spark’s User Interface which shows the application's dashboard runs on which local port?

- 80
- 443
- 4040
- 8080

In [11]:
# 4040 or 8080 based on how it is ran in standalone mode

## Question 6: Least frequent pickup location zone

Load the zone lookup data into a temp view in Spark:

```bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
```

Using the zone lookup data and the Yellow October 2024 data, what is the name of the LEAST frequent pickup location Zone?

- Governor's Island/Ellis Island/Liberty Island
- Arden Heights
- Rikers Island
- Jamaica Bay

In [13]:
!mkdir ./../data/zone
!wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv -O ./../data/zone/taxi_zone_lookup.csv

--2025-03-09 12:51:13--  https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
3.164.82.112, 3.164.82.197, 3.164.82.160, ...i6vzurychx.cloudfront.net)... 
connected. to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|3.164.82.112|:443... 
HTTP request sent, awaiting response... 200 OK
Length: 12331 (12K) [text/csv]
Saving to: ‘./../data/zone/taxi_zone_lookup.csv’


2025-03-09 12:51:14 (1.45 GB/s) - ‘./../data/zone/taxi_zone_lookup.csv’ saved [12331/12331]



In [20]:
zones_schema = types.StructType([
    types.StructField('LocationID', types.IntegerType(), True), 
    types.StructField('Borough', types.StringType(), True), 
    types.StructField('Zone', types.StringType(), True), 
    types.StructField('service_zone', types.StringType(), True)
])

In [43]:
df_zones = spark.read.csv("./../data/zone/taxi_zone_lookup.csv", header=True, schema=zones_schema)
# df_zones.show(1)

In [44]:
df \
    .groupBy(F.col("PULocationID")) \
    .count() \
    .orderBy(F.col("count").asc()) \
    .limit(1) \
    .join(df_zones, df.PULocationID == df_zones.LocationID, how="inner") \
    .show(truncate=False)

+------------+-----+----------+---------+---------------------------------------------+------------+
|PULocationID|count|LocationID|Borough  |Zone                                         |service_zone|
+------------+-----+----------+---------+---------------------------------------------+------------+
|105         |1    |105       |Manhattan|Governor's Island/Ellis Island/Liberty Island|Yellow Zone |
+------------+-----+----------+---------+---------------------------------------------+------------+



25/03/09 17:39:26 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 31657287 ms exceeds timeout 120000 ms
25/03/09 17:39:30 WARN SparkContext: Killing executors is not supported by current scheduler.
25/03/09 17:39:30 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint