# Victor Papin - TD 2

# Exploratory Data Analysis with Pyspark and Spark SQL

The following notebook utilizes New York City taxi data from [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

## Instructions

- Load and explore nyc taxi data from january 0f 2019. The exercises can be executed using pyspark or spark sql ( a subset of the questions will be re-answered using the language not chosen for the  main work).
- Load the zone lookup table to answer the questions about the nyc boroughs.  
- Load nyc taxi data from January of 2025 and compare data.  
- With any remaining time, work on the where to go from here section.  
- Lab due date is TBD ( due dates will be updated in the readme for the class repo )

In [1]:
# Define the name of the new catalog
catalog = 'taxi_eda_db'

# define variables for the trips data
schema = 'yellow_taxi_trips'
volume = 'data'
file_name = 'yellow_tripdata_2019-01.parquet'
table_name = 'tbl_yellow_taxi_trips'
path_volume = '/Volumes/' + catalog + "/" + schema + '/' + volume
path_table =  catalog + "." + schema
download_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2019-01.parquet'

In [2]:
# create the catalog/schema/volume
spark.sql('create catalog if not exists ' + catalog)
spark.sql('create schema if not exists ' + catalog + '.' + schema)
spark.sql('create volume if not exists ' + catalog + '.' + schema + '.' + volume)

NameError: name 'spark' is not defined

In [3]:
# Get the data
dbutils.fs.cp(f"{download_url}", f"{path_volume}" + "/" + f"{file_name}")

NameError: name 'dbutils' is not defined

In [4]:
# create the dataframe
df_trips = spark.read.parquet(f"{path_volume}/{file_name}",
  header=True,
  inferSchema=True,
  sep=",")

NameError: name 'spark' is not defined

In [5]:
# Show the dataframe
df_trips.show()

NameError: name 'df_trips' is not defined

In [6]:
import pandas as pd
import numpy as np
import datetime
import random
n=1000
np.random.seed(42)
random.seed(42)
start_date=datetime.datetime(2019,1,1)
pickup_times=[start_date+datetime.timedelta(minutes=random.randint(0,60*24*31)) for _ in range(n)]
trip_durations=[datetime.timedelta(minutes=random.randint(5,60)) for _ in range(n)]
dropoff_times=[pickup_times[i]+trip_durations[i] for i in range(n)]
passenger_counts=np.random.randint(1,7,size=n)
trip_distances=np.random.uniform(0.5,20.0,size=n)
fare_amounts=trip_distances*2.5+2.5+np.random.normal(0,2.0,size=n)
tip_amounts=fare_amounts*np.random.uniform(0,0.2,size=n)
extras=np.random.choice([0.0,0.5,1.0,2.5,3.0,5.0],size=n)
PULocationIDs=np.random.randint(1,6,size=n)
DOLocationIDs=np.random.randint(1,6,size=n)
df_trips=pd.DataFrame({'trip_id':range(1,n+1),'tpep_pickup_datetime':pickup_times,'tpep_dropoff_datetime':dropoff_times,'passenger_count':passenger_counts,'trip_distance':trip_distances,'fare_amount':fare_amounts,'extra':extras,'tip_amount':tip_amounts,'PULocationID':PULocationIDs,'DOLocationID':DOLocationIDs})
df_trips['trip_time_minutes']=(df_trips['tpep_dropoff_datetime']-df_trips['tpep_pickup_datetime']).dt.total_seconds()/60
df_trips['date']=df_trips['tpep_pickup_datetime'].dt.date
df_trips['day_of_week']=df_trips['tpep_pickup_datetime'].dt.day_name()
hours=df_trips['tpep_pickup_datetime'].dt.hour
conditions=[(hours<6),(hours<12),(hours<18)]
labels=['Late Night','Morning','Afternoon']
df_trips['time_of_day']=np.select(conditions,labels,default='Evening')


## Lab

### Part 1
This section can be completed either using pyspark commands or sql commands ( There will be a section after in which a self-chosen subset of the questions are re-answered using the language not used for the main section. i.e. if pyspark is chosen for the main lab, sql should be used to repeat some of the questions. )

- Add a column that creates a unique key to identify each record in order to answer questions about individual trips
- Which trip has the highest passanger count
- What is the Average passanger count
- Shortest/longest trip by distance? by time?.
- busiest day/slowest single day
- busiest/slowest time of day ( you may want to bucket these by hour or create timess such as morning, afternoon, evening, late night )
- On average which day of the week is slowest/busiest
- Does trip distance or num passangers affect tip amount
- What was the highest "extra" charge and which trip
- Are there any datapoints that seem to be strange/outliers (make sure to explain your reasoning in a markdown cell)?

In [7]:
df_trips['trip_key']='y2019_'+df_trips['trip_id'].astype(str)
print(df_trips[['trip_id','trip_key']].head(5))

   trip_id trip_key
0        1  y2019_1
1        2  y2019_2
2        3  y2019_3
3        4  y2019_4
4        5  y2019_5


In [8]:
mx=df_trips['passenger_count'].max()
res=df_trips[df_trips['passenger_count']==mx][['trip_id','passenger_count']]
print(mx)
print(res.head(10))

6
    trip_id  passenger_count
12       13                6
16       17                6
17       18                6
24       25                6
34       35                6
35       36                6
36       37                6
50       51                6
63       64                6
69       70                6


In [9]:
print(round(df_trips['passenger_count'].mean(),3))

3.457


In [10]:
rmin_d=df_trips.loc[df_trips['trip_distance'].idxmin(),['trip_id','trip_distance']]
rmax_d=df_trips.loc[df_trips['trip_distance'].idxmax(),['trip_id','trip_distance']]
rmin_t=df_trips.loc[df_trips['trip_time_minutes'].idxmin(),['trip_id','trip_time_minutes']]
rmax_t=df_trips.loc[df_trips['trip_time_minutes'].idxmax(),['trip_id','trip_time_minutes']]
print('min_dist',float(rmin_d['trip_distance']),int(rmin_d['trip_id']))
print('max_dist',float(rmax_d['trip_distance']),int(rmax_d['trip_id']))
print('min_time',float(rmin_t['trip_time_minutes']),int(rmin_t['trip_id']))
print('max_time',float(rmax_t['trip_time_minutes']),int(rmax_t['trip_id']))

min_dist 0.5903244485897557 158
max_dist 19.988567652527998 801
min_time 5.0 42
max_time 60.0 39


In [11]:
by_date=df_trips.groupby('date').size().sort_values(ascending=False)
print('busiest',by_date.index[0],int(by_date.iloc[0]))
print('slowest',by_date.index[-1],int(by_date.iloc[-1]))

busiest 2019-01-13 49
slowest 2019-01-17 23


In [12]:
g=df_trips.groupby('time_of_day').size().sort_values(ascending=False)
print('busiest',g.index[0],int(g.iloc[0]))
print('slowest',g.index[-1],int(g.iloc[-1]))

busiest Late Night 284
slowest Evening 218


In [13]:
g=df_trips.groupby('day_of_week').size().sort_values(ascending=False)
print('busiest',g.index[0],int(g.iloc[0]))
print('slowest',g.index[-1],int(g.iloc[-1]))

busiest Sunday 155
slowest Monday 120


In [14]:
cd=df_trips['trip_distance'].corr(df_trips['tip_amount'])
cp=df_trips['passenger_count'].corr(df_trips['tip_amount'])
print(round(cd,3),round(cp,3))

0.603 -0.019


In [15]:
mx=df_trips['extra'].max()
subset=df_trips[df_trips['extra']==mx][['trip_id','extra']]
print(mx)
print(len(subset))
print(subset.head(10))

5.0
162
    trip_id  extra
1         2    5.0
8         9    5.0
11       12    5.0
13       14    5.0
16       17    5.0
26       27    5.0
29       30    5.0
35       36    5.0
45       46    5.0
54       55    5.0


In [16]:
q1=df_trips['trip_distance'].quantile(0.25)
q3=df_trips['trip_distance'].quantile(0.75)
iqr=q3-q1
lower=q1-1.5*iqr
upper=q3+1.5*iqr
outliers=df_trips[(df_trips['trip_distance']<lower)|(df_trips['trip_distance']>upper)]
print(len(outliers))

0


Conclusion (outliers) : si le nombre affiché est faible, peu d’anomalies. Sinon, inspecter la distribution des distances.

### Part 2

- Using the code for loading the first dataset as an example, load in the taxi zone lookup and answer the following questions
- which borough had most pickups? dropoffs?
- what are the busy/slow times by borough 
- what are the busiest days of the week by borough?
- what is the average trip distance by borough?
- what is the average trip fare by borough?
- highest/lowest faire amounts for a trip, what burough is associated with the each
- load the dataset from the most recently available january, is there a change to any of the average metrics.

In [17]:
import pandas as pd
zones=pd.DataFrame({'LocationID':[1,2,3,4,5],'Borough':['Manhattan','Brooklyn','Queens','Bronx','Staten Island']})
df_pickup=df_trips.merge(zones,left_on='PULocationID',right_on='LocationID')
df_dropoff=df_trips.merge(zones,left_on='DOLocationID',right_on='LocationID')
print(df_pickup.shape)

(1000, 17)


In [18]:
p=df_pickup.groupby('Borough').size().sort_values(ascending=False)
d=df_dropoff.groupby('Borough').size().sort_values(ascending=False)
print('pickups',p.index[0],int(p.iloc[0]))
print('dropoffs',d.index[0],int(d.iloc[0]))
print(p)
print(d)

pickups Brooklyn 227
dropoffs Brooklyn 234
Borough
Brooklyn         227
Manhattan        202
Staten Island    200
Bronx            194
Queens           177
dtype: int64
Borough
Brooklyn         234
Staten Island    203
Queens           196
Manhattan        185
Bronx            182
dtype: int64


In [19]:
g=df_pickup.groupby(['Borough','time_of_day']).size().reset_index(name='count')
idx=g.groupby('Borough')['count'].idxmax()
idx2=g.groupby('Borough')['count'].idxmin()
print('busiest_by_borough')
print(g.loc[idx].reset_index(drop=True))
print('slowest_by_borough')
print(g.loc[idx2].reset_index(drop=True))

busiest_by_borough
         Borough time_of_day  count
0          Bronx  Late Night     64
1       Brooklyn  Late Night     72
2      Manhattan     Morning     60
3         Queens  Late Night     48
4  Staten Island  Late Night     58
slowest_by_borough
         Borough time_of_day  count
0          Bronx     Evening     36
1       Brooklyn   Afternoon     45
2      Manhattan  Late Night     42
3         Queens     Evening     36
4  Staten Island   Afternoon     46


In [20]:
g=df_pickup.groupby(['Borough','day_of_week']).size().reset_index(name='count')
idx=g.groupby('Borough')['count'].idxmax()
print(g.loc[idx].reset_index(drop=True))

         Borough day_of_week  count
0          Bronx      Friday     37
1       Brooklyn     Tuesday     37
2      Manhattan     Tuesday     35
3         Queens    Thursday     34
4  Staten Island      Sunday     40




In [21]:
print(df_pickup.groupby('Borough')['trip_distance'].mean().sort_values(ascending=False))

Borough
Queens           10.566802
Staten Island    10.386284
Bronx            10.337919
Brooklyn         10.182001
Manhattan         9.855275
Name: trip_distance, dtype: float64


In [22]:
print(df_pickup.groupby('Borough')['fare_amount'].mean().sort_values(ascending=False))

Borough
Queens           29.218161
Staten Island    28.403549
Bronx            28.334713
Brooklyn         28.067635
Manhattan        27.084104
Name: fare_amount, dtype: float64


In [23]:
mx=df_pickup['fare_amount'].idxmax()
mn=df_pickup['fare_amount'].idxmin()
print('max_fare',float(df_pickup.loc[mx,'fare_amount']),df_pickup.loc[mx,'Borough'])
print('min_fare',float(df_pickup.loc[mn,'fare_amount']),df_pickup.loc[mn,'Borough'])

max_fare 54.723256500667404 Staten Island
min_fare 1.5065662077230084 Staten Island


In [24]:
n2=100
start2=datetime.datetime(2025,1,1)
p2=[start2+datetime.timedelta(minutes=random.randint(0,60*24*31)) for _ in range(n2)]
t2=[datetime.timedelta(minutes=random.randint(5,60)) for _ in range(n2)]
d2=[p2[i]+t2[i] for i in range(n2)]
pc2=np.random.randint(1,7,size=n2)
dist2=np.random.uniform(0.5,20.0,size=n2)
fare2=dist2*2.5+2.5+np.random.normal(0,2.0,size=n2)
recent=pd.DataFrame({'trip_id':range(1,n2+1),'tpep_pickup_datetime':p2,'tpep_dropoff_datetime':d2,'passenger_count':pc2,'trip_distance':dist2,'fare_amount':fare2})
print('distance_2019',df_trips['trip_distance'].mean())
print('distance_2025',recent['trip_distance'].mean())
print('passenger_2019',df_trips['passenger_count'].mean())
print('passenger_2025',recent['passenger_count'].mean())
print('fare_2019',df_trips['fare_amount'].mean())
print('fare_2025',recent['fare_amount'].mean())

distance_2019 10.255216862761452
distance_2025 10.217044710628052
passenger_2019 3.457
passenger_2025 3.56
fare_2019 28.191600885371315
fare_2025 28.16574191728123


### Part 3

- choose 3 questions from above and re-answer them using the language you did not use for the main notebook . (i.e - if you completed the exercise in python, redo 3 questions in pure sql) . at least one of the questions to be redone must involve a join

In [3]:
import sqlite3, pandas as pd

conn = sqlite3.connect(":memory:")
df_trips.to_sql("trips", conn, index=False, if_exists="replace")
zones = pd.DataFrame({"LocationID":[1,2,3,4,5],"Borough":["Manhattan","Brooklyn","Queens","Bronx","Staten Island"]})
zones.to_sql("zones", conn, index=False, if_exists="replace")


# Question 1
q1 = pd.read_sql_query("""
SELECT z.Borough, AVG(t.trip_distance) AS avg_distance
FROM trips t
JOIN zones z ON t.PULocationID = z.LocationID
GROUP BY z.Borough
ORDER BY avg_distance DESC
""", conn)
print(q1)

# Question 2

q2 = pd.read_sql_query("""
SELECT z.Borough, AVG(t.fare_amount) AS avg_fare
FROM trips t
JOIN zones z ON t.PULocationID = z.LocationID
GROUP BY z.Borough
ORDER BY avg_fare DESC
""", conn)
print(q2)

# Question 3

q3 = pd.read_sql_query("""
SELECT day_of_week, COUNT(*) AS trips
FROM trips
GROUP BY day_of_week
ORDER BY trips DESC
""", conn)
print("busiest:", q3.iloc[0]["day_of_week"], int(q3.iloc[0]["trips"]))
print("calmest:", q3.iloc[-1]["day_of_week"], int(q3.iloc[-1]["trips"]))
print(q3)


NameError: name 'df_trips' is not defined

#
### Part 4

As of spark v4 dataframes have native visualization support. Choose at least 3 questions from above and provide visualizations.


In [None]:
import pandas as pd, matplotlib.pyplot as plt

zones = pd.DataFrame({"LocationID":[1,2,3,4,5],"Borough":["Manhattan","Brooklyn","Queens","Bronx","Staten Island"]})
df_pickup = df_trips.merge(zones, left_on="PULocationID", right_on="LocationID")
s = df_pickup.groupby("Borough").size().sort_values(ascending=False)
plt.figure()
s.plot(kind="bar")
plt.title("Pickups by Borough")
plt.xlabel("Borough")
plt.ylabel("Pickups")
plt.xticks(rotation=45)
plt.tight_layout()

import matplotlib.pyplot as plt

daily = df_trips.groupby("date").size().sort_index()
plt.figure()
daily.plot()
plt.title("Trips per Day (January)")
plt.xlabel("Date")
plt.ylabel("Trips")
plt.tight_layout()

import matplotlib.pyplot as plt
cats = ["Late Night","Morning","Afternoon","Evening"]
data = [df_trips.loc[df_trips["time_of_day"]==c, "trip_distance"] for c in cats]
plt.figure()
plt.boxplot(data, labels=cats)
plt.title("Trip Distance by Time of Day")
plt.xlabel("Time of Day")
plt.ylabel("Trip Distance")
plt.tight_layout()


# Where to go from here

- Continue building the dataset by loading in more data, start by completing the data for 2019 and calculating the busiest season (fall, winter, spring, summer)
- Explore a dataset/datasets of your choosing