# 2 Cab Driver Perspective

In this notebook, we will analyze the taxi data from the perspective of a cab driver who wants to maximize earnings and tips.

From the cab driver's perspective, it is useful to know which pickup locations and times are best for maximizing earnings and tips, as well as give an idea of which dropoff locations the driver is likely to end up in.

The first 4 steps are the same as before. After that, we will use additional fields for our analysis, including pickup locations, pickup times, and trip duration and earnings metrics

In [1]:
%%time
#We will import the necessary Python libraries in this step. The %%time command keeps track of the execution time for each step
import sqlite3         # Provides powerful relational database query capabilities using the SQL language
import pandas as pd    # Pandas provides a powerful DataFrame to manipulate and analyze tabular data in memory

Wall time: 2.93 s


In [2]:
%%time
#We connect to a SQLite database. This database was prepared using the notebook "00 Prepare Taxi Trip Data"
#We will examine the contents of this database by looking at the sqlite_master table
cn = sqlite3.connect('../taxiJul.db')
master = pd.read_sql_query("SELECT * from sqlite_master;",cn)

Wall time: 383 ms


In [3]:
master

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,distinct_pu_latlong,distinct_pu_latlong,3,CREATE TABLE distinct_pu_latlong(\n pu_latlon...
1,index,latlong,distinct_pu_latlong,2914013,CREATE INDEX latlong ON `distinct_pu_latlong` ...
2,table,taxiJul,taxiJul,2,"CREATE TABLE taxiJul(\n ""VendorID"" TEXT,\n ""..."
3,view,taxiJulEnrich,taxiJulEnrich,0,"CREATE VIEW taxiJulEnrich AS SELECT *,\n1 AS `..."
4,table,zones,zones,1714273,"CREATE TABLE zones(\n ""LocationID"" TEXT,\n ""..."
5,table,01_Tourist_Resident,01_Tourist_Resident,3520136,"CREATE TABLE ""01_Tourist_Resident""(\n dropoff..."


In [4]:
#We are interested in the taxiJulEnrich view
#Read the first row of this view to examine the columns available
sample = pd.read_sql_query("SELECT * from taxiJulEnrich LIMIT 2;",cn)
sample.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'pickup_OBJECTID',
       'pickup_Shape_Leng', 'pickup_Shape_Area', 'pickup_zone',
       'pickup_LocationID', 'pickup_borough', 'dropoff_OBJECTID',
       'dropoff_Shape_Leng', 'dropoff_Shape_Area', 'dropoff_zone',
       'dropoff_Location', 'dropoff_borough', 'count', 'pu_year', 'pu_month',
       'pu_day', 'pu_hour', 'pu_minute', 'pu_second', 'pu_weekday', 'do_year',
       'do_month', 'do_day', 'do_hour', 'do_minute', 'do_second', 'do_weekday',
       'pu_latlong'],
      dtype='object')

In [5]:
%%time
#In this step, we use SQLite to group the taxi trips by pickup zone, pickup time, and dropoff zone and return the counts and summary tips/fares
#Note that the original data set does not contain trip durations. 
#
#We will calculate trip duration in minutes by subtracting the pickup time from the dropoff time
#Due to the large size of the dataset (11 million+ records), it is more efficient to process using SQLite rather than 
#load everything into memory
#We calculate the number of trips by dropoff zone, weekday and hour, and sort the results in descending order
df = pd.read_sql_query("SELECT `pickup_zone`, `pu_weekday`, `pu_hour`, `dropoff_zone`, SUM((julianday(`tpep_dropoff_datetime`) - julianday(`tpep_pickup_datetime`))*24*60) AS `tripduration_minutes`, SUM(`count`) AS `ridecount`, SUM(`tip_amount`) AS `sum_tip_amount`, SUM(`total_amount`) AS `sum_total_amount` FROM taxiJulEnrich GROUP BY `pickup_zone`, `pu_weekday`, `pu_hour`, `dropoff_zone` ORDER BY `ridecount` DESC;", cn)

Wall time: 1min 30s


In [6]:
df.head()

Unnamed: 0,pickup_zone,pu_weekday,pu_hour,dropoff_zone,tripduration_minutes,ridecount,sum_tip_amount,sum_total_amount
0,Penn Station/Madison Sq West,4,7,Midtown Center,7027.88334,768,806.41,7431.81
1,Penn Station/Madison Sq West,3,7,Midtown Center,6794.200014,763,803.5,7284.9
2,Upper East Side South,3,8,Midtown East,8257.566678,762,853.68,7749.88
3,Upper East Side South,3,14,Upper East Side North,5851.283344,749,543.3,5715.9
4,Upper East Side North,3,12,Upper East Side South,8323.500002,729,685.84,7186.04


In [7]:
%%time
#In this step, we make the data more user-friendly and useful for analysis
#Firstly, Weekday numbers (eg. 1) are converted to strings (eg. Monday)
#Secondly, hours are converted from string to numbers
#Thirdly, we calculate average ride duration, average tip per minute and average amount per minute
df['pu_weekday'] = df['pu_weekday'].replace('0','Sunday')
df['pu_weekday'] = df['pu_weekday'].replace('1','Monday')
df['pu_weekday'] = df['pu_weekday'].replace('2','Tuesday')
df['pu_weekday'] = df['pu_weekday'].replace('3','Wednesday')
df['pu_weekday'] = df['pu_weekday'].replace('4','Thursday')
df['pu_weekday'] = df['pu_weekday'].replace('5','Friday')
df['pu_weekday'] = df['pu_weekday'].replace('6','Saturday')
df['pu_hour'] = pd.to_numeric(df['pu_hour'])
df['avg_tripduration_minutes'] = df['tripduration_minutes'] / df['ridecount']
df['avg_tip_perminute'] = df['sum_tip_amount'] / df['ridecount'] / df['avg_tripduration_minutes']
df['avg_amount_perminute'] = df['sum_total_amount'] / df['ridecount'] / df['avg_tripduration_minutes']

Wall time: 1.39 s


In [8]:
df.head()

Unnamed: 0,pickup_zone,pu_weekday,pu_hour,dropoff_zone,tripduration_minutes,ridecount,sum_tip_amount,sum_total_amount,avg_tripduration_minutes,avg_tip_perminute,avg_amount_perminute
0,Penn Station/Madison Sq West,Thursday,7.0,Midtown Center,7027.88334,768,806.41,7431.81,9.15089,0.114744,1.057475
1,Penn Station/Madison Sq West,Wednesday,7.0,Midtown Center,6794.200014,763,803.5,7284.9,8.904587,0.118263,1.072223
2,Upper East Side South,Wednesday,8.0,Midtown East,8257.566678,762,853.68,7749.88,10.836702,0.103382,0.938519
3,Upper East Side South,Wednesday,14.0,Upper East Side North,5851.283344,749,543.3,5715.9,7.812127,0.092851,0.976863
4,Upper East Side North,Wednesday,12.0,Upper East Side South,8323.500002,729,685.84,7186.04,11.417695,0.082398,0.863344


In [9]:
%%time
#We now export our dataset to a CSV file. This file will be used by Excel and Tableau for further analysis and visualization
df.to_csv('02_Cabdriver.csv', index_label='ROWID')

Wall time: 20.3 s


### Most Lucrative Pickup Zones

Where in New York City can a cab driver earn the most tips, and get the highest value fares? As expected, the airports top the list. However, did you know that LaGuardia airport is more lucrative than JFK airport? 

The below chart was created using the Microsoft Excel workbook stored [here](02_CabDriver.xlsx)

This workbook contains a dynamic link to the CSV file created in the previous step. You can use Pivot Tables in Excel to further explore the dataset yourself.

![Top 15 Pickup Zones With The Maximum Tips](Top15PickupZonesWithMaxTips.png "Top 15 Pickup Zones With The Maximum Tips")

What if you looked at it from the perspective of tips earned in cents per minute? This provides the best ROI for the cab driver's time. Based on this metric, a different picture emerges for the Top 15 pickup locations. It turns out that New York yellow cab drivers can earn a lot of tips by picking passengers from Newark airport!

![Top 15 Pickup Zones Based on Tips (Cents Per Minute)](Top15TipsCentsPerMinute.png "Top 15 Pickup Zones Based on Tips (Cents Per Minute)")

## Interactive Visualization In Tableau

The following interactive visualization was created using Tableau Public. It is also based on the same CSV file created above. Be sure to click on the link below to explore the visualization.

[Pickup Zones With Max Tips](https://public.tableau.com/profile/vijay.balasubramaniam#!/vizhome/02_CabDriver/PickupZonesWithMaxTips) - The most tips are earned at LaGuardia and JFK airports, as indicated by the bubble sizes. The bubble colors indicate the number of rides (darker = more rides)

This visualization was created in Tableau as follows:
```
Marks: Circle
Labels: Pickup Zone
Colors: SUM(Ridecount)
Color Palette: Automatic
Size: SUM(Tips)
Detail: MEDIAN(Avg Trip Duration), MEDIAN(Total Amount), MEDIAN(Tip Amount)
```

Note that Tableau Public is a free tool, and you can download the Tableau workbook using the above links and explore on your own if you wish.