# Tourist or Resident Perspective

In this notebook, we will analyze the taxi data from the perspective of a tourist or resident who wants to explore interesting parts of the city.

Taxi dropoffs are used as a proxy for the "popularity" of a given location. We will rank-order the taxi zones contained in the data by popularity at various times of the day and week.

Then, we will use machine learning to discover hotspots of activity. This will be achieved by applying a clustering algorithm (unsupervised learning) to Latitude and Longitude information contained in the data.

In [1]:
%%time
#We will import the necessary Python libraries in this step. The %%time command keeps track of the execution time for each step
import sqlite3         # Provides powerful relational database query capabilities using the SQL language
import pandas as pd    # Pandas provides a powerful DataFrame to manipulate and analyze tabular data in memory

Wall time: 1.69 s


In [2]:
%%time
#We connect to a SQLite database. This database was prepared using the notebook "00 Prepare Taxi Trip Data"
#We will examine the contents of this database by looking at the sqlite_master table
cn = sqlite3.connect('../taxiJul.db')
master = pd.read_sql_query("SELECT * from sqlite_master;",cn)

Wall time: 251 ms


In [3]:
master

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,distinct_pu_latlong,distinct_pu_latlong,3,CREATE TABLE distinct_pu_latlong(\n pu_latlon...
1,index,latlong,distinct_pu_latlong,2914013,CREATE INDEX latlong ON `distinct_pu_latlong` ...
2,table,taxiJul,taxiJul,2,"CREATE TABLE taxiJul(\n ""VendorID"" TEXT,\n ""..."
3,view,taxiJulEnrich,taxiJulEnrich,0,"CREATE VIEW taxiJulEnrich AS SELECT *,\n1 AS `..."
4,table,zones,zones,1714273,"CREATE TABLE zones(\n ""LocationID"" TEXT,\n ""..."
5,table,01_Tourist_Resident,01_Tourist_Resident,3520136,"CREATE TABLE ""01_Tourist_Resident""(\n dropoff..."


In [4]:
#We are interested in the taxiJulEnrich view
#Read the first row of this view to examine the columns available
sample = pd.read_sql_query("SELECT * from taxiJulEnrich LIMIT 1;",cn)
sample.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'pickup_OBJECTID',
       'pickup_Shape_Leng', 'pickup_Shape_Area', 'pickup_zone',
       'pickup_LocationID', 'pickup_borough', 'dropoff_OBJECTID',
       'dropoff_Shape_Leng', 'dropoff_Shape_Area', 'dropoff_zone',
       'dropoff_Location', 'dropoff_borough', 'count', 'pu_year', 'pu_month',
       'pu_day', 'pu_hour', 'pu_minute', 'pu_second', 'pu_weekday', 'do_year',
       'do_month', 'do_day', 'do_hour', 'do_minute', 'do_second', 'do_weekday',
       'pu_latlong'],
      dtype='object')

In [None]:
%%time
#In this step, we use SQLite to group the taxi trips by dropoff zone, and return the counts
#Due to the large size of the dataset (11 million+ records), it is more efficient to process using SQLite rather than 
#load everything into memory
#We calculate the number of trips by dropoff zone, weekday and hour, and sort the results in descending order
df = pd.read_sql_query("SELECT `dropoff_zone`, `do_weekday`, `do_hour`, SUM(`count`) AS `ridecount` FROM taxiJulEnrich GROUP BY `dropoff_zone`, `do_weekday`, `do_hour` ORDER BY `ridecount` DESC;", cn)

In [49]:
#Let's examine the first 20 rows of the results
#We notice that Midtown Center is consistently the most popular destination zone
#Weekdays are shown as numbers here, which are hard to read. We will convert them to text in the next step
df.head(20)

Unnamed: 0,dropoff_zone,do_weekday,do_hour,ridecount
0,Midtown Center,3,7,10075
1,Midtown Center,4,7,9825
2,Midtown Center,3,8,9466
3,Midtown Center,4,8,9289
4,Midtown Center,3,9,7932
5,Midtown Center,2,7,7920
6,Midtown East,3,8,7603
7,Midtown Center,4,9,7515
8,Midtown East,3,7,7372
9,Midtown Center,2,8,7367


In [None]:
#In this step, we make the data more user-friendly and useful for analysis
#Firstly, Weekday numbers (eg. 1) are converted to strings (eg. Monday)
#Secondly, hours are converted from string to numbers
df['do_weekday'] = df['do_weekday'].replace('0','Sunday')
df['do_weekday'] = df['do_weekday'].replace('1','Monday')
df['do_weekday'] = df['do_weekday'].replace('2','Tuesday')
df['do_weekday'] = df['do_weekday'].replace('3','Wednesday')
df['do_weekday'] = df['do_weekday'].replace('4','Thursday')
df['do_weekday'] = df['do_weekday'].replace('5','Friday')
df['do_weekday'] = df['do_weekday'].replace('6','Saturday')
df['do_hour'] = pd.to_numeric(df['do_hour'])
df

In [52]:
%%time
#We now export our dataset to a CSV file. This file will be used by Excel and Tableau for further analysis and visualization
df.to_csv('01_Tourist_Resident.csv', index_label='ROWID')


Wall time: 204 ms


### Most Popular Dropoff Zones

Chart created using Microsoft Excel workbook stored [here](01_Tourist_Resident.xlsx)

![Most Popular Dropoff Zones](01_Tourist_Resident/MostPopularDropoffZones.png "Most Popular Zones")