# 3 Business Owner Perspective

In this notebook, we will analyze the taxi data from the perspective of a business owner who is interested in establishing a retail outlet in New York City. The business owner wants to determine the optimal store location, ideally on a busy street that receives a lot of taxi traffic.

We will start with a bird's eye view of Manhattan, and examine general taxi traffic on a map that ranks the various taxi zones based on dropoffs. We will use this view to pick a general neighborhood.

Once we pick the general zone we want to be in, we will get down to the street level. We will construct a detailed street map with individual dropoffs plotted on it. 

Then, we will use machine learning to discover hotspots of activity. This will be achieved by applying a clustering algorithm (unsupervised machine learning) to Latitude and Longitude information contained in the data. The resulting clusters will become the short list of potential locations for the business owner's new store!

In [13]:
#The bird's eye view is constructed by using summarizing the data by dropoff zone, then visualizing it using QGIS
#We have already summarized the data by dropoff zone in Step 1, so we will pull in that information here.
import pandas as pd
df = pd.read_csv('../01_Tourist_Resident/01_Tourist_Resident.csv')
df

Unnamed: 0,ROWID,dropoff_zone,do_weekday,do_hour,ridecount
0,0,Midtown Center,Wednesday,7.0,10075
1,1,Midtown Center,Thursday,7.0,9825
2,2,Midtown Center,Wednesday,8.0,9466
3,3,Midtown Center,Thursday,8.0,9289
4,4,Midtown Center,Wednesday,9.0,7932
5,5,Midtown Center,Tuesday,7.0,7920
6,6,Midtown East,Wednesday,8.0,7603
7,7,Midtown Center,Thursday,9.0,7515
8,8,Midtown East,Wednesday,7.0,7372
9,9,Midtown Center,Tuesday,8.0,7367


In [66]:
#Summarize the ridecounts by dropoff zones, and calcuate a rank by ridecount
import numpy as np

summary = pd.DataFrame(pd.pivot_table(df, values='ridecount', index=['dropoff_zone'],aggfunc=np.sum))
summary = summary.sort_values(by='ridecount',ascending=False)
summary['rank'] = np.arange(summary.shape[0])+1
summary

Unnamed: 0_level_0,ridecount,rank
dropoff_zone,Unnamed: 1_level_1,Unnamed: 2_level_1
Midtown Center,456367,1
Times Sq/Theatre District,385849,2
Murray Hill,381494,3
Midtown East,376738,4
Penn Station/Madison Sq West,339835,5
Upper East Side South,334615,6
Union Sq,332632,7
Upper East Side North,330459,8
Clinton East,314599,9
East Village,300551,10


In [67]:
#Save the resulting table to a CSV file. This file will be used by QGIS
summary.to_csv('zoneSummary.csv')

In [68]:
#Launch QGIS and create a new project
#Choose Layer-->Add Layer-->Add Vector Layer, and import the shapefile 'taxi_zones_wgs84.shp' created in Step 0


## Taxi Zone Map
<img src='taxiZones.png'>

The above is the raw shapefile downloaded from the NYC website. We will enhance this map in subsequent steps to make it more useful.

In [None]:
#In QGIS, install the MMQGIS plugin. You will use this plugin to perform an attribute join
#Once the plugin is installed, choose MMQGIS-->Combine-->Attributes Join from CSV File
#Choose the zoneSummary.csv file created in the previous step
#Match the dropoff_zone field in the CSV file with the zone field in the current shapefile
#Save the output to taxi_zones_wgs84_enh.shp. This file now contains the additional attributes for ride counts and ranks
#We will use these additional fields to format the map

#The ridecount and rank fields are created as string fields by default. We need to convert them to numbers
#Open the attributes table for the shapefile and choose New Field
#Choose Create New Field and Create New Virtual field of integer type call the new fields nridecount and nrank




## Apply Colors To Map

Choose 'Categorized'

Pick 'nridecount' as the column, and 'Spectral as the color ramp'

Press the 'Classify' button, followed up 'Apply'

<img src='layerColors.png'>

## Apply Labels To Map


### Bird's Eye View of Manhattan

[Explore this map interactively](http://qgiscloud.com/vbalasu/taxi_zones_color_by_area_midtown). Here, you can click on each neighborhood to see the ride counts.

##### Blue color = More taxi dropoffs, Labels = Rank

![Bird's Eye View](BirdsEyeView.png)

Looking at the above map, we determine that Midtown Manhattan is where the action is. We will zoom in further in the next step.

![Midtown Focus](MidtownFocus.png)

## Zooming In

Based on our analysis, we have shortlisted to the top 4 neighborhoods, which all happen to be right next to each other. These zones are as follows:
- Midtown Center (#1)
- Times Sq/Theatre District (#2)
- Murray Hill (#3)
- Midtown East (#4)

In the next step, we will pull the detailed taxi trip data for these 4 neighborhoods


In [69]:
%%time
#We will import the necessary Python libraries in this step. The %%time command keeps track of the execution time for each step
import sqlite3         # Provides powerful relational database query capabilities using the SQL language
import pandas as pd    # Pandas provides a powerful DataFrame to manipulate and analyze tabular data in memory

Wall time: 1 ms


In [71]:
%%time
#We connect to a SQLite database. This database was prepared using the notebook "00 Prepare Taxi Trip Data"
#We will examine the contents of this database by looking at the sqlite_master table
cn = sqlite3.connect('../taxiJul.db')

Wall time: 29.1 ms


In [72]:
#We are interested in the taxiJulEnrich view
#Read the first row of this view to examine the columns available
sample = pd.read_sql_query("SELECT * from taxiJulEnrich LIMIT 1;",cn)
sample.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'pickup_OBJECTID',
       'pickup_Shape_Leng', 'pickup_Shape_Area', 'pickup_zone',
       'pickup_LocationID', 'pickup_borough', 'dropoff_OBJECTID',
       'dropoff_Shape_Leng', 'dropoff_Shape_Area', 'dropoff_zone',
       'dropoff_Location', 'dropoff_borough', 'count', 'pu_year', 'pu_month',
       'pu_day', 'pu_hour', 'pu_minute', 'pu_second', 'pu_weekday', 'do_year',
       'do_month', 'do_day', 'do_hour', 'do_minute', 'do_second', 'do_weekday',
       'pu_latlong'],
      dtype='object')

In [73]:
%%time
#In this step, we use SQLite to group the taxi trips by dropoff zone, and return the counts
#Due to the large size of the dataset (11 million+ records), it is more efficient to process using SQLite rather than 
#load everything into memory
#We calculate the number of trips by dropoff zone, weekday and hour, and sort the results in descending order
#- Midtown Center (#1)
#- Times Sq/Theatre District (#2)
#- Murray Hill (#3)
#- Midtown East (#4)
df = pd.read_sql_query("SELECT * FROM taxiJulEnrich WHERE `dropoff_zone` IN ('Midtown Center','Times Sq/Theater District','Murray Hill','Midtown East');", cn)

Wall time: 53.6 s


In [74]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
passenger_count,1214599.0,1.680424,1.349408,0.0,1.0,1.0,2.0,9.0
trip_distance,1214599.0,4.888817,3010.647175,0.0,0.9,1.4,2.2,3318000.0
fare_amount,1214599.0,11.023237,8.146214,-80.0,6.5,9.0,12.0,814.0
extra,1214599.0,0.242556,0.358181,-1.0,0.0,0.0,0.5,1.5
mta_tax,1214599.0,0.499369,0.021012,-0.5,0.5,0.5,0.5,0.5
tip_amount,1214599.0,1.457989,2.135993,-41.0,0.0,1.08,2.06,661.38
tolls_amount,1214599.0,0.244104,1.273687,-5.54,0.0,0.0,0.0,591.48
improvement_surcharge,1214599.0,0.299822,0.010007,-0.3,0.3,0.3,0.3,0.3
total_amount,1214599.0,13.767449,10.278073,-80.3,8.3,11.16,14.8,814.8
pickup_Shape_Leng,1214599.0,0.046963,0.030995,0.024696,0.03527,0.041514,0.046108,0.290556


In [75]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'pickup_OBJECTID',
       'pickup_Shape_Leng', 'pickup_Shape_Area', 'pickup_zone',
       'pickup_LocationID', 'pickup_borough', 'dropoff_OBJECTID',
       'dropoff_Shape_Leng', 'dropoff_Shape_Area', 'dropoff_zone',
       'dropoff_Location', 'dropoff_borough', 'count', 'pu_year', 'pu_month',
       'pu_day', 'pu_hour', 'pu_minute', 'pu_second', 'pu_weekday', 'do_year',
       'do_month', 'do_day', 'do_hour', 'do_minute', 'do_second', 'do_weekday',
       'pu_latlong'],
      dtype='object')

In [111]:
%%time
#Convert longitude and latitude to numeric, and store them as X and Y columns. Delete the text columns
latlong = df[['dropoff_longitude','dropoff_latitude']]
pd.options.mode.chained_assignment = None  # default='warn'
latlong['X'] = pd.to_numeric(df['dropoff_longitude'])
latlong['Y'] = pd.to_numeric(df['dropoff_latitude'])
del latlong['dropoff_longitude']
del latlong['dropoff_latitude']

Wall time: 1.19 s


In [112]:
latlong.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X,1214599.0,-73.976404,0.00366,-73.984108,-73.979027,-73.976662,-73.973984,-73.966576
Y,1214599.0,40.75437,0.005015,40.741974,40.751091,40.755047,40.758415,40.763641


In [113]:
%%time
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10)
model = kmeans.fit(latlong)
print("model\n", model)

model
 KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
Wall time: 1min 10s


In [114]:
model.cluster_centers_

array([[-73.97300935,  40.75616031],
       [-73.97963543,  40.75094798],
       [-73.9808618 ,  40.74500483],
       [-73.98137698,  40.75572842],
       [-73.97566408,  40.75185393],
       [-73.96974338,  40.76025142],
       [-73.9754983 ,  40.74690388],
       [-73.97432196,  40.75980209],
       [-73.97821989,  40.76021787],
       [-73.97715866,  40.75544863]])

In [115]:
%%time
predict = kmeans.predict(latlong)
latlong['group'] = pd.Series(predict, index=latlong.index)

Wall time: 271 ms


In [116]:
latlong

Unnamed: 0,X,Y,group
0,-73.975922,40.757702,9
1,-73.978607,40.761799,8
2,-73.977631,40.747368,6
3,-73.979843,40.749844,1
4,-73.972984,40.755630,0
5,-73.975082,40.752052,4
6,-73.972618,40.756092,0
7,-73.972862,40.755867,0
8,-73.978050,40.745770,6
9,-73.972054,40.756809,0


In [131]:
#Create a pivot table of the cluster groups, and calculate counts of rides for each cluster
#Then order the cluster in descending order of counts
summary = latlong.pivot_table(index=['group'])
counts = latlong.pivot_table(index=['group'],aggfunc=np.size)
summary['count'] = counts['X']
summary = summary.sort_values(['count'],ascending=False)
summary

Unnamed: 0_level_0,X,Y,count
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-73.979636,40.750945,181000.0
0,-73.973013,40.756159,151770.0
4,-73.975666,40.751851,138650.0
9,-73.977164,40.755444,129451.0
5,-73.969744,40.760251,122037.0
7,-73.974322,40.759801,114694.0
8,-73.978219,40.760216,111179.0
2,-73.980861,40.745004,99016.0
3,-73.981379,40.755728,83972.0
6,-73.975494,40.7469,82830.0


# Pick the top 3 locations

Based on the matrix above, we know that the top 3 clusters are Group 1, Group 0 and Group 4

We will use the latitude and longitude of the cluster centers, not the averages shown in the above table

So the top 3 locations are as follows:
1. Group 1 : Longitude:-73.97963543,  Latitude:40.75094798
1. Group 0 : Longitude:-73.97300935, Latitude:40.75616031
1. Group 4 : Longitude:-73.97566408, Latitude:40.75185393

# And The Winners Are ...

A quick visit to Google Maps tells us that these correspond to the following street locations:
1. 40th Street, Between Madison Ave and Park Ave (40.75094798,-73.97963543)
1. 49th Street and Lexington Ave (40.75616031,-73.97300935)
1. 43rd Street and Lexington Ave (40.75185393,-73.97566408)

### #1 40th Street, Between Madison Ave and Park Ave
![40th Street, Between Madison Ave and Park Ave](Rank1.png)

### #2 49th Street and Lexington Ave
![49th Street and Lexington Ave](Rank2.png)

### #3 43rd Street and Lexington Ave
![43rd Street and Lexington Ave](Rank3.png)

We have successfully identified the top 3 locations based on our traffic and neighborhood selections. The business owner now has a solid data-driven foundation to make the location decision. Of course, there will be other factors such as rent, crime rate, etc. that should also be considered, but we now have a solid start. 