# Analysis of New York City Green Taxi Data

### The objective

Programmatically download the NYC green taxi data from [here](https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv)

Based on what you can find out about green taxis provide a count of how many transactions originate and terminate at one of the NYC area airports (JFK and LGA)

# Concerning green taxis 



The goal of the Boro Taxi program is to improve access to street-hail transportation throughout the five boroughs – especially for persons with disabilities and people who live or spend time in areas of New York City historically underserved by the yellow taxi industry.  These Boro Taxis are referred as Green Taxis

> [Boro Taxi drivers](http://www.nyc.gov/html/tlc/html/passenger/shl_passenger.shtml) can pick up passengers from the street in northern Manhattan (north of West 110th street and East 96th street), the Bronx, Queens (excluding the airports), Brooklyn and Staten Island and they may drop you off anywhere. Each vehicle is associated with a local car service that has been affiliated with the Boro Taxi program and can still participate in pre-arranged trips. Boro Taxi drivers can be dispatched to pick you up in northern Manhattan, the Bronx, Queens, Brooklyn and Staten Island and at the airports, but may not pick up any trips – pre-arranged or street hail – in the Manhattan exclusionary zone."

Another article I came across in Forbes provides similar details [information](https://www.forbes.com/sites/johngiuffo/2013/09/30/nycs-new-green-taxis-what-you-should-know/#1b90ebc32a28) on fares which Green Taxis may pick up - 
> The rules for the owners of the new 'Boro Taxis,' as the green ones are dubbed, are simple: they are not allowed to pick up street hails on the home turf, as it were, of the yellow taxi – Manhattan below 110th St. on the West Side, and below 96th St. on the East Side, or at either LaGuardia or JFK airports. Otherwise, they can ply the streets anywhere else in the city, and can also be on call through a dispatcher.

See below for a map highlighting the areas described above. Again, no fares may be picked up in the Yellow area.  The boro taxis have full reign over the "Green" area, and they may pick up prearranged fares from JFK and Laguardia (Grey area).  

![title](http://www.nyc.gov/html/tlc/images/features/map_service_area.png)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import io
import os
from pprint import pprint
from pandasql import sqldf
import json

In [2]:
from green import * 

`codsc` contains useful functions to complete the tasks.  In my opinion, it is best to put helpers and utilities in a seperate python file that will be imported into the notebook - keeps things clean.

In [3]:
# sql on Pandas DataFrames!
pysqldf = lambda q: sqldf(q, globals())

In [4]:
%matplotlib inline

## Get data


In [5]:
## data.json contains information about the dataset, i.e., value labels
## url, categorical columns and numerical columns.
## this info was curated from http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf
with open('data.json') as json_data:
    conf_dict = json.load(json_data)

In [7]:
data = get_data()

# Airport dropoffs and pickups


Based on the rules governing Green Taxi fares, Green Taxis can only pick up fares from JFK or Laguardia which have been prearranged, therefore, I would imagine there will be a very limited number of fare originating from either NYC Area Airport.  

JFK is a very easy ask, provided we trust the `RateCodeID` in the dataset.  Taxis trips to JFK are flatrate, and are logged accordingly.  So we could get an easy count of Fares originated or ended at JFK.  

Laguardia trips aren't so easy and will require significantly more work!  Laguardia trips are regular rate fares so there is no marker in the record that will make this count as easy as the JFK ask.  

#### Considerations

1.  The markers in the data set are subject to human input.  So how good are they?
2.  Latitude and Longitude of drop off and pick up are available.  


## Solutions

In order to provided the counts, the plan is as follows: Create a bounding box for JFK and Laguardia.  This is done by going to google maps and dropping pins around both airports and recording the latitude and longitude, then using [shapely](https://github.com/Toblerity/Shapely), we iterate through points and determine whether or not the pickup location and/or the dropoff location is contained in each airports bounding box.  It may be more apprporiate to convert the latitudes and longitudes to [Universal Traverse Mercator coordinate system](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system), but I think the results from my approach are good enough.  

The point of doing JFK as well is to compare my construction against the `RateCodeID` in the dataset.  I am assuming that the geo locations are accurate.  

I only thought of using the bounding boxes after I figured out that using the geocode from google wasn't going to work out so great.  The original plan was to use the RateCodeID, and the distances between the dropoff geo locale and JFK geolocale (as determined from google maps.  I was disregarding pickup locations since there should be a very small number of pickup at JFK.  I would then choose the median distance as the radius of a circle, which had center at JFK geo locale, then I would go through and see which points fell in the circle, but it just was not working out.  

As mentioned previously, I will use bounding boxes to get a "better" count of the number of trips that start or end at JFK and LaGuardia.  The idea is to take lat and long coordinates and using polygons based on the coordinates to determine whether a particular point is in the polygon. 

In [7]:
## i constructed these bounding boxed by hand - going to google maps and 
## dropping points on the map
jfk_bb = [(40.646198,-73.792402), 
          (40.640564,-73.789269), 
          (40.645677,-73.775322), 
          (40.649975,-73.782145)]
lag_bb = [(40.766197,-73.862458),
          (40.769317,-73.858123),
          (40.771982,-73.871384),
          (40.776845,-73.872357),
          (40.767777,-73.885275),
          (40.776942,-73.887592)]

from shapely.geometry import Point
from shapely.geometry.polygon import Polygon

jfk_polygon = Polygon(jfk_bb)
lag_polygon = Polygon(lag_bb)

Below shows the points that will be used as vertices of the polygon that encloses JFK.

In [8]:
## if you find that the following cell won't display the map 
## check that you enabled the widgets with 

## jupyter nbextension enable --py --sys-prefix widgetsnbextension
## jupyter nbextension enable --py --sys-prefix gmaps

In [9]:
import gmaps
import os
gmaps.configure(api_key = os.environ["GOOGLE_API_KEY"])
layers = [gmaps.symbol_layer(jfk_bb, fill_color="red", stroke_color="red", scale=3)]
layers.append( gmaps.symbol_layer(lag_bb))

fig_jfk = gmaps.figure()
fig_jfk.add_layer(layers[0])
fig_jfk

The figures will not diplay appropriately when converted to HTML - so I'm included a static version
![title](pics/jfk_bb.png)

And below shows the vectices that will be used in the polygon to enclose LaGuardia

In [10]:
fig_lag = gmaps.figure()
fig_lag.add_layer(layers[1])
fig_lag

The figures will not diplay appropriately when converted to HTML - so I'm included a static version
![title](pics/lga_bb.png)

### Fares with RateCodeID = JFK

I'm going to compare my method to the fares actually tagged at JFK fares.  

In [11]:
conf_dict["Columns"]["RateCodeID"]

{'1': 'Standard Rate',
 '2': 'JFK',
 '3': 'Newark',
 '4': 'Nassau or Westchester',
 '5': 'Negotiated Fare',
 '6': 'Group Ride'}

In [12]:
# grab all trips that had a rate code associated with JFK flat rate
jfk_tagged_trips = data[data["RateCodeID"] == 2]
jfk_tagged_trips.shape

(4435, 21)

The `RateCodeID` variable implies that there is 4,435 fares starting or ending at JFK.  Next I will use the latitude and longitude (both pickup and dropoff) to see how many of these fares are acutally withint the bounding box for JFK

In [13]:
# pull out the starting latitude and longitude for each of the 4,435 fares
start_geocode = list(zip(jfk_tagged_trips["Pickup_latitude"],jfk_tagged_trips["Pickup_longitude"]))
# pull out the ending latitude and longitude for each of the 4,435 fares
end_geocode = list(zip(jfk_tagged_trips["Dropoff_latitude"],jfk_tagged_trips["Dropoff_longitude"]))

## create a list of Points
start_points = list(map( lambda x: Point(x[0], x[1]), start_geocode))
end_points = list(map( lambda x: Point(x[0], x[1]), end_geocode))

In [14]:
## create binary vectors for strips starting and ending at JFK
starting_at_jfk = list(map( lambda x: 1*jfk_polygon.contains(x), start_points))
ending_at_jfk = list(map( lambda x: 1*jfk_polygon.contains(x), end_points))

print("trips starting at jfk: {}".format(np.array(starting_at_jfk).sum()))
print("trips ending at jfk: {}".format(np.array(ending_at_jfk).sum()))

trips starting at jfk: 19
trips ending at jfk: 2204


Based on this approach almost half of the fares with RateCodeID corresponding to JFK actually start or end at JFK!  But, as I expected, the fares mainly end at JFK - since Green Taxis can only pick up pre-arranged fairs at JFK and LaGuardia.  Below is a plot of the first 10 observations of the `jfk_tagged_trips`.  Blue are all starting locations, green are all ending locations.  As you can see 2 out of the first 10 records were actually JFK fares.  

In [15]:
## the first 10 trips in the jfk_tagged_trip dataset don't start or end at jfk
starting = [gmaps.symbol_layer(start_geocode[0:10], fill_color="blue",stroke_color="blue", scale=3)]
ending = [gmaps.symbol_layer(end_geocode[0:10], fill_color = "green",stroke_color="green", scale=3)]

fig_jfk = gmaps.figure()
fig_jfk.add_layer(layers[0])
fig_jfk.add_layer(starting[0])
fig_jfk.add_layer(ending[0])
fig_jfk

The figures will not diplay appropriately when converted to HTML - so I'm included a static version
![title](pics/first_ten.png)

## JFK - full data set

Next, we'll repeat the same analysis on the entire dataset.  

In [16]:
## pull out the lat and long for all trips 
start_geocode = list(zip(data["Pickup_latitude"],data["Pickup_longitude"]))
end_geocode = list(zip(data["Dropoff_latitude"],data["Dropoff_longitude"]))

In [17]:
start_points = list(map( lambda x: Point(x[0], x[1]), start_geocode))
end_points = list(map( lambda x: Point(x[0], x[1]), end_geocode))

In [18]:
starting_at_jfk = list(map( lambda x: 1*jfk_polygon.contains(x), start_points))
ending_at_jfk = list(map( lambda x: 1*jfk_polygon.contains(x), end_points))

In [19]:
print("trips starting at jfk: {:,}".format(np.array(starting_at_jfk).sum()))
print("trips ending at jfk: {:,}".format(np.array(ending_at_jfk).sum()))

trips starting at jfk: 269
trips ending at jfk: 12,130


WOW!  I was actually suprised by this.  This method caught 12,130 trips, when there were only about 4,400 with RateCodeID = JFK.  

Just to make sure, we visualize all the fares we tagged as terminating at JFK.  Based on the visual, I think it was a fairly successful approach.  what I dont' understand is why these would not be tagged appropriately.  It would be interesting to see the average fare amount of those that were not tagged correctly.  I may be able to get a better read on this count if I play around with the bounding box used for JFK.  

In [20]:
e = filter( lambda x: jfk_polygon.contains(x), end_points) # keep all trips that terminate at jfk
e = map( lambda loc: (loc.x, loc.y), e) # extract the lat and long

In [21]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(e))
fig

The figures will not diplay appropriately when converted to HTML - so I'm included a static version
![title](pics/jfk_heatmap.png)

## LaGuardia

We'll apply the same analysis to find the number of trips starting and ending at LaGuardia Airport.

In [22]:
starting_at_lag = list(map( lambda x: 1*lag_polygon.contains(x), start_points))
ending_at_lag = list(map( lambda x: 1*lag_polygon.contains(x), end_points))

In [23]:
print("trips starting at lag: {:,}".format(np.array(starting_at_lag).sum()))
print("trips ending at lag: {:,}".format(np.array(ending_at_lag).sum()))

trips starting at lag: 264
trips ending at lag: 14,357


In [24]:
e = filter( lambda x: lag_polygon.contains(x), end_points) # keep all trips that terminate at jfk
e = map( lambda loc: (loc.x, loc.y), e) # extract the lat and long

In [25]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(e))
fig

The figures will not diplay appropriately when converted to HTML - so I'm included a static version
![title](pics/lga_heatmap.png)

The visual above shows that I'm directionally correct, but could stand to refine the bounding box for LGA.  For instance, the one red spot on the heatmap just below Grand Central Pkwy is New York LGA Airport Marriot.  I caught this unintentionally.  I'll have to pay closer attention to the `shapely` documentation for `Polygon`.  

For JFK
* trips starting at jfk: 269
* trips ending at jfk: 12,130

For LaGuardia
* trips starting at LGA: 264
* trips ending at LGA: 14,357

In [26]:
# just some cleanup
try:
    del starting_at_lag, ending_at_lag, starting_at_jfk, ending_at_jfk, starting_points, end_points
except:
    pass