# Where should we put taxi station in New York city ?

## 1- Problem description

In this hands on, we will explore data about taxi in New York city. The purpose is to understand taxi behaviour,  generate some insights about the pattern of rides amount throughout the day across the city and suggest the best locations for futur taxi stops where people can get picked up/dropped off by cabs and wait for cabs to pick them up.

## 2- Data

<img src="taxi.png">

We will be using the training data from 'the New York City Taxi Trip Duration DataSet': https://www.kaggle.com/c/nyc-taxi-trip-duration/data that can be obtained from Kaggle. Datase includes pickup time, geo-coordinates, number of passengers, and several other variables.

Please download train.csv data and put it under './data_used/'


## 3- Read data

In [None]:
# import librairies
import numpy as np 
import pandas as pd 
from datetime import timedelta
import datetime as dt
import matplotlib.pyplot as plt
import folium
%matplotlib inline

In [None]:
# Some set up:
np.random.seed(1987)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
plt.rcParams['figure.figsize'] = [8,8]

In [None]:
# Read data:
nyc_taxi_data = pd.read_csv('../data_used/train.csv')

## 4- First data exploration & cleaning 

In [None]:
# Display the first 5 lines of data
# TODO

In [None]:
# Display data info
# TODO

In [None]:
# Display data description 
# TODO

In [None]:
# Trip duration clean-up

As we noted earlier there are some outliers associated with the trip_duration variable, specifically a 980 hour maximum trip duration and a minimum of 1 second trip duration. We decided to exclude data that lies outside 2 standard deviations from the mean.

In [None]:
m = np.mean(nyc_taxi_data['trip_duration'])
s = np.std(nyc_taxi_data['trip_duration'])
nyc_taxi_data = nyc_taxi_data[nyc_taxi_data['trip_duration'] <= m + 2*s]
nyc_taxi_data = nyc_taxi_data[nyc_taxi_data['trip_duration'] >= m - 2*s]

In [None]:
# Latitude and Longitude clean-up

Looking into it, the borders of New York city, in coordinates comes out to be:
- City_long_border = (-74.03, -73.75)
- City_lat_border = (40.63, 40.85) 

In [None]:
xlim = [-74.03, -73.77]
ylim = [40.63, 40.85]
nyc_taxi_data = nyc_taxi_data[(nyc_taxi_data.pickup_longitude> xlim[0]) & (nyc_taxi_data.pickup_longitude < xlim[1])]
nyc_taxi_data = nyc_taxi_data[(nyc_taxi_data.dropoff_longitude> xlim[0]) & (nyc_taxi_data.dropoff_longitude < xlim[1])]
nyc_taxi_data = nyc_taxi_data[(nyc_taxi_data.pickup_latitude> ylim[0]) & (nyc_taxi_data.pickup_latitude < ylim[1])]
nyc_taxi_data = nyc_taxi_data[(nyc_taxi_data.dropoff_latitude> ylim[0]) & (nyc_taxi_data.dropoff_latitude < ylim[1])]

In [None]:
# Date format clean-up

In [None]:
nyc_taxi_data['pickup_datetime'] = pd.to_datetime(nyc_taxi_data.pickup_datetime)
nyc_taxi_data.loc[:, 'pickup_date'] = nyc_taxi_data['pickup_datetime'].dt.date
nyc_taxi_data['dropoff_datetime'] = pd.to_datetime(nyc_taxi_data.dropoff_datetime) 

In [None]:
# Create columns month, week, day and hour pick up:
nyc_taxi_data['month'] = nyc_taxi_data.pickup_datetime.apply(lambda x: x.month)
nyc_taxi_data['week'] = nyc_taxi_data.pickup_datetime.apply(lambda x: x.week)
nyc_taxi_data['day'] = nyc_taxi_data.pickup_datetime.apply(lambda x: x.day)
nyc_taxi_data['hour'] = nyc_taxi_data.pickup_datetime.apply(lambda x: x.hour)

In [None]:
# Display new cleaned data 
nyc_taxi_data.head()

In [None]:
# Plot trip_duration distribution using hist()

In [None]:
# TODO

We see that major trip duration are less than 2000s 

In [None]:
# plot the evolution of number of trips over time

In [None]:
plt.plot(nyc_taxi_data.groupby('pickup_date').count()[['id']], 'o-')
plt.title('Trips over time.')
plt.ylabel('Number of trips')
plt.show()

Around 8000 trips per day in New York city !!

## 5- Data Visualization using matpolotlib & Folium

In [None]:
# Let's have a look to drop off and pick up locations

In [None]:
longitude = list(nyc_taxi_data.pickup_longitude) + list(nyc_taxi_data.dropoff_longitude)
latitude = list(nyc_taxi_data.pickup_latitude) + list(nyc_taxi_data.dropoff_latitude)

In [None]:
# TODO

In [None]:
# Display NYC map with Folium:

In [None]:
# Function to generate a new New York City map
def generateNYCmap(default_location=[40.737595, -73.993647],default_width='80%', default_height='80%', default_zoom_start=11):
    base_map = folium.Map(location=default_location,width=default_width, height=default_height, control_scale=True,zoom_control=True, zoom_start=default_zoom_start)
    return base_map

In [None]:
# Display nyc map
# TODO

In [None]:
# Create Heatmap of pick up and drop off locations (use HeatMap from folium.plugins & use only 3 first months)

In [None]:
from folium.plugins import HeatMap

In [None]:
# Filter on 3 months 
df_heatMap = nyc_taxi_data[nyc_taxi_data.month>4]

In [None]:
data = df_heatMap[['pickup_latitude', 'pickup_longitude']].groupby(['pickup_latitude', 'pickup_longitude']).sum().reset_index().values.tolist()

In [None]:
# Create Heatmap here 
# TODO

In [None]:
# This function is used to display maps with features 
def embed_map(m):
    from IPython.display import IFrame

    m.save('../data_generated/index.html')
    return IFrame('../data_generated/index.html', width='100%', height='750px')

In [None]:
embed_map(nyc_map)

In [None]:
# We want to see the evolution of this heatmap over the time (use HeatMapWithTime from folium.plugins)

In [None]:
from folium.plugins import HeatMapWithTime

In [None]:
df_hour_list = []
for hour in df_heatMap.hour.sort_values().unique():
    df_hour_list.append(df_heatMap.loc[df_heatMap.hour == hour, ['pickup_latitude', 'pickup_longitude', 'count']].groupby(['pickup_latitude', 'pickup_longitude']).sum().reset_index().values.tolist())

In [None]:
# generate a new base map
# TODO

In [None]:
# Create HeatMapWithTime
# TODO

In [None]:
embed_map(nyc_map_2)

In [None]:
# Let's create clusters of pick up and drop off locations (Use KMeans)

In [None]:
from sklearn.cluster import KMeans

In [None]:
df_loc = pd.DataFrame()
df_loc['longitude'] = longitude
df_loc['latitude'] = latitude

In [None]:
# Kmeans fit on df_loc
# TODO

In [None]:
df_loc['label'] = kmeans.labels_

In [None]:
df_loc = df_loc.sample(100000)

In [None]:
# Plot clusters
# TODO

In [None]:
# Let's order clusters by most visited: 

In [None]:
df_loc['count']= 1

In [None]:
df_loc.groupby('label').count().sort_values(by='count', ascending=False)

Let's plot the cluster centers:

In [None]:
fig,ax = plt.subplots(figsize = (10,10))
for label in loc_df.label.unique():
    ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.4, markersize = 0.1, color = 'gray')
    ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'brown')
    ax.annotate(label, (kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1]), color = 'brown', fontsize = 25)
ax.set_title('Cluster Centers')
plt.show()


## So where to put taxi stations ?

In [None]:
# Finally, Add markers to represent represnt taxi station 

In [None]:
# TODO
# Use folium.ClickForMarker


In [None]:
embed_map(nyc_map)