# Gotham Cabs

<img src="Gotham cab.png">

It is around the year 2034 in the city of Gotham, and the last time Batman got into a fight
with the Joker, the Batmobile (Batman's high-tech car) was seriously damaged. Apparently,
it would take his butler, Alfred, a while to fix the car and during that time Batman needs
to use a cab to save people!<br>
<br>Alfred needs your help to come up with a good prediction of the taxi trip duration
between multiple points of the Gotham city. If he can make such predictions, then that
significantly helps with Batman's missions.<br>
<br>Lucius (Batman's tech support sta) has been able to pull out a rich dataset of the
recorded taxi durations between various parts of the city and is sharing that with you for
your modeling purposes.<br>
<br>The input features of the aforementioned data file are:
<br>
- pickup datetime: a variable containing a date and a time specifying the date and thetime the taxi picked of a passenger. For instance, you may observe a pickup datetime of "6/14/2034 3:00:00 AM", which indicates the time the taxi picked up the passenger.Note that you may also obtain the day of the week, or the season information from this dataset. For instance, if we look up the 2034 calendar (search it on Google), you would see that \6/14/2034" is a Wednesday.<br>
<br>
- pickup x: This is a variable that represents the x coordinate of the location the taxi picked up the passenger.<br>
<br>
- pickup y: This is a variable that represents the y coordinate of the location the taxi picked up the passenger.<br>
<br>

- dropoff x: This is a variable that represents the x coordinate of the location the taxi dropped off the passenger.<br>
<br>
- dropoff y: This is a variable that represents the y coordinate of the location the taxidropped off the passenger.<br>
<br>
The response variable is:
<br>
<br>
- duration: which is the duration of the trip in seconds.<br>
<br>
As a competition，we will compare the MSE of the test data. The winner is the one has the lowest MSE value. Good luck!

### Import packages

In [2]:
import pylab
import calendar
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

from datetime import datetime
import holidays
import matplotlib.pyplot as plt
import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

from sklearn.cluster import KMeans

### Import dataset

In [3]:
df=pd.read_csv("Train.csv", sep=",")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1300000 entries, 0 to 1299999
Data columns (total 6 columns):
pickup_datetime    1300000 non-null object
duration           1300000 non-null int64
pickup_x           1300000 non-null float64
pickup_y           1300000 non-null float64
dropoff_x          1300000 non-null float64
dropoff_y          1300000 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 59.5+ MB


In [4]:
df.head()

Unnamed: 0,pickup_datetime,duration,pickup_x,pickup_y,dropoff_x,dropoff_y
0,2034-02-02 22:43:13,690,154.696567,325.477213,126.881573,322.21187
1,2034-01-30 16:30:13,399,160.754531,337.409138,156.181906,344.56497
2,2034-03-26 17:11:42,715,144.344672,330.828932,168.868622,372.795481
3,2034-03-07 23:45:27,566,122.749169,439.668599,131.837488,350.767551
4,2034-05-27 20:09:32,1565,184.348606,398.135385,104.896974,332.185639


In [5]:
#check the missing value
df.isnull().sum()

pickup_datetime    0
duration           0
pickup_x           0
pickup_y           0
dropoff_x          0
dropoff_y          0
dtype: int64

### Data processing and Feature engineering

##### 1. Extract the "Date" and "Hour" from the feature "pickup_datetime".
##### 2. Extract "Weekday", "Month", and "Holiday" from "Date".
##### 3. Use "pickup_x", "p.pickup", "dropoff_x", "dropoff_y" to create "Euclidean distance" and "Manhattan distance".
##### 4. Use K-means to create 4 clusters.
##### 5. Use PCA to create "pickup_pca" and "dropoff_pca".
##### 6. Dummy "Hour", "Weekday", "Month".
##### 7. Create new dataset for running in model.

In [6]:
#extract date as a new column
df["Date"] = df.pickup_datetime.apply(lambda x: x.split()[0])

In [7]:
#extract hour as a new column
df["Hour"]=df.pickup_datetime.apply(lambda x: x.split()[1].split(":")[0])

In [8]:
#extract weekday as a new column
df["Weekday"] = df.Date.apply(lambda dateString : calendar.day_name[datetime.strptime(dateString,"%Y-%m-%d").weekday()])

In [9]:
#extract month as a new column
df["Month"] = df.Date.apply(lambda dateString:  calendar.month_name[datetime.strptime(dateString,"%Y-%m-%d").month] )

In [10]:
usholiday = holidays.UnitedStates()
df['Holiday'] = df['Date'].map(lambda x: 1 if x in usholiday else 0)

In [11]:
from sklearn import metrics
from scipy.spatial import distance

In [12]:
df['Euclidean distance'] = df.apply(lambda p: round(np.linalg.norm(np.array((p.pickup_x,p.pickup_y))-np.array((p.dropoff_x,p.dropoff_y))),3) ,axis=1)

In [13]:
df['Manhattan distance'] = df.apply(lambda p: round(abs(p.dropoff_x-p.pickup_x)+abs(p.dropoff_y-p.pickup_y),3) ,axis=1)

In [14]:
#Create cluster with cluster number=4
kmeans = KMeans(n_clusters=4, random_state=0)
df['cluster_4'] = kmeans.fit_predict(df[['pickup_x','pickup_y','dropoff_x','dropoff_y']])

In [15]:
#Create cluster with cluster number=8
kmeans = KMeans(n_clusters=8, random_state=0)
df['cluster_8'] = kmeans.fit_predict(df[['pickup_x','pickup_y','dropoff_x','dropoff_y']])

In [16]:
#Create cluster with cluster number=16
kmeans = KMeans(n_clusters=16, random_state=0)
df['cluster_16'] = kmeans.fit_predict(df[['pickup_x','pickup_y','dropoff_x','dropoff_y']])

In [17]:
#import PCA, and create two features with PCA.
from sklearn.decomposition import PCA

In [18]:
pca = PCA(n_components=1)

In [19]:
df['pickup_pca'] = pca.fit_transform(df[['pickup_x', 'pickup_y']])

In [20]:
df['dropoff_pca'] = pca.fit_transform(df[['dropoff_x', 'dropoff_y']])

In [21]:
df.head()

Unnamed: 0,pickup_datetime,duration,pickup_x,pickup_y,dropoff_x,dropoff_y,Date,Hour,Weekday,Month,Holiday,Euclidean distance,Manhattan distance,cluster_4,cluster_8,cluster_16,pickup_pca,dropoff_pca
0,2034-02-02 22:43:13,690,154.696567,325.477213,126.881573,322.21187,2034-02-02,22,Thursday,February,0,28.006,31.08,1,6,4,-17.039229,-11.360709
1,2034-01-30 16:30:13,399,160.754531,337.409138,156.181906,344.56497,2034-01-30,16,Monday,January,0,8.492,11.728,0,1,0,-11.443359,0.352978
2,2034-03-26 17:11:42,715,144.344672,330.828932,168.868622,372.795481,2034-03-26,17,Sunday,March,0,48.607,66.49,0,0,0,-6.418135,22.986239
3,2034-03-07 23:45:27,566,122.749169,439.668599,131.837488,350.767551,2034-03-07,23,Tuesday,March,0,89.364,97.989,0,2,1,92.325252,14.072647
4,2034-05-27 20:09:32,1565,184.348606,398.135385,104.896974,332.185639,2034-05-27,20,Saturday,May,0,103.257,145.401,0,0,4,21.528007,5.168034


In [22]:
#Dummy 'Hour', 'Weekday', 'Month' these three variable
df_dummy=df[['Hour', 'Weekday', 'Month']]
df_dummy=pd.get_dummies(df_dummy)

In [24]:
df_dummy.head()

Unnamed: 0,Hour_00,Hour_01,Hour_02,Hour_03,Hour_04,Hour_05,Hour_06,Hour_07,Hour_08,Hour_09,...,Weekday_Thursday,Weekday_Tuesday,Weekday_Wednesday,Month_April,Month_February,Month_January,Month_July,Month_June,Month_March,Month_May
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [25]:
df_dummy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1300000 entries, 0 to 1299999
Data columns (total 38 columns):
Hour_00              1300000 non-null uint8
Hour_01              1300000 non-null uint8
Hour_02              1300000 non-null uint8
Hour_03              1300000 non-null uint8
Hour_04              1300000 non-null uint8
Hour_05              1300000 non-null uint8
Hour_06              1300000 non-null uint8
Hour_07              1300000 non-null uint8
Hour_08              1300000 non-null uint8
Hour_09              1300000 non-null uint8
Hour_10              1300000 non-null uint8
Hour_11              1300000 non-null uint8
Hour_12              1300000 non-null uint8
Hour_13              1300000 non-null uint8
Hour_14              1300000 non-null uint8
Hour_15              1300000 non-null uint8
Hour_16              1300000 non-null uint8
Hour_17              1300000 non-null uint8
Hour_18              1300000 non-null uint8
Hour_19              1300000 non-null uint8
Hour_20

In [26]:
# convert the dummy variable to bool value
df_dummy[df_dummy.columns[0:38]] = df_dummy[df_dummy.columns[0:38]].astype('bool')

In [27]:
df_dummy.columns

Index(['Hour_00', 'Hour_01', 'Hour_02', 'Hour_03', 'Hour_04', 'Hour_05',
       'Hour_06', 'Hour_07', 'Hour_08', 'Hour_09', 'Hour_10', 'Hour_11',
       'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17',
       'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23',
       'Weekday_Friday', 'Weekday_Monday', 'Weekday_Saturday',
       'Weekday_Sunday', 'Weekday_Thursday', 'Weekday_Tuesday',
       'Weekday_Wednesday', 'Month_April', 'Month_February', 'Month_January',
       'Month_July', 'Month_June', 'Month_March', 'Month_May'],
      dtype='object')

In [28]:
#Create new dataset
df1=df.drop(['pickup_datetime','Date','Hour','Weekday','Month'],axis=1)

In [29]:
#combine two dataset together as a new dataset
df_new=pd.concat([df1,df_dummy],axis =1)

In [30]:
df_new.head()

Unnamed: 0,duration,pickup_x,pickup_y,dropoff_x,dropoff_y,Holiday,Euclidean distance,Manhattan distance,cluster_4,cluster_8,...,Weekday_Thursday,Weekday_Tuesday,Weekday_Wednesday,Month_April,Month_February,Month_January,Month_July,Month_June,Month_March,Month_May
0,690,154.696567,325.477213,126.881573,322.21187,0,28.006,31.08,1,6,...,True,False,False,False,True,False,False,False,False,False
1,399,160.754531,337.409138,156.181906,344.56497,0,8.492,11.728,0,1,...,False,False,False,False,False,True,False,False,False,False
2,715,144.344672,330.828932,168.868622,372.795481,0,48.607,66.49,0,0,...,False,False,False,False,False,False,False,False,True,False
3,566,122.749169,439.668599,131.837488,350.767551,0,89.364,97.989,0,2,...,False,True,False,False,False,False,False,False,True,False
4,1565,184.348606,398.135385,104.896974,332.185639,0,103.257,145.401,0,0,...,False,False,False,False,False,False,False,False,False,True


In [31]:
df_new=df_new[['Euclidean distance', 'pickup_x', 'dropoff_x', 'dropoff_y', 'Holiday',
      'Manhattan distance', 'cluster_4', 'cluster_8',
       'cluster_16', 'pickup_pca', 'dropoff_pca', 'Hour_00', 'Hour_01',
       'Hour_02', 'Hour_03', 'Hour_04', 'Hour_05', 'Hour_06', 'Hour_07',
       'Hour_08', 'Hour_09', 'Hour_10', 'Hour_11', 'Hour_12', 'Hour_13',
       'Hour_14', 'Hour_15', 'Hour_16', 'Hour_17', 'Hour_18', 'Hour_19',
       'Hour_20', 'Hour_21', 'Hour_22', 'Hour_23', 'Weekday_Friday',
       'Weekday_Monday', 'Weekday_Saturday', 'Weekday_Sunday',
       'Weekday_Thursday', 'Weekday_Tuesday', 'Weekday_Wednesday',
       'Month_April', 'Month_February', 'Month_January', 'Month_July',
       'Month_June', 'Month_March', 'Month_May', 'pickup_y','duration']]

In [32]:
df_new.head()

Unnamed: 0,Euclidean distance,pickup_x,dropoff_x,dropoff_y,Holiday,Manhattan distance,cluster_4,cluster_8,cluster_16,pickup_pca,...,Weekday_Wednesday,Month_April,Month_February,Month_January,Month_July,Month_June,Month_March,Month_May,pickup_y,duration
0,28.006,154.696567,126.881573,322.21187,0,31.08,1,6,4,-17.039229,...,False,False,True,False,False,False,False,False,325.477213,690
1,8.492,160.754531,156.181906,344.56497,0,11.728,0,1,0,-11.443359,...,False,False,False,True,False,False,False,False,337.409138,399
2,48.607,144.344672,168.868622,372.795481,0,66.49,0,0,0,-6.418135,...,False,False,False,False,False,False,True,False,330.828932,715
3,89.364,122.749169,131.837488,350.767551,0,97.989,0,2,1,92.325252,...,False,False,False,False,False,False,True,False,439.668599,566
4,103.257,184.348606,104.896974,332.185639,0,145.401,0,0,4,21.528007,...,False,False,False,False,False,False,False,True,398.135385,1565


In [33]:
df_new.to_csv('data for processing and model0501.csv')

### Exploratory Data Analysis

In [34]:
import matplotlib.pyplot as plt

In [35]:
df.head()

Unnamed: 0,pickup_datetime,duration,pickup_x,pickup_y,dropoff_x,dropoff_y,Date,Hour,Weekday,Month,Holiday,Euclidean distance,Manhattan distance,cluster_4,cluster_8,cluster_16,pickup_pca,dropoff_pca
0,2034-02-02 22:43:13,690,154.696567,325.477213,126.881573,322.21187,2034-02-02,22,Thursday,February,0,28.006,31.08,1,6,4,-17.039229,-11.360709
1,2034-01-30 16:30:13,399,160.754531,337.409138,156.181906,344.56497,2034-01-30,16,Monday,January,0,8.492,11.728,0,1,0,-11.443359,0.352978
2,2034-03-26 17:11:42,715,144.344672,330.828932,168.868622,372.795481,2034-03-26,17,Sunday,March,0,48.607,66.49,0,0,0,-6.418135,22.986239
3,2034-03-07 23:45:27,566,122.749169,439.668599,131.837488,350.767551,2034-03-07,23,Tuesday,March,0,89.364,97.989,0,2,1,92.325252,14.072647
4,2034-05-27 20:09:32,1565,184.348606,398.135385,104.896974,332.185639,2034-05-27,20,Saturday,May,0,103.257,145.401,0,0,4,21.528007,5.168034


In [None]:
x=df["hour"]