## Introduction
This is a comprehensive Exploratory Data Analysis for the [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration) competition with Python and Data Visualization libraries such as matplotlib and seaborn. I also use [New York City Taxi with OSRM](https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm) to support the primary dataset.

The goal of this playground challenge is to predict the duration of taxi rides in NYC based on features like trip coordinates or pickup date and time. We start the exploratory data analysis by loading the dataset using pandas, checking missing values, doing feature engineering,checking outliers and comparing between univariate and bivariate features,improving the model using ML Algorithms(Decision Tree and Gradient Boosting) as regression model.
We also implement Haversine Formula using for calculating the duration between two points(longitude and latitude) as follows 

$$ s = r \theta $$
where $r$ is the Earth's radius, and $\theta$ is the central angle calculated as

$$ \theta = 2 \arcsin\left( \sqrt{\sin^2 \left(\frac{\phi_2-\phi_1}{2}\right) + \cos(\phi_1)\cos(\phi_2)\sin^2 \left( \frac{\lambda_2-\lambda_1}{2} \right) } \right) $$
with:

$$ \begin{align} \phi &= \text{latitude}\\ \lambda &= \text{longitude}\\ \end{align} $$

# File descriptions
- train.csv - the training set (contains 1458644 trip records)
- test.csv - the testing set (contains 625134 trip records)
- sample_submission.csv - a sample submission file in the correct format

# Data fields
- id - a unique identifier for each trip
- vendor_id - a code indicating the provider associated with the trip record
- pickup_datetime - date and time when the meter was engaged
- dropoff_datetime - date and time when the meter was disengaged
- passenger_count - the number of passengers in the vehicle (driver entered value)
- pickup_longitude - the longitude where the meter was engaged
- pickup_latitude - the latitude where the meter was engaged
- dropoff_longitude - the longitude where the meter was disengaged
- dropoff_latitude - the latitude where the meter was disengaged
- store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
- trip_duration - duration of the trip in seconds

In [None]:
%matplotlib inline
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 5]
import xgboost as xgb
import seaborn as sns
from datetime import timedelta
import datetime as dt
from sklearn.model_selection import train_test_split
import warnings

## LOAD DATA

In [None]:
df_train = pd.read_csv('../input/nyc-taxi-trip-duration/train.zip')
df_test = pd.read_csv("../input/nyc-taxi-trip-duration/test.zip")
df_sample_submission = pd.read_csv('../input/nyc-taxi-trip-duration/sample_submission.zip')

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
#Check columns in train columns that are not included in test columns
print([column for column in df_train.columns if column not in df_test.columns])

Neither 'dropoff_datetime' nor 'trip_duration' should in test data.

In [None]:
len(df_train['id'].value_counts()) == len(df_train)

There is no duplicate id in the train data

In [None]:
df_train.head()

### check missing value

In [None]:
def check_missing_value(df):
    len0 = len(df)
    len1 = len(df.dropna())
    if len0 == len1:
        print("no missing values")
    else:
        miss = (len0-len1) 
        print("%f of the data is missing" %(miss))

In [None]:
check_missing_value(df_train)

In [None]:
check_missing_value(df_test)

## FEATURE ANALYSIS

## Univariate Feature

In [None]:
df_train['vendor_id'].value_counts()

In [None]:
df_train['store_and_fwd_flag'].value_counts()

In [None]:
def convert_binary_variable(df):
    df['store_and_fwd_flag'] = 1 * (df.store_and_fwd_flag.values == 'Y')
    return df
df_train = convert_binary_variable(df_train)
df_test = convert_binary_variable(df_test)

both 'vendor_id' and 'store_and_fwd_flag' are binary variable

change the target label into logaritmic transformation as the evaluation metric is RMSLE and plot the distribution either it is normal distribution or not

In [None]:
sns.set()
df_train['log_trip_duration'] = np.log(df_train['trip_duration'].values + 1)
plt.hist(df_train['log_trip_duration'].values, bins=100)
plt.xlabel('log(trip_duration)')
plt.ylabel('number of train records')
plt.show()

In [None]:

upper_limit = np.percentile(df_train['log_trip_duration'],99.99)
lower_limit = np.percentile(df_train['log_trip_duration'],0.01)
df_train_filter = df_train[(df_train['log_trip_duration'] <= upper_limit) & (df_train['log_trip_duration'] >= lower_limit) ]
print(len(df_train_filter))

In [None]:
df_train.dtypes

since the datatype of the pickup_datetime, dropoff_datetime are object, we change them into datetime.

### Conversion Of Datatype

In [None]:
def convert_datetime(df):
    '''
    convert datetime in string format to datetime object
    '''
    df['pickup_datetime'] = pd.to_datetime(df.pickup_datetime)
    df.loc[:, 'pickup_date'] = df['pickup_datetime'].dt.date
    df.loc[:, 'pickup_month'] = df['pickup_datetime'].dt.month
    df.loc[:, 'pickup_weekday'] = df['pickup_datetime'].dt.weekday
    df.loc[:, 'pickup_weekofyear'] = df['pickup_datetime'].dt.isocalendar().week
    df.loc[:, 'pickup_hour'] = df['pickup_datetime'].dt.hour

    return df

In [None]:
df_train = convert_datetime(df_train )
df_test = convert_datetime(df_test)

## Bivariate Analysis

In [None]:
def get_stats_describe(df,feature1,feature2):
    biv_columns = df.groupby([feature1])[[feature2]].agg(['size','mean','median','var','std']).reset_index()
    biv_columns = biv_columns.round(3)
    biv_columns.columns = [feature1,'size','mean','median','var','std']
    biv_columns = biv_columns.set_index(feature1)
    return biv_columns

In [None]:
df_train_date = get_stats_describe(df_train,'pickup_date','log_trip_duration')
df_train_weekday = get_stats_describe(df_train,'pickup_weekday','log_trip_duration')
df_train_hour = get_stats_describe(df_train,'pickup_hour','log_trip_duration')

In [None]:
df_train_date.describe()

In [None]:
fig, ax = plt.subplots(nrows=2, sharex=True, sharey=False)
ax[0].plot(df_train_date[["size"]],".-")
fig.suptitle("Number of Samples and Average Log Trip Duration by Date")
ax[0].set_ylabel('Number of Samples')

ax[1].plot(df_train_date[["mean"]],".-")
ax[1].set_ylabel('Average Log Trip Duration')
ax[1].set_xlabel('Date')

In [None]:
fig, ax = plt.subplots(nrows=2, sharex=True, sharey=False)
ax[0].plot(df_train_weekday[["size"]],"o-")
fig.suptitle("Number of Samples and Average Log Trip Duration by Day")
ax[0].set_ylabel('Number of Samples')

ax[1].plot(df_train_weekday[["mean"]],"o-")
ax[1].set_ylabel('Average Log Trip Duration')
ax[1].set_xticklabels(["","Mon","Tue","Wed","Thu","Fri","Sat","Sun"])
ax[1].set_xlabel('Day of the Week')

In [None]:
fig, ax = plt.subplots(nrows=2, sharex=True, sharey=False)
ax[0].plot(df_train_hour[["size"]],"o-")
fig.suptitle("Number of Samples and Average Log Trip Duration by Hour")
ax[0].set_ylabel('Number of Samples')

ax[1].plot(df_train_hour[["mean"]],"o-")
ax[1].set_ylabel('Average Log Trip Duration')
ax[1].set_xlabel('Hour of the Day')

Morning rush hour is around 7:00 am. Trip volume largely increased after 4 pm and peaked around 6pm. 

## Number of passengers vs Trip Durationtion

In [None]:
pclt = df_train[["passenger_count","log_trip_duration"]].boxplot( by="passenger_count", figsize = (10,6))
pclt.set_xlabel("Number of Passengers")
pclt.set_ylabel("Log Time Duration")
pclt.set_title("distribution of time duration by number of passengers")

In [None]:
df_train.columns

 ## Vendors vs Tri Duration

In [None]:
vilt = df_train[["vendor_id","log_trip_duration"]].boxplot( by="vendor_id", figsize = (10,6))
vilt.set_xlabel("Vendor")
vilt.set_ylabel("Log Time Duration")
vilt.set_title("distribution of time duration by vendor id")

## Distance Vs Trip Duration(Haversine Distance)

Based off exploratory data analysis on kaggle, the distance (km) between pickup and dropoff points is a significant feature impacting trip duration. Let's calculate the distance and investigate its patterns by using [haversine](https://en.wikipedia.org/wiki/Haversine_formula) formula.


$$ s = r \theta $$
where $r$ is the Earth's radius, and $\theta$ is the central angle calculated as

$$ \theta = 2 \arcsin\left( \sqrt{\sin^2 \left(\frac{\phi_2-\phi_1}{2}\right) + \cos(\phi_1)\cos(\phi_2)\sin^2 \left( \frac{\lambda_2-\lambda_1}{2} \right) } \right) $$
with:

$$ \begin{align} \phi &= \text{latitude}\\ \lambda &= \text{longitude}\\ \end{align} $$

In [None]:
def haversine_distance(lat1, long1, lat2, long2):
    # the unit is in km
    #r = Average of earth radius
    lat1, long1, lat2, long2 = map(np.radians, (lat1, long1, lat2, long2))
    r = 6371 
    lat = lat2 - lat1
    long = long2 - long1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(long * 0.5) ** 2
    h = 2 * r * np.arcsin(np.sqrt(d))
    return h

In [None]:
def get_distance(df):
    df.loc[:, 'distance_haversine'] = haversine_distance(df['pickup_latitude'].values, \
                                                      df['pickup_longitude'].values, \
                                                      df['dropoff_latitude'].values, \
                                                      df['dropoff_longitude'].values)
    
    return df

df_train = get_distance(df_train)
df_test = get_distance(df_test)

In [None]:
df_train_filter = df_train[df_train.trip_duration < 100000]

In [None]:
df_train_filter.head()

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1)
ax.scatter(df_train_filter.distance_haversine, df_train_filter.trip_duration, s=1, alpha=0.5)
ax.set_xlabel("Distance Haversine")
ax.set_ylabel("Trip Duration")

## Average Speed in hour

In [None]:
df_train.loc[:, 'avg_speed_h'] = 1000 * df_train['distance_haversine'] / df_train['trip_duration']

In [None]:
fig, ax = plt.subplots(ncols=2, sharey=True)
ax[0].plot(df_train.groupby('pickup_hour').mean()['avg_speed_h'], 'o-', lw=2, alpha=0.7)
ax[1].plot(df_train.groupby('pickup_weekday').mean()['avg_speed_h'], 'o-', lw=2, alpha=0.7)
ax[0].set_xlabel('hour')
ax[0].set_ylabel('average speed')

ax[1].set_xlabel('weekday')
ax[1].set_xticklabels(["","Mon","Wed","Fri","Sun"])
fig.suptitle('Average Traffic Speed Over Time')
plt.show()

### distance from external dataset

In [None]:
# add 3 more features
fr1 = pd.read_csv('../input/new-york-city-taxi-with-osrm/fastest_routes_train_part_1.csv', usecols=['id', 'total_distance', 'total_travel_time',  'number_of_steps'])
fr2 = pd.read_csv('../input/new-york-city-taxi-with-osrm/fastest_routes_train_part_2.csv', usecols=['id', 'total_distance', 'total_travel_time', 'number_of_steps'])
df_test_street_info = pd.read_csv('../input/new-york-city-taxi-with-osrm/fastest_routes_test.csv',
                               usecols=['id', 'total_distance', 'total_travel_time', 'number_of_steps'])
df_train_street_info = pd.concat((fr1, fr2))
df_train = df_train.merge(df_train_street_info, how='left', on='id')
df_test = df_test.merge(df_test_street_info, how='left', on='id')
df_train_street_info.head()

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1)
ax.scatter(df_train.total_distance, df_train.trip_duration, s=1, alpha=0.5)
ax.set_xlabel("Total Distance")
ax.set_ylabel("Trip Duration")

In [None]:
df_train.head()

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1)
ax.scatter(df_train.total_distance, df_train.total_travel_time, s=1, alpha=0.5)
ax.set_xlabel("Total Distance")
ax.set_ylabel("Total Travel Time")

## Comparison between Test and Train data

### Time

In [None]:
plt.plot(df_train.groupby('pickup_date')[['id']].count(), label='train')
plt.plot(df_test.groupby('pickup_date')[['id']].count(), label='test')
plt.title('Number of Samples in Train and Test')
plt.legend(loc=0)
plt.xlabel('Time')
plt.ylabel('Number of Samples')
plt.show()

## Geolocaton 

In [None]:
fig, ax = plt.subplots(ncols=2, sharex=True, sharey=True)
ax[0].scatter(df_train['pickup_longitude'].values, df_train['pickup_latitude'].values,
              s=1, label='train', alpha=0.1)
ax[1].scatter(df_test['pickup_longitude'].values, df_test['pickup_latitude'].values,
              s=1, label='test', alpha=0.1)
fig.suptitle('Trip Location Distribution of Train and Test Data')
ax[0].legend(loc=0)
ax[0].set_ylabel('latitude')
ax[0].set_xlabel('longitude')
ax[1].set_xlabel('longitude')
ax[1].legend(loc=0)
plt.show()

Let's zoom in to the city and see the difference.

In [None]:
long_lim = (-74.03, -73.75)
lat_lim = (40.63, 40.85)
fig, ax = plt.subplots(ncols=2, sharex=True, sharey=True)
ax[0].scatter(df_train['pickup_longitude'].values, df_train['pickup_latitude'].values,
              s=1, label='train', alpha=0.1)
ax[1].scatter(df_test['pickup_longitude'].values, df_test['pickup_latitude'].values,
              s=1, label='test', alpha=0.1)
fig.suptitle('Trip Location Distribution of Train and Test Data')
ax[0].legend(loc=0)
ax[0].set_ylabel('latitude')
ax[0].set_xlabel('longitude')
ax[1].set_xlabel('longitude')
ax[1].legend(loc=0)
plt.xlim(long_lim)
plt.ylim(lat_lim)
plt.show()

It looks like train and test data are completely overlapped based off time and geolocation distribution.

# MODELING

In [None]:
df_train.columns

In [None]:
np.setdiff1d(df_train.columns, df_test.columns)

In [None]:
features = [column for column in df_train.columns if column not in ['avg_speed_h', 'dropoff_datetime', 'log_trip_duration',\
                                                  'trip_duration', 'id', 'pickup_datetime', 'dropoff_datetime',\
                                                  'distance_haversine',\
                                                 'trip_duration', 'log_trip_duration', 'trip_duration','pickup_date']]

In [None]:
features

In [None]:
len(features)

In [None]:
df_train_filter.head()

In [None]:
df_train_filter.isnull().sum()

In [None]:
df_train_filter = df_train.dropna()
X = df_train_filter[features].values
y = np.log(df_train_filter['trip_duration'].values + 1)
Xtr, Xv, ytr, yv = train_test_split(X, y, test_size=0.2, random_state=7)
Xtst = df_test[features].values

In [None]:
df_train_filter.head()

### decision tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {'max_depth': np.arange(3, 10)}
tree = GridSearchCV(DecisionTreeRegressor(), param_grid)
tree.fit(Xtr, ytr)

In [None]:
tree.best_params_

In [None]:
y_pred = tree.predict(Xv)
rmsle = np.sqrt(sum(np.square(y_pred - yv)) / len(y_pred))
print("rmsle of decision tree is: %.3f"%rmsle)

In [None]:
# submission
ytest = tree.predict(Xtst)
df_test['trip_duration'] = np.exp(ytest) - 1
df_test[['id', 'trip_duration']].to_csv('dt_submission.csv.gz', index=False, compression='gzip')

### Gradient Boosting tree

In [None]:
dtrain = xgb.DMatrix(Xtr, label=ytr)
dvalid = xgb.DMatrix(Xv, label=yv)
dtest = xgb.DMatrix(Xtst)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 9,
            'subsample': 0.8, 'lambda': 1., 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}

In [None]:
model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)

In [None]:
y_pred = model.predict(dvalid)
rmsle = np.sqrt(sum(np.square(y_pred - yv)) / len(y_pred))
print("rmsle of gbt is: %.3f"%rmsle)

In [None]:
# for submission
ytest = model.predict(dtest)
df_test['trip_duration'] = np.exp(ytest) - 1
df_test[['id', 'trip_duration']].to_csv('xgb_submission.csv.gz', index=False, compression='gzip')

In [None]:
fs = ['f%i' % i for i in range(len(features))]
name = dict(zip(fs, features))

feature_importance_dict = model.get_fscore()
f_i = pd.DataFrame({'feature': list(feature_importance_dict.keys()), \
                    'importance': list(feature_importance_dict.values())})
f_i["feature"] = f_i["feature"].apply(lambda x: name[x])

In [None]:
f_i = f_i.sort_values("importance")
f_i.set_index("feature").plot(kind='barh')

dropoff_latitude as the most important feature impacting the trip duration 