# Predictions of NYC Taxi Trip Duration

### Table of contents 
* [Data loading](#Data-loading)
* [Data exploration](#Data-exploration)
* [Data reprocessing](#Data-preprocessing)
* [Features engineering](#Features-engineering)
* [Modeling](#Modeling)
* [Predictions](#Predictions)
* [Submission](#Submission)

# Data loading

In [None]:
import os
print(os.listdir("../input"))

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

from math import radians, cos, sin, asin, sqrt
from datetime import datetime

In [None]:
fp1 = os.path.join("..", "input", "train.csv")
fp2 = os.path.join("..", "input", "test.csv")

### Train preview

In [None]:
train = pd.read_csv(fp1, index_col=0)
train.head() 

We have :
* **id** : a unique identifier for each trip
* **vendor_id** : a code indicating the provider associated with the trip record
* **pickup_datetime** : date and time when the meter was engaged
* **dropoff_datetime** : date and time when the meter was disengaged
* **passenger_count** : the number of passengers in the vehicle (driver entered value)
* **pickup_longitude** : the longitude where the meter was engaged
* **pickup_latitude** : the latitude where the meter was engaged
* **dropoff_longitude** : the longitude where the meter was disengaged
* **dropoff_latitude** : the latitude where the meter was disengaged
* **store_and_fwd_flag** : this flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* **trip_duration** : duration of the trip in seconds

In [None]:
train.shape

In [None]:
train.dtypes

In [None]:
train.describe()

### Test preview

In [None]:
test = pd.read_csv(fp2, index_col=0)
test.head()

In [None]:
test.dtypes

In [None]:
test.shape

# Data exploration

#### Let's see all the features with histograms :

In [None]:
train.hist(bins=50, figsize=(20,15))
plt.show()

#### We are going to focus on "trip_duration" and check their values between 0 and 5000 :

In [None]:
train.loc[train['trip_duration'] < 5000, 'trip_duration'].hist();

plt.title('trip_duration')
plt.show()

#### We can make a log-transformation of trip_duration's data :

In [None]:
np.log1p(train['trip_duration']).hist();
plt.title('log_trip_duration')
plt.show()

# Data preprocessing

### Cleaning

In [None]:
plt.subplots(figsize=(15,5))
train.boxplot(); 

#### • Trip_duration

In [None]:
train = train[(train.trip_duration < 5000)]

Principal datas for "trip_duration" are between 0 and 5000. So we're going to work with the lines concerned.

#### • Pickup

In [None]:
train.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude', alpha=0.1);

In [None]:
train = train.loc[(train['pickup_longitude'] > -75) & (train['pickup_longitude'] < -73)]
train = train.loc[(train['pickup_latitude'] > 40) & (train['pickup_latitude'] < 41)]

Principal datas for "pickup" : longitude ∈ [-75;-73] and latitude ∈ [40;41].

#### • Dropoff

In [None]:
train.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude', alpha=0.1);

In [None]:
train = train.loc[(train['dropoff_longitude'] > -75) & (train['dropoff_longitude'] < -73)]
train = train.loc[(train['dropoff_latitude'] > 40.5) & (train['dropoff_latitude'] < 41.5)]

Principal datas for "dropoff" : longitude ∈ [-75;-72.5] and latitude ∈ [40;41.5].

#### • Passenger_count

In [None]:
train['passenger_count'].hist(bins=100, log=True, figsize=(10,5));
plt.title('passenger_count')
plt.show()

In [None]:
train = train.loc[(train['passenger_count'] >= 0) & (train['passenger_count'] <= 6)]

### Missing values

In [None]:
train.isnull().sum()

### Duplicated values 

In [None]:
train.duplicated().sum()

In [None]:
train = train.drop_duplicates()
train.duplicated().sum()

### Categorical features

In [None]:
train.dtypes

As we can see, "pickup_datetime", "dropoff_datetime" and "store_and_fwd_flag" are object types. First step, let's convert str to datetime for "pickup_datetime" and "dropoff_datetime" and then we will drop "store_and_fwd_flag" because we won't use it.

In [None]:
train.drop(["store_and_fwd_flag"], axis=1, inplace=True)
test.drop(["store_and_fwd_flag"], axis=1, inplace=True)

In [None]:
train.shape, test.shape

# Features engineering

### Features creation

#### Shortcuts :

In [None]:
plg, plt = 'pickup_longitude', 'pickup_latitude'
dlg, dlt = 'dropoff_longitude', 'dropoff_latitude'
pdt, ddt = 'pickup_datetime', 'dropoff_datetime'

#### We create a function to calculate distance from pickup to dropoff :

##### Source: Stackoverflow

In [None]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371* c
    return km

def euclidian_distance(x):
    x1, y1 = np.float64(x[plg]), np.float64(x[plt])
    x2, y2 = np.float64(x[dlg]), np.float64(x[dlt])    
    return haversine(x1, y1, x2, y2)

In [None]:
%time
train['distance'] = train[[plg, plt, dlg, dlt]].apply(euclidian_distance, axis=1)

In [None]:
%time
test['distance'] = test[[plg, plt, dlg, dlt]].apply(euclidian_distance, axis=1)

#### We convert string to datetime :

In [None]:
train[pdt] = train[pdt].apply(lambda x : datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
train[ddt] = train[ddt].apply(lambda x : datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))

In [None]:
test[pdt] = test[pdt].apply(lambda x : datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
#test dataset has not "dropoff_datetiime"

#### We create columns from pickuptime :

In [None]:
train['month'] = train[pdt].apply(lambda x : x.month)
train['week_day'] = train[pdt].apply(lambda x : x.weekday())
train['day_month'] = train[pdt].apply(lambda x : x.day)
train['pickup_time_minutes'] = train[pdt].apply(lambda x : x.hour * 60.0 + x.minute)

In [None]:
test['month'] = test[pdt].apply(lambda x : x.month)
test['week_day'] = test[pdt].apply(lambda x : x.weekday())
test['day_month'] = test[pdt].apply(lambda x : x.day)
test['pickup_time_minutes'] = test[pdt].apply(lambda x : x.hour * 60.0 + x.minute)

Let's have a look on our dataset with the new columns we've created :

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.shape, test.shape

### Features selection

In [None]:
features_train = ["vendor_id", "passenger_count", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "distance", "month", "week_day", "day_month", "pickup_time_minutes"]
X_train = train[features_train]
y_train = np.log1p(train["trip_duration"])

features_test = ["vendor_id", "passenger_count", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "distance", "month", "week_day", "day_month", "pickup_time_minutes"]
X_test = test[features_test]

In [None]:
#Last check
#X_train.dtypes
#X_test.dtypes

# Modeling

I decided to choose **Random Forest**. This prediction algorithm is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories. Unlike linear models, random forests are able to capture non-linear interaction between the features and the target.

In [None]:
from sklearn.ensemble import RandomForestRegressor 
#from sklearn.model_selection import GridSearchCV

Let's tune our model with hyperparameters with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define.

Hyperparameters I will use : 
* **n_estimators** : the number of trees in the forest
* **min_samples_leaf** : the minimum number of samples required to split an internal node
* **max_features** : the number of features to consider when looking for the best split
* **max_depth** : the maximum depth of the tree
* **bootstrap** : whether bootstrap samples are used when building trees
* **n_jobs** : the number of jobs to run in parallel for both fit and predict, "-1" means using all processors

In [None]:
#param_grid_rf = {'n_estimators' : [10, 20, 100],
                 #'min_samples_leaf' : [2, 4, 6],
                 #'max_features' : [0.2, 0.5, 'auto'],
                 #'max_depth' : [50, 80, 100]}
#rf = RandomForestRegressor()
#grid_search_rf = GridSearchCV(RandomForestRegressor(), param_grid_rf)
#grid_search_rf.fit(X_train, y_train)

In [None]:
#print("Score final : ", round(grid_search_rf.score(X_train, y_train)*100, 4), " %")
#print("Meilleurs paramètres : ", grid_search_rf.best_params_)
#print("Meilleure configuration : ", grid_search_rf.best_estimator_)

In [None]:
#rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, min_samples_split=15, max_depth=100, bootstrap=True, n_jobs=-1)
#rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=2, max_features=0.7, max_depth=100, bootstrap=True, n_jobs=-1)
#rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, min_samples_split=15, max_depth=100, bootstrap=True, n_jobs=-1)
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=2, max_features='auto', max_depth=50, bootstrap=True, n_jobs=-1)

In [None]:
rf.fit(X_train, y_train)

# Predictions

In [None]:
#y_pred = grid_search_rf.predict(X_test)

In [None]:
log_pred = rf.predict(X_test)
y_pred = np.exp(log_pred) - np.ones(len(log_pred)) 

# Submission

In [None]:
my_submission = pd.DataFrame({'id': test.index, 'trip_duration': y_pred})
my_submission.head()

In [None]:
my_submission.to_csv("submission.csv", index=False)