# New York City Taxi Fare Prediction

## Can we predict a rider's taxi fare?

## Import data

In [None]:
# load some default Python modules
%matplotlib inline

import time

from sklearn.metrics import mean_squared_error
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode()

from math import radians, cos, sin, asin, sqrt
import re
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from fbprophet import Prophet
plt.style.use('seaborn-whitegrid')
import warnings
warnings.filterwarnings('ignore')

We start by importing the data. The original file train.csv contains more than 55 millions rows. Because we use a Kaggle kernel we take only 22 millions rows.

In [None]:
train = pd.read_csv("../input/train.csv", nrows = 2_000_000)

print("shape of train data", train.shape)
train.head()


We look what is the type of each feature

In [None]:
# datatypes
train.dtypes

We look more closely at the data.

In [None]:
train.describe()

## Data cleaning

We can see that there are some outliers in the dataset.

For example : 
<p> 
    <ul>
        <li> The minimum of fare amount is negative and the maximum is more than 60,000 USD</li>
        <li> The maximum of passenger count is 208 and minimum is 0 </li>
        <li> Some latitude and longitude are very high </li>
    </ul>
</p>

In New York City, minimum taxi fare is 2.5 USD. We remove data where fare_amount is less than 2.5 USD

In [None]:
# Checking for valid fare amount
print('Old size: %d' % len(train))
train = train.drop(train[train['fare_amount']<2.5].index, axis=0)
print('New size after dropping invalid fare amount: %d' % len(train))

We look if there is some missing data.

In [None]:
# check missing data
train.isnull().sum()

We remove it

In [None]:
print("old size: %d" % len(train))
train = train.dropna(how='any', axis=0)
print("New size after dropping missing value: %d" % len(train))

Now we look closer at passanger count

In [None]:
# checking for passanger count
train.passenger_count.hist(bins=10, figsize = (16,8))
plt.xlabel("Passanger Count")
plt.ylabel("Frequency")

It seems that there are taxi with more than 200 passanger

In [None]:
# checking for passanger count greater than 7
train[train.passenger_count >7].passenger_count.hist(bins=10, figsize = (16,8))
plt.xlabel("Passanger Count")
plt.ylabel("Frequency")

The maximum capacity for taxi is 7 so we remove data above 

In [None]:
print('Old size: %d' % len(train))
train = train.drop(train[train['passenger_count']>7].index, axis = 0)
train = train.drop(train[train['passenger_count']<1].index, axis = 0)
print('New size: %d' % len(train))

Now we look for outliers on taxi fare

In [None]:
# checking for taxi fare
train.fare_amount.hist(bins=10, figsize = (16,8))
plt.xlabel("Taxi Fare")
plt.ylabel("Frequency")

In [None]:
# checking for taxi fare more than 250 USD
train[train.fare_amount >250].fare_amount.hist(bins=10, figsize = (16,8))
plt.xlabel("Taxi Fare")
plt.ylabel("Frequency")

In [None]:
print('Old size: %d' % len(train))
train = train.drop(train[train['fare_amount']>250].index, axis = 0)
print('New size: %d' % len(train))

In [None]:
# Lets see the distribution of fare amount less than 100
train[train.fare_amount <100 ].fare_amount.hist(bins=100, figsize = (16,8))
plt.xlabel("Fare Amount")
plt.ylabel("Number of courses")

As we can see,  majority of taxi rides cost around 7 USD that means people use it on short distances. 

The bounding box around New York city is :
<p>
    <ul>
        <li>North Latitude: 40.917577</li>
        <li>South Latitude: 40.477399</li> 
        <li>East Longitude: -73.700272 </li>
        <li>West Longitude: -74.259090</li>
    </ul>       
</p>
We remove ride out of New York city:

In [None]:
print('Old size: %d' % len(train))
train = train[(train['pickup_longitude'] >= -74.259090) & (train['pickup_longitude'] <= -73.700272)]
train = train[(train['dropoff_longitude'] >= -74.259090) & (train['dropoff_longitude'] <= -73.700272)]
train = train[(train['pickup_latitude'] >= 40.477399) & (train['pickup_latitude'] <= 40.917577)]
train = train[(train['dropoff_latitude'] >= 40.477399) & (train['dropoff_latitude'] <= 40.917577)]
print('New size: %d' % len(train))

We also remove rides where pickup and dropoff location are exactly the same:

In [None]:
print('Old size: %d' % len(train))
train = train[(train['pickup_longitude'] != train['dropoff_longitude']) | (train['pickup_latitude'] != train['dropoff_latitude'])]
print('New size: %d' % len(train))

Let's check if it is better !

In [None]:
train.describe()

We look the test set:

In [None]:
test = pd.read_csv("../input/test.csv")
print("shape of test data", test.shape)
test.head()

In [None]:
#check for missing value
test.isnull().sum()

In [None]:
# checking for basic stats
test.describe()

It seems find !

## Feature engineering

<p>
    <ol>
        <li>First we add a distance in kilometers</li>
         <li>Second we add time feature</li>
     </ol>
 </p>
        

In [None]:
# For XGBoost, later
def time_features(dataframe):
    dataframe['pickup_datetime'] = dataframe['pickup_datetime'].astype(str).str.slice(0, 16)
    dataframe['pickup_datetime'] = pd.to_datetime(dataframe['pickup_datetime'], utc=True, format='%Y-%m-%d %H:%M')
    dataframe['hour_of_day'] = dataframe.pickup_datetime.dt.hour
    dataframe['month'] = dataframe.pickup_datetime.dt.month
    dataframe["year"] = dataframe.pickup_datetime.dt.year
    dataframe["weekday"] = dataframe.pickup_datetime.dt.weekday    
    return dataframe

In [None]:
# calculate distance between two latitude longitude points haversine formula 
# Returns distance in kilometers
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 12742 * np.arcsin(np.sqrt(a))   # 2*R*asin...

In [None]:
train['distance_miles'] = distance(train.pickup_latitude, train.pickup_longitude, \
                                      train.dropoff_latitude, train.dropoff_longitude)

In [None]:
test['distance_miles'] = distance(test.pickup_latitude, test.pickup_longitude, \
                                      test.dropoff_latitude, test.dropoff_longitude)

In [None]:
print("Average $USD/Km : {:0.2f}".format(train.fare_amount.sum()/train.distance_miles.sum()))

In [None]:
# scatter plot distance - fare
plt.scatter(train.distance_miles, train.fare_amount, alpha=0.2)
plt.xlabel('distance mile')
plt.ylabel('fare $USD')
plt.show()

It seems that the relation is linear between the distance and the fare amount

## FbProphet model

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.
Prophet is open source software released by Facebook.

FbProphet is a very ressource consuming algorithm, so we need to split the train data set. 
We take 1 million sample to process FbProphet

In [None]:
prophet_df = train.iloc[:1000000]

In FbProphet library we must use 'ds' and 'y' as column names. So we rename the existing columns.

In [None]:
prophet_df = prophet_df.reset_index()[["pickup_datetime", "fare_amount"]]
prophet_df.columns = ["ds", "y"]

In [None]:
prophet_df.head()

We convert 'ds' column to datastamp and sort the values.

In [None]:
prophet_df['ds'] = pd.to_datetime(prophet_df['ds'].sort_values())
prophet_df['y'] = pd.to_numeric(prophet_df['y'],errors='ignore')
prophet_df.head()

### Split train/test

The train set will be the 80% firsts values, and the test set the 20% last values.

In [None]:
df_train = prophet_df.iloc[:round(len(prophet_df)*0.8)]
df_test = prophet_df.iloc[round(len(prophet_df)*0.8):]

### Fitting the model

In [None]:
model = Prophet(changepoint_prior_scale=2.5, daily_seasonality=True)

start = time.time()
model.fit(df_train)
print("Fitting duration : {:.3f}s".format(time.time() - start) )

In [None]:
future_data = df_test.drop("y", axis=1)
start = time.time()
forecast_data = model.predict(future_data)
print("Predict duration : {:.3f}s".format(time.time() - start) )

In [None]:
forecast_data["y"] = df_test["y"].values
forecast_data[['ds', 'y', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Now let's compare the prediction and confidence to the real data (you can zoom, pan on the plot)

In [None]:
py.iplot([
    go.Scatter(x=df_test['ds'], y=df_test['y'], name='y'),
    go.Scatter(x=forecast_data['ds'], y=forecast_data['yhat'], name='yhat'),
    go.Scatter(x=forecast_data['ds'], y=forecast_data['yhat_upper'], fill='tonexty', mode='none', name='upper'),
    go.Scatter(x=forecast_data['ds'], y=forecast_data['yhat_lower'], fill='tonexty', mode='none', name='lower'),
    go.Scatter(x=forecast_data['ds'], y=forecast_data['trend'], name='Trend')
])

### Metric

For the metric, I'll compute the MSE for every 2 days (48 hours of data) to check how it evolves during the complete year

In [None]:
mse = []
for i in range(0, len(forecast_data), 48):
    mse.append(mean_squared_error(
                    forecast_data.loc[i:i+48, "y"],
                    forecast_data.loc[i:i+48, "yhat"]
                ))

plt.figure(figsize=(20,12))
plt.plot(mse) # mse per day during 2 years
plt.title("Evolution of MSE during year 2016 - 2017")
plt.show()



We can see that we have a correct MSE during the summer but every winter have a lot more error. This can be explained with weather. We have peaks of error which may be explained by the lack of wind. This is more difficult to predict. As a result, the pollution generated by heating system stacked over the city and decrease quickly when wind is back. The rest of the time we have a quite good approximation


In [None]:
model.plot_components(forecast_data)
plt.show()

We have now 4 analysis.

  <ul>
    <li>A global trend which is increasing during years. This is logical as we predict over 6 years of data, and it is known that taxi fare increases (a little over years).
    </li> 
    <li>A yearly trend that shows that the prices are quite equal from may to december and starts to decrease because it is the least touristic periods</li>
    <li> A weekly trend which shows that the taxi fare are high especially the week end </li>
    <li> A daily trend that show that the highest rates fare are near to 4 a.m</li>



In [None]:
# Calculate root mean squared error.
print('RMSE: %f' % np.sqrt(np.mean((forecast_data.loc[:800, 'yhat']-prophet_df['y'])**2)) )

The result seems predictable because FbProphet take only timestamp parameters so that it could be interesting to take also distance and number of passengers. 

## Train a linear model

Our model will take the form  $X⋅w=y$  where  $X$  is a matrix of input features, and  $y$ is a column of the target variable, fare_amount, for each row. The weight column  $w$  is what we will "learn".

First let's setup our input matrix $X$  and target column  $y$  from our training set. The matrix  $X$  should consist of the two GPS coordinate differences, plus a third term of 1 to allow the model to learn a constant bias term. 

In a way, the column of 1s is a hack to extend the model to support a bias term.

A simpler example is in 2D space, where $x∈ℝ$ is your "input" and $y∈ℝ$ is your "target". If you try to capture this relationship with a linear model of form $y=ax$, where $a∈ℝ$, your model could only be lines that pass through the origin (0,0) and you could not effectively capture most 2D relationships.

However if you extend the model to have a second variable $b$ -- sometimes called the bias term -- say $y=ax+b$, then your model can be (almost) any 2D line, and the model can now capture most 2D linear relationships.

Note that if we write $\vec{x}=\begin{pmatrix} x & 1 \end{pmatrix}$ and $\vec{w}=\begin{pmatrix} a\\b \end{pmatrix}$ then the following two models are equivalent:

$$y=ax+b$$
$$y=\vec{x} \cdot \vec{w} $$
So adding the column of 1s to our inputs $\vec{x}$  allows us to write the model in a more concise way (just $\vec{w}$  instead of a and b), while still allowing the model (encoded by the $\vec{x}$  column) to learn the additional bias term. The column  $y$  should consist of the target fare_amount values.

In [None]:
X = train[['distance_miles']]
y = train[['fare_amount']]
X['default'] = 1


Training of Linear model using sklearn library : 

In [None]:
from sklearn.linear_model import LinearRegression
modelRegression = LinearRegression(normalize=True)
modelRegression.fit(X,y)


In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
predictions = modelRegression.predict(X)
print("RMSE : " + str(sqrt(mean_absolute_error(predictions, y))))
print("R2 : " + str(r2_score(predictions, y)))

The RMSE shows the strong correlation between distance and price. 

The R-squared score is relatively close to 1. The nearer, the more correlated. This indicator shows the variation in relation to the regression line. Yet, we see the good relation between distance and price. 

Plot of our prediction : 

In [None]:
plt.plot(X[['distance_miles']], predictions, 'r')
plt.scatter(train.distance_miles, train.fare_amount, alpha=0.2)
plt.xlabel('distance mile')
plt.ylabel('fare $USD')
plt.show()

## XGBoost model

XGBoost model is a well-known model on Kaggle competitions. It is composed both of decision trees and boosting and can give good accuracy. What we want do with this model is to predict the price of the course according to our variables. Before, we split the model within the training set and testing set. X are the data in  which we do the prediction and y what we want to predict (the fare amount).

In [None]:
train = time_features(train)


from sklearn.model_selection import train_test_split
y = train.fare_amount
X = train.drop(['fare_amount', 'key', 'pickup_datetime'], axis=1)
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.2)





For Kaggle competition : 

In [None]:
#train_y = train.fare_amount
#train_X = train.drop(['fare_amount', 'key', 'pickup_datetime'], axis=1)
#sample_submission = pd.read_csv("../input/sample_submission.csv")
#test_X = test.drop(['key', 'pickup_datetime' ], axis=1)
#test_y = sample_submission.drop(['key'], axis=1)


Then, We created two fonctions. The first function is the creation of the model with specific parameters : 
* max_depth : the depth of the tree ;
* nb_estimators is the numbers of trees ;
* learning_rate is speed learning, it means mutiply the prediction of model before we add them together. It can reduce overfit ;
* early_stopping_rounds : the number of rounds maximum before the error rise. It means if we have the minimum error for a round and after, the error rises, this parameter will stop the training to come back at the step when we had the minimum error.

The second function is the error returned by the first model, in which the mean square error is calculated. We want to reduce this value to the maximum. 


In [None]:
from xgboost import XGBRegressor
from math import sqrt

def XGBoost(train_X, test_X, train_y, test_y, max_depth, nb_estimators, learning_rate, early_stopping_rounds):
    model = XGBRegressor(max_depth = max_depth, nb_estimators = nb_estimators, learning_rate = learning_rate)
    model.fit(train_X, train_y, early_stopping_rounds = early_stopping_rounds , eval_set=[(test_X, test_y)], verbose=False)
    return model

def errorXGBoost(model, test_X, test_y):
    predictions = model.predict(test_X)
    return str(mean_absolute_error(predictions, test_y))

Mean square error has been chosen because it gives score in the square fare amount.
Some tests has been done with different parameters. Max depth and the number of estimators are the most important parameters in XGBoost. It has been changed many times to approximate a good score, be robust, trying to not be in overfitting. 

In [None]:

model = XGBoost(train_X, test_X, train_y, test_y, 5, 500, 0.05, 5)
print(errorXGBoost(model, test_X, test_y))


Submission to Kaggle competition : 

In [None]:
test = pd.read_csv("../input/test.csv")
test['distance_miles'] = distance(test.pickup_latitude, test.pickup_longitude, \
                                      test.dropoff_latitude, test.dropoff_longitude)

test = time_features(test)
test = test.drop(['key', 'pickup_datetime' ], axis=1)
test= test.as_matrix()

prediction = model.predict(test)
test = pd.read_csv("../input/test.csv")
holdout = pd.DataFrame({'key': test['key'], 'fare_amount': prediction})
holdout.to_csv('predictionTest.csv', index=False)


According to Kaggle competition, we have a RSE of 6.05 $. 