<h1> New York Taxi Fare Prediction </h1>
<h2> Introduction </h2>
This project was about learning how to spot poor or incorrect data, as well as learning how to deal with large data sets where I only train on a subsection of the data.
I used the project below extensively for guidance and inspiration.

https://www.kaggle.com/btyuhas/bayesian-optimization-with-xgboost

The notebook linked below was invaluable in generating the haversine distance feature as well as the Airport distance feature, which was published in a popular notebook for this competition.

https://www.kaggle.com/jsylas/python-version-of-top-ten-rank-r-22-m-2-88


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression, Ridge
from sklearn import preprocessing
from sklearn.cluster import KMeans

from xgboost import XGBRegressor

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
train = pd.read_csv('/kaggle/input/new-york-city-taxi-fare-prediction/train.csv', nrows = 20000, parse_dates = ['pickup_datetime'])
test = pd.read_csv('/kaggle/input/new-york-city-taxi-fare-prediction/test.csv', parse_dates = ['pickup_datetime'])

<h2> Preprocessing the data </h2>

Initially I dropped the 0 values from columns which should not have any zeroes

In [None]:
nonzerocols = ['pickup_latitude', 'pickup_longitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']
for col in nonzerocols:
    train.drop(train[train[col] == 0].index, inplace = True)
train.reset_index(inplace=True, drop = True)


Whilst looking at plots of the data, I discovered this interesting fact - some of the positional values seem to have been switched.

New York is at a latitude of 40.7 degrees, and a longitude of -74 degrees. 

Here we have a small minority of journies that appear to have their longitude and latitude switched! When examining the histogram plots, it was clear this only affected a few entries. (This is becasuse I found 3 errors in the first 10,000 values of pickup_latitude).

It is possible to fix this in order to use all the data, however given that we have 5Gb of data already, I will opt to drop the incorrect rows instead.

In addition to the switching, some of these values are incorrect, for example one value is 0.34 for latitude. Since New York is nowhere near zero latitude I will drop latitude values outside of the range (38,42) and longitude values outside of the range (-76, -72).

Fortunately the test data does indeed have the correct values inputted for latitude and longitude.

I also dropped all negative fares.

In [None]:
train.dropna(how='any', axis='rows', inplace=True)
train.drop(train[(train['pickup_latitude'] < 38) | (train['pickup_latitude'] > 42)].index, inplace = True)
train.drop(train[(train['dropoff_latitude'] < 38) | (train['dropoff_latitude'] > 42)].index, inplace = True)
train.drop(train[(train['pickup_longitude'] < -76) | (train['pickup_longitude'] > -72)].index, inplace = True)
train.drop(train[(train['dropoff_longitude'] < -76) | (train['dropoff_longitude'] > -72)].index, inplace = True)
train.drop(train[(train['fare_amount'] < 0) | (train['fare_amount'] > 500)].index, inplace = True)

train.reset_index(inplace=True, drop = True)

Now I combined the data to make preprocessing slightly easier.

In [None]:
train["is_train"] = 1
test["is_train"] = 0
data = pd.concat([train, test.drop(['key'], axis = 1)])

data["lat_diff_squared"] = np.power(data.dropoff_latitude.subtract(data.pickup_latitude),2)
data["long_diff_squared"] = np.power(data.dropoff_longitude.subtract(data.pickup_longitude),2)

data['day'] = data['pickup_datetime'].dt.day
data['month'] = data['pickup_datetime'].dt.month
data['year'] = data['pickup_datetime'].dt.year
data['hour'] = data['pickup_datetime'].dt.hour

data = pd.concat([data, pd.get_dummies(data.hour, prefix = 'hour')], axis = 1)
data = pd.concat([data, pd.get_dummies(data.day, prefix = 'day')], axis = 1)
data = pd.concat([data, pd.get_dummies(data.month, prefix = 'month')], axis = 1)

data = data.drop(['hour', 'day','month'], axis = 1)

data.drop(['key', 'pickup_datetime'], axis = 1, inplace = True)

As an example, the distribution of one of the coordinate features is shown below.

In [None]:
plt.hist(data.dropoff_longitude, range = (-74.1, -73.7), bins = 70)
plt.show()

<h2> Feature Engineering </h2>

Initially I used a K-means clustering algorithm on the geographical data to generate points of interest as features. (see end of notebook for code)

Ultimately this proved unsuccessful, and so I instead opted to use the features given in [this](https://www.kaggle.com/jsylas/python-version-of-top-ten-rank-r-22-m-2-88) notebook, namely the haversine distance and airport distances.

In [None]:
def sphere_dist(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon):
    """
    Return distance along great radius between pickup and dropoff coordinates.
    """
    #Define earth radius (km)
    R_earth = 6371
    #Convert degrees to radians
    pickup_lat, pickup_lon, dropoff_lat, dropoff_lon = map(np.radians,
                                                             [pickup_lat, pickup_lon, 
                                                              dropoff_lat, dropoff_lon])
    #Compute distances along lat, lon dimensions
    dlat = dropoff_lat - pickup_lat
    dlon = dropoff_lon - pickup_lon
    
    #Compute haversine distance
    a = np.sin(dlat/2.0)**2 + np.cos(pickup_lat) * np.cos(dropoff_lat) * np.sin(dlon/2.0)**2
    return 2 * R_earth * np.arcsin(np.sqrt(a))
    
data['haversine_dist'] = sphere_dist(data.pickup_latitude, data.pickup_longitude, data.dropoff_latitude, data.dropoff_longitude)


In [None]:
"""
Return minumum distance from pickup or dropoff coordinates to each airport.
JFK: John F. Kennedy International Airport
EWR: Newark Liberty International Airport
LGA: LaGuardia Airport
SOL: Statue of Liberty 
NYC: Newyork Central
"""
jfk_coord = (40.639722, -73.778889)
ewr_coord = (40.6925, -74.168611)
lga_coord = (40.77725, -73.872611)
sol_coord = (40.6892,-74.0445) # Statue of Liberty
nyc_coord = (40.7141667,-74.0063889) 


pickup_lat = data['pickup_latitude']
dropoff_lat = data['dropoff_latitude']
pickup_lon = data['pickup_longitude']
dropoff_lon = data['dropoff_longitude']

pickup_jfk = sphere_dist(pickup_lat, pickup_lon, jfk_coord[0], jfk_coord[1]) 
dropoff_jfk = sphere_dist(jfk_coord[0], jfk_coord[1], dropoff_lat, dropoff_lon) 
pickup_ewr = sphere_dist(pickup_lat, pickup_lon, ewr_coord[0], ewr_coord[1])
dropoff_ewr = sphere_dist(ewr_coord[0], ewr_coord[1], dropoff_lat, dropoff_lon) 
pickup_lga = sphere_dist(pickup_lat, pickup_lon, lga_coord[0], lga_coord[1]) 
dropoff_lga = sphere_dist(lga_coord[0], lga_coord[1], dropoff_lat, dropoff_lon)
pickup_sol = sphere_dist(pickup_lat, pickup_lon, sol_coord[0], sol_coord[1]) 
dropoff_sol = sphere_dist(sol_coord[0], sol_coord[1], dropoff_lat, dropoff_lon)
pickup_nyc = sphere_dist(pickup_lat, pickup_lon, nyc_coord[0], nyc_coord[1]) 
dropoff_nyc = sphere_dist(nyc_coord[0], nyc_coord[1], dropoff_lat, dropoff_lon)



data['jfk_dist'] = pickup_jfk + dropoff_jfk
data['ewr_dist'] = pickup_ewr + dropoff_ewr
data['lga_dist'] = pickup_lga + dropoff_lga
data['sol_dist'] = pickup_sol + dropoff_sol
data['nyc_dist'] = pickup_nyc + dropoff_nyc


Now that the features are made I can split the data and begin modelling.

In [None]:
train = data.loc[data["is_train"] == 1]
test = data.loc[data["is_train"] == 0]
X = train.drop(['fare_amount', 'is_train'], axis = 1)
y = train.fare_amount
X_test = test.drop(['fare_amount','is_train'], axis = 1)

<h2> Modelling </h2>

I initially wanted to use an ensemble of tree methods, however I decided to instead train on 1.5m data points using a simple xgbr. 

I attempted to optimise hyperparameters whilst only using 20000 data points, however I found this to be very variable and actually made the model perform worse when training on all 1.5m data points (see appendix for code). This is a problem I wish to learn how to solve in the future - how do you optimise hyperparameters under the constraint that you cannot test them on your full training data set?

In [None]:
xgbr = XGBRegressor()
xgbr.fit(X, y)

<h2> Conclusions </h2>

Interestingly my best result came by using a feature given by the square difference of the dropoff and pickup coordinates. I am unsure how to optimise the use of these valuable other features I have found. 

The plots below show the feature importance is completely dominated by Haversine distance, and I have not yet learned how to prevent the tree from using this feature so heavily.

In future projects I would like to address this problem, as well as the hyperparameter question mentioned previously. My best score with all the features produced a mean average error of $3.15 using default xgbr parameters and only 1m data points. With 1.5m data points the score got worse, suggesting an overfitting issue due to lack of any regularisation from non-default hyperparameters. 

This project should probably have been tackled with neural networks, and this is an approach I would enjoy attempting in future.

In [None]:
num_feat = 10
plt.xticks(range(1,len(X.columns[-num_feat:])+1), X.columns[-num_feat:], rotation=70, fontsize = 15)
plt.bar(x = range(1,len(X.columns[-num_feat:])+1), height = xgbr.feature_importances_.tolist()[-num_feat:])

<h2> Appendix </h2>

Below shows the code I initially used to generate 5 cluster points on the map.

In [None]:
# kmeans = KMeans(n_clusters = 5)
# train['cluster_label_pickup'] = kmeans.fit_predict(train[['pickup_latitude', 'pickup_longitude']])
# test['cluster_label_pickup'] = kmeans.predict(test[['pickup_latitude', 'pickup_longitude']])

# # plt.figure(figsize = (7,7))
# # plt.scatter(x = train['pickup_latitude'], y = train['pickup_longitude'], c=train.cluster_label_pickup, cmap='viridis')
# # plt.scatter(kmeans.cluster_centers_[:, 0],kmeans.cluster_centers_[:,1], s = 100, c = 'red')


# train = pd.concat([train, pd.get_dummies(train.cluster_label_pickup, prefix = 'clust_pick')], axis = 1)
# train = train.drop(['cluster_label_pickup'], axis = 1)

# test = pd.concat([test, pd.get_dummies(test.cluster_label_pickup, prefix = 'clust_pick')], axis = 1)
# test = test.drop(['cluster_label_pickup'], axis = 1)

# train['cluster_label_dropoff'] = kmeans.fit_predict(train[['dropoff_latitude', 'dropoff_longitude']])
# test['cluster_label_dropoff'] = kmeans.predict(test[['dropoff_latitude', 'dropoff_longitude']])

# train = pd.concat([train, pd.get_dummies(train.cluster_label_dropoff, prefix = 'clust_drop')], axis = 1)
# train = train.drop(['cluster_label_dropoff'], axis = 1)

# test = pd.concat([test, pd.get_dummies(test.cluster_label_dropoff, prefix = 'clust_drop')], axis = 1)
# test = test.drop(['cluster_label_dropoff'], axis = 1)


This code I used to try to optimise the number of clusters, however this proved to do more harm than good and requires a more thoughtful approach.

In [None]:
#tested both xgbr and rf, basically no difference in cross val scores, maybe slightly better for about 6 clusters. However test data less spread so 6 clusters too many
# for i in range(2,11):
#     kmeans = KMeans(n_clusters = i)
    
#     train['cluster_label_pickup'] = kmeans.fit_predict(train[['pickup_latitude', 'pickup_longitude']])

#     train = pd.concat([train, pd.get_dummies(train.cluster_label_pickup, prefix = 'clust_pick')], axis = 1)
#     train = train.drop(['cluster_label_pickup'], axis = 1)

#     train['cluster_label_dropoff'] = kmeans.fit_predict(train[['dropoff_latitude', 'dropoff_longitude']])
    

#     train = pd.concat([train, pd.get_dummies(train.cluster_label_dropoff, prefix = 'clust_drop')], axis = 1)
#     train = train.drop(['cluster_label_dropoff'], axis = 1)

    
#     print(cross_val_score(rf, train.drop(['fare_amount'], axis = 1), train.fare_amount, cv = 3, scoring = 'neg_mean_absolute_error').mean(), i)
    
#     #reset training data so we can remake columns with new kmeans cluster number
#     train = train.loc[:,:'long_diff_squared']

Below is the code for hyperparameter optimisation.

In [None]:
# parameter_grid_xgbr = {'max_depth':[1,2,3],'min_child_weight':[0.01],
#                         'gamma':[0, 0.1, 0.2]}
# best_xgbr = GridSearchCV(XGBRegressor(), param_grid = parameter_grid_xgbr,cv=3, verbose = 1,n_jobs=-1, scoring = 'neg_mean_absolute_error')
# best_xgbr.fit(X,y)
# print(best_xgbr.best_params_)

Prediction submission code is given below.

In [None]:
# preds = xgbr.predict(X_test)
# test_submission = pd.DataFrame({'key':pd.read_csv('/kaggle/input/new-york-city-taxi-fare-prediction/test.csv', parse_dates = ['pickup_datetime']).key, 'fare_amount':preds})
# test_submission.to_csv('submission27.csv', index = False)