# New York Taxi Fare 

## Executive Summary

Always wondered about the costs of transportation when planning your budget? This model aims to provide accurate estimations of transport expenditure when taking taxis in New York. In addition, it might provide a good gauge as to how much a taxi fare should be so as to not be fooled! Data was cleaned, analyzed using feature engineering and finally modelled. Specifically, we are leveraging on the
qualities of ANN model to help us predict the **fare_amount** due to its comparatively lower MSE on both train and test set, which exhibits the best bias-variance tradeoff.

### Import libraries

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import warnings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from keras.models import Sequential
from keras.layers import Dense,LSTM, TimeDistributed, Flatten, MaxPooling1D,Conv1D,Dropout

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso,ElasticNet,HuberRegressor,PassiveAggressiveRegressor,SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor,ExtraTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor,BaggingRegressor,RandomForestRegressor,ExtraTreesRegressor,GradientBoostingRegressor
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.arima.model import ARIMA
from math import radians, cos, sin, asin, sqrt
pd.set_option('display.float_format', lambda x: '%.3f' % x)

warnings.filterwarnings("ignore")
%matplotlib inline

### Data Visualization

In [None]:
# loading train data
# will only be including 1,000,000 rows in this notebook due to size constraints
df = pd.read_csv('/kaggle/input/new-york-city-taxi-fare-prediction/train.csv',nrows = 1000000)

# loading test data
test_df = pd.read_csv('/kaggle/input/new-york-city-taxi-fare-prediction/test.csv')

# loading sample submissions
sample = pd.read_csv('/kaggle/input/new-york-city-taxi-fare-prediction/sample_submission.csv')

In [None]:
# number of rows and columns
print(f'Number of records: {df.shape[0]}')
print(f'Number of columns: {df.shape[1]}')

In [None]:
# Data types
df.info()

### Data dictionary

|Feature|Type|Dataset|Description|
|:---|:---|:---|:---|
|`key`|object|train/test|Unique ID field - pickup_datetime + unique integer|
|`fare_amount`|float|train|Cost of taxi fare|
|`pickup_datetime`|object|train/test|Date and time of pick up|
|`pickup_longitude`|float|train/test|Longitude coordinate of where taxi ride started|
|`pickup_latitutde`|float|train/test|Latitude coordinate of where taxi ride started|
|`dropoff_longitude`|float|train/test|Longitude coordinate of where taxi ride ended|
|`dropoff_latitude`|float|train/test|Latitude coordinate of where taxi ride ended|
|`passenger_count`|float|train/test|Indicating number of passengers in the taxi|

In [None]:
# Statistical summary of data
df.describe()

1) Negative fare amounts? Max fare amount over $1000?! <br>
2) 0 passengers yet its in the records?

### Exploratory Data Analysis 
Cleaning dataset

In [None]:
df.head()

In [None]:
# 69 rows with at least 1 null value
df[df.isnull().any(1)]

In [None]:
# columns with null values
df.columns[df.isnull().any()]

In [None]:
# drop these record since they are impt in determining prices
df1 = df[~df.isnull().any(1)]

#### Dropping unrealistic longitudes/latitudes
1) Longitudes should be negative and latitudes should be positive

In [None]:
# swap these values
incorrect_location = df1[((df1['dropoff_latitude'] < 0) | (df1['pickup_latitude'] < 0)) & ((df1['dropoff_longitude'] > 0) | (df1['pickup_longitude'] > 0))]

In [None]:
# swap columns
incorrect_location.columns = ['key','fare_amount',"pickup_datetime","pickup_latitude","pickup_longitude",
                              "dropoff_latitude","dropoff_longitude","passenger_count"]

In [None]:
# merge these values back into original df
df1.loc[df1.index.isin(incorrect_location.index),["pickup_latitude","pickup_longitude","dropoff_latitude","dropoff_longitude"]] = incorrect_location[["pickup_latitude","pickup_longitude","dropoff_latitude","dropoff_longitude"]]

In [None]:
# remaining odd coordinates, drop them
df1[((df1['dropoff_latitude'] < 0) | (df1['pickup_latitude'] < 0)) & ((df1['dropoff_longitude'] > 0) | (df1['pickup_longitude'] > 0))]

In [None]:
# drop the remaining 77 rows
todrop = df1[((df1['dropoff_latitude'] < 0) | (df1['pickup_latitude'] < 0)) & ((df1['dropoff_longitude'] > 0) | (df1['pickup_longitude'] > 0))]
df1 = df1[~df1.index.isin(todrop.index)]

#### Dropping unrealistic longitudes/latitudes
2) Coordinates should fall within USA

In [None]:
# dropping these records that sit in the ATLANTIC OCEAN:
# train:
df1 = df1.drop(df1[(df1['dropoff_latitude'] == 0) & (df1['dropoff_longitude'] == 0) & (df1['pickup_latitude'] == 0) & (df1['pickup_longitude'] ==0)].index)

Since range of longitudes and latitudes for cities in USA is between -125 & -67 and 24 & 50 respectively, remove all the other records that fall outside of these ranges

In [None]:
df1.head()

In [None]:
# drop these records: Odd latitudes
df1[((df1["pickup_latitude"] < 24) & (df1["pickup_latitude"] > 50)) | (df1["dropoff_latitude"]) < 24 & (df1["dropoff_latitude"] > 50)]

In [None]:
# drop these records: Odd longitudes
df1.loc[((df1["pickup_longitude"] < -125) & (df1["pickup_longitude"]  > -67)) | (df1["dropoff_longitude"])  < -125 & (df1["dropoff_longitude"] > -67)]

In [None]:
todrop = df1.loc[((df1["pickup_longitude"] < -125) & (df1["pickup_longitude"]  > -67)) | (df1["dropoff_longitude"])  < -125 & (df1["dropoff_longitude"] > -67)]
df1 = df1[~df1.index.isin(todrop.index)]

#### Drop records that fall outside of `test_df`'s coordinates

In [None]:
# looking at range of pickup latitude and longitude in test set
fig,ax = plt.subplots(2,figsize = (12,8))
sns.boxplot(test_df['pickup_latitude'],ax = ax[0])
sns.boxplot(test_df['pickup_longitude'],ax = ax[1])

In [None]:
test_df.describe()

In [None]:
df1 = df1[((df1['pickup_longitude'] > -75) & (df1['pickup_longitude'] < -72)) & ((df1['pickup_latitude'] > 40) & (df1['pickup_latitude'] < 42)) & ((df1['dropoff_longitude'] > -75) & (df1['dropoff_longitude'] < -72)) & ((df1['dropoff_latitude'] > 40) & (df1['dropoff_latitude'] < 42))]

In [None]:
# looking at range of pickup latitude and longitude in test set
fig,ax = plt.subplots(2,figsize = (12,8))
sns.boxplot(df1['pickup_latitude'],ax = ax[0])
sns.boxplot(df1['pickup_longitude'],ax = ax[1])

#### Drop unrealistic cab fares
Drop negative cab fares

In [None]:
df1 = df1.drop(df1[df1['fare_amount'] <= 0].index)

#### Drop unrealistic passenger count

In [None]:
fig,ax = plt.subplots(figsize = (12,8))
sns.boxplot(df1['passenger_count'])

In [None]:
df1[df1['passenger_count'] >50]

In [None]:
# drop the 2 extreme values
df1 = df1.drop(df1[df1['passenger_count'] > 50].index)

#### Drop unrealitstic fare amount

In [None]:
fig,ax = plt.subplots(figsize = (12,8))
sns.boxplot(df1['fare_amount'])

In [None]:
# Assumption: cab fares are all below $200
df1 = df1.drop(df1[df1['fare_amount'] > 200].index)

### Feature Engineering
Changing datatypes and creating new fields

In [None]:
pd.to_datetime(pd.to_datetime(df1.head()['pickup_datetime']).dt.strftime("%Y-%m-%d %H:%M"))

In [None]:
# changing date column to datetime(ns) 
df1['pickup_datetime'] = pd.to_datetime(pd.to_datetime(df1['pickup_datetime']).dt.strftime("%Y-%m-%d %H:%M"))
test_df['pickup_datetime'] = pd.to_datetime(pd.to_datetime(test_df['pickup_datetime']).dt.strftime("%Y-%m-%d %H:%M"))

In [None]:
# Creating separate fields for year, month, weekday and hour
# train set:
df1['year'] = df1['pickup_datetime'].dt.year
df1['month'] = df1['pickup_datetime'].dt.month
df1['day'] = df1['pickup_datetime'].dt.day
df1['weekday'] = df1['pickup_datetime'].dt.weekday
df1['hour'] = df1['pickup_datetime'].dt.hour
df1['min'] = df1['pickup_datetime'].dt.minute

# test set:
test_df['year'] = test_df['pickup_datetime'].dt.year
test_df['month'] = test_df['pickup_datetime'].dt.month
test_df['day'] = test_df['pickup_datetime'].dt.day
test_df['weekday'] = test_df['pickup_datetime'].dt.weekday
test_df['hour'] = test_df['pickup_datetime'].dt.hour
test_df['min'] = test_df['pickup_datetime'].dt.minute

#### `distance`
Converting longitudes, latitudes into distance in km 

In [None]:
# define haversine formula to convert points to distance in km
def haversine(df2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1 = df2['pickup_longitude']
    lon2 = df2['dropoff_longitude']
    lat1 = df2['pickup_latitude']
    lat2 = df2['dropoff_latitude']
    
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers.
    return c * r

In [None]:
# apply formula to get distance column
df1['distance'] = df1.apply(haversine,axis = 1)

Since `distance` is enough to encapsulate the relationship between distance travelled and cab fare, other dependent features eg longitude and latitude features can be dropped.

In [None]:
# Dropping correlated and redundant columns
df2 = df1.copy()
df2 = df2.drop(columns = ['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'])

In [None]:
# doing the same for the test set
test_df['distance'] = test_df.apply(haversine,axis = 1)

# Dropping correlated and redudant columns
test_df = test_df.drop(columns = ['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'])

### Data Visualization

In [None]:
# to visualize correlation betwen variables
mask = np.triu(np.ones_like(df2.corr(),dtype = bool))
fig,ax = plt.subplots(figsize = (12,8))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(df2.corr(),ax = ax,annot = True,cmap = cmap,mask = mask)

In [None]:
# Spread of variables
fig,ax = plt.subplots(2,figsize = (12,8))
sns.violinplot(y = df2['fare_amount'],x = df2['year'],ax = ax[0])
sns.violinplot(y = df2['fare_amount'],x = df2['month'],ax = ax[1])

In [None]:
# visualize number of trips
fig,ax = plt.subplots(2,figsize = (12,8))
sns.barplot(y = df2['fare_amount'],x = df2['year'],ax = ax[0],palette = 'Set2')
sns.barplot(y = df2['fare_amount'],x = df2['month'],ax = ax[1],palette = 'Set2')

In [None]:
# save cleaned data as a separate csv

# df2.to_csv('../data/distanced_train.csv',index = False)
# test_df.to_csv('../data/distanced_test.csv',index = False)

In [None]:
# df2 = pd.read_csv('../data/distanced_train.csv')
# test_df = pd.read_csv('../data/distanced_test.csv')

## Data modeling 

In [None]:
# Separating predictor variables and target variable
X = df2.drop(columns = ['fare_amount','key','pickup_datetime'])
y = df2['fare_amount']

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)

In [None]:
# looking at rows, columns for train and validation set
print(f'train: {X_train.shape}')
print(f'test: {y_train.shape}')
print(f'val train: {X_test.shape}')
print(f'val test: {y_test.shape}')

### StandardScaler
Normalize scales of features to improve accuracy of predictions especially if our variables are on different scales/magnitudes. This is because this would affect the performances of models that specifically rely on distance metrics(k-NN, PCA) as well as to speed up gradient descent convergence for deep neural networks during backpropagation. Mainly to ensure that every feature contributes equally to the models! 

In [None]:
X_train.head()

In [None]:
# scale data
ss = StandardScaler()
ss.fit(X_train)
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)

In [None]:
def get_models(models=dict()):
# linear models
    models['lr'] = LinearRegression()
    models['lasso'] = Lasso()
    models['ridge'] = Ridge()
    models['en'] = ElasticNet()
    models['huber'] = HuberRegressor()
    models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3)
   
    return models

def get_models_nl(models=dict()):
# non-linear models
    models['svr'] = SVR()
# ensemble models
    n_trees = 100
    models['ada'] = AdaBoostRegressor(n_estimators=n_trees)
    models['bag'] = BaggingRegressor(n_estimators=n_trees)
    models['rf'] = RandomForestRegressor(n_estimators=n_trees)
    models['et'] = ExtraTreesRegressor(n_estimators=n_trees)
    models['gbm'] = GradientBoostingRegressor(n_estimators=n_trees)
    return models

def evaluate_models(models, X_train_ss,y_train,X_test_ss,y_test):
    for name, model in models.items():
    # fit models
        model_fit = model.fit(X_train_ss,y_train)
        # make predictions
        train_preds = model_fit.predict(X_train_ss)
        test_preds = model_fit.predict(X_test_ss)
        # evaluate forecast
        train_mse = mean_squared_error(y_train,train_preds)
        test_mse = mean_squared_error(y_test,test_preds)
        print(f'{name}:')
        print(f'----')
        print(f'Train MAE: {round(train_mse,2)}')
        print(f'Test MAE: {round(test_mse,2)}')
        print(f'\n')
        
def pipeline(model):
    pipe = Pipeline([(model, model_dict[model])])
    return pipe

def params(model):
    

    if model == 'lasso':
        return {"alpha":[0.01,0.1,1,2,5,10],
               }
    
    
    elif model == 'ridge':
        return {
            "alpha":[0.01,0.1,1,2,5,10],
            }
    
    elif model == 'en':
        return {
            'alpha':[0.01,0.1,1,10],
            'l1_ratio':[0.2,0.3,0.4,0.5,0.6]
            }
    elif model == 'knn':
        return {
            'n_neighbors':[4,5,6,7]}

    elif model == 'dt':
        return {
            'max_depth':[3,4,5],
            'min_samples_split':[2,3,4],
            'min_samples_leaf':[2,3,4]
        }
    elif model == 'bag':
        return {
            'max_features':[100, 150]
        }
        
    elif model == 'rf':
        return {
            'n_estimators':[100,150],
            'max_depth':[4],
            'min_samples_leaf':[2,3,4]
        }
    elif model == 'et':
        return {
            'n_estimators':[50,100,150,200],
            'max_depth':[1000,2000,3000],
            'min_samples_leaf':[10000,20000,30000],
        }
    elif model == 'abc':
        return {
            'n_estimators':[50,100,150,200],
            'learning_rate':[0.3,0.6,1]
        }
    elif model == 'gbc':
        return {
            'learning_rate':[0.2],
            'max_depth':[1000,2000,3000],
            'min_samples_split':[10000,20000,30000]
            
        }
    elif model == 'xgb':
        return {
            'eval_metric' : ['auc'],
            'subsample' : [0.8], 
            'colsample_bytree' : [0.5], 
            'learning_rate' : [0.1],
            'max_depth' : [5], 
            'scale_pos_weight': [5], 
            'n_estimators' : [100,200],
            'reg_alpha' : [0, 0.05],
            'reg_lambda' : [2,3],
            'gamma' : [0.01]
                             
        }
    elif model == 'svr':
        return {
            'kernel': ['rbf', 'linear','poly'], 
            'C': [1,20,50,100],
            'gamma':['scale','auto'],
            'epsilon':[0.1,1,10]
        }
    elif model == 'ada':
        return {
            'n_estimators':[50,100,150],
            'learning_rate':[0.01,0.1,1],
            
        }
    elif model == 'bag':
        return {
            'n_estimators':[20,50,100,150],
            'max_features':[2,4,6],
            'max_samples':[0.1,0.2,0.3,0.5,0.7],
            'bootstrap':[True]
            
        }
    elif model == 'rf':
        return {
             'bootstrap': [True],
             'max_depth': [5,10,15],
             'max_features': ["auto", "sqrt", "log2"],
             'min_samples_leaf': [10000,20000,30000],
             'min_samples_split': [10000,20000,30000],
             'n_estimators': [50,200,300,400],
             'random_state': 42,
             }
    elif model == 'et':
        return {
             'bootstrap': [True],
             'max_depth': [5,10,15],
             'max_features': ["auto", "sqrt", "log2"],
             'min_samples_leaf': [10000,20000,30000],
             'min_samples_split': [10000,20000,30000],
             'n_estimators': [50,200,300,400],
             'random_state': 42,
        }
            
    elif model == 'gbm':
        return {
            'learning_rate' : [0.1,0.3,0.6,1], 
            'min_samples_split':[10000,20000,30000],
            'min_samples_leaf': [10000,20000,30000],
            'max_depth' : [8,10,20]
       }



# grid search with randomizedsearchcv
def grid_search_rs(model,models,X_train = X_train_ss,y_train = y_train,X_test = X_test_ss,y_test=y_test):
    pipe_params = params(model)
    model = models[model]
    gs = RandomizedSearchCV(model,param_distributions = pipe_params,cv = 5,scoring = 'neg_mean_squared_error', verbose=True, n_jobs=8)
    gs.fit(X_train_ss,y_train)
    train_score = gs.score(X_train_ss,y_train)
    test_score = gs.score(X_test_ss,y_test)
    
    print(f'Results from: {model}')
    print(f'-----------------------------------')
    print(f'Best Hyperparameters: {gs.best_params_}')
    print(f'Mean MSE: {-round(gs.best_score_,4)}')
    print(f'Train Score: {-round(train_score,4)}')
    print(f'Test Score: {-round(test_score,4)}')
    print(' ')

#### Linear Models

In [None]:
models = get_models()
evaluate_models(models,X_train_ss,y_train,X_test_ss,y_test)

In [None]:
%time grid_search_rs("ridge",models)

#### Neural Nets

In [None]:
model = Sequential()
model.add(Dense(64,activation = 'relu',kernel_initializer = 'normal',input_dim = X_train_ss.shape[1]))
model.add(Dropout(0.3))
model.add(Dense(32,activation = 'relu'))
model.add(Dense(1))
model.compile(loss = 'mse',optimizer = 'adam',metrics = 'mae')
history_model = model.fit(X_train_ss,y_train, epochs = 100, batch_size = 50000, validation_data = (X_test_ss,y_test),verbose = 2)

In [None]:
model1 = Sequential()
model1.add(Dense(128,activation = 'relu',kernel_initializer = 'normal',input_dim = X_train_ss.shape[1]))
model1.add(Dropout(0.3))
model1.add(Dense(64,activation = 'relu'))
model1.add(Dense(1))
model1.compile(loss = 'mse',optimizer = 'adam',metrics = 'mean_squared_error')
history_model1 = model1.fit(X_train_ss,y_train, epochs = 30, batch_size = 50000, validation_data = (X_test_ss,y_test),verbose = 2)

In [None]:
fig,ax = plt.subplots(figsize = (20,10))
ax.plot(history_model.history['loss'],label = 'Train Loss')
ax.plot(history_model.history['val_loss'],label = 'Val Loss')
ax.plot(history_model1.history['loss'],label = 'Train Loss - More Layers')
ax.plot(history_model1.history['val_loss'],label = 'Val Loss - More Layers')
plt.legend()

In [None]:
print(f"Evaluating ANN's(Vanilla) performance:")
print('------')
print(f'Train Score:{mean_squared_error(y_train,model.predict(X_train_ss))}')
print(f'Train Score:{mean_squared_error(y_test,model.predict(X_test_ss))}')
print(f'\n')
print(f"Evaluating ANN's(extra layer) performance:")
print('------')
print(f'Train Score: {mean_squared_error(y_train,model1.predict(X_train_ss))}')
print(f'Test Score: {mean_squared_error(y_test,model1.predict(X_test_ss))}')

### Final predictions
Fit the model with entire train set now

In [None]:
ss = StandardScaler()
ss.fit(X)
X_ss = ss.transform(X)
test_df_ss = ss.transform(test_df.iloc[:,2:])

In [None]:
print(f'Shape of X: {X_ss.shape}')
print(f'Shape of y: {y.shape}')

#### Linear Model

In [None]:
ridge = Ridge(alpha = 10)
ridge.fit(X_ss,y)
ridge_preds = pd.DataFrame({"key":test_df["key"],"fare_amount":ridge.predict(test_df_ss)})
# ridge_preds.to_csv('../submissions/my_submissions_ridge.csv',index = False)

#### ANN

In [None]:
ann_preds = pd.DataFrame({"key": test_df['key'], "fare_amount":finalmodel.predict(test_df_ss).flatten()})
ann_preds.to_csv("my_final_submission", index=False)

In [None]:
finalmodel = Sequential()
finalmodel.add(Dense(64,activation = 'relu',kernel_initializer = 'normal',input_dim = X_ss.shape[1]))
finalmodel.add(Dropout(0.3))
finalmodel.add(Dense(3,activation = 'relu'))
finalmodel.add(Dense(1))
finalmodel.compile(loss = 'mse',optimizer = 'adam',metrics = 'mae')
history_finalmodel = finalmodel.fit(X_ss,y, epochs = 100, batch_size = 50000,verbose = 2)

In [None]:
ann_preds = pd.DataFrame({"key": test_df['key'], "fare_amount":finalmodel.predict(test_df_ss).flatten()})
ann_preds.to_csv("my_final_submission", index=False)