## New York Taxi fare Prediction

- Decision Tree, Random Forest and XGBoost Regression technique to predict the taxi fare
- Train data is huge so we will be taking only sample of data (100k records) for building the model
- Result will improve if increase the train size and do some Hyperparameter tuning with cross validation. Due to memory issue in Kaggle environment, chose to go with only 100k records.

## Steps Taken to build the model
- Load the data / Cleanup the data
- Feature Engineering
- Exploratory Data Anaysis
- Univariate and Bivariate Anaysis
- Distribution of data
- Decision Tree for Predict the taxi fare
- Random Forest for Predicting the taxi fare
- XGBoost for Predicting the taxi fare

## Result
- Descision Tree: 77.9 % accuracy
- Random Forest: 78.4 % accuracy
- XGBoost: 85.46% accuracy

## Final Result: XGBoost accuracy is 85.46% is much higher than Random Forest and Decision Tree which proves that XGBoost is the best in predicting the New York Taxi fares


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

import xgboost as xgb
from sklearn.metrics import mean_squared_error

from sklearn.ensemble import StackingRegressor

In [None]:
## Load 100k rows only
data = pd.read_csv("/kaggle/input/new-york-city-taxi-fare-prediction/train.csv", nrows=100_000, parse_dates=['pickup_datetime'])


In [None]:
print(data.shape)
print(data.info())

In [None]:
data.head()

In [None]:
data.describe()

## Fare Amount Distribution

In [None]:
data[data.fare_amount<100].fare_amount.hist(bins=100, figsize=(14,3))
plt.xlabel('fare $USD')
plt.title('Histogram');

### Calculate the distance between two GPS location
- actual lat long are not useful for modeling
- we will calculate the distance between two points

In [None]:
from math import sin, cos, sqrt, atan2, radians

def calculateDistance(lt1, ln1, lt2, ln2):

    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(lt1)
    lon1 = radians(ln1)
    lat2 = radians(lt2)
    lon2 = radians(ln2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c * 1000
    
    return distance

## Feature Engineering
- pickup_datetime will not help much in feature selection
- We can extract weekday and pickup_time from the pickup_datetime which will be very good feature for prediction
- Weekday will tell which day has peak day in the month
- Pickup Time will tell which is a peak hour in a day

## Data Cleanup
- Remove the rows which have fare amount as negative which doesn't make sense
- Remove the rows which have distance as <=0
- Also we will remove all the rows which have nan values

In [None]:
def featureCleanup(dfOrig, train = True):
    if(train):
        df = dfOrig[dfOrig['fare_amount'] >= 0]
    else:
        df = dfOrig.copy()
        
    df['weekday'] = df['pickup_datetime'].dt.day_name()
    df['pickup_hour'] = df['pickup_datetime'].dt.hour
    df['pickup_time'] = df['pickup_datetime'].dt.hour + df['pickup_datetime'].dt.minute/60
    
    df['distance'] = df.apply(lambda x: 
                              calculateDistance(x['pickup_latitude'], 
                                                x['pickup_longitude'],
                                                x['dropoff_latitude'],
                                                x['dropoff_longitude']), 
                              axis=1)
    
    df.drop(columns = ['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude','pickup_datetime','key'], 
          inplace = True)
    
    if(train):
        df.dropna(
            axis=0,
            how='any',
            thresh=None,
            subset=None,
            inplace=True
        )

        df = df[df['distance'] > 0]
    
    return df

In [None]:
trainData = featureCleanup(data)

In [None]:
trainData.head()

In [None]:
def plotChart(df, x, y, title, num):
    plt.subplot(5, 2, num)
    sns.lineplot(data = df, x= x, y = y)
    plt.title(title)
    #plt.xticks(rotation = 90)
    plt.legend(loc='upper right')

# Exploratory Data Analysis

## Bivariate Analysis
- Generally taxi fares are expensive on Sundays 
- Generally people are travelling on Sundays or Wednesday (wednesday has max distance because of an outlier)
- Taxi fare is maximum during 2AM - 4AM. Midnight Charges ?
- Outlier causing issue with distance vs fare distribution

In [None]:
plt.figure(figsize  = (15,30))
plotChart(trainData.groupby(by="weekday").mean().reset_index(), 'weekday', 'fare_amount', 'weekday vs fare', 1)
plotChart(trainData.groupby(by="weekday").mean().reset_index(), 'weekday', 'distance', 'weekday vs distance', 2)
plotChart(trainData.groupby(by="pickup_hour").mean().reset_index(), 'pickup_hour', 'fare_amount', 'hour vs fare', 3)
plotChart(trainData.groupby(by="distance").mean().reset_index(), 'distance', 'fare_amount', 'distance vs fare', 4)

## Univariate Analysis
- Single passenger Taxi hire has maximum trend, hiring taxi from office to home ?
- Thurday, Friday and Saturday has maximum taxi hiring count
- Moderate hour is from 9:00 AM to 5:00 PM
- Peak hour is from 6:00PM to 9:00PM, leaving from office to home ?

In [None]:
plt.figure(figsize  = (20,40))
for i in enumerate(trainData.columns.drop(['fare_amount', 'distance', 'pickup_time'])):
    plt.subplot(10, 2, i[0]+1)
    sns.countplot(trainData[i[1]])


## Outliers Detection
- There are outliers in the dataset but it will not impact on ML models based on decision tree.
- Outlier detection and treatment are not required here.

### Convert Weekday names with numeric numbers
- ML models always look for numbers not String values so converting weeknames to weeknumber.

In [None]:
trainData.drop(columns=['pickup_hour'], inplace=True)
trainData['weekday'] = trainData['weekday'].map({"Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4, "Friday": 5, "Saturday": 6, "Sunday": 7})

In [None]:
y_train = trainData.pop('fare_amount')
X_train = trainData

In [None]:
X_train.head()

# Decision Tree

## Hyperparameter Tuning for Decision Trees
- Max depth need to be set in order to avoid over fitting
- select multiple max_depth from 4 to 10 to identity the best max_depth
- Other parameters like max_sample_split etc can also be set, but its taking a lot of time to fit the data

In [None]:
params = {
    'max_depth': [4,5,6,7,8,9,10]
}

### Grid Search Cross validation Technique
- We don't have to split the data (train, test) into two parts because test data is provided seperately
- So I am using Cross validation technique to validate the model with random validation set
- cv=4 means 3 part will be used for traning and 1 part will be used for cross validation

In [None]:
# Instantiate the grid search model

dt = DecisionTreeRegressor(random_state=100)

grid_search = GridSearchCV(estimator=dt, param_grid = params, 
                          cv=4, n_jobs=-1, verbose=1)

In [None]:
grid_search.fit(X_train, y_train)
grid_search.best_estimator_

## Decision Tree Result
- Decision tree has predicted the data with 77.9% accuracy

In [None]:
y_train_predict = grid_search.predict(X_train)
print("Decision Tree Accuracy:", round(r2_score(y_train, y_train_predict)*100, 2), "%")

# Random Forest

## Hyperparameter Tuning for Random Forest
- Number of estimator used 50
- max_depth used 6 to 8 to identify the best depth of the trees
- max_feature used from 2 to 4 to identify best number of features
- We can iterate this based on the results and tune the hyperparameter futher to get the optimal values

In [None]:
rfEstimator = RandomForestRegressor(random_state=42)
para_grids = {
            "n_estimators" : [50],
            "max_depth": [6,7,8],
            'max_features': [2,3,4]
        }

### Random forest will create 45 different trees for training the model and will use best tree for prediction

In [None]:
grid_rf = GridSearchCV(rfEstimator, para_grids, verbose=1, n_jobs=-1, cv=5)
grid_rf.fit(X_train, y_train)
grid_rf.best_estimator_

In [None]:
y_train_pred_rf = grid_rf.predict(X_train)
print("Random Forest Accuracy:", round(r2_score(y_train, y_train_pred_rf)*100, 2), "%")

## Random forest accuracy is 78.41 slighly better than Decision Tree Regressor 77.9%


# XGBoost (Extreme Gradient Boosting)
- This is the best Machine learning Algorithm in today's world
- The concept of using 100s of weak learner to create a strong learner which makes it special
- Also it is much much faster than Random Forest and Decision Tree because it is leveraging parallel computations

In [None]:
xg_reg = xgb.XGBRegressor(n_jobs=-1)

In [None]:
xg_reg.fit(X_train,y_train)

In [None]:
from sklearn import metrics

y_train_pred_xg = xg_reg.predict(X_train)
y_train_pred_xg

In [None]:
print("XGBoost Accuracy:", round(r2_score(y_train, y_train_pred_xg)*100, 2), "%")

## XGBoost accuracy 85.46% is much higher than Random Forest which proves that XGBoost is the best in predicting the New York Taxi fares

# Hyperparameter tuning for XGBoost
- Note: XGBoost already has inbuilt hyperparameter tuning but we can test it further with cross validation
- we will try to see if we tune different parameter, do we get the better results or not
- This Hyperparameter tuning might take around 6-7 mins because it is training with around 72 XGBoost Trees to find the best estimator

In [None]:
para_grids = {
            "n_estimators": [100,200],
            "learning_rate": [0.3,0.4,0.5],
            "max_depth": [6,7,8]
        }

grid_xg = GridSearchCV(xg_reg, para_grids, verbose=1, n_jobs=-1, cv=4)
grid_xg.fit(X_train, y_train)
grid_xg.best_estimator_

In [None]:
y_train_pred_xg_cv = grid_xg.predict(X_train)
y_train_pred_xg_cv

In [None]:
print("XGBoost Accuracy after Hyperparameter tuning:", round(r2_score(y_train, y_train_pred_xg_cv)*100, 2), "%")

## Accuracy didn't change even after Hyperparameter tuning, that means XGBoost is really predicting well with high level of accuracy on its own.

In [None]:
xgb.plot_tree(grid_xg.best_estimator_,num_trees=0)
plt.show()

## Using Stacking Regressor to check if it improves the accracy
- use Random Forest and XGBoost together to predict

In [None]:
base_learners = [
                 ('es1', xg_reg),
                 ('es2', grid_rf.best_estimator_)     
                ]

In [None]:
stregr = StackingRegressor(estimators=base_learners, cv=4,n_jobs=1,verbose=1)

In [None]:
stregr.fit(X_train, y_train)

In [None]:
y_predict_stack_reg = stregr.predict(X_train)

In [None]:
print("Accuracy:", round(r2_score(y_train, y_predict_stack_reg)*100, 2), "%")

## Accuracy goes little down if Random Forest and XGBoost stacked together.

# Predict the taxi fare for Test Data

In [None]:
test = pd.read_csv("/kaggle/input/new-york-city-taxi-fare-prediction/test.csv", parse_dates=['pickup_datetime'])

testData = featureCleanup(test, False)


In [None]:
testData.head()

In [None]:
testData.drop(columns=['pickup_hour'], inplace=True)
testData['weekday'] = testData['weekday'].map({"Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4, "Friday": 5, "Saturday": 6, "Sunday": 7})

In [None]:
y_test_pred_xg_cv = grid_xg.predict(testData)
y_test_pred_xg_cv

In [None]:
test['fare_amount_predicted'] = y_test_pred_xg_cv
test.head()