<a href="https://colab.research.google.com/github/sanjayrawat2468/CapstoneProject2-EDA/blob/main/Capstone_Project_2_EDA_Type.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **NYC Taxi Trip Time Prediction**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member  -  Sanjay Rawat**


# **Project Summary**

## **To Explore various attributes and build a Predictive model that predicts the total trip duration of taxi trips in New York City.**



# **GitHub Link**

**https://github.com/sanjayrawat2468/CapstoneProject2-EDA/blob/main/Capstone_Project_2_EDA_Type.ipynb**

# **Problem Statement**


## **Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables..**

# **Data Description**

## The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.


# **Data fields**
## **id -** A unique identifier for each trip.
## **vendor_id** - A code indicating the provider associated with the trip record.
## **pickup_datetime** - Date and time when the meter was engaged.
## **dropoff_datetime -** Date and time when the meter was disengaged.
## **passenger_count -** The number of passengers in the vehicle. (driver entered value)
## **pickup_longitude -** The longitude where the meter was engaged.
## **pickup_latitude -** The latitude where the meter was engaged.
## **dropoff_longitude -** The longitude where the meter was disengaged.
## **dropoff_latitude -** The latitude where the meter was disengaged.
## **store_and_fwd_flag -** This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip.
## **trip_duration -** Duration of the trip in seconds. (Target variable)



# ***Let's Begin !***

# **Import Libraries**

In [None]:
# Importing Required Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
import statsmodels.formula.api as sm
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
import warnings; warnings.simplefilter('ignore')


# **Mount Google Drive**


In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

# **Dataset Loading**

In [None]:
# Load Dataset
data = pd.read_csv('/content/NYC Taxi Data.csv',on_bad_lines='skip')


# **Dataset First View**

In [None]:
# Dataset First Look
data.head()

# **Dataset Rows & Columns count**

In [None]:
# Dataset Rows & Columns count
print("Number of rows is: ", data.shape[0])
print("Number of columns is: ", data.shape[1])

# **Dataset Information**

In [None]:
# Dataset Info
data.info

In [None]:
data.describe()

# **Duplicate Values**

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

# **Missing Values/Null Values**

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
data.shape

# **What did you know about your dataset?**

**Originally data was containing the 106745 rows and 11 columns but after checking for null values i found that there are some null values i.e neglegible so I dropped them and now dataset having 106744 rows and 11 columns.**

# **Understanding Your Variables**

In [None]:
# Dataset Columns
data.columns

# **Check Unique Values for each variable**

In [None]:
# Check Unique Values for each variable.
print( 'We have %d unique id in our dataset ' %(data.id.nunique()))
print( 'We have %d unique vendor_id in our dataset ' %(data.vendor_id.nunique()))
print( 'We have %d unique pickup_datetime in our dataset ' %(data.pickup_datetime.nunique()))
print( 'We have %d unique dropoff_datetime in our dataset ' %(data.dropoff_datetime.nunique()))
print( 'We have %d unique passenger_count in our dataset ' %(data.passenger_count.nunique()))
print( 'We have %d unique pickup_longitude in our dataset ' %(data.pickup_longitude.nunique()))
print( 'We have %d unique pickup_latitude in our dataset ' %(data.pickup_latitude.nunique()))
print( 'We have %d unique dropoff_longitude in our dataset ' %(data.dropoff_longitude.nunique()))
print( 'We have %d unique dropoff_latitude in our dataset ' %(data.dropoff_latitude.nunique()))
print( 'We have %d unique store_and_fwd_flag in our dataset ' %(data.store_and_fwd_flag.nunique()))
print( 'We have %d unique trip_duration in our dataset ' %(data.trip_duration.nunique()))


In [None]:
data.dtypes

# **Data Wrangling**

In [None]:
# We have pickup_datetime, dropoff_datetime of the type 'object'. Convert it into type 'datetime'.
data['pickup_datetime'] = pd.to_datetime(data['pickup_datetime'],errors = 'coerce')
data['dropoff_datetime'] = pd.to_datetime(data['dropoff_datetime'],errors = 'coerce')


In [None]:
# let us extract or add some new features from existing ones
data['pickup_weekday']=data['pickup_datetime'].dt.day_name()
data['dropoff_weekday']=data['dropoff_datetime'].dt.day_name()
data['pickup_weekday_num']=data['pickup_datetime'].dt.weekday
data['pickup_hour']=data['pickup_datetime'].dt.hour
data['month']=data['pickup_datetime'].dt.month

**1** - pickup_weekday which will contain the name of the day on which the ride was taken.

**2** - pickup_weekday_num which will contain the day number instead of characters with Monday=0 and Sunday=6.

**3** - pickup_hour with an hour of the day in the 24-hour format.

**4** - pickup_month with month number with January=1 and December=12.

**Lets Extract some other features from the dataset as we have columns or features namely longitude and latitude.**

In [None]:
# Importing tha required liberary for
from geopy.distance import great_circle

In [None]:
# Lets create a function to get the distance feature from dataset
def cal_distance(pickup_lat,pickup_long,dropoff_lat,dropoff_long):
 
 start_coordinates=(pickup_lat,pickup_long)
 stop_coordinates=(dropoff_lat,dropoff_long)
 
 return great_circle(start_coordinates,stop_coordinates).km

In [None]:
# Treating null values
data.dropna(inplace = True)

In [None]:
# Now applying the above function to dataset
data['distance'] = data.apply(lambda x: cal_distance(x['pickup_latitude'],x['pickup_longitude'],x['dropoff_latitude'],x['dropoff_longitude'] ), axis=1)

In [None]:
# Calculating the speed with speed = distance/time formula
data['speed'] = (data.distance/(data.trip_duration/3600))

**Lets Define the different time segments acording to time as morning after noon and so on**

In [None]:
# Lets create a function to define different time segments of a day.
def time_of_day(x):
    if x in range(6,12):
        return 'Morning'
    elif x in range(12,16):
        return 'Afternoon'
    elif x in range(16,22):
        return 'Evening'
    else:
        return 'Late night'

In [None]:
# Applying the above function to the dataset
data['pickup_timeofday']=data['pickup_hour'].apply(time_of_day)

# **Exploratory Data Analysis**

## **Univariate Analysis**

In [None]:
# Lets find the out the didtribution for the target variable i.e tripduration
plt.figure(figsize = (10,5))
sns.distplot(data['trip_duration'])
plt.xlabel('Trip Duration')
 
plt.show()

**From the above plot we can see that it is not normally ditributed or right skewed.**

In [None]:
# Applying log tranformation to the target variable
plt.figure(figsize = (10,5))
sns.distplot(np.log10(data['trip_duration']))
plt.xlabel('Trip Duration')
plt.show()

**Now the target variable is normally ditributed.**

In [None]:
plt.figure(figsize = (10,5))
sns.boxplot(data.trip_duration)
plt.xlabel('Trip Duration')
plt.show()

In [None]:
plt.figure(figsize = (10,5))
data.trip_duration.groupby(pd.cut(data.trip_duration, np.arange(1,7200,600))).count().plot(kind='bar')
plt.xlabel('Trip Counts')
plt.ylabel('Trip Duration (seconds)')
plt.show()

In [None]:
# Trips done according to different time segments of the day
plt.figure(figsize = (10,5))
sns.countplot(x="pickup_timeofday",data=data)
plt.title('Pickup Time of Day')
plt.xlabel('Parts of the Day')
plt.ylabel('Count')
plt.show()

**From the above graph we can clearly see that most of the trips done in the evening segment.**

## **Have Look For Passenger Count**

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x='passenger_count',data=data)
plt.ylabel('Count')
plt.xlabel('No.of Passngers')
plt.show()

In [None]:
# Lets remove the rows with lower passenger counts.
data=data[data['passenger_count']!=0]
data=data[data['passenger_count']<=6]

In [None]:
# Re-plotting the above graph
plt.figure(figsize = (10,5))
sns.countplot(x='passenger_count',data=data)
plt.ylabel('Count')
plt.xlabel('No.of Passngers')
plt.show()

In [None]:
# Plotting the distance over boxplot
plt.figure(figsize = (10,5))
sns.boxplot(data.distance)

plt.ylabel('Distance Travelled')
plt.show()

**There are some trips with over 100 km distance & some of the trips with 0 km distance.**

**These colud be some possible reasons for 0 Km. distance tarvelled**

**The dropoff location couldn’t be tracked.**

**The passengers or driver cancelled the trip due to some issue.**

**Due to some technical issue in software, etc.**

# **Bivariate Analysis**

**Trip Duration per hour**



In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='pickup_hour',y='trip_duration',data=data)
plt.xlabel('Time of Pickup (24hr format)')
plt.ylabel('Duration (seconds)')
plt.show()

**We see the trip duration is the maximum around 3 pm and trip duration is the lowest around 6 am as streets may not be busy.**

**Trip duration per weekday**

In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='pickup_weekday_num',y='trip_duration',data=data)
plt.ylabel('Duration (seconds)')
plt.xlabel('')
plt.show()

**Trip duration on thursday is longest among all days.**


**Trip duration per month**



In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='month',y='trip_duration', data=data)
plt.ylabel('Duration (seconds)')
plt.xlabel('Month of Trip ')
plt.show()

**Trip distance is lowest in 2nd month and maximum in 5th month.**

**Distance VS Hour**



In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(y='distance',x='pickup_hour',data=data)
plt.ylabel('Distance')
plt.xlabel('Pickup Hour')
plt.show()

**Trip distance is highest during early morning hours & It starts increasing gradually towards the late night hours starting from evening.**

**Distance VS Weekday**



In [None]:
plt.figure(figsize = (10,5))
sns.lineplot(x='pickup_weekday_num', y='distance',data=data)
plt.ylabel('Distance')
plt.xlabel('Pickup Day of the Week')
plt.show()

# **Feature Engineering**


**One Hot Encoding**



In [None]:
dummy = pd.get_dummies(data.pickup_weekday, prefix='pickup_weekday', drop_first=True)
data = pd.concat([data,dummy], axis = 1,)

In [None]:
# Trip duration in hours
data['trip_duration_hour']=data['trip_duration']/3600

In [None]:
data=data.drop(['id','pickup_datetime', 'dropoff_datetime', 'pickup_weekday', 'dropoff_weekday', 'pickup_weekday_num', 'pickup_timeofday', 'trip_duration', 'speed'], axis=1)
data.head()

# **Correlation Analysis**

In [None]:
plt.figure (figsize= (20,10))
corelation= data.corr()
sns.heatmap(abs(corelation), annot=True, cmap='coolwarm')

In [None]:
numeric_features= data.describe().columns

In [None]:
features= numeric_features.copy()
features= list(features)
features= features[:-1]
features

In [None]:
from scipy.stats import zscore
X = data[features].apply(zscore)[:] 

In [None]:
 y= np.log10(data['trip_duration_hour'])[:]

# **Splitting the data into train and test sets**


**Splitting the data set in 75-25 split for training and testing purpose respectively.**



In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

**Recurssive Feature Elimination(RFE)**

**RFE tries to select the features that are important and eliminates the features that are not important.**



In [None]:
# Importing RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
lm =  LinearRegression()
rfe = RFE(lm, n_features_to_select=12)
rfe= rfe.fit(X_train, y_train)
rfe.support_

In [None]:
# Assigning the rfe features from X_train to col 
col= X_train.columns[rfe.support_]

# **Building Model** 

**Assigning remaining features after eliminating unimportant features from X_train.**

In [None]:
X_train_rfe= X_train[col]

In [None]:
import statsmodels.api as sm
# Adding a constant variable
X_train_rfe= sm.add_constant(X_train_rfe)

**Running the Linear Model.**



In [None]:
lm= sm.OLS(y_train, X_train_rfe).fit()

In [None]:
print(lm.summary())

# **Prediction**

In [None]:
y_pred_train= lm.predict(X_train_rfe)

In [None]:
X_test_rfe= X_test[col]
# Adding a condtant variable
X_test_rfe= sm.add_constant(X_test_rfe)
y_pred_test= lm.predict(X_test_rfe)

# **Model Evaluation**

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

**Train**

In [None]:
lr_train_mse  = mean_squared_error((y_train), (y_pred_train))
print("Train MSE :" , lr_train_mse)

lr_train_rmse = np.sqrt(lr_train_mse)

print("Train RMSE :" ,lr_train_rmse)

lr_train_r2 = r2_score((y_train), (y_pred_train))
print("Train R2 :" ,lr_train_r2) 

lr_train_r2_ = 1-(1-r2_score((y_train), (y_pred_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",lr_train_r2_)

**Test**

In [None]:
lr_test_mse  = mean_squared_error((y_test), (y_pred_test))
print("Test MSE :" , lr_test_mse)

lr_test_rmse = np.sqrt(lr_test_mse)

print("Test RMSE :" ,lr_test_rmse)

lr_test_r2 = r2_score((y_test), (y_pred_test))
print("Test R2 :" ,lr_test_r2)

lr_test_r2_ = 1-(1-r2_score((y_test), (y_pred_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",lr_test_r2_)

**As we can clearly see the Linear regression model does not provide us with high accuracy**

# **Running Lasso Regression**


In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
# Cross validation
lasso = Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='r2', cv=5)
lasso_regressor.fit(X_train, y_train)

In [None]:
print('The best fit alpha value is found out to be :', lasso_regressor.best_params_)
print('The R2 score using the same alpha is :', lasso_regressor.best_score_)

In [None]:
lasso_regressor.score(X_train, y_train)

In [None]:
y_pred_lasso_train = lasso_regressor.predict(X_train)
y_pred_lasso_test = lasso_regressor.predict(X_test)


# **Model Evaluation**


**Train**

In [None]:
lasso_train_mse  = mean_squared_error(y_train, y_pred_lasso_train)
print("Train MSE :" , lasso_train_mse)

lasso_train_rmse = np.sqrt(lasso_train_mse)
print("Train RMSE :" ,lasso_train_rmse)

lasso_train_r2 = r2_score(y_train, y_pred_lasso_train)
print("Train R2 :" ,lasso_train_r2)

lasso_train_r2_= 1-(1-r2_score(y_train, y_pred_lasso_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", lasso_train_r2)

**Test**

In [None]:
lasso_test_mse  = mean_squared_error(y_test, y_pred_lasso_test)
print("Test MSE :" , lasso_test_mse)

lasso_test_rmse = np.sqrt(lasso_test_mse)
print("Test RMSE :" ,lasso_test_rmse)

lasso_test_r2 = r2_score(y_test, y_pred_lasso_test)
print("Test R2 :" ,lasso_test_r2)

lasso_test_r2_= 1-(1-r2_score(y_test, y_pred_lasso_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", lasso_test_r2_)

**Found that the Lasso regresion model doesn't improve on the Linear model either.**



# **Running Ridge Regression**


In [None]:
from sklearn.linear_model import Ridge
# Cross validation
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='r2', cv=5)
ridge_regressor.fit(X_train, y_train)

In [None]:
print('The best fit alpha value is found out to be :' ,ridge_regressor.best_params_)
print('The R2 score using the same alpha is :', lasso_regressor.best_score_)

In [None]:
ridge_regressor.score(X_train, y_train)


In [None]:
y_pred_ridge_train=ridge_regressor.predict(X_train)
y_pred_ridge_test = ridge_regressor.predict(X_test)


# **Model Evaluation**


**Train**

In [None]:
ridge_train_mse  = mean_squared_error(y_train, y_pred_ridge_train)
print("Train MSE :" , ridge_train_mse)

ridge_train_rmse = np.sqrt(ridge_train_mse)
print("Train RMSE :" ,ridge_train_rmse)

ridge_train_r2 = r2_score(y_train, y_pred_ridge_train)
print("Train R2 :" ,ridge_train_r2)

ridge_train_r2_= 1-(1-r2_score(y_train, y_pred_ridge_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", ridge_train_r2)

**Test**

In [None]:
ridge_test_mse  = mean_squared_error(y_test, y_pred_ridge_test)
print("Test MSE :" , ridge_test_mse)

ridge_test_rmse = np.sqrt(ridge_test_mse)
print("Test RMSE :" ,ridge_test_rmse)

ridge_test_r2 = r2_score(y_test, y_pred_ridge_test)
print("Test R2 :" ,ridge_test_r2)

ridge_test_r2_= 1-(1-r2_score(y_test, y_pred_ridge_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", ridge_test_r2_)

**Neither ridge regresion model improve on the Linear model.**



# **Running Decision Tree Regressor**


In [None]:
# Importing required liberary
from sklearn.tree import DecisionTreeRegressor

In [None]:
# Maximum depth of trees
max_depth = [4,6,8,10]
 
# Minimum number of samples required to split a node
min_samples_split = [10,20,30]
 
# Minimum number of samples required at each leaf node
min_samples_leaf = [8,16,22]
 
# Hyperparameter Grid
param_dict_dt = {
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}
cache = {
              'max_depth' : [10],
              'min_samples_split' : [10],
              'min_samples_leaf' : [22]}

In [None]:
param_dict_dt


In [None]:
dt = DecisionTreeRegressor()

# Grid search
dt_grid = GridSearchCV(estimator=dt,
                       param_grid = cache,
                       cv = 5, verbose=2, scoring='r2')

dt_grid.fit(X_train,y_train)

In [None]:
dt_grid.best_score_


In [None]:
dt_grid.best_estimator_


In [None]:
y_pred_dt_train=dt_grid.predict(X_train)
y_pred_dt_test=dt_grid.predict(X_test)


# **Model Evaluation**


**Train**

In [None]:
dt_train_mse  = mean_squared_error(y_train, y_pred_dt_train)
print("Train MSE :" , dt_train_mse)

dt_train_rmse = np.sqrt(dt_train_mse)
print("Train RMSE :" ,dt_train_rmse)

dt_train_r2 = r2_score(y_train, y_pred_dt_train)
print("Train R2 :" ,dt_train_r2)

dt_train_r2_= 1-(1-r2_score(y_train, y_pred_dt_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", dt_train_r2_)

**Test**

In [None]:
dt_test_mse  = mean_squared_error(y_test, y_pred_dt_test)
print("Test MSE :" , dt_test_mse)
dt_test_rmse = np.sqrt(dt_test_mse)
print("Test RMSE :" ,dt_test_rmse)

dt_test_r2 = r2_score(y_test, y_pred_dt_test)
print("Test R2 :" ,dt_test_r2)

dt_test_r2_= 1-(1-r2_score(y_test, y_pred_dt_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", dt_test_r2_)

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(y_test - y_pred_dt_test )
plt.title('Error Term', fontsize=20)
plt.show()

**The decision tree with the selected hyperparameters does improve the predictions of the model considerably. It still isn't ideal but it is certainly much better than Linear models.**

# **Running XGBoost Regressor**


In [None]:
n_estimators = [80,150,200]
 
# Maximum depth of trees
max_depth = [5,8,10]
min_samples_split = [40,50]
learning_rate=[0.2,0.4,0.6]
 
# Hyperparameter Grid
param_xgb = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
             'min_samples_' : min_samples_split,
             'learning_rate' : learning_rate
             }
cache = {'n_estimators' : [200],
              'max_depth' : [8],
             'min_samples_' : [40],
             'learning_rate' : [0.2],
             }

In [None]:
param_xgb


In [None]:
data.isnull().sum()

In [None]:
import xgboost as xgb
xgb_model = xgb.XGBRegressor(tree_method = 'gpu_hist')

# Grid search
xgb_grid = GridSearchCV(estimator=xgb_model,param_grid=cache,cv=3,verbose=1,scoring="r2")

xgb_grid.fit(X_train,y_train)

In [None]:
xgb_grid.best_score_

In [None]:
xgb_grid.best_params_

In [None]:
y_pred_xgb_train=xgb_grid.predict(X_train)
y_pred_xgb_test=xgb_grid.predict(X_test)

# **Model Evaluation**


**Train**

In [None]:
xgb_train_mse  = mean_squared_error(y_train, y_pred_xgb_train)
print("Train MSE :" , xgb_train_mse)

xgb_train_rmse = np.sqrt(xgb_train_mse)
print("Train RMSE :" ,xgb_train_rmse)

xgb_train_r2 = r2_score(y_train, y_pred_xgb_train)
print("Train R2 :" ,xgb_train_r2)

xgb_train_r2_= 1-(1-r2_score((y_train), (y_pred_xgb_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", xgb_train_r2_)

**Test**

In [None]:
xgb_test_mse  = mean_squared_error(y_test, y_pred_xgb_test)
print("Test MSE :" , xgb_test_mse)

xgb_test_rmse = np.sqrt(xgb_test_mse)
print("Test RMSE :" ,xgb_test_rmse)

xgb_test_r2 = r2_score(y_test, y_pred_xgb_test)
print("Test R2 :" ,xgb_test_r2)

xgb_test_r2_= 1-(1-r2_score((y_test), (y_pred_xgb_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", xgb_test_r2_)

**Model summary for train data**

In [None]:
models= ['Linear Regression', 'Lasso Regression', 'Ridge Regression','DecisionTree Regressor','XGBoost Regressor']
train_mse= [lr_train_mse, lasso_train_mse, ridge_train_mse, dt_train_mse, xgb_train_mse]
train_rmse= [lr_train_rmse, lasso_train_rmse, ridge_train_rmse, dt_train_rmse, xgb_train_rmse]
train_r2= [lr_train_r2, lasso_train_r2, ridge_train_r2, dt_train_r2, xgb_train_r2]
train_adjusted_r2= [lr_train_r2_, lasso_train_r2_, ridge_train_r2_, dt_train_r2_, xgb_train_r2_]

**Model summary for test data**

In [None]:
models= ['Linear Regression', 'Lasso Regression', 'Ridge Regression','DecisionTree Regressor','XGBoost Regressor']
test_mse= [lr_test_mse, lasso_test_mse, ridge_test_mse, dt_test_mse, xgb_test_mse]
test_rmse= [lr_test_rmse, lasso_test_rmse, ridge_test_rmse, dt_test_rmse, xgb_test_rmse]
test_r2= [lr_test_r2, lasso_test_r2, ridge_test_r2, dt_test_r2, xgb_test_r2]
test_adjusted_r2= [lr_test_r2_, lasso_test_r2_, ridge_test_r2_, dt_test_r2_, xgb_test_r2_]

In [None]:
Train_data_df=pd.DataFrame({'Model Name': models, 'Train MSE': train_mse, 'Train RMSE': train_rmse, 'Train R^2': train_r2, 
                            'Train Adjusted R^2': train_adjusted_r2})
Train_data_df

# **Conclusion**

**We can see that MSE and RMSE which are the metrics used to evaluate the performance of regression model of Decision Tree and XGBoost Regressor is not varying much during traing and testing time. Also the R^2 is about same during training and Testing time.**

**The Linear models don't show good performance on our training and testing environment.**

**From above table we can conclude that XGBoost Regressor is the best models as compare to the other models to predict the trip duration for a particular taxi.**