# Machin Learning Model on NYC Taxi Trip Duration

## Agenda

1. [Dataset Description](#dataset)
2. [Import relevant packages](#package)
3. [Loading the data](#read)
4. [Data Wrangling/Explore the Dataset](#explore)
5. [Descriptive Statistics of the dataset](#describe)
6. [Exploratory Data Analysis(EDA)](#EDA)
    - 6.1. [Data Visualization](#visualize)
    - 6.2. [Feature Engineering](#feature)
7. [Spliting Dataset into Train and Test](#split)
8. [Learning Algorithm Selection](#algo)
    - 8.1 [Building Linear Regression Model](#logreg)
    - 8.2 [Building Decision Tree Regressor Model](#dt)
    - 8.3 [Building Random Forest Regressor Model](#rf)
    - 8.4 [Building AdaBoost Regressor Model](#ab)
    - 8.5 [Building GradientBoosting Regressor Model](#gb)
    - 8.6 [Building XGB Regressor Model](#xgb)
    - 8.7 [Building LGBM Regressor Model](#lgbm)
9. [Model Performance Assessment](#perform)
    - 9.1 [RMSE Score](#perform)
    - 9.2 [R2 Score](#perform)
    - 9.3 [Train and Test Score](#perform)
10. [Model Explanability](#explain)
    - 10.1 [Eli5](#eli5)
    - 10.2 [LIME](#lime)
    - 10.3 [SHAP](#shap)
11. [Closing Remarks](#close)

## 1. Dataset Description <a id='dataset'>

<p/>
The data set contains the data regarding several taxi trips and its duration in New York City. I will now try and apply different techniques of Data Analysis to get insights about the data and determine how different variables are dependent on the target variable Trip Duration.My objective is to build a model that predicts the total trip duration of taxi trips in New York City.


<p/>
<b>File Descriptions:</b>
<br/>
<b>taxi_train.csv</b> - the training set (contains 1458644 trip records)

<p/>
<b>Data Fields:</b>
<br/>
<b>id</b> - a unique identifier for each trip.<br/>
<b>vendor_id</b> - a code indicating the provider associated with the trip record <br/>
<b>pickup_datetime</b> - date and time when the meter was engaged. <br/>
<b>dropoff_datetime</b> - date and time when the meter was disengaged.<br/>
<b>passenger_count</b> - the number of passengers in the vehicle (driver entered value). <br/>
<b>pickup_longitude</b> - the longitude where the meter was engaged. <br/>
<b>pickup_latitude</b> - the latitude where the meter was engaged. <br/>
<b>dropoff_longitude</b> - the longitude where the meter was disengaged. <br/>
<b>dropoff_latitude</b> - the latitude where the meter was disengaged.<br/>
<b>store_and_fwd_flag</b> - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip.<br/>
<b>trip_duration</b> - duration of the trip in seconds.<br/>

## 2. Import relevent packages <a id='package'>

In [None]:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import pandas_profiling
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly_express as px
import time
import random 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from pylab import rcParams
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost.sklearn import XGBRegressor
import lightgbm as lgb
from vecstack import stacking
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics
import eli5
from eli5.sklearn import PermutationImportance
import lime
import lime.lime_tabular
import shap
import geopandas as gpd
from shapely.geometry import Point,Polygon
import descartes

## 3. Loading the data <a id='read'>

We'll be working with a dataset that was used in a Kaggle competition ([data dictionary](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview)).

In [None]:
#Read data through Pandas and compute time taken to read
df_taxi = pd.read_csv('../input/nyc-taxi-trip-duration/train.zip',parse_dates=['pickup_datetime','dropoff_datetime'],infer_datetime_format=True)

## 4. Data Wrangling/Explore the Dataset <a id='explore'>

In [None]:
#Getting the head of the dataset
df_taxi.head(5)

In [None]:
#Shape of the dataset
df_taxi.shape

**We have 1458644 observations, 11 features, and our target variable is trip_duration**

In [None]:
#Data Type of features for dataset
df_taxi.dtypes

In [None]:
#Info of the dataset
df_taxi.info()

In [None]:
#Checking for null values in dataset
df_taxi.isnull().sum()

**Great!No missing values in the dataset.**

In [None]:
#Checking duplicate value in vendor_id
df_taxi[df_taxi.duplicated(['id'], keep=False)]

**No duplicastes available in id which is trip id!**

In [None]:
#Checking Date and Time range
print('Datetime range: {} to {}'.format(df_taxi.pickup_datetime.min(),df_taxi.dropoff_datetime.max()))

**Data is of 6 full months, from January 2016 to June 2016!**

In [None]:
#Checking no. of vendors
df_taxi['vendor_id'].value_counts()

In [None]:
#Checking Passenger count
print('Passenger Count: {} to {}'.format(df_taxi.passenger_count.min(),df_taxi.passenger_count.max()))

In [None]:
#The distribution of Pickup and Drop Off day of the week
print(df_taxi['pickup_datetime'].nunique())
print(df_taxi['dropoff_datetime'].nunique())

**There are many different pickup and drop off dates in these 2 columns.**

In [None]:
#Performing Pandas profiling to understand quick overview of columns
report = pandas_profiling.ProfileReport(df_taxi)
#coverting profile report as html file
report.to_file('taxi_train.html')

from IPython.display import display,HTML,IFrame
display(HTML(open('taxi_train.html').read()))  

## 5. Descriptive Statistics of the dataset <a id='describe'> 

In [None]:
#Summary statistics for the dataset
df_taxi.describe()

## 6. Exploratory Data Analysis(EDA) <a id='EDA'> 

### Feature Engineering & Data Visualization<a id='feature'>

#### Lets have a look at the distribution of various variables in the Data set.

In [None]:
#Passenger Count
sns.distplot(df_taxi['passenger_count'],kde=False)
plt.title('Distribution of Passenger Count')
plt.show()

**Here we can see that mostly 1 or 2 passengers avail the taxi. The instance of large group of people travelling together is rare.**

#### Lets create some features from datetime stamp. 

In [None]:
#Creating pickup and dropoff day
df_taxi['pickup_day']=df_taxi['pickup_datetime'].dt.day_name()
df_taxi['dropoff_day']=df_taxi['dropoff_datetime'].dt.day_name()

In [None]:
#Creating pickup and dropoff month
df_taxi['pickup_month']=df_taxi['pickup_datetime'].dt.month
df_taxi['dropoff_month']=df_taxi['dropoff_datetime'].dt.month

In [None]:
#Creating pickup and dropoff hour
df_taxi['pickup_hour']=df_taxi['pickup_datetime'].dt.hour
df_taxi['dropoff_hour']=df_taxi['dropoff_datetime'].dt.hour

In [None]:
df_taxi.head(2)

In [None]:
#Plotting monthly Pickup and Dropoff trip distribution
figure,ax=plt.subplots(nrows=1,ncols=2,figsize=(15,4))
sns.countplot(x='pickup_month',data=df_taxi,ax=ax[0])
ax[0].set_title('The distribution of number of pickups each month')
sns.countplot(x='dropoff_month',data=df_taxi,ax=ax[1])
ax[1].set_title('The distribution of number of dropoffs each month')
plt.tight_layout()

**There is not so much of difference in Pickup and dropoff month.**

In [None]:
#Plotting daily Pickup and Dropoff trip distribution
figure,ax=plt.subplots(nrows=1,ncols=2,figsize=(15,4))
sns.countplot(x='pickup_day',data=df_taxi,ax=ax[0])
ax[0].set_title('The distribution of number of pickups each day')
sns.countplot(x='dropoff_day',data=df_taxi,ax=ax[1])
ax[1].set_title('The distribution of number of dropoffs each day')
plt.tight_layout()

**We can see most trips were taken on Friday & least trips were taken on Monday.**

In [None]:
#Plotting hourly Pickup and Dropoff trip distribution
figure,ax=plt.subplots(nrows=1,ncols=2,figsize=(20,5))
sns.countplot(x='pickup_hour',data=df_taxi,ax=ax[0])
ax[0].set_title('The distribution of number of pickups each hour')
sns.countplot(x='dropoff_hour',data=df_taxi,ax=ax[1])
ax[1].set_title('The distribution of number of dropoffs each hour')
plt.tight_layout()

**Both the distribution looks quite similiar,majority of the trip has been booked from 6PM to 10PM.**

In [None]:
#Creating a new column according to the traffic scenerio of New York
def rush_hour(hour):
    if hour.item()>=7 and hour.item()<=9:
        return 'rush_hour_morning(7-9)'
    elif hour.item()>9 and hour.item()<16:
        return 'normal_hour_afternoon(9-16)'
    elif hour.item()>=16 and hour.item()<=19:
        return 'rush_hour_evening(16-19)'
    elif hour.item()>19 and hour.item()<=23:
        return 'normal_hour_evining(19-23)'
    else:
        return 'latenight(23 onwards)'
df_taxi['traffic_scenerio_pickup']=df_taxi[['pickup_hour']].apply(rush_hour, axis=1)
df_taxi['traffic_scenerio_dropoff']=df_taxi[['dropoff_hour']].apply(rush_hour, axis=1)

In [None]:
#Plotting pickup and dropoff trip distribution as per traffic scenerio
figure,ax=plt.subplots(nrows=1,ncols=2,figsize=(20,5))
sns.countplot(x='traffic_scenerio_pickup',data=df_taxi,ax=ax[0])
ax[0].set_title('The distribution of number of pickups as per traffics scenerio')
sns.countplot(x='traffic_scenerio_dropoff',data=df_taxi,ax=ax[1])
ax[1].set_title('The distribution of number of dropoffs as per traffics scenerio')
plt.tight_layout()

#### Distribution of the trip duration

In [None]:
sns.distplot(df_taxi['trip_duration'],kde=True)
plt.title('The distribution of of the Pick Up  Duration distribution')

**This histogram shows extreme right skewness, hence there are outliers. Lets see the boxplot of this variable.**

In [None]:
sns.boxplot(df_taxi['trip_duration'], orient='horizontal')
plt.title('A boxplot depicting the pickup duration distribution')

**We can see there are few outliers, which we have to treat.**

In [None]:
#Dropping trip_duration <1 min
df_taxi= df_taxi[df_taxi.trip_duration>60] # >1 min

In [None]:
#Dropping trip_duration >2 Hrs
df_taxi= df_taxi[df_taxi.trip_duration<=7200] # >2 hrs

**Removed trip duration of less than 1 min and greater than 2 hrs as its not looking sensible that one can hire taxi for less than a min or more that 2 hrs in a city like New York!**

#### Distribution of vendor_id

In [None]:
sns.countplot(x='vendor_id',data=df_taxi)

**The distribution of vendor id is not much different.**

#### Analysing geographical boundary of NYC.

In [None]:
#Checking Longitude and Lattitude bounds available in the data
print('Longitude Bounds: {} to {}'.format(max(df_taxi.pickup_longitude.min(),df_taxi.dropoff_longitude.min()),max(df_taxi.pickup_longitude.max(),df_taxi.dropoff_longitude.max())))
print('Lattitude Bounds: {} to {}'.format(max(df_taxi.pickup_latitude.min(),df_taxi.dropoff_latitude.min()),max(df_taxi.pickup_latitude.max(),df_taxi.dropoff_latitude.max())))

In [None]:
#The borders of NY City, in coordinates comes out to be: city_long_border = (-74.03, -73.75) & city_lat_border = (40.63, 40.85)
#Comparing this to our 'df_taxi.describe()' output we see that there are some coordinate points (pick ups/drop offs) that fall outside these borders. So let's limit our area of investigation to within the NY City borders.
df_taxi = df_taxi[df_taxi['pickup_longitude'] <= -73.75]
df_taxi = df_taxi[df_taxi['pickup_longitude'] >= -74.03]
df_taxi = df_taxi[df_taxi['pickup_latitude'] <= 40.85]
df_taxi = df_taxi[df_taxi['pickup_latitude'] >= 40.63]
df_taxi = df_taxi[df_taxi['dropoff_longitude'] <= -73.75]
df_taxi = df_taxi[df_taxi['dropoff_longitude'] >= -74.03]
df_taxi = df_taxi[df_taxi['dropoff_latitude'] <= 40.85]
df_taxi = df_taxi[df_taxi['dropoff_latitude'] >= 40.63]

**Limited the New York City boundary as per City Long and Lat Boundary!**

In [None]:
#Getting distance(in km) from geographocal co-ordinates
from math import radians, sin, cos, sqrt, asin
def haversine(columns):
    lat1, lon1, lat2, lon2 = columns
    R = 6372.8 # Earth radius in kilometers
    
    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)
    
    a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
    c = 2*asin(sqrt(a))
    
    return R * c

cols = ['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']
distances = df_taxi[cols].apply(lambda x: haversine(x),axis = 1)
df_taxi['distance_km'] = distances.copy()
df_taxi['distance_km'] = round(df_taxi.distance_km,2)

In [None]:
sns.scatterplot(x='distance_km',y='trip_duration',data=df_taxi)

In [None]:
#Removing distance Outliers
df_taxi = df_taxi[df_taxi['distance_km'] > 0]

**Removed distance which have 0 value,seems to be cancelled trips.**

In [None]:
#Getting Speed(Km/h) of the taxi 
df_taxi['speed_km/h']= 3600*(df_taxi.distance_km/df_taxi.trip_duration)  #3600 to convert it from km/s to km/h

In [None]:
#Checking Distance and Speed range
print('Distance Bounds: {} to {}'.format(df_taxi.distance_km.min(),df_taxi.distance_km.max()))
print('Speed Bounds: {} to {}'.format(df_taxi['speed_km/h'].min(),df_taxi['speed_km/h'].max()))

In [None]:
#Removing speed Outliers
df_taxi = df_taxi[df_taxi['speed_km/h'] > 0]
df_taxi = df_taxi[df_taxi['speed_km/h'] < 100]

**Removed average speed equals to zero and more than 100, as its seems to be outliers.**

In [None]:
#Dropping passenger count=0
df_taxi= df_taxi[df_taxi.passenger_count>0]
df_taxi['passenger_count'].value_counts()

In [None]:
#Plotting Trip Distribution
plt.figure(figsize=(10,6))
plt.hist(df_taxi.trip_duration, bins=100)
plt.xlabel('Trip_duration')
plt.ylabel('Number of Trips')
plt.title('Trip Distribution')
plt.show()

**Seems the distribution is skewed so we can apply certain transforms such as log transform!**

In [None]:
#Applying Feature Scaling in trip_duration caloumn to normalize the data
df_taxi['log_trip_duration']= np.log1p(df_taxi['trip_duration'])
plt.hist(df_taxi['log_trip_duration'].values, bins=100)
plt.title('Log Trip Distribution')
plt.xlabel('log(trip_duration)')
plt.ylabel('Number of Trips')
plt.show()
sns.distplot(df_taxi["log_trip_duration"], bins =100)

In [None]:
#Visualizing Passenger road map for picking up
fig, ax = plt.subplots(ncols=1, nrows=1,figsize=(10,8))
plt.ylim(40.63, 40.85)
plt.xlim(-74.03,-73.75)
ax.scatter(df_taxi['pickup_longitude'],df_taxi['pickup_latitude'], s=0.02, alpha=1)

In [None]:
#Visualizing Passenger road map for dropoff
fig, ax = plt.subplots(ncols=1, nrows=1,figsize=(10,8))
plt.ylim(40.63, 40.85)
plt.xlim(-74.03,-73.75)
ax.scatter(df_taxi['dropoff_longitude'],df_taxi['dropoff_latitude'], s=0.02, alpha=1)

In [None]:
#Converting Data to Geo Dataframe for pickup 
gdf=gpd.GeoDataFrame(df_taxi,geometry=gpd.points_from_xy(df_taxi['pickup_longitude'],df_taxi['pickup_latitude']))

In [None]:
#Geometry point has been generated for pickup
gdf.head(2)

In [None]:
#Visulizing pickup points with geopandas
gdf.plot(figsize=(12,10))

In [None]:
#Getting New York City map from Geopandas
nyc = gpd.read_file(gpd.datasets.get_path('nybb'))
ax = nyc.plot(figsize=(12, 10))

**We can overlap datapoints to the map for getting better idea!**

##### Categorical Encoding - One Hot Encoding

In [None]:
#Applying one hot encoding to the catagorical variables
taxi_vendor=pd.get_dummies(df_taxi['vendor_id'], prefix='vendor_id',drop_first= True)
taxi_pax=pd.get_dummies(df_taxi['passenger_count'], prefix='passenger',drop_first= True)
taxi_store_and_fwd_flag=pd.get_dummies(df_taxi['store_and_fwd_flag'], prefix='store_and_fwd_flag',drop_first= True)
taxi_pickup_day=pd.get_dummies(df_taxi['pickup_day'], prefix='pickup_day',drop_first= True)
taxi_dropoff_day=pd.get_dummies(df_taxi['dropoff_day'], prefix='dropoff_day',drop_first= True)
taxi_pickup_month=pd.get_dummies(df_taxi['pickup_month'], prefix='pickup_month',drop_first= True)
taxi_dropoff_month=pd.get_dummies(df_taxi['dropoff_month'], prefix='dropoff_month',drop_first= True)
taxi_pickup_traffic_scenerio=pd.get_dummies(df_taxi['traffic_scenerio_pickup'], prefix='pickup_',drop_first= True)
taxi_dropoff_traffic_scenerio=pd.get_dummies(df_taxi['traffic_scenerio_dropoff'], prefix='dropoff_',drop_first= True)

In [None]:
#Adding encoded columns to final data
df_taxi=pd.concat([df_taxi,taxi_pax,taxi_vendor,taxi_store_and_fwd_flag,taxi_pickup_day,taxi_dropoff_day,taxi_pickup_month,taxi_dropoff_month,taxi_pickup_traffic_scenerio,taxi_dropoff_traffic_scenerio],axis=1)

In [None]:
df_taxi.head(2)

In [None]:
#Dropping unnecessary columns from dataset
df_taxi=df_taxi.drop(['id','vendor_id','passenger_count','pickup_datetime','dropoff_datetime','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','log_trip_duration','speed_km/h','store_and_fwd_flag','traffic_scenerio_pickup','traffic_scenerio_dropoff','pickup_month','dropoff_month','pickup_day','dropoff_day','pickup_hour','dropoff_hour','geometry','dropoff_month_7'],axis=1)

In [None]:
df_taxi.columns

## 7. Spliting Dataset into Train and Test <a id='split'> 

In [None]:
#Assigning X and y variables
X = df_taxi.drop('trip_duration',1)
y = df_taxi['trip_duration']

In [None]:
#Splitting the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

## 8. Learning Algorithm Selection<a id='algo'> 

### 8.1. Building Linear Regression Model <a id='logreg'>

In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)
lr_pred = lr.predict(X_test)

In [None]:
#RMSE score 
lr_rmse = np.sqrt(metrics.mean_squared_error(lr_pred,y_test))
lr_rmse

In [None]:
#R2 score
lr_r2score = metrics.r2_score(lr_pred,y_test)
lr_r2score

In [None]:
#Train Score
lr_train=lr.score(X_train,y_train)
lr_train

In [None]:
#Test Score
lr_test=lr.score(X_test,y_test)
lr_test

In [None]:
#Null RMSE
y_null=np.zeros_like(y_test,dtype=float)
y_null.fill(y_test.mean())
np.sqrt(metrics.mean_squared_error(y_test,y_null))

In [None]:
coef1 = pd.DataFrame(lr.coef_,index=X_train.columns)
coef1.plot(kind='bar', title='Model Coefficients')

**We can see that the regression model with all the columns performed well except few columns.**

### 8.2. Building Decision Tree Regressor Model <a id='dt'>

In [None]:
dt=DecisionTreeRegressor()
dt.fit(X_train,y_train)
dt_pred=dt.predict(X_test)

In [None]:
#RMSE score 
dt_rmse = np.sqrt(metrics.mean_squared_error(dt_pred,y_test))
dt_rmse

In [None]:
#R2 score
dt_r2score = metrics.r2_score(dt_pred,y_test)
dt_r2score

In [None]:
#Train Score
dt_train=dt.score(X_train,y_train)
dt_train

In [None]:
#Test Score
dt_test=dt.score(X_test,y_test)
dt_test

### 8.3. Building Random Forest Regressor Model <a id='rf'>

In [None]:
rf=RandomForestRegressor()
rf.fit(X_train,y_train)
rf_pred=rf.predict(X_test)

In [None]:
#RMSE score 
rf_rmse = np.sqrt(metrics.mean_squared_error(rf_pred,y_test))
rf_rmse

In [None]:
#R2 score
rf_r2score = metrics.r2_score(rf_pred,y_test)
rf_r2score

In [None]:
#Train Score
rf_train=rf.score(X_train,y_train)
rf_train

In [None]:
#Test Score
rf_test=rf.score(X_test,y_test)
rf_test

### 8.4. Building AdaBoost Regressor Model <a id='ab'>

In [None]:
ab=AdaBoostRegressor()
ab.fit(X_train,y_train)
ab_pred=ab.predict(X_test)

In [None]:
#RMSE score 
ab_rmse = np.sqrt(metrics.mean_squared_error(ab_pred,y_test))
ab_rmse

In [None]:
#R2 score
ab_r2score = metrics.r2_score(ab_pred,y_test)
ab_r2score

In [None]:
#Train Score
ab_train=ab.score(X_train,y_train)
ab_train

In [None]:
#Test Score
ab_test=ab.score(X_test,y_test)
ab_test

### 8.5. Building GradientBoosting Regressor Model <a id='gb'>

In [None]:
gb = GradientBoostingRegressor()
gb.fit(X_train,y_train)
gb_pred = gb.predict(X_test)

In [None]:
#RMSE score 
gb_rmse = np.sqrt(metrics.mean_squared_error(gb_pred,y_test))
gb_rmse

In [None]:
#R2 score
gb_r2score = metrics.r2_score(gb_pred,y_test)
gb_r2score

In [None]:
#Train Score
gb_train=gb.score(X_train,y_train)
gb_train

In [None]:
#Test Score
gb_test=gb.score(X_test,y_test)
gb_test

### 8.6. Building XGB Regressor Model <a id='xgb'>

In [None]:
xgb= XGBRegressor()
xgb.fit(X_train,y_train)
xgb_pred=xgb.predict(X_test)

In [None]:
#RMSE score 
xgb_rmse = np.sqrt(metrics.mean_squared_error(xgb_pred,y_test))
xgb_rmse

In [None]:
#R2 score
xgb_r2score = metrics.r2_score(xgb_pred,y_test)
xgb_r2score

In [None]:
#Train Score
xgb_train=xgb.score(X_train,y_train)
xgb_train

In [None]:
#Test Score
xgb_test=xgb.score(X_test,y_test)
xgb_test

### 8.7. Building LGBM Regressor Model <a id='lgbm'>

In [None]:
lgbm = lgb.LGBMRegressor()
lgbm.fit(X_train,y_train)
lgbm_pred = lgbm.predict(X_test)

In [None]:
#RMSE score 
lgbm_rmse = np.sqrt(metrics.mean_squared_error(lgbm_pred,y_test))
lgbm_rmse

In [None]:
#R2 score
lgbm_r2score = metrics.r2_score(lgbm_pred,y_test)
lgbm_r2score

In [None]:
#Train Score
lgbm_train=lgbm.score(X_train,y_train)
lgbm_train

In [None]:
#Test Score
lgbm_test=lgbm.score(X_test,y_test)
lgbm_test

## 9. Model Performance Assessment <a id='perform'> 

In [None]:
#Creating dictionary for all the metrics and models
metrics = {'Metrics': ['RMSE Score','R2 Score','Train Score','Test Score'],'Linear Regression':[lr_rmse,lr_r2score,lr_train,lr_test],
          'Decision Tree Regressor':[dt_rmse,dt_r2score,dt_train,dt_test],'Random Forest Regressor':[rf_rmse,rf_r2score,rf_train,rf_test],
        'AdaBoost Regressor':[ab_rmse,ab_r2score,ab_train,ab_test],
          'GradientBoosting Regressor':[gb_rmse,gb_r2score,gb_train,gb_test],'XGBoost Regressor':[xgb_rmse,xgb_r2score,xgb_train,xgb_test],
           'LGBM Regressor':[lgbm_rmse,lgbm_r2score,lgbm_train,lgbm_test]}

In [None]:
#Converting dictionary to dataframe
metrics = pd.DataFrame(metrics)
metrics

**Looking at the above Performance Matrix we can say thay XGBoost is the best model for this dataset.We may perform hyperparameter tuning on XGBoost model to improve the performance of the model**

## 10. Model Explanability <a id='explain'> 

### 10.1. Eli5<a id='eli5'>

In [None]:
#Finding the importance of columns for prediction
perm = PermutationImportance(xgb, random_state=1).fit(X_test,xgb_pred)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

## 11. Closing Remarks <a id='close'> 

**In this project we covered various aspects of the Machine learning development cycle. We observed that the data exploration and variable analysis is a very important aspect of the whole cycle and should be done for thorough understanding of the data. We also cleaned the data while exploring as there were some outliers which should be treated before feature engineering. Further we did feature engineering to filter and gather only the optimal features which are more significant and covered most of the variance in the dataset. Then finally we trained the models on the optimum featureset to get the results.**