<a href="https://colab.research.google.com/github/sanjay2097/NYC-Taxi-Trip-Time-Prediction/blob/main/NYC_Taxi_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

## **STEPS** 
### 1. Data Analysis
### 2. Feature Engineering
### 3. Feature Selection
### 4. Model Building
### 5. Model Validation & Selection


In [None]:
# Importing libraries
import numpy as np
import pandas as pd
from numpy import math
from datetime import datetime

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from lightgbm import LGBMRegressor

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error


from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

import seaborn as sns
import folium
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

### DATA ANALYSIS

In [None]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Loading Data
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/NYC Taxi Data.csv')

In [None]:
dataset.head()

In [None]:
dataset.shape

(1458644, 11)

There are approx 1.46 million records in our dataset.

In [None]:
dataset.describe()

From preliminary analysis using describe function we can see that there are anomalous values in passenger count and trip_duration that needs to be addressed later.

In [None]:
# Checking null values
dataset.isnull().sum()

In [None]:
# Checking duplicated values
dataset.duplicated().sum()

0

*There are no Null values and duplicated values in given dataset.*

In [None]:
# Copying data to new dataframe for further analysis
df = dataset.copy()

In [None]:
# Lets look at the distribution plot of the features
pos = 1
fig = plt.figure(figsize=(18,26))
for i in df.describe().columns:
    ax = fig.add_subplot(6,2,pos)
    pos = pos + 1
    sns.distplot(df[i],ax=ax)

Inferences from distribution plot :

1.There are two major vendors in NYC.

2.Passenger count 1 has max distribution.

3.Distribution of trip duration is highly skewed.

### Analysis of independent variables

#### Vendor ID

In [None]:
df['vendor_id'].value_counts().plot(kind='bar')
plt.ylabel('Count')
plt.xlabel('Vendor ID')

*Both the vendors seems to have almost equal market share. But Vendor 2 is evidently more famous among the population as per the above graph.*

#### Datetime

In [None]:
# Converting datetime datatype from object
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'])

In [None]:
# Adding new features month , day and hour from datetime
df['hour'] = df['pickup_datetime'].dt.hour 
df['day'] = df.pickup_datetime.dt.day_name()
df['month'] = df.pickup_datetime.dt.month_name()

In [None]:
# Analyzing month 
data=df['month']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

All the months are closely distributed with March being highest and January lowest.

In [None]:
# Analyzing day
data=df['day']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

We can see that Friday has the largest count of trips in dataset and Monday lowest.

In [None]:
# Analyzing hour
data=df['hour']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

### Passanger Count

Between 7 am to 3 pm the trip distributions are close to each other but increases from 5 pm to 10 pm and starts decreasing till 5 am.

In [None]:
# Analyzing passanger count
data=df['passenger_count']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

The passenger_count variable has a minimum value of 0 passengers. These observations are most likely errors and will need to be removed from the dataset.

According to the NYC Taxi & Limousine Commission, the maximum number of people allowed in a yellow taxicab, by law, is 5 passengers and one child .The observations more than 6 are likely an error and will also need to be removed from the dataset.

In [None]:
# Removing passenger count more than 6 
df = df[(df['passenger_count']>0) & (df['passenger_count']<=6)]

###store_and_fwd_flag

In [None]:
# analyzing trip data storing flag column
df['store_and_fwd_flag'].value_counts()

Most of the trip records were not stored in vehicle memory before forwarding to the vendor because the vehicle did not have a direct connection to the server.

#### Longitude and Latitude

Looking into it, the borders of NY City coordinates comes out to be:

longitude = (-74.03, -73.77)  ,
latitude = (40.63, 40.85)


Any coordinates outside will be outliers.

In [None]:
# Max and min values of lat and long in pickup and dropoff location
print(np.min(df['pickup_longitude']), np.min(df['pickup_latitude']))
print(np.max(df['pickup_longitude']), np.max(df['pickup_latitude']))

print(np.min(df['dropoff_longitude']), np.min(df['dropoff_latitude']))
print(np.max(df['dropoff_longitude']), np.max(df['dropoff_latitude']))

In [None]:
# Removing outlier coordinates
west, south, east, north = -74.03, 40.63, -73.77, 40.85

df = df[(df.pickup_latitude> south) & (df.pickup_latitude < north)]
df = df[(df.dropoff_latitude> south) & (df.dropoff_latitude < north)]
df = df[(df.pickup_longitude> west) & (df.pickup_longitude < east)]
df = df[(df.dropoff_longitude> west) & (df.dropoff_longitude < east)]

In [None]:
# Visualization of coordinates
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(18,10))

df.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax1)
ax1.set_title("Pickups")
ax1.set_facecolor('black')

df.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax2)
ax2.set_title("Dropoffs")
ax2.set_facecolor('black') 

In [None]:
# Finding total diatance covered in each trip by making get_distance function
from math import sin, cos, sqrt, atan2, radians

def get_distance(lon_1, lon_2, lat_1, lat_2):

    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(lat_1)
    lon1 = radians(lon_1)
    lat2 = radians(lat_2)
    lon2 = radians(lon_2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

In [None]:
# Applying get_distance function to claculate each trip distance
df["distance"] = df.apply(lambda x: get_distance(x["pickup_longitude"],x["dropoff_longitude"],x["pickup_latitude"],x["dropoff_latitude"]),axis=1)

#### Distance

In [None]:
df.distance.describe()

count    1.438573e+06
mean     3.292866e+00
std      3.662317e+00
min      0.000000e+00
25%      1.224953e+00
50%      2.068546e+00
75%      3.767414e+00
max      2.720017e+01
Name: distance, dtype: float64

We can see from describe function that there are trips where trip distance is zero which is not possible we need to drop these.

In [None]:
# Boxplot of distance
plt.figure(figsize = (12,5))
sns.boxplot(df.distance)
plt.show()

During previous analysis of longitude and latitude columns we have limited all the trips within NY city only , hence all the outliers are actually part of extreme values and we have no further reason to remove them.

In [None]:
# Total outlier values where value is 0
len(df[df.distance==0])

71929

In [None]:
# Removing outliers 
df = df[df.distance>0]

In [None]:
# Plotting distribution of trip distribution
from scipy import stats
from scipy.stats import norm, skew

plt.rcParams["figure.figsize"] = (12,6)
sns.distplot(df['distance'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(df['distance'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

# Plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('Duration distribution')
plt.show()


### Bivariate Analysis

In [None]:
# Analysis between pickup hour and trip duration
group1 = df.groupby('hour').trip_duration.median()
plt.figure(figsize = (12,6))
sns.pointplot(x=group1.index, y=group1.values)
plt.ylabel('Average Trip Duration')
plt.xlabel('Pickup Hour')
plt.show()

There is sharp increase in avg trip duration from 6 am onwards till 10 am ,  most working people leave at these hours.

In [None]:
# Analysis between weekday and trip duration
group1 = df.groupby('day').trip_duration.median()
plt.figure(figsize = (12,6))
sns.pointplot(x=group1.index, y=group1.values)
plt.ylabel('Average Trip Duration')
plt.xlabel('Weekday')
plt.show()

The avg trip duration on sunday is slightly lower than other days , it could be because of weekend and less number of people going out.

In [None]:
# Analysis between pickup hour and trip distance
group1 = df.groupby('day').distance.median()
plt.figure(figsize = (12,6))
sns.pointplot(x=group1.index, y=group1.values)
plt.ylabel('Average Trip Distance')
plt.xlabel('Weekday')
plt.show()

On sunday the trips cover more avg distance than other days , it could be because more travelling to and from airports or preople going out to meet friends and family from farther distances.

In [None]:
# Day-wise comparison of traffic
n = sns.FacetGrid(df, col='day')
n.map(plt.hist, 'hour')
plt.show()

From above we can see that Weekends have high late night trip distribution as compared to other days as people tend to stay out late.

#### Analyzing target variable

#### Trip Duration

In [None]:
df.trip_duration.describe()

In [None]:
# boxplot trip duration
plt.figure(figsize = (12,5))
sns.boxplot(df.trip_duration)
plt.show()

In [None]:
# Total outlier values outside 3 standard deviation
len(df[df.trip_duration>df.trip_duration.quantile(0.99)])

14326

In [None]:
# Removing outliers 
df = df[df.trip_duration<df.trip_duration.quantile(0.99)]

In [None]:
# Plotting trip distribution
from scipy import stats
from scipy.stats import norm, skew

plt.rcParams["figure.figsize"] = (12,6)
sns.distplot(df['trip_duration'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(df['trip_duration'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

# Plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('Duration distribution')
plt.show()


### Feature Selection

In [None]:
# Copying into new dataframe
new_df=df.copy()

In [None]:
new_df.info()

In [None]:
# Removing unnecessary features which are of no use
new_df.drop(['id','pickup_datetime','dropoff_datetime','store_and_fwd_flag'], 
                axis = 1, inplace = True)

In [None]:
# Removing duplicates
new_df.drop_duplicates(keep=False, inplace=True)

In [None]:
# Visualising the correlation between attributes
corr = new_df.corr(method='kendall')
plt.figure(figsize=(16,10))
plt.title("Correlation Between Different Variables\n")
sns.heatmap(corr, annot=True)
plt.show()

There are no high correlations in our dataset.

In [None]:
# Encoding Categorical Data
df1 = pd.get_dummies(new_df, columns=['vendor_id','passenger_count','hour','month','day'], drop_first=True)

In [None]:
df1.info()

In [None]:
# Preliminary analysis using stats OLS
import statsmodels.api as sm
x =   df1.loc[:, df1.columns != 'trip_duration']
Y = df1['trip_duration']

In [None]:
x = sm.add_constant(x)
model= sm.OLS(Y, x).fit()
model.summary()

####Splitting Dataset

In [None]:
# Seperating independent and target variables
X = df1.loc[:, df1.columns != 'trip_duration']
y = df1['trip_duration']

In [None]:
# Creating test and training dataset
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
print(X_train.shape)
print(X_test.shape)

(1135087, 45)
(283772, 45)


In [None]:
X_train.info()

In [None]:
y_train.shape

(1138917,)

In [None]:
# Scaling variables
sc = StandardScaler()
X_train.iloc[:, :5] = sc.fit_transform(X_train.iloc[:, :5])
X_test.iloc[:, :5] = sc.fit_transform(X_test.iloc[:, :5])

## Applying Regression Models

### Linear Regression

In [None]:
# Applying linear regression model
reg = LinearRegression().fit(X_train, y_train)
reg.score(X_train, y_train)

0.6244058010191571

In [None]:
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

In [None]:
#Train set metrics
L_MSE  = mean_squared_error((y_train), (y_train_pred))
print("MSE :" , L_MSE)

L_RMSE = np.sqrt(L_MSE)

print("RMSE :" ,L_RMSE)

L_r2 = r2_score((y_train), (y_train_pred))
print("R2 :" ,L_r2)

L_ar2 = 1-(1-r2_score((y_train), (y_train_pred)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Adjusted R2 : ",L_ar2)

MSE : 118225.05400404647
RMSE : 343.8387034701685
R2 : 0.6244058010191571
Adjusted R2 :  0.6243909101571052


In [None]:
# Test set metrics
Lt_MSE  = mean_squared_error((y_test), (y_test_pred))
print("MSE :" , Lt_MSE)

Lt_RMSE = np.sqrt(Lt_MSE)
print("RMSE :" ,Lt_RMSE)

Lt_r2 = r2_score((y_test), (y_test_pred ))
print("R2 :" ,Lt_r2)

Lt_ar2 = 1-(1-r2_score((y_test), (y_test_pred )))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ", Lt_ar2)

MSE : 119221.04665171781
RMSE : 345.2840086822988
R2 : 0.623009378367501
Adjusted R2 :  0.622949586251257


In [None]:
# Residual plot
residuals=y_test_pred-y_test

plt.figure(figsize=(8,5), dpi=120, facecolor='w', edgecolor='b')
f = range(0,len(y_test))
k = [0 for i in range(0,len(y_test))]
plt.scatter( f, residuals, label = 'residuals')
plt.plot( f, k , color = 'red', label = 'regression line' )
plt.xlabel('fitted points ')
plt.ylabel('residuals')
plt.title('Residual plot')
plt.legend()

In [None]:
# Storing the traing and test set metrics for comparison
training_df = pd.DataFrame()
test_df = pd.DataFrame()

a=pd.Series(
    {'MSE':round((L_MSE),3),'RMSE':round((L_RMSE),3),'R2_score':round((L_r2),3),'Adjusted R2':round((L_ar2),3)},
    name='Linear regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((Lt_MSE),3),'RMSE':round((Lt_RMSE),3),'R2_score':round((Lt_r2),3),'Adjusted R2':round((Lt_ar2),3)},
     name='Linear regression ')            

test_df = test_df.append(b,ignore_index=False)

### Ridge Regression

In [None]:
# Applying ridge regression model
ridge = Ridge()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='r2', cv=5)
ridge_regressor.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=Ridge(),
             param_grid={'alpha': [1e-15, 1e-13, 1e-10, 1e-08, 1e-05, 0.0001,
                                   0.001, 0.01, 0.1, 1, 5, 10, 20, 30, 40, 45,
                                   50, 55, 60, 100]},
             scoring='r2')

In [None]:
ridge_regressor.score(X_train, y_train)

0.5505939605752155

In [None]:
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)

The best fit alpha value is found out to be : {'alpha': 1}


In [None]:
y_pred_ridge_train = ridge_regressor.predict(X_train)
y_pred_ridge_test = ridge_regressor.predict(X_test)

In [None]:
# Training metrics
r_MSE  = mean_squared_error(y_train, y_pred_ridge_train)
print("Train MSE :" , r_MSE)

r_RMSE = np.sqrt(r_MSE)
print("Train RMSE :" , r_RMSE)

r_r2 = r2_score(y_train, y_pred_ridge_train)
print("Train R2 :" ,r_r2)

r_ar2 = 1-(1-r2_score(y_train, y_pred_ridge_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ", r_ar2)


Train MSE : 64267.70763792365
Train RMSE : 253.51076434329894
Train R2 : 0.5505939605752155
Train Adjusted R2 :  0.5505744111713912


In [None]:
# Testing metrics
rt_MSE  = mean_squared_error(y_test, y_pred_ridge_test)
print("Test MSE :" , rt_MSE)

rt_RMSE = np.sqrt(r_MSE)
print("Test RMSE :" ,rt_RMSE)

rt_r2 = r2_score(y_test, y_pred_ridge_test)
print("Test R2 :" ,rt_r2)

rt_ar2 = 1-(1-r2_score(y_test, y_pred_ridge_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",rt_ar2)


Test MSE : 64016.78555554594
Test RMSE : 253.51076434329894
Test R2 : 0.5513328561325768
Test Adjusted R2 :  0.5512547769727136


In [None]:
# Storing the traing and test set metrics for comparison
a=pd.Series(
    {'MSE':round((r_MSE),3),'RMSE':round((r_RMSE),3),'R2_score':round((r_r2),3),'Adjusted R2':round((r_ar2),3)},
    name='Ridge regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((rt_MSE),3),'RMSE':round((rt_RMSE),3),'R2_score':round((rt_r2),3),'Adjusted R2':round((rt_ar2),3)},
     name='Ridge regression ')            

test_df = test_df.append(b,ignore_index=False)

### Lasso Regression

In [None]:
# Applying lasso regressor
lasso = Lasso()
parameters = {'alpha': [1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='r2', cv=5)
lasso_regressor.fit(X_train, y_train)

In [None]:
lasso_regressor.score(X_train, y_train)

0.5505939610899314

In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)

In [None]:
y_pred_lasso_test = lasso_regressor.predict(X_test)
y_pred_lasso_train = lasso_regressor.predict(X_train)

In [None]:
# Training metrics
l_MSE  = mean_squared_error(y_train, y_pred_lasso_train)
print("Train MSE :" , l_MSE)

l_RMSE = np.sqrt(l_MSE)
print("Train RMSE :" ,l_RMSE)

l_r2 = r2_score(y_train, y_pred_lasso_train)
print("Train R2 :" ,l_r2)
l_ar2 = 1-(1-r2_score(y_train, y_pred_lasso_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",l_ar2)

In [None]:
# Test metrics
lt_MSE  = mean_squared_error(y_test, y_pred_lasso_test)
print("Test MSE :" , lt_MSE)

lt_RMSE = np.sqrt(lt_MSE)
print("Test RMSE :" ,lt_RMSE)

lt_r2 = r2_score(y_test, y_pred_lasso_test)
print("Test R2 :" ,lt_r2)

lt_ar2 = 1-(1-r2_score(y_test, y_pred_lasso_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",lt_ar2)

In [None]:
# storing the traing and test set metrics for comparison
a=pd.Series(
    {'MSE':round((l_MSE),3),'RMSE':round((l_RMSE),3),'R2_score':round((l_r2),3),'Adjusted R2':round((l_ar2),3)},
    name='Lasso regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((lt_MSE),3),'RMSE':round((lt_RMSE),3),'R2_score':round((lt_r2),3),'Adjusted R2':round((lt_ar2),3)},
     name='Lasso regression ')            

test_df = test_df.append(b,ignore_index=False)

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Hyperparameter Grid
param_dt = { 'max_depth' : [9,10] , 'min_samples_split' : [40,50] , 'min_samples_leaf' : [20,30] }

dt_model = DecisionTreeRegressor()

# Grid search
dt_grid = GridSearchCV(estimator = dt_model,
                       param_grid = param_dt,
                       cv = 5, verbose = 2, scoring = 'r2')

dt_grid.fit(X_train,y_train)

In [None]:
dt_grid.best_score_

0.5908874319926922

In [None]:
dt_grid.best_estimator_

DecisionTreeRegressor(max_depth=10, min_samples_leaf=30, min_samples_split=40)

In [None]:
dt_optimal_model =dt_grid.best_estimator_

In [None]:
y_pred_dt_train = dt_optimal_model.predict(X_train)
y_pred_dt_test = dt_optimal_model.predict(X_test)

In [None]:
# Training metrics
d_MSE  = mean_squared_error(y_train, y_pred_dt_train)
print("Train MSE :" , d_MSE)

d_RMSE = np.sqrt(d_MSE)
print("Train RMSE :" ,d_RMSE)

d_r2 = r2_score(y_train, y_pred_dt_train)
print("Train R2 :" ,d_r2)

d_ar2 = 1-(1-r2_score(y_train, y_pred_dt_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",d_ar2)

Train MSE : 57840.23474131701
Train RMSE : 240.49996827716424
Train R2 : 0.5955394743353739
Train Adjusted R2 :  0.5955218800855118


In [None]:
# Testing metrics
dt_MSE  = mean_squared_error(y_test, y_pred_dt_test)
print("Test MSE :" , dt_MSE)

dt_RMSE = np.sqrt(dt_MSE)
print("Test RMSE :" ,dt_RMSE)

print('Training MAE: {:0.2f}'.format(metrics.mean_absolute_error(y_test, y_pred_dt_test)))

dt_r2 = r2_score(y_test, y_pred_dt_test)
print("Test R2 :" ,d_r2)

dt_ar2 = 1-(1-r2_score(y_test, y_pred_dt_test))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",dt_ar2)

Test MSE : 58322.073117166576
Test RMSE : 241.49963378267591
Test R2 : 0.5955394743353739
Test Adjusted R2 :  0.5911736042156011


In [None]:
#Residual plot
plt.scatter((y_test)-(y_pred_dt_test),(y_pred_dt_test))

In [None]:
# storing the traing and test set metrics for comparison
a=pd.Series(
    {'MSE':round((d_MSE),3),'RMSE':round((d_RMSE),3),'R2_score':round((d_r2),3),'Adjusted R2':round((d_ar2),3)},
    name='Decision Tree regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((dt_MSE),3),'RMSE':round((dt_RMSE),3),'R2_score':round((dt_r2),3),'Adjusted R2':round((dt_ar2),3)},
     name='Decision Tree regression ')            

test_df = test_df.append(b,ignore_index=False)

In [None]:
importances = dt_optimal_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df.sort_values(by=['Feature Importance'],ascending=False,inplace=True)
importance_df

In [None]:
plt.figure(figsize=(15,10))
plt.title('Features importace')
sns.barplot(x='Feature',y="Feature Importance",data=importance_df[:10])

#### XGB Regressor

In [None]:
# Hyperparameter Grid
param_xgb = {'n_estimators' : [100,150] ,'max_depth' : [9,10],'min_samples_split':[30,40] }

# Create an instance of the  XGBRegressor
import xgboost as xgb
xgb_m = xgb.XGBRegressor(tree_method = "gpu_hist")

# Grid search
xgb_grid = GridSearchCV(estimator = xgb_m , param_grid = param_xgb, cv = 5 , verbose=2 , scoring="r2")

xgb_grid.fit(X_train,y_train)

In [None]:
xgb_grid.best_score_

0.7930116054925561

In [None]:
xgb_grid.best_params_

{'max_depth': 10, 'min_samples_split': 30, 'n_estimators': 150}

In [None]:
xgb_model = xgb_grid.best_estimator_

In [None]:
y_pred_xgb_test = xgb_model.predict(X_test)
y_pred_xgb_train = xgb_model.predict(X_train)

In [None]:
# Training metrics
xg_MSE  = mean_squared_error(y_train, y_pred_xgb_train)
print("Train MSE :" , xg_MSE)

xg_RMSE = np.sqrt(xg_MSE)
print("Train RMSE :" ,xg_RMSE)

xg_r2 = r2_score(y_train, y_pred_xgb_train)
print("Train R2 :" ,xg_r2)

xg_ar2 = 1-(1-r2_score((y_train), (y_pred_xgb_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",xg_ar2)

Train MSE : 56911.226799658325
Train RMSE : 238.56074027311854
Train R2 : 0.8191963046842581
Train Adjusted R2 :  0.8191891365147477


In [None]:
# Testing metrics
xgt_MSE  = mean_squared_error(y_test, y_pred_xgb_test)
print("Test MSE :" , xgt_MSE)

xgt_RMSE = np.sqrt(xgt_MSE)
print("Test RMSE :" ,xgt_RMSE)

xgt_r2 = r2_score(y_test, y_pred_xgb_test)
print("Test R2 :" ,xgt_r2)

xgt_ar2 = 1-(1-r2_score((y_test), (y_pred_xgb_test)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ", xgt_ar2)

Test MSE : 65775.86341208882
Test RMSE : 256.4680553443037
Test R2 : 0.7920091767976405
Test Adjusted R2 :  0.7919761886786661


In [None]:
#Residual plot
plt.scatter((y_test)-(y_pred_xgb_test),(y_pred_xgb_test))

In [None]:
# storing the traing and test set metrics for comparison
a=pd.Series(
    {'MSE':round((xg_MSE),3),'RMSE':round((xg_RMSE),3),'R2_score':round((xg_r2),3),'Adjusted R2':round((xg_ar2),3)},
    name='Ridge regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((xgt_MSE),3),'RMSE':round((xgt_RMSE),3),'R2_score':round((xgt_r2),3),'Adjusted R2':round((xgt_ar2),3)},
     name='Ridge regression ')            

test_df = test_df.append(b,ignore_index=False)

In [None]:
# Checking important features
importances = xgb_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df.sort_values(by=['Feature Importance'],ascending=False,inplace=True)
importance_df

In [None]:
plt.figure(figsize=(15,10))
plt.title('Features importace')
sns.barplot(x='Feature',y="Feature Importance",data=importance_df[:10])

#### LGBM Regressor

In [None]:
# Applying LightGBM 

params={"n_estimator":[5,10],"max_depth":[10,20] ,"min_samples_split":[20,30]}

lgb = LGBMRegressor()
gs_lgb = GridSearchCV(lgb,params,cv=5,verbose=2,scoring='r2')
gs_lgb.fit(X_train,y_train)

In [None]:
print(gs_lgb.best_score_)
print(gs_lgb.best_params_)

0.7614648788398489
{'max_depth': 20, 'min_samples_split': 20, 'n_estimator': 5}


In [None]:
lgb_model = gs_lgb.best_estimator_

In [None]:
y_pred_lgb_train = lgb_model.predict(X_train)
y_pred_lgb = lgb_model.predict(X_test)

In [None]:
# Training metrics
lg_MSE  = mean_squared_error(y_train, y_pred_lgb_train)
print("Train MSE :" , lg_MSE)

lg_RMSE = np.sqrt(lg_MSE)
print("Train RMSE :" ,lg_RMSE)

lg_r2 = r2_score(y_train, y_pred_lgb_train)
print("Train R2 :" ,lg_r2)

lg_ar2 = 1-(1-r2_score((y_train), (y_pred_lgb_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",lg_ar2)

Train MSE : 74751.06856926194
Train RMSE : 273.40641647419676
Train R2 : 0.7625201531202257
Train Adjusted R2 :  0.7625107379597957


In [None]:
# Testing metrics
lgt_MSE  = mean_squared_error(y_test, y_pred_lgb)
print("Test MSE :" , lgt_MSE)

lgt_RMSE = np.sqrt(lgt_MSE)
print("Test RMSE :" ,lgt_RMSE)

lgt_r2 = r2_score(y_test, y_pred_lgb)
print("Test R2 :" ,lgt_r2)

lgt_ar2 = 1-(1-r2_score((y_test), (y_pred_lgb)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Test Adjusted R2 : ",lgt_ar2)

Test MSE : 75925.30534677533
Test RMSE : 275.54546874658513
Test R2 : 0.7599154774749186
Test Adjusted R2 :  0.7598773991757369


In [None]:
#Residual plot
plt.scatter((y_test)-(y_pred_lgb),(y_pred_lgb))

In [None]:
# storing the traing and test set metrics for comparison
a=pd.Series(
    {'MSE':round((lg_MSE),3),'RMSE':round((lg_RMSE),3),'R2_score':round((lg_r2),3),'Adjusted R2':round((lg_ar2),3)},
    name='LGBM regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((lgt_MSE),3),'RMSE':round((lgt_RMSE),3),'R2_score':round((lgt_r2),3),'Adjusted R2':round((lgt_ar2),3)},
     name='LGBM regression ')            

test_df = test_df.append(b,ignore_index=False)

In [None]:
# Checking important features
importances = lgb_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df.sort_values(by=['Feature Importance'],ascending=False,inplace=True)
importance_df

In [None]:
plt.figure(figsize=(15,10))
plt.title('Features importace')
sns.barplot(x='Feature',y="Feature Importance",data=importance_df[:10])

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

params={"max_depth": [8,10],"min_samples_split": [40,50],"n_estimators": [100,150]}
RFR = RandomForestRegressor()
RFR_grid = GridSearchCV(RFR, params, scoring='r2',verbose=2, cv=3, n_jobs=-1)
RFR_grid.fit(X_train, y_train)

In [None]:
print(RFR_grid.best_score_)
print(RFR_grid.best_params_)

0.6826572522903533
{'max_depth': 9, 'min_samples_split': 30, 'n_estimators': 50}


In [None]:
rf_model = RFR_grid.best_estimator_

In [None]:
y_pred_rf = rf_model.predict(X_test)
y_pred_rf_train = rf_model.predict(X_train)

In [None]:
# Training metrics
rf_MSE  = mean_squared_error(y_train, y_pred_rf_train)
print("Train MSE :" , rf_MSE)

rf_RMSE = np.sqrt(rf_MSE)
print("Train RMSE :" ,rf_RMSE)

rf_r2 = r2_score(y_train, y_pred_rf_train)
print("Train R2 :" ,rf_r2)

rf_ar2 = 1-(1-r2_score((y_train), (y_pred_rf_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",rf_ar2)

Train MSE : 1.3230569789031397
Train RMSE : 1.1502421392485755
Train R2 : 0.6838898979332522
Train Adjusted R2 :  0.6838785491816053


In [None]:
# Testing metrics
rft_MSE  = mean_squared_error(y_test, y_pred_rf)
print("Train MSE :" , rft_MSE)

rft_RMSE = np.sqrt(rft_MSE)
print("Train RMSE :" ,rft_RMSE)

rft_r2 = r2_score(y_test, y_pred_rf)
print("Train R2 :" ,rft_r2)

rft_ar2 = 1-(1-r2_score((y_test), (y_pred_rf)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",rft_ar2)

Train MSE : 1.336514139218553
Train RMSE : 1.1560770472674184
Train R2 : 0.6831758116037868
Train Adjusted R2 :  0.6831644372155394


In [None]:
#Residual plot
plt.scatter((y_test)-(y_pred_rf),(y_pred_rf))

In [None]:
# storing the traing and test set metrics for comparison
a=pd.Series(
    {'MSE':round((rf_MSE),3),'RMSE':round((rf_RMSE),3),'R2_score':round((rf_r2),3),'Adjusted R2':round((rf_ar2),3)},
    name='Ridge regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((rft_MSE),3),'RMSE':round((rft_RMSE),3),'R2_score':round((rft_r2),3),'Adjusted R2':round((rft_ar2),3)},
     name='Ridge regression ')            

test_df = test_df.append(b,ignore_index=False)

In [None]:
importances = rf_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df.sort_values(by=['Feature Importance'],ascending=False,inplace=True)
importance_df

In [None]:
plt.figure(figsize=(15,10))
plt.title('Features importace')
sns.barplot(x='Feature',y="Feature Importance",data=importance_df[:10])

###CatBoost Regressor

In [None]:
#pip install catboost

In [None]:
import catboost as cb

params = {'iterations': [100,150],
        'learning_rate': [0.01, 0.1],
        'depth': [9,10],
        'l2_leaf_reg': [5,7]}

cbr = cb.CatBoostRegressor()
cbr_grid = GridSearchCV(cbr, params, scoring='r2',verbose=2, cv=5)
cbr_grid.fit(X_train, y_train)

In [None]:
print(cbr_grid.best_score_)
print(cbr_grid.best_params_)

0.7490337927030396
{'depth': 10, 'iterations': 150, 'l2_leaf_reg': 5, 'learning_rate': 0.1}


In [None]:
cat_model = cbr_grid.best_estimator_

In [None]:
y_pred_cb = cat_model.predict(X_test)
y_pred_cb_train = cat_model.predict(X_train)

In [None]:
# Training metrics
cb_MSE  = mean_squared_error(y_train, y_pred_cb_train)
print("Train MSE :" , cb_MSE)

cb_RMSE = np.sqrt(cb_MSE)
print("Train RMSE :" ,cb_RMSE)

cb_r2 = r2_score(y_train, y_pred_cb_train)
print("Train R2 :" ,cb_r2)

cb_ar2 = 1-(1-r2_score((y_train), (y_pred_cb_train)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",cb_ar2)

Train MSE : 0.02537573020631267
Train RMSE : 0.1592976151934255
Train R2 : 0.7538787888455403
Train Adjusted R2 :  0.7538690639034688


In [None]:
# Testing metrics
cbt_MSE  = mean_squared_error(y_test, y_pred_cb)
print("Train MSE :" , cbt_MSE)

cbt_RMSE = np.sqrt(cbt_MSE)
print("Train RMSE :" ,cbt_RMSE)

cbt_r2 = r2_score(y_test, y_pred_cb)
print("Train R2 :" ,cbt_r2)

cbt_ar2 = 1-(1-r2_score((y_test), (y_pred_cb)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1))
print("Train Adjusted R2 : ",cbt_ar2)

Train MSE : 0.028594564847447875
Train RMSE : 0.1690992751239575
Train R2 : 0.7212566158062704
Train Adjusted R2 :  0.7212456018702859


In [None]:
#Residual plot
plt.scatter((y_test)-(y_pred_cb),(y_pred_cb))

In [None]:
# storing the traing and test set metrics for comparison
a=pd.Series(
    {'MSE':round((cb_MSE),3),'RMSE':round((cb_RMSE),3),'R2_score':round((cb_r2),3),'Adjusted R2':round((cb_ar2),3)},
    name='Catboost regression ')

training_df = training_df.append(a,ignore_index=False)

b=pd.Series(
    {'MSE':round((cbt_MSE),3),'RMSE':round((cbt_RMSE),3),'R2_score':round((cbt_r2),3),'Adjusted R2':round((cbt_ar2),3)},
     name='Catboost regression ')            

test_df = test_df.append(b,ignore_index=False)

In [None]:
importances = cat_model.feature_importances_

importance_dict = {'Feature' : list(X_train.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df.sort_values(by=['Feature Importance'],ascending=False,inplace=True)
importance_df

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False,inplace=True)
importance_df

### CONCLUSION :






In this project we covered various aspects of the Machine learning development cycle. We observed that the data exploration and variable analysis is a very important aspect of the whole cycle and should be done for thorough understanding of the data. We also cleaned the data while exploring as there were some outliers which should be treated before feature engineering. Further we did feature engineering to filter and gather only the optimal features which are more significant and covered most of the variance in the dataset. Then finally we trained the models on the optimum featureset to get the results



















