##### **Project Type -**  Regression 

##### **Presented By -** Suraj Kumar


# **Project Name**    - **SEOUL BIKE SHARING DEMAND PREDECTION**



Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it reduces the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. Data description is the dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour, date information and attribute information.

# Variables

**Date** : year-month-day

**Rented Bike count** - Count of bikes rented at each hour

**Hour** - Hour of the day

**Temperature**-Temperature in Celsius

**Humidity**- %

**Windspeed** - m/s

**Visibility** - 10m

**Dew point temperature** - Celsius

**Solar radiation** - MJ/m2

**Rainfall** - mm

**Snowfall** - cm

**Seasons** - Winter, Spring, Summer, Autumn

# **INTRODUCTION**

This is the problem related to the regression prediction where we have to predict continuous target variable that is rented bike sales using the different independent variables related to the atmospheric condition.

Here we will follow few norms for systemizing the approach to find the best prediction.

1) Data Exploration and analysing pattern of relation among different variables.

2) Removing outliers and dropping correlating variables.

3) Defining target variables and features variables.

4) Splitting the data for training and testing.

5) Choosing the different model like linear regression, random forest regression, polynomial regression.

6) Fitting the data and predicting result.

7) Evaluation of the result using different metrics like Mean Squared Error,     R2_score etc.

8) Hyper Parameter Tuning using Lasso, Ridge, Grid Search CV.

9) Comparing different model with the help of metrics.

10) Analysing importance of different features in prediction (Model explainability).

11) Conclusion

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

Reading the file.

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path ='/content/drive/MyDrive/Capstone_Project-2/'

In [None]:
df=pd.read_csv(path+'SeoulBikeData.csv',encoding='unicode_escape') 


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

Number of row=8760

Number of columns=14

### Dataset Information

In [None]:
# Dataset Info
df.info

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])


No Duplicate Values.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum().sum()

No Missing Values.

In [None]:
# Dataset Columns
df.columns

# Exploratory Data Analysis

In [None]:
# Dataset Describe
df.describe(include='all')


After looking at the mean, max and stadard value, it looks like it might contain many outliers.

In [None]:
#Dependent variable 'Rented Bike Count
plt.figure(figsize=(7,7))
sns.distplot(df['Rented Bike Count'],color='r')

After plotting the Density plot of number of rented bike, we can see that the majority is in between 100 to 1200 rented bikes with an outliers upto 3500.

The density plot is positively skewed thus needs transformation for normalising the distribution of data.

In [None]:
#skewness and kurtosis
print("Skewness: %f" % df['Rented Bike Count'].skew())
print("Kurtosis: %f" % df['Rented Bike Count'].kurt())

Here skewness is 1.153 while the kurtosis is 0.853.

In [None]:
#Reducing Skewness by root squaring target variables.
plt.figure(figsize=(7,7))
sns.distplot(np.sqrt(df['Rented Bike Count']),color='r')

In [None]:
#skewness and kurtosis
print("Skewness after transformation: %f" % np.sqrt(df['Rented Bike Count']).skew())
print("Kurtosis after transformation: %f" % np.sqrt(df['Rented Bike Count']).kurt())

After root squaring the target variable,we are able to reduce the Skewness to 0.23 and Kurtosis to -0.65.

*Analysing the correlation between different numerical variables.*

In [None]:
#Relation Between Two Numerical Variables
sns.pairplot(df,vars=['Rented Bike Count','Hour','Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)',], hue='Seasons')


*First we need to convert the Date columns from string to date time format for data processing.*

In [None]:
# Write your code to make your dataset analysis ready.
#Convert the Date column in Datetime Dtype
df['Date']=pd.to_datetime(df['Date'])

#Breaking Down the Date into 3 Components
df['Day']=df['Date'].dt.day
df['Month']=df['Date'].dt.month
df['Year']=df['Date'].dt.year

In [None]:
df.drop(['Date'],axis=1,inplace=True) #Removing Date Column

In [None]:
df.describe().columns

In [None]:
# Chart - 1 visualization code
#Vizualizing Density of various features
numerical_features=df.describe().columns
for col in numerical_features[1:]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    sns.distplot(feature,bins=50, ax = ax,color='y')
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

*By analyzing the density plot of different numerical features,it can be concluded tha*t-

Feature with near normal distribution are-
  
    1-Temperature-(mean-13 degree celcius)

    2-Humidity-(mean-58%)


Feature with skewed distribution are-

    1-Wind Speed-(Mean-1.62 m/s)
   
    2-Visibility-(Mean-1434 10m)

    3-Solar Radiation-(Mean-0.5 MJ/m2)

    4-Rainfall-(Mean-0.1 mm)

    5-Snowfall-(Mean-0.064 cm)

*Analysing the relation of Number of Rented Bike with respect to different numerical features*.

In [None]:
# Chart - 2 visualization code
#Visualizing Relation of Dependent Variable with numerical independent features
for col in numerical_features[1:]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    label = df['Rented Bike Count']
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Rented Bike Count')
    ax.set_title('Rented Bike Count vs ' + col + '- correlation: ' + str(correlation))
    z = np.polyfit(df[col], df['Rented Bike Count'], 1)
    y_hat = np.poly1d(z)(df[col])

    plt.plot(df[col], y_hat, "r--", lw=1)

plt.show()

**Conclusion-**

1) Most number of bikes rented in between 15 and 20 hrs which shows evening
period sees higher demand.

2) Temperature having 20 to 30 degree celcius sees highest demand of rental bikes (Automn or Summer season).

3) Humidity with 40 to 70 % with maximum demand.

4) Lower wind speed increases the demand of Rental Bike.

5) Demand of rental bikes increased with higher visibility.

6) Higher dew point temperature with greater demand of Rental Bikes.

7) Demand decreases with higher Solar Radiations.

8) Demand decreases during higher Rainfall and Snowfall.

# Data Preprocessing

In [None]:
#Removing Outliers
df=df[df['Wind speed (m/s)']<=4]
df=df[df['Visibility (10m)']>=100]
df=df[df['Solar Radiation (MJ/m2)']<=3]
df=df[df['Rainfall(mm)']<=10]
df=df[df['Snowfall (cm)']<=4]

*Analyzing the relation between rental bike count and numerical features*

In [None]:
# Chart - 2 visualization code
#Vizualizing Relation between categorical features with Dependent Variables
for col in numerical_features[1:-2]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    label = df['Rented Bike Count']
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Rented Bike Count')
    ax.set_title('Rented Bike Count vs ' + col + '- correlation: ' + str(correlation))
    z = np.polyfit(df[col], df['Rented Bike Count'], 1)
    y_hat = np.poly1d(z)(df[col])

    plt.plot(df[col], y_hat, "r--", lw=1)

plt.show()

 *Analysing Categorical features*

In [None]:
df['Seasons'].value_counts()

It can be seen that rental bikes are available evenly in different seasons.

In [None]:
# Chart - 3 visualization code
#ploting Number of Rented bike in different Seasons
bike_rented_per_season=df.groupby(['Seasons'])['Rented Bike Count'].mean()
plt.rcParams['figure.figsize']=(7,7)
sns.barplot(y=bike_rented_per_season,x=bike_rented_per_season.index,data=df)
plt.ticklabel_format(style='plain', axis='y')
plt.show()

We can see that demand of Rental Bikes are higher in Automn and Summer Season with average 800 and 1050 respectively while the Winter seen the list minimum demand of nearly 200.

In [None]:
df['Holiday'].value_counts()

Here data show the clearly working days are much more than Holidays.

In [None]:
#Visualizing Number of Rented Bike on the Basis of Holiday
bike_rented_on_holiday=df.groupby(['Holiday'])['Rented Bike Count'].mean()
plt.rcParams['figure.figsize']=(7,7)
sns.barplot(y=bike_rented_on_holiday,x=bike_rented_on_holiday.index,data=df)
plt.ticklabel_format(style='plain', axis='y')
plt.show()

From the bar chart ,it can be clearly seen that demand of Rental bikes are more on working days.

In [None]:
df['Functioning Day'].value_counts()

Here we can be seen that Functioning Days are much more than the non functioning days.

In [None]:
bike_rented_on_functioning_day=df.groupby(['Functioning Day'])['Rented Bike Count'].mean()
plt.rcParams['figure.figsize']=(7,7)
sns.barplot(y=bike_rented_on_functioning_day,x=bike_rented_on_functioning_day.index,data=df)
plt.ticklabel_format(style='plain', axis='y')
plt.show()

*Analysing the correlation among different variables.*

In [None]:
# Chart - 4 visualization code
#Visualizing the correlation among different variable
plt.figure(figsize=(15,8))
cbar_kws = { 
            "shrink":1,
            'extend':'min', 
            'extendfrac':0.1, 
            "ticks":np.arange(0,22), 
            "drawedges":True,
           }
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm',linewidth=1,cbar_kws=cbar_kws)

From the heatmap it can be seen that-

1) Majority of the features are not correlated with each others.

2) Temperature has the highest correlation with Dew Point Temperatures.

3) While humidity is also highly correlated with visibility.

Identifying correlating variables with the help of variance inflation factor to get clear pictures..

In [None]:
#Treating Correlating Variables
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(x):
  vif=pd.DataFrame()
  vif['variable']=x.columns
  vif['vif']=[variance_inflation_factor(x.values,i) for i in range(x.shape[1])]
  return (vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['Rented Bike Count','Day','Month','Year']]])

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['Rented Bike Count','Day','Month','Year','Dew point temperature(°C)']]])

Removing the Dew Point Temperature.

In [None]:
df.drop(['Dew point temperature(°C)'],axis=1,inplace=True)

In [None]:
numerical_features=['Hour','Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)','	Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']

Let see,the correlation between variables after its treatment.

In [None]:
#Visualizing Correlation after treatment
plt.figure(figsize=(15,8))
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True)

In [None]:
categorical_features=df.describe(include=['object','category']).columns

In [None]:
numerical_features

*Analysing the number of rented bike with respect to the different categorical features.*

First of all we ll plot the boxplot and analyse the density as well as outliers.

In [None]:
#Visualizing outlier through Boxplot
for col in categorical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    df.boxplot(column = 'Rented Bike Count', by = col, ax = ax)
    ax.set_title('Label by ' + col)
    ax.set_ylabel("Rented Bike Count")
plt.show()

After vizualisation we can conclude that-

    1-Maximum demand density of rental bike is in Automn and Summer.
    2-Above features have higher density of outliers,thus removing them could cause the major data lost.

*Encoding the Categorical Data*

Encoding ll help to process the categorical data by assigning them a numerical values.

In [None]:
#Encoding the categorical variables
df_pr=df.copy()
def encoder(data,columns):
  data=pd.concat([data,pd.get_dummies(data[columns],prefix=columns,drop_first=True)],axis=1)
  data=data.drop([columns],axis=1)
  return data

for col in categorical_features:
  df_pr=encoder(df_pr,col)
df_pr.head()

In [None]:
df_pr.drop(['Day','Year'],axis=1,inplace=True)

In [None]:
df_pr.head()

## ***7. ML Model Implementation***

Now we ll apply various the Models like-Linear Regression,Random Forest Regression and Polynomial Regression and then after evaluate the results.

*Assigning and Splitting the data for training and testing:*

In [None]:

# ML Model - 1 Implementation
x=df_pr.iloc[:,1:]
y=np.sqrt(df_pr.iloc[:,:1])


# Fit the Algorithm
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=0)
print(x_train.shape)
print(x_test.shape)

*Scaling the data*

It help out to get rid of impact of the difference in magnitude of the different features.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.transform(x_test)

`

### ML Model - 1-Linear Regression

*Implementation*

In [None]:
#Implementation
reg=LinearRegression().fit(x_train,y_train)

*Model features*

In [None]:
reg.score(x_train,y_train) 

In [None]:
reg.intercept_

In [None]:
reg.coef_ #coefficient of parameter

*Prediction using the model*

In [None]:
#Prediction
y_train_pred=reg.predict(x_train)
y_test_pred=reg.predict(x_test)

*It's performance using Evaluation metric Score Chart.*




In [None]:
from sklearn.metrics import mean_absolute_error
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

mse=mean_squared_error(((y_test)**2),((y_test_pred)**2)) 
print('MSE',mse)

mae=mean_absolute_error(((y_test)**2),((y_test_pred)**2))
print('MAE',mae)

rmse=np.sqrt(mse)
print('RMSE',rmse)

r2=r2_score(((y_test)**2),((y_test_pred)**2))
print('R2',r2)
print("Adjusted R2 : ",1-(1-r2_score(((y_test)**2), ((y_test_pred)**2)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

Here R2_Score is nearly 61% which is very less and generally not acceptable and need further tuning and transformation.

In [None]:
#Visualizing the relation between the predicted value and actual values.
plt.figure(figsize=(8,5))
plt.scatter((y_test**2),((y_test_pred)**2),color='brown')
plt.xlabel('True_Values')
plt.ylabel('Predicted_Values')
plt.show()

From the above scatter plot,we can see that higher values giving more sparse values pointing toward the higher error.

It shows that model is working well for lower values.

In [None]:
error=((y_test)**2)-((y_test_pred)**2)

In [None]:
sns.distplot(error)
plt.show()

The above density plot shows the normal distribution of the error,which shows majority of the prediction are having low error.

# Cross- Validation & Hyperparameter Tuning



Now we ll try to penalise the coffiecient parameters to reduce the error.

Here we ll use three method that are Lasso Regression,Ridge Regression and Elastic Regression and Cross validate them

*Lasso Regression*

In [None]:
#Hyperparameter Tuning using Lasso Regression
from sklearn.linear_model import Lasso
lasso=Lasso(alpha=0.0001,max_iter=8000)
lasso.fit(x_train,y_train) #fitting model

In [None]:
lasso.score(x_train,y_train)

In [None]:
lasso.coef_

*Cross Validation*

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
#Implementing Cross Validation
from sklearn.model_selection import GridSearchCV
lasso=Lasso()
parameters={'alpha':[1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]}
lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(x_train,y_train) #Fitting the model

*Best Lasso Parameter*

In [None]:
#Analysing optimal parameter
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

*Prediction through Lasso on test data*

In [None]:
#Predicting through model
y_pred_lasso=lasso_regressor.predict(x_test) # Prediction on test data

*Analysing ERROR*



In [None]:
#Visualizing the accuracy of Predicted value with True Value
plt.figure(figsize=(8,5))
plt.scatter(((y_test)),np.array(y_pred_lasso))
plt.xlabel('True_Values')
plt.ylabel('Lasso_predicted_Value')
plt.show()

After visualizing the above ScatterPlot ,we can see increase in linearity of the relationship between True Values and Predicted Values which shows the reduction in error compared to simple Linear Regression Model.

*Evaluation of Lasso Model*

In [None]:
#Evaluation of Lasso Model
MSE  = mean_squared_error(((y_test)**2), (y_pred_lasso)**2)
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(((y_test)), (y_pred_lasso))
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(((y_test)), (y_pred_lasso)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

Here increase in accuracy can be seen with increase in R2_Score to 67%

Ridge Regression

In [None]:
#Hyperparameter tuning using Ridge Regression
from sklearn.linear_model import Ridge
parameters={'alpha' : [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100,0.1]}
ridge_regressor=GridSearchCV(Ridge(),parameters,scoring='neg_mean_squared_error',cv=3)
ridge_regressor.fit(x_train,y_train) #Fitting the model

*Best Parameter on Ridge Regression*

In [None]:
#Analysing the optimal parameters
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

*Prediction using Ridge Regression*

In [None]:
#Prediction using Ridge Regression
y_pred_train_ridge=ridge_regressor.predict(x_train) #prediction on training data

*Evaluation of Ridge Regression*

In [None]:
#Evaluation of model(Ridge)
MSE  = mean_squared_error((y_train), (y_pred_train_ridge))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score((y_train)**2, (y_pred_train_ridge)**2)
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score((y_train), (y_pred_train_ridge)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

On train data,the R2_Score is nearly 58%.

In [None]:
y_pred_ridge=ridge_regressor.predict(x_test) #Prediction on test data

In [None]:
#Evaluating model
lr_MSE  = mean_squared_error(((y_test)**2), (y_pred_ridge)**2)
print("MSE :" , MSE)

lr_RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

lr_r2 = r2_score(((y_test)**2),(y_pred_ridge)**2)
print("R2 :" ,lr_r2)
print("Adjusted R2 : ",1-(1-r2_score((y_test),(y_pred_ridge)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1)))

On test data,here the R2_Score on test is nearly 60% lesser than Lasso Regression.

In [None]:
#Visualizing the accuracy of Predicted value with True Value
plt.figure(figsize=(8,5))
plt.scatter((y_test),np.array(y_pred_ridge))
plt.xlabel('True_Values')
plt.ylabel('Ridge_predicted_Value')
plt.show()

Here we can see from the above Scatter plot is decrease in density compared to the Lasso Regression which point toward the decrease in Accuracy.

*ElasticNet*

In [None]:
#HyperparameterTuning using ElasticNet
from sklearn.linear_model import ElasticNet
#a * L1 + b * L2
#alpha = a + b and l1_ratio = a / (a + b)
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

In [None]:
elasticnet.fit(x_train,(y_train)) #Fitting the model

In [None]:
elasticnet.score(x_train, (y_train)) #Evaluating the model

In elasticnet,we are getting even worse R2_Score with nearly 50% means least accuracy.

From the above evaluation,we can derive that Lasso Regression is giving best result from different models of Hyper Tuning.

### ML Model - 2-Random Forest Classifier

*Implementation of Random Forest Classifier*

In [None]:
# Implementation the model
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor() #Initializing the model
grid_search = GridSearchCV(estimator = rf, param_grid = {'n_estimators':[50,80,100],'max_depth':[3,5,7]}, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(x_train,y_train) #fitting the model

*Prediction using Random Forest Classifier*

In [None]:
# predicting for both train and test
y_pred_train2=grid_search.predict(x_train) #Prediction with Train Data
y_pred_test2=grid_search.predict(x_test)  #Prediction with Test Data


*Evaluation of Random Forest Regressor*

In [None]:
#Model Evaluation on Training set
print('The evaluation metric values for training set - Random ForestRegressor with GridSearchCV:')
print('The MAE of training set = ',mean_absolute_error(y_train, y_pred_train2))
print('The MSE of training set = ',mean_squared_error(y_train, y_pred_train2))
print('The R2_score of training set = ',r2_score(y_train, y_pred_train2))

In [None]:
#Model Evaluation on testing set
rf_mae=mean_absolute_error(y_test**2, y_pred_test2**2)
rf_mse=mean_squared_error(y_test**2, y_pred_test2**2)
rf_r2=r2_score(y_test**2,y_pred_test2**2)

print('The evaluation metric values for test set - Linear regression:')
print('The MAE of test set = ',mean_absolute_error(y_test**2, y_pred_test2**2))
print('The MSE of test set = ',mean_squared_error(y_test**2, y_pred_test2**2))
print('The R2_score of test set = ',r2_score(y_test**2,y_pred_test2**2))

Here we are getting best good result with R2_Score 77% on test data,much better than the Lasso,Linear Regression

In [None]:
from sklearn.metrics import roc_auc_score,confusion_matrix,accuracy_score

In [None]:
#Visualizing the accuracy of predicted train data with respect to actual train data
plt.scatter(y_train,y_pred_train2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

In [None]:
#Visualizing the accuracy of predicted test data with respect to actual test data
plt.scatter(y_test,y_pred_test2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

From the scatter plot,we can see the linearity of the relationship between Predicted Values and Actual Values which shows high accuracy and increase in variance with respect to testing data.

*Hypertuning And Cross Validation On Random Forest Regression Model*

In [None]:
#Cross Validation And Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV 

In [None]:
parameters = {'criterion':['squared_error', 'absolute_error', 'poisson'],'max_features':['auto', 'sqrt', 'log2']}

In [None]:
#Performing the grid search using the parameters with cv of 5
grid = GridSearchCV(rf,parameters,cv=5,scoring='neg_mean_squared_error')
#Fitting it on our training dataset
#grid.fit(x_train,y_train)

*Optimal Parameters*

In [None]:
#grid.best_params_ #Optimal Parameter

In [None]:
rf_2=RandomForestRegressor(n_estimators=100,random_state=0,criterion='squared_error',max_features='auto',max_depth=15) #Initializing Tuned optimal model

In [None]:
rf_2.fit(x_train,y_train) #fitting the model

*Prediction on Model*

In [None]:
rf_2_y_train_pred=rf_2.predict(x_train) #Prediction on Train Data

In [None]:
rf_2_y_test_pred=rf_2.predict(x_test) #Prediction on Test Data

*Evaluation*

In [None]:
#Evaluation of test data
rfc_2_t_mae=mean_absolute_error(y_test**2,rf_2_y_test_pred**2)
rfc_2_t_mse=mean_squared_error(y_test**2,rf_2_y_test_pred**2)
rfc_2_t_rmse=np.sqrt(mean_squared_error(y_test**2,rf_2_y_test_pred**2))
rfc_2_r2=r2_score(y_test**2,rf_2_y_test_pred**2)


print('THE MEAN SQUARED ERROR in tuned parameter is : ',mean_squared_error(y_test**2,rf_2_y_test_pred**2))
print('THE MEAN ABSOLUTE ERROR in tuned parameter is : ',mean_absolute_error(y_test**2,rf_2_y_test_pred**2))
print('THE ROOT MEAN SQUARED ERROR in tuned parameter is : ',np.sqrt(mean_squared_error(y_test**2,rf_2_y_test_pred**2)))
print('THE R_2 SCORE in tuned parameter im training data is : ',r2_score(y_train**2,rf_2_y_train_pred**2))
print('THE R_2 SCORE in tuned parameter in test data is : ',r2_score(y_test**2,rf_2_y_test_pred**2))

This gave the best result with R2_Score 86% on test data which is quiet good score and in acceptable range.

In [None]:
#Visualizing the accuracy of predicted train value with respect to actual train value
plt.scatter(y_train,rf_2_y_train_pred,color='g')
plt.xlabel('Actual Value')
plt.ylabel('Predicted Value')

In [None]:
#Visualizing the accuracy of predicted test value with respect to actual test value
plt.scatter(y_test,rf_2_y_test_pred,color='g')
plt.xlabel('Actual Value')
plt.ylabel('Predicted Value')

The above scatter plot shows the increase in linearity and variance with a high accuracy of predicted values with respect to actual values.

### ML Model - 3-Polynomial Regression

*Implementation*

In [None]:
from sklearn.preprocessing import PolynomialFeatures
# ML Model - 3 Implementation

# Defining the variables
dependent_variable = 'Rented Bike Count'
independent_variables = list(set(df_pr.columns[1:].tolist()) - {dependent_variable})

x=df_pr.iloc[:,1:]
y=np.sqrt(df_pr.iloc[:,:1])

X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
poly_features = PolynomialFeatures(degree=2) #Initializing the model
X_train_poly = poly_features.fit_transform(X_train)
poly_model = LinearRegression() 
poly_model.fit(X_train_poly, Y_train) #fitting the model
y_train_predicted = poly_model.predict(X_train_poly) #Predicting the model on train data
y_test_predict = poly_model.predict(poly_features.fit_transform(X_test)) #Predicting the model on test data
#Evaluation of the model
poly_mse_test =mean_squared_error(((Y_test)**2), ((y_test_predict)**2))
poly_mae_test=mean_absolute_error(((Y_test)**2),((y_test_predict)**2))
#poly_rmse_test=np.sqrt(mse_test)

*Evaluation*

In [None]:
 print('MEAN SQUARE ERROR OF TEST DATA : ',poly_mse_test)
 print('MEAN SQUARE ERROR OF ABSOLUTE DATA : ',poly_mae_test)
 #print('ROOT MEAN SQUARE ERROR OF TEST DATA : ',poly_rmse_test)

In [None]:
r2_poly_train=r2_score(((Y_train)**2),((y_train_predicted)**2)) #Evaluation
r2_poly_test=r2_score(((Y_test)**2),((y_test_predict)**2))

In [None]:
print('r2_score of polynomial_train_data',r2_poly_train)
print('r2_score of polynomial_test_data',r2_poly_test)

Its R2_Score is quiet acceptable with 72% on test data but lower than the Random Forest Regression Model.

In [None]:
#Vizualizing the predicted train data with respect to the actual train data
plt.scatter(Y_train,y_train_predicted)
plt.xlabel('Actual value'),plt.ylabel('Predicted value')
plt.title('Training Error')
plt.show()

In [None]:
#Vizualising the predicted test data with respect to the actual test data
plt.scatter(Y_test,y_test_predict)
plt.xlabel('Actual value'),plt.ylabel('Predicted value')
plt.title('Test Error')
plt.show()

From the above Scatter Plot we can visualize the accuracy of the Predicted Values with respect to Actual Values.

In [None]:
#Visualizing error density
error_poly=((Y_test)**2)-((y_test_predict)**2)
sns.distplot(error_poly)

In [None]:

print("Skewness: %f" % error_poly.skew())
print("Kurtosis: %f" % error_poly.kurt())

Here we can see that despite skewness is under acceptable range,its Kurtosis is quiet high.

Here Error density follows the normal distribution

# **Comparison**

*Mean Squared Error*

In [None]:
#Comparison of different model with respect to following metrics
#MEAN SQUARE ERROR
model=['Linear Regression','RANDOM FOREST REGRESSION','TUNED RFC','POLYNOMIAL REGRESSION']
acc=[lr_MSE,rf_mse,rfc_2_t_mse,poly_mse_test]
plt.figure(figsize=(12,8))
sns.barplot(x=model,y=acc)
plt.xlabel('Model')
plt.ylabel('MEAN SQUARED ERROR')


From the above Barplot we can visualize that Tuned Random forest Regression Model gives the least Mean Squarred Error with less than 60000 while Linear Regression Model gives the Maximum Mean Squared Error with more than 140000.

*R2_Score*

In [None]:
#R2_SCORE
model=['Linear Regression','RANDOM FOREST REGRESSION','TUNED RFC','POLYNOMIAL REGRESSION']
acc=[lr_r2,rf_r2,rfc_2_r2,r2_poly_test]
plt.figure(figsize=(12,8))
sns.barplot(x=model,y=acc)
plt.xlabel('Model')
plt.ylabel('R2_SCORE')


If we compare the R2_Score of the different models,we can see that the Tuned Random Forest Model gives the maximum accuracy with more than 85% while Linear Regression gives minimum accuracy with less than 60%.

# Analysing importance of different features

If we take the best fit model the we have to choose here is Random Forest Model then after we ll try to find the importance of the features.

In [None]:
#feature importance in tuned random forest classifier
rf_optimal_model=grid_search.best_estimator_
rf_optimal_model

In [None]:
rf_optimal_model.feature_importances_

In [None]:
importances=rf_optimal_model.feature_importances_

In [None]:
importance_dict = {'Feature' : list(x.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)

In [None]:
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df=importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
importance_df.head(10).reset_index()

In [None]:

# visualizing feature importance 
plt.figure(figsize=(8,8))
plt.title('Feature Importance')
sns.barplot(x=importance_df['Feature Importance'],y=importance_df['Feature'],hue=importance_df['Feature'])

From the above bar chart we can see that the Temperature and Time(Hour) play the maximum role in affecting the demand of the Rental Bike.

# Model Explainabilty

*Explaination of model using Eli5.*

In [None]:
pip install eli5

In [None]:
import eli5 as eli

In [None]:
eli.explain_weights(rf_2)

In [None]:
eli.explain_prediction(rf_2 , np.array(x_test)[1])

In [None]:
eli.show_prediction(rf_2, x_test[1],
                    feature_names=list(x.columns),
                    show_feature_values=True)

Here,according to Eli5,Temperature and time have negative contribution  on demand with a higher values.