# Bike Sharing Demand Prediction With Regression Models and GridsearchCV

## Business Understanding
#### Project Goal:
The goal of the project is to work on the Bike share Dataset and predict the demand of Bikes sharing on daily basis by building the Regression Models and GridsearchCV.
#### Practical use:
* The project will help the Bike Sharing Companies to solve the realtime problems such as:
    * By knowing the demand the companies can plan in better way to meet the demand.
    * Better planning on seasons when there is high demand.
    * Better planning on the days/hours have high demand.
    * Increase the profit by managing the bikes based on demand.
  
 Data Source: Https://cycling.data.tfl.gov.uk/


In [None]:

import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression  
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

from sklearn.metrics import make_scorer

from pandas import DataFrame
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV 

## Data Understanding

The Dataframe has below features :
**timestamp** :timestamp field for grouping the data\
**cnt** :the count of a new bike shares\
**t1** :real temperature in C\
**t2** :temperature in C "feels like"\
**hum** :humidity in percentage\
**windspeed** :wind speed in km/h\
**weathercode** :category of the weather\
**isholiday** :boolean field - 1 holiday / 0 non holiday\
**isweekend** :boolean field - 1 if the day is weekend\
**season** :category field meteorological seasons:
   > 0-spring\
   > 1-summer\
   > 2-fall\
   > 3-winter.
   
**weathe_code** :category description:
   > 1 = Clear mostly clear but have some values with haze or fog or patches of fog or fog in vicinity.\
   > 2 = scattered clouds or few clouds.\
   > 3 = Broken clouds.\
   > 4 = Cloudy.\
   > 7 = Rain or light Rain shower or Light rain.\
   > 10 = rain with thunderstorm.\
   > 26 = snowfall.\
   > 94 = Freezing Fog.
    


In [None]:
# Reading the data from dataframe
data_frame = pd.read_csv('../input/bike-sharing/Bike_sharing.csv')

In [None]:
# displaying the few rows of dataset
data_frame.head()

In [None]:
# displaying the info about the train data set
data_frame.info()

In [None]:
# finding the null values in the data set
data_frame.isna().sum()

In [None]:
# function to handle the date column in the data frame
def Handling_time_feature(df,column):
    # splitting the time feature into seperate features
    df['year'] = pd.DatetimeIndex(df[column]).year
    df['month'] = pd.DatetimeIndex(df[column]).month
    df['day'] = pd.DatetimeIndex(df[column]).day
    df['hour'] = pd.DatetimeIndex(df[column]).hour
    df= df.drop(columns=[column], axis=0) # dropping the date column 
    return df


# calling the Handling_time_feature() to split the timestamp feature into seperate features
data_frame = Handling_time_feature(data_frame, 'timestamp')

## Exploratory Data Analysis

In [None]:
plt.figure(figsize=(18, 12))
heatmap = sns.heatmap(data_frame.corr(), vmin=-1, vmax=1, annot=True, cmap= 'YlGnBu')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':25}, pad=25);

* from the above correlation matrix we can see that there is a positive linear relationship between the features t1, t2, hour, windspeed and cnt.
* we can also see that there is negative linear relationship between hum and cnt.
* there is no linear relationship with cnt and other features.

In [None]:
# plots for Bike share count VS Temparature, Temparature feels like, Humidity, Windspeed
fig,ax = plt.subplots(2,2, figsize=(15,15))
plot = sns.scatterplot(x="cnt", y="t1",hue = 'season',data=data_frame,ax= ax[0,0])
plot.set_title("Bike Share count VS Temparature")
plot = sns.scatterplot(x="cnt", y="t2",hue = 'season', palette="ch:r=-.5,l=.75",data=data_frame,ax= ax[0,1])
plot.set_title("Bike Share count VS Temparature feels like")
plot = sns.scatterplot(x="cnt", y="hum",hue = 'season', palette="ch:r=-.5,l=.75",data=data_frame,ax= ax[1,0])
plot.set_title("Bike Share count VS Humidity")
plot = sns.scatterplot(x="cnt", y="wind_speed",hue = 'season',data=data_frame,ax= ax[1,1])
plot.set_title("Bike Share count VS Windspeed")

#### Temparature VS count in different seasons:
* From the above plots we can see that the bike sharing is high in season1,2(**Summer,Fall**) and low in season4(**Winter**) and Moderate in season0(**Spring**) 
#### Temparature Feelslike VS count in different seasons:
* From the above plots we can see that the bike sharing is high in season1,2(**Summer,Fall**) and low in season4(**Winter**) and Moderate in season0(**Spring**) 

In [None]:
# Bike share based on season and temparature
plt.figure(figsize=(15, 10))
plot= sns.relplot(data=data_frame, x='cnt', y='t1', col="season", hue="t1", height=5, aspect=.65)

In [None]:
# Bike share based on season and temparature feels like
plot= sns.relplot(data=data_frame, x='cnt', y='t2', col="season", hue="t2", height=5, aspect=0.65)

In [None]:
# Bike share based on season and Humidity
plot= sns.relplot(data=data_frame, x='cnt', y='hum', col="season", hue="t2", height=5, aspect=0.65)

In [None]:
# Bike share based on season and Wind speed
plot= sns.relplot(data=data_frame, x='cnt', y='wind_speed', col="season", hue="t2", height=5, aspect=0.65)

In [None]:
# Bike share based on weekend and Holiday
fig, ax = plt.subplots(3,2, figsize=(12,12))
sns.boxplot(x="is_weekend", y="cnt", data=data_frame, ax=ax[0,0]) 
sns.boxplot(x="month", y="cnt", data=data_frame, ax=ax[0,1]) 
sns.boxplot(x="hour", y="cnt", data=data_frame, ax=ax[1,0]) 
sns.boxplot(x="is_holiday", y="cnt", data=data_frame, ax=ax[1,1]) 
sns.boxplot(x="season", y="cnt", data=data_frame, ax=ax[2,0]) 
sns.boxplot(x="weather_code", y="cnt", data=data_frame, ax=ax[2,1])
plt.close(2)
plt.close(3)
plt.close(4)
plt.close(5)
plt.close(6)
plt.close(7)
fig.tight_layout()


#### weekend VS count:
* from the above boxplot we can say that the median of bike sharing is higher in weekdays which is **0** and low during weekend which is **1**.

#### Month VS count:
* from the above boxplot we can say that the median of bike sharing is higher in Months like 7,6,8,5,9 and moderate in 4,10 and low in remaning months.

#### hour VS count:
* from the above boxplot we can say that the median of bike sharing is higher during the hours 7,8,9,17,18,19.

#### Holiday VS count:
* from the above boxplot we can say that the median of bike sharing is higher in Non Holidays which is **0** low during holidays which is **1**.

#### Season VS count:
* from the above boxplot we can say that the median of bike sharing is higher in season **1** summer.

#### Weather VS count:
* from the above boxplot we can say that the median of bike sharing is higher in weather_code **1,2,3**.

## Data Preparation

#### Normalization with MinMaxScalar:
 MinMaxScalar scales each input variable separately to the range 0-1, in the above data frame the features 't1','t2','hum','wind_speed' are in high scale compare to the other features to normalize the data we applied the MinMaxScalar on the above features.

#### Onehotencoding with pandas:
A one hot encoding is a representation of categorical variables as binary vectors, A one hot encoding allows the representation of categorical data to be more expressive.
we applied the onehot encoding on the categorical features 'Weather_code','season','weekend','holiday'.

In [None]:
# function to normalize the features
def Normalize_Function(df, column_list):
    # apply standardization on numerical features
    for column in column_list:
        # fit on training data column
        scale = MinMaxScaler().fit(df[[column]])    
        # transform the training data column
        df[column] = scale.transform(df[[column]])
    
    return df        
 

# columns required to scale
columns_to_scale = ['t1', 't2','hum', 'wind_speed']

# calling the  Normalize_Function to scale the features
data_frame = Normalize_Function(data_frame, columns_to_scale)


# function for onehot encoder
def oneHot_Encoder(data_frame,column,prefix):
    holiday_df = pd.DataFrame(data_frame, columns=[column])
    temp_df = pd.get_dummies(holiday_df,columns= [column],prefix=[prefix] )
    data_frame=data_frame.join(temp_df)
    data_frame= data_frame.drop(columns=[column], axis=0)

    return data_frame

# converting the features into encoded features based on each category
data_frame= oneHot_Encoder(data_frame,'weather_code','W')
data_frame = oneHot_Encoder(data_frame,'season','season')
data_frame = oneHot_Encoder(data_frame,'is_weekend','Weekend')
data_frame = oneHot_Encoder(data_frame,'is_holiday','Holiday')
data_frame.head(5)

## Modeling and Performance Tuning

In [None]:
# splitting the data frame into Features and Target varibale for Testing  and Training the model 
X = data_frame.drop(columns=['cnt'], axis=0)
y = data_frame['cnt']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
print(data_frame.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

#### Performance evaluation metrics MSE, RMSE

In [None]:
# Function to Generate the MSE, RMSE
def mse_rmse(trues, preds):
    '''
    Compute MSE and rMSE for each column separately.
    '''
    mse = np.sum(np.square(trues - preds), axis=0) / trues.shape[0]
    rmse = np.sqrt(mse)
    return mse, rmse


def rmse_scorer(trues, preds):
    '''
    Compute rMSE
    '''
    mse, rmse = mse_rmse(trues, preds)
    return rmse
# Make the scoring function for GridSearch
rmse_scoring = make_scorer(rmse_scorer, greater_is_better=False)

In [None]:
# Creating the dataframe for model evaluation metric results
Models_Performance = pd.DataFrame(columns=['Model', 'Train RMSE','Tuning Parameters', 'Test RMSE'])

In [None]:
def Alpha_Generator(RangeMin, RangeMax):
    base = 2 
    steps =20
    bottom =math.log(RangeMin,base)
    top = math.log(RangeMax,base)
    exps = np.arange(bottom, top, (top-bottom)/steps)
    alphas = [np.power(base, ex) for ex in exps]
    alphas
    return alphas

#### Linear Regression Model

In [None]:
# Linear Regression Model

lm = LinearRegression().fit(X_train, y_train)

# appling the Linear model on train data
lm_pred_train=lm.predict(X_train)
Linear_Train_MSE, Linear_Train_RMSE=  mse_rmse(y_train, lm_pred_train)

# appling the Linear model on test data
lm_pred=lm.predict(X_test)
Linear_MSE, Linear_RMSE=  mse_rmse(y_test, lm_pred)

data =[{'Model': 'Linear_Regression','Train RMSE':Linear_Train_RMSE,'Tuning Parameters':'none','Test RMSE': Linear_RMSE }]
Models_Performance= Models_Performance.append(data, ignore_index=True,sort=False)

Models_Performance

#### Lasso

In [None]:
lasso = Lasso()
lasso.fit(X_train, y_train)

# appling the lasso model on train data
lasso_pred_train=lasso.predict(X_train)
lasso_Train_MSE, lasso_Train_RMSE=  mse_rmse(y_train, lasso_pred_train)

# appling the lasso model on test data
lasso_pred=lasso.predict(X_test)
lasso_MSE, lasso_RMSE=  mse_rmse(y_test, lasso_pred)


data =[{'Model': 'Lasso','Train RMSE':lasso_Train_RMSE,'Tuning Parameters':'none','Test RMSE': lasso_RMSE }]
Models_Performance=Models_Performance.append(data, ignore_index=True,sort=False)
Models_Performance

#### Lasso With GridsearchCV

In [None]:
# generating alpha values for lasso
alphas=Alpha_Generator(1e-4, 1)
alphas

In [None]:
# Applying the GridsearchCV with alpha and CrossValidation
param_grid = [{'alpha':alphas}]
lasso_grid_search = GridSearchCV(estimator=lasso,
                           param_grid=param_grid,
                           scoring=rmse_scoring,
                           n_jobs=-1,
                           verbose= 1,cv=10)

lasso_grid_search.fit(X_train, y_train)

In [None]:
# retriving the stats from gridsearchcv
stats = lasso_grid_search.cv_results_

In [None]:
Lasso_df = DataFrame(stats)
Lasso_df

In [None]:
# plot for RMSE vs Alpha for Lasso
scores = (-stats["mean_test_score"])
alpha = stats["param_alpha"]

plt.figure(figsize=(8, 5))
sns.lineplot(alpha, scores)
plt.xlabel('Alpha value')
plt.ylabel('RMSE')
plt.title("RMSE vs Alpha for LASSO")

In [None]:
print("Best Tuning Parameter:",lasso_grid_search.best_params_)
print("Best RMSEscore",-lasso_grid_search.best_score_)

In [None]:
lasso_bestmodel = lasso_grid_search.best_estimator_
lasso_bestmodel.fit(X_train, y_train)

In [None]:
data =[{'Model': 'Lasso_with_GridsearchCV','Train RMSE':lasso_Train_RMSE,'Tuning Parameters':lasso_grid_search.best_params_ ,'Test RMSE': -lasso_grid_search.best_score_ }]
Models_Performance=Models_Performance.append(data, ignore_index=True,sort=False)
Models_Performance

#### Ridge

In [None]:
ridge = Ridge()
ridge.fit(X_train, y_train)

# appling the ridge model on train data
Ridge_pred_train=ridge.predict(X_train)
Ridge_Train_MSE, Ridge_Train_RMSE=  mse_rmse(y_train, Ridge_pred_train)


# appling the ridge model on test data
Ridge_pred=ridge.predict(X_test)
Ridge_MSE, Ridge_RMSE=  mse_rmse(y_test, Ridge_pred)


data =[{'Model': 'Ridge','Train RMSE':Ridge_Train_RMSE,'Tuning Parameters':'none','Test RMSE': Ridge_RMSE }]
Models_Performance=Models_Performance.append(data, ignore_index=True,sort=False)
Models_Performance

#### Ridge with GridsearchCV

In [None]:
# generating alpha values
alphas=Alpha_Generator(1, 1e-6)
alphas

In [None]:
# Applying the GridsearchCV with alpha and CrossValidation

param_grid = [{'alpha':alphas}]
ridge_grid_search = GridSearchCV(estimator=ridge,
                           param_grid=param_grid,
                           scoring=rmse_scoring,
                           n_jobs=-1,
                           verbose= 1,cv=10)

ridge_grid_search.fit(X_train, y_train)

In [None]:
# retriving the stats from gridsearchcv
stats = ridge_grid_search.cv_results_

In [None]:
Ridge_df = DataFrame(stats)
Ridge_df

In [None]:
# plot for RMSE vs Alpha for Lasso
scores = (-stats["mean_test_score"])
alpha = stats["param_alpha"]

plt.figure(figsize=(8, 5))
sns.lineplot(alpha, scores)
plt.xlabel('Alpha value')
plt.ylabel('RMSE')
plt.title("RMSE vs Alpha for Ridge")

In [None]:
print("Best Tuning Parameter:",ridge_grid_search.best_params_)
print("Best RMSEscore",-ridge_grid_search.best_score_)

In [None]:
ridge_bestmodel = ridge_grid_search.best_estimator_
ridge_bestmodel.fit(X_train, y_train)

In [None]:
data =[{'Model': 'Ridge_with_GridsearchCV','Train RMSE':Ridge_Train_RMSE,'Tuning Parameters':ridge_grid_search.best_params_ ,'Test RMSE': -ridge_grid_search.best_score_ }]
Models_Performance=Models_Performance.append(data, ignore_index=True,sort=False)
Models_Performance

#### Elastic Net

In [None]:
elastic = ElasticNet(alpha=0.002, l1_ratio= 0.5)
elastic.fit(X_train, y_train)

# appling the elastic model on train data
elastic_pred_train=elastic.predict(X_train)
elastic_Train_MSE, elastic_Train_RMSE=  mse_rmse(y_train, elastic_pred_train)

# appling the elastic model on test data
elastic_pred=elastic.predict(X_test)
elastic_MSE, elastic_RMSE=  mse_rmse(y_test, elastic_pred)


data =[{'Model': 'Elastic','Train RMSE':elastic_Train_RMSE,'Tuning Parameters':'none','Test RMSE': elastic_RMSE }]
Models_Performance=Models_Performance.append(data, ignore_index=True,sort=False)
Models_Performance

#### Elasticnet with GridsearchCV

In [None]:
# Generating the alphas and L1 Ratios
alphas =Alpha_Generator(1, 1e-6)
print(alphas)
l1_ratios = np.arange(0, 1.2, .2)
print(l1_ratios)

In [None]:
# Applying the GridsearchCV with alpha and CrossValidation
param_grid = [{'alpha':alphas, 'l1_ratio': l1_ratios}]
elastic_grid_search = GridSearchCV(estimator=elastic,
                           param_grid=param_grid,
                           scoring=rmse_scoring,
                           n_jobs=-1,
                           verbose= 1,cv=10)

elastic_grid_search.fit(X_train, y_train)

In [None]:
# retriving the stats from gridsearchcv
stats = elastic_grid_search.cv_results_

elastic_bestmodel = elastic_grid_search.best_estimator_
elastic_bestmodel.fit(X_train, y_train)


In [None]:
elastic_bestmodel.predict(X_test)

In [None]:
elastic_df = DataFrame(stats)
elastic_df

In [None]:
print("Best Tuning Parameter:",elastic_grid_search.best_params_)
print("Best RMSEscore",-elastic_grid_search.best_score_)

In [None]:
data =[{'Model': 'Elastic_with_GridsearchCV','Train RMSE':elastic_Train_RMSE,'Tuning Parameters':elastic_grid_search.best_params_ ,'Test RMSE': -elastic_grid_search.best_score_ }]
Models_Performance=Models_Performance.append(data, ignore_index=True,sort=False)
Models_Performance

## Modeling and Performance Tuning
#### Regression Models

* By predicting the Bike sharing demand on daily basis using the Linear regression models and error evaluation metric as Root Mean Square Error i got errors for each model as     below.
* Without tuning the performance metrics we got the best results with Linear Regression model with Test_RMSE as 905.633076  where as the Train RMSE is 885.999083.
		
|     Model          | Train RMSE         | Tuning Parameters  |      Test RMSE     |
| :------------------| :------------------| :------------------| :------------------|
| Linear_Regression  | 885.999083         | none               | 905.633076         |
| Lasso              | 886.790976         | none               | 906.141956         |
| Ridge              | 886.086497         | none               | 905.764295         |
| Elastic            | 886.944678         | none               | 906.506568         |

#### Regression Models With GridSearchCV
* **Lasso With GridsearchCV**

	* By using the GridsearchCV with CV as 10 and different alpha values we got the RMSE error as above plot.
	* We can see that the RMSE value increases as the alpha value increases from 0.0
	* The Optimal alpha value where the RMSE value is low is 'alpha': 0.0010000000000000024.

* **Ridge With GridsearchCV**

	* By using the GridsearchCV with CV as 10 and different alpha values we got the RMSE error as above plot.
	* We can see that the RMSE value increases as the alpha value increases from 0.0
	* The Optimal alpha value where the RMSE value is low is 'alpha': 0.03162277660168379.
	
* **Elastic With GridsearchCV**
	* By using the GridsearchCV with CV as 10 and different alpha values and l1 norm  we got the RMSE  error as 886.851187.
	* We got the optimal solution at 'alpha': 1.5848931924611134e-05, 'l1_ratio': 0.8.
	
|     Model                            | Train RMSE         | Tuning Parameters                                  |      Test RMSE     |
| :----------------------------------- | :------------------| :------------------------------------------------- | :------------------|
| Linear_Regression                    | 885.999083         | none                                               | 905.633076         |
| Lasso_with_GridsearchCV              | 886.790976         | {'alpha': 0.0010000000000000024}                   | 886.851327         |
| Ridge_with_GridsearchCV              | 886.086497         | {'alpha': 0.03162277660168379}                     | 886.851185         |
| Elastic_with_GridsearchCV            | 886.944678         | {'alpha': 1.5848931924611134e-05, 'l1_ratio': 0.8} | 886.851187         |

## Model Evaulation Metrics

From the below table we can see that we got the best results by parameter tuning and we got the least RMSE value for Ridge_with_GridsearchCV and Elastic_with_GridsearchCV models.

In [None]:
# displaying the Evaluation metrics table
with pd.option_context('display.max_colwidth', -1):
    display(Models_Performance)

## Conclusion

By this project we are able to predict the Bike sharing demand on daily basis with more accuracy. the Bike sharing companies can use this project for predicting the demand of bike sharing which helps them in managing the bikes in correct manner which helps in increase of profits.