# BOOM_BIKE_LINEAR_REGRESSION

Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BikeIndia has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state.

In such an attempt, BikeIndia aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes. How well those variables describe the bike demands Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.


Business Goal:¶

We are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market



In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

#importnumpy and pandas package and matplot lib and seaborn 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# display 500 column and rows
pd.set_option('display.max_rows', 500)

In [None]:
# read the file
df=pd.read_csv("../input/bike-sharing/day.csv").set_index("instant")

In [None]:
# reading first five rows of the file.
df.head()

In [None]:
# reading the information of the file.
df.info()

In [None]:
#dteday column is in  object data type--> change to datetimpe stamp
df['dteday']= pd.to_datetime(df['dteday'])
df.info()

In [None]:
# reading the shape ofthe file 
df.shape

In [None]:
# reading the statistics of the file.  
df.describe()

In [None]:
#TO find co-relation of the dataframe
df.corr()

In [None]:
# replacing the columns with categoricalvalues
df["weathersit"].replace(to_replace=(1,2,3,4),value=("cloudy","mist","snow","rain"),inplace=True)
df["season"].replace(to_replace=(1,2,3,4),value=("spring","summer","fall","winter"),inplace=True)
df["mnth"].replace(to_replace=(1,2,3,4,5,6,7,8,9,10,11,12),value=("jan","feb","March","april","may","june","july","aug","sept","oct","nov","dec"),inplace=True)
df["weekday"].replace(to_replace=(0,1,2,3,4,5,6),value=("sun","mon","tue","wed","thu","fri","sat"),inplace=True)
df.info()

In [None]:
#drop the unnecassary column
df.drop(['casual','registered',"dteday","atemp"],axis=1,inplace=True)
df.head()

casual and registered gives the column cnt so droping casual and registered. dteday is nor neceaasty so droping. atemp and temp almost give the same effect of model so decided to drop atemp.

In [None]:
# to see the co-relation values of the dataframe
plt.figure(figsize = (16, 10))
cor=df.corr()
ax=sns.heatmap(cor, annot = True, cmap="YlGnBu")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

In [None]:
# to read the missing values of unkonwn values
print(df["season"].value_counts())
print(df["yr"].value_counts())
print(df["mnth"].value_counts())
print(df["holiday"].value_counts())
print(df["weekday"].value_counts())
print(df["workingday"].value_counts())
print(df["weathersit"].value_counts())

In [None]:
# box plot for categorical variables
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x="season",y="cnt",data=df)
plt.subplot(2,3,2)
sns.boxplot(x="weathersit",y="cnt",data=df)
plt.subplot(2,3,3)
sns.boxplot(x="mnth",y="cnt",data=df)
plt.subplot(2,3,4)
sns.boxplot(x="weekday",y="cnt",data=df)
plt.show()

In [None]:
# pair plot for numerical and binary values
plt.figure(figsize=(30,40))
sns.pairplot(df,x_vars=("temp","hum","windspeed"),y_vars="cnt")
plt.show()

In [None]:
# Creating dummy variables for object type variables     
df = pd.get_dummies(df,drop_first=True)
df.head()

In [None]:
# test and train data split
from sklearn.model_selection import train_test_split
df_train,df_test=train_test_split(df,train_size=.7,random_state=100)

In [None]:
# to see  the shape of the dataframe
print(df_train.shape)
print(df_test.shape)

In [None]:
# plot the heap map for test dat
plt.figure(figsize = (40, 40))
cor=df_test.corr()
ax=sns.heatmap(cor, annot = True, cmap="YlGnBu")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

In [None]:
# to see the co-relation values for train data
plt.figure(figsize = (30, 40))
cor=df_train.corr()
ax=sns.heatmap(cor, annot = True, cmap="YlGnBu")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

In [None]:
# feature scaling by using MinMax scaler
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()

In [None]:
#apply scaler tonumeric columns except dummy variables
num=['temp', 'hum', 'windspeed',"cnt"]
df_train[num]=scaler.fit_transform(df_train[num])
df_train.describe()

In [None]:
#diving X and y for model
X_train=df_train
y_train=df_train.pop("cnt")

In [None]:
#Building our model by using RFE

# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 15
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             # running RFE and selecting 15 features best describing the price of cars
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
# to see top 15 columns from rfe
col = X_train.columns[rfe.support_]
col

In [None]:
# to see other than top 15 columns from rfe
X_train.columns[~rfe.support_]

Building model

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

In [None]:
# importing stats model api and define the function  for lm summary
import statsmodels.api as sm
def fit_LRM(X_train):
#Function to fit the linear regression model from the statmodel package
# Creating X_train dataframe with the selected variables
    # Adding a constant variable  
    X_train = sm.add_constant(X_train)
    lm = sm.OLS(y_train,X_train).fit() 
    print(lm.summary())
    return lm

In [None]:
lm=fit_LRM(X_train_rfe)

all p-vales are less than 5,so calculating VIF

In [None]:
# Check for the VIF values and define the function for VIF calculation for the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor
def getVIF(X_train):
    # Calculate the VIFs for the new model
    vif = pd.DataFrame()
    X = X_train
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

In [None]:
#checking the VIf values
getVIF(X_train_rfe)

high VIF value is hum,so we have to drop it.

In [None]:
#drop the humas high VIF
X_train2 = X_train_rfe.drop('hum', axis=1)

In [None]:
# to fine the linear regression model after dropping the hum
lm1=fit_LRM(X_train2)

No p values less than 5% so checkVIF

In [None]:
# getting VIF for X_train2
getVIF(X_train2)

Droping temp is not good idea because value of adjusted r squrae chnges from83.8 to 77.5.
SO droping next variable i.e.working day

In [None]:
# droping next variable i.e. working day
X_train3 = X_train2.drop('workingday', axis=1)

In [None]:
# checking the linear regression model 
X_train4 = sm.add_constant(X_train3)
lm3 = sm.OLS(y_train,X_train4).fit() 
print(lm3.summary())

weeday_sat has high p valueso drop it

In [None]:
# droping weekday_sat
X_train5=X_train4.drop("weekday_sat",axis=1)

In [None]:
# checking linear rehression model
X_train6 = sm.add_constant(X_train5)
lm4 = sm.OLS(y_train,X_train6).fit() 
print(lm4.summary())

No high p values so checking VIF

In [None]:
X_train6=X_train6.drop("const",axis=1)

In [None]:
#checking VIF values
getVIF(X_train6)

drop the windspeed as high p value

In [None]:
# droping windspeed
X_train7=X_train6.drop("windspeed",axis=1)

In [None]:
# checking linear regression model
X_train8 = sm.add_constant(X_train7)
lm5 = sm.OLS(y_train,X_train8).fit() 
print(lm5.summary())

mnth_jan as high pvalue,so remove it

In [None]:
# drop the mnth_jan
X_train9=X_train8.drop("mnth_jan",axis=1)

In [None]:
# checking linear regression
X_train10 = sm.add_constant(X_train9)
lm6 = sm.OLS(y_train,X_train10).fit() 
print(lm6.summary())

no pvalues are greater than 5%, so check VIF

In [None]:
#get vif for train10
#X_train10=X_train10.drop(["const"],axis=1)
getVIF(X_train10)

no VIF values greater than 5%

Residual Analysis of the train data So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_cnt = lm6.predict(X_train10)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18) 

making predections

In [None]:
num=['temp', 'hum', 'windspeed', 'cnt']
df_test[num] = scaler.transform(df_test[num])

In [None]:
y_test = df_test.pop('cnt')
X_test = df_test


In [None]:

# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_train10= X_train10.drop(['const'], axis=1)
X_test_new = X_test[X_train10.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
y_pred = lm6.predict(X_test_new)

In [None]:
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16) 

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_test, y_pred))
print('Model RMSE:',rmse)

from sklearn.metrics import r2_score
r2=r2_score(y_test, y_pred)
print('Model r2_score:',r2)

In [None]:
# calculation of adjusted R square for test data set 
X_test.shape
n = X_test.shape[0]
p=X_test.shape[1]
adjusted_r2_pred = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2_pred


Conclusion: These are the variables influences the bike demand yr
holiday
temp
season_spring
season_summer
season_winter
mnth_july
mnth_sept
weathersit_mist
weathersit_snow

Train data set, The R square is 82.4% The adjusted R square is 82% Test data set, The R square is 81% The adjusted R square is 78% There is 3% drop with respect to train and test data set, so model is accepted

Finally we can conclude, As per our final Model, the top 3 predictor variables that influences the bike Sharing demand are: -Temperature (temp) - A coefficient value of ‘0.5029’ indicated that a unit increase in temp variable increases the bike share demand numbers by 0.5029 units. -Year (yr) - A coefficient value of ‘0.2326’ indicated that a unit increas e in yr variable increases the bike hire numbers by 0.2326 units -Season_winter - A coefficient value of ‘0.0829’ indicated.

Negative co-efficient variabels are

holiday
season_spring
mnth_july
weathersit_mist
weathersit_snow
Positive co-efficient variables are,

year
temperature
season_summer
season_winter
mnth_sept
