# BOOM BIKES


### Author : Diprodeep Ghosh

# **Background:**
A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

# **Problem Statement:**
A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.
How well those variables describe the bike demand
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

# **Business Goal:**
We are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

In [None]:
import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing
from IPython.core.interactiveshell import InteractiveShell
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
# from collections import defaultdicta
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

%matplotlib inline

# READ DATA

In [None]:
# READ DATA

bikedata = pd.read_csv("../input/boombikes/day.csv",parse_dates=['dteday']) 
print(bikedata.head())

In [None]:
#shape check 
print(bikedata.shape)

In [None]:
#  descriptive information check

print(bikedata.info())

In [None]:
#descriptive  statistical information check

print(bikedata.describe())

# DATA QUALITY CHECK

## NULL/MISSING values checking :

In [None]:
# percentage of missing values in each column

round(100*(bikedata.isnull().sum()/len(bikedata.index)), 2).sort_values(ascending=False)

 we can see all the percentage value is zero so there is no missing or NULL value

# Duplicate data Checking

In [None]:

bike_duplicate = bikedata

# Checking for duplicates and dropping the entire duplicate row if any
bike_duplicate.drop_duplicates(subset=None, inplace=True)
bike_duplicate.shape

After running the drop duplicate command the shape of the dataframe is same as that of the original.So we can say that there is no duplicate value in the dataset.

# Data Cleaning

### Removing Unwanted Columns

From a high level analysis of the data dictionary we can conclude that the columns namely instant, dteday, casual & registered can be droped from the dataset.Following are the reasons why:

1.instant : its only an index value 

2.dteday: we already have a seperate coloumn for month and year so we dont need date seperately.

3.casual & registered : cotains count of bike with different category.since our count will not be specific to any category so we dont need it.

I will be creating a new dataframe named bikedata_new which will have the dataframe with the droped coloumns,

In [None]:
bikedata.columns

In [None]:
bikedata_new=bikedata[['season', 'yr', 'mnth', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'cnt']]

In [None]:
bikedata_new.info()

# Creating Dummy Variable

Will create dummy variable for the following coloumns:

1.'mnth'

2.'weekday'

3.'season' 

4.'weathersit'       

In [None]:
#converting the datatype to category
bikedata_new['season']=bikedata_new['season'].astype('category')
bikedata_new['weathersit']=bikedata_new['weathersit'].astype('category')
bikedata_new['mnth']=bikedata_new['mnth'].astype('category')
bikedata_new['weekday']=bikedata_new['weekday'].astype('category')
bikedata_new.info()

In [None]:
#creating the dummy variables
#using drop_first to drop the first variable for each set of dummies created

bikedata_new = pd.get_dummies(bikedata_new, drop_first=True)

bikedata_new.info()

In [None]:
bikedata_new.head()

# Data Splitting


We will split the entire data set into training set and testing set in 70:30 ration

In [None]:
from sklearn.model_selection import train_test_split

np.random.seed(0)
df_train, df_test = train_test_split(bikedata_new, train_size = 0.70, test_size = 0.30, random_state = 333)

In [None]:
#checking out training set info
df_train.info()

In [None]:
#checking out training set size
df_train.shape

In [None]:
#checking out testing set info
df_test.info()

In [None]:
#checking out testing set size
df_test.shape

so the data has been sucessfully split into 70% train data and 30% test data

# EDA on Training Dataset

###  Numeric Variables  

In [None]:

bikedata_num=df_train[[ 'temp', 'atemp', 'hum', 'windspeed','cnt']] #taking only numerical variable

sns.pairplot(bikedata_num, diag_kind='kde')
plt.show()

##### From the above pair plot we can see a LINEAR relationship between temp,atemp and cnt

# Categorical Variables

We will create boxplot for each of the categorical
variable to see how it stacks up with the target variable

In [None]:
#taking categorical variables before creating dummy variables

plt.figure(figsize=(25, 10))
plt.subplot(2,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = bikedata)
plt.subplot(2,3,2)
sns.boxplot(x = 'mnth', y = 'cnt', data = bikedata)
plt.subplot(2,3,3)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bikedata)
plt.subplot(2,3,4)
sns.boxplot(x = 'holiday', y = 'cnt', data = bikedata)
plt.subplot(2,3,5)
sns.boxplot(x = 'weekday', y = 'cnt', data = bikedata)
plt.subplot(2,3,6)
sns.boxplot(x = 'workingday', y = 'cnt', data = bikedata)
plt.show()

Following are the analysis from the above boxplots - 

season: Season 3 has the highest nbr of booking with a median close to 5000 closely followed by season 2 and season 3. This indicates, season can be a good predictor for the dependent variable. - 

mnth: months 5,6,7,8 & 9 have majority bike booking with a median of hovering around 4000 booking per month. This indicates, mnth has some trend for bookings and can be a good predictor for the dependent variable. 

weathersit: Majority of the booking is happening during ‘weathersit1 with a median of close to 5000 booking (for the period of 2 years). This was followed by weathersit2 .This indicates, weathersit does show some trend towards the bike bookings can be a good predictor for the dependent variable. 

holiday: Almost majority of the bike booking were happening when it is not a holiday which means this data is clearly biased so it will not be a good predictor for the dependent variable.

weekday: weekday variable shows very close trend having their independent medians between 4000 to 5000 bookings. 

workingday: greater no.of bike booking were happening in ‘workingday’ with a median of close to 5000 booking 

In [None]:
plt.figure(figsize = (25,20))
ax=sns.heatmap(bikedata_new.corr(), annot = True, cmap="YlGnBu")
bottom,top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()


# RESCALING 

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
    

In [None]:
df_train.head()

In [None]:
df_train.columns

In [None]:
# Apply scaler()

numerical_vars = ['temp', 'atemp', 'hum', 'windspeed','cnt']

df_train[numerical_vars] = scaler.fit_transform(df_train[numerical_vars])

In [None]:
# Checking values after scaling
df_train.head()

In [None]:
df_train.describe()

# LINEAR MODEL

### Dividing the training dataset into X and Y sets for the model building

In [None]:
y_train = df_train.pop('cnt')
X_train = df_train

### Recursive feature elimination:

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 15
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             # running RFE
rfe = rfe.fit(X_train, y_train)
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col


In [None]:
X_train.columns[~rfe.support_]

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

# Model 1

In [None]:
#VIF CHECK 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
import statsmodels.api as sm

# Add a constant
X_train_lm1 = sm.add_constant(X_train_rfe)

# Create a first fitted model
lr1 = sm.OLS(y_train, X_train_lm1).fit()

In [None]:
# parameter check

lr1.params

In [None]:
#model Summary
print(lr1.summary())

Removing the variable 'atemp' based on its High p-value & High VIF -

# Model 2

In [None]:
X_train_new = X_train_rfe.drop(["atemp"], axis = 1)

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm2 = sm.add_constant(X_train_new)

# Create a first fitted model
lr2 = sm.OLS(y_train, X_train_lm2).fit()

In [None]:
# Check the parameters obtained

lr2.params

In [None]:
print(lr2.summary())

Removing the variable 'hum' based on its Very High 'VIF' value. - choosing 'hum' over 'temp' because based on general knowledge we can say temp can have effects on businessess like bike rental

# Model 3

In [None]:
X_train_new = X_train_new.drop(["hum"], axis = 1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm3 = sm.add_constant(X_train_new)

# Create a first fitted model
lr3 = sm.OLS(y_train, X_train_lm3).fit()

In [None]:
lr3.params

In [None]:
print(lr3.summary())

Removing the variable 'season3' based on its Very High 'VIF' value.

# MODEL 4

In [None]:
X_train_new = X_train_new.drop(["season_3"], axis = 1)

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm4 = sm.add_constant(X_train_new)

# Create a first fitted model
lr4 = sm.OLS(y_train, X_train_lm4).fit()

In [None]:
lr4.params

In [None]:
print(lr4.summary())

Removing the variable 'mnth_10' based on its Very High p-value campared to others.

# MODEL 5

In [None]:
X_train_new = X_train_new.drop(["mnth_10"], axis = 1)

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm5 = sm.add_constant(X_train_new)

# Create a first fitted model
lr5 = sm.OLS(y_train, X_train_lm5).fit()

In [None]:
lr5.params

In [None]:
print(lr5.summary())

Removing the variable 'mnth_3' based on its High 'p-value' caomparing with others

# MODEL 6

In [None]:
X_train_new = X_train_new.drop(["mnth_3"], axis = 1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm6 = sm.add_constant(X_train_new)

# Create a first fitted model
lr6 = sm.OLS(y_train, X_train_lm6).fit()

In [None]:
lr6.params

In [None]:
print(lr6.summary())

Now we have a model in which we have all the P-values as zero and also all the VIFs are less than 5 depicting very low multicollinearity between the predictors.So we can consider this as our final model.

# Interpretation

Final Model Interpretation

## Hypothesis Testing:

H0 : B1 = B2 = ......... = Bn = 0

H1 : at least one Bi != 0    

In [None]:
lr6.params

From the list above it can be seen that none of the coefficient are equal to zero which means we can reject the null hypothesis.


### Significance of the final model

The overall Significance of the final model is determined by the F-statistics value (higher the value greater the significance of the model).From the above summary of lr6 model we can see that it has :

F-statistic:          233.8

The F-Statistics value of 233 (which is greater than 1) states that the overall model is significant


# Assumptions Validation

#### Residual Analysis Of Training Data

In [None]:
y_train_predict = lr6.predict(X_train_lm6)

In [None]:
residual = y_train-y_train_predict


fig = plt.figure()
sns.distplot((residual), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)  
plt.xlabel('Errors', fontsize = 18)    

##### From the normally distributed residuals we can assume that the linear regression is valid

## Linear relationship between X and Y

In [None]:
bikedata_new=bikedata_new[[ 'temp', 'atemp', 'hum', 'windspeed','cnt']]

sns.pairplot(bikedata_num, diag_kind='kde')
plt.show()

Using the pair plot, we could see there is a linear relation between temp and atemp variable with the predictor ‘cnt’.

## Multicollinearity between the predictor variables

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

##### As no VIF value is above 5 so we can assume that there is no multicollinearity between the pradictor variables

# Applying the Final Model(lr6) to make Prediction

In [None]:
#applying scaling

numerical_vars = ['temp', 'atemp', 'hum', 'windspeed','cnt']

df_test[numerical_vars] = scaler.transform(df_test[numerical_vars])
df_test.head()

In [None]:
df_test.describe()

In [None]:
#Dividing into X_test and y_test

y_test = df_test.pop('cnt')
X_test = df_test

X_test.info()

In [None]:
#Selecting the variables that are part of final model.
col1=X_train_new.columns

X_test=X_test[col1]

# Adding constant variable to test dataframe
X_test_lm6 = sm.add_constant(X_test)

X_test_lm6.info()

In [None]:
# Making predictions using the final model (lr6)

y_predict = lr6.predict(X_test_lm6)

# MODEL EVALUATION

In [None]:
# Plotting y_test and y_pred to understand the spread
# import matplotlib.pyplot as plt
# import numpy as np


fig = plt.figure()
plt.scatter(y_test, y_predict, alpha=.5)
fig.suptitle('y_test vs y_pred', fontsize = 20)              # Plot heading 
plt.xlabel('y_test', fontsize = 18)                          # X-label
plt.ylabel('y_pred', fontsize = 16) 

# R^2 Test

In [None]:
#r2 = 1-(RSS/TSS)

from sklearn.metrics import r2_score
r2_score(y_test, y_predict)

# Adjusted R^2  TEST

In [None]:
r2=0.8203092200749708

In [None]:
# n is number of rows in X

n = X_test.shape[0]


# Number of features (predictors, p) is the shape along axis 1
p = X_test.shape[1]

# We find the Adjusted R-squared using the formula

adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

# FINAL RESULT

##### FINAL RESULT COMPARISON: 

##### Train R^2 :0.824  

##### Train Adjusted R^2 :0.821  

##### Test R^2 :0.820  

##### Test Adjusted R^2 :0.812

# The equation of best fitted surface based on model lr6

cnt=0.084143+(yr×0.230846)+(workingday×0.043203)+(temp×0.563615)−(windspeed×0.155191)+(season2×0.082706)+(season4×0.128744)+(mnth9×0.094743)+(weekday6×0.056909)−(weathersit2×0.074807)−(weathersit3×0.306992)