**Problem Statement**

A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 



## Step 1: Reading and Understanding the Data

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

In [None]:
#Reading CSV data
bike= pd.read_csv("../input/boombikes/day.csv")
bike.head()

Inspecting the data frame

In [None]:
bike.info()

In [None]:
bike.shape

In [None]:
bike.describe() 

### Dropping the unwanted columns in the data frame



instant-it's just index and it will be of no use for building model


dteday-yr is already present as boolean and day of week is present as categorical variable,so this becomes unwanted or extra colmun


casual,registered-it is directly related to cnt the target variable


atemp-it is just feel temperature and we have already orginal temperature so this becomes an extra feature 

so by all these reasons above variables can be dropped from the dropped

In [None]:
#Drooping unwanted columns from Data frame
bike.drop(['instant','casual','registered','dteday','atemp'],axis=1,inplace=True)


In [None]:
bike.head(150)

## Step 2: Visualizing the Data 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Pairplot for numerical variables
sns.pairplot(bike,vars=['cnt','temp','windspeed','hum','weathersit','weekday','season'])
plt.show()

As we can observere there is some positive correlation between temp(temperature) and cnt (Total number of bikes rented)

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(1,3,1)
sns.boxplot(x = 'holiday', y = 'cnt', data = bike)
plt.subplot(1,3,2)
sns.boxplot(x = 'workingday', y = 'cnt', data = bike)
plt.subplot(1,3,3)
sns.boxplot(x = 'yr', y = 'cnt', data = bike)
plt.show()

As you can observe in 2019 the number of cnt has increased a lot compared to 2018

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'yr', y = 'cnt', hue = 'holiday', data = bike)
plt.show()

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'season', y = 'cnt', hue = 'weathersit', data = bike)
plt.show()

In [None]:
plt.figure(figsize = (15,10))
sns.boxplot(x = 'yr', y = 'cnt', hue = 'mnth', data = bike)
plt.show()

In [None]:
#Heat map for the variables
plt.figure(figsize=(15,10))
sns.heatmap(bike.corr(),annot=True)
plt.show()

## Step3:Data Preparation

### Dummy Variables


In [None]:
# Get the dummy variables for the feature 'Season' and store it in a new variable - 'seas'
bike['season']=bike['season'].astype(object)
seas = pd.get_dummies(bike['season'],drop_first=True)
seas.head(350)

Now, you don't need three columns. You can drop the spring column(1st column), as the type of season can be identified with just the last three columns where 

000- spring season

100- summer season

010- fall season

001- winter season

In [None]:
#Changing the names columns so that it can be understood easily
seas = seas.rename(columns={1:'spring', 2:'summer', 3:'fall', 4:'winter'})
seas.head()

In [None]:
# Add the results to the original housing dataframe

bike = pd.concat([bike, seas], axis = 1)

In [None]:
bike.head(300)

In [None]:
# Drop 'season' as we have created the dummies for it
bike.drop(['season'], axis = 1, inplace = True)

bike.head()

In [None]:
# Get the dummy variables for the feature 'weekday' and store it in a new variable - 'wday'
bike['weekday']=bike['weekday'].astype(object)
wday = pd.get_dummies(bike['weekday'],drop_first=True)
wday.head(150)

In [None]:
#Changing the names columns so that it can be understood easily

wday = wday.rename(columns={0:'sun',1:'mon', 2:'tue', 3:'wed', 4:'thu',5:'fri',6:'sat'})
wday.head(350)

Now, you don't need seven columns and we dropped  the sun(0 column) column, as the day can be identified with just the last six columns where

100000-mon

010000-tue

001000-wed

000100-thu

000010-fri

000001-sat

000000-sun

In [None]:
#lets concat this data frame to original data frame
bike = pd.concat([bike, wday], axis = 1)

In [None]:
bike.head()

In [None]:
# Drop 'weekday' as we have created the dummies for it
bike.drop(['weekday'], axis = 1, inplace = True)

bike.head()

In [None]:
# Get the dummy variables for the feature 'mnth' and store it in a new variable - 'month'
bike['mnth']=bike['mnth'].astype(object)
month = pd.get_dummies(bike['mnth'],drop_first=True)
month.head(150)

In [None]:
#Changing the names columns so that it can be understood easily

month = month.rename(columns={1:'jan', 2:'feb', 3:'mar',4:'apr',5:'may',6:'jun',7:'jul',8:'aug',9:'sep',10:'oct',11:'nov',12:'dec'})
month.head(350)

Now, you don't need twelve columns. You can drop the jan(1st) column, as the month can be identified with just the last eleven columns where

10000000000-feb

01000000000-mar

00100000000-apr

00010000000-may

00001000000-jun

00000100000-jul

00000010000-aug

00000001000-sep

00000000100-oct

00000000010-nov

00000000001-dec

00000000000-jan

In [None]:
#lets concat this data frame to original data frame
bike = pd.concat([bike, month], axis = 1)

In [None]:
bike.head(200)

In [None]:
# Drop 'mnth' as we have created the dummies for it
bike.drop(['mnth'], axis = 1, inplace = True)

bike.head()

In [None]:
# Get the dummy variables for the feature 'weathersit' and store it in a new variable - 'wsit'
bike['weathersit']=bike['weathersit'].astype(object)
wsit = pd.get_dummies(bike['weathersit'],drop_first=True)
wsit.head(150)

In [None]:
#Changing the names columns so that it can be understood easily

wsit = wsit.rename(columns={1:'clear', 2:'mist+cloudy', 3:'lightsnow',4:'heavyrain'})
wsit.head(350)

Now, you don't need three columns. You can drop the 'clear'(1 cloumn) column, as the 'weathersit' can be identified with just the last two columns where

00-clear

10-mist+cloudy

01-lightsnow



In [None]:
#lets concat this data frame to original data frame

bike = pd.concat([bike, wsit], axis = 1)

In [None]:
bike.head()

In [None]:
# Drop 'weathersit' as we have created the dummies for it
bike.drop(['weathersit'], axis = 1, inplace = True)

bike.head()

In [None]:
pd.set_option('max_columns',300)
bike.shape

## Splitting the Data into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively

bike_train, bike_test = train_test_split(bike, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
bike_train.shape

In [None]:
bike_test.shape

### Rescaling the Features 


In [None]:
#MinMaxScaling

from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['temp', 'hum', 'windspeed','cnt']

bike_train[num_vars] = scaler.fit_transform(bike_train[num_vars])

bike_train.head()

### Dividing into X and Y sets for the model building

In [None]:
y_train = bike_train.pop('cnt')
X_train = bike_train

### Building our model

#### RFE

In [None]:
# Importing RFE and LinearRegression

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

### Building model using statsmodel, for the detailed statistics

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
lm = sm.OLS(y_train,X_train_rfe).fit()   # Running the linear model

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

'fall' variable is insignificant so let's drop it 

In [None]:
X_train_new = X_train_rfe.drop(["fall"], axis = 1)

Rebuilding model without fall

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new)

In [None]:
lm = sm.OLS(y_train,X_train_lm).fit()   # Running the linear model

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

In [None]:
X_train_new.columns

In [None]:
X_train_new = X_train_new.drop(['const'], axis=1)

In [None]:
#Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#hum vif value is too high so the presence of the variable might be ineffective to model so let's drop it
X_train_new1 = X_train_new.drop(["hum"], axis = 1)
X_train_new1.columns

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new1)

In [None]:
lm1 = sm.OLS(y_train,X_train_lm).fit() 

In [None]:
print(lm1.summary())

In [None]:
#Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new1
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#holiday has high p-value so it makes model insignificant so let's drop it
X_train_new2 = X_train_new1.drop(["holiday"], axis = 1)
X_train_new2.columns

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new2)

In [None]:
lm2 = sm.OLS(y_train,X_train_lm).fit() 

In [None]:
print(lm2.summary())

In [None]:
#Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new2
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#Drooping windspeed column as the temp is important to our model
X_train_new3 = X_train_new2.drop(["windspeed"], axis = 1)
X_train_new3.columns

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm= sm.add_constant(X_train_new3)

In [None]:
lm3 = sm.OLS(y_train,X_train_lm).fit() 

In [None]:
print(lm3.summary())

In [None]:
#Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new3
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#workingday variable has high vif value and temp is very important to our business so we need to keep temp but have to reduce it so let's drop the workingday variable t
X_train_new4 = X_train_new3.drop(["workingday"], axis = 1)
X_train_new4.columns

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new4)

In [None]:
lm4 = sm.OLS(y_train,X_train_lm).fit() 

In [None]:
print(lm4.summary())

In [None]:
#Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new4
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#We need to drop 'sat' column as it is having high p-value hence it is insignificant
X_train_new5 = X_train_new4.drop(["oct"], axis = 1)
X_train_new5.columns

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new5)

In [None]:
lm5 = sm.OLS(y_train,X_train_lm).fit() 

In [None]:
print(lm5.summary())

In [None]:
#Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new5
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#We need to drop 'windspeed' column as it is having high p-value hence it is insignificant
X_train_new6 = X_train_new5.drop(["sat"], axis = 1)
X_train_new6.columns

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new6)

In [None]:
lm6 = sm.OLS(y_train,X_train_lm).fit() 

In [None]:
print(lm6.summary())

In [None]:
#Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new6
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Residual Analaysis

In [None]:

y_train_cnt = lm6.predict(X_train_lm)

In [None]:
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

The error graph is normal distributing with mean 0.0 

## Making predictions

In [None]:
#Apply scaling on test sets

num_vars = ['temp', 'hum', 'windspeed','cnt']

bike_test[num_vars] = scaler.fit_transform(bike_test[num_vars])

bike_test.head()

In [None]:
y_test = bike_test.pop('cnt')
X_test = bike_test

In [None]:
# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new6.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
# Making predictions
y_cnt = lm6.predict(X_test_new)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
#Checking r-squared on test set
r_squared = r2_score(y_test, y_cnt)
r_squared

## Making Predictions

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_cnt)
fig.suptitle('y_test vs y_cnt', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_cnt', fontsize=16)                          # Y-label

#### y=0.2316 * yr+0.5431 * temp+0.0972 * summer+0.1453 * winter+0.0601 * aug+0.1207 * sep-0.0793 * mist+cloudy-0.2931 * lightsnow+0.0645

In [None]:
#FinalModel
print(lm6.summary())

 **Number of bikes
rented increases from January to June**

**in the 4 month (April) the number of bikes
rented will rapidly increase compared to previous month,
So in April month there will be more demand for the
bikes**

**If weathersit is ‘heavy rain’(4 in dataset) we don’t have any
bikes rented that day in all the seasons,If weather situation is light snow (3 in dataset) then we have
very less number of bikes rented in all seasons**
