<a id="prob"></a>
### <font color = "sky blue" > Problem Statement

A US bike-sharing provider BoomBikes  wants to know:

Which variables are significant in predicting the demand for shared bikes.
How well those variables describe the bike demands

<a id="prob"></a>
#### <font color = "sky blue" > Import Necessary Libraries

In [None]:
#Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

<a id="prob"></a>
#### <font color = "sky blue" > Read Data and know the data

In [None]:
bb=pd.read_csv("../input/boom-bike-dataset/bike_sharing_data.csv")
bb.head()

In [None]:
bb.shape

In [None]:
bb.info()

<a id="prob"></a>
#### <font color = "sky blue" > Data Preparation

In [None]:
#From the given data we can see that instant is an index column so we drop it
bb.drop(['instant'],axis=1,inplace=True)

In [None]:
bb.head()

In [None]:
#We can see column dteday and yr month are having same data so we can drop dteday to avoid confusion

bb.drop(['dteday'],axis=1,inplace=True)
bb.head()

In [None]:
#we know that casual+registered=cnt and cnt is our target variable so we will not consider casual and registered
bb.drop(['casual','registered'],axis=1,inplace=True)
bb.head()

In [None]:
bb.info()

In [None]:
#From data we can see that: season,yr,mnth,holiday,weekday,workingday,weathersit all are categorical variables
#We will replace season,weekday and weathersit with appropriate values

bb['season'].replace({1:"spring",2:'summer',3:'fall',4:"winter"},inplace=True)

bb['mnth'].replace({1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'June',7:'July',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'},inplace=True)

bb['weekday'].replace({1:'Mon',2:'Tue',3:'Wed',4:'Thurs',5:'Fri',6:'Sat',7:'Sun'},inplace=True)

bb['weathersit'].replace({1:"Clear_Few Clouds",2:"Mist_cloudy",3:"Light_rain",4:'Heavy_Rain'},inplace=True)

In [None]:
bb.head()

In [None]:
bb.info()

<a id="prob"></a>
#### <font color = "sky blue" > PERFORMING EDA

<a id="prob"></a>
#### <font color = "sky blue" > Visualising numerical variables


In [None]:
# Pairplot of numerical variables
sns.pairplot(bb,vars=['temp','atemp','hum','windspeed','cnt'])
plt.show()

In [None]:
#check the correlation
plt.figure(figsize=(16,20))
sns.heatmap(bb.corr(),annot=True)
plt.show()

In [None]:
#correlation of temp and atemp is same and let us drop temp column
bb.drop(['temp'],axis=1,inplace=True)
bb.head()

<a id="prob"></a>
#### <font color = "sky blue" > Visualising categorical variables 

In [None]:

plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = bb)
plt.subplot(3,3,2)
sns.boxplot(x = 'yr', y = 'cnt', data = bb)
plt.subplot(3,3,3)
sns.boxplot(x = 'mnth', y = 'cnt', data = bb)
plt.subplot(3,3,4)
sns.boxplot(x = 'workingday', y = 'cnt', data = bb)
plt.subplot(3,3,5)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bb)
plt.subplot(3,3,6)
sns.boxplot(x = 'weekday', y = 'cnt', data = bb)
plt.subplot(3,3,7)
sns.boxplot(x = 'holiday', y = 'cnt', data = bb)
plt.show()

  
  <div class="alert alert-block alert-success">
    Demand of bikes has risen up<b> </b> that have more than 40% null values.
    <br>In summer and fall.
    <br>2019 and in the months of March to October.
    <br>when the weather is clear and on holidays.  
    
    

In [None]:
bb.info()

In [None]:
#Grouping the categorical columns
bb_categorical = bb.select_dtypes(include=['object'])
bb_categorical.columns

In [None]:
#create the dummy variables
bb_dummies = pd.get_dummies(bb_categorical, drop_first=True)
bb_dummies.head()

In [None]:
bb.drop(bb_categorical.columns,axis=1,inplace=True)

In [None]:
bb.columns

In [None]:
bb= pd.concat([bb, bb_dummies],axis=1)

In [None]:
bb.columns

In [None]:
bb.head()

In [None]:
bb.info()

<a id="prob"></a>
#### <font color = "sky blue" > Split the data

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
bb_train, bb_test = train_test_split(bb, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
bb_train.head(2)

In [None]:
bb_test.head(2)

In [None]:
bb_train.columns

In [None]:
bb_test.columns

In [None]:
#SCALING THE NUMERICAL DATA
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()

In [None]:
num_vars=['atemp','hum','windspeed','cnt']
bb_train[num_vars] = scaler.fit_transform(bb_train[num_vars])

In [None]:
bb_train.head()

In [None]:
bb_train.describe()

In [None]:
#CREATING X AND Y
X_train = bb_train
y_train = bb_train.pop('cnt')

In [None]:
X_train.head()

In [None]:
y_train

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

<a id="prob"></a>
#### <font color = "sky blue" > BUILD USING RFE APPROACH FOR FEATURE SELECTION(20 Var)

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 20)            
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
X_train_rfe = X_train[col]

<a id="prob"></a>
#### <font color = "sky blue" > Building model using statsmodel

In [None]:
import statsmodels.api as sm  
X_train_rfe1 = sm.add_constant(X_train_rfe)

In [None]:
lm = sm.OLS(y_train,X_train_rfe1).fit()

In [None]:
print(lm.summary())

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train_rfe1.head()

In [None]:
#COLUMN hum HAS A VERY HIGH VIF SO WE DROP IT and build model again
X_train_rfe=X_train_rfe.drop(['hum'],axis=1)
X_train_rfe1 = sm.add_constant(X_train_rfe)
lm1 = sm.OLS(y_train,X_train_rfe1).fit()
print(lm1.summary())

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train_rfe

In [None]:
#COLUMN weekday thurs HAS A VERY HIGH P SO WE DROP IT
X_train_rfe=X_train_rfe.drop(['weekday_Thurs'],axis=1)


In [None]:
X_train_rfe2 = sm.add_constant(X_train_rfe)
lm2 = sm.OLS(y_train,X_train_rfe2).fit()
print(lm2.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#mnth_July has high P so we drop it
X_train_rfe=X_train_rfe.drop(['weekend_Tue'],axis=1)
X_train_rfe3 = sm.add_constant(X_train_rfe)
lm3 = sm.OLS(y_train,X_train_rfe3).fit()
print(lm3.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#Winter has A VERY HIGH p-value WHUCH MEANS IT IS insignificant SO WE DROP IT
X_train_rfe=X_train_rfe.drop(['season_winter'],axis=1)
X_train_rfe4 = sm.add_constant(X_train_rfe)
lm4 = sm.OLS(y_train,X_train_rfe4).fit()
print(lm4.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#week Tuesday has A VERY HIGH p-value WHUCH MEANS IT IS insignificant SO WE DROP IT
X_train_rfe=X_train_rfe.drop(['weekday_Tue'],axis=1)
X_train_rfe5 = sm.add_constant(X_train_rfe)
lm5 = sm.OLS(y_train,X_train_rfe5).fit()
print(lm5.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#Adding atemp as it is a significant feature
X_train_rfe['atemp']=X_train['atemp']
X_train_rfe.head()

In [None]:
X_train_rfe6 = sm.add_constant(X_train_rfe)
lm6 = sm.OLS(y_train,X_train_rfe6).fit()
print(lm6.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#drop mnth Nov as it has high P
X_train_rfe=X_train_rfe.drop(['mnth_Nov'],axis=1)
X_train_rfe

In [None]:
X_train_rfe7 = sm.add_constant(X_train_rfe)
lm7 = sm.OLS(y_train,X_train_rfe7).fit()
print(lm7.summary())


In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#Drop weekday Mon due to high p-value
X_train_rfe=X_train_rfe.drop(['weekday_Mon'],axis=1)
X_train_rfe.head()



In [None]:
X_train_rfe8 = sm.add_constant(X_train_rfe)
lm8 = sm.OLS(y_train,X_train_rfe8).fit()
print(lm8.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#Drop weekday Mon due to high p-value
X_train_rfe=X_train_rfe.drop(['weekday_Wed'],axis=1)
X_train_rfe.head()


In [None]:
X_train_rfe9 = sm.add_constant(X_train_rfe)
lm9 = sm.OLS(y_train,X_train_rfe9).fit()
print(lm9.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#Drop weekday Fri due to high p-value
X_train_rfe=X_train_rfe.drop(['weekday_Fri'],axis=1)
X_train_rfe.head()


In [None]:
X_train_rfe10 = sm.add_constant(X_train_rfe)
lm10= sm.OLS(y_train,X_train_rfe10).fit()
print(lm10.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#adding month march
X_train_rfe['mnth_Dec']=X_train['mnth_Dec']
X_train_rfe.head()

In [None]:
X_train_rfe11 = sm.add_constant(X_train_rfe)
lm11 = sm.OLS(y_train,X_train_rfe11).fit()
print(lm11.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#adding month march
X_train_rfe['mnth_Feb']=X_train['mnth_Feb']
X_train_rfe.head()

In [None]:
X_train_rfe12 = sm.add_constant(X_train_rfe)
lm12 = sm.OLS(y_train,X_train_rfe12).fit()
print(lm12.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
#adding month march
X_train_rfe['mnth_May']=X_train['mnth_May']
X_train_rfe.head()

In [None]:
X_train_rfe13 = sm.add_constant(X_train_rfe)
lm13 = sm.OLS(y_train,X_train_rfe13).fit()
print(lm13.summary())

In [None]:
#Drop month March due to high p-value
X_train_rfe=X_train_rfe.drop(['mnth_May'],axis=1)
X_train_rfe.head()

# We have considered all columns and checked.Now we stop the model building and check on which model can we choose
Out all the models model lm12 seems to give good result so we choose it.

In [None]:
#Predict values
y_train_cnt = lm12.predict(X_train_rfe8)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#CALCULATING RESIDUALS

res=y_train - y_train_cnt

In [None]:
#Checking ASSUMPTION OF NORMALITY:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((res), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

In [None]:
#Checking columns
X_train_rfe8.columns

In [None]:
print(X_train_rfe8.shape)
print(res.shape)

In [None]:
bb_test.columns

In [None]:
#Scaling the test data

num_vars=['atemp','hum','windspeed','cnt']
bb_test[num_vars] = scaler.transform(bb_test[num_vars])

In [None]:
#Creating x and y sets

y_test = bb_test.pop('cnt')
X_test = bb_test

In [None]:
X_train_new=X_train_rfe8.drop(['const'], axis=1)

In [None]:
# Now let's use our model to make predictions.
# Creating X_test_new dataframe by dropping variables from X_test

X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
X_train_rfe8.columns

<a id="prob"></a>
#### <font color = "sky blue" > Making predictions on the chosen model


In [None]:
y_pred = lm8.predict(X_test_new)

In [None]:
#CHECKING PREDICTED V/s TEST DATA 

fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label

We have a model that seems good enough to predict demand of bikes. The actual and predicted cnt i.e demand significantly overlapped, thus indicating that the model is able to explain the change in demand very well.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
#Returns the mean squared error; we'll take a square root
np.sqrt(mean_squared_error(y_test, y_pred))

In [None]:
#Calculate the r square for test

r_squared = r2_score(y_test, y_pred)
r_squared

In [None]:
X_t=X_train_new.iloc[:,0].values

In [None]:
#PLotting the residuals to see if a pattern exists
#Checking assumption of homoscedasticity and autocorrelation
plt.figure()
plt.scatter(X_t,res)
fig.suptitle('Independent vars vs res', fontsize=20)              # Plot heading 
plt.xlabel('Independent variables', fontsize=18)                          # X-label
plt.ylabel('Residuals', fontsize=16)  
plt.show()

In [None]:
print(X_train_rfe8.columns)
print(lm8.summary())

<a id="prob"></a>
#### <font color = " sky blue " > We can see that the equation for best fitted line is:

<a id="prob"></a>
#### <font color = " Purple " >cnt= 0.2381 X yr - 0.0796 X holiday  - 0.1447 X windspeed - 0.0870 X season_spring - 0.0675 X season_winter-0.0686 X mnth_Dec - 0.0490 X mnth_Feb - 0.0869 X mnth_Jan + 0.0330 X mnth_May-0.0578 X mnth_Nov + 0.0693 X mnth_Sep +0.0201 X weekday_Sat - 0.2941 X weathersit_Light_rain - 0.0814 X weather_Mist_cloudy + 0.3605 X atemp

<a id="prob"></a>
#### <font color = " Purple " >R-Squared : 0.83
    
<a id="prob"></a>
#### <font color = " Purple " > Adjusted R-Squared :0.82

<a id="prob"></a>
#### <font color = " Purple" >Demands increases in the month of May, September, 10 and yr,saturdays and in Season of winter,
<a id="prob"></a>
#### <font color = " Purple " >Demand decreases if it is windspeed,holiday , Spring, Light rain, Mist_cloudy, Sunday,months of Dec,Jan and Feb

<a id="prob"></a>
#### <font color = " Purple " >Final recommendations for the company:
<a id="prob"></a>
#### <font color = " Purple " >Demand is higher in months of  May and September.