## <font color="blue">Importing packages</font>


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Supress Warnings
warnings.filterwarnings('ignore')

In [None]:
## increasing the column view
pd.set_option('display.max_column',999)

<br>
<br>

## <font color="blue">Step 1: Reading and Understanding the Data</font>

Let's start with the following steps:

1. Importing data using the pandas library
2. Understanding the structure of the data

In [None]:
df = pd.read_csv("/kaggle/input/boom-bike-dataset/day.csv")
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
### droppping instant & dteday column as its not required for model analysis
df.drop(columns=["instant","dteday"], inplace=True)

In [None]:
### checking null count of each column
df.isnull().sum()

<br>
<br>

## <font color="blue">Step 2: Visualising the Data</font>

Let's now visualise our data using seaborn. We'll first make scatterplot of all the contineous variables present to visualise which variables are most correlated to `cnt`.

In [None]:
### scatter plot for contineous vars
catvars = ['temp', 'atemp', 'hum','windspeed','casual','registered']
plt.figure(figsize=(20,20))
for i in range(1,len(catvars)+1):
    plt.subplot(3,2,i)
    sns.scatterplot(data=df , y="cnt" , x=catvars[i-1])
    

plt.show() 

In [None]:
### box plot for categorical vars
catvars = ["season","yr","mnth","holiday","weekday","workingday","weathersit"]
plt.figure(figsize=(20,20))
for i in range(1,len(catvars)+1):
    plt.subplot(4,2,i)
    sns.boxplot(data=df , y="cnt" , x=catvars[i-1])
    

plt.show()    
    
    
# season : season (1:spring, 2:summer, 3:fall, 4:winter)
# yr : year (0: 2018, 1:2019)
# mnth : month ( 1 to 12)
# holiday : weather day is a holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
# weekday : day of the week
# workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
# weathersit : 
#  1: Clear, Few clouds, Partly cloudy, Partly cloudy
#  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
#  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
#  4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog    
    


<br>
<br>

## <font color="blue">Step 3: Data Preparation</font>

1. We need to convert season , mnth , weekday , weathersit into dummy columns as there are categorical variables

2. As yr , holiday and workingday are binary categorical column no need to convert those

In [None]:
## create Dummy for season
## 1. 000 - spring
## 2. 100 - summer
## 3. 010 - fall
## 4. 001 - winter

season = pd.get_dummies(df.season, drop_first=True)

In [None]:
## convert column names 
season.columns = ["Summer","Fall","Winter"]
season.head()

In [None]:
#merging the columns into original dataframe and droppping season column
df = pd.concat([df,season] , axis=1)

df.drop(columns="season",inplace=True)
        
df.head() 

In [None]:
## create Dummy for weekdays
## 0. 000000 - Monday
## 1. 100000 - Tuesday
## 2. 010000 - Wednesday
## 3. 001000 - Thursday
## 4. 000100 - Friday
## 5. 000010 - Saturday
## 6. 000001 - Sunday

weekday = pd.get_dummies(df.weekday , drop_first=True) 

In [None]:
## convert column names 
weekday.columns = ["Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
weekday.head()

In [None]:
#merging the columns into original dataframe and droppping weekday column
df = pd.concat([df,weekday] , axis=1)

df.drop(columns="weekday",inplace=True)
        
df.head() 

In [None]:
## create Dummy for weathersit
## 1. 00 - Clear, Few clouds, Partly cloudy, Partly cloudy
## 2. 10 - Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
## 3. 01 - Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

## 4. N.A - Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
##(Not Considered as there is no data for this as seen in the box plot)

weathersit = pd.get_dummies(df.weathersit , drop_first=True) 

In [None]:
## convert column names 
weathersit.columns = ["Mist","LightSnow"]
weathersit.head()

In [None]:
#merging the columns into original dataframe and droppping weathersit column
df = pd.concat([df,weathersit] , axis=1)

df.drop(columns="weathersit",inplace=True)
        
df.head() 

In [None]:
## create Dummy for mnth
## 1. 00000000000 - Jan
## 2. 10000000000 - Feb
## 3. 01000000000 - Mar
## 4. 00100000000 - Apr
## 5. 00010000000 - May
## 6. 00001000000 - June
## 7. 00000100000 - July
## 8. 00000010000 - Aug
## 9. 00000001000 - Sep
## 10. 00000000100 - Oct
## 11. 00000000010 - Nov
## 12. 00000000001 - Dec

mnth = pd.get_dummies(df.mnth , drop_first=True)

In [None]:
## convert column names 
mnth.columns = ["Feb","Mar","Apr","May","June","July","Aug","Sep","Oct","Nov","Dec"]
mnth.head()

In [None]:
#merging the columns into original dataframe and droppping mnth column
df = pd.concat([df,mnth] , axis=1)

df.drop(columns="mnth",inplace=True)
        
df.head()

In [None]:
## As yr is already a binary categorical column , just converting the column name to the year will be helpful
df = df.rename(columns={"yr":"2019"})

df.head()

<br>

## <font color="blue">Step 4: Splitting the Data into Training and Testing Sets</font>

As you know, the first basic step for regression is performing a train-test split.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_train,df_test = train_test_split(df , train_size=0.7 ,test_size=0.3, random_state=100)
print(df_train.shape)
print(df_test.shape)

<br>
<br>

## <font color="blue">Step 5: Rescaling the Features </font>

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
### rescaling temp , atemp , hum , windspeed , casual , registered , cnt
scale_feature = ["temp" , "atemp" , "hum" , "windspeed" , "casual" , "registered","cnt"]

In [None]:
scaler = MinMaxScaler()

In [None]:
df_train[scale_feature] = scaler.fit_transform(df_train[scale_feature])
df_train.head()

In [None]:
### checking the impact of scaling on the train set
df_train.describe()

<br>
<br>

## <font color="blue">Step 6: Checking correlation coefficients in the Training Set </font>

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated

plt.figure(figsize = (30, 20))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

<br>
<br>

## <font color="blue">Step 7: Dividing into X and Y sets for the model building</font>

In [None]:
y_train = df_train.pop('cnt')
X_train = df_train

<br>
<br>

## <font color="blue">Step 8: Building a linear model</font>

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

In [None]:
lm = LinearRegression()
lm.fit(X_train,y_train)
### selecting top 20 features as per idea from heatmap above
rfe = RFE(lm,20)
rfe = rfe.fit(X_train, y_train)

In [None]:
### checking the list of features as er RFE
list(zip(X_train.columns , rfe.support_,rfe.ranking_))

In [None]:
### selecing the top RFE columns
topRFEcolumns = X_train.columns[rfe.support_]
topRFEcolumns

In [None]:
### creating the linear model with these RFE selected features

import statsmodels.api as sm

In [None]:
## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[topRFEcolumns])

In [None]:
## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

In [None]:
### checking the sumary of the Linear Model
lm_model.summary()

In [None]:
## creating function to calculate VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor

def showVIF(data,cols):
    vif = pd.DataFrame()
    vif['Features'] = cols
    vif['VIF'] = [variance_inflation_factor(data[cols].values , i) for i in range(data[cols].shape[1])]
    vif['VIF'] = round(vif['VIF'] ,2)
    vif = vif.sort_values(by ="VIF" , ascending=False)
    return vif

In [None]:
## show VIF
print(showVIF(X_train_sm,topRFEcolumns))

<br>

-  ### We will remove the column temp thats having very high VIF from the model & then rebuild the Model

In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019','atemp', 'hum', 'windspeed', 'casual', 'registered',
       'Fall', 'Winter', 'Wednesday', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Feb', 'Mar', 'Apr', 'July', 'Aug', 'Dec']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column registered thats having very high VIF from the model & then rebuild the Model

In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019','atemp', 'hum', 'windspeed', 'casual',
       'Fall', 'Winter', 'Wednesday', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Feb', 'Mar', 'Apr', 'July', 'Aug', 'Dec']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column Dec thats having very high P-Value from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019','atemp', 'hum', 'windspeed', 'casual',
       'Fall', 'Winter', 'Wednesday', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Feb', 'Mar', 'Apr', 'July', 'Aug']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column Mar thats having very high P-Value from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019','atemp', 'hum', 'windspeed', 'casual',
       'Fall', 'Winter', 'Wednesday', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Feb','Apr', 'July', 'Aug']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column Feb thats having very high P-Value from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019','atemp', 'hum', 'windspeed', 'casual',
       'Fall', 'Winter', 'Wednesday', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Apr', 'July', 'Aug']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column Wednesday thats having very high P-Value from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019','atemp', 'hum', 'windspeed', 'casual',
       'Fall', 'Winter', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Apr', 'July', 'Aug']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column atemp thats having very high VIF from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019','hum', 'windspeed', 'casual',
       'Fall', 'Winter', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Apr', 'July', 'Aug']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column hum thats having very high p-Value & VIF from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019', 'windspeed', 'casual',
       'Fall', 'Winter', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Apr', 'July', 'Aug']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

-  ### We will remove the column Aug thats having very high P-Value from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019', 'windspeed', 'casual',
       'Fall', 'Winter', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Apr', 'July']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>

-  ### We will remove the column July thats having very high P-Value from the model & then rebuild the Model


In [None]:
## selecting new list of feature columns
FeatureColumns = ['2019', 'windspeed', 'casual',
       'Fall', 'Winter', 'Friday', 'Saturday', 'Mist',
       'LightSnow', 'Apr']

## adding constant to the training vars

X_train_sm = sm.add_constant(X_train[FeatureColumns])

## creating the Linear Model

lm_sm = sm.OLS(y_train , X_train_sm)

lm_model = lm_sm.fit()

### checking the sumary of the Linear Model
print(lm_model.summary())

## check VIF
print('\n',showVIF(X_train_sm,FeatureColumns))

<br>
<br>

- ### Finally we have the list of 10 top features that give very good R2 value in the model and none have P-Value > 0.05 and VIF > 5

- ### Features : <font color="green">casual , windspeed , 2019 , Fall, Winter, Mist, Apr, Friday, Saturday, LightSnow </font>

<br>
<br>

## <font color="blue">Step 9: Residual Analysis of the train data</font>

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
## getting predicted value from the model
y_train_predict = lm_model.predict(X_train_sm)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_predict))
fig.suptitle('Error Terms', fontsize = 15)
plt.xlabel('Errors', fontsize = 18)  
plt.show()

- ### We can that the error terms of the predicted values from the model on our training data is normally distributed and mean is at 0


<br>
<br>

##  <font color="blue">Step 10: Making Predictions Using the Final Model</font>

#### Applying the scaling on the test sets

In [None]:
df_test[scale_feature] = scaler.fit_transform(df_test[scale_feature])
df_test.head()

In [None]:
### checking the impact of scaling on the test set
df_test.describe()

<br>

#### Dividing into X and Y vars from test sets

In [None]:
y_test = df_test.pop('cnt')
X_test = df_test

<br>

#### Adding constant variable to test dataframe

In [None]:
X_test_sm = sm.add_constant(X_test[FeatureColumns])

<br>

#### Making predictions

In [None]:
y_test_predict = lm_model.predict(X_test_sm)

<br>
<br>

## <font color="blue">Step 11: Model Evaluation</font>

In [None]:
# Plotting y_test and y_test_predict to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_test_predict)
fig.suptitle('y_test vs y_test_predict', fontsize = 20) 
plt.xlabel('y_test', fontsize = 18) 
plt.ylabel('y_test_predict', fontsize = 16)  
plt.show()

In [None]:
## checking the R2-score of the Test data

from sklearn.metrics import r2_score

r2_score(y_test, y_test_predict)

<br>

## <font color="green">We can see that the R2 value of the Train (0.772) and Test (0.708) data is very close . So we can conclude the model is best fit.</font>