## Step 1: Reading and Understanding the Data


In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

## Data Description

In [None]:
bikedata=pd.read_csv('../input/boom-bike-dataset/bike_sharing_data.csv')
bikedata

In [None]:
bikedata.shape

In [None]:
bikedata.info()

In [None]:
bikedata.describe()

## Step 2: Check - Null Values


In [None]:
#Calculate Percentage of null values in each clolumn
round(100*(bikedata.isnull().sum()/len(bikedata.index)),2).sort_values(ascending=False)

In [None]:
#Calculate Percentage of null values in each row
round((bikedata.isnull().sum(axis=1)/122)*100,2).sort_values(ascending=False)

### From above observation there is no null values in rows and columns.

## Step2.1. Check Duplicate

In [None]:
# Copy of original bikedata dataframe for duplicate check
bike_dp = bikedata

# Checking for duplicates and dropping the entire duplicate row if any
bike_dp.drop_duplicates(subset=None, inplace=True)

In [None]:
bike_dp.shape

### Original dataframe is same after performing drop duplicate and shape size is same as previous. Hence conclude that there were no duplicate values in dataset.

## Step 2.2 Droping unwanted columns

The following variables can be removed from further analysis: 

- instant : It is just an index value 
- dteday : This has the date, Since we already have seperate columns for 'year' & 'month'  
- casual & registered : These columns contains the count of bike booked by different categories of customers.Ignoring these two columns. 

Saveing the new dataframe as bikedata_new.

In [None]:
bikedata.columns

In [None]:
bikedata_new=bikedata[['season','yr','mnth','holiday','weekday','workingday','weathersit','temp','atemp','hum','windspeed','cnt']]
bikedata_new.info()

#### Converting 4 categorical variables(season,weathersit,mnth,weekday) into 'category' data types.

In [None]:
bikedata_new['season']=bikedata_new['season'].astype('category')
bikedata_new['weathersit']=bikedata_new['weathersit'].astype('category')
bikedata_new['mnth']=bikedata_new['mnth'].astype('category')
bikedata_new['weekday']=bikedata_new['weekday'].astype('category')

bikedata_new.info()

## Creating Dummy Variables

In [None]:
#Droping orginal variable for which dummy was created
#Droping first dummy variable for each set of dummies created

bikedata_new=pd.get_dummies(bikedata_new,drop_first=True)

bikedata_new.info()

In [None]:
bikedata_new.shape

# SPLITTING THE DATA

### Splitting the data to Train and Test: - Split the data into TRAIN and TEST (70:30 ratio) - We will use train_test_split method from sklearn package for this ---

In [None]:
# Checking the shape before spliting

bikedata_new.shape

In [None]:
# Checking the info before spliting

bikedata_new.info()


In [None]:
from sklearn.model_selection import train_test_split

# Specifying 'random_state'- the train and test data set always have the same rows, respectively

np.random.seed(0)
df_train, df_test = train_test_split(bikedata_new, train_size = 0.70, test_size = 0.30, random_state = 333)


In [None]:
df_train.info()

In [None]:
df_train.shape

In [None]:
df_test.info()

In [None]:
df_test.shape

# EXPLORATORY DATA ANALYSIS

* Performing EDA on df_train dataset

### Makeing a pairplot of all the numeric variables

In [None]:
df_train.columns

In [None]:
#Creating newdataframe for numeric variable

bikenum=df_train[[ 'temp', 'atemp', 'hum', 'windspeed','cnt']]

sns.pairplot(bikenum,diag_kind='kde')
plt.show()

#### From above pairplot, there is liner relation between temp, atemp and cnt. 

### Building boxplot of all categorical variables against target variable

In [None]:
plt.figure(figsize=(25, 10))

plt.subplot(2,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = bikedata)
plt.subplot(2,3,2)
sns.boxplot(x = 'mnth', y = 'cnt', data = bikedata)
plt.subplot(2,3,3)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bikedata)
plt.subplot(2,3,4)
sns.boxplot(x = 'holiday', y = 'cnt', data = bikedata)
plt.subplot(2,3,5)
sns.boxplot(x = 'weekday', y = 'cnt', data = bikedata)
plt.subplot(2,3,6)
sns.boxplot(x = 'workingday', y = 'cnt', data = bikedata)

plt.show()

Observation:6 categorical variables in the dataset and used Box plot to observe their effect on the target variable (‘cnt’).

- season: Almost 32% of the bike booking were happening in season3 with a median of over 5000 booking (for the period of 2 years). This was followed by season2 & season4 with 27% & 25% of total booking. This indicates, season can be a good predictor for the dependent variable. 
- mnth: Almost 10% of the bike booking were happening in the months 5,6,7,8 & 9 with a median of over 4000 booking per month. This indicates, mnth has some trend for bookings and can be a good predictor for the dependent variable. 
- weathersit: Almost 67% of the bike booking were happening during ‘weathersit1 with a median of close to 5000 booking (for the period of 2 years). This was followed by weathersit2 with 30% of total booking. This indicates, weathersit does show some trend towards the bike bookings can be a good predictor for the dependent variable. 
- holiday: Almost 97.6% of the bike booking were happening when it is not a holiday. This indicates, holiday CANNOT be a good predictor for the dependent variable. 
- weekday: weekday variable shows very close trend (between 13.5%-14.8% of total booking on all days of the week) having their independent medians between 4000 to 5000 bookings. This variable can have some or no influence towards the predictor. I will let the model decide if this needs to be added or not. 
- workingday: Almost 69% of the bike booking were happening in ‘workingday’ with a median of close to 5000 booking (for the period of 2 years). This indicates, workingday can be a good predictor for the dependent variable.

# Correlation Matrix

In [None]:
# checking the correlation coefficients -which variables are highly correlated. 


plt.figure(figsize = (25,20))
sns.heatmap(bikedata_new.corr(), annot = True, cmap="YlGnBu")
plt.show()

Observation: 
- The heatmap clearly shows which all variable are multicollinear in nature, and which variable have high collinearity with the target variable. 
- validate different correlated values along with VIF & p-value, for identifying the correct variable to select/eliminate from the model. ---

# RESCALING 

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scale = MinMaxScaler()

In [None]:
# Before scaling, check the values
df_train.head()

In [None]:
df_train.columns

In [None]:
# Applying scaler() to all the numeric variables

num_vars = ['temp', 'atemp', 'hum', 'windspeed','cnt']

df_train[num_vars] = scale.fit_transform(df_train[num_vars])

In [None]:
df_train.head()

In [None]:
df_train.describe()

# Building Linear Model

In [None]:
y_train = df_train.pop('cnt')
X_train = df_train

# RFE

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
#RFE with the output number of the variable equal to 15
lin = LinearRegression()
lin.fit(X_train, y_train)

rfe = RFE(lin, 15)             #RFE
rfe = rfe.fit(X_train, y_train)


In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

## Linear Model using 'STATS MODEL

## Model 1

### Observing VIF


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Createing a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
import statsmodels.api as sm

# Add a constant
X_train_lm1 = sm.add_constant(X_train_rfe)

# Create a first fitted model
linr1 = sm.OLS(y_train, X_train_lm1).fit()

In [None]:
# Checking the parameters obtained

linr1.params



In [None]:
# Summary of the linear regression model obtained
print(linr1.summary())

#### Note : atemp" has high P- value and High VIF - Dropping 'atemp'

## Model 2

In [None]:
X_train_new = X_train_rfe.drop(["atemp"], axis = 1)

#### Checking VIF

In [None]:
# Checking for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm2 = sm.add_constant(X_train_new)

# Create a first fitted model
linr2 = sm.OLS(y_train, X_train_lm2).fit()

In [None]:
# Checking the parameters obtained

linr2.params

In [None]:

# summary of the linear regression model obtained
print(linr2.summary())

#### Note- Dropping 'hum', as it has second highest VIF. Tempeature can be important factor .

## Model 3

In [None]:
X_train_new = X_train_new.drop(["hum"], axis = 1)

#### Checking VIF

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm3 = sm.add_constant(X_train_new)

# Create a first fitted model
linr3 = sm.OLS(y_train, X_train_lm3).fit()


In [None]:
# Checking the parameters obtained

linr3.params

In [None]:
# Summary of the linear regression model obtained
print(linr3.summary())

#### Note- Dropping 'seacon-2', as it has second highest VIF. Tempeature can be important factor .

## Model 4

In [None]:
X_train_new = X_train_new.drop(["season_2"], axis = 1)

#### VIF Checking

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm4 = sm.add_constant(X_train_new)

# Create a first fitted model
linr4 = sm.OLS(y_train, X_train_lm4).fit()

In [None]:
# Checking the parameters obtained

linr4.params

In [None]:
# summary of the linear regression model obtained
print(linr4.summary())

#### Note- Dropping 'month-10', as it has high P- Value. 

## Model 5

In [None]:
X_train_new = X_train_new.drop(["mnth_10"], axis = 1)

#### VIF Checking

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm5 = sm.add_constant(X_train_new)

# Create a first fitted model
linr5 = sm.OLS(y_train, X_train_lm5).fit()

In [None]:
# Checking the parameters obtained

linr5.params

In [None]:
# summary of the linear regression model obtained
print(linr5.summary())

#### Note- Dropping 'month-8', as it has high P- Value. 

## Model 6

In [None]:
X_train_new = X_train_new.drop(["mnth_8"], axis = 1)

#### VIF Checking

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Add a constant
X_train_lm6 = sm.add_constant(X_train_new)

# Create a first fitted model
linr6 = sm.OLS(y_train, X_train_lm6).fit()

In [None]:
# Checking the parameters obtained

linr6.params

In [None]:
# summary of the linear regression model obtained
print(linr6.summary())

##### Note: - We will consider this as our final model (unless the Test data metrics are not significantly close to this number). ---Considering this is a final model as seems to be very low muliticollinerity between predictors and p-values for all the predictors seems to be significant.

# Final Model Interpretation

## Hypothesis Testing:

#### Hypothesis testing states that:
##### H0:B1=B2=...=Bn=0 
##### H1:  at least one  Bi!=0

### linr6 model coefficient values

* const           0.082570
* yr              0.231870
* temp            0.580437
* windspeed      -0.152419
* season_4        0.129183
* mnth_3          0.065965
* mnth_4          0.083857
* mnth_5          0.072684
* mnth_6          0.061821
* mnth_9          0.090881


### Consultion: As all above coefficients are not equl to zero. Which means Rejecting the NULL Hypothesis

## F Statistics
F-Statistics is used for testing the overall significance of the Model: Higher the F-Statistics, more significant the Model is.

F-statistic: 194.5
Prob (F-statistic): 4.74e-165

# VALIDATE ASSUMPTIONS

### Error terms are normally distributed with mean zero (not X, Y)
Residual Analysis Of Training Data

In [None]:
y_train_pred = lr6.predict(X_train_lm6)

In [None]:
res = y_train-y_train_pred

# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((res), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                   
plt.xlabel('Errors', fontsize = 18)                         

####  From the above histogram,  the Residuals are normally distributed. Hence our assumption for Linear Regression is valid.

### There is a linear relationship between X and Y

In [None]:
bikedata_new=bikedata_new[[ 'temp', 'atemp', 'hum', 'windspeed','cnt']]

sns.pairplot(bikenum, diag_kind='kde')
plt.show()

### From above the pair plot,  see  that there is a linear relation between temp and atemp variable with the predictor ‘cnt’. ---

## There is No Multicollinearity between the predictor variables

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Note - From the VIF calculation, concluted that there is no multicollinearity existing between the predictor variables, as all the values are within permissible range of below 5.

# MAKING PREDICTION USING FINAL MODEL

#### Now that we have fitted the model and checked the assumptions, it's time to go ahead and make predictions using the final model (lr6)

Applying the scaling on the test sets¶

In [None]:
# Apply scaler() to all numeric variables in test dataset. Note: we will only use scaler.transform, 
# as want to use the metrics that the model learned from the training data to be applied on the test data. 


num_vars = ['temp', 'atemp', 'hum', 'windspeed','cnt']

df_test[num_vars] = scale.transform(df_test[num_vars])

In [None]:
df_test.head()

In [None]:
df_test.describe()

#### Dividing  X_test and y_test

In [None]:
y_test = df_test.pop('cnt')
X_test = df_test

X_test.info()



# y_test = df_test.pop('cnt')
# X_test = df_test

In [None]:
#Selecting the variables that were part of final model.
col1=X_train_new.columns

X_test=X_test[col1]

# Adding constant variable to test dataframe
X_test_lm6 = sm.add_constant(X_test)

X_test_lm6.info()

In [None]:
# Making predictions using the final model (lr6)

y_pred = linr6.predict(X_test_lm6)

## MODEL EVALUATION

In [None]:
# Plotting y_test and y_pred to understand the spread
# import matplotlib.pyplot as plt
# import numpy as np


fig = plt.figure()
plt.scatter(y_test, y_pred, alpha=.5)
fig.suptitle('y_test vs y_pred', fontsize = 20)              # Plot heading 
plt.xlabel('y_test', fontsize = 18)                          # X-label
plt.ylabel('y_pred', fontsize = 16)      

### R^2 Value for TEST

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

## Adjusted R^2 Value for TEST
Formula for Adjusted R^2

R2adj.=1−(1−R2)∗n−1n−p−1

In [None]:
r2=0.7768676080828206

In [None]:
# Get the shape of X_test

X_test.shape

In [None]:
# n is number of rows in X

n = X_test.shape[0]


# Number of features (predictors, p) is the shape along axis 1
p = X_test.shape[1]

# We find the Adjusted R-squared using the formula

adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

# Final Result 

#### Train R^2 :0.77986.. - Train Adjusted R^2 :0.76614 .
##### This seems to be a really good model that can very well 'Generalize' various datasets. ---

# FINAL REPORT

##### As per our final Model, the top 3 predictor variables that influences the bike booking are: --- - Temperature (temp) - A coefficient value of ‘0.580437’ indicated that a unit increase in temp variable increases the bike hire numbers by 0.5804 units. - Weather Situation 3 (weathersit_3) - A coefficient value of ‘-0.276877’ indicated that, w.r.t Weathersit1, a unit increase in Weathersit3 variable decreases the bike hire numbers by 0.276877 units. - Year (yr) - A coefficient value of ‘0.231870’ indicated that a unit increase in yr variable increases the bike hire numbers by 0.276877 units. --- 

##### SO IT IS RECOMMENDED TO GIVE THESE VARIABLES UTMOST IMPORTANCE WHILE PLANNING, TO ACHIEVE MAXIMUM BOOKING. --- 
 
##### The next best features that can also be considered are - - season_4: - A coefficient value of ‘0.129183’ indicated that w.r.t season_1, a unit increase in season_4 variable increases the bike hire numbers by 0.128744 units. - windspeed: - A coefficient value of ‘-0.152419’ indicated that, a unit increase in windspeed variable decreases the bike hire numbers by 0.152419 units. ---

* NOTE: - The details of weathersit_1 & weathersit_3 - weathersit_1: Clear, Few clouds, Partly cloudy, Partly cloudy - weathersit_3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - The details of season1 & season4 - season1: spring - season4: winter