## <font color= "green"> This is Compressed Folder please execute all the cells below to check the results of EDA and model Building results in your local machine </font>

# Problem Statement
  - A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.


- A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 


- In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.


- They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. 


- The company wants to know: Which variables are significant in predicting the demand for shared bikes.How well those variables describe the bike demands. Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors. 



<img src="image1.jfif">

## 1. Importing Required Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

In [None]:
#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Reading the day.csv data 


In [None]:
#Reading data from day.csv 
data=pd.read_csv('day.csv')

# Displaying the first 5 rows of data
data.head()

In [None]:
sns.distplot(data['cnt'])

#  3. Data Description

In [None]:
#Displaying the number of rows and columns in the data
data.shape

In [None]:
#Displaying data typesa ling with non-null count of all columns
data.info()

## Inferences: 
    - Null values are not present in the data
    - Some of the categorical columns like [season,yr,mnth,holiday,weekday,workingday,weathersit] are treated as dataype int which we need to convert to object in further steps

In [None]:
#Summary of the dataset
data.describe()

## 4. Inspecting Missing Values

In [None]:
data.isnull().sum()

## Inferences:

    - To be confident we executed this command which says, We dont have any null values

=========================================
Dataset characteristics
=========================================	
day.csv have the following fields:
	
	- instant: record index
	- dteday : date
	- season : season (1:spring, 2:summer, 3:fall, 4:winter)
	- yr : year (0: 2018, 1:2019)
	- mnth : month ( 1 to 12)
	- holiday : weather day is a holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
	- weekday : day of the week
	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
	+ weathersit : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
	- temp : temperature in Celsius
	- atemp: feeling temperature in Celsius
	- hum: humidity
	- windspeed: wind speed
	- casual: count of casual users
	- registered: count of registered users
	- cnt: count of total rental bikes including both casual and registered

# 5. Treating unnecessary columns

In [None]:
data.head()

In [None]:
# removing instant column from the dataset since its a record index and doesn't help in our analysis
data.drop('instant',axis=1,inplace=True)
data.head()

In [None]:
# Confirming the columns after removal of 'instant' column
data.head()

## Lets Not consider dta(date) in this analysyis. Since the day and month has been already explained by mnth and yr columns. Considering this may be lead to multicollinearity

In [None]:
# Dropping the dteday column using index iloc
data=data.iloc[:,1:]
data.head()

## Assumption 1: Verify that Data has linear Relationship

In [None]:
# Pairplot to understand the relationship between varibales in the data 
sns.pairplot(data)

# Insights:

    - We observed that there is a linear relationship between temp and cnt columns,which is good sign for our prediction
    - we oberved that temp and atemp has high degree of positive correltation betwen them almost a straight line which indiactes high linear relationship, Dropping one of the column is a good practice to avoid multicollinearity

## Visualise the categorical columns

In [None]:
## Visualise the categorical columns

plt.figure(figsize=(15,15))

plt.subplot(3,3,1)
plt.title('Distribution of Season vs cnt')
sns.boxplot('season','cnt',data=data)

plt.subplot(3,3,2)
plt.title('Distribution of year vs cnt')
sns.boxplot('yr','cnt',data=data)

plt.subplot(3,3,3)
plt.title('Distribution of month vs cnt')
sns.boxplot('mnth','cnt',data=data)

plt.subplot(3,3,4)
plt.title('Distribution of holiday vs cnt')
sns.boxplot('holiday','cnt',data=data)

plt.subplot(3,3,5)
plt.title('Distribution of weekday vs cnt')
sns.boxplot('weekday','cnt',data=data)

plt.subplot(3,3,6)
plt.title('Distribution of workingday vs cnt')
sns.boxplot('workingday','cnt',data=data)

plt.subplot(3,3,7)
plt.title('Distribution of weathersit vs cnt')
sns.boxplot('weathersit','cnt',data=data)
plt.show()

# 6. MultiCollinearity Treatment

# Treating temp and atemp Features
    - temp : temperature in Celsius
    - atemp: feeling temperature in Celsius
    

In [None]:
# Lets understand correlations of numerical columns

plt.figure(figsize=(20,8))
sns.heatmap(data.corr(),cmap='gray',annot=True)

# Insights

    - We comuld observe that temp and atemp are correlated so we can drop one of the columns to avoid multicollinearity
    - We observed similar relationship when we did visulisation using pairplot

In [None]:
# Drop atemp column from the dataset
data.drop(['atemp'],axis=1,inplace=True)

In [None]:
#confirming that atemp column has been dropped by using dataframe.head()
data.head()

## Investigating where datatype is integer

    - casual: count of casual users
    - registered: count of registered users
    - cnt: count of total rental bikes including both casual and registered
    
### Since cnt(count of total rental bikes) is equal to sum of casual and registered column,we can consider cntcolumn and drop casual and registered columns so that we can avoid multi collinearity in the data

In [None]:
#since casual and registered sum is equal to cnt we can drop them to avoid multicollinearity

data.drop(['casual','registered'],axis=1,inplace=True)

In [None]:
#Confirming that the selected  columns are dropped
data.head()

## Assumption: The independent variables should not be correlated absence of this phenomenon may leads to Multicollinearity. 

    - we have removed atemp since this column(independent variable) is correlated with temp variable    
    - We have removed casual and registered column since sum of these two independent features results in cnt column which is our target column and may mislead our model

In [None]:
#correlation between numerical columns

numerical_columnscorr=data[['temp','hum','windspeed','cnt']]

plt.figure(figsize=(20,8))
sns.heatmap(numerical_columnscorr.corr(),cmap='gray',annot=True)
plt.show()

## Inferences:

    - cnt(Target) column has highest correlation with temp column

## Treating season column which was encoded as 1,2,34 in the given dataframe

In [None]:
# Function to convert season column to their season names
def func_season(x):
    if x==1:
        return 'spring'
    elif x==2:
        return 'summer'
    elif x==3:
        return 'fall'
    elif x==4:
        return 'winter'

In [None]:
# Apply above function on season column
data['season']=data['season'].apply(func_season)

In [None]:
# confirming the change mad in the previous step
data.head()

## Treating Month column which was encoded from 1-12 in the serial order of monthwise

In [None]:
# Treat mnth(month) column which was given as encoded format in dataset as 1 to 12

def func_mnth(x):
    return x.map({1:'jan',2:'feb',3:'mar',4:'apr',5:'may',6:'june',7:'july',8:'aug',9:'sep',10:'oct',11:'nov',12:'dec'})

# convert month column with the name of months which was given as 1-12 in our dataset
data[['mnth']]=data[['mnth']].apply(func_mnth)

In [None]:
data.head()

## Univariate Analysis

## Lets treat categorical columns

In [None]:
# create a univariate_analysis function which helps in analysing uivariate analysis

def univariate_analysis(feature):
    plt.figure(figsize=(10,6))
    plt.title('Count of different categories in '+feature)
    sns.countplot(feature,data=data)
    plt.show()

In [None]:
# univariate analysis on Season column

univariate_analysis('season')

In [None]:
# univariate analysis on mnth(month) column
univariate_analysis('mnth')

In [None]:
# univariate analysis on holiday column
univariate_analysis('holiday')

In [None]:
# univariate analysis on weekday column
univariate_analysis('weekday')

In [None]:
# univariate analysis on workingday column
univariate_analysis('workingday')

In [None]:
# univariate analysis on weatersit column
univariate_analysis('weathersit')

## Multivariate Analysis

In [None]:
# Lets understand correlations of numerical columns

plt.figure(figsize=(20,8))
sns.heatmap(data.corr(),cmap='gray',annot=True)
plt.show()

# Insights from heatmap

- cnt(Target variable) is explained by temp and atemp variables with same coorelation so we can drop one of the columns 
- windspeed is negatively correlated with cnt, which is quite obvious when there is a high windspeed bikerides re not preferrable by many people, Further analysis is required in the next steps

# Datatype Handling

In [None]:
#Verify the datatype of a column
data.info()

### We Noticed from dataDictionary that the following columns are categorical but the datatype of this colmns are int in the dataset and we need to treat them:
 
 - yr
 - holiday
 - weekday
 - workingday
 - weathersit

In [None]:
#Lets convert datatype into datatype:object where column is mentioned as int but actually they are categorical in nature

data['yr']=data['yr'].astype('object')
data['holiday']=data['holiday'].astype('object')
data['weekday']=data['weekday'].astype('object')
data['workingday']=data['workingday'].astype('object')
data['weathersit']=data['weathersit'].astype('object')

In [None]:
#Verify the datatype of a column
data.info()

In [None]:
## We have converted the categorical datatypes to the required datatype,Lets roceed with further analysis

In [None]:
data.head()

In [None]:
#consoder variable y as Target column which is cnt(count) of bikes

y=data.iloc[:,-1:]
y

In [None]:
#dropping cnt column from the dataset since it has been taken as Target column
data.drop('cnt',axis=1,inplace=True)

## Categorical column Treatment in Feature/input variables

In [None]:
#categorical columns are formed in the new dataframe which is data_categorical
data_categorical=data.iloc[:,0:7]
data_categorical

In [None]:
#Non-Categorical columns are created in a new dataframe. i.e..,data_noncategorical

data_noncategorical=data.iloc[:,7:]
data_noncategorical

## Categorical columns are treated using dummy varibale creation

In [None]:
# we are using pandas inbuilt method which is get_dummies() method to treat the categorical column.

data_categorical=pd.get_dummies(data_categorical,drop_first=True)
data_categorical

# Inferences:

    - We could observe that get_dummies() has been applied on categorical columns
    - drop_first=True has dropped one of the category in eacch feature to avoid dummy variable trap

In [None]:
# Lets see the shape of categorical and non-categoricla columns in our Input/feature/x
print(data_categorical.shape)
print(data_noncategorical.shape)

In [None]:
#lets combine two dataframes data_1 and data_2 since caetgorical columns tretament has been performed

final_data=pd.concat([data_categorical,data_noncategorical],axis=1)
final_data

In [None]:

plt.figure(figsize=(20,8))
sns.heatmap(final_data.corr(),annot=True,cmap='gray')

In [None]:
y

In [None]:
final_data.head()

## Splitting the Data into Training and Testing Sets for model building process

In [None]:
# import train_test_split from sklearn.model_selection

from sklearn.model_selection import train_test_split

In [None]:
# Lets consider 80 percent of our data into trainning data and 20 percent to test data
x_train,x_test,y_train,y_test=train_test_split(final_data,y,test_size=0.20,random_state=42)

In [None]:
# Verifying the shape after train and test split
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

## Scaling the Data

In [None]:
# Applying min max scaling
from sklearn.preprocessing import MinMaxScaler
mm=MinMaxScaler()

## We are going to apply scaling on three independent variables where the column datatype is int

In [None]:
# Appply min max scaling which scales the value between 0 and 1
# Final_data is the name given to dataframe and this will be passed to model after scaling is applied

final_data[['temp','hum','windspeed']]=mm.fit_transform(final_data[['temp','hum','windspeed']])
final_data.head()

In [None]:
# Look at description to see that min and max has been applied on the numerical columns which we applied using min max scaler
final_data.describe()

In [None]:
# apply scaling on target column. i.e..,y/Dependent variable
y=mm.fit_transform(y)
y

In [None]:
final_data.head()

## Model Building

In [None]:
# import Linearregression model
from sklearn.linear_model import LinearRegression

In [None]:
#Instantiating the object
lr=LinearRegression()

# Fitting the model
lr.fit(x_train,y_train)

In [None]:
# Applying prediction
y_pred=lr.predict(x_test)
y_pred

## Intercept and coefficient of our model

In [None]:
# Intercept and coefficient of our model
# In multi linear regression line we had multiple coefficient as shown in the output below and one Intercept .i.e.., The point at y=0 line cuts the y-axis

print("Intercept of ou model is: ",lr.intercept_)
print()
print("Coefficient of our model is: ",lr.coef_)

## Residual Analysis

## Lets see another analysis where error terms should be normally distributed

In [None]:
# y_train_pred is used to predict output on the training data
y_train_pred=lr.predict(x_train)

In [None]:
# Finding residual, to understand difference between fitted value and predicted value
residual=y_train-y_train_pred
residual

## Assumption: Error Terms should be normally distributed

In [None]:
# ditplot is used to analyse the distribution of Residual terms

plt.title('Distribution of Error terms')
sns.distplot(residual)
plt.show()

# Inferences:

    - We could see that Error terms are normally distributed with almost/closely mean equal to 0.Now lets Evaluate our model

## Model Evaluation

In [None]:
# Import r2_score from sklearn.metrics
from sklearn.metrics import r2_score

In [None]:
# R-Squared score 

r2_score(y_test,y_pred)

In [None]:
# metrcis Evaluation

from sklearn.metrics import mean_squared_error,mean_absolute_error
print("Mean squared Error: ",mean_squared_error(y_test,y_pred))
print("Mean Absolute Error: ",mean_absolute_error(y_test,y_pred))
print("Roor mean squared Error: ",(np.sqrt(mean_squared_error(y_test,y_pred))))


## Inferences:

    - Its a good sign to look at MeanSquaredError and MeanAbsoluteError as well.

In [None]:
# Plotting y_test and y_pred using sactter plot to understand the spread of the Target variables

plt.figure(figsize=(10,6))
plt.title('y_test vs y_pred')
plt.scatter(y_test,y_pred)
plt.xlabel('y_test', fontsize=15)                          
plt.ylabel('y_pred', fontsize=15)
plt.show()

## Inference:

        - y_test and y_pred are linearly related as shown above which indicates model has performed well in predicting        on test data

## Lets see Adjusted R2 value which is a good metric for multiple linear regression analysis


### r2 score on train data

In [None]:
#r2 score on train data
yhat=lr.predict(x_train)                                # yhat is predicted value
RSS=np.sum((yhat-y_train)**2)                           # rss=(yhat-y)**2
TSS=np.sum((np.mean(yhat)-y_train)**2)
r2=1-(RSS/TSS)
print(r2)

### r2 score on test data

In [None]:
#r2 score on test data

yhat=lr.predict(x_test)                    # y-hat here is predicted value on test data
RSS=np.sum((yhat-y_test)**2)
TSS=np.sum((np.mean(yhat)-y_test)**2)
r2=1-(RSS/TSS)
print(r2)

### Adjusted r2 on training data

In [None]:
# Adjusted r2 on training data

yhat = lr.predict(x_train)                             # y-hat here is predicted value on train data
SumSquaresResidual = np.sum((y_train-yhat)**2)
SumSquaresTotal = np.sum((y_train-np.mean(y_train))**2)
r_squared = 1 - (float(SumSquaresResidual))/SumSquaresTotal
adjusted_r_squared = 1 - (1-r_squared)*(len(y_train)-1)/(len(y_train)-x_train.shape[1]-1)
print(r_squared, adjusted_r_squared)

### Adjusted r2 on test data

In [None]:
# Adjusted r2 on test data

yhat = lr.predict(x_test)                              # y-hat here is predicted value on test data
SumSquaresResidual = np.sum((y_test-yhat)**2)
SumSquaresTotal = np.sum((y_test-np.mean(y_test))**2)
r_squared = 1 - (float(SumSquaresResidual))/SumSquaresTotal
adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-x_test.shape[1]-1)
print(r_squared, adjusted_r_squared)

## Inference:

    - Adjusted r2 is considered as better metrics in Multiple Linear regression considering that we have obtained 0.834375 in train data
    - Adjusted r2 for test datais 0.825761

# Inferences on Overall model Building,evaluation and Prediction:

    - Adjusted r2 is good measure for Multiple linear regression,since we have multiple independent variables
    - Adjusted r2 on train data is: 0.834375
    - Adjusted r2 on test data is 0.825761
    - We could see that Error terms are normally distributed with almost/closely mean equal to 0.Now lets Evaluate our model
    - the plot we have drawn on y_test and y_pred shows a good linear relation,which indicates good model perfomance

 ## Stats model Analysis

In [None]:
# Import stats models library for statistical analysis

import statsmodels.api as sm   
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Created a user defined functin called statsmodel_analysis for furter stats analysis,which will reduce the lines of code

def statsmodel_analysis(x,y):
    x=sm.add_constant(x)                 # step 1: Add cnstant
    result=sm.OLS(y,x).fit()             # step 2:fit the model
    print(result.summary())              # step 3: return the summary    

In [None]:
# Created a user defined functin called checkVIF for furter stats analysis,which will reduce the lines of code

def MeasureVIF(x):
    vif = pd.DataFrame()
    vif['Features'] = x.columns
    vif['VIF'] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

## stats model 1

In [None]:
# considering all the columns where adjusted r2 is 0.834 which has been noticed by performing model evaluatin using sklearn

x_train_stats1=x_train
statsmodel_analysis(x_train_stats1,y_train)

In [None]:
MeasureVIF(x_train_stats1)

## Insights:

    - Pvalue for weekday_2 is 0.987 lets remove this since its statistically insignificant in our model
    - Highest VIF:172.63 has been noticed for workingday_1
    - Adjusted r2 : 0.834

## Stats model 2

In [None]:
# weekday_2 has high p-value greater than 0.05 lets drop this column from our analysis

# x_train_stats2 dataframe doesn't hold weekday_2 column/feature

x_train_stats2=x_train_stats1.drop(['weekday_2'],axis=1)
statsmodel_analysis(x_train_stats2,y_train)

In [None]:
MeasureVIF(x_train_stats2)

## Insights;

    - PValue for mnth_june is 0.966 lets remove this since its statistically insignificant in our model
    - VIF for temp is 44.75
    - Adjusted r2 :0.835 

## stats model 3

In [None]:
# Drop mnth_june column where p-value is 0.966 which is statistically not significant

# x_train_stats3 dataframe doens't hold weekday_2 and weekday_1 column/feature

x_train_stats3=x_train_stats1.drop(['weekday_2','mnth_june'],axis=1)
statsmodel_analysis(x_train_stats3,y_train)

In [None]:
MeasureVIF(x_train_stats3)

# Insights:

    - pvalue of weekday_1 is 0.796 we can remove this feature, Since high p-value indicates feature is insignificant in model
    - VIF for hum is highest at 37.77
    - Adjusted r2 : 0.835

    

## stats Model 4

In [None]:
# Drop weekday_1 column where p-value is 0.796 which is statistically not significant

# x_train_stats4 dataframe doens't hold weekday_2, weekday_1 and mnth_june

x_train_stats4=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1'],axis=1)
statsmodel_analysis(x_train_stats4,y_train)

In [None]:
MeasureVIF(x_train_stats4)

# Insights:
    - PValue of mnth_feb is 0.770 we can remove this feature, Since high p-value indicates feature is insignificant in model
    - VIF for hum is highest at 37.76
    - Adjusted r2 : 0.835

## stats Model 5

In [None]:
# Drop mnth_feb column where p-value is 0.770 which is statistically not significant

# x_train_stats5 dataframe doens't hold weekday_2, mnth_june,weekday_1 and mnth_feb columns/feature

x_train_stats5=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb'],axis=1)
statsmodel_analysis(x_train_stats5,y_train)

In [None]:
MeasureVIF(x_train_stats5)

# Insights:

    - Adjusted r-squared is 0.836
    - pvalue of mnth_aug is 0.694 we can remove this feature, Since high p-value indicates feature is insignificant in model
    - VIF for hum is high at 34.87

## stats model 6

In [None]:
# Drop mnth_aug column where p-value is 0.694 which is statistically not significant

# x_train_stats6 dataframe doesn't hold weekday_2, mnth_june,weekday_1 ,mnth_feb and mnth_aug columns/feature

x_train_stats6=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug'],axis=1)
statsmodel_analysis(x_train_stats6,y_train)

In [None]:
MeasureVIF(x_train_stats6)

# Insights:

    - Adjusted r-squared is 0.836
    - pvalue of mnth_jan is 0.415 we can remove this feature, Since high p-value indicates feature is insignificant in model
    - VIF for hum is high at 33.33

## stats model 7

In [None]:
# Drop mnth_jan column where p-value is 0.415 which is statistically not significant

# x_train_stats7 dataframe doesn't hold weekday_2, mnth_june,weekday_1 ,mnth_feb, mnth_aug and mnth_jan columns/feature

x_train_stats7=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan'],axis=1)
statsmodel_analysis(x_train_stats7,y_train)

In [None]:
MeasureVIF(x_train_stats7)

# Insights:

    - Adjusted r-squared is 0.836
    - pvalue of weekday_3 is 0.288 we can remove this feature, Since high p-value indicates feature is insignificant in model
    - VIF for hum is high at 32.19

## stats modle 8

In [None]:
# Drop weekday_3 column where p-value is 0.288 which is statistically not significant

# x_train_stats8 dataframe doesn't hold weekday_2, mnth_june,weekday_1 ,mnth_feb, mnth_aug,mnth_jan and weekday_3 columns/feature

x_train_stats8=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan','weekday_3'],axis=1)
statsmodel_analysis(x_train_stats8,y_train)

In [None]:
MeasureVIF(x_train_stats8)

# Insights:

    - Adjusted r-squared is 0.836
    - pvalue of weekday_6 is 0.248 we can remove this feature,Since high p-value indicates feature is insignificant in model
    - VIF for hum is high at 32.18

## stats model 9

In [None]:
# Drop weekday_6 column where p-value is 0.248 which is statistically not significant

# x_train_stats9 dataframe doesn't hold weekday_2, mnth_june,weekday_1 ,mnth_feb, mnth_aug,mnth_jan,weekday_3 and weekday_6 columns/feature

x_train_stats9=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan','weekday_3','weekday_6'],axis=1)
statsmodel_analysis(x_train_stats9,y_train)

In [None]:
MeasureVIF(x_train_stats9)

# Insights:

    - Adjusted r-squared is 0.836
    - pvalue of mnth_oct is 0.209 we can remove this feature, Since high p-value indicates feature is insignificant in model
    - VIF for hum is high at 32.00

## statsmodel 10

In [None]:
# Drop mnth_oct column where p-value is 0.209 which is statistically not significant

# x_train_stats10 dataframe doesn't hold weekday_2, mnth_june,weekday_1 ,mnth_feb, mnth_aug,mnth_jan,weekday_3,weekday_6 and mnth_oct columns/feature

x_train_stats10=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan','weekday_3','weekday_6','mnth_oct'],axis=1)
statsmodel_analysis(x_train_stats10,y_train)

In [None]:
MeasureVIF(x_train_stats10)

# Insights:

    - Adjusted r-squared is 0.836
    - pvalue of season_summer is 0.257 we can remove this feature, Since high p-value indicates feature is insignificant in model
    - VIF for hum is at 31.66

## statsmodel 11

In [None]:
# Drop season_summer  column where p-value is 0.257 which is statistically not significant

# x_train_stats11 dataframe doesn't hold weekday_2, mnth_june,weekday_1 ,mnth_feb, mnth_aug,mnth_jan,weekday_3,weekday_6, mnth_oct and season_summer columns/feature

x_train_stats11=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan','weekday_3','weekday_6','mnth_oct','season_summer'],axis=1)
statsmodel_analysis(x_train_stats11,y_train)

In [None]:
MeasureVIF(x_train_stats11)

# Insights:

    - Adjusted r-squared is 0.835
    - PValue of all the features/columns is less than 0.05 which indiactes all the features are significant to the model
    - VIF for hum(humidity) is 28.79 which is the highest VIF among all other features

## Stats model 12

In [None]:
# Drop humidity(hum)  column whereVIF is 28.79 which may lead to multicollinearity and statistically insignificant

# x_train_stats12 dataframe doesn't hold weekday_2, mnth_june,weekday_1 ,mnth_feb, mnth_aug,mnth_jan,weekday_3,weekday_6, mnth_oct,season_summer and hum columns/feature

x_train_stats12=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan','weekday_3','weekday_6','mnth_oct','season_summer','hum'],axis=1)
statsmodel_analysis(x_train_stats12,y_train)

In [None]:
MeasureVIF(x_train_stats12)

# Insights:

    - Adjusted r-squared is 0.830
    - PValue of all the features/columns is less than 0.05 which indiactes all the features are significant to the model
    - VIF for temp is 7.93 which is the highest VIF among all other features

## consider new Features which are statistically significant

#### We consider the features obtained in stats model 12 and proceed with model  building to check the accuracy on test data

In [None]:
#x_train_selected features hold the features which are statistically significant 

x_train_selectedfeatures=x_train_stats1.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan','weekday_3','weekday_6','mnth_oct','season_summer','hum'],axis=1)
x_train_selectedfeatures

In [None]:
x_train_selectedfeatures.shape

### Now we are left with 17 columns lets consider the 17 clumns for test data as well

In [None]:
#x_test selected features
x_testselectedfeatures=x_test.drop(['weekday_2','mnth_june','weekday_1','mnth_feb','mnth_aug','mnth_jan','weekday_3','weekday_6','mnth_oct','season_summer','hum'],axis=1)
x_testselectedfeatures

### Fit the model

In [None]:
lr.fit(x_train_selectedfeatures,y_train)

### Make predictions on test data

In [None]:
y_predselectedfeatures=lr.predict(x_testselectedfeatures)
y_predselectedfeatures

In [None]:
# Plotting y_test and y_pred using sactter plot to understand the spread of the Target variables

plt.figure(figsize=(10,6))
plt.scatter(y_test,y_predselectedfeatures)
plt.xlabel('y_test', fontsize=15)                          
plt.ylabel('y_predselectedfeatures', fontsize=15)
plt.show()

#### Adjusted r2 on training data

In [None]:
# Adjusted r2 on training data

yhat = lr.predict(x_train_selectedfeatures)                             
SumSquaresResidual = np.sum((y_train-yhat)**2)
SumSquaresTotal = np.sum((y_train-np.mean(y_train))**2)
r_squared = 1 - (float(SumSquaresResidual))/SumSquaresTotal
adjusted_r_squared = 1 - (1-r_squared)*(len(y_train)-1)/(len(y_train)-x_train_selectedfeatures.shape[1]-1)
print(r_squared, adjusted_r_squared)

#### Adjusted r2 on test data

In [None]:
# Adjusted r2 on test data

yhat = lr.predict(x_testselectedfeatures)                              
SumSquaresResidual = np.sum((y_test-yhat)**2)
SumSquaresTotal = np.sum((y_test-np.mean(y_test))**2)
r_squared = 1 - (float(SumSquaresResidual))/SumSquaresTotal
adjusted_r_squared = 1 - (1-r_squared)*(len(y_test)-1)/(len(y_test)-x_testselectedfeatures.shape[1]-1)
print(r_squared, adjusted_r_squared)

# Actions performed above:

    - Importing and reading data
    - Finding Missing values    
    - Univariate and Bivariate Analysis for better insights of data
    - Feature selection
    - Considered the Linear regression Assumptions
    - Model Building
    - Model evaluation
    - Statistical Analysis to avoid multicollinearity and finding best features of the data
    - Verifying the accuracy after training the model on the new features which was considered after perfomring statistical analysis

# Final Result Comparison between Train model and Test:
    
- Train Adjusted R^2 : 0.835
- Test Adjusted R^2  : 0.847
- Difference in Adjusted R^2 between Train and Test data is : 1 % which is less than 5% indicates a **Good model**

    

# Inferences on Overall Stats model Building:
   - As observed in the analysis, We can see more demand for bikes in 2019 than 2018, Which indicates the business is gradually increasing the revenue over the days. So business can focus more on other variabes and can plan accordingly afetr the pandemic.
   - Business Can focus more on Summer & Winter season, March,May and  September month they have good influence on bike rentals.
   - We can see spring season has negative coefficients and negatively correlated to bike rentals. So we can provide offers during this time to increase the revenue and attract customers.