# Linear Regression
## Bike Sharing Assignment
#### Problem Statement:

- To identify the variables affecting demand for shared bikes in American Market

- To know Which variables are significant in predicting the demand for shared bikes.

- To know the accuracy of the model, i.e. how well these variables can predict demand.
- To know the priority of the variables(i.e) influence the demand

# Steps for acheving above statements:
    - Reading,understanding and visualising the data
    - Preparing the data for modelling (train-set split,rescaling etc.)
    - Training the model
    - Residual analysis
    - Predictions and evaluation on the test set

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [None]:
## Supress Warnings 
import warnings
warnings.filterwarnings('ignore')

In [None]:
## Importing required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import calendar

In [None]:
bike = pd.read_csv('../input/bike-sharing-data-set/day.csv') ## Importing data set into bike variable

In [None]:
bike.head() # Display the top 5 records for the datframe using head() function

In [None]:
bike.shape # Shape of data frame (no of rows & cols)

In [None]:
bike.info() ## Info of dataset , column data types and non- null values 

In [None]:
bike.isnull().sum() ### Looking for any null values present in given data set

In [None]:
bike.describe() ## getting statistical information about dataset

#### Dropping columns which are irrelavant for model building

Dropping instant, dteday,casual,registered

- 1) Instant - it is necessary to drop the variable becuase it is unique identifier of row and not required for regression
- 2) Dteday - It is a redundant variable as we could see we have yr,mnth explains 
- 3) Casual & Registered - these variables are target variables and they are not available all the time. Also we have given one more target variable where we can get combination of casual & regisrtered as cnt. So it is necessary to drop these variables as well

In [None]:
not_req = ['instant','dteday','casual','registered']

bike = bike.drop(not_req,axis=1)

In [None]:
bike.shape ## checking whether changes are implemented

In [None]:
bike.head()  ## Its always good practise to inspect dataframe 

## Before Starting anything lets visual; the data, By seeing data we could see there are some categorical variables mentioned with int data types and some numerical varariables

In [None]:
## Correlation between variables
## Heat map is one of best way of visualizing correlation between variables

plt.figure(figsize=(20,8))
ax = sns.heatmap(bike.corr(),annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()



In [None]:
## nunique is the function in python which helps to give some idea about the columns whether they are categorical or not

bike.nunique()
bike.nunique().sort_values(ascending= True)

In [None]:
## By seeing above info we can have some idea the cols [temp,hum,windspeed,atemp,cnt] are definetly numerical columns
## The possible best way to visualize numerical columns is pair plot

sns.pairplot(data=bike,vars=['temp', 'atemp', 'hum','windspeed','cnt'])
plt.show()


### As per above visual we could see temp and atemp variables are highly correlated. To Avoid the multicollinearity it is better to drop one of these variables. Hence dropping atemp 

In [None]:
bike = bike.drop(['atemp'],axis =1) ## Using drop function and we need column so axis = 1

In [None]:
bike.shape  ## Shape of data frame after atemp variable is removed

In [None]:
co = bike[['temp', 'hum','windspeed','cnt']].corr()  ## calculating correlation again for numeric columns
co

In [None]:
## Visualizing corr() using heat map for best representation

ax = sns.heatmap(co,annot=True,cmap="YlGnBu")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

### Lets visualize categorical columns in the data frame.
    - As we could see there are categorical columns by above info using nunique function. 
    - To visualize the categories lets use box plot using subplot for all categorical columns

In [None]:
plt.figure(figsize=(20, 12))

plt.subplot(2,4,1)
sns.boxplot(x = 'yr', y = 'cnt', data = bike)

plt.subplot(2,4,2)
sns.boxplot(x = 'holiday', y = 'cnt', data = bike)

plt.subplot(2,4,3)
sns.boxplot(x = 'workingday', y = 'cnt', data = bike)

plt.subplot(2,4,4)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bike)

plt.subplot(2,4,5)
sns.boxplot(x = 'season', y = 'cnt', data = bike)

plt.subplot(2,4,6)
sns.boxplot(x = 'weekday', y = 'cnt', data = bike)

plt.subplot(2,4,7)
sns.boxplot(x = 'mnth', y = 'cnt', data = bike)
plt.show()

### Inferences:
    - Defintely we could see in the year 2019, cnt values increased means shared bike count increased when compared to year 2018 
    - Also we could see in the weather sit where 1 - represent clear have highest cnt value which means the shared bikes is pretty high in clear when compared to other weather condition.
    - For the season spring has outlier of 8000 count how ever in the season 3 which is fall as per data dictionary has highest usage of bikes
    - The cnt value is less during holidays
    - In the month graph, except oct all the months never started with 0 which means atleast there are few usages where as in the oct month it has no bookings on someday however it has descent number of bookings when compared to other months

### We can also visualise these category columns by using hue as holiday

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'mnth', y = 'cnt', hue = 'holiday', data = bike)
plt.show()

-  <font color = blue>`By using above group we found some interesting insghts that in the month of March & August there is no cnt values on holidays`</font>

## Step 2: Data Preparation for model building
    - Converting some category columns into categories
    - Also for regression we need only numeric columns so if we have any of these we need to handle them

In [None]:
bike.shape
bike.head()

In [None]:
bike.nunique().sort_values(ascending=True)

In [None]:
bike.weathersit.value_counts()

### Even though data dict provides 4 types of weather sit we have only 3 unique types in the dataset by above info

#### By above info of nunique function we could convert categorical variables to corresponding category

- 1) season : It has 4 count and can map as (1 - Spring,2 - Summer,3-Fall,4-Winter)
- 2) yr : No need to modify as it is binary categorical variable
- 3) mnth : It has 12 Count and can map as (1- Jan to 12 -Dec)
- 4) holiday, workingday : No need to modify as it is binary categorical variable
- 5) Weekday : it has 7 count can be mapped as (sun - 0 to sat 6)
- 6) Weathersit : it has 3 unique count can be mapped as (1 - Clear , 2 - Mist & Cloud, 3 - Light Rain & Snoq 4 - Heavy Rain & Snow

In [None]:
## Converting the required categorical colums into categories by using map function and for month using calendar function to extract month 
bike.season = bike.season.map({1: 'spring',2:'summer',3:'fall',4:'winter'})
bike['mnth'] = bike['mnth'].apply(lambda x:calendar.month_name[x])   
bike.weekday = bike.weekday.map({0:"Sunday",1:"Monday",2:"Tuesday",3:"Wednesday",4:"Thursday",5:"Friday",6:"Saturday"})
bike.weathersit = bike.weathersit.map({1:'Clear',2:'Mist & Cloudy', 
                                             3:'Light Rain & Snow',4:'Heavy Rain & Snow'})

In [None]:
## Inspect the data frame whether changes has been reflected or not
bike.head()

In [None]:
## Lets look the info of dataframe for data types etc
bike.info() 

## Creating Dummy variables for the categorical columns which just now created above

In [None]:
cat_cols = ['season','mnth','weekday','weathersit']

cat_df = bike[cat_cols]
cat_df

`In pandas we have function called get dummies to get dummies  of respective columns and by default it will also include the first set, but to avoid redundancy and multi collinearity
it is necessary to drop first set`

In [None]:
cat_df = pd.get_dummies(cat_df,drop_first=True) ## We are creating a seperated dataframe for all dummy variables

In [None]:
cat_df.head() ## Inspecting top 5 rows for results 

In [None]:
## Merging cat_df dataframe with bike

bike = pd.concat([bike,cat_df],axis = 1)    ### Now that we concat the dummy variables data frame with our original df bike to use of them


In [None]:
bike.shape  ## Geeting shape whether they were added to data frame or not

In [None]:
bike.head() ## Checking few rows of data frame

In [None]:
pd.set_option('display.max_columns',50)  ## By Default the pandas allowed max 30 cols Hence pd.set_option is used

In [None]:
bike.head()

In [None]:
## To avoid reduncy we will drop the variables for which dummy variables are created

bike = bike.drop(cat_cols,axis = 1) 


In [None]:
bike.head()

In [None]:
bike.shape

### Now we have prepared data for regression techniques such as (splitting data,model building etc.,)

### Step 4 : Split the data set into training and test

In [None]:
## Importing required libraries for Linear regression 
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE


import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor


In [None]:
np.random.seed(0)    ## Random seed is function in numpy to select random from our dataset
## By using train split menthod we are diving the given data set into training and test in the size of 70%:30% ratio
bike_train,bike_test=train_test_split(bike, train_size = 0.7, random_state = 100)  

In [None]:
bike_train.shape ## Shape of training set after split 

In [None]:
bike_test.shape ## Shape of test set after split

In [None]:
bike_train.head()  ## Inspecting data for train set

In [None]:
num_cols = ['temp','hum','windspeed','cnt']   ## Creating list of numeric columns which requires scaling

#### Scaling is a feature in regression which is very helpful for good interpretation. As we could see the numeric columns there are some numeric coulumns like cnt, hum have different scales with each other. To interpret the coefficient in model it would be difficult, so if we scale them by using scaling techniques it would be very easy to interpret coeff.

`Using MinMax Scaling technique which is also normalization technique to get all the results of columns in between 0 and 1 where highest value in column is 1 and lowest is 0`

In [None]:
scaler = MinMaxScaler()  ## Creating MinMaxScaler object 
bike_train[num_cols] = scaler.fit_transform(bike_train[num_cols]) ## We are fiiting scaler on data set and also transforming results into dataset

In [None]:
bike_train.head() ## Inspecting data frame

In [None]:
bike_train.describe()  ## After scaling it is good to use describe function because describe will give min max of entries in the columns apart from mean,median,count

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated on training data set 

plt.figure(figsize=(25,15))
ax = sns.heatmap(bike_train.corr(),annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

_"By Above heat map we could see temp variable is highly correlated with cnt variable. So lets visulize that particular alone"_

In [None]:
plt.figure(figsize=[5,5])
plt.scatter(bike_train.temp, bike_train.cnt)
plt.show()

## To predict some more varibles or derieve fractors lets build Model building using RFE

In [None]:
y_train = bike_train.pop('cnt')  ## Dividing data set into X variable and y varibles where X Variables are independent variables and y is dependent variable
X_train = bike_train

In [None]:
lm = LinearRegression()   ## Creating Linear regression object 
lm.fit(X_train, y_train) ### Fit the model on data set 

rfe = RFE(lm, 12)             # running RFE of top 12 variables which we will select 
rfe = rfe.fit(X_train, y_train) ## Fit it 

In [None]:
## Rfe has two function which makes our job earier is support function to give bool values true or false for consideration and ranking on priority wise
list(zip(X_train.columns,rfe.support_,rfe.ranking_))  

In [None]:
## Now that we have columns needs to be considered for model building
col = X_train.columns[rfe.support_] ## Extracting column names from data set 
col

In [None]:
X_train_rfe = X_train[col]  ## Creating another data frame with selected columns

In [None]:
X_train_rfe.head() ## Inspecting data frame of top 5 rows

In [None]:
## We need to add constant because it fits a regression line passing through the origin, by default.in statsmodel 
X_train_rfe = sm.add_constant(X_train_rfe)  ## Adding constant

In [None]:
lm1 = sm.OLS(y_train,X_train_rfe).fit()  ## Using ordinary Least squared is the best method for linear building for models
lm1.params

In [None]:
print(lm1.summary()) ## Printing summary of model

### By above summary information we could see R-Squared and adjusted R-Squared are almost similar and F-Stat is almost zero and p values for atleast 11 variables shows statistically significant

#### How ever we will use variance inflaction factor to determine how much the variables are correlated with each other

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

`We can observer constant has VIF but however constant needs to be removed from VIF calculation`

In [None]:
X_train_rfe = X_train_rfe.drop(['const'], axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

`From above table we can see hum variable has high VIF we will drop this variable`

In [None]:
X_train_new1 = X_train_rfe.drop(["hum"], axis = 1) ## Dropping the hum column and adding to new dataframe

In [None]:
X_train_new1 = sm.add_constant(X_train_new1) ## Adding constant to df

In [None]:
lm2 = sm.OLS(y_train,X_train_new1).fit() ## model build again

In [None]:
lm2.summary()

`As per above summary of lm2, there is no drastic change in R and adjusted R2 square and also p value also can infer <=0.05 which are statistically significant`

#### Let's caluculate VIF again to look after results

In [None]:
X_train_new1 = X_train_new1.drop(['const'], axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_new1
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

- As per above info we could see VIF has reduced has reduced drastically after dropping hum column . Now lets remove one more column to improve model. Selection of mnth_July has 0.05 p value -

In [None]:
X_train_new2 = X_train_new1.drop(["mnth_July"], axis = 1)

In [None]:
X_train_new2 = sm.add_constant(X_train_new2) ## Adding constant

In [None]:
lm3 = sm.OLS(y_train,X_train_new2).fit()   ## Building model again

In [None]:
lm3.summary()

`From above info the pvalue is 0.08 for season . Before that we could see VIF values` 

In [None]:
X_train_new2 = X_train_new2.drop(['const'], axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_new2
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

`Now the VIF has below 5 which is acceptable and for season_spring it has 0.008 value so lets improve model furthur`

In [None]:
X_train_new3 = X_train_new2.drop(["season_spring"], axis = 1)

In [None]:
X_train_new3 = sm.add_constant(X_train_new3)

In [None]:
lm4 = sm.OLS(y_train,X_train_new3).fit()


In [None]:
lm4.summary()

In [None]:
X_train_new3 = X_train_new3.drop(['const'], axis=1)

In [None]:
vif = pd.DataFrame()
X = X_train_new3
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

# Interference: 
    - Now that we have our final model which has 83 for r2 squared which is pretty good model
    - Also the adjusted and normal R2 square are almost similar which is acceptable
    - The F-Stat metric is again good
    - ALl the pvalues have 0 values which is strongly say that variables are statistically significant 
    - Also the VIF for all predictors are less than 5 

## Step 7: Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
X_train_new3.shape ## Shape of final model df
X_train_new3 = sm.add_constant(X_train_new3) ##  Adding constant variable

In [None]:
y_train_cnt = lm4.predict(X_train_new3)

In [None]:
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label
plt.show()

## Step 8: Making Predictions Using the Final Model

Now that we have fitted the model and checked the normality of error terms and it is normally distributed
it's time to go ahead and make predictions using the final, i.e. fourth model.

### Lets perform same operations on test data set like scaling and make predictions

In [None]:
bike_test[num_cols] = scaler.transform(bike_test[num_cols]) ## Instead of fit_tranform we will only transform on test data set because its already fitted on train set

In [None]:
bike_test.describe() ## Describe function to see the values again

In [None]:
bike_test.head()

In [None]:
## Dividing data set into X and y variables 
y_test = bike_test.pop('cnt') 
X_test = bike_test

In [None]:
X_test = sm.add_constant(X_test) ## Adding the constant 

In [None]:
X_test.columns  ## Getting test columns

In [None]:
X_train_new3.columns ## Getting final model training data set columns

In [None]:
X_train_new3 = X_train_new3.drop(['const'], axis=1) ## Dropping the constant columns

In [None]:
X_train_new3.columns ## Getting columns

In [None]:
X_test = X_test[X_train_new3.columns] ## Modifying test data set with same columns as train set

In [None]:
X_test.columns

In [None]:
X_train_new3.columns

In [None]:
X_test = sm.add_constant(X_test) ## Adding constant

In [None]:
X_test.head()

#### Predicting values

In [None]:
y_pred = lm4.predict(X_test) 

In [None]:
y_pred

### Getting r2_square for data set

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
r2_score(y_true= y_test ,y_pred=y_pred)

In [None]:
print('R-Sqaured',round(r2_score(y_test, y_pred),2))
print('MSE',round(np.sqrt(mean_squared_error(y_test, y_pred)),4))
print('Mean Absolute Error',mean_absolute_error(y_test, y_pred))

### Notes: Now we can observe the difference between training set r squared and above r2_square is less than 5 % which is in acceptable range

## Step 9: Model Evaluation

Let's now plot the graph for actual vs predicted values.

In [None]:
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_test_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label
plt.show()

#### Getting priorities of precictors which influence for demand

In [None]:
param = pd.DataFrame(lm4.params)
param.rename(columns = {0:'Coefficient value'},inplace = True)
param.insert(0,'Variables',param.index)
param.sort_values(by = 'Coefficient value',ascending = False,inplace = True)
param.set_index('Variables',inplace = True)
param

We can see that the equation of our best fitted line is:

$ cnt = 0.125926 + 0.548008  X  temp + 0.232861  X  yr + 0.129345 X season_Winter + 0.101195 X mnth_September + 0.088080 X season_summer -0.078375 X  weathersit_Mist & Cloudy -0.098685 X holiday  -0.153246 X windspeed -0.282869 X weathersit_Light Rain & Snow $

In [None]:
r2_score(y_true= y_test ,y_pred=y_pred)

Inferences :
- 1) First of all positive sign indicates that increases in cnt variable and negativve sign indicates it has decrease 
- 2) The positive signed variables or predictors are temp as top followed by yr,season_winter
- 3) Negative signed variables or predictors are weathersit_Light Rain & Snow followed by windspeed
      
What we can interpreted:
   - Temp is top variable and influence the count of shared bike increase. It indicates that when there is unit increase in temp, the output cnt is estimated to increase by 0.54 units, keeping all the other attributes constant.
   - yr coefficient is 0.2328. It indicates that the year 2019 was favoring the target variable cnt.
   - And Weather and windspeed are influencing decrease in demand for bikes that means not favorable for bike rentals
   
#### To Summarize, Business is growing Year over year and tempearture plays major role in bike rentals. Season and Weather seems to be good predictors of how bike sharing is happening. Also, during holidays bike sharing is less. 
      