# Bike Sharing using Linear Regression



This is my first attempt on Kaggle.

I learnt a lot in this development.

1. Exploratory Data Analysis
2. Creating the dummy variables for the categorical variable.
3. Scaling using MinMaxscaler
4. Spliting the data into Trai and Test.
5. General Assestment of data using Statmodel.
6. Analysis using RFE
7. Linear Regression Prediction 
8. Evaluating using R2
9. The most influencing factors.

# Problem Statement
__A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state__

# Business Goal:
_You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market_

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
# Importing day.csv
df = pd.read_csv('/kaggle/input/boombikes/day.csv')
df.head()

In [None]:
df.info()

In [None]:
df.drop(['instant'],axis=1,inplace=True)

In [None]:
df.drop(['dteday'],axis=1,inplace=True)

In [None]:
df.drop(['casual','registered'],axis=1,inplace=True)

In [None]:
df.workingday.value_counts()

In [None]:
df.holiday.value_counts()

In [None]:
df.season.value_counts()

In [None]:
df['season'].replace({1:"spring",2:"summer",3:"fall",4:"winter"},inplace=True)
df.head()

In [None]:
df.weathersit.value_counts()

In [None]:
df['weathersit'].replace({1:"Clear Clouds",2:"Mist cloudy",3:"Light rain",4:'Heavy Rain'},inplace=True)
df.head()

In [None]:
df.weathersit.value_counts()

In [None]:
df['weekday'].replace({0:"Sunday",1:"Monday",2:"Tuesday",3:"Wednesday",4:"Thursday",5:"Friday",6:"Saturday"},inplace=True)
df.head()

In [None]:
df['mnth'].replace({1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr',5:'May', 6:'Jun', 7:'Jul', 8:'Aug',9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'},inplace=True)
df.head()

In [None]:
df[['temp','atemp','hum','windspeed','cnt']]=df[['temp','atemp','hum','windspeed','cnt']].apply(pd.to_numeric)

In [None]:
sns.distplot(df.cnt, bins = 20)
plt.show()

In [None]:
sns.pairplot(df,vars =['temp','atemp','hum','windspeed','cnt'],hue="weathersit")
plt.show()

### CNT is highly correlated  with temp and atemp we can remove one of the variable preferable temp to avoid multicollinarity.

In [None]:
df.plot(kind='scatter', x='atemp', y='cnt', alpha=0.5)
plt.show()

__Good Correlation between atemp and cnt__

In [None]:
sns.lmplot(x='atemp', y='cnt', data=df, aspect=1.5, scatter_kws={'alpha':0.8})

In [None]:
df.boxplot(column='cnt', by='season')
plt.show()

### Count is high for the fall season then followed by summer and winter.
### Spring with the lowest count.


In [None]:
plt.figure(figsize=(30, 15))
plt.subplot(2,3,1)
sns.boxplot(x = 'yr', y = 'cnt', data = df)
plt.subplot(2,3,2)
sns.boxplot(x = 'mnth', y = 'cnt', data = df)
plt.subplot(2,3,3)
sns.boxplot(x = 'holiday', y = 'cnt', data = df)
plt.subplot(2,3,4)
sns.boxplot(x = 'workingday', y = 'cnt', data = df)
plt.subplot(2,3,5)
sns.boxplot(x = 'season', y = 'cnt', data = df)
plt.subplot(2,3,6)
sns.boxplot(x = 'weekday', y = 'cnt', data = df)
plt.show()
sns.boxplot(x = 'weathersit', y = 'cnt', data = df)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x = 'mnth', y = 'cnt', data = df)
plt.show()

#### Monthly count is more in March, April, September, October.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x = 'mnth', y = 'cnt',hue='season', data = df)
plt.show()

### Summer, Winter and Fall there is a good business.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x = 'mnth', y = 'cnt',hue='weathersit', data = df)
plt.show()


#### More Count when there is a clear clouds followed by Misty Cloud.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x = 'mnth', y = 'cnt',hue='holiday', data = df)
plt.show()


#### During Holidays there is a good business.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x = 'season', y = 'cnt',hue='weathersit', data = df)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x = 'weekday', y = 'cnt',hue='weathersit', data = df)
plt.show()

#### There is a good count except the rainy days on the weekdays.
#### Saturdays and Wednesday there is a good count.


In [None]:
plt.figure(figsize = (16, 10))
sns.heatmap(df.corr(), annot = True, cmap="YlGnBu")
plt.show()

In [None]:
df.drop('temp',axis=1, inplace = True)

#### Temp and Atemp are both are having high corellation. It is better to drop Temp column to avoid the multicolliranity.

## Creating the dummy variables for  the categorical variable.

In [None]:
#Convert variables to object type
df['mnth']=df['mnth'].astype(object)
df['season']=df['season'].astype(object)
df['weathersit']=df['weathersit'].astype(object)
df['weekday']=df['weekday'].astype(object)
df.info()

In [None]:
seasons=pd.get_dummies(df['season'],drop_first=True)
df=pd.concat([df,seasons],axis=1)
df.drop('season',axis=1,inplace = True)


In [None]:
weather=pd.get_dummies(df['weathersit'],drop_first=True)
df=pd.concat([df,weather],axis=1)
df.drop('weathersit',axis=1,inplace = True)


In [None]:
wkday=pd.get_dummies(df['weekday'],drop_first=True)
df=pd.concat([df,wkday],axis=1)
df.drop('weekday',axis=1,inplace = True)

In [None]:
month=pd.get_dummies(df['mnth'],drop_first=True)
df=pd.concat([df,month],axis=1)
df.drop('mnth',axis=1,inplace = True)

## Splitting the train and test data 

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
df.shape

## Scaling using MinMaxscaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
num_vars = ['atemp','hum','windspeed','cnt']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

In [None]:
## Removing the 'cnt' variable from the Train data set and assign to Y_train
y_train = df_train.pop('cnt')
X_train = df_train

In [None]:
X_train.head()

In [None]:
y_train.head()

## General Assestment of data using Statmodel.

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
import statsmodels.api as sm
X_train_lm = sm.add_constant(X_train)

lr_1 = sm.OLS(y_train, X_train_lm).fit()

lr_1.params

In [None]:
print(lr_1.summary())

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

___There are lot variables showing greater than 5. It is recommeded to use RFE for the variable selection___

## Analysis using RFE 

In [None]:
# Running RFE with the output number of the variable equal to 15
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
## These columns having more significance after using the RFE method
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
X_train_rfe = X_train[col]

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
# Running the linear model
lm = sm.OLS(y_train,X_train_rfe).fit()   

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

In [None]:
X_train_rfe = X_train_rfe.drop(['const'], axis=1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Removing humdity variable.
X_train_hum = X_train_rfe.drop(["hum"], axis = 1)
X_train_hum.head()

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_hum)

In [None]:
lm = sm.OLS(y_train,X_train_lm).fit() 

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

In [None]:
X_train_lm = X_train_lm.drop(['const'], axis=1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_lm 
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Removing Saturday  variable where P Value > 0.05
X_train_temp = X_train_lm.drop(["Saturday"], axis =1)
X_train_temp.head()

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm1 = sm.add_constant(X_train_temp)

In [None]:
lm = sm.OLS(y_train,X_train_lm1).fit() 
# #Let's see the summary of our linear model
print(lm.summary())

In [None]:
## Predicting the Y train using X Train
y_train_pred = lm.predict(X_train_lm1)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_pred), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

In [None]:
num_vars = ['atemp','hum','windspeed','cnt']

df_test[num_vars] = scaler.fit_transform(df_test[num_vars])
df_test.head()

In [None]:
y_test = df_test.pop('cnt')
X_test = df_test
y_test.head()

In [None]:
X_test_new = X_test[X_train_temp.columns]

In [None]:
X_test_new.shape

In [None]:
# Adding constant variable to test dataframe
X_test_m4 = sm.add_constant(X_test_new)

In [None]:
y_pred = lm.predict(X_test_m4)

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label

In [None]:
#Calculate the r square for test
r_squared = r2_score(y_test, y_pred)
print('R Square value',r_squared)

In [None]:
print(lm.summary())


# Feedback

## The most influencing factors.

  ### Year
    
  ### Atemp (feeling temperature in Celsius)
    
## Month

___*September Month is having a positive influence may be because of end of fall climate  and winter climate starting.___

__*Jan,Jul,Nov,Dec is having a negative influence may be because of mixed climate along with rain.___

## Climate
___*Winter climate is having a positive influence may be people love the starting of winter climate.___

___*Spring climate is having a negative influence___

## Weathersit
___*Misty cloud is a negative influence which reduce the business.___

___*Light Rain is having negative influence which reduce the business.___

## Others
___*Holidays is having a negative influence may be people are using other mode of transport.___

___*Windspeed is place the major role because if the more wind then it will be difficult to use the cycles.___
