# Bike Sharing Assignment

## Problem Statement

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.
How well those variables describe the bike demands
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors. 

## Business Goals

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 

## Step 1 : Reading and Understanding the Data

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score

from math import sqrt

In [None]:
#Reading the Data Set

Bike = pd.read_csv('../input/bike-sharing-data/day.csv')

In [None]:
#Understandng the Data Set

Bike.head()

In [None]:
#Describing the Data Set

Bike.describe()

In [None]:
# Rows and Columns of the Data Set

Bike.shape

In [None]:
#Information in the Data

Bike.info()

Since the Mean and Median is not having much difference, we can estimate that the outliers are minimal.

## Step 2 : Data Visualisation


#### Let's check the Numeric Variables Intially
#### Pairplot of the numeric Variables

In [None]:
sns.pairplot(data=Bike,vars=['cnt','temp','atemp','hum','windspeed'])
plt.show()

#### Let's check the Categorical Variables
#### Boxplot of the Categorical Variables

In [None]:
plt.figure(figsize=(20,20))
plt.subplot(2,3,1)
sns.boxplot(x='season', y='cnt', data = Bike)
plt.subplot(2,3,2)
sns.boxplot(x = 'mnth', y = 'cnt', data = Bike)
plt.subplot(2,3,3)
sns.boxplot(x = 'holiday', y = 'cnt', data = Bike)
plt.subplot(2,3,4)
sns.boxplot(x = 'weathersit', y = 'cnt', data = Bike)
plt.subplot(2,3,5)
sns.boxplot(x = 'yr', y = 'cnt', data = Bike)
plt.subplot(2,3,6)
sns.boxplot(x = 'weekday', y = 'cnt', data = Bike)
plt.show()

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(x='workingday',y='cnt',data = Bike)
plt.show()

### Conclusions:
###### The Count of Bike Sharing is least at the Spring Season
###### The Number of Bike Shares has increased in the year 2019
###### The Count Value increases during the Summer Season
###### The Count Values decreases during the Holidays

## Step 3 : Data Preparation

#### Here we will drop the columns which are irrelavant for the Model Building

In [None]:
# Dropping the Columns

Bike.drop(['instant','dteday','casual','registered'], axis=1 , inplace = True)

In [None]:
# Verifying the Head of the Data Set with the Columns Dropped
Bike.head()

In [None]:
# Nextly, we will convert the Numeric Values into Categorical Data
import calendar
Bike['mnth']=Bike['mnth'].apply(lambda x: calendar.month_abbr[x])

In [None]:
# Mapping Seasons Values into Names
Bike.season = Bike.season.map({1:'Spring',2:'Summer',3:'Fall',4:'Winter'})

In [None]:
# Mapping WeatherSit Values into Names
Bike.weathersit = Bike.weathersit.map({1:'Clear',2:'Mist_&_Cloudy',3:'Light_Snow_&_Rain',4:'Heavy_Snow_&_Rain'})

In [None]:
# Mapping Weekdays Values into Days
Bike.weekday = Bike.weekday.map({1:'Sunday',2:'Monday',3:'Tuesday',4:'Wednesday',5:'Thrusday',6:'Friday',7:'Saturday'})

In [None]:
# Verifying the Data Set
Bike.head()

### Creating Dummy Variables

In [None]:
# Creating Dummy Variables for the Columns Season , Mnth , WeatherSit , Weekday
Dummy = Bike[['season','mnth','weathersit','weekday']]
Dummy = pd.get_dummies(Dummy,drop_first=True)

In [None]:
# Appending Dummy Variables to the Original Data Set
Bike = pd.concat([Dummy,Bike],axis=1)

In [None]:
# Verify the Data Set after addition of Dummy
Bike.head()

In [None]:
# Dropping the Columns against which the Dummy Variables are created
Bike.drop(['season','mnth','weathersit','weekday'], axis = 1 , inplace = True)

In [None]:
# Verifying the Data Set
Bike.head()

In [None]:
Bike.shape

## Step 4 : Spiliting the Data into Training and Testing Sets


#### Performing Train-Test Split

In [None]:
train, test = train_test_split(Bike,train_size = 0.7, test_size=0.3 , random_state = 100)

## Step 5 : Rescaling

In [None]:
Scaler = MinMaxScaler()

In [None]:
# Performing Scaler() function to all the columns except the Dummy Variables in the Data Set

Num_V = ['cnt','hum','windspeed','temp','atemp']

train[Num_V] = Scaler.fit_transform(train[Num_V])

In [None]:
# Verifying the Data Head

train.head()

In [None]:
train.describe()

In [None]:
# Verifying the Correlation Coefficients to understand the Variables which are highly correlated to each other

plt.figure(figsize=(20,15))
sns.heatmap(train.corr(), annot = True, cmap = "YlGnBu")
plt.show()

##### Here, we will try to Build Models using all the Columns

### Nextly, we will divide the X and Y Sets for the Model Building

In [None]:
y_train = train.pop('cnt')
X_train = train

## Step 6 :  Building a Linear Model 

##### We will fit a Regression Line through the Training Data Set using statsmodels. 
##### In Statsmodels we will explicitly fit a constant using sm.add_constant(X). 

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train,y_train)

#Running RFE
rfe = RFE(lm,13)
rfe = rfe.fit(X_train,y_train)

In [None]:
# Columns for RFE and their weights

list(zip(X_train.columns , rfe.support_, rfe.ranking_))

#### Initiating with the Columns selected by RFE

#### Model

In [None]:
# Display the Columns selected by RFE for Manual Elimination

Columns = X_train.columns[rfe.support_]
Columns

In [None]:
X_train.columns[~rfe.support_]

In [None]:
# Creating X_test Data Frames with RFE selected variables

X_train_rfe = X_train[Columns]

# Adding a Constant Variable

X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
# Creating the Fittest Model

lm = sm.OLS(y_train,X_train_rfe).fit()

In [None]:
# Verifying the Parameters 

lm.params

In [None]:
# Print a summary of the linear regression model obtained
print(lm.summary())

In [None]:
# Dropping mnth_Jan as it has high P Value

X_train_new = X_train_rfe.drop(["mnth_Jan"], axis = 1)

In [None]:
## Rebuilding the Model 

# Adding a constant variable 
X_train_lm = sm.add_constant(X_train_new)

In [None]:
lm = sm.OLS(y_train,X_train_lm).fit()   # Running the linear model

In [None]:
print(lm.summary())

### Verifying the VIF

In [None]:
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Since VIF is not less than 5. Thus, dropping few of the variables

In [None]:
X_train_new = X_train_new.drop(['const'], axis=1)

In [None]:
# Calculate the VIFs for the new model again
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Rebuilding Model

### dropping hum from the model
X_train_new = X_train_new.drop(['hum'], axis=1)

In [None]:
# Adding a constant variable 
X_train_lm = sm.add_constant(X_train_new)

# Create a first fitted model
lm = sm.OLS(y_train,X_train_lm).fit()  

In [None]:
# Check the summary
print(lm.summary())

In [None]:
# Calculate the VIFs for the new model again
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

## Step 7: Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_cnt = lm.predict(X_train_lm)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

### Step 8 : Making Predictions Using the Final Model

Now that we have fitted the model and checked the normality of error terms, it's time to go ahead and make predictions using the final model.

#### Applying the scaling on the test sets

In [None]:
scaler = MinMaxScaler()
num_vars = ['cnt','windspeed','temp','atemp']
test[num_vars] = scaler.fit_transform(test[num_vars])
test.describe()

#### Dividing into X and Y Set

In [None]:
y_test = test.pop('cnt')
X_test = test

In [None]:
# Adding constant variable to test dataframe
X_test = sm.add_constant(X_test)

In [None]:
# predicting using values used by the final model
test_col = X_train_lm.columns
X_test=X_test[test_col[1:]]
# Adding constant variable to test dataframe
X_test = sm.add_constant(X_test)
X_test.info()

In [None]:
# Making predictions using the final model

y_pred = lm.predict(X_test)

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

In [None]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
mse

### Model Evaluation

In [None]:
# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_pred)
fig.suptitle('y_test vs y_pred', fontsize = 20)              # Plot heading 
plt.xlabel('y_test', fontsize = 18)                          # X-label
plt.ylabel('y_pred', fontsize = 16) 

In [None]:
param = pd.DataFrame(lm.params)
param.insert(0,'Variables',param.index)
param.rename(columns = {0:'Coefficient value'},inplace = True)
param['index'] = list(range(0,12))
param.set_index('index',inplace = True)
param.sort_values(by = 'Coefficient value',ascending = False,inplace = True)
param


We can see that the equation of our best fitted line is:

cnt = 0.199648 + 0.491508 X temp + 0.233482 X yr + 0.083084 X season_Winter - 0.066942 X season_ Spring + 0.083084 X season_Winter -0.052418 X mnth_Jul + 0.076686 X mnth_Sep -0.285155 X weathersit_Light Snow & Rain -0.081558 X weathersit_Mist & Cloudy -0.098013 X holiday -0.147977X windspeed

All the positive coefficients like temp,season_Summer indicate that an increase in these values will lead to an increase in the value of cnt.
All the negative coefficients indicate that an increase in these values will lead to an decrease in the value of cnt.

1.Temp is the most significant with the largest coefficient.

2.Followed by weathersit_Light Snow & Rain.

3.Bike rentals is more for the month of september

4.The rentals reduce during holidays

This indicates that the bike rentals is majorly affected by temperature,season and month.