### Bike Sharing Demand
### Problem Statement:
A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues. They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.
How well those variables describe the bike demands
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
pd.set_option('display.max_columns',200)
pd.set_option('display.max_rows', 200)

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

### 1. Data Insights

In [None]:
df = pd.read_csv("day.csv")

In [None]:
df.head(20)

In [None]:
df.shape

In [None]:
df.info()

From info it is observed that there is no null data in the data set

In [None]:
df.describe()

1. From data description it is observed that, mean = 365.500 & median = 365.500,

2. There is not much more difference in mean and median 

3. From this observation we can conclude that there is only very few outliers in this data set

### 2. Visualizing Data

###### 1.Numerical Data

1. Viewing only numerical values
2. Since " casual+ registered = cnt ", we are taking only "cnt" column

In [None]:
num_vars = ['cnt','temp','atemp','hum','windspeed']
df[num_vars].head()

In [None]:
sns.pairplot(df, vars=num_vars)

###### 2.Categorical Data

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(2,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = df)

plt.subplot(2,3,2)
sns.boxplot(x = 'weathersit', y = 'cnt', data = df)


##### Seasons
1 - Spring
2 - Summer
3 - Fall
4 - Winter

##### Weathersit

1 - Clear
2 - Mist & Cloudy
3 - Light Snow & Rain
4 - Heavy Snow & Rain

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(2,3,1)
sns.boxplot(x = 'weekday', y = 'cnt', data = df)

plt.subplot(2,3,2)
sns.boxplot(x = 'workingday', y = 'cnt', data = df)

plt.subplot(2,3,3)
sns.boxplot(x = 'holiday', y = 'cnt', data = df)



In [None]:
plt.figure(figsize=(20,20))

plt.subplot(2,3,1)
sns.boxplot(x = 'mnth', y = 'cnt', data = df)

plt.subplot(2,3,2)
sns.boxplot(x = 'yr', y = 'cnt', data = df)

### Observations from the plot
1. The Spring Season has very low count of bike sharing , I guess spring is suitable walking and aslo for roamtic walk people opt to walk
2. From the observation from the data set it is known that people don't use bike sharing in heavysnow or rain
3. Bike sharing is high in clear weather condition,ie..(Clear, Few clouds, Partly cloudy, Partly cloudy)
4. The number of bike shares incresed in 2019
5. The Bike sharing are increases in summer months
6. The Bike sharing are less during holidays

### 3. Preparing Data

###### 1. Dropping Data 

In [None]:
df.drop(['instant','dteday','casual','registered'], axis=1,inplace=True)
df

In [None]:
df.head(10)

###### 2. Mapping data's

a. Mapping Season

In [None]:
df['season'] = df['season'].map({1: 'Spring',2:'Summer',3:'Fall',4:'Winter'})

In [None]:
df['season'].describe()

In [None]:
sns.countplot(df['season'])

b. Mapping Weather Condition


In [None]:
df['weathersit'] = df['weathersit'].map({1:'Clear',2:'Mist & Cloudy',3:'Light Snow & Rain',4:'Heavy Snow & Rain'})

In [None]:
df['weathersit'].describe()

In [None]:
sns.countplot(df['weathersit'])

c. Mapping weekdays

In [None]:
df['weekday'] = df['weekday'].map({0:"Sunday",1:"Monday",2:"Tuesday",3:"Wednesday",4:"Thrusday",5:"Friday",6:"Saturday"})

In [None]:
plt.figure(figsize=(10,9))
sns.countplot(df['weekday'])
plt.show()

#####  Converting month to categorical value for future use

In [None]:
import calendar
df['mnth'] = df['mnth'].apply(lambda x: calendar.month_abbr[x])

In [None]:
df.head(51)

### Creating Dummy Variables 

In [None]:
dummy = df[['season','mnth','weekday','weathersit']]

In [None]:
dummy = pd.get_dummies(dummy,drop_first=True)

In [None]:
dummy.columns

In [None]:
df = pd.concat([dummy,df],axis=1)

In [None]:
df.head()

In [None]:
df.columns.value_counts()

In [None]:
df.shape

In [None]:
#Dropping columns 
df.drop(['season', 'mnth', 'weekday','weathersit'], axis = 1, inplace = True)
df.head()

In [None]:
df.shape

### 4.Splitting and Re-scaling Data

In [None]:
#splitting data
train, test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
#Rescaling 

scaler = MinMaxScaler()
num_vars = ['cnt','hum','windspeed','temp','atemp']

train[num_vars] = scaler.fit_transform(train[num_vars])

In [None]:
train.head()

In [None]:
train.describe()

### 5. Dividing Training Data Set

In [None]:
y_train = train.pop("cnt")
X_train = train

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
#Creating LinearRegression Object
lm = LinearRegression()
#fitting the model
lm.fit(X_train,y_train)

### Selecting Top 12 features using RFE

In [None]:
rfe = RFE(lm, 12) 
rfe = rfe.fit(X_train, y_train)

#### Viewing columns with RFE Score

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
rfe_col = X_train.columns[rfe.support_]
rfe_col

In [None]:
X_train.columns[~rfe.support_]

## 6.  Creating Models 

### Model 1

In [None]:
X_train_rfe = X_train[rfe_col]

In [None]:
#adding constant
X_train_lm = sm.add_constant(X_train_rfe)

In [None]:
#fitting model
lm = sm.OLS(y_train,X_train_rfe).fit()

In [None]:
#lm.params

In [None]:
print(lm.summary())


In [None]:
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Observation :
It is observed that hum has high VIF score and high p-value so creating model with dropping that column

### Model 2 

In [None]:
X_train_2 = X_train_lm.drop(["hum"], axis = 1)
X_train_lm = sm.add_constant(X_train_2)
lm = sm.OLS(y_train,X_train_lm).fit()

In [None]:
print(lm.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_2
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Observation :
Still const column has low p-value it have very high VIF so we are dropping it and creating new model

###  Model 3

In [None]:
X_train_3 = X_train_lm.drop(["const"], axis = 1)
X_train_lm = sm.add_constant(X_train_3)
lm = sm.OLS(y_train,X_train_lm).fit()

In [None]:
print(lm.summary())

In [None]:
vif = pd.DataFrame()
X = X_train_3
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

# 7. Residual Analysis

In [None]:
y_train_res = lm.predict(X_train_lm)

In [None]:
fig = plt.figure()
sns.distplot((y_train - y_train_res), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)
plt.xlabel('Errors', fontsize = 18)                 

# 8. Prediction using Final Model

In [None]:
num_vars = ['cnt','hum','windspeed','temp','atemp']
test[num_vars] = scaler.fit_transform(test[num_vars])

In [None]:
test.shape

In [None]:
test.describe()

###  Dividing Data set

In [None]:
y_test = test.pop('cnt')
X_test = test

In [None]:
print(y_test.shape)
print(X_test.shape)

In [None]:
# predicting using values used by the final model
test_col = X_train_lm.columns
X_test=X_test[test_col[1:]]

# Adding constant variable to test dataframe
X_test = sm.add_constant(X_test)

In [None]:
X_test.info()

### 9. Prediction

In [None]:
y_pred = lm.predict(X_test)

In [None]:
print("R2-Score : ",r2_score(y_test,y_pred).round(2))

In [None]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print("MeanSquaredError = ",mse)

### 10. Evaluvating Model

In [None]:
fig = plt.figure()
plt.scatter(y_test, y_pred)
fig.suptitle('y_test VS y_pred', fontsize = 20)              
plt.xlabel('y_test', fontsize = 16)                         
plt.ylabel('y_pred', fontsize = 16)  

In [None]:
param = pd.DataFrame(lm.params)
param.insert(0,'Variables',param.index)
param.rename(columns = {0:'Coefficient value'},inplace = True)
param['index'] = list(range(0,12))
param.set_index('index',inplace = True)
param.sort_values(by = 'Coefficient value',ascending = False,inplace = True)
param


cnt = 0.199648 + 0.491508 X temp + 0.233482 X yr + 0.083084 X seasonWinter + 0.076686 X mnth_Sep  + 0.045280 X season_Summer -0.052418 X  mnth_Jul - 0.066942 X season_Spring - 0.081558 X weathersit_Mist & Cloudy	- 0.098013 X holiday - 0.147977 X windspeed - 0.285155 X weathersit_Light Snow & Rain	

#### The possible variables : temp,yr,season-winter,mnth_Sep,season_Summer indicate that an increase in these values will lead to an increase in the value of cnt

#### The negative variables : mnth_Jul,season_Spring,weathersit_Mist & Cloudy,holiday,windspeed,weathersit_Light Snow & Rain	indicate that an increase in these values will lead to an deccrease in the value of cnt

## Final Observation

1. Temperature has the highest impact
2. In the month of september (Spring) the bike rental is high
3. The renatal is very low in holidays

#### Temperature and Season has the higer impact on deciding the bike rental