# Problem Statement

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
- Which variables are significant in predicting the demand for shared bikes.
- How well those variables describe the bike demands
- Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors. 

# Business Goal

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 

In [None]:
#import the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Reading and Understanding the Data

In [None]:
# read the data
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
bike_data = pd.read_csv('/kaggle/input/boombikes/day.csv', parse_dates=['dteday'])

In [None]:
# Top 5 rows
bike_data.head()

In [None]:
# Bottom 5 rows
bike_data.tail()

In [None]:
# Dataframe shape
bike_data.shape

In [None]:
# Checking data type and null values
bike_data.info()

In [None]:
# Summary for numerical variables
bike_data.describe()

# Missing Data Check

In [None]:
# verify the null records
bike_data.isnull().sum()

As there is no missing value in the data set, we do not have to handle any missing data.

# Duplicate check

In [None]:
print('No of rows before drop duplicates:',bike_data.shape[0])
bike_data.drop_duplicates(subset=None, inplace=True)
print('No of rows after drop duplicates:',bike_data.shape[0])

# Encoding categorical variables

In [None]:
# Season Variable
bike_data.season.value_counts()

In [None]:
# Convert season data in categorical values (1:spring, 2:summer, 3:fall, 4:winter)
bike_data.season = bike_data.season.map({1:'spring', 2:'summer', 3:'fall', 4:'winter'})
bike_data.season.value_counts(normalize=True)

In [None]:
# year variable
bike_data.yr.value_counts()

In [None]:
# Month variable
bike_data.mnth.value_counts()

In [None]:
# Convert month data in categorical values
bike_data.mnth = bike_data.mnth.map({
    1:'Jan', 
    2:'Feb', 
    3:'Mar', 
    4:'Apr',
    5:'May',
    6:'Jun',
    7:'Jul',
    8:'Aug',
    9:'Sep',
    10:'Oct',
    11:'Nov',
    12:'Dec'})
bike_data.mnth.value_counts(normalize=True)

In [None]:
# holiday variable
bike_data.holiday.value_counts()

In [None]:
# Weekday variable
bike_data.weekday.value_counts()

In [None]:
bike_data.weekday = bike_data.weekday.map({
    0:'Sun',
    1:'Mon',
    2:'Tue',
    3:'Wed',
    4:'Thu',
    5:'Fri',
    6:'Sat'
})
bike_data.weekday.value_counts()

In [None]:
bike_data.workingday.value_counts()

In [None]:
bike_data.weathersit.value_counts()

In [None]:
#Converting data as
#		 1: Clear, Few clouds, Partly cloudy, Partly cloudy ---> Clear
#		 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ---> Cloudy
#		 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds ---> LightRain
#		 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog ---> Thunderstorm

bike_data.weathersit = bike_data.weathersit.map({1:'Clear',2:'Cloudy',3:'LightRain',4:'Thunderstorm'})
bike_data.weathersit.value_counts(normalize=True)

In [None]:
# Verify dataframe post encoding
bike_data.info()

In [None]:
# As yr, holiday and workingday are categorical variable, therefore converting the datatype to categorical
bike_data.yr = bike_data.yr.astype('category')
bike_data.holiday = bike_data.holiday.astype('category')
bike_data.workingday = bike_data.workingday.astype('category')
bike_data.info()

# Exploratory Data Analysis

In [None]:
# Numerical columns for EDA
numerical_columns = bike_data.select_dtypes(include=['int64','float64']).columns

# categorical columns for EDA
categorical_columns = bike_data.select_dtypes(exclude=['int64','float64','datetime64']).columns

In [None]:
# Distribution plot for numerical columns
plt.figure(figsize=(20,10))

for i in range(len(numerical_columns)):
    plt.subplot(2,4,i+1)
    sns.distplot(bike_data[numerical_columns[i]])

plt.show()

**Inference**
- All the colums are normally distributed except windspeed and casual, which are little skewed.
- Need more investigation to identify the outliers

In [None]:
# Box plot for numerical variables
plt.figure(figsize=(20,10))

for i in range(len(numerical_columns)):
    plt.subplot(2,4,i+1)
    sns.boxplot(bike_data[numerical_columns[i]], orient='h')

plt.show()

**Inference**
- we can clearly see that instant, temp, atemp, registered and cnt are normally distributed. They don't have any outliers.
- There seems to be some outliers in hum, windspeed and casual which we need to handle

In [None]:
# We will handle outliers through 1.5*iqr method

b_rows = bike_data.shape[0]

for i in ['hum','windspeed'] :
    q75,q25 = np.percentile(bike_data.loc[:,i],[75,25])
    iqr = q75-q25
    
    min = q25 - (iqr*1.5)
    max = q75 + (iqr*1.5)
    
    bike_data = bike_data.drop(bike_data[bike_data.loc[:,i] < min].index)
    bike_data = bike_data.drop(bike_data[bike_data.loc[:,i] > max].index)
a_rows = bike_data.shape[0]

print('percent reduction in data after deleting outliers:',(b_rows-a_rows)/b_rows*100)

In [None]:
# Categorical data analysis
plt.figure(figsize=(20,15))

for i in range(len(categorical_columns)):
    plt.subplot(3,3,i+1)
    sns.boxplot(x=categorical_columns[i], y='cnt', data=bike_data)

plt.show()

**Inferene**
- Highest number for bookings are done in fall, followed by summer and winter.
- Number of bookings increased in 2019, compare to the year 2018
- Jun-Sep are the busiest months in terms of the bookings.
- More number of booking during non-holiday than during holiday
- Days of the week are mostly flat in terms of bookings.
- Bookings are quite similar betweek working or non-working days.
- Clear weather is seems to be most favorable for bike riding, followed by cloudy weather.

In [None]:
def categorical_plot(col):
    sns.set(style="whitegrid")
    plt.figure(figsize = (12,6))
    total = float(len(bike_data))
    plt.subplot(1,2,1)
    ax =sns.barplot(col,'cnt',data=bike_data, ci=0)

    plt.subplot(1,2,2)
    ax = sns.barplot(col,'cnt',data=bike_data, hue='yr', ci=0)
    l=plt.legend()
    l.get_texts()[0].set_text('2018')
    l.get_texts()[1].set_text('2019')
    plt.show()

In [None]:
categorical_plot('season')

**Inference**
- Highest number for bookings are done in fall, followed by summer and winter.
- Year on Year shows the same trend

In [None]:
categorical_plot('weathersit')

**Inference**
- Clear weather is seems to be most favorable for bike riding, followed by cloudy weather.
- It seems people avoid bike riding during Rainy season 
- Similar trend in the Year on Year graph

In [None]:
categorical_plot('mnth')

**Inference**
- Most people ride bike during May to Oct month as these months recorded highest bookings.
- January recorded the least number of booking
- Similar trend in the Year on Year graph

In [None]:
categorical_plot('weekday')

**Inference**
- Not much difference in day wise bookings, Thursday recorded most number of bookings followed by Saturday and Sunday.
- Least number of booking was on Friday in 2018, where as Tuesday is least booking day in 2019

In [None]:
categorical_plot('workingday')

**Inference**
- No much difference in the booking in terms of working and non-working days 
- Non working day recorded little more number of bookings.
- Similar trend in the Year on Year graph

In [None]:
categorical_plot('holiday')

**Inference**
- Non-holidays have recorded more number of bookings, than holidays.
- May be more people use these bikes for their daily commute.
- Similar trend in the Year on Year graph

In [None]:
plt.figure(figsize=(10,10))

sns.pairplot(bike_data[numerical_columns])
plt.show()

**Inferences**
Below observations can be maded with above graph:
- Instant is an index variable, not much useful in the model. We can consider dropping it.
- There is linear relationship of temp and atemp variable with cnt. however there is very strong relationship between temp and atemp. therefore we can only use one in our model building.
- There seems to be negative relationship of hum and windspeed variables with cnt.
- There is a strong relationship of casual and registered variable with cnt, need further analysis.

In [None]:
correlation =  bike_data[numerical_columns].corr()
mask = np.array(correlation)
mask[np.tril_indices_from(mask)] = False

plt.figure(figsize=(10,10))
sns.heatmap(correlation, mask=mask, cmap="RdYlGn", annot=True)
plt.show()

**Inferences**
Below observations can be maded with above graph:
- We can confirm high correlation between temp and atemp variable.
- We can confirm negative relationship of hum and windspeed variables with cnt.
- casual & registered: Both these columns contains the count of bike booked by different categories of customers. From the analysis, we can understand that 'cnt = 'casual' + 'registered'. Since our objective is to find the total count of bikes and not by specific category, we will drop these two columns.

In [None]:
#Based on the high level analysis of the data and the data dictionary, the following variables can be removed from further analysis -

#instant: It is only an index value
bike_data.drop('instant', axis=1, inplace=True)

#dteday: This has the date, Since we already have separate columns for 'year' & 'month' we could live without this column
bike_data.drop('dteday', axis=1, inplace=True)

# atemp has high correlation with temp and also same correlation with target variable, we can get rid of one variable thereofre we will drop atemp column
bike_data.drop('atemp', axis=1, inplace=True)

#Since our objective is to find the total count of bikes and not by specific category, we will drop these two columns.
bike_data.drop('casual', axis=1, inplace=True)
bike_data.drop('registered', axis=1, inplace=True)

# Data Preparation for Modeling

### Dummy variable creation for categorical columns

In [None]:
bike_data_with_dummies = pd.get_dummies(bike_data[categorical_columns], drop_first=True)

bike_data_with_dummies

In [None]:
# Dropping the original columns as there would be duplicate variables
bike_data = bike_data.drop(categorical_columns, axis=1)

In [None]:
# Concate dummy dataframe to the original dataframe
bike_data = pd.concat([bike_data, bike_data_with_dummies], axis=1)

In [None]:
# Verify the changes
bike_data

# Train Test Split

In [None]:
np.random.seed(0)
bike_data_train, bike_data_test = train_test_split(bike_data, train_size=0.7, random_state=100)

In [None]:
bike_data_train

In [None]:
bike_data_test

In [None]:
# Verify the correlation of the variables in the train set

correlation =  bike_data_test.corr()
mask = np.array(correlation)
mask[np.tril_indices_from(mask)] = False

plt.figure(figsize=(20,20))
sns.heatmap(correlation, mask=mask, cmap="RdYlGn")
plt.show()

# Scalling the numerical variables

In [None]:
# Scalling the numerical variables using the standard scaller
numerical_columns = ['temp','hum', 'windspeed','cnt']

sc = StandardScaler()
bike_data_train[numerical_columns] = sc.fit_transform(bike_data_train[numerical_columns])
bike_data_test[numerical_columns] = sc.transform(bike_data_test[numerical_columns])

In [None]:
# Summary of the numerical variable post scalling
bike_data_train[numerical_columns].describe()

In [None]:
bike_data_test[numerical_columns].describe()

In [None]:
#Functions to build model and verify VIF

def build_model(X,y):
    lm = LinearRegression()
    X = sm.add_constant(X) #Adding the constant
    lm = sm.OLS(y,X).fit() # fitting the model
    print(lm.summary()) # model summary
    return lm
    
def checkVIF(X):
    vif = pd.DataFrame()
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return vif

In [None]:
# Prepare X and Y variable for the predictions
y_train = bike_data_train.pop('cnt')
X_train = bike_data_train

### Model-1

In [None]:
lm1 = build_model(X_train, y_train)
checkVIF(X_train)

- R-squared: 0.861 and Adj. R-squared: 0.852
- Model is successfully able to explain by 86% of its variables however there are multiple insignicant variables (hight p-value) and multicollinear variables. we will iterate multiple times by removing these variables to get the best suited model.


In [None]:
# Dropping workingday_1 as it has high correlations with other independant variables
X_train.drop('workingday_1', axis=1, inplace=True)

### Model-2

In [None]:
lm1 = build_model(X_train, y_train)
checkVIF(X_train)

In [None]:
# Dropping season_winter as it has high correlations with other independant variables
X_train.drop('season_winter', axis=1, inplace=True)

### Model-3

In [None]:
lm3 = build_model(X_train, y_train)
checkVIF(X_train)

### Model-4

In [None]:
# Dropping weekday_Thu due to high p-value as it is not sighnificant for the model.
X_train.drop('weekday_Thu', axis=1, inplace=True)
lm4 = build_model(X_train, y_train)
checkVIF(X_train)

### Model-5

In [None]:
# Dropping weekday_Wed, weekday_Sat, mnth_Feb due to high p-value as it is not sighnificant for the model.
X_train.drop(['weekday_Wed','weekday_Sat','mnth_Feb'], axis=1, inplace=True)
lm5 = build_model(X_train, y_train)
checkVIF(X_train)

### Model-6

In [None]:
# Dropping 'mnth_Dec, mnth_Jun, mnth_Nov due to high p-value as it is not sighnificant for the model.
X_train.drop(['mnth_Dec','mnth_Jun','mnth_Nov'], axis=1, inplace=True)
lm6 = build_model(X_train, y_train)
checkVIF(X_train)

### Model-7

In [None]:
# Dropping weekday_Mon, mnth_Aug to high p-value as it is not sighnificant for the model.
X_train.drop(['weekday_Mon','mnth_Aug'], axis=1, inplace=True)
lm7 = build_model(X_train, y_train)
checkVIF(X_train)

### Model-8

In [None]:
# Dropping season_summer to high p-value as it is not sighnificant for the model.
X_train.drop(['season_summer'], axis=1, inplace=True)
lm8 = build_model(X_train, y_train)
checkVIF(X_train)

### Model-9

In [None]:
# Dropping mnth_Jan to high p-value as it is not sighnificant for the model.
X_train.drop(['mnth_Jan'], axis=1, inplace=True)
lm9 = build_model(X_train, y_train)
checkVIF(X_train)

### Model-10

In [None]:
# Dropping mnth_May, weekday_Thu to high p-value as it is not sighnificant for the model.
X_train.drop(['mnth_May'], axis=1, inplace=True)
lm10 = build_model(X_train, y_train)
checkVIF(X_train)

**Inference**
- Now we have a model where 
    - R-squared:0.851 and Adj. R-squared: 0.847
    - No variable have p-value more than 0.05
    - No variable where VIF is more than 5
#### This seems to be a stable model, Now we will move to its interpretation 

# Model Interpretation

### 1. Hypothesis Testing :

Hypothesis Testing States that

- H0:B1=B2=...=Bn=0 
- H1: at least one Bi!=0

In [None]:
lm10.params

##### it is evident that all our coefficients are not equal to zero, which means we REJECT the NULL HYPOTHESIS

#### 2. F-Staitsics

- F-statistic: 198.7 
- Prob (F-statistic): 9.43e-191

##### The F-Statistics value of 198.7, which is greater than 1 and the p-value is very small (approx) states that the overall model is significant

# Model Validation

#### We validate the below assumptions of Linear Regression
- 1. Linear Relationship 
- 2. Homoscedasticity 
- 3. Absence of Multicollinearity
- 4. Normality of Errors

### 1. Linear Relationship

In [None]:
sm.graphics.plot_ccpr(lm10, 'temp')
plt.show()

In [None]:
sm.graphics.plot_ccpr(lm10, 'windspeed')
plt.show()

In [None]:
sm.graphics.plot_ccpr(lm10, 'hum')
plt.show()

##### The above plots represents the relationship between the model and the predictor variables. and we can see clearly that linearity is well preserved

### 2. Homoscedasticity

In [None]:
# Verify that there is no correlation among residual terms
X_train = sm.add_constant(X_train)
y_train_pred = lm10.predict(X_train)
residual = y_train - y_train_pred
sns.scatterplot(y_train,residual)
plt.plot(y_train,(y_train - y_train), '-r')
plt.xlabel('Count')
plt.ylabel('Residual')
plt.show()

##### There is no visible pattern in residual values, thus homoscedacity is well preserved

### 3. Multicollnearity

In [None]:
checkVIF(X_train)

##### All the predictor variables have VIF value less than 5. So we can assume that there are insignificant multicollinearity among the predictor variables.

### 4. Normality of error

In [None]:
# Verify that residuals are normally distributed
res = y_train-y_train_pred

# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((res), bins = 20)
fig.suptitle('Error Terms')                  
plt.xlabel('Errors')                         
plt.show()

In [None]:
sm.qqplot((y_train - y_train_pred), fit=True, line='45')
plt.show()

##### Based on the histogram, we can conclude that error terms are following a normal distribution

# Making Predictions Using the Final Model

In [None]:
# Prepare x and y variable from test set
y_test = bike_data_test.pop('cnt')
X_test = bike_data_test

In [None]:
# Filter columns from the final X_train  
col=X_train.columns
X_test = sm.add_constant(X_test)
X_test=X_test[col]
X_test

In [None]:
# Making predictions using the final model (lm10)
y_pred = lm10.predict(X_test)

# Model Evaluation

In [None]:
# Plotting y_test and y_pred to understand the spread
fig = plt.figure()
plt.scatter(y_test, y_pred, alpha = 0.5)
fig.suptitle('y_test vs y_pred')             
plt.xlabel('y_test')                          
plt.ylabel('y_pred') 

In [None]:
# Calculating R-squared and Adj. R-squared for the test dataset
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

In [None]:
# n is number of rows in test dataset
n = X_test.shape[0]

# Number of features (predictors, p) is the shape along axis 1
p = X_test.shape[1]

# We find the Adjusted R-squared using the formula
adjusted_r2 = round(1-(1-r2)*(n-1)/(n-p-1),6)
adjusted_r2

# Summary

*  |R-squared | Adj. R-squared 
:--------------|:---------|:--------------
Train Data Set |0.851     | 0.847
Test Data Set  |0.812     | 0.798
---
Overall we have a decent model, but we also acknowledge that we could do better. <br>
As per the final model, below variables have significate influence on the target variable i.e. no of bookings. Therefore they should be considered before making any decesion.<br>
- **yr**<br>
A coefficient value of ‘1.019406’ indicated that a year wise the rental numbers are increasing
- **Temp**<br>
A coefficient value of ‘0.432292’ indicated that a temperature has significant impact on bike rentals 
- **Weathersit_LightRain (value=3))**<br>
A coefficient value of ‘-0.944379’ indicated that the light snow and light rain deters people from renting out bikes