## Boom Bikes Prediction by Vivek Chowdhury

### Problem Statement:

The Boom Bikes company wishes to increase revenue and sales after Covid scenarios. We are provided a datasets and the company would like to know which features help predict the number of riders.

### Step1: Importing libraries and packages and Reading the Dataframe

In [None]:
# Importing the necessary datasets

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import r2_score
%matplotlib inline

In [None]:
# To ignore warnings

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Reading the dataframe

bikes = pd.read_csv('../input/boombikes/day.csv')
bikes.head()

### Step2: Analyzing the dataframe and Initial Cleaning

In [None]:
# Checking the number of rows and columns

bikes.shape

In [None]:
# Checking for any null values and the data type

bikes.info()

- There are no null values in any of the columns
- Performing Sanity checks, we find that, the columns mnth, weekday, weathersit and season should be converted to categorical columns.
- Casual and registered columns can be dropped.

In [None]:
# Initial data cleaning for the sanity check performed

bikes.mnth = bikes.mnth.map({1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun", 7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"})

bikes.weekday = bikes.weekday.map({1:"Mon", 2:"Tue", 3:"Wed", 4:"Thu", 5:"Fri", 6:"Sat", 0:"Sun"})

bikes.weathersit = bikes.weathersit.map({1:"Clear", 2:"Cloudy", 3:"Rainy", 4:"Thunderstorm_or_snow"})

bikes.season = bikes.season.map({1:"Spring", 2:"Summer", 3:"Fall", 4:"Winter"})


In [None]:
bikes.head()

- The column instant and dteday are not necessary, so we can drop it.

In [None]:
# Dropping the "casual", "registered", "instant" and "dteday" column

bikes.drop(columns = ["instant","dteday", "casual", "registered"], axis = 1, inplace = True)

- Now that we are done with the initial cleaning of the data, we will proceed with the EDA.

### Step3: Exploratory Data Analysis (EDA)

#### EDA on categorical features

In [None]:
bikes.info()

In [None]:
cat_cols = bikes.select_dtypes(include = object).columns

num_cols = bikes.select_dtypes(include = ['int64','float64' ]).columns

In [None]:
print(cat_cols)
print()
print(num_cols)

In [None]:
# Checking the column: "season"

bikes.season.value_counts()

In [None]:
# Visualizing the column "season" with "cnt"

plt.figure(figsize=[12,5])
sns.barplot(x = bikes["season"], y = bikes["cnt"], ci=None)
plt.title("Count of people using Boom Bikes vs Season", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nSeason", size = 12, color = "green")
plt.ylabel("Count of people using Boom Bikes\n", size = 12, color = "green")
plt.show()

- We can clearly see that most of the people like to ride the bikes during summer and fall. 
- However, people avoid using it during spring.

In [None]:
# Checking the column: "mnth"

bikes.mnth.value_counts()

In [None]:
# Visualizing the column "mnth" with "cnt"

plt.figure(figsize=[12,5])
sns.barplot(x = bikes["mnth"], y = bikes["cnt"], ci=None)
plt.title("Count of people using Boom Bikes vs Month of the Year", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nMonth of the Year", size = 12, color = "green")
plt.ylabel("Count of people using Boom Bikes\n", size = 12, color = "green")
plt.show()

- From the above bar graph, we can see that most people like to use the bikes during the summer and fall, but not so much during the beginning of the year or spring.

In [None]:
# Checking the column: "weekday"

bikes.weekday.value_counts()

In [None]:
# Visualizing the column "weekday" with "cnt"

plt.figure(figsize=[12,5])
sns.barplot(x = bikes["weekday"], y = bikes["cnt"], ci=None)
plt.title("Count of people using Boom Bikes vs Weekday", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nWeekday", size = 12, color = "green")
plt.ylabel("Count of people using Boom Bikes\n", size = 12, color = "green")
plt.show()

- Here, we do not observe a noticable difference.
- Boom Bikes see customers in equal participation throughout the week.

In [None]:
# Checking the column: "weathersit"

bikes.weathersit.value_counts()

In [None]:
# Visualizing the column "weathersit" with "cnt"

plt.figure(figsize=[12,5])
sns.barplot(x = bikes["weathersit"], y = bikes["cnt"], ci=None)
plt.title("Count of people using Boom Bikes vs Weather Conditions", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nWeather Conditions", size = 12, color = "green")
plt.ylabel("Count of people using Boom Bikes\n", size = 12, color = "green")
plt.show()

- It is visible from the graph that, people avoid riding bikes during rainy days.
- They are however, a go to choice during clear weather conditions

#### Analyzing Numerical Columns

#### EDA on numerical features

In [None]:
num_cols

In [None]:
# Checking the column: 'yr'

bikes.yr.value_counts()

In [None]:
# Visualizing the column "yr" with "cnt"

plt.figure(figsize=[12,5])
sns.barplot(x = bikes["yr"], y = bikes["cnt"], ci=None)
plt.title("Count of people using Boom Bikes vs Year", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nYear", size = 12, color = "green")
plt.ylabel("Count of people using Boom Bikes\n", size = 12, color = "green")
plt.xticks(ticks = [0, 1], labels = ["2018","2019"])
plt.show()

- It is visible here that Boom Bikes had recieved popularity over the year and have seen increase in the count of people using their bikes.

In [None]:
# Checking the column: 'holiday'

bikes.holiday.value_counts()

In [None]:
# Visualizing the column "holiday" with "cnt"

plt.figure(figsize=[12,5])
sns.barplot(x = bikes["holiday"], y = bikes["cnt"], ci=None)
plt.title("Count of people using Boom Bikes vs Holidays", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nHolidays", size = 12, color = "green")
plt.ylabel("Count of people using Boom Bikes\n", size = 12, color = "green")
plt.xticks(ticks = [0, 1], labels = ["Not a Holiday","Holiday"])
plt.show()

- We can see that more people use these bikes on a working day rather than a holiday.
- This explains that people use these bikes more often to commute to their workplace.

In [None]:
# Checking the column: 'workingday'

bikes.workingday.value_counts()

In [None]:
# Visualizing the column "workingday" with "cnt"

plt.figure(figsize=[12,5])
sns.barplot(x = bikes["workingday"], y = bikes["cnt"], ci=None)
plt.title("Count of people using Boom Bikes vs WorkingDays", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nWorkingDays", size = 12, color = "green")
plt.ylabel("Count of people using Boom Bikes\n", size = 12, color = "green")
plt.xticks(ticks = [0, 1], labels = ["Weekends","Weekdays"])
plt.show()

- It can be observed that people use Boom Bikes almost equally between Weekends and Weekdays.
- And this is also evident from our Days of the Week analysis done previously.

In [None]:
# Visualizing the column "temp" with "cnt"

plt.figure(figsize=[12,5])
sns.boxplot(x = bikes["temp"])
plt.title("Count of people using Boom Bikes vs ActualTemperature", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nActualTemperature", size = 12, color = "green")
plt.show()

- There are no outliers in this column.

In [None]:
# Visualizing the column "atemp" with "cnt"

plt.figure(figsize=[12,5])
sns.boxplot(x = bikes["atemp"])
plt.title("Count of people using Boom Bikes vs FeelingTemperature", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nFeelingTemperature", size = 12, color = "green")
plt.show()

- There are no outliers in this column.

In [None]:
# Visualizing the column "hum" with "cnt"

plt.figure(figsize=[12,5])
sns.boxplot(x = bikes["hum"])
plt.title("Count of people using Boom Bikes vs Humidity", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nHumidity", size = 12, color = "green")
plt.show()

- There are some outliers here.

In [None]:
# Checking the percentile 

bikes["hum"].describe(percentiles=[0.01,0.02,0.05,0.25,0.5,0.75])

In [None]:
# Removing the outliers

bikes = bikes[bikes["hum"] >= 20]

In [None]:
# Visualizing the column "hum" with "cnt"

plt.figure(figsize=[12,5])
sns.boxplot(x = bikes["hum"])
plt.title("Count of people using Boom Bikes vs Humidity", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nHumidity", size = 12, color = "green")
plt.show()

In [None]:
# Visualizing the column "windspeed" with "cnt"

plt.figure(figsize=[12,5])
sns.boxplot(x = bikes["windspeed"])
plt.title("Count of people using Boom Bikes vs WindSpeed", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nWindSpeed", size = 12, color = "green")
plt.show()

- There are some outliers here.

In [None]:
# Checking the percentile 

bikes["windspeed"].describe(percentiles=[0.25,0.5,0.75,0.95,0.98,0.99])

- Here, we can remove the top 1% outlier.

In [None]:
# Removing the top 1%

Q3 = bikes.windspeed.quantile(0.99)
bikes = bikes[bikes["windspeed"] <= Q3]

In [None]:
# Visualizing the column "windspeed" with "cnt"

plt.figure(figsize=[12,5])
sns.boxplot(x = bikes["windspeed"])
plt.title("Count of people using Boom Bikes vs WindSpeed", size = 15, color = "Purple", pad = 20)
plt.xlabel("\nWindSpeed", size = 12, color = "green")
plt.show()

- We have successfully removed the 1% outlier from the column.

In [None]:
# Visualizing the numerical columns with "cnt" target column

heat = bikes.corr()

plt.figure(figsize=[10,5])
sns.heatmap(heat, cmap = 'YlGnBu', annot = True)
plt.show()

- We can see from the heatmap that there is a strong correlation between the 'cnt' and 'temp'/'atemp' columns.
- Next we have a good correlation between 'cnt' and 'yr', also, between 'cnt' and 'windspeed'/'humidity' columns.
- There is also a very high correlation between 'temp' and 'atemp', we might have to drop one of the columns to avoid multicollinearity

### Step4: Preparing the dataframe for modelling

In [None]:
# First, we need to create the dummy variables for the categorical variables

list(cat_cols)

In [None]:
dummy = pd.get_dummies(bikes["season"], drop_first = True)
bikes = pd.concat([bikes, dummy], axis = 1)

dummy = pd.get_dummies(bikes["mnth"], drop_first = True)
bikes = pd.concat([bikes, dummy], axis = 1)

dummy = pd.get_dummies(bikes["weekday"], drop_first = True)
bikes = pd.concat([bikes, dummy], axis = 1)

dummy = pd.get_dummies(bikes["weathersit"], drop_first = True)
bikes = pd.concat([bikes, dummy], axis = 1)

In [None]:
bikes.drop(columns = list(cat_cols), axis = 1, inplace = True)

In [None]:
# Visualizing the numerical columns with "cnt" target column

heat1 = bikes.corr()

plt.figure(figsize=[25,10])
sns.heatmap(heat1, cmap = 'YlGnBu', annot = True)
plt.show()

In [None]:
bikes.info()

In [None]:
# Defining the columns to be scaled

cols_to_scale = ['temp', 'atemp', 'hum', 'windspeed','cnt']

In [None]:
# Using train_test_split to split the dataframe

df_train, df_test = train_test_split(bikes, train_size = 0.7, random_state = 100)

In [None]:
# Scaling the numerical columns

scaler = MinMaxScaler()

df_train[cols_to_scale] = scaler.fit_transform(df_train[cols_to_scale])

df_train.head()

In [None]:
# Splitting X and y

y_train = df_train.pop('cnt')

X_train = df_train

### Step5: Building the model

In [None]:
# Choosing columns using RFE

lm = LinearRegression()

lm.fit(X_train, y_train)

rfe = RFE(lm, 15)

rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Columns selected by RFE

rfe_cols = X_train.columns[rfe.support_]
rfe_cols

In [None]:
# Columns RFE did not select

n_rfe_cols = X_train.columns[~rfe.support_]
n_rfe_cols

In [None]:
# Building the first model

X_train_rfe = X_train[rfe_cols]

X_train_sm = sm.add_constant(X_train_rfe)

model1 = sm.OLS(y_train, X_train_sm)

result = model1.fit()

In [None]:
result.summary()

In [None]:
# Checking the VIF

vif = pd.DataFrame()
vif['Features'] = X_train_sm.columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.values, i) for i in range(X_train_sm.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

- Since "holiday" has a very high P value, we are dropping this column and building the model again.

In [None]:
# Dropping the column: "holiday"

X_train_sm.drop(columns = "holiday", axis = 1, inplace = True)

In [None]:
# Building the 2nd Model

X_train_sm = sm.add_constant(X_train_sm)

model2 = sm.OLS(y_train, X_train_sm)

result = model2.fit()

In [None]:
result.summary()

In [None]:
# Checking the VIF

vif = pd.DataFrame()
vif['Features'] = X_train_sm.columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.values, i) for i in range(X_train_sm.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

- Here, we see that "atemp" has a very high P value, but, we can also see from the heatmap that it is highly correlated with Winter, so we drop the Winter column instead to check.

In [None]:
# Dropping the column: "atemp"

X_train_sm.drop(columns = "atemp", axis = 1, inplace = True)

In [None]:
# Building the 3rd Model

X_train_sm = sm.add_constant(X_train_sm)

model3 = sm.OLS(y_train, X_train_sm)

result = model3.fit()

In [None]:
result.summary()

In [None]:
# Checking the VIF

vif = pd.DataFrame()
vif['Features'] = X_train_sm.columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.values, i) for i in range(X_train_sm.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

- Since "Sun" has a very high P-value, we drop it.

In [None]:
# Dropping the column: "Sun"

X_train_sm.drop(columns = "Sun", axis = 1, inplace = True)

In [None]:
# Building the 4th Model

X_train_sm = sm.add_constant(X_train_sm)

model4 = sm.OLS(y_train, X_train_sm)

result = model4.fit()

In [None]:
result.summary()

In [None]:
# Checking the VIF

vif = pd.DataFrame()
vif['Features'] = X_train_sm.columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.values, i) for i in range(X_train_sm.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

- Dropping "Nov" as it has a very high P value.

In [None]:
# Dropping the column: "Nov"

X_train_sm.drop(columns = "Nov", axis = 1, inplace = True)

In [None]:
# Building the 5th Model

X_train_sm = sm.add_constant(X_train_sm)

model5 = sm.OLS(y_train, X_train_sm)

result = model5.fit()

In [None]:
result.summary()

In [None]:
# Checking the VIF

vif = pd.DataFrame()
vif['Features'] = X_train_sm.columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.values, i) for i in range(X_train_sm.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

- Dropping "Jan", since it has a high P-value.

- Since, the P-values looks significant for all the features and also the VIF is well below 3, we will conclude this as our final model.

In [None]:
# predicting in train set

y_train_pred = result.predict(X_train_sm)

In [None]:
# Residual Analysis

sns.distplot((y_train - y_train_pred))
plt.show()

### Step6: Predicting on the test set

In [None]:
# Scaling the test set

df_test[cols_to_scale] = scaler.transform(df_test[cols_to_scale])

df_test.head()

In [None]:
df_test.yr.value_counts()

In [None]:
# Splitting into X and y

y_test = df_test.pop("cnt")

X_test = df_test

In [None]:
# Dropping necessary columns

X_test.drop(columns = n_rfe_cols, axis = 1, inplace = True)
X_test.drop(columns = ["Nov", "atemp", "Sun", "holiday"], axis = 1, inplace = True)

In [None]:
# Adding constant

X_test_sm = sm.add_constant(X_test)

In [None]:
# Predicting on the test set

y_test_pred = result.predict(X_test_sm)

In [None]:
# Residual Analysis

sns.distplot((y_test - y_test_pred))
plt.show()

In [None]:
# Checking the R2_score

round(r2_score(y_true = y_test, y_pred = y_test_pred),2)

In [None]:
# Checking the parameters

result.params

#### Therefore, our final MLR model equation is:

- cnt = 0.367 +  (0.224*yr) + (0.044*workingday) +  (0.539*temp) + (0.051*Winter) + (0.062*Sep) + (0.056*Sat) - (0.225*hum) - (0.124*windspeed) - (0.117*Spring) - (0.076*Jul) - (0.179*Rainy)

### Step7: Conclusion

- We can clearly see that in Boom Bikes thrive during summer and fall. There are ample customers using these bikes during that time. It can be supported by the EDA that we have performed and also by the co-efficients of "Sep" and "temp".
- During a clear sunny day, when the temperatures are high, more people tend to go out and hence use the Boom Bikes more.
- The scenarios which definitely require some attention are during: Bad weather conditions and also during Spring.
- The company can come out with schemes and special offers to riders during this period to increase participation and boost their income.
- They can also allow portable umbrellas or rain suits, so that riders are confident about commuting during bad weather.

##### Considering if the above steps are taken, the business for Boom Bikes will thrive for the days to come.