# Shared Bikes Sales analysis

## Problem statement
- A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic.
- BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19.
- The company wants to know:
    1. Which variables are significant in predicting the demand for shared bikes.
    2. How well those variables describe the bike demands

## Solution

### Importing and understanding the available data
- Import the data 
- Clean the data and create the derived data which can give more insights
- Visualize the demand with different variables

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np 
import pandas as pd

In [None]:
df = pd.read_csv('../input/boombikes/day.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.head()

#### Dropping variables which are not required for evaluation

In [None]:
# The below variables are not useful in the analysis
# We have temp and atemp which are highly correlated. So, removing temp variable from the analysis

# Also, we should not have casual and registered since they are directly used in the cnt calculation
# cnt = casual + registered, so both these values will directly affect the calculation
redundant_variables = ['instant', 'dteday', 'temp', 'casual', 'registered']

df = df.drop(redundant_variables, axis=1)

In [None]:
df.head()

#### Now we can visualise the relationship of our target variable with other variables

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.pairplot(df)
plt.show()

Among all the continous variables, an interesting relationship of cnt is seen with atemp. Let's create a seperate regplot for atemp vs cnt

In [None]:
sns.regplot(x='atemp', y='cnt', data=df)
plt.show()

The graph clearly shows that with the increase in temperature cnt increases too.

A lot of variables are categorical. It's a better idea to use boxplot for visualizing the pattern across the categories.

In [None]:
plt.figure(figsize=(20,20))
plt.subplot(3,3,1)
sns.boxplot(x='season', y='cnt', data=df)
plt.subplot(3,3,2)
sns.boxplot(x='yr', y='cnt', data=df)
plt.subplot(3,3,3)
sns.boxplot(x='mnth', y='cnt', data=df)
plt.subplot(3,3,4)
sns.boxplot(x='holiday', y='cnt', data=df)
plt.subplot(3,3,5)
sns.boxplot(x='weekday', y='cnt', data=df)
plt.subplot(3,3,6)
sns.boxplot(x='workingday', y='cnt', data=df)
plt.subplot(3,3,7)
sns.boxplot(x='weathersit', y='cnt', data=df)
plt.show()

Influence of different variables to demand - 
1. Seasons - There is quite a lot of variation with season. Spring seems to have significantly lesser demand while summer and fall have higher demands.
2. Year - 2019 clearly has a much higher demand. This clearly shows the business has a good prospect and it was witnessing positive trend in demand before the Covid scene
3. Month - This is again a detailed view of seasons and we can clearly see the same pattern that months corresponding to summer and fall have higher demands.
4. Holidays - The demand seems to be higher on non-holidays as we can see significant difference in the median value with respect to holidays.
5. Weekday - Although we see difference in the range of values across the weekdays but the median seems to be close for all days. So, in first glance weekday does not seem to be affecting the demand to a great extent.
6. Working day - Working days have little higher 25 percentile however the median are fairly close.
7. Weathersit - There is very significant influence of weather on the demand. As we can see, the demand is much more during clear weather and reduces when it is misty or cloudy. It reduces even more during rains and there is almost no demand when it is snowy.

Let's also have a look at the correlations

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot=True, cmap="YlGnBu")
plt.show()

- The heatmap shows high positive correlation of `cnt` with `atemp` and `yr`.
- `cnt` has significant negative correlation with `windspeed` and `weathersit`  

### Data Preparation

In [None]:
# We have to first create dummy columns for our analysis
#season : season (1:spring, 2:summer, 3:fall, 4:winter)
season_dict = { 
    1: 'Spring',
    2: 'Summer',
    3: 'Fall',
    4: 'Winter'
}
# Months
month_dict = {
    1: 'Jan',
    2: 'Feb',
    3: 'Mar',
    4: 'Apr',
    5: 'May',
    6: 'Jun',
    7: 'Jul',
    8: 'Aug',
    9: 'Sep',
    10: 'Oct',
    11: 'Nov',
    12: 'Dec'
}
# weathersit : 
# 1: Clear, Few clouds, Partly cloudy, Partly cloudy
# 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
# 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
# 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
weathersit_dict = {
    1: 'Clear',
    2: 'Mist',
    3: 'Light_Snow',
    4: 'Heavy_Rain'
}
weekday_dict = {
    0: 'Sun',
    1: 'Mon',
    2: 'Tue',
    3: 'Wed',
    4: 'Thu',
    5: 'Fri',
    6: 'Sat',
}

Coverting the categorical columns to strings since pd.get_dummies takes object as parameter

In [None]:
df.weathersit = df.weathersit.apply(lambda x: weathersit_dict[x])
df.weathersit.value_counts()

In [None]:
df.mnth = df.mnth.apply(lambda x: month_dict[x])
df.season = df.season.apply(lambda x: season_dict[x])
df.weekday = df.weekday.apply(lambda x: weekday_dict[x])

Now, that we have named columns, it's easiar to comprehend the variation of these categories with `cnt` which we had done earlier.

In [None]:
plt.figure(figsize=(10,12))
plt.subplot(4,1,1)
sns.boxplot(x='season', y='cnt', data=df)
plt.subplot(4,1,2)
sns.boxplot(x='weathersit', y='cnt', data=df)
plt.subplot(4,1,3)
sns.boxplot(x='mnth', y='cnt', data=df)
plt.show()
sns.boxplot(x='weekday', y='cnt', data=df)
plt.show()

In [None]:
month_dummies = pd.get_dummies(df.mnth, drop_first = True)
season_dummies = pd.get_dummies(df.season, drop_first = True)
weathersit_dummies = pd.get_dummies(df.weathersit, drop_first = True)
weekday_dummies = pd.get_dummies(df.weekday, drop_first = True)

In [None]:
df = pd.concat([df, month_dummies, season_dummies, weathersit_dummies, weekday_dummies], axis=1)

In [None]:
df.head()

We can now remove the original columns for which we have the dummies

In [None]:
df.drop(['mnth', 'weekday', 'weathersit', 'season'], axis=1, inplace=True)
df.head()

## Splitting the Data into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling the Features
We need to rescale all the features that are non-binary and not dummies so that we perform analysis on the variable coefficents correctly. We will also exclude the target variable since we are interested in the predictor variables coeffecients.
- Using MinMaxScaler here.

In [None]:
df.describe()

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
num_vars = ['atemp', 'hum', 'windspeed']
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])
df_train.head()

Dividing the train dataset into the target and predictor variables.

In [None]:
y_train = df_train.pop('cnt')
X_train = df_train

Using RFE to filter narrow down to first 15 features

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
# Getting top15 columns
top15_cols = X_train.columns[rfe.support_]

top15_cols

In [None]:
# following columns are dropped from analysis
X_train.columns[~rfe.support_]

Getting model summary

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[top15_cols]

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
lm = sm.OLS(y_train,X_train_rfe).fit()

In [None]:
print(lm.summary())

In [None]:
X_train_rfe = X_train_rfe.drop(['const'], axis = 1)

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Dropping hum since it's highly collinear VIF value is very high.
X_train_new = X_train_rfe.drop(['hum'], axis = 1)

In [None]:
# Rebuilding the model
X_train_lm = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_lm).fit()
print(lm.summary())

R-Squared value for the model is pretty good. Also, all the predictor variables are significant since the P-value is very less.

In [None]:
#Let's check for collinearity again in the new model
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

`atemp` has a high VIF. We can drop it. 

In [None]:
X_train_new = X_train_new.drop(['atemp'], axis = 1)
# Rebuilding the model
X_train_lm = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_lm).fit()
print(lm.summary())

Dropping July now.

In [None]:
X_train_new = X_train_new.drop(['Jul'], axis = 1)
# Rebuilding the model
X_train_lm = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_lm).fit()
print(lm.summary())

Dropping Winter

In [None]:
X_train_new = X_train_new.drop(['Winter'], axis = 1)
# Rebuilding the model
X_train_lm = sm.add_constant(X_train_new)
lm = sm.OLS(y_train,X_train_lm).fit()
print(lm.summary())

- All the variables are significant now.
- Let's check the VIF now

In [None]:
#Let's check for collinearity again in the new model
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

So, we have a fairly stable model now with very less multi-collinearity and a good Adjusted R-Squared value of 0.789

### Residual analysis on the training data

In [None]:
y_train_cnt = lm.predict(X_train_lm)

In [None]:
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 50)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

### Making predictions

In [None]:
df_test[num_vars] = scaler.transform(df_test[num_vars])

#### Dividing into X_test and y_test

In [None]:
y_test = df_test.pop('cnt')
X_test = df_test

In [None]:
# Using only the filtered columns present in X_train_new
X_test_new = X_test[X_train_new.columns]

In [None]:
# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
# Making predictions
y_pred = lm.predict(X_test_new)

#### Let's compute the final r2_score

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

### Conclusions and recommendations to the bike company

#### Statistics

In [None]:
print(lm.summary())
# Total variables considered - 11

#### Recommendations:
- Yr has a positive coefficient. So, the demand is increasing with each year. This gives confidence that the business model is has potential to have high demand in future after the Covid problems are over.
- The demand is more on the working days and Saturdays. So, in order to boost demand on non-working days they can come up with some offer for those days (espcially Sunday).
- Bad weather(snow and mist) and winters adversely affect the demand. In fact, demand is low uptil Spring. So, the company needs to come up with some strategy and offers during this period.
