# BoomBikes Regression

## Objective:  To understand the factors affecting the demand for the shared bikes in the American market after quarantine.

**To find:**
   - Which variables are significant in predicting the demand for shared bikes?
   - How well those variables describe the bike demands?

In [None]:
# Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('pastel')

In [None]:
# Importing dataset
df = pd.read_csv('../input/boombikes/day.csv')
df.shape

**There are 730 rows and 16 predictors.**

In [None]:
# Let's understand the dataset
df.info()

**Mix of categorical and numerical data types.
<br> cnt is the target variable.**

In [None]:
# Let's look at the values.
df.describe()

**Range is varied. Will have to scale later.**

In [None]:
# Any null values?
# Although we can see that there are no null values from Non null count above, 
# Better to confirm once. 
df.isnull().sum()

**Alright, dataset with no null values. That's good.**

In [None]:
# How does our data look like?
df.head(5)

## Data Cleaning

In [None]:
import datetime
df['dteday'] = pd.to_datetime(df['dteday'], format = '%d-%m-%Y')
df[['dteday', 'holiday', 'workingday']]
df[(df['weekday'] == 0) & (df['yr'] == 1)].tail(60)

In [None]:
# df[(df['holiday'] == 0) & (df['workingday'] == 0) & (df['yr'] == 1)]
df[df['yr'] == 1].head(31)

In [None]:
# It was found that encoding weekdays and holidays was not done properly. 
#Ideally we would ask the company to correct data issues. 
# Since it is not possible here, we will just continue as it is.

In [None]:
# Let's start by removing unwanted variables.
df = df.drop(['instant','dteday','casual','atemp','registered'], axis = 1)
df

**Deleting atemp as it is highly related to temp.
<br>Deleting instant, day, casual, registered as they are not important.
<br>A quick google search says that humidity and windspeed are not directly related. Hence keeping both of them.**

In [None]:
# Looking at holiday and workingday. They both seem to be same
df[['holiday','workingday']].sample(15)

In [None]:
df[df['workingday'] == 0]['holiday'].value_counts()

**From data dictionay we know that** 
 <br>holiday:
  - 0 => not a public holiday, i.e., weekend or weekday, hence working day can be 0 or 1
  - 1 => public holiday, i.e., working day = 0
<br>**Hence if we delete holiday, we might loose the information about public holdiays.
<br>So, we'll keep it till we check VIF.**


In [None]:
# Correcting the data types. 
df['season'].replace({1:'spring', 2:'summer',3:'fall',4:'winter'}, inplace = True)
df['mnth'].replace({1:'jan', 2:'feb',3:'march',4:'april',5:'may',6:'june',7:'july',
                     8:'august',9:'sept',10:'oct',11:'nov',12:'dec'}, inplace = True)
df['workingday'].replace({1:'working', 0:'holiday'}, inplace = True)
df['holiday'].replace({1:'pubholiday', 0:'n_pubholiday'}, inplace = True)
df['weekday'].replace({6:'mon', 0:'tue',1:'wed',2:'thu',3:'fri',4:'sat',5:'sun'}, inplace = True)
df['weathersit'].replace({1:'clear', 2:'misty',3:'light_rain_snow',4:'heavy_rain_snow'}, inplace = True)
df['yr'].replace({0:'2018', 1:'2019'}, inplace = True)

**Same trend follows with weekday and working day. We know that by default saturdays and sunday will be holdiay.
<br>But if we delete one of them, we'll loose data on weekdays that were a holiday. 
<br>That is monday which was a holiday, i.e, public holdiay. But there are only 21 of them.** 


In [None]:
# Let's check
df[df['workingday'] == 'holiday'][['weekday','workingday','holiday']]
## Hence, we can say that there are holidays which are not public holdiay nor weekend. So better to keep all 3.

In [None]:
# Now let's look at the data
df

## EDA

In [None]:
# Let's start by looking at pairplot
sns.pairplot( data = df)
plt.show()

**We can see that humidity and windspeed are not linear with our target variable.
<br>So, we'll go ahead and remove it.
<br>But there is no mulicollinearity. So, that is good.**

In [None]:
# Deleting humidity and windspeed as it is not linear with our target.
df.drop(['hum','windspeed'], axis = 1, inplace = True)

In [None]:
# Storing categorical variables as a list
cat_vars = ['season','yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']
# Looking at the count of the categories.
plt.figure(figsize = (18,25))
plt.figure(figsize = (18,15))
count=1
for i in cat_vars:
    plt.subplot(4,2,count)
    sns.countplot(x = i, data = df)
    count+=1

plt.show()

**From the above countplot we can see that:**
- During the fall more bikes were rented.
- Both years have same number of datapoints.
- We have less data for Feb. Maybe due to cold weather.
- Since there are more working days than holidays (unfortunately!) the count is also less.
- Most of the year, weather can be clear, hence count is more here. 

**However, since this is countplot, it can get affected by the number of data points in each category.
<br> Hence, let's compare with our target variable.**

#### Bivariate Analysis

In [None]:
# Let's now check how many bikes were rented throughout each category.
plt.figure(figsize = (18,15))
count=1
for i in cat_vars:
    plt.subplot(4,2,count)
    sns.barplot(y = "cnt",x = i, data = df)
    count+=1

plt.show()

**From the above plots we can see that:**
-  During the fall, bikes were rented more. Nice weather may be the reason behind this.
- The number bikes rented doubled in 2019 than 2018. Maybe the company started getting recognition.
- June to Sept is Summer to Fall in US which is quite pleasent and people like to roam out in open. Hence, number is more.
- During week or holidays does not seem to have much effect. Maybe people are renting for all purposes throughout the week.
- And nobody rides bike when it is snowing or raining for obvious reasons. Hence the number is highest during a clear day.

**Therefore we can guess that month, season and weather situation are having more effect on the target variable than others. We will confirm it later with help of correlation.**

## Data Preperation

In [None]:
# Converting categorical to dummy variables
df = pd.concat([df, pd.get_dummies(df[cat_vars],drop_first = True)], axis = 1)
df.drop(cat_vars, axis = 1, inplace = True)

- From the above pairplot we saw that target vairable is positively related with temp and 2019. That is as these two increased, the number of rentals increases. We noticed this trend in the above plots too. 
- Feb and Spring, working day and saturday, October and Winter seems to have correlation. However, to decide whether to delete them will be made based on VIF.

In [None]:
# Splitting into train and test
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size = 0.7, random_state = 100)

In [None]:
# Scaling temp and cnt variables.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Fitting & Transforming Train 
df_train[['cnt','temp']] = scaler.fit_transform(df_train[['cnt','temp']])

# Transforming test
df_test[['cnt','temp']] = scaler.transform(df_test[['cnt','temp']])

In [None]:
# Checking if properly scaled
df_train.describe()

In [None]:
# Dividing train as X, y
y_train = df_train.pop('cnt')
X_train = df_train

## Linear Regression

In [None]:
X_train.shape

In [None]:
# We'll start by using RFE to coarse tune our model.
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select = 10)
rfe = rfe.fit(X_train,y_train)

In [None]:
# Columns and their rank have been stored as a dataframe.
rank = pd.DataFrame(list(zip( X_train.columns, rfe.support_, rfe.ranking_)))

In [None]:
# Let's see which all are dropped  and which are selected by RFE.
rank.sort_values(by = 2)

**Out of 27 columns, 10 are retained as expected.**


In [None]:
# Save these as a separate training data.
X_train_RFE = X_train[list(X_train.columns[rfe.support_])]
X_train_RFE.sample(10)

In [None]:
# Now let's build a model with the selected 10 values.
import statsmodels.api as sm
X_train_RFE = sm.add_constant(X_train_RFE)
model1 = sm.OLS(y_train, X_train_RFE).fit()
print(model1.summary())

**From the summary we can make following inferences:**
- The R^2 is 0.822 and adjusted R^2 is 0.818. There is no huge difference between the two, which indeed is a good sign.
- The p values are 0 except for weekday_mon. Hence, we'll check VIF before removing this.
- Few of the variables have negative coefficients, for example season_spring, public holiday, misty, rainy days etc.This means that they effect the target negatively. However, their p values are < 0.05, which is okay. 
- We would think that during spring, the number of rentals will be high. However, the coefficient is negative, which is strange. 
- Rain, snow and misty weather do make the roads unsafe for travel, hence here if the rentals are reducing, it makes sense.
- The coefficient for public holiday is also negative. These days are those which is spent by families together, or there is a huge procession / crowd going on. We could infer that because of this number of rentals is reducing during holidays.

**Let's know more with VIF.**


In [None]:
# Checking Variance Inflation Factor
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Let's check the documentation to know more about this.
help(variance_inflation_factor)

In [None]:
# Let's check VIF for one variable
variance_inflation_factor(X_train_RFE.values,2)

**Hence to check the VIF we need to give the values as a matrix with index number of the independent variable.**

In [None]:
# Now we will check VIF for all variables by storing it in database.
vif = pd.DataFrame()
vif['Features'] = X_train_RFE.columns
vif['VIF'] = [variance_inflation_factor(X_train_RFE.values, i) for i in range(X_train_RFE.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif.sort_values(by = 'VIF', ascending = False)

**From the table above we can see that, Constant has highest VIF but we need this, hence ignore.<br> The VIF of weekday_mon is 1.01, very low actually. But it is insignificant (high p vlaue). So, we will drop this variable.**

In [None]:
X = X_train_RFE.drop('weekday_mon', axis = 1)

In [None]:
# Building the model again
X_lr = sm.add_constant(X)
model2 = sm.OLS(y_train, X_lr).fit()
print(model2.summary())

**According to the summary:**
- R^2 has reduced by 0.001 but adjusted R^2 is same. Deleting the variable was a good idea because R^2 and adjusted R^2 are little bit closer.
- All p values are < .05 which makes them significant. 
- Those which had negative coefficient are still negative. We can then conclude that they effect target negatively. That is the target variable will increase when these variables decrease. 


## Residual Analysis

In [None]:
# Checking for multicollinearity.
fig, axes = plt.subplots(figsize = (20,10))
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm', ax = axes, mask = (df.corr()<0.4) & (df.corr()>-0.4) )
plt.show()

In [None]:
fig, axes = plt.subplots(figsize = (20,10))
sns.heatmap(X_lr.corr(), annot = True, cmap = 'coolwarm', mask = (X_lr.corr()<0.4) & (X_lr.corr()>-0.4))

In [None]:
vif2 = pd.DataFrame()
vif2['Features'] = X_lr.columns
vif2['VIF'] = [variance_inflation_factor(X_lr.values, i) for i in range(X_lr.shape[1])]
vif2

**Though correlation coefficient is 0.6, the VIF for season spring is 4.729 < 5. Hence we can ignore it.**

In [None]:
# Calculating Residuals.
y_pred = model2.predict(X_lr)
res = y_train - y_pred

In [None]:
# Checking normality of errors.
from statsmodels.graphics.gofplots import qqplot
qqplot(res, line = 's')
plt.show()

In [None]:
sns.distplot(res)
plt.show()

In [None]:
# Checking Homoscedasity or Constant Variance
sns.regplot(res, y_pred, line_kws = {'color':'red'})
plt.show()

In [None]:
# No Autocorrelation
# In the summary we saw that Durbin - Watson test value is 2.074. We know that the value = 2 indicates no autocrrelation. 
# Hence, the assumption is followed.

## Predictions using Final model

In [None]:
# Storing the names of selected features
index = list(X_lr.columns)
index.remove('const')

In [None]:
# Selecting the features and storing it as X_test
X_test = df_test[index]
X_test = sm.add_constant(X_test)
y_test = df_test.pop('cnt')

In [None]:
# Predicting on the test set
y_test_pred = model2.predict(X_test)

In [None]:
# Calculating R^2 and adjusted R^2
from sklearn.metrics import r2_score
r2 = r2_score(y_true = y_test, y_pred = y_test_pred)
adj_r2 = 1 - ((1 - r2)*(X_test.shape[0] - X_test.shape[1]) / (X_test.shape[0] - X_test.shape[1] - 1))

In [None]:
# Let's compare them
r2, adj_r2

**This is really good. There is only 0.01 difference between R^2 and adjusted R^2. The model performed excetionally well on test data. Also 80% of the variance is now explained by the model.**

In [None]:
# Now let's see how much do each features contribute to the target variable.
model2.params
# Parameters or coefficients tells us how much is the count going to vary with unit change of our features. 

cnt = 0.141396 + 0.489605 temp - 0.064774 season_spring + 0.052311 season_summer + 0.095684 season_winter + 0.233162 yr_2019 + 0.095415 mnth_sept - 0.099109 holiday_pubholiday - 0.299799 weathersit_light_rain_snow - 0.077023 weathersit_misty             

**We can say that temperature has the greatest effect on rentals.**

In [None]:
import warnings
warnings.filterwarnings('ignore')