# Predicting Bike Sharing Demand with Ensembles

Detailed explanation for a similar project can be found in my article in the following link:

https://www.linkedin.com/pulse/predicting-bike-sharing-demand-ensemble-methods-oğuz-can-yurteri/

## Introduction

Bike sharing system is a modern way of bike rentals in which bikes are available for individuals to borrow a bike from a dock and return it at another belonging to the same system. Membership, rental and return processes are generally automatic and digital. By the half of 2018, there are   ap-proximately 1600 bike sharing programs with 18,2 million bikes in over 1000 cities.

Apart from their positive effect on traffic, environmental and health issues, bike sharing systems are attractive due to the data they generate for research purposes. The departure and arrival time, location and duration of usage are explicitly recorded in these systems. This data could be used to measure the mobility and detect the important events of a city.

The goal of this project is to predict the number of hourly bike usage based on environmental and seasonal conditions. 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
from numpy import median
from numpy import mean

import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV,cross_val_score
from sklearn.metrics import mean_squared_log_error as msle 
from sklearn.metrics import mean_absolute_error as mae
from sklearn.inspection import permutation_importance

import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv('/kaggle/input/bike-sharing-dataset/hour.csv')

data.head()

For a better understanding and improving readability, the names of the attributes are changed as following:

- instant: instant
- dteday: date
- season: season
- yr: year
- mnth: month
- hr: hour
- holiday: holiday
- weekday: weekday
- workingday: workingday
- weathersit: weathersit
- temp: temp
- atemp: feel_temp
- hum: humidity
- windspeed: windspeed
- casual: casual
- registered: registered
- cnt: bike_use (Target variable)

The target variable which is predicted is bike_use. The differentiation between the casual and registered users is not in the scope of the project, that’s why they are removed from the dataset. Besides, instant attribute is just an index, so it is also removed.


In [None]:
data.columns = ['instant','date','season','year','month','hour','holiday','weekday','workingday','weathersit','temp','feel_temp','humidity','windspeed','casual','registered','bike_use']

data.drop(columns='instant',inplace=True)

## Exploratory Data Analysis

There are 17379 instances in the dataset which means that some dates and hours are missing in two years. However, there is not any blank row or column. 

It can be seen that the number of instances is balanced among the seasons, years, months, hours and days as expected. For this reason, missing dates and hours are not an issue because they are random, most probably due to the problems in data acquisition.

The number of holidays and non-working days are less than the number of non-holidays and working days, and the number of the days with weather situation 1(clear, few clouds, partly cloudy) are the most as expected.

In [None]:
cat_var = ['season','year','month','hour']

for var in cat_var:
    sns.countplot(var,data=data) 
    plt.show()

In [None]:
cat_var2 = ['holiday', 'weekday','workingday', 'weathersit']

for var in cat_var2:
    sns.countplot(var,data=data) 
    plt.show()

The distributions of continuous variables temp and feel_temp are close to normal distribution, however the distribution of continuous variable windspeed is right-skewed and humidity is left-skewed. 

In [None]:
cont_var = ['temp', 'feel_temp','humidity', 'windspeed']

for var in cont_var:
    sns.distplot(data.loc[:,var],kde=False) 
    plt.show()

The distribution of the target variable, bike_use, is right-skewed and its value varies between 0-1000.

In [None]:
sns.distplot(data.bike_use,kde=False)

The correlation map shows that people tend to use bike less in bad weather conditions and high humidity, but more in high temperatures. Temp, feel_temp and humidity variables should be used as inputs in modelling. Also, there is a strong positive correlation between temp and feel_temp attributes. 

In [None]:
corrmat = data.iloc[:,[8,9,10,11,12,15]].corr()

sns.heatmap(corrmat, vmax=0.9,square=True, annot=True)

The seasonal effect is too high in bike usage and also, there are different demand levels among the months in the same season. That’s why, both season and month should be used as inputs in the model. 

Although both season and month is nominal categorical variables, only season attribute will be one-hot encoded since adding 12 extra attributes will increase the sparsity too much which does not worth for low variation between the months in the same season. 

The summation of bike usage in 2012 is higher than 2011 which means that both the number of users and de-mand of the system increased by time, so the year attribute should also be used as input in the model. 

The bike usage is at maximum especially in rush hours. By looking at the relationship between bike_use and hour closer, it can be seen that the hours 7, 8, 9, 17, 18, 19, and 20 have different characteristics in working days. To represent this characteristic better, a new feature called is_towork could be defined (1: hour is 7 or 8 or 9 or 17 or 18 or 19 and workingday is 1, 0: otherwise) to determine whether this hour is rush hour or not. The hour attribute could be one-hot encoded instead, however it will increase the sparsity and dimension of the data too much by add-ing 24 extra attribute which is not preferred. So, the original hour attribute and the new is_towork attribute will be used together as inputs. 

In [None]:
cat_var = ['season','year','month','hour']

for var in cat_var:
    sns.barplot(x=var,y='bike_use',data=data,estimator=sum)
    plt.show()

In [None]:
var = 'hour'
data1 = pd.concat([data['bike_use'], data[var]],axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='hour', y="bike_use", hue='workingday', data=data)

In [None]:
cat_var2 = ['holiday', 'weekday','workingday', 'weathersit']

for var in cat_var2:
    sns.barplot(x=var,y='bike_use',data=data,estimator=mean)
    plt.show()

In [None]:
var = 'weekday'
data1 = pd.concat([data['bike_use'], data[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="bike_use", data=data1)

In [None]:
data.drop(columns=['date','casual','registered'],inplace=True)

data['is_weekend'] = [1 if i==0 else 1 if i==6 else 0 for i in data.weekday]
data['is_towork'] = [1 if (data.loc[i,'hour']==7 or data.loc[i,'hour']==8 or data.loc[i,'hour']==9 or data.loc[i,'hour']==17 or data.loc[i,'hour']==18 or data.loc[i,'hour']==19 or data.loc[i,'hour']==20) and (data.loc[i,'workingday']==1) else 0 for i in data.index]

data = pd.get_dummies(data, columns=["season"],prefix='season_is')

Based on the exploratory data analysis, the following attributes will be used as inputs of the regres-sion model to predict bike_use:

- year 
- month 
- hour
- holiday 
- weekday 
- workingday
- weathersit 
- temp 
- feel_temp 
- humidity 
- windspeed 
- bike_use
- is_weekend 
- is_towork 
- season_is_1 (one-hot encoded season variable)
- season_is_2 (one-hot encoded season variable)
- season_is_3 (one-hot encoded season variable)
- season_is_4 (one-hot encoded season variable)

## Modelling

The dataset is divided into training (80% of the data) and test (20% of the data) dataset. Both of them represents the overall attribute features (covers all seasons, months, days, hours etc.). 

In order to predict hourly bike usage, two decision tree based ensemble learning methods, Random Forest Regression and Extreme Gradient Boosting (XGBoost) Regression, are used separately and the results are compared.

Ensemble learning methods combine several machine learning algorithms to improve results. Ran-dom Forest Regression uses bootstrap aggregating to ensemble different decision trees. Many de-cision trees are formed in parallel with different training data subsets. The different training subsets are built by sampling the training dataset with replacement (bootstrap), also different subset of fea-tures can be used in each tree. The prediction is calculated by averaging the results of the decision trees (aggregating). XGBoost uses boosting to ensemble different decision trees. Many decision trees are formed sequentially to learn from the errors (residuals) from its predecessor decision tree. 

Mean absolute error (MAE) and mean squared logarithmic error (MSLE) are selected as evaluation metrics. Mean absolute error shows how many bikes in average is deviated from the actual demand. Mean squared logarithmic error is selected since it penalizes underestimates more than overestimates. Supplying insufficient bikes especially in rush hours could lead to problems such as customer dissatisfaction and losing customers, that’s why underestimating is more harmful than overestimating. 

In [None]:
x = data.drop(columns=['bike_use'])
y = data.bike_use

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2,random_state = 42)

Grid search cross validation method with 5-fold is applied to training dataset to find the optimum number of decision trees (estimators) and maximum depth for each tree for Random Forest Regression and XGBoost Regression based on mean absolute error. 

In [None]:
model = xgb.XGBRegressor()

n_estimators = [1000,2000,3000]
max_depth = [3,6]

param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)
grid_search = GridSearchCV(model, param_grid, scoring="neg_mean_absolute_error")

grid_result = grid_search.fit(x_train,y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

In [None]:
model = RandomForestRegressor()

n_estimators = [500,1000,1500]
max_depth = [3,6]

param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)
grid_search = GridSearchCV(model, param_grid, scoring="neg_mean_absolute_error")

grid_result = grid_search.fit(x_train,y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

XGBoost is better in terms of MAE, on the other hand Random Forest is slightly better in terms of MSLE on the test dataset.

In [None]:
rf_reg = RandomForestRegressor(n_estimators=1000, max_depth=6)

rf_reg.fit(x_train,y_train)

pred_test = rf_reg.predict(x_test)

print("mae test: ",mae(y_test,pred_test))
print("msle test: ",msle(y_test,pred_test))

In [None]:
xgb_reg = xgb.XGBRegressor(n_estimators=1000, max_depth=6)

xgb_reg.fit(x_train,y_train)

pred_test = xgb_reg.predict(x_test)

pred_test = [0 if i<0 else i for i in pred_test]

print("mae test: ",mae(y_test,pred_test))
print("msle test: ",msle(y_test,pred_test))

Feature importance from mean impurity decrease and permutation importance of XGBoost can be seen below. 

Mean impurity decrease feature importance method measures the feature importance based on how much each feature contributes to decreasing the weighted impurity in each split of trees in terms of variance. 

Permutation feature importance method measures the feature importance by re-shuffling each feature in test dataset and observing how much the model performance decreases. 

Both methods illustrate that hour, is_towork temp, year, humidity and workingday are important features to predict bike demand. The assumption made for is_towork proves itself with its importance in the model. The assumption made for is_weekend is also an important feature in terms of permutation importance, however the assumption made for it is relatively less valid than is_towork.

In [None]:
# Mean impurity decrease feature importance

feat_importances = pd.Series(xgb_reg.feature_importances_, index=x_train.columns)
feat_importances.sort_values(ascending=True,inplace=True)
feat_importances.plot(kind='barh')

In [None]:
# Permutation feature importance

result = permutation_importance(xgb_reg, x_test, y_test, n_repeats=10,
                                random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=x_test.columns[sorted_idx])
fig.tight_layout()
plt.show()    

Residuals resulted from XGBoost can be seen below. 

The residuals are distributed symmetrically around 0, but not normal which means that there are trends which are not represented by the model. Especially, there are some residuals with high magnitude. Investigating some specific dates and hours with high deviation shows that there are specific events happened on that days such as holidays, earthquakes, festivals which cannot be taken as inputs to the model, that’s why cannot be represented by the model. 

Although it causes a decrease in the performance, this situation could allow us to use the data and the model for validation of anomaly or event detection algorithms as well.

In [None]:
res = y_test - pred_test

sns.distplot(res,kde=False)