# 3: Bike rental prediction

We would like to use machine learning to make various predictions about our bike rental data. 

Firstly, we would like to be able to predict the number of daily rentals - this will be framed as a regression problem and carried out on both the Dublin and London data. 

Secondly, in order to plan and manage a bike rental network, the controllers need to remove bikes from full stations so that more can be dropped off, and need to add bikes to empty stations so that more rentals can be made. In order to facilitate this, it would be useful to be able to predict when a station is in need of an intervention (at 0% or 100% capacity). This can be framed as a classification problem.


In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import xgboost as xgb
import lightgbm as lgb
from pathlib import Path
import datetime
import warnings
import holidays

warnings.filterwarnings('ignore')

charts_folder = Path('charts/')
data_folder = Path('data/')
    
sns.set(style="darkgrid", font='serif', rc={'lines.linewidth': 0.7})    


In [18]:
daily_data = pd.read_parquet(data_folder/'dublin_london_daily_rentals.parquet')
daily_data.head() 

Unnamed: 0,city,date,num_rentals
0,Dublin,2018-08-01,4431.0
1,Dublin,2018-08-02,8392.0
2,Dublin,2018-08-03,7798.0
3,Dublin,2018-08-04,5040.0
4,Dublin,2018-08-05,4384.0


As we only have the city, the date and the number of rentals to use in this dataset, we would like to extract additional features from the date column.

We will use, the day of the week, the month of the year, whether the day is a weekend or weekday, and whether the day is a public holiday. Additionally, as we know from our statistical analysis that bike rental behaviour changed significantly post-covid, we will use an indicator variable for this.

In [19]:
daily_data['day_of_week_name'] = daily_data['date'].dt.day_name()
daily_data['month_name'] = daily_data['date'].dt.month_name()
daily_data['is_weekend'] = daily_data['date'].dt.day_name().isin(['Saturday', 'Sunday'])
daily_data.head()

Unnamed: 0,city,date,num_rentals,day_of_week_name,month_name,is_weekend
0,Dublin,2018-08-01,4431.0,Wednesday,August,False
1,Dublin,2018-08-02,8392.0,Thursday,August,False
2,Dublin,2018-08-03,7798.0,Friday,August,False
3,Dublin,2018-08-04,5040.0,Saturday,August,True
4,Dublin,2018-08-05,4384.0,Sunday,August,True


In [20]:
ireland_holidays = holidays.Ireland(years=daily_data[daily_data['city'] == 'Dublin']['date'].dt.year)
england_holidays = holidays.UnitedKingdom(years=daily_data[daily_data['city'] == 'London']['date'].dt.year)
daily_data['is_holiday'] = daily_data.apply(lambda row: row['date'] in ireland_holidays if row['city'] == 'Dublin' else row['date'] in england_holidays, axis=1)
daily_data.loc[daily_data['is_holiday']].head()

Unnamed: 0,city,date,num_rentals,day_of_week_name,month_name,is_weekend,is_holiday
5,Dublin,2018-08-06,4277.0,Monday,August,False,True
89,Dublin,2018-10-29,3768.0,Monday,October,False,True
146,Dublin,2018-12-25,1253.0,Tuesday,December,False,True
147,Dublin,2018-12-26,1893.0,Wednesday,December,False,True
153,Dublin,2019-01-01,2047.0,Tuesday,January,False,True


In [21]:
daily_data['pre_covid'] = daily_data['date'] < datetime.datetime(2020,3,1)

In [30]:
cat_cols = ['city', 'day_of_week_name', 'month_name']
cat_subset = daily_data[cat_cols]

encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(cat_subset)
df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(cat_cols))

one_hot_df = pd.concat([daily_data, df_encoded], axis=1)
one_hot_df.drop(columns=cat_cols+['date'], inplace=True)
one_hot_df.head()

Unnamed: 0,num_rentals,is_weekend,is_holiday,pre_covid,city_Dublin,city_London,day_of_week_name_Friday,day_of_week_name_Monday,day_of_week_name_Saturday,day_of_week_name_Sunday,...,month_name_December,month_name_February,month_name_January,month_name_July,month_name_June,month_name_March,month_name_May,month_name_November,month_name_October,month_name_September
0,4431.0,False,False,True,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8392.0,False,False,True,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7798.0,False,False,True,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,5040.0,True,False,True,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4384.0,True,False,True,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
X_train, X_test, y_train, y_test = train_test_split(one_hot_df.drop(columns=['num_rentals']), one_hot_df['num_rentals'], test_size=0.2, random_state=495)

In [None]:
random.seed(495)

rf_params = {'max_depth': [3, 5, 10, 20, None], 'n_estimators': [50, 100, 200], 'max_features': ['sqrt', None]}
xgb_params = {'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [50, 100, 200], 'max_depth': [3, 4, 5]}
lgb_params = {'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [50, 100, 200], 'max_depth': [3, 4, 5]}

models = {
    'RandomForest': GridSearchCV(RandomForestRegressor(n_jobs=-1), param_grid=rf_params).fit(X_train, y_train).best_estimator_,
    'XGB': GridSearchCV(xgb.XGBRegressor(n_jobs=-1), param_grid=xgb_params).fit(X_train, y_train).best_estimator_,
    'LightGBM': GridSearchCV(lgb.LGBMRegressor(n_jobs=-1), param_grid=lgb_params).fit(X_train, y_train).best_estimator_
}