## Bike Sharing Demand Prediction

Author: Ian Ho

This is a notebook for the Bike Sharing Demand Prediction competition on Kaggle. As with most competitions, there is a need to cover both data processing and modelling to achieve good and robust results. However, this notebook will primarily focus on the modelling and optimization portions, while quickly glossing over the preprocessing portions. Nonetheless, proper pipeline methods will be employed for the preprocessing, and viewers are welcome to try alternative methods to evaluate the effects on test errors. 

In particular, I will be highlighting the use of Bayesian Optimization in finding the optimal hyperparameters for the XGBoost Regressor.

### Results

This notebook took only about 3 hours to complete, but it has a leaderboard error of 0.39 which places it well in the top 10%!

### Dependencies

In [None]:
import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from xgboost import XGBRegressor

from bayes_opt import BayesianOptimization

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Import Data

In [None]:
train = pd.read_csv("/kaggle/input/bike-sharing-demand/train.csv")
submit = pd.read_csv("/kaggle/input/bike-sharing-demand/test.csv")

## Feature Engineering

### Extracting features from datetime

In [None]:
# Function to get information about day, week, hour, year from datetime

def date_extractor(data):

    data['datetime'] = pd.to_datetime(data['datetime'])
    data["date"] = data["datetime"].apply(lambda x: x.date())
    data["hour"] = data["datetime"].apply(lambda x: x.hour)
    data["weekday"] = data["datetime"].apply(lambda x:x.isoweekday())
    data["month"] = data["datetime"].apply(lambda x:x.month)
    data["year"] = data["datetime"].apply(lambda x:x.year)

    data.drop(['datetime'], axis=1, inplace=True)
    
    return data

In [None]:
train = date_extractor(train)
submit = date_extractor(submit)

### Dropping columns ```casual``` and ```registered``` as they do not appear in the test set

In [None]:
drop_col = ['casual', 'registered']
for col in drop_col:
    train.drop(col, axis=1, inplace=True)

### Creating pipeline for Data Transformation

In [None]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehotencoder', OneHotEncoder(handle_unknown='ignore'))])

numeric_features = ['temp', 'atemp', 'humidity', 'windspeed']
categorical_features = ['season', 'holiday', 'workingday', 'weather', 'hour', 'weekday', 'month', 'year']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

### Isolating label column

In [None]:
y = train['count']

### Dropping irrelevant columns

In [None]:
train.drop(['date', 'count'], axis=1, inplace=True)
submit.drop(['date'], axis=1, inplace=True)

### Transforming Data using Pipelines

In [None]:
X = preprocessor.fit_transform(train)
X_submit = preprocessor.transform(submit)

### User-defined function for RMSLE 

RMSLE = Root-mean-squared-log-error

$RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$

In [None]:
def RMSLE(y_true, y_pred):
    
    pairs = list(zip(np.log(y_true+1), np.log(y_pred+1)))
    
    return round(np.sqrt(sum(map(lambda x: (x[0] - x[1]) ** 2, pairs)) / len(y_true)), 3)

### Modelling Baseline XGBoost Regressor

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Bayesian Optimization for Hyperparameter Tuning

- In order to use Bayesian Optimization for RMSLE minimization, I customised the ```XGB_error``` to return the negative of RMSLE, so that when Bayesian Optimization tries to maximise this value, it will be minimizing the error. Obviously, this is a quick fix but it works pretty well looking at the iterations!


- I decide to search for the following hyperparameters, as defined according to the [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/parameter.html):
    - ```max_depth``` : Maximum depth of a tree. Increasing this value will make the model more complex and more prone to overfitting
    - ```min_child_weight``` : This is the minimum sum of instance weight needed in a child. The larger ```min_child_weight``` is, the more conservative the algorithm will be
    - ```gamma``` : Minimum loss reduction required to make a further partition on a leaf node of the tree
    - ```colsample_by_tree``` : The subsample ratio of columns when constructing each tree.
    - ```subsample``` : Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost will randomly sample half of training prior to growing trees, to prevent overfitting.

In [None]:
def XGB_error(max_depth, min_child_weight, gamma, colsample_bytree, subsample, X_train, X_test, y_train, y_test):
    
    XGB = XGBRegressor(
        max_depth=max_depth, 
        min_child_weight=min_child_weight, 
        gamma=gamma, 
        colsample_bytree=colsample_bytree, 
        subsample=subsample
    )
    
    y_train = np.log(y_train)
    XGB.fit(X_train, y_train)
    y_pred = XGB.predict(X_test)
    y_pred = np.exp(y_pred)
    
    return -RMSLE(y_test, y_pred)


def optimize_XGB(X_train, X_test, y_train, y_test):
    
    def XGB_wrapper(max_depth, min_child_weight, gamma, colsample_bytree, subsample):
        
        return XGB_error(
            max_depth=int(max_depth), 
            min_child_weight=min_child_weight, 
            gamma=gamma, 
            colsample_bytree=colsample_bytree, 
            subsample=subsample,
            X_train=X_train, 
            X_test=X_test, 
            y_train=y_train, 
            y_test=y_test
        )
    
    optimizer = BayesianOptimization(
        f=XGB_wrapper,
        pbounds={
            "max_depth": (0, 20), 
            "min_child_weight": (1, 10), 
            "gamma": (0.2, 0.8), 
            "colsample_bytree": (0.2, 0.9), 
            "subsample": (0.2, 0.9)
        },
        random_state=0,
        verbose=2
    )
        
    optimizer.maximize(n_iter=50)

    print("Final result:", optimizer.max)
    print()
    final_params = optimizer.max['params']
    for k, v in final_params.items():
        print(k,'=',v,',')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


optimize_XGB(X_train, X_test, y_train, y_test)

### Retrain with all the data, and with optimal hyperparameters, make submission

In [None]:
XGB = XGBRegressor(
    colsample_bytree = 0.8420564768063263 ,
    gamma = 0.3169832518324276 ,
    max_depth = 6 ,
    min_child_weight = 8.956069480250676 ,
    subsample = 0.8715769930248918 ,)

y_logged = np.log(y)
XGB.fit(X, y_logged)

y_submit = XGB.predict(X_submit)
y_submit = np.exp(y_submit)

submit = pd.read_csv("/kaggle/input/bike-sharing-demand/test.csv")
submission = pd.concat([submit['datetime'], pd.DataFrame(y_submit)], axis=1)
submission.columns = ['datetime', 'count']
submission.set_index('datetime', inplace=True)
submission.to_csv('submission.csv')