## Project for the [AWS Machine Learning Engineer Nanodegree](https://www.udacity.com/course/aws-machine-learning-engineer-nanodegree--nd189)

In this project, I experiment with the auto-ML framework Autogluon as well as a regular CatBoost model.

First, some simple baseline models are trained. After that, the score was improved by creating some new features from the raw timestamp as well as hyperparamter tuning. As so often in machine learning, providing better features to the model lead to a rather big improvement. Hyperparameter tuning has much less impact but still improves the error by a considerable margin.

# Prerequisites

In [None]:
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir

In [None]:
import os
from os.path import join
import json

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.metrics import mean_squared_log_error, make_scorer, mean_squared_error
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.linear_model import Ridge
from sklearn.inspection import permutation_importance

from skopt import BayesSearchCV
from skopt.space import Real, Integer

from catboost import CatBoostRegressor
from autogluon.tabular import TabularPredictor
import autogluon.core as ag

In [None]:
input_dir = '/kaggle/input/bike-sharing-demand'

train = pd.read_csv(join(input_dir, 'train.csv')).drop(['casual', 'registered'], axis=1)
y_train = train['count']
x_train = train.drop('count', axis=1)

x_test = pd.read_csv(join(input_dir, 'test.csv'))

train.head()

In [None]:
def cap_at_zero(predictions):
    predictions[predictions < 0] = 0
    return predictions


class RidgeNonNegative(Ridge):
    
    def predict(self, *args, **kwargs):
        predictions = super().predict(*args, **kwargs)
        return cap_at_zero(predictions)


class CatBoostRegressorNonNegative(CatBoostRegressor):
    
    def predict(self, *args, **kwargs):
        predictions = super().predict(*args, **kwargs)
        return cap_at_zero(predictions)


class TabularPredictorNonNegative(TabularPredictor):
    
    def predict(self, *args, **kwargs):
        predictions = super().predict(*args, **kwargs)
        return cap_at_zero(predictions)


def root_mean_squared_log_error(*args, **kwargs):
    return np.sqrt(mean_squared_log_error(*args, **kwargs))


def root_mean_squared_error(*args, **kwargs):
    return np.sqrt(mean_squared_error(*args, **kwargs))


rmsle_scorer = make_scorer(score_func=root_mean_squared_log_error, greater_is_better=False)


def save_submission(y_pred, file_name='submission.csv'):
    dirname = 'submission_files'
    if not os.path.exists(dirname):
        os.makedirs(dirname)
    
    submission = pd.read_csv(join(input_dir,'sampleSubmission.csv'))
    submission['count'] = y_pred
    assert (submission['count'] >= 0).all()
    submission.to_csv(os.path.join(dirname, file_name), index=False)

# Quick EDA

There 8 ready-to-use featues and 1 target variable.  
In addition, there's one more column `datetime` which can be used for feature engineering later.

In [None]:
plt.rcParams["figure.figsize"] = (20, 20)
train.hist();

# Baseline models

In [None]:
x_train_raw = x_train.drop('datetime', axis=1)
x_test_raw = x_test.drop('datetime', axis=1)

In [None]:
ridge = RidgeNonNegative()
cv_pred_ridge = cross_val_predict(estimator=ridge,
                                  X=x_train_raw,
                                  y=y_train)
ridge_baseline_rmse = round(root_mean_squared_error(y_train, cv_pred_ridge), 2)
print('Baseline ridge-regression:\n'
      f'RMSE = {ridge_baseline_rmse}\n'
      f'RMSLE = {round(root_mean_squared_log_error(y_train, cv_pred_ridge), 2)}')

ridge.fit(x_train_raw, y_train)
y_pred_ridge = ridge.predict(x_test_raw)
save_submission(y_pred_ridge, 'ridge_regression_baseline.csv')

Note that the metric used to train and evaluate the autogluon model is not the same as the one used in the competition. Thus, the leaderboard only shows the relative performance of the autogluaon models and cannot be used to for comparison with the regression model or the CatBoost. See the last section of this notebook for a comprehensive comparision between of all scores on the competition test set.

In [None]:
auto_model = TabularPredictorNonNegative(label='count',
                                         problem_type='regression',
                                         eval_metric='root_mean_squared_error',
                                         verbosity=0)
auto_model.fit(train_data=train,
               time_limit=300,
               presets='best_quality')
y_pred_auto = auto_model.predict(x_test)
save_submission(y_pred_auto, 'autogluon_baseline.csv')
autogluon_baseline_rmse = -auto_model.leaderboard(silent=True).score_val.iloc[0]
auto_model.leaderboard(silent=True)

In [None]:
# for comparison: no time limit
auto_model = TabularPredictorNonNegative(label='count',
                                         problem_type='regression',
                                         eval_metric='root_mean_squared_error',
                                         verbosity=0)
auto_model.fit(train_data=train,
               time_limit=None,
               presets='best_quality')
auto_model.leaderboard(silent=True)

In [None]:
cb_model = CatBoostRegressorNonNegative(verbose=0)
cv_pred_cb = cross_val_predict(estimator=cb_model,
                               X=x_train_raw,
                               y=y_train)
cb_baseline_rmse = round(root_mean_squared_error(y_train, cv_pred_cb), 2)
print('Baseline catboost:\n'
      f'RMSE = {cb_baseline_rmse}\n'
      f'RMSLE = {round(root_mean_squared_log_error(y_train, cv_pred_cb), 2)}')

cb_model.fit(x_train_raw, y_train)
y_pred_cb = cb_model.predict(x_test_raw)
save_submission(y_pred_cb, 'catboost_baseline.csv')

# Feature Engineering and further EDA

In [None]:
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:

    df = df.copy()
    df['datetime'] = pd.to_datetime(df.datetime)
    
    df['year'] = df.datetime.dt.year
    df['week'] = df.datetime.dt.week
    df['hour'] = df.datetime.dt.hour
    df['weekday'] = df.datetime.dt.day_name()
    
    df.season = df.season
    df.weather = df.weather
    
    return df

def make_categorical(df: pd.DataFrame, cat_features: list):
    for f in cat_features:
        df[f] = df[f].astype('category')
    return df
    
x_train_eng = engineer_features(x_train)
x_test_eng = engineer_features(x_test)
x_train_eng.head()

In [None]:
x_train_eng.year.value_counts().to_frame(name='num_datapoints').rename_axis('year', axis='columns')

In [None]:
plt.rcParams["figure.figsize"] = (10, 5)
x_train_eng[['week', 'hour']].hist()
plt.show();

In [None]:
weekday_counts = x_train_eng.weekday.value_counts()[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']]

plt.bar(weekday_counts.index, weekday_counts.values)
plt.title('weekday');

In [None]:
categorical_features = ['year', 'weekday', 'season', 'weather']
x_train_eng = make_categorical(x_train_eng, categorical_features)
x_test_eng = make_categorical(x_test_eng, categorical_features)

# Feature Selection

#### redundant features

Datetime could now be dropped because we used it to create the four features `year`, `week`, `weekday` and `hour`.

The feature `workingday` is redundant because the features `holiday` and `weekday` already contain the same information.
If holiday is 1 or `weekday` is 'Saturday' or 'Sunday', then workingday is 0 otherwise, it's 1 (see below).

Finally, the feature `week` refering to the calendar-week contains the information of `season` just much more granular.

In the experiments below, a model is fitted after dropping redundant features while another one uses all the features.

In [None]:
day_stats = x_train_eng.groupby(['holiday', 'weekday']).apply(lambda d: d.workingday.value_counts())
day_stats.to_frame(name='count').rename_axis('workinday', axis='columns')

In [None]:
cb_model = CatBoostRegressorNonNegative(verbose=0, cat_features=['year', 'weekday', 'weather'])
cv_pred_cb = cross_val_predict(estimator=cb_model,
                               X=x_train_eng.drop(['datetime', 'workingday', 'season'], axis=1),
                               y=y_train)
print('Catboost with non-redundant features:\n'
      f'RMSE = {round(root_mean_squared_error(y_train, cv_pred_cb), 2)}\n'
      f'RMSLE = {round(root_mean_squared_log_error(y_train, cv_pred_cb), 2)}')

In [None]:
cb_model = CatBoostRegressorNonNegative(verbose=0, 
                                        cat_features=categorical_features)
cv_pred_cb = cross_val_predict(estimator=cb_model,
                               X=x_train_eng,
                               y=y_train)
print('Catboost with all features:\n'
      f'RMSE = {round(root_mean_squared_error(y_train, cv_pred_cb), 2)}\n'
      f'RMSLE = {round(root_mean_squared_log_error(y_train, cv_pred_cb), 2)}')

In [None]:
# for feature importance:
x_tr, x_val, y_tr, y_val = train_test_split(x_train_eng, y_train)
cb_model = CatBoostRegressorNonNegative(verbose=0, cat_features=categorical_features)
cb_model.fit(x_tr, y_tr);

As we can see, the model with all (including redundant features) performs better. Therefore, it is not a good idea to just drop the features that seem redundant. However, it is still likely that not all those features are required for optimal performance.

To find the smallest subset of features with optimal performence, the plot below shows the permutation feature importance of all features. By dropping those, whose importance is close to zero in the plot, we can have the same performance as the model with all features but using only a subset of the features. All other things equal, a model with fewer features is preferable.

In [None]:
feature_importances = permutation_importance(cb_model, x_val, y_val, n_repeats=50)

mean_imp = sorted(zip(cb_model.feature_names_, feature_importances['importances_mean']), key=lambda x: x[1], reverse=True)
sorted_features, sorted_mean_importance = zip(*mean_imp)

In [None]:
plt.bar(sorted_features, sorted_mean_importance)
plt.xticks(rotation=70)
plt.ylabel('importance')
plt.title('feature importance');

In [None]:
x_train_final = x_train_eng.drop(['season', 'holiday', 'windspeed', 'year'], axis=1)
x_test_final = x_test_eng.drop(['season', 'holiday', 'windspeed', 'year'], axis=1)

In [None]:
cb_model = CatBoostRegressorNonNegative(verbose=0, cat_features=['weekday', 'weather'])
cv_pred_cb = cross_val_predict(estimator=cb_model,
                               X=x_train_final,
                               y=y_train)
cb_features_rmse = round(root_mean_squared_error(y_train, cv_pred_cb), 2)
print('Catboost with final features:\n'
      f'RMSE = {cb_features_rmse}\n'
      f'RMSLE = {round(root_mean_squared_log_error(y_train, cv_pred_cb), 2)}')

In [None]:
cb_model.fit(x_train_final, y_train)
y_pred_feature_eng = cb_model.predict(x_test_final)

save_submission(y_pred_feature_eng, 'catboost_with_engineered_features.csv')

In [None]:
%%time

auto_model_feature_eng = TabularPredictorNonNegative(label='count',
                                                     problem_type='regression',
                                                     eval_metric='root_mean_squared_error',
                                                     verbosity=0)
auto_model_feature_eng.fit(train_data=pd.concat([x_train_final, y_train], axis=1),
                           time_limit=None,
                           presets='best_quality')
save_submission(auto_model_feature_eng.predict(x_test_final), 'autogluon_feature_engineering.csv')
autogluon_features_rmse = -auto_model_feature_eng.leaderboard(silent=True).score_val.iloc[0]
auto_model_feature_eng.leaderboard(silent=True)

# Hyperparameter Tuning

In [None]:
%%time

cb_hyperparams = {'iterations': (100, 1000),
                   'learning_rate': (0.01, 0.03),
                   'depth': (4, 10),
                   'l2_leaf_reg': (0, 100),
                   'random_strength': (1, 10),
                   'colsample_bylevel': (0.5, 1),
                   'subsample': (0.5, 1)}

bayesian_search = BayesSearchCV(
    estimator=CatBoostRegressorNonNegative(verbose=0, cat_features=['weekday', 'weather']),
    search_spaces=cb_hyperparams,
    scoring=rmsle_scorer,
    cv=5,
    n_iter=100,
    refit=False,
    n_jobs=-1,
    error_score=0,
    verbose=0,
)
bayesian_search.fit(x_train_final, y_train)
bayesian_search.best_params_

In [None]:
cb_model_hyper = CatBoostRegressorNonNegative(verbose=0, cat_features=['weekday', 'weather'], **bayesian_search.best_params_)
cv_pred_cb = cross_val_predict(estimator=cb_model_hyper,
                               X=x_train_final,
                               y=y_train)
cb_hyper_rmse = round(root_mean_squared_error(y_train, cv_pred_cb), 2)
print('Catboost after hyperparameter-tuning:\n'
      f'RMSE = {cb_hyper_rmse}\n'
      f'RMSLE = {round(root_mean_squared_log_error(y_train, cv_pred_cb), 2)}')

In [None]:
cb_model_hyper.fit(x_train_final, y_train)
y_pred_hyper = cb_model_hyper.predict(x_test_final)

save_submission(y_pred_hyper, 'catboost_with_hyperparam_tuning.csv')

In [None]:
%%time

auto_model_hyper = TabularPredictorNonNegative(label='count',
                                               problem_type='regression',
                                               eval_metric='root_mean_squared_error',
                                               verbosity=0)
auto_model_hyper.fit(train_data=pd.concat([x_train_final, y_train], axis=1),
                     time_limit=None,
                     presets='best_quality',
                     hyperparameter_tune_kwargs='bayesopt')
save_submission(auto_model_hyper.predict(x_test_final), 'autogluon_hyper.csv')
autogluon_hyper_rmse = -auto_model_hyper.leaderboard(silent=True).score_val.iloc[0]
auto_model_hyper.leaderboard(silent=True)

# Model comparison

In [None]:
df = pd.DataFrame([
    ['baseline', 'ridge_regression', ridge_baseline_rmse ,1.43],
    ['baseline', 'catboost', cb_baseline_rmse , 1.32],
    ['baseline', 'autogluon', autogluon_baseline_rmse ,1.40],
    ['add_features', 'catboost', cb_features_rmse ,0.63],
    ['hyperparam', 'catboost', cb_hyper_rmse , 0.53],
    ['add_features', 'autogluon', autogluon_features_rmse ,0.45],
    ['hyperparam', 'autogluon', autogluon_hyper_rmse ,0.45]
],
    columns=['iteration', 'model', 'cv_rmse', 'submission_rmsle'])
df

In [None]:

sns.pointplot(data=df, x='iteration', y='cv_rmse', hue='model')
plt.ylim(0, df.cv_rmse.max() + df.cv_rmse.max() / 10)
plt.xlabel('')
plt.ylabel('cross-validated RMSE')
plt.title('Model comparison: cross-validation')

plt.savefig('model_comparison_cv.png', dpi=300)

In [None]:
sns.pointplot(data=df, x='iteration', y='submission_rmsle', hue='model')
plt.ylim(0, 1.5)
plt.xlabel('')
plt.ylabel('leaderboard RMSLE')
plt.title('Model comparison: leaderboard')

plt.savefig('model_comparison_leaderboard.png', dpi=300)

## Summary

Even with the fairly small dataset at hand, the score improved substantially with some simple feature engineering und hyperparameter tuning.

The results also show that it is not trivial to beat autoML models and demonstrate the usefulness of autoML for quickly creating baseline and exploratory models.