# Loading the Dataset

In [25]:
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [26]:
pd.options.display.max_columns = 999

After completing Case Study #1 and saving the data after encoding and the final cleaned data to “encoded_data.csv” and “PCA_data.csv” respectively, they’ll be used to split the dataset and train the models.

In [27]:
PCA_data = pd.read_csv("./PCA_data.csv")
encoded_data = pd.read_csv("./encoded_data.csv")

# Categorizing and Splitting the Data

The aim is to train the three models to categorize/classify bike demand as "extreme," "high," "medium," or "low" based on rental bike counts. To do this, the data after PCA needs to be categorized into bins in the ranges of [0, 0.25, 0.5, 0.75, 1.0] and labels ['low', 'medium', 'high', 'extreme']. Then the data can be split for training and testing.

In [28]:
PCA_X = PCA_data
PCA_y = encoded_data['count']

bins = [0, 0.25, 0.5, 0.75, 1.0]
labels = ['low', 'medium', 'high', 'extreme']

label_encoder = LabelEncoder()
PCA_y_binned = label_encoder.fit_transform(pd.cut(PCA_y, bins=bins, labels=labels))

PCA_X_train, PCA_X_test, PCA_y_train, PCA_y_test = train_test_split(PCA_X, PCA_y_binned, test_size=0.2, random_state=42)


The number of instances for each category:

In [35]:
binned_counts = pd.cut(encoded_data['count'], bins=bins, labels=labels)
print(binned_counts.value_counts())

count
low        12105
medium      3782
high        1375
extreme        0
Name: count, dtype: int64


# Hyperparameter Tuning and Training

To find the best hyperparameters to use for each model, cross validation using Grid Search will be utilized with pre-defined lists of parameters for each model. The Random Forest Classifier will be tuned for the parameters [n_estimators, max_depth, min_samples_split]. The XGBoost model will be tuned for the following [n_estimators, max_depth, learning_rate]. The MLP model will be tuned for [hidden_layer_sizes, activation, solver, learning_rate].

In [29]:
rf = RandomForestClassifier()
xgb = XGBClassifier()
mlp = MLPClassifier()

rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

xgb_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2]
}

mlp_params = {
    'hidden_layer_sizes': [(50,), (100,), (100, 50)],
    'activation': ['relu', 'tanh'],
    'solver': ['adam', 'sgd'],
    'learning_rate': ['constant', 'adaptive']
}

To perform Grid Search cross-validation and train the data on the best hyperparameters for each model:

In [30]:
def perform_grid_search(model, params, X_train, y_train):
    grid_search = GridSearchCV(estimator=model, param_grid=params,
                               cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_estimator_, grid_search.best_params_

best_rf, best_rf_params = perform_grid_search(rf, rf_params, PCA_X_train, PCA_y_train)
best_xgb, best_xgb_params = perform_grid_search(xgb, xgb_params, PCA_X_train, PCA_y_train)
best_mlp, best_mlp_params = perform_grid_search(mlp, mlp_params, PCA_X_train, PCA_y_train)


In [31]:
print("Best Random Forest Parameters:", best_rf_params)
print("Best XGBoost Parameters:", best_xgb_params)
print("Best MLP Parameters:", best_mlp_params)

Best Random Forest Parameters: {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 100}
Best XGBoost Parameters: {'learning_rate': 0.2, 'max_depth': 6, 'n_estimators': 200}
Best MLP Parameters: {'activation': 'tanh', 'hidden_layer_sizes': (100, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}


# Testing

Now that each model is trained on the data using the best hyperparameters from the given grids, testing them will be conducted through the following code.

In [33]:
# Evaluate the models and print the best parameters
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    unique_labels = label_encoder.inverse_transform(sorted(set(y_test)))
    report = classification_report(y_test, y_pred, target_names=unique_labels)
    print(report)

print("Random Forest Evaluation:")
evaluate_model(best_rf, PCA_X_test, PCA_y_test)

print("XGBoost Evaluation:")
evaluate_model(best_xgb, PCA_X_test, PCA_y_test)

print("MLP Evaluation:")
evaluate_model(best_mlp, PCA_X_test, PCA_y_test)

Random Forest Evaluation:
              precision    recall  f1-score   support

        high       0.88      0.83      0.85       290
         low       0.99      0.99      0.99      2420
      medium       0.91      0.92      0.91       743

    accuracy                           0.96      3453
   macro avg       0.93      0.91      0.92      3453
weighted avg       0.96      0.96      0.96      3453

XGBoost Evaluation:
              precision    recall  f1-score   support

        high       0.89      0.84      0.86       290
         low       0.99      0.99      0.99      2420
      medium       0.91      0.93      0.92       743

    accuracy                           0.97      3453
   macro avg       0.93      0.92      0.93      3453
weighted avg       0.97      0.97      0.97      3453

MLP Evaluation:
              precision    recall  f1-score   support

        high       0.85      0.83      0.84       290
         low       1.00      1.00      1.00      2420
      medium 

All three models show high accuracy, with **RF** at 96%, **XGBoost** at 97%, and **MLP** at 97%. This suggests that the models are generally suitable for predicting bike rental demand categories based on the PCA-transformed features. The high precision and recall for the 'low' category across all models indicate that the models are especially good at predicting low bike rental demand, which is also the majority class (support = 2420).

</br>

The '*high*' and '*medium*' categories have slightly lower **precision** and **recall** than the '*low*' category. This is anticipated, given the lower support (number of instances) for these classes (290 for high and 743 for medium). The **F1-scores** for the '*high*' and '*medium*' categories remain quite high, demonstrating that the models perform well even with fewer examples.

</br>

**Random Forest** performs well with high **accuracy** and balanced **precision** and **recall** across categories. However, it has somewhat worse **precision** and **recall** in the '*high*' category than **XGBoost** and **MLP**. </br>

**XGBoost** slightly surpasses **Random Forest** in terms of **accuracy** and **precision/recall** in the '*high*' and '*medium*' categories. This could be attributed to **XGBoost**'s capacity to handle complicated relationships and interactions between features. </br>

**MLP** performs similarly to **XGBoost**, with flawless **precision** and **recall** for the '*low*' category and high scores in the '*medium*' and '*high*' categories. The **MLP**'s performance demonstrates that neural networks can effectively detect patterns in data post PCA transformation.

</br>

It is worth mentioning that the *‘extreme’* category did not show up in the evaluation metrics, and this is due to it having no instances in the data based on the selected ranges.