## Learning Methodology
<h4>Team Twin AI</h4>
<h4><b>Overview</b></h4>

This is the Machine Learning component of our solution to the FormulaAI Hack 2022 Competition. The workflow for this notebook is outlined as follows:
- Standardisation and Pipelines
- Model Experimentations I: Classification
- Model Experimentations II: Regression
- Evaluation and Predictions
- Leaderboard: Rankings the Challenger Models

In [13]:
import os

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None, 'display.max_rows', 100)

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import deepchecks as dc

from scipy import stats
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import GenericUnivariateSelect
from sklearn.preprocessing import scale,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import log_loss

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
#from catboost import CatBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
#from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RepeatedKFold
from sklearn.experimental import enable_hist_gradient_boosting
#from sklearn.ensemble import HistGradientBoostingClassifier
#from sklearn.ensemble import HistGradientBoostingRegressor

import random
import time
from datetime import datetime

#### <i>Read the data<i>

In [2]:
#final_data_weather = pd.read_csv('final_data_weather.csv')
#final_data_weather.drop('Unnamed: 0', axis =1, inplace = True)
#print(final_data_weather.shape)
#
#final_data_rain = pd.read_csv('final_data_rain.csv')
#final_data_rain.drop('Unnamed: 0', axis =1, inplace = True)
#print(final_data_rain.shape)

(534055, 14)
(534055, 14)


In [14]:
final_data_weather = pd.read_csv('final_data_weather.csv', index_col = False)
print(final_data_weather.shape)
final_data_weather.head(2)

(534055, 15)


Unnamed: 0,M_SESSION_UID,M_SESSION_TIME,M_FRAME_IDENTIFIER,M_PLAYER_CAR_INDEX,M_SESSION_LINK_IDENTIFIER,M_TRACK_TEMPERATURE,M_TRACK_LENGTH,M_AIR_TEMPERATURE,M_TRACK_ID,M_TIME_OFFSET,M_RAIN_PERCENTAGE,M_AI_DIFFICULTY,M_NUM_MARSHAL_ZONES,M_SESSION_TIME_SPENT,M_WEATHER
0,2.106082e+16,28.86,624,19,2184232491,33,4650,25,28,5.0,3.0,90,16.0,788,0
1,2.106082e+16,28.86,624,19,2184232491,33,4650,25,28,5.0,3.0,90,16.0,788,0


In [15]:
weather_X = final_data_weather.drop('M_WEATHER', axis=1)
weather_y = final_data_weather['M_WEATHER']

rain_X = final_data_weather.drop('M_RAIN_PERCENTAGE', axis=1)
rain_y = final_data_weather['M_RAIN_PERCENTAGE']

print(weather_X.shape)
print(rain_X.shape)

(534055, 14)
(534055, 14)


<br>
<h4><b>1. Cross-Validation, Standardisation and Pipelines</b></h4>

***Creating train, test and validation sets***
    
We first split our data into train and test sets. The test set is our holdout set and will not be unlocked until the end of each of the 2 sequences of experiments for classification and regression, respectively. The validation set will be split out of the train data and will be used for primary evaluation and to compute cross validation scores in each of our experiments.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(weather_X, weather_y, test_size=.25, random_state=42)

Considering the huge class imbalance in the weather target, ***we will implement repeated k-fold cross validation to further split our train data***. For our classification experiments, we will use ***repeated stratified k-fold cross validation***. We choose the value of 10 for *k* as this value has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance. In other words, we are choosing *k = 10* to achieve reasonable bias-variance trade-off in training.

In [17]:
skfold = RepeatedStratifiedKFold(n_splits=10, n_repeats = 2, random_state=1)

Now, we will define a helper function that we will use for all our classification experiments. This function will also be associated with a class to return validation scores for each experiment.

In [18]:
class ClassifierScores:
    def __init__(self):
        self.accuracy = mean_accuracy
        self.logloss = mean_loss

def training(X_train, y_train, model):

    fold_no = 1    
    n_scores, log_scores = [],[]
    for train_index, val_index in skfold.split(X_train, y_train):
        # select rows
        train_X, val_X = X_train.iloc[train_index], X_train.iloc[val_index]
        train_y, val_y = y_train.iloc[train_index], y_train.iloc[val_index]

        model.fit(train_X, train_y)
        n_scores.append(model.score(val_X, val_y))
        log_scores.append(log_loss(val_y, model.predict_proba(val_X), labels = [0,1,2]))
        print('For Fold {}, the Accuracy is {},'.format(str(fold_no), n_scores[fold_no - 1]), 
              'and the LogLoss is', log_scores[fold_no - 1])

        fold_no += 1
        
    mean_accuracy, std_accuracy = np.mean(n_scores), np.std(n_scores)
    mean_loss, std_loss = np.mean(log_scores), np.std(log_scores)
    
    print('\n======================================')
    print('Average Accuracy and LogLoss:')

    return model, ClassifierScores()

<br>

***To avoid information leakage from our test data into the models we want to train, we will make use of pipelines in most of our experiments.***

We define a pipeline construct below that implements standardisation on our data to make it Gaussian distributed, then fits a model on the standardised data. For experiments on just raw features, we will implement a pipeline without Standard Scaler.

In [19]:
def scale_pipe(model_name, x_train, y_train):
    """
    This function standardises the data to make it Gaussian distributed, then applies a 
    pipeline construct to fit a model on the standardised data
    """
    trans = StandardScaler()
    model_pipeline = Pipeline([('scaler', trans), ('model', model_name)])
    
    training(x_train, y_train, model_pipeline)
    
    return model_pipeline

We will experiment with various models with and without standardisation and see how they perform. But before we proceed, let's see what our data looks like when standardised.

In [6]:
scaler = StandardScaler()
pd.DataFrame(scaler.fit_transform(final_data_weather), columns = final_data_weather.columns)

Unnamed: 0,M_SESSION_UID,M_SESSION_TIME,M_FRAME_IDENTIFIER,M_PLAYER_CAR_INDEX,M_SESSION_LINK_IDENTIFIER,M_TRACK_TEMPERATURE,M_TRACK_LENGTH,M_AIR_TEMPERATURE,M_TRACK_ID,M_TIME_OFFSET,M_RAIN_PERCENTAGE,M_AI_DIFFICULTY,M_NUM_MARSHAL_ZONES,M_SESSION_TIME_SPENT,M_WEATHER
0,-0.941930,-0.907981,-0.958823,0.481238,-0.834546,-0.185343,-1.821453,-0.390822,1.745594,-0.918268,-0.320525,1.701675,-1.353735,-0.227546,-0.627758
1,-0.941930,-0.907981,-0.958823,0.481238,-0.834546,-0.185343,-1.821453,-0.390822,1.745594,-0.918268,-0.320525,1.701675,-1.353735,-0.227546,-0.627758
2,-0.941930,-0.907981,-0.958823,0.481238,-0.834546,-0.185343,-1.821453,-0.390822,1.745594,-0.723403,-0.320525,1.701675,-1.353735,-0.227546,-0.627758
3,-0.941930,-0.907981,-0.958823,0.481238,-0.834546,-0.185343,-1.821453,-0.390822,1.745594,-0.723403,-0.320525,1.701675,-1.353735,-0.227546,-0.627758
4,-0.941930,-0.907981,-0.958823,0.481238,-0.834546,-0.185343,-1.821453,-0.390822,1.745594,-0.528538,-0.320525,1.701675,-1.353735,-0.227546,-0.627758
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534050,2.209103,-0.928847,-0.981442,-2.199539,1.230293,-1.901298,0.544278,-2.205470,-0.654983,-0.723403,1.130695,-0.517774,2.862515,-1.070066,1.372537
534051,2.209103,-0.928847,-0.981442,-2.199539,1.230293,-1.901298,0.544278,-2.205470,-0.654983,-0.528538,0.767890,-0.517774,2.862515,-1.070066,1.372537
534052,2.209103,-0.928847,-0.981442,-2.199539,1.230293,-1.901298,0.544278,-2.205470,-0.654983,0.056057,-0.139123,-0.517774,2.862515,-1.070066,1.372537
534053,2.209103,-0.928847,-0.981442,-2.199539,1.230293,-1.901298,0.544278,-2.205470,-0.654983,0.640653,-0.139123,-0.517774,2.862515,-1.070066,1.372537


<br>
<h4><b>2. Model Experimentations I: Weather Classification</b></h4>
Here are the 4 classification algorithms we will experiment with:

- Gradient Boosted Trees Classifier
- XGBoost Classifier
- Light Gradient Boosted Machines Classifier
- RandomForest Classifier


##### **(a) Gradient Boosted Trees Classifier**

*First Experiment: Without Standardisation*

In [20]:
GBC = GradientBoostingClassifier()

In [21]:
model_gbc = training(X_train, y_train, GBC)
model_gbc

KeyboardInterrupt: 

In [None]:
model_gb_accuracy = model_gbc.accuracy
model_gb_logloss = model_gbc.logloss
model_gb_logloss

*Second Experiment: With Standardisation*

In [None]:
GBC2 = GradientBoostingClassifier()
model_gbc_scaled = scale_pipe(GBC2, X_train, y_train)
model_gbc_scaled

In [None]:
#GBC2 = GradientBoostingClassifier()
#model_gbc_scaled = scale_pipe(GBC2, X_train, y_train)
#model_gbc_scaled

For Fold 1 the Accuracy is 1.0, and the LogLoss is 0.0025856786731771144
For Fold 2 the Accuracy is 1.0, and the LogLoss is 0.0022258996481037564
For Fold 3 the Accuracy is 1.0, and the LogLoss is 0.0022946443723346334
For Fold 4 the Accuracy is 1.0, and the LogLoss is 0.0021741248483805337
For Fold 5 the Accuracy is 1.0, and the LogLoss is 0.0022748596871658266
For Fold 6 the Accuracy is 1.0, and the LogLoss is 0.0025054619903956393
For Fold 7 the Accuracy is 1.0, and the LogLoss is 0.0024137793942517354
For Fold 8 the Accuracy is 1.0, and the LogLoss is 0.0022935201711755965
For Fold 9 the Accuracy is 1.0, and the LogLoss is 0.002422249096510743
For Fold 10 the Accuracy is 1.0, and the LogLoss is 0.002341781701161599
For Fold 11 the Accuracy is 1.0, and the LogLoss is 0.002391817244708867
For Fold 12 the Accuracy is 1.0, and the LogLoss is 0.002379902752678636
For Fold 13 the Accuracy is 1.0, and the LogLoss is 0.00226958303699276
For Fold 14 the Accuracy is 1.0, and the LogLoss is 0

##### **(b) XGBoost Classifier**

*First Experiment: Without Standardisation*

In [22]:
xgb = XGBClassifier()

In [None]:
warnings.filterwarnings('ignore')
model_xgb = training(X_train, y_train, xgb)
model_xgb

For Fold 1, the Accuracy is 1.0, and the LogLoss is 6.595909427796833e-06
For Fold 2, the Accuracy is 1.0, and the LogLoss is 6.239530378252062e-06
For Fold 3, the Accuracy is 1.0, and the LogLoss is 6.334016523466604e-06
For Fold 4, the Accuracy is 1.0, and the LogLoss is 7.366365452343651e-06
For Fold 5, the Accuracy is 1.0, and the LogLoss is 6.633528435390757e-06
For Fold 6, the Accuracy is 1.0, and the LogLoss is 5.848046982227228e-06
For Fold 7, the Accuracy is 1.0, and the LogLoss is 6.650637248941316e-06
For Fold 8, the Accuracy is 1.0, and the LogLoss is 8.264812783209205e-06


In [None]:
model_xgb_accuracy = model_xgb.accuracy
model_xgb_logloss = model_xgb.logloss
model_xgb_logloss

*Second Experiment: With Standardisation*

In [None]:
GBC2 = GradientBoostingClassifier()
model_xgb_scaled = scale_pipe(xgb, X_train, y_train)
model_xgb_scaled


##### **(c) Light Gradient Boosted Machines Classifier**

*First Experiment: Without Standardisation*

In [20]:
lgb = LGBMClassifier()

In [21]:
model_lgb = training(X_train, y_train, lgb)
model_lgb

KeyboardInterrupt: 

In [None]:
model_lgb_accuracy = model_lgb.accuracy
model_lgb_logloss = model_lgb.logloss
model_lgb_logloss

*Second Experiment: With Standardisation*

In [None]:
lgb2 = GradientBoostingClassifier()
model_lgb_scaled = scale_pipe(lgb2, X_train, y_train)
model_lgb_scaled

<br>
<h4><b>2. Model Experimentations I: Classification</b></h4>
Models we will experiment with:
- XGBoost Cl