## Part III: Learning Methodology
<h4>Team Twin AI</h4>
<h4><b>Overview</b></h4>

This is the Machine Learning component of our solution to the FormulaAI Hack 2022 Competition. The workflow for this notebook is outlined as follows:
- Standardisation and Pipelines
- Model Experimentations I: Weather Classification
- Leaderboard Ranking and Holdout Evaluation I
- Model Experimentations II: Regression
- Leaderboard Ranking and Holdout Evaluation II
- Predictions and Exporting

In [None]:
import os

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None, 'display.max_rows', 100)

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
#import deepchecks as dc

from scipy import stats
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import GenericUnivariateSelect
from sklearn.preprocessing import scale,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

from xgboost import XGBClassifier
from xgboost import XGBRFClassifier
from lightgbm import LGBMClassifier
#from catboost import CatBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRFRegressor
from sklearn.svm import SVR

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RepeatedKFold
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet

import random
import time
from datetime import datetime

#### <i>Read the data<i>

In [None]:
final_data_weather = pd.read_csv('../input/final-weather-data-formulaai/final_data_weather.csv', index_col = False)
print(final_data_weather.shape)
final_data_weather.head(2)

In [None]:
weather_X = final_data_weather.drop('M_WEATHER', axis=1)
weather_y = final_data_weather['M_WEATHER']

rain_X = final_data_weather.drop('M_RAIN_PERCENTAGE', axis=1)
rain_y = final_data_weather['M_RAIN_PERCENTAGE']

print(weather_X.shape)
print(rain_X.shape)

<br>
<h4><b>1. Cross-Validation, Standardisation and Pipelines</b></h4>

***Creating train, test and validation sets***
    
We first split our data into train and test sets. The test set is our holdout set and will not be unlocked until the end of each of the 2 sequences of experiments for classification and regression, respectively. The validation set will be split out of the train data and will be used for primary evaluation and to compute cross validation scores in each of our experiments.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(weather_X, weather_y, test_size=.25, random_state=42)

Considering the huge class imbalance in the weather target, ***we will implement repeated k-fold cross validation to further split our train data***. For our classification experiments, we will use ***repeated stratified k-fold cross validation***. We choose the value of 10 for *k* as this value has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance. In other words, we are choosing *k = 10* to achieve reasonable bias-variance trade-off in training.

In [None]:
skfold = RepeatedStratifiedKFold(n_splits=10, n_repeats = 2, random_state=1)

Now, we will define a helper function that we will use for all our classification experiments. This function will also help to return validation scores for each experiment.

In [None]:
def training(X_train, y_train, model):

    fold_no = 1    
    n_scores, log_scores = [],[]
    for train_index, val_index in skfold.split(X_train, y_train):
        # select rows
        train_X, val_X = X_train.iloc[train_index], X_train.iloc[val_index]
        train_y, val_y = y_train.iloc[train_index], y_train.iloc[val_index]
        
        #eval_set = [(val_X, val_y)]
        model.fit(train_X, train_y)#, eval_metric=metric, eval_set=eval_set, verbose=False)
        
        n_scores.append(model.score(val_X, val_y))
        log_scores.append(log_loss(val_y, model.predict_proba(val_X), labels = [0,1,2]))
        print('For Fold {}, the Accuracy is {},'.format(str(fold_no), n_scores[fold_no - 1]), 
              'and the LogLoss is', log_scores[fold_no - 1])

        fold_no += 1
     
    #cv_score = cross_val_score(model, train_X, train_y, scoring='accuracy', cv=skfold, n_jobs=-1, error_score='raise')
    mean_accuracy, std_accuracy = np.mean(n_scores), np.std(n_scores)
    mean_loss, std_loss = np.mean(log_scores), np.std(log_scores)
    
    print('\n======================================')
    
    print('Mean Accuracy: %.3f (%.3f)' % (mean_accuracy, std_accuracy))
    #print('Mean CV Accuracy: %.3f (%.3f)' % (np.mean(cv_score), np.std(cv_score)))
    #
    #print('\n======================================')
    #
    print('Average LogLoss: {} ({})'.format(mean_loss, std_loss))

    return model, mean_loss

<br>

***To avoid information leakage from our test data into the models we want to train, we will make use of pipelines in most of our experiments.***

We define a pipeline construct below that implements standardisation on our data to make it Gaussian distributed, then fits a model on the standardised data. For experiments on just raw features, we will implement a pipeline without Standard Scaler.

In [None]:
def scale_pipe(model_name, x_train, y_train):
    """
    This function standardises the data to make it Gaussian distributed, then applies a 
    pipeline construct to fit a model on the standardised data
    """
    trans = StandardScaler()
    pipe = Pipeline([('scaler', trans), ('model', model_name)])
    
    model_pipeline, mean_loss = training(x_train, y_train, pipe)
    
    return model_pipeline, mean_loss

We will experiment with various models with and without standardisation and see how they perform. But before we proceed, let's see what our data looks like when standardised.

In [None]:
scaler = StandardScaler()
pd.DataFrame(scaler.fit_transform(final_data_weather), columns = final_data_weather.columns)

<br>
<h4><b>2. Model Experimentations I: Weather Classification</b></h4>
Here are the 3 classification algorithms we will experiment with:

- XGBoost Classifier
- XGB Random Forest Classifier
- Light Gradient Boosted Machines Classifier
- Gradient Boosted Trees with CatBoost Classifier

##### **(a) XGBoost Classifier**

*First Experiment: Without Standardisation*

In [None]:
xgb = XGBClassifier()
model_xgb, model_xgb_loss = training(X_train, y_train, xgb)
model_xgb

In [None]:
model_xgb_loss

*Second Experiment: With Standardisation*

In [None]:
xgb2 = XGBClassifier()
xgb_scaled, xgb_scaled_loss = scale_pipe(xgb, X_train, y_train)
xgb_scaled

In [None]:
xgb_scaled_loss

##### **(b) XGBoost Random Forest Classifier**

*First Experiment: Without Standardisation*

In [None]:
xgbrf = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.2)
xgb_forest, xgb_forest_loss = training(X_train, y_train, xgbrf)
xgb_forest

In [None]:
xgb_forest_loss

*Second Experiment: With Standardisation*

In [None]:
xgbrf2 = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.2)
xgb_forest_scaled, xgb_forest_scaled_loss = scale_pipe(xgbrf2, X_train, y_train)
xgb_forest_scaled

In [None]:
xgb_forest_scaled_loss

##### **(c) Light Gradient Boosted Machines Classifier**

*First Experiment: Without Standardisation*

In [None]:
lgb = LGBMClassifier()
lgb_model, lgb_loss = training(X_train, y_train, lgb)
lgb_model

In [None]:
lgb_loss

*Second Experiment: With Standardisation*

In [None]:
lgb2 = LGBMClassifier()
lgb_scaled, lgb_scaled_loss = scale_pipe(lgb2, X_train, y_train)
lgb_scaled

In [None]:
lgb_scaled_loss

##### **(d) Gradient Boosted Trees with CatBoost Classifier**

*First Experiment: Without Standardisation*

In [None]:
from catboost import CatBoostClassifier

ctb = CatBoostClassifier(verbose=0, n_estimators=100)
ctb_model, ctb_loss = training(X_train, y_train, ctb)
ctb_model

In [None]:
ctb_loss

*Second Experiment: With Standardisation*

In [None]:
ctb2 = CatBoostClassifier(verbose=0, n_estimators=100)
ctb_scaled, ctb_scaledloss = scale_pipe(ctb2, X_train, y_train)
ctb_scaled

In [None]:
ctb_scaledloss

<br>
<h4><b>3. Leaderboard Ranking and Holdout Evaluation I </b></h4>

Before we unleash our holdout set which is 25% of our entire dataset, we will create a leaderboard of models built so far, to rank them by logloss both with standardisation and without it. This will give us an insight whether to evaluate on a transform of our test set or to test the raw features.

In [None]:
unpiped_models = [model_xgb_loss, xgb_forest_loss, lgb_loss, ctb_loss]
piped_models = [xgb_scaled_loss, xgb_forest_scaled_loss, lgb_scaled_loss, ctb_scaledloss]
model_index = ['XGBoost','XGBoost Random Forest','LightGBM', 'CatBoost Classifier']

pre_leaderboard_1 = pd.DataFrame({'Log Loss (Without Standardisation)': unpiped_models, 
                                  'Log Loss (With Standardisation)': piped_models}, index = model_index)

pd.set_option('display.float_format', lambda x: '%.10f' % x)
pre_leaderboard_1.round(10)

Now, let's set the tabe in ascending order, starting with the first column. We are ranking in ascending order as the goal of the learning algorithms is to minimise the log loss.

In [None]:
pre_leaderboard_1.nsmallest(4, 'Log Loss (Without Standardisation)')

In [None]:
pre_leaderboard_1.nsmallest(4, 'Log Loss (With Standardisation)')

From the above results, we can see that only XGBoost Random Forest performed better with standardisation. In both cases, XGBoost outperformed other models by a large margin. 

***We will therefore select XGBoost and unlock our test (holdout) set for evaluation without standardisation.***

In [None]:
xgb1 = model_xgb

print('Average LogLoss from Cross Validation:', model_xgb_loss)
print('\n======================================')

eval_set = [(X_test, y_test)]
xgb1.fit(X_train, y_train, eval_metric = 'mlogloss', eval_set = eval_set, verbose=False)
xgb1_eval = log_loss(y_test, xgb1.predict_proba(X_test), labels = [0,1,2])

print('\n======================================')
print('LogLoss on Holdout:', xgb1_eval)

<br>
<h4><b>4. Model Experimentations I: Regression</b></h4>
Here are the 4 regression algorithms we will experiment with:

- RandomForest Regressor
- Light Gradient Boosted Trees Regressor (with Early Stopping)
- ElasticNet Regularised Regression with Grid Search Optimisation
- XGBoost Regressor

Before we proceed, we specify a new train-test split using the rain percentage data. We also change our k-fold algorithm to RepeatedKFold. In each of these experiments, we will use cross validation to give us insights into which model performs the best before we then unlock holdout to apply the best algorithm.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(rain_X, rain_y, test_size=.25, random_state=42)
kfold = RepeatedKFold(n_splits=10, n_repeats = 2, random_state=1)

In [None]:
def train_reg(X_train, y_train, model):

    fold_no = 1    
    n_scores, mae_scores = [],[]
    for train_index, val_index in skfold.split(X_train, y_train):
        # select rows
        train_X, val_X = X_train.iloc[train_index], X_train.iloc[val_index]
        train_y, val_y = y_train.iloc[train_index], y_train.iloc[val_index]
        
        #eval_set = [(val_X, val_y)]
        model.fit(train_X, train_y)#, eval_metric=metric, eval_set=eval_set, verbose=False)
        
        n_scores.append(model.score(val_X, val_y))
        mae_scores.append(mean_absolute_error(val_y, model.predict(val_X)))
        print('For Fold {}, the Accuracy is {},'.format(str(fold_no), n_scores[fold_no - 1]), 
              'and the LogLoss is', mae_scores[fold_no - 1])

        fold_no += 1
     
    #cv_score = cross_val_score(model, train_X, train_y, scoring='accuracy', cv=skfold, n_jobs=-1, error_score='raise')
    mean_accuracy, std_accuracy = np.mean(n_scores), np.std(n_scores)
    mean_mae, std_loss = np.mean(mae_scores), np.std(mae_scores)
    
    print('\n======================================')
    
    print('Mean Accuracy: %.3f (%.3f)' % (mean_accuracy, std_accuracy))
    #print('Mean CV Accuracy: %.3f (%.3f)' % (np.mean(cv_score), np.std(cv_score)))
    #
    #print('\n======================================')
    #
    print('Average MAE: {} ({})'.format(mean_mae, std_mae))

    return model, mean_mae

##### **(a) Random Forest Regressor**

*First Experiment: Without Standardisation*

In [None]:
rforest = RandomForestRegressor(n_estimators=100)
cv_scores = cross_val_score(rforest, X_train, y_train, scoring='neg_mean_absolute_error', 
                            cv=kfold, n_jobs=-1, error_score='raise')

In [None]:
rforest_mae = np.mean(cv_scores)
print('MAE and Standard Deviation: %.3f (%.3f)' % (rforest_mae, np.std(cv_scores)))

*Second Experiment: With Standardisation*

In [None]:
def reg_pipe(model_name, x_train, y_train):
    """
    This function standardises the data to make it Gaussian distributed, 
    then applies reeated 10-fold cross-validation
    """
    trans = StandardScaler()
    pipe = Pipeline([('scaler', trans), ('cv', model_name)])
    
    scores = cross_val_score(pipe, x_train, y_train, scoring='neg_mean_absolute_error', 
                            cv=kfold, n_jobs=-1, error_score='raise')
    scaled_mae = np.mean(scores)
    
    return pipe, scaled_mae

In [None]:
r_forest_scaled, r_forest_scaled_mae = reg_pipe(RandomForestRegressor(n_estimators = 20), X_train, y_train)
r_forest_scaled_mae

##### **(b) Light Gradient Boosted Trees Regressor (with Early Stopping)**

*First Experiment: Without Standardisation*

In [None]:
lgb = LGBMRegressor()

fold_no = 1    
n_scores, mae_scores = [],[]
for train_index, val_index in kfold.split(X_train, y_train):
    # select rows
    train_X, val_X = X_train.iloc[train_index], X_train.iloc[val_index]
    train_y, val_y = y_train.iloc[train_index], y_train.iloc[val_index]

    eval_set = [(val_X, val_y)]
    lgb.fit(train_X, train_y, early_stopping_rounds=10, 
            eval_metric='mean_absolute_error', eval_set=eval_set, verbose=False)

    n_scores.append(lgb.score(val_X, val_y))
    mae_scores.append(mean_absolute_error(val_y, lgb.predict(val_X)))
    print('For Fold {}, the Accuracy is {},'.format(str(fold_no), n_scores[fold_no - 1]), 
    'and the MAE is', mae_scores[fold_no - 1])

    fold_no += 1
                      
lgb_mae = np.mean(mae_scores)

print('\n======================================')
print('MAE and Standard Deviation: %.3f (%.3f)' % (lgb_mae, np.std(mae_scores)))

*Second Experiment: With Standardisation (and Without Early Stopping)***

In [None]:
lgb_scaled, lgb_scaled_mae = reg_pipe(LGBMRegressor(), X_train, y_train)
lgb_scaled_mae

##### **(c) ElasticNet Regularised Regression with GridSearch Optimisation**

*First Experiment: Without Standardisation*

In [None]:
start_model = ElasticNet()

# define grid
grid = dict()
grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
grid['l1_ratio'] = np.arange(0, 1, 0.01)

# define search
search = GridSearchCV(start_model, grid, scoring='neg_mean_absolute_error', cv=kfold, n_jobs=-1)

# perform the search
results = search.fit(X_train, y_train)

# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

In [None]:
#Implement cross-validation for ElasticNet across train data

enet = ElasticNet(alpha = results.list(results.best_params_.values())[0],
                        l1_ratio = results.list(results.best_params_.values())[1])

fold_no = 1    
n_scores, mae_scores = [],[]
for train_index, val_index in kfold.split(X_train, y_train):
    # select rows
    train_X, val_X = X_train.iloc[train_index], X_train.iloc[val_index]
    train_y, val_y = y_train.iloc[train_index], y_train.iloc[val_index]

    enet.fit(train_X, train_y)

    n_scores.append(enet.score(val_X, val_y))
    mae_scores.append(mean_absolute_error(val_y, enet.predict(val_X)))
    print('For Fold {}, the Accuracy is {},'.format(str(fold_no), n_scores[fold_no - 1]), 
    'and the MAE is', mae_scores[fold_no - 1])

    fold_no += 1
                      
enet_mae = np.mean(mae_scores)

print('\n======================================')
print('MAE and Standard Deviation: %.3f (%.3f)' % (enet_mae, np.std(mae_scores)))

*Second Experiment: With Standardisation*

In [None]:
enet_scaled = ElasticNet(alpha = results.list(results.best_params_.values())[0],
                        l1_ratio = results.list(results.best_params_.values())[1])

enet_scaled, enet_scaled_mae = reg_pipe(enet_scaled, X_train, y_train)
enet_scaled_mae

##### **(d) XGBoost Regressor**

*First Experiment: Without Standardisation*

In [None]:
xgb_reg = XGBRegressor(objective='reg:squarederror')
cv_scores = cross_val_score(xgb_reg, X_train, y_train, scoring='neg_mean_absolute_error', 
                            cv=kfold, n_jobs=-1, error_score='raise')

xgbr_mae = np.mean(cv_scores)
print('MAE and Standard Deviation: %.3f (%.3f)' % (xgbr_mae, np.std(cv_scores)))

*Second Experiment: With Standardisation*

In [None]:
xgbr = XGBRegressor(objective='reg:squarederror')
xgbr_scaled, xgb_scaled_mae = reg_pip(xgbr, X_train, y_train)
xgb_scaled_mae

<br>
<h4><b>5. Leaderboard Ranking and Holdout Evaluation II</b></h4>

In [None]:
unpiped_models_reg = [rforest_mae, lgb_mae, enet_mae, xgbr_mae]
piped_models_reg = [scaled_rforest_mae, lgb_scaled_mae, enet_scaled_mae, xgbr_scaled_mae]
reg_model_index = ['XGBoost','XGBoost Random Forest','LightGBM', 'CatBoost Classifier']

pre_leaderboard_2 = pd.DataFrame({'MAE (Without Standardisation)': unpiped_models_reg, 
                                  'MAE (With Standardisation)': piped_models_reg}, index = reg_model_index)

pd.set_option('display.float_format', lambda x: '%.10f' % x)
pre_leaderboard_2.round(10)

In [None]:
pre_leaderboard_1.nsmallest(4, 'MAE (Without Standardisation)')

In [None]:
pre_leaderboard_1.nsmallest(4, 'MAE (With Standardisation)')

## Part IV: Predictions and Exporting
<h4> </h4>

In [None]:
def predictor(X, classifier, regressor):
    """
    returns a test json response of predictions across the time interval 
    of {5,10,15,30,60} in minutes after a session timestamp
    """
    export = dict()
    intervals = [5,10,15,30,60]
    
    for time in intervals:
        #Assign an offset to the picked offset
        X['M_TIME_OFFSET'] = time
        weather_prediction = classifier.predict([X])
        rain_prediction = regressor.predict([X])
        
        output[time] = {
            'weather_type': int(weather_prediction[0][0]),
            'rain_percentage' : round(rain_prediction[0][1],2)
        }
    return json.dumps(output)

In [None]:
model_feat = '../input/final-weather-data-formulaai/twin_ai_features.pkl'
pickle.dump(train_df_fu11.columns, open(mode1_feat, 'wb'))


model_file = '../input/final-weather-data-formulaai/twin_ai_model.h5'
pickle.dump(moc_model, open('model_file.pkl', 'wb'))

#object_1 = 1
#object_2 = "A string"
#object_3 = 5
#
#sample_list = [object_1, object_2, object_3]
#file_name = "sample.pkl"
#
#open_file = open(file_name, "wb")
#pickle.dump(sample_list, open_file)
#open_file.close()
#
#open_file = open(file_name, "rb")
#loaded_list = pickle.load(open_file)
#open_file.close()