# Before begins

This notebook is written in google colab.

To see some interactive plots, please enter the colab link Below.

<a href="https://colab.research.google.com/drive/1WPxPqsUsWxgZcmeHRsXn-jYR6aJLdegO?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

There are many notebooks similar to this for various competitions, so check the github address below

<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width=50 align='left' alt="Open in Colab" /></a>
&nbsp; <font size="5">[Github: Kaggle-Notebook](https://github.com/JayAhn0104/Kaggle-Notebook)</font>

# Overview

<br>

## Competition description

<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/4699/logos/thumb76_76.png" width=40 align='left' alt="Open in Colab"/></a>
&nbsp; 
<font size="5">[Prudential Life Insurance Assessment](https://www.kaggle.com/c/prudential-life-insurance-assessment)</font>

- Problem type: (Ordinal) classification
  - Predicting the risk (8 classes) of individual in terms of life insurance
- Evaluation metric: [quadratic weighted kappa](https://www.kaggle.com/c/prudential-life-insurance-assessment/overview/evaluation)

<br>

## Notebook description

This notebook provides the '**proper workflow**' for kaggle submission.

The workflow is divided into three main steps.
1. Data preprocessing
2. Model selection (hyper parameter tuning, model combination, model comparison)
3. Training final model & Prediction on Test-set

At each stage, detailed descriptions of the work and an appropriate procedure will be provided.

Through this notebook, readers can learn the 'proper workflow' to be done for kaggle submission, 
and using this as a basic structure, someone will be able to apply this to other competitions easily with some adjustments

**Warnings**:
- The purpose of this notebook
  - This notebook focuses on the 'procedure' rather than the 'result'. 
  - Thus this notebook does not guide you on how to achieve the top score. Since I personally think that any result can only have a meaning through an appropriate procedure.

- The readers this notebook is intended for
  - Who are aware of the basic usage of data processing tools (e.g., numpy, pandas)
  - Who are aware of the basic concepts of machine learning models 


# 0. Preliminaries

### > Set Configurations 

- Set the configurations for this notebook

In [None]:
config = {
    'data_name': 'prudential-life-insurance-assessment',
    'random_state': 2022
}

### > Install Libraries

In [None]:
!pip install tune_sklearn skorch

# 1. Data preprocessing

The data preprocessing works are divided into 8 steps here.

Some of these steps are mandatory and some are optional.

Optional steps are marked separately.

It is important to go through each step in order.
Be careful not to reverse the order.

## 1-1. Load Dataset

Load train-set and test-set on working environment


In [None]:
%%bash 

(
cd config['data_name']
unzip train.csv
unzip test.csv
)

In [None]:
import numpy as np
import pandas as pd

train = pd.read_csv('/kaggle/input/{}/train.csv'.format(config['data_name']))
test = pd.read_csv('/kaggle/input/{}/test.csv'.format(config['data_name']))

> ### Concatenate the 'train' and 'test' data for preprocessing

Data preprocessing work should be applied equally for train-set and test-set.

In order to work at once, exclude the response variable 'Response' from 'train' and combine it with 'test'.

In [None]:
all_features = pd.concat((train.drop(['Id','Response'], axis=1), test.drop(['Id'], axis=1)), axis=0)

## 1-2. Missing Value Treatment

Missing (NA) values in Data must be treated properly before model training.

There are three main treatment methods:
1. Remove the variables which have NA values
2. Remove the rows (observations) which have NA values
3. Impute the NA values with other values

Which of the above methods is chosen is at the analyst's discretion.
It is important to choose the appropriate method for the situation.

### > Check missing values in each variable


In [None]:
import missingno as msno
msno.bar(all_features)

### > Remove highly proportioned nan variables (nan proportion > 30%)

In [None]:
condition_col_idx = ((all_features.isnull().sum() / all_features.shape[0]) > 0.3).values

print(all_features.iloc[:,condition_col_idx].isnull().sum()/ all_features.shape[0])

all_features = all_features.iloc[:, ~condition_col_idx]

### > Impute the NA values properly

There are 4 variables that have NA values left.

And these are all numeric (interger, float) variables

In [None]:
all_features.iloc[:,all_features.isnull().any().values].head()

#### >> Replace NA by 'median value' of the variable grouped by another (relevant) categorical variable


The figures below show that the distribution of y-axis variable is different depending on the x-axis variable values.

(ANOVA test is used to determine whether the differences between groups are significant.)


In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

by_var_list = ['Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5']
target_list = ['Employment_Info_1', 'Employment_Info_4', 'Employment_Info_6']

anova_res_list = []
for target in target_list:
  for by_var in by_var_list:
    formula = target + '~' + by_var
    model = ols(formula, data=all_features).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    anova_res_list.append([target, by_var, anova_table['F'][0]])

anova_res = pd.DataFrame(np.vstack((anova_res_list)), columns=['target', 'by_var', 'F-statistic'])    
anova_res['F-statistic'] = anova_res['F-statistic'].astype(np.float32)

import plotly.express as px
fig = px.bar(anova_res, x='target', y='F-statistic', color='by_var', width=800, height=400, barmode='group')
fig.show()

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

by_var_list = all_features.iloc[:,all_features.columns.str.startswith('Medical')].select_dtypes(exclude=np.float64).columns
target_list = ['Medical_History_1']

anova_res_list = []
for target in target_list:
  for by_var in by_var_list:
    formula = target + '~' + by_var
    model = ols(formula, data=all_features).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    anova_res_list.append([target, by_var, anova_table['F'][0]])

anova_res = pd.DataFrame(np.vstack((anova_res_list)), columns=['target', 'by_var', 'F-statistic'])    
anova_res['F-statistic'] = anova_res['F-statistic'].astype(np.float32)

import plotly.express as px
fig = px.bar(anova_res, x='target', y='F-statistic', color='by_var', width=800, height=400, barmode='group')
fig.show()

In [None]:
method = 'median'

all_features.loc[:,'Employment_Info_1'] = all_features.loc[:,'Employment_Info_3'].fillna(all_features.groupby('Employment_Info_3')['Employment_Info_1'].transform(method))
all_features.loc[:,'Employment_Info_4'] = all_features.loc[:,'Employment_Info_3'].fillna(all_features.groupby('Employment_Info_5')['Employment_Info_4'].transform(method))
all_features.loc[:,'Employment_Info_6'] = all_features.loc[:,'Employment_Info_3'].fillna(all_features.groupby('Employment_Info_3')['Employment_Info_6'].transform(method))

all_features.loc[:,'Medical_History_1'] = all_features.loc[:,'Medical_History_23'].fillna(all_features.groupby('Medical_History_23')['Medical_History_1'].transform(method))

In [None]:
assert not all_features.isnull().sum().any()

## 1-3. Categorical variable consideration

### > 'Medical_Keyword_[1-48]' variables

'Medical_Keyword_[1-48]' variables have only {0, 1} values

In [None]:
var_idx = all_features.columns.str.startswith('Medical_Keyword')
np.unique(all_features.iloc[:,var_idx].values)

But sum of values through axis=1 has values > 1.

So, we can think that these variables are set of one-hot encoded (dummified) variables,
NOT one-hot encoded on 'ONE' variable.

Thus we should change the data type from 'int64' to 'uint8' to prevent additional one-hot encoding

In [None]:
all_features.iloc[:,var_idx].sum(axis=1).value_counts()

In [None]:
var_idx = all_features.columns.str.startswith('Medical_Keyword')
all_features.iloc[:,var_idx] = all_features.iloc[:,var_idx].astype(np.uint8)

## 1-4. Dummify categorical variables

In the case of linear modeling without regularization, the first or last column should be dropped (to prevent linear dependency), but here, for the convenience of using the factorization model, one-hot encoding method is used that does not drop any columns.

In [None]:
data_set = pd.get_dummies(all_features, drop_first=False)

## 1-5. Scaling continuous variables

MinMaxScaling maps all variables from 0 to 1 in order to consider only relative information, not absolute magnitudes of the values.

Besides, it is known that scaling is often more stable in parameter optimization when training a model.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_set = scaler.fit_transform(data_set)

## 1-6. Split Train & Test set

In [None]:
n_train = train.shape[0]
X_train = data_set[:n_train].astype(np.float32)
X_test = data_set[n_train:].astype(np.float32)
y_train = train['Response'].values.astype(np.int64)

## 1-7. Outlier Detection on Training data (*optional*)

Detect and remove outlier observations that exist in the train-set.

- Methodology: [Isolation Forest](https://ieeexplore.ieee.org/abstract/document/4781136/?casa_token=V7U3M1UIykoAAAAA:kww9pojtMeJtXaBcNmw0eVlJaXEGGICi1ogmeHUFMpgJ2h_XCbSd2yBU5mRgd7zEJrXZ01z2)
  - How it works
    - Isolation Forest applies a decision tree that repeats splits based on the 'random criterion' for the given data unitl only one observation remains in every terminal node (this is defined as 'isolation').
    - Based on the number of splits used for isolation, 'normality' is defined. A smaller value means a higher degree of outlierness.
    - By applying this decision tree several times, the average of the measured 'normality' values ​​is derived as the final 'normality' value.
  - Assumptions
    - Outliers require relatively few splits to be isolated.
    - For normal data, the number of splits required to be isolated is relatively large.
  - Outlier determination
    - Determines whether it is an outlier or not based on the measured 'normality' value.
      - sklearn's IsolationForest package determines based on '0' 
      - I, personally, think it is better to set the discriminant criterion by considering the 'distribution' of the 'normality' values.
      - The details of the method is given below.

In [None]:
from sklearn.ensemble import IsolationForest
clf = IsolationForest(
    n_estimators=100,
    max_samples='auto',
    n_jobs=-1,
    random_state=config['random_state'])

clf.fit(X_train)
normality_df = pd.DataFrame(clf.decision_function(X_train), columns=['normality'])

- The dicriminant value 
  - The discriminant value (threshold) is defined by calculating the 1st quartile ($q_1$) and 3rd quartile ($q_3$) on the distribution of the measured normality values.
    - with $k=1.5$

$$threshold = q_1 - k*(q_3 - q_1)$$


- Motivation
  - This discriminant method is adapted from Tukey's boxplot idea.
In the distribution of any continuous variable, Tukey designates observations smaller than that value or larger than q_3 + k*(q_3 - q_1) as outliers.

- How we do 
  - Our methodology does not apply the above method to a specific variable, but applies the method to the obtained normality.

  - That is, it is based on the assumption that an outlier will be far left from the other observations in the measured normality distribution.

In [None]:
def outlier_threshold(normality, k=1.5):
  q1 = np.quantile(normality, 0.25)
  q3 = np.quantile(normality, 0.75)  
  threshold = q1 - k*(q3-q1)
  return threshold

threshold = outlier_threshold(normality_df['normality'].values, k=1.5)

import plotly.express as px
fig = px.histogram(normality_df, x='normality', width=400, height=400)
fig.add_vline(x=threshold, line_width=3, line_dash="dash", line_color="red")
fig.show()

In [None]:
import plotly.express as px
px.box(normality_df, x='normality', orientation='h', width=400, height=400)

In [None]:
X_train = X_train[normality_df['normality'].values>=threshold]
y_train = y_train[normality_df['normality'].values>=threshold]

print('{} out of {} observations are removed from train_set'.format(train.shape[0] - X_train.shape[0], train.shape[0]))

## 1-8. Output variable transformation

In [None]:
print('Before transformation:', np.unique(y_train))

y_train_trans = y_train - 1

print('After transformation:', np.unique(y_train_trans))


# 2. Model Selection


## > Modeling Strategy
Our goal is to build a model that predicts the type of risk in terms of life insurance for individuals given some informations. The formula can be expressed as:

$\hat{y} = \underset{k \in \{1,\cdots,K\}}{\operatorname{argmax}}f_{k}(x)$

where,
  - $y \in \{1,\cdots,K\} $: labels 
    - $ 1 < \cdots < K$
  - $x$: an input observation
  - $f_{k}(x)$: a function of $x$ that outputs predicted value for each $k$

This can be formulated as a "**ordinal** classification" problem whose output variable has a ordinal characteristic.


- Problem 
  - Standard classification models can not consider the ordinal relationship of output variable

- Solution 
  - According to [A Simple Approach to Ordinal Classification](https://www.cs.waikato.ac.nz/~eibe/pubs/ordinal_tech_report.pdf), by applying some simple methods, we can formulate the ordinal classification while using standard (binary) classification methods.
  > We can take advantage of the ordered class value by transforming a k-class ordinal regression problem to a k-1 binary classification problem, we convert an ordinal attribute A* with ordinal value V1, V2, V3, … Vk into k-1 binary attributes, one for each of the original attribute’s first k − 1 values. The ith binary attribute represents the test A* > Vi

  - How it works
    1. Convert an ordinal output variable Y into k-1 binary attributes ($Y_{1}, \cdots Y_{K-1}$). The i-th binary attributes $Y_{i} \in \\{0, 1\\}$ represents $Y > i$
    2. Estimate the probabilities $Pr(Y_{1} = 1), \cdots, Pr(Y_{K-1}=1)$ by training each model on data which are same as $Pr(Y > 1), \cdots, Pr(Y > K-1)$
    3. Get the probabilities $Pr(Y=k)$ by using estimated $Pr(Y > 1), \cdots, Pr(Y > K-1)$. i.e.,
      - $Pr(Y=1) = 1 - Pr(Y>1) $
      - $Pr(Y=i) = Pr(Y>i-1) - Pr(Y>i)$
      - $Pr(Y=K) = Pr(Y>K-1)$

Below figure shows the overall procedure, when $Y \in \{1, 2, 3, 4\}$

<center><img src='https://drive.google.com/uc?export=view&id=1ImlHhVUuBXAfHwEBfg0RS6NXXkDNJk3H' width = 1000></center>


## > Model Selection method

To estimate the probabilities for each binary classification model, we uses the following models.
- Logistic regression
- Random forest
- Xgboost
- Multi-layer perceptron

However, we have to "choose" one final methodology to make predictions on the test set.
To do this, a “fair evaluation” of the models is essential. "Fair evaluation" must satisfy the following two conditions.

1. Select optimal hyperparameters for each model
  - If hyperparameter search is not performed, the difference in model performance may occur due to incorrect hyperparameter values.
2. same evaluation method
  - If the same evaluation method is not applied, comparison between models itself is impossible.

When comparing models through an evaluation method that satisfies the above two conditions,
Only then can the final model be selected.




### > Define a scoring function for hyper parameter tuning


In [None]:
from sklearn.metrics import make_scorer
from sklearn.metrics import cohen_kappa_score

def weighted_kappa(y_true, y_pred):
  try:
    score = cohen_kappa_score(y_true, y_pred, weights='quadratic')
  except:
    score = np.nan
  return score

target_metric = make_scorer(weighted_kappa, greater_is_better=True)

## 2-1. Define a Ordinal Classificatier class

This class enables the ordinal classification formulation by using the standard binary classification models


In [None]:
from sklearn.base import BaseEstimator
from sklearn.base import clone
from sklearn.metrics import accuracy_score
class OrdinalClassifier(BaseEstimator):

    def __init__(self, clf):
        self.clf = clf
        self.clfs = {}

    def fit(self, X, y):
        self.unique_class = np.sort(np.unique(y))
        if self.unique_class.shape[0] > 2:
            for i in range(self.unique_class.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.unique_class[i]).astype(np.uint8)
                clf = clone(self.clf)
                try:
                  clf.module
                except: # For others
                  clf.fit(X, binary_y)
                else: # For MLP
                  binary_y_reshape = binary_y.astype('float32').reshape(-1,1)
                  clf.fit(X, binary_y_reshape)
                self.clfs[i] = clf

    def predict_proba(self, X):
        clfs_predict = {k: self.clfs[k].predict_proba(X) for k in self.clfs}
        predicted = []
        for i, y in enumerate(self.unique_class):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - clfs_predict[i][:,1])
            elif i in clfs_predict:
                # Vi = Pr(y > Vi-1) - Pr(y > Vi)
                 predicted.append(clfs_predict[i-1][:,1] - clfs_predict[i][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(clfs_predict[i-1][:,1])
        try:
          self.clf.module
        except: # For others
          pred_proba = np.vstack(predicted).T      
        else: # For MLP
          pred_proba = np.hstack((predicted))
        
        return pred_proba

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)

    def score(self, X, y, sample_weight=None):
        _, indexed_y = np.unique(y, return_inverse=True)
        return accuracy_score(indexed_y, self.predict(X), sample_weight=sample_weight)

## 2-2. Hyper parameter tuning by using Tune_SKlearn (Ray Tune)

- Package: tune_sklearn
  - This package makes it easy to apply [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) to sklearn models.
  - Ray Tune is a python package that provides various hyperparameter tuning algorithms (HyperOpt, BayesianOptimization, ...).
- Tuning procedure
  - Define an appropriate search space for each model's hyperparameters.
  - 5-fold CV (Cross Validation) is performed for each specific hyper-parameter value combination of the search space by using the hyper-parameter tuning algorithm (HyperOpt)
    - Training: Training by using Scikit-Learn and Skorch packages
    - Validation: Evaluate the model using an appropriate evaluation metric
  - The hyperparameter with the highest average score of the CV result is designated as the optimal hyperparameter of the model.
    - Save this CV result and use for model comparison



### > Make a dataframe for containing CV results

In [None]:
model_list = []
for name in ['linear', 'rf', 'xgb', 'mlp']:
  model_list.append(np.full(5, name))
  
best_cv_df = pd.DataFrame({'model': np.hstack((model_list)), 'accuracy':None, 'kappa':None, 'best_hyper_param': None})

### Logistic regression

In [None]:
from tune_sklearn import TuneSearchCV
from sklearn.linear_model import SGDClassifier

# Define a search space
parameters = {
    'clf__alpha': list(np.geomspace(1e-5, 1e-2, 4)),
    'clf__max_iter': [1000],
    'clf__tol': [1e-4, 1e-3, 1e-2],
    'clf__loss': ['log'],
    'clf__penalty': ['l2'],
    'clf__random_state': [config['random_state']],
}

# Define a Ordinal classifier
clf = SGDClassifier()
ordinal_clf = OrdinalClassifier(clf)

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    ordinal_clf,
    parameters,
    search_optimization='hyperopt',
    n_trials=6,
    n_jobs=-1,
    scoring={'accuracy':'accuracy', 'kappa':target_metric},
    cv=5,
    refit='kappa',
    verbose=1,
    random_state=config['random_state']
    )

# Run hyper parameter tuning
X = X_train
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'linear'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'kappa'] = cv_values[5:]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_kappa')
fig.show()

### Random forest

In [None]:
from tune_sklearn import TuneSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define a search space
parameters = {
    'clf__n_estimators': [50, 100],
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [20, 25, 30],
    'clf__max_features': ['auto'],
    'clf__random_state': [config['random_state']]
}

# Define a Ordinal classifier
clf = RandomForestClassifier()
ordinal_clf = OrdinalClassifier(clf)

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    ordinal_clf,
    parameters,
    search_optimization='hyperopt',
    n_trials=4,
    n_jobs=-1,
    scoring={'accuracy':'accuracy', 'kappa':target_metric},
    cv=5,
    refit='kappa',
    verbose=1,
    random_state=config['random_state']
    )

# Run hyper parameter tuning
X = X_train
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'rf'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'kappa'] = cv_values[5:]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.get_dummies(pd.DataFrame(tune_search.cv_results_['params']), drop_first=False), 
                            cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
tune_result_df = tune_result_df.astype({'clf__criterion_entropy':'int64', 'clf__criterion_gini':'int64'})
import plotly.express as px
px.parallel_coordinates(tune_result_df, color='mean_test_kappa')

### XGBoost

In [None]:
from tune_sklearn import TuneSearchCV
from xgboost import XGBClassifier

# Define a search space
parameters = {
    'clf__n_estimators': [10, 50],
    'clf__learning_rate': list(np.geomspace(1e-2, 1, 3)),
    'clf__min_child_weight': [5, 10, 15],
    'clf__gamma': [0.5, 2],
    'clf__subsample': [0.6, 1.0],
    'clf__colsample_bytree': [0.6, 1.0],
    'clf__max_depth': [5, 10, 15],
    'clf__lambda': [1],
    'clf__objective': ['binary:logistic'],
    'clf__random_state': [config['random_state']]
}

# Define a Ordinal classifier
clf = XGBClassifier()
ordinal_clf = OrdinalClassifier(clf)

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    ordinal_clf,
    parameters,
    search_optimization='hyperopt',
    n_trials=4,
    n_jobs=-1,
    scoring={'accuracy':'accuracy', 'kappa':target_metric},
    cv=5,
    refit='kappa',
    verbose=1,
    random_state=config['random_state']
    )

# Run hyper parameter tuning
X = X_train
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'xgb'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'kappa'] = cv_values[5:]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_kappa')
fig.show()

### Multi-layer perceptron

In [None]:
import torch
from torch import nn
from skorch import NeuralNetClassifier
from skorch.callbacks import EarlyStopping
from skorch.callbacks import EpochScoring
from skorch.callbacks import Checkpoint
from tune_sklearn import TuneSearchCV

# Define a model structure
class MLP(nn.Module):
    def __init__(self, num_inputs=X_train.shape[1], num_outputs=1, layer1=512, layer2=256, dropout1=0, dropout2=0):
        super(MLP, self).__init__()

        self.linear_relu_stack = nn.Sequential(
            nn.Linear(num_inputs, layer1),
            nn.LeakyReLU(),
            nn.Dropout(dropout1),
            nn.Linear(layer1, layer2),
            nn.LeakyReLU(),
            nn.Dropout(dropout2),
            nn.Linear(layer2, num_outputs)
            )
    def forward(self, x):
        x = self.linear_relu_stack(x)
        return x  

def try_gpu(i=0): 
    return f'cuda:{i}' if torch.cuda.device_count() >= i + 1 else 'cpu'

# Set model configurations
mlp = NeuralNetClassifier(
    MLP(num_inputs=X_train.shape[1], num_outputs=1),
    criterion=nn.BCEWithLogitsLoss(),
    optimizer=torch.optim.Adam,
    device=try_gpu(),
    verbose=0,
    callbacks=[EarlyStopping(monitor='valid_loss', patience=5,
                             threshold=1e-3, lower_is_better=False)]
                          )
# Define a search space
parameters = {
    'clf__lr': list(np.geomspace(1e-4, 1e-1, 4)),
    'clf__module__layer1': [128, 256, 512],
    'clf__module__layer2': [128, 256, 512],
    'clf__module__dropout1': [0, 0.1],
    'clf__module__dropout2': [0, 0.1],
    'clf__batch_size': [128, 256],
    'clf__optimizer__weight_decay': list(np.geomspace(1e-5, 1e-1, 5)),
    'clf__max_epochs': [2000],
    'clf__iterator_train__shuffle': [True],
    'clf__callbacks__EarlyStopping__threshold': [1e-4, 1e-3]
    }

def use_gpu(device):
    return True if not device == 'cpu' else False 

# Define a Ordinal classifier
clf = mlp
ordinal_clf = OrdinalClassifier(clf)

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    ordinal_clf,
    parameters,
    search_optimization='hyperopt',
    n_trials=10,
    n_jobs=-1,
    scoring={'accuracy':'accuracy', 'kappa':target_metric},
    cv=5,
    refit='kappa',
    verbose=1,
    random_state=config['random_state']
    )

# Run hyper parameter tuning
X = X_train
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'mlp'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'kappa'] = cv_values[5:]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
tune_result_df.rename({
    'clf__callbacks__EarlyStopping__threshold':'Earlystoping_threshold',
    'clf__optimizer__weight_decay': 'weight_decay'
    }, axis=1, inplace=True)
import plotly.express as px
px.parallel_coordinates(tune_result_df, color='mean_test_kappa')

### > Save CV results

## 2-3. Model Comparison based on CV results

Compare the CV results (measured using the optimal hyper parameter values).

The figure below shows that 

xgb > linear > mlp > rf




In [None]:
fig = px.box(best_cv_df, x='model', y='kappa', color='model', width=800)
fig.show()

## 2-4. Model Combination

Although it is possible to select a final model based on the above results, it has been observed that in many cases the combination of predicted values ​​from multiple models leads to improve prediction performance. ([Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts?](https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.210?casa_token=OwyF2RbEywAAAAAA:gahpwGRdOWzLXyafYQQt_voHOF8MedTBLd1SBv4vkdT3ZTLVoKZQj3zl-KbrhSkX5x8CndeCxwBoL_-S))

For classification problems, the final probabilities are derived by combining the predicted 'probabilities' for each class in a 'proper way'.

This notebook uses following two model combination methods.

1. Simple Average
2. Stacked Generalization (Stacking)


Model comparison needs to be done with single models (e.g., rf, xgb,...).
So model performance are measured by applying the same CV method as above.

Based on the CV results, we select (linear, xgb, mlp) as the base estimators for model combination.

### > Simple Average

The simple average method derives the final probability value by 'averaging' the predicted probability values ​​for each class of multiple models.

The top 3 models (linear, xgb, mlp) of the above CV results are selected as base estimators used for the combination of predicted values.

For example,
- Base Estimations
  - $P_{linear}(Y=1|X=x)$ = 0.80
  - $P_{xgb}(Y=1|X=x)$ = 0.80
  - $P_{mlp}(Y=1|X=x)$ = 0.85
- Final Estimation
  - $P_{average}(Y=1|X=x)$  = 0.817 (= 0.80 + 0.80 + 0.85 / 3)


In [None]:
from sklearn.model_selection import KFold
from tqdm import notebook
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score

def CV_ensemble(ensemble_name, ensemble_func, estimators, X_train, y_train, n_folds=5, shuffle=True, random_state=2022):
  kf = KFold(n_splits=5, random_state=random_state, shuffle=True)

  res_list = []
  for train_idx, valid_idx in notebook.tqdm(kf.split(X_train), total=kf.get_n_splits(), desc='Eval_CV'):
    X_train_train, X_valid = X_train[train_idx], X_train[valid_idx]
    y_train_train, y_valid = y_train[train_idx], y_train[valid_idx]

    ensemble_pred_proba = ensemble_func(estimators, X_train_train, y_train_train, X_valid)
    accuracy = accuracy_score(y_valid, ensemble_pred_proba.argmax(axis=1))
    kappa = weighted_kappa(y_valid, ensemble_pred_proba.argmax(axis=1))

    res_list.append([ensemble_name, accuracy, kappa])
  res_df = pd.DataFrame(np.vstack((res_list)))
  res_df.columns = ['model', 'accuracy', 'kappa']
  return res_df

def ensemble_average(estimators, X_train, y_train, X_test):
  preds = []
  num_estimators = len(estimators)
  num_class = len(np.unique(y_train))
  for iter in range(num_estimators):
    estimators[iter].fit(X_train, y_train)
    preds.append(estimators[iter].predict_proba(X_test))
  
  preds_stack = np.hstack((preds))
  preds_mean = []
  for iter in range(num_class):
    col_idx = np.arange(iter, num_estimators * num_class, num_class)
    preds_mean.append(np.mean(preds_stack[:,col_idx], axis=1))

  pred_fin = np.vstack((preds_mean)).transpose()
  return pred_fin

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

linear_ordinal = OrdinalClassifier(SGDClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='linear', 'best_hyper_param'].values[0]))
rf_ordinal = OrdinalClassifier(RandomForestClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='rf', 'best_hyper_param'].values[0]))
xgb_ordinal = OrdinalClassifier(XGBClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='xgb', 'best_hyper_param'].values[0]))
mlp_ordinal = OrdinalClassifier(mlp).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='mlp', 'best_hyper_param'].values[0]))

estimators = [linear_ordinal, xgb_ordinal, mlp_ordinal]
estimators_name = 'linear_xgb_mlp'
ensemble_name = 'average' + '_' + estimators_name

X = X_train
y = y_train_trans

res_df = CV_ensemble(ensemble_name, ensemble_average, estimators, X, y, n_folds=5, shuffle=True, random_state=config['random_state'])
best_cv_df = best_cv_df.append(res_df).reset_index(drop=True)

In [None]:
fig = px.box(best_cv_df, x='model', y='kappa', color='model', width=800)
fig.show()

### > Stacked generalization (Stacking)

In the [Stacked generalization](https://www.jair.org/index.php/jair/article/view/10228), the predicted probabilities of base estimators are treated as the 'input data', and y (Cover_Type) of each row is treated as the 'output variable'. 
The 'Meta Learner' is learned with these data and the predicted probablities of this model are derived as the final prediction probabilities.

- The 'Meta Learner' can be optained among any of the classification models. However, this notebook uses a ridge model (logistic regression with ridge penalty) to prevent overfitting.

- As input data for 'Meta Learner', prediction probabilities for validation data in cv of base estimators are obtained.

- Trained meta-learner predicts the final predicted probabilities for the test-set by using the predicted probabilites of baes estimators for the test-set as input data.

The total process, in order, is as follows:
1. (Base estimators) Run CV on Train-set
2. (Meta Learner) Train on CV predictions (predicted probabilities on validation data of CV) with corresponding y values
3. (Base estimators) Train on Train-set
4. (Base estimators) Predict on Test-set
5. (Meta Learner) Predict on predictions on Test-set

<img align='top' src='https://drive.google.com/uc?export=view&id=1uDxSIIFt8rUJkuIwRYU4lALvOPqlXPG5' width='600' height='400'>


For example,
- Assume that 
  - $Y \in \{0, 1, 2\}$
- Base Estimatiors
  - rf
    - $P_{rf}(Y=0|X=x)$ = 0.75
    - $P_{rf}(Y=1|X=x)$ = 0.10
    - $P_{rf}(Y=2|X=x)$ = 0.15
  - xgb
    - $P_{xgb}(Y=0|X=x)$ = 0.80
    - $P_{xgb}(Y=1|X=x)$ = 0.10
    - $P_{xgb}(Y=2|X=x)$ = 0.10
- Meta Learner (logistic regression with ridge (l2) penalty)
  - when Y=0:
    - intercept = 0.1
    - coefficient = [0.8, 0.1, -0.1, 0.9, 0.2, -0.05]
  - predicted probabilities
    - $P_{stack}(Y=0|X=x)$ = 0.8069 = sigmoid(0.1 + 0.8*0.75 + 0.1*0.1 -0.1*0.15 + 0.9*0.8 + 0.2*0.1 - 0.05*0.1)$


**Warnings**:

- the set of predicted probabilities $[P_{rf}(Y=1|X=x), \cdots, P_{xgb}(Y=2|X=x)]$ is a **linearly dependent ** matrix.
- Thus, as a final estimator, linear model with penalty or not a linear model is recommended.
- If you want to apply plain linear model with no penalty, please remove the first or last class probabilities of each base estimators (e.g., remove $P_{rf}(Y=2|X=x)$ and $P_{xgb}(Y=2|X=x)$)

In [None]:
from sklearn.model_selection import KFold
from tqdm import notebook


def stack_clf(estimators, X_train, y_train, X_test, n_folds=5, shuffle=True, random_state=2022):
  final_estimator = estimators[-1]
  num_estimators = len(estimators)-1

  kf = KFold(n_splits=n_folds, random_state=random_state, shuffle=shuffle)
  preds = []
  y_valid_list = []
  # Get CV predictions
  for train_idx, valid_idx in notebook.tqdm(kf.split(X_train), total=kf.get_n_splits(), desc='Stack_CV'):
    X_train_train, X_valid = X_train[train_idx], X_train[valid_idx]
    y_train_train, y_valid = y_train[train_idx], y_train[valid_idx]
    
    valid_preds = []
    for iter in range(num_estimators):
        estimators[iter].fit(X_train_train, y_train_train)
        valid_preds.append(estimators[iter].predict_proba(X_valid)) # warning: this matrix is linearly dependent. If you want to ge linearly independent matrix, drop first column
    
    preds.append(np.hstack((valid_preds)))
    y_valid_list.append(y_valid)

  cv_preds = np.vstack((preds))
  cv_y = np.hstack((y_valid_list))

  # Get test predictions
  test_preds =[]
  for iter in range(num_estimators):
      estimators[iter].fit(X_train, y_train)
      test_preds.append(estimators[iter].predict_proba(X_test)) # warning: this matrix is linearly dependent. If you want to ge linearly independent matrix, drop first column

  test_preds_mat = np.hstack((test_preds))

  # Fit the final estimator on cv prediction values 
  # And make a prediction on test predictoin values
  final_estimator.fit(cv_preds, cv_y)
  print('Training score: {}'.format(target_metric(final_estimator, cv_preds, cv_y)))
  print(' Estimated coefficients: {} \n intercept: {}'.format(final_estimator.coef_, final_estimator.intercept_))
  
  pred_fin = final_estimator.predict_proba(test_preds_mat)
  return pred_fin

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Base estimators
linear_ordinal = OrdinalClassifier(SGDClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='linear', 'best_hyper_param'].values[0]))
rf_ordinal = OrdinalClassifier(RandomForestClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='rf', 'best_hyper_param'].values[0]))
xgb_ordinal = OrdinalClassifier(XGBClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='xgb', 'best_hyper_param'].values[0]))
mlp_ordinal = OrdinalClassifier(mlp).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='mlp', 'best_hyper_param'].values[0]))

estimators = [linear_ordinal, xgb_ordinal, mlp_ordinal]
estimators_name = 'linear_xgb_mlp'

# Final estimator
clf = LogisticRegression(penalty='l2', max_iter=1000, random_state=config['random_state'])

estimators.append(clf)
ensemble_func = stack_clf
ensemble_name = 'stack_ridge' + '_by_' + estimators_name

# Run CV 
X = X_train
y = y_train_trans

res_df = CV_ensemble(ensemble_name, ensemble_func, estimators, X, y, n_folds=5, shuffle=True, random_state=config['random_state'])
best_cv_df = best_cv_df.append(res_df)

## 2-5. Model Comparison based on CV results including model combination methods

From the figure below, 'stack_ridge_by_linear_xgb_mlp' shows the best performance.

In [None]:
fig = px.box(best_cv_df, x='model', y='kappa', color='model', width=800)
fig.show()

In [None]:
best_cv_df.to_csv('best_cv_results.csv', index=False)

# 3. Make a prediction with the best model


In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Base estimators
linear_ordinal = OrdinalClassifier(SGDClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='linear', 'best_hyper_param'].values[0]))
rf_ordinal = OrdinalClassifier(RandomForestClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='rf', 'best_hyper_param'].values[0]))
xgb_ordinal = OrdinalClassifier(XGBClassifier()).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='xgb', 'best_hyper_param'].values[0]))
mlp_ordinal = OrdinalClassifier(mlp).set_params(**eval(best_cv_df.loc[best_cv_df['model']=='mlp', 'best_hyper_param'].values[0]))

estimators = [linear_ordinal, xgb_ordinal, mlp_ordinal]
estimators_name = 'linear_xgb_mlp'

# Final estimator
clf = LogisticRegression(penalty='l2', max_iter=1000, random_state=config['random_state'])

estimators.append(clf)
ensemble_func = stack_clf
ensemble_name = 'stack_ridge' + '_by_' + estimators_name

# Run CV 
X = X_train
y = y_train_trans

pred_proba = stack_clf(estimators, X, y,  X_test, n_folds=5, shuffle=True, random_state=config['random_state'])
pred = pred_proba.argmax(axis=1)
pred_trans = pred + 1

res_df = pd.DataFrame({'Id': test['Id'], 'Response': pred_trans})
res_df.to_csv('submission.csv', index=False)
print(ensemble_name)