# Before begins

This notebook is written in google colab.

To see some interactive plots, please enter the colab link Below.

<a href="https://colab.research.google.com/drive/1Kgd6OOrRE7rXrl62HTu4PHtAED2d2zWJ?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

There are many notebooks similar to this for various competitions, so check the github address below

<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width=50 align='left' alt="Open in Colab" /></a>
&nbsp; <font size="5">[Github: Kaggle-Notebook](https://github.com/JayAhn0104/Kaggle-Notebook)</font>

# Overview

<br>

## Competition Description

<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/3936/logos/thumb76_76.png" width=50 align='left' alt="Open in Colab"/></a>
&nbsp; 
<font size="5">[Forest Cover Type Prediction](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)</font>

<br>

- Problem type: (Multi-class) classification
  - Predict the forest categories by using cartographic variables 
- Evaluation metric: Accuracy

<br>

## Notebook Description

This notebook provides the '**proper workflow**' for kaggle submission.

The workflow is divided into three main steps.
1. Data preprocessing
2. Model selection (hyper parameter tuning, model combination, model comparison)
3. Training final model & Prediction on Test-set

At each stage, detailed descriptions of the work and an appropriate procedure will be provided.

Through this notebook, readers can learn the 'proper workflow' to be done for kaggle submission, 
and using this as a basic structure, someone will be able to apply this to other competitions easily with some adjustments

**Warnings**:
- The purpose of this notebook
  - This notebook focuses on the 'procedure' rather than the 'result'. 
  - Thus this notebook does not guide you on how to achieve the top score. Since I personally think that any result can only have a meaning through an appropriate procedure.
  - But since this is a competition, it cannot be avoided that the score is important. Following this notebook, you will get the top 15% (score: 0.12519) result in this competition

- The readers this notebook is intended for
  - Who are aware of the basic usage of data processing tools (e.g., numpy, pandas)
  - Who are aware of the basic concepts of machine learning models 


# 0. Preliminaries

### > Set Configurations 

- Set the configurations for this notebook

In [None]:
config = {
    'data_name': 'forest-cover-type-prediction',
    'random_state': 2022
}

### > Install Libraries

In [None]:
!pip install tune_sklearn skorch

# 1. Data preprocessing

The data preprocessing works are divided into 9 steps here.

Some of these steps are mandatory and some are optional.

Optional steps are marked separately.

It is important to go through each step in order.
Be careful not to reverse the order.

## 1-1. Load Dataset

Load train-set and test-set on working environment


### > Load Data-set

In [None]:
import numpy as np
import pandas as pd
import os

train = pd.read_csv('/kaggle/input/{}/train.csv'.format(config['data_name']))
test = pd.read_csv('/kaggle/input/{}/test.csv'.format(config['data_name']))

### > Concatenate the 'train' and 'test' data for preprocessing

Data preprocessing work should be applied equally for train-set and test-set.

In order to work at once, exclude the response variable 'Cover_Type' from 'train' and combine it with 'test'.

In [None]:
all_features = pd.concat((train.drop(['Id','Cover_Type'], axis=1), test.drop(['Id'], axis=1)), axis=0)

## 1-2. Missing Value Treatment

Missing (NA) values in Data must be treated properly before model training.

There are three main treatment methods:
1. Remove the variables which have NA values
2. Remove the rows (observations) which have NA values
3. Impute the NA values with other values

Which of the above methods is chosen is at the analyst's discretion.
It is important to choose the appropriate method for the situation.

### > Check missing values in each variable

There is no missing values in the data-set

In [None]:
all_features.isnull().sum().values

## 1-3. Adding new features (*optional*)

New variables can be created using the given data.
These variables are called 'derived variables'.

New informations can be added by creating appropriate derived variables.

This can have a positive effect on model performance. (Not always)


### > Get Euclidean distance by using horizontal distance and vertical distance 

There are 'Horizontal_Distance_To_Hydrology' and 'Vertical_Distance_To_Hydrology'. 

By using Pythagorean theorem, we can calculate the Euclidean distance to hydrology

In [None]:
all_features['Euclidean_Distance_To_Hydrology'] = np.sqrt(np.power(all_features['Horizontal_Distance_To_Hydrology'],2) + np.power(all_features['Vertical_Distance_To_Hydrology'],2))

## 1-4. Variable modification

### > Aspect

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Brosen_windrose.svg/600px-Brosen_windrose.svg.png" width=180 align='left' alt="Open in Colab"/></a>

According to the data description, 'Aspect' indicates the Aspect in degrees azimuth. 
In the cartographic data, the azimuth is the angular direction of the sun, measured from the north in clockwise degrees from 0 to 360. 
For example, An azimuth of 90 degrees is east.

Since the values of 'Aspect' vary from 0 to 360, this variable's actual information, which is the azimuth angle, can not be obtained in modeling.

Thus we need to convert these values ​​appropriately to represent the azimuth angle.

<br>

Procedure:
- Bin values into discrete intervals based on the cardinal direction 
- Label the binned discrete intervals based on the cardinal direction



In [None]:
aspect_label_list = ['N', 'NNE', 'NE', 'ENE', 'E', 'ESE', 'SE', 'SSE', 'S', 'SSW', 'SW', 'WSW', 'W', 'WNW', 'NW', 'NNW']
aspect_interval = np.linspace(11.25, 371.25, 17)
aspect_interval[0] = 0
all_features['Aspect_direction'] = pd.cut(all_features['Aspect']+11.25, aspect_interval, right=True, labels=aspect_label_list, ordered=False)
all_features.drop('Aspect', inplace=True, axis=1)

## 1-5. Dummify categorical variables

In the case of linear modeling without regularization, the first or last column should be dropped (to prevent linear dependency), but here, for the convenience of using the factorization model, one-hot encoding method is used that does not drop any columns.

In [None]:
data_set = pd.get_dummies(all_features, drop_first=False)

## 1-6. Scaling continuous variables

MinMaxScaling maps all variables from 0 to 1 in order to consider only relative information, not absolute magnitudes of the values.

Besides, it is known that scaling is often more stable in parameter optimization when training a model.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_set = scaler.fit_transform(data_set)

## 1-7. Split Train & Test set

In [None]:
n_train = train.shape[0]
X_train = data_set[:n_train].astype(np.float32)
X_test = data_set[n_train:].astype(np.float32)
y_train = train['Cover_Type'].values.astype(np.int64)

## 1-8. Outlier Detection on Training data (*optional*)

Detect and remove outlier observations that exist in the train-set.

- Methodology: [Isolation Forest](https://ieeexplore.ieee.org/abstract/document/4781136/?casa_token=V7U3M1UIykoAAAAA:kww9pojtMeJtXaBcNmw0eVlJaXEGGICi1ogmeHUFMpgJ2h_XCbSd2yBU5mRgd7zEJrXZ01z2)
  - How it works
    - Isolation Forest applies a decision tree that repeats splits based on the 'random criterion' for the given data unitl only one observation remains in every terminal node (this is defined as 'isolation').
    - Based on the number of splits used for isolation, 'normality' is defined. A smaller value means a higher degree of outlierness.
    - By applying this decision tree several times, the average of the measured 'normality' values ​​is derived as the final 'normality' value.
  - Assumptions
    - Outliers require relatively few splits to be isolated.
    - For normal data, the number of splits required to be isolated is relatively large.
  - Outlier determination
    - Determines whether it is an outlier or not based on the measured 'normality' value.
      - sklearn's IsolationForest package determines based on '0' 
      - I, personally, think it is better to set the discriminant criterion by considering the 'distribution' of the 'normality' values.
      - The details of the method is given below.

In [None]:
from sklearn.ensemble import IsolationForest
clf = IsolationForest(
    n_estimators=100,
    max_samples='auto',
    n_jobs=-1,
    random_state=config['random_state'])

clf.fit(X_train)
normality_df = pd.DataFrame(clf.decision_function(X_train), columns=['normality'])

- The dicriminant value 
  - The discriminant value (threshold) is defined by calculating the 1st quartile ($q_1$) and 3rd quartile ($q_3$) on the distribution of the measured normality values.
    - with $k=1.5$

$$threshold = q_1 - k*(q_3 - q_1)$$


- Motivation
  - This discriminant method is adapted from Tukey's boxplot idea.
In the distribution of any continuous variable, Tukey designates observations smaller than that value or larger than q_3 + k*(q_3 - q_1) as outliers.

- How we do 
  - Our methodology does not apply the above method to a specific variable, but applies the method to the obtained normality.

  - That is, it is based on the assumption that an outlier will be far left from the other observations in the measured normality distribution.

In [None]:
def outlier_threshold(normality, k=1.5):
  q1 = np.quantile(normality, 0.25)
  q3 = np.quantile(normality, 0.75)  
  threshold = q1 - k*(q3-q1)
  return threshold

threshold = outlier_threshold(normality_df['normality'].values, k=1.5)

import plotly.express as px
fig = px.histogram(normality_df, x='normality', width=400, height=400)
fig.add_vline(x=threshold, line_width=3, line_dash="dash", line_color="red")
fig.show()

In [None]:
import plotly.express as px
px.box(normality_df, x='normality', orientation='h', width=400, height=400)

In [None]:
X_train = X_train[normality_df['normality'].values>=threshold]
y_train = y_train[normality_df['normality'].values>=threshold]

print('{} observations are removed from train_set'.format(train.shape[0] - X_train.shape[0]))

## 1-9. Output variable transformation

PyTorch module supports labels starting from 0.

Since our output variable values vary from 1 to 7, we convert these to 0 to 7.

In [None]:
np.unique(y_train)

In [None]:
y_train_trans = y_train - 1


# 2. Model Selection

Our goal is to build a model that predicts the forest cover type given the cartographic informations of the forest. The formula can be expressed as:

$\hat{y} = \underset{k \in \{1,\cdots,K\}}{\operatorname{argmax}}f_{k}(x)$

where,
  - $y \in \{1,\cdots,K\} $: labels 
  - $x$: an input observation
  - $f_{k}(x)$: a function of $x$ that outputs predicted value for each $k$

This is a typical multiclass classification problem, and various machine learning models can be obtained. This notebook uses the following models.
- Logistic regression
- Support vector machine
- Random forest
- Xgboost
- Multi-layer perceptron
- Factorization

However, we have to "choose" one final methodology to make predictions on the test set.
To do this, a “fair evaluation” of the models is essential. "Fair evaluation" must satisfy the following two conditions.

1. Select optimal hyperparameters for each model
  - If hyperparameter search is not performed, the difference in model performance may occur due to incorrect hyperparameter values.
2. same evaluation method
  - If the same evaluation method is not applied, comparison between models itself is impossible.

When comparing models through an evaluation method that satisfies the above two conditions,
Only then can the final model be selected.




## 2-1. Hyper parameter tuning by using Tune_SKlearn (Ray Tune)

- Package: tune_sklearn
  - This package makes it easy to apply [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) to sklearn models.
  - Ray Tune is a python package that provides various hyperparameter tuning algorithms (HyperOpt, BayesianOptimization, ...).
- Tuning procedure
  - Define an appropriate search space for each model's hyperparameters.
  - 5-fold CV (Cross Validation) is performed for each specific hyper-parameter value combination of the search space by using the hyper-parameter tuning algorithm (HyperOpt)
    - Training: Training by using Scikit-Learn and Skorch packages
    - Validation: Evaluate the model using an appropriate evaluation metric
  - The hyperparameter with the highest average score of the CV result is designated as the optimal hyperparameter of the model.
    - Save this CV result and use for model comparison



### > Make a dataframe for containing CV results

In [None]:
model_list = []
for name in ['linear', 'svm', 'rf', 'xgb', 'mlp', 'fm']:
  model_list.append(np.full(5, name))
  
best_cv_df = pd.DataFrame({'model': np.hstack((model_list)), 'log_loss':None, 'accuracy':None, 'best_hyper_param':None})

### > Logistic regression

In [None]:
from tune_sklearn import TuneSearchCV
from sklearn.linear_model import SGDClassifier

# Define a search space
parameters = {
    'max_iter': [1000],
    'loss': ['log'],
    'penalty': ['l2'],
    'random_state': [config['random_state']],
    'alpha': list(np.geomspace(1e-6, 1e-3, 4)),
    'tol': list(np.geomspace(1e-4, 1e-1, 4))
}

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    SGDClassifier(),
    parameters,
    search_optimization='hyperopt',
    n_trials=10,
    n_jobs=-1,
    scoring=['neg_log_loss', 'accuracy'],
    cv=5,
    refit='accuracy', # target metric of competition
    verbose=1,
    random_state=config['random_state']
    )

# Run hyper parameter tuning
X = X_train
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'linear'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'log_loss'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[5:10]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_accuracy')
fig.show()

### > Support vector machine

In [None]:
from tune_sklearn import TuneSearchCV
from sklearn.linear_model import SGDClassifier

# Define a search space
parameters = {
    'alpha': list(np.geomspace(1e-7, 1e-3, 3)),
    'epsilon': list(np.geomspace(1e-5, 1e-1, 3)),
    'loss': ['hinge'],
    'tol': list(np.geomspace(1e-7, 1e-1, 4)),
    'max_iter': [1000],
    'penalty': ['l2'],
    'random_state': [config['random_state']]
}

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    SGDClassifier(),
    parameters,
    search_optimization='hyperopt',
    n_trials=10,
    n_jobs=-1,
    scoring=['accuracy'],
    cv=5,
    refit='accuracy', # target metric of competition
    verbose=1,
    random_state=config['random_state']
    )

# Run hyper parameter tuning
X = X_train 
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'svm'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[:5]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_accuracy')
fig.show()

### > Random forest

In [None]:
from tune_sklearn import TuneSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define a search space
parameters = {
    'n_estimators': [100, 500, 1000],
    'criterion': ['gini', 'entropy'],
    'max_depth': [20, 25, 30],
    'max_features': ['auto'],
    'random_state': [config['random_state']]
}

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    RandomForestClassifier(),
    parameters,
    search_optimization='hyperopt',
    n_trials=10,
    n_jobs=-1,
    scoring=['neg_log_loss', 'accuracy'],
    cv=5,
    refit='accuracy',
    verbose=1,
    random_state=config['random_state']
    )


# Run hyper parameter tuning
X = X_train 
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'rf'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'log_loss'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[5:10]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_accuracy')
fig.show()

### > XGBoost

In [None]:
from tune_sklearn import TuneSearchCV
from xgboost import XGBClassifier

# Define a search space
parameters = {
    'n_estimators': [50, 100, 200],
    'learning_rate': list(np.geomspace(1e-2, 1, 3)),
    'min_child_weight': [10, 15, 20],
    'gamma': [0.5, 2],
    'subsample': [0.6, 1.0],
    'colsample_bytree': [0.6, 1.0],
    'max_depth': [5, 10, 15],
    'objective': ['multi:softmax'],
    'random_state': [config['random_state']]
}

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    XGBClassifier(),
    parameters,
    search_optimization='hyperopt',
    n_trials=10,
    n_jobs=-1,
    scoring=['neg_log_loss', 'accuracy'],
    cv=5,
    refit='accuracy',
    verbose=1,
    random_state=config['random_state']
    )


# Run hyper parameter tuning
X = X_train 
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'xgb'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'log_loss'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[5:10]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_accuracy')
fig.show()

### > Multi-layer perceptron

In [None]:
import torch
from torch import nn
from skorch import NeuralNetClassifier
from skorch.callbacks import EarlyStopping
from skorch.callbacks import Checkpoint
from tune_sklearn import TuneSearchCV

# Define a model structure
class MLP(nn.Module):
    def __init__(self, num_inputs=X_train.shape[1], num_outputs=len(np.unique(y_train)), layer1=512, layer2=256, dropout1=0, dropout2=0):
        super(MLP, self).__init__()

        self.linear_relu_stack = nn.Sequential(
            nn.Linear(num_inputs, layer1),
            nn.LeakyReLU(),
            nn.Dropout(dropout1),
            nn.Linear(layer1, layer2),
            nn.LeakyReLU(),
            nn.Dropout(dropout2),
            nn.Linear(layer2, num_outputs)
            )
    def forward(self, x):
        x = self.linear_relu_stack(x)
        return x  

def try_gpu(i=0): 
    return f'cuda:{i}' if torch.cuda.device_count() >= i + 1 else 'cpu'

# Set model configurations
mlp = NeuralNetClassifier(
    MLP(num_inputs=X_train.shape[1], num_outputs=len(np.unique(y_train))),
    optimizer=torch.optim.Adam,
    criterion=nn.CrossEntropyLoss(),
    iterator_train__shuffle=True,
    device=try_gpu(),
    verbose=0,
    callbacks=[EarlyStopping(monitor='valid_loss', patience=5,
                             threshold=1e-4, lower_is_better=True),
               Checkpoint(monitor='valid_loss_best')]
                          )

# Define a search space
parameters = {
    'lr': list(np.geomspace(1e-4, 1e-1, 4)),
    'module__layer1': [128, 256, 512],
    'module__layer2': [128, 256, 512],
    'module__dropout1': [0, 0.1],
    'module__dropout2': [0, 0.1],
    'optimizer__weight_decay': list(np.append(0, np.geomspace(1e-5, 1e-3, 3))),
    'max_epochs': [1000],
    'batch_size': [32, 128],
    'callbacks__EarlyStopping__threshold': list(np.geomspace(1e-4, 1e-2, 3))
    }

def use_gpu(device):
    return True if not device == 'cpu' else False 

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    mlp,
    parameters,
    search_optimization='hyperopt',
    n_trials=10,
    n_jobs=-1,
    scoring=['neg_log_loss', 'accuracy'],
    cv=5,
    refit='accuracy',
    verbose=1,
    random_state=config['random_state']
    )

# Run hyper parameter tuning
X = X_train 
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'mlp'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'log_loss'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[5:10]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
tune_result_df.rename({
    'callbacks__EarlyStopping__threshold':'Earlystoping_threshold',
    'optimizer__weight_decay': 'weight_decay'
    }, axis=1, inplace=True)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_accuracy')
fig.show()

### > Factorization Machine

[Factorization Machines](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5694074&casa_token=WNncod4Fzy0AAAAA:06BUH6Q3Mh-HhboU-WV9p4h5AykMCWcYedWlcFDLtNw4tIkNWZg9oadIz32UuMx9rFDyqOTGY1w&tag=1), proposed by Steffen Rendle in 2010, is a supervised algorithm that can be used for classification, regression, and ranking tasks. 
It quickly took notice and became a popular and impactful method for making predictions and recommendations.


#### >> Preprocessing Data for implementing Factorization Machine

Since the factorization machine uses an embedding layer, it requires that the data type of all input variables be 'int'.

To take this into account, 'float' type variables are divided into several sections according to their values, and values ​​belonging to a specific section are transformed into interger values ​​of the section.

In [None]:
def prepro_for_fm(X_train, X_test, bin_method='sturges'):
  n_train = X_train.shape[0]
  all = np.vstack((X_train, X_test))

  col_num_uniq = np.apply_along_axis(lambda x: len(np.unique(x)), 0,  all)
  remain_iidx = (col_num_uniq<=2)
  to_bin_iidx = (col_num_uniq>2)

  all_remain = all[:,remain_iidx]
  all_to_bin = all[:,to_bin_iidx]
  
  for iter in range(all_to_bin.shape[1]):
    bin_size = len(np.histogram(all_to_bin[:,iter], bins=bin_method)[0])
    all_to_bin[:,iter] = pd.cut(all_to_bin[:,iter], bins=bin_size, labels=False)

  all_to_bin_df = pd.DataFrame(all_to_bin).astype('object')
  all_to_bin_array = pd.get_dummies(all_to_bin_df, drop_first=False).to_numpy()

  all_array = np.hstack((all_to_bin_array, all_remain)).astype(np.int64)
  field_dims = all_array.shape[1]
  all_fm = np.vstack((np.apply_along_axis(lambda x: np.where(x==1), 1, all_array)))

  return all_fm[:n_train], all_fm[n_train:], field_dims


X_train_fm, X_test_fm, field_dims = prepro_for_fm(X_train, X_test, bin_method='sturges')

In [None]:
import torch
from torch import nn
from skorch import NeuralNetClassifier
from skorch.callbacks import EarlyStopping
from skorch.callbacks import Checkpoint
from tune_sklearn import TuneSearchCV

# Define a model structure
class FM(nn.Module):
    def __init__(self, num_inputs=field_dims, num_factors=20, output_dim=7):
        super(FM, self).__init__()
        self.output_dim = output_dim
        for i in range(output_dim):
          setattr(self, f'embedding_{i}', nn.Embedding(num_inputs, num_factors))
        self.fc = nn.Embedding(num_inputs, output_dim)
        self.bias = nn.Parameter(torch.zeros((output_dim,)))

    def forward(self, x):
        square_of_sum_list = []
        sum_of_square_list = []
        for i in range(self.output_dim):
          square_of_sum_list.append(torch.sum(getattr(self, f'embedding_{i}')(x), dim=1)**2)
          sum_of_square_list.append(torch.sum(getattr(self, f'embedding_{i}')(x)**2, dim=1))
        square_of_sum = torch.stack(square_of_sum_list, dim=1)
        sum_of_square = torch.stack(sum_of_square_list, dim=1)
        x = self.bias + self.fc(x).sum(dim=1) + 0.5 * (square_of_sum - sum_of_square).sum(dim=2)
        return x

def try_gpu(i=0): 
    return f'cuda:{i}' if torch.cuda.device_count() >= i + 1 else 'cpu'

# Set model configurations
fm = NeuralNetClassifier(
    FM(num_inputs=field_dims, output_dim=len(np.unique(y_train_trans))),
    optimizer=torch.optim.Adam,
    criterion=nn.CrossEntropyLoss(),
    iterator_train__shuffle=True,
    device=try_gpu(),
    verbose=0,
    callbacks=[EarlyStopping(monitor='valid_loss', patience=5,
                             threshold=1e-4, lower_is_better=True),
               Checkpoint(monitor='valid_loss_best')]
                          )

# Define a search space
parameters = {
    'lr': list(np.geomspace(1e-4, 1e-2, 3)),
    'module__num_factors': [50, 100, 150],
    'optimizer__weight_decay': [1e-5, 1e-4, 1e-1],
    'max_epochs': [1000],
    'batch_size': [16, 32]
    }

def use_gpu(device):
    return True if not device == 'cpu' else False 

# Specify the hyper parameter tuning algorithm
tune_search = TuneSearchCV(
    fm, 
    parameters, 
    search_optimization='hyperopt',
    n_trials=10,
    n_jobs=-1,
    scoring=['neg_log_loss', 'accuracy'],
    cv=5,
    refit='accuracy',
    mode='max',   
    use_gpu = use_gpu(try_gpu()),
    random_state=config['random_state'],
    verbose=1,
    )

# Run hyper parameter tuning
X = X_train_fm
y = y_train_trans
tune_search.fit(X, y)

# Save the tuning results 
model_name = 'fm'

## Save the optimal hyper parmater values
best_cv_df.loc[best_cv_df['model']==model_name, 'best_hyper_param'] = str(tune_search.best_params_)

## Save the CV results
cv_df = pd.DataFrame(tune_search.cv_results_)
cv_values = cv_df.loc[tune_search.best_index_, cv_df.columns.str.startswith('split')].values
best_cv_df.loc[best_cv_df['model']==model_name, 'log_loss'] = cv_values[:5]
best_cv_df.loc[best_cv_df['model']==model_name, 'accuracy'] = cv_values[5:10]

# Visualize the tuning results with parallel coordinate plot
tune_result_df = pd.concat([pd.DataFrame(tune_search.cv_results_['params']), cv_df.loc[:,cv_df.columns.str.startswith('mean')] ], axis=1)
tune_result_df.rename({
    'callbacks__EarlyStopping__threshold':'Earlystoping_threshold',
    'optimizer__weight_decay': 'weight_decay'
    }, axis=1, inplace=True)
import plotly.express as px
fig = px.parallel_coordinates(tune_result_df, color='mean_test_accuracy')
fig.show()

## 2-2. Model Comparison based on CV results

Compare the CV results (measured using the best hyper parameter values) \\
The figure below shows that \\
rf > xgb >> fm > mlp >> linear > svm



In [None]:
fig = px.box(best_cv_df, x='model', y='accuracy', color='model', width=600)
fig.show()

## 2-3. Model Combination

Although it is possible to select a final model based on the above results, it has been observed that in many cases the combination of predicted values ​​from multiple models leads to improve prediction performance. ([Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts?](https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.210?casa_token=OwyF2RbEywAAAAAA:gahpwGRdOWzLXyafYQQt_voHOF8MedTBLd1SBv4vkdT3ZTLVoKZQj3zl-KbrhSkX5x8CndeCxwBoL_-S))

For classification problems, the final probabilities are derived by combining the predicted 'probabilities' for each class in a 'proper way'.

This notebook uses following two model combination methods.

1. Simple Average
2. Stacked Generalization (Stacking)


Model comparison needs to be done with single models (e.g., rf, xgb,...).
So model performance are measured by applying the same CV method as above.

Based on the CV results, we select (rf, xgb, mlp) as the base estimators for model combination. (Although fm performs slightly better than mlp in terms of CV results, mlp was chosen because mlp has a shorter learning time.

### > Simple Average

The simple average method derives the final probability value by 'averaging' the predicted probability values ​​for each class of multiple models.

For example,
- Base Estimations
  - $P_{rf}(Y=1|X=x)$ = 0.75
  - $P_{xgb}(Y=1|X=x)$ = 0.80
  - $P_{mlp}(Y=1|X=x)$ = 0.80
- Final Estimation
  - $P_{average}(Y=1|X=x)$  = 0.8 (= 0.75 + 0.80 + 0.85 + 0.80 / 4)


In [None]:
from sklearn.model_selection import KFold
from tqdm import notebook
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score

def CV_ensemble(ensemble_name, ensemble_func, estimators, X_train, y_train, n_folds=5, shuffle=True, random_state=2022):
  kf = KFold(n_splits=5, random_state=random_state, shuffle=True)

  res_list = []
  for train_idx, valid_idx in notebook.tqdm(kf.split(X_train), total=kf.get_n_splits(), desc='Eval_CV'):
    X_train_train, X_valid = X_train[train_idx], X_train[valid_idx]
    y_train_train, y_valid = y_train[train_idx], y_train[valid_idx]

    ensemble_pred_proba = ensemble_func(estimators, X_train_train, y_train_train, X_valid)
    neg_log_loss = np.negative(log_loss(y_valid, ensemble_pred_proba))
    accuracy = accuracy_score(y_valid, ensemble_pred_proba.argmax(axis=1))

    res_list.append([ensemble_name, neg_log_loss, accuracy])
  res_df = pd.DataFrame(np.vstack((res_list)))
  res_df.columns = ['model', 'log_loss', 'accuracy']
  return res_df

def ensemble_average(estimators, X_train, y_train, X_test):
  preds = []
  num_estimators = len(estimators)
  num_class = len(np.unique(y_train))
  for iter in range(num_estimators):
    try:
      estimators[iter].module__num_factors
    except: # for other models
      estimators[iter].fit(X_train, y_train)
      preds.append(estimators[iter].predict_proba(X_test))
    else: # for factorization machine
      X_train_fm, X_test_fm, _ = prepro_for_fm(X_train, X_test)
      estimators[iter].fit(X_train_fm, y_train)
      preds.append(estimators[iter].predict_proba(X_test_fm))
  
  preds_stack = np.hstack((preds))
  preds_mean = []
  for iter in range(num_class):
    col_idx = np.arange(iter, num_estimators * num_class, num_class)
    preds_mean.append(np.mean(preds_stack[:,col_idx], axis=1))

  return np.vstack((preds_mean)).transpose()

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

linear = SGDClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='linear', 'best_hyper_param'].values[0]))
svm = SGDClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='svm', 'best_hyper_param'].values[0]))
rf = RandomForestClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='rf', 'best_hyper_param'].values[0]))
xgb = XGBClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='xgb', 'best_hyper_param'].values[0]))
mlp = mlp.set_params(**eval(best_cv_df.loc[best_cv_df['model']=='mlp', 'best_hyper_param'].values[0]))
fm = fm.set_params(**eval(best_cv_df.loc[best_cv_df['model']=='fm', 'best_hyper_param'].values[0]))

estimators = [rf, xgb, mlp]
estimators_name = 'rf_xgb_mlp'
ensemble_name = 'average' + '_by_' + estimators_name

X = X_train
y = y_train_trans

res_df = CV_ensemble(ensemble_name, ensemble_average, estimators, X, y, n_folds=5, shuffle=True, random_state=config['random_state'])
best_cv_df = best_cv_df.append(res_df).reset_index(drop=True)

In [None]:
fig = px.box(best_cv_df, x='model', y='accuracy', color='model', width=800)
fig.show()

### > Stacked generalization (Stacking)

In the [Stacked generalization](https://www.jair.org/index.php/jair/article/view/10228), the predicted probabilities of base estimators are treated as the 'input data', and y (Cover_Type) of each row is treated as the 'output variable'. 
The 'Meta Learner' is learned with these data and the predicted probablities of this model are derived as the final prediction probabilities.

- The 'Meta Learner' can be optained among any of the classification models. However, this notebook uses a ridge model (logistic regression with ridge penalty) to prevent overfitting.

- As input data for 'Meta Learner', prediction probabilities for validation data in cv of base estimators are obtained.

- Trained meta-learner predicts the final predicted probabilities for the test-set by using the predicted probabilites of baes estimators for the test-set as input data.

The total process, in order, is as follows:
1. (Base estimators) Run CV on Train-set
2. (Meta Learner) Train on CV predictions (predicted probabilities on validation data of CV) with corresponding y values
3. (Base estimators) Train on Train-set
4. (Base estimators) Predict on Test-set
5. (Meta Learner) Predict on predictions on Test-set

<img align='top' src='https://drive.google.com/uc?export=view&id=1uDxSIIFt8rUJkuIwRYU4lALvOPqlXPG5' width='600' height='400'>


For example,
- Assume that 
  - $Y \in \{0, 1, 2\}$
- Base Estimatiors
  - rf
    - $P_{rf}(Y=0|X=x)$ = 0.75
    - $P_{rf}(Y=1|X=x)$ = 0.10
    - $P_{rf}(Y=2|X=x)$ = 0.15
  - xgb
    - $P_{xgb}(Y=0|X=x)$ = 0.80
    - $P_{xgb}(Y=1|X=x)$ = 0.10
    - $P_{xgb}(Y=2|X=x)$ = 0.10
- Meta Learner (logistic regression with ridge (l2) penalty)
  - when Y=0:
    - intercept = 0.1
    - coefficient = [0.8, 0.1, -0.1, 0.9, 0.2, -0.05]
  - predicted probabilities
    - $P_{stack}(Y=0|X=x)$ = 0.8069 = sigmoid(0.1 + 0.8*0.75 + 0.1*0.1 -0.1*0.15 + 0.9*0.8 + 0.2*0.1 - 0.05*0.1)$


**Warnings**:

- the set of predicted probabilities $[P_{rf}(Y=1|X=x), \cdots, P_{xgb}(Y=2|X=x)]$ is a **linearly dependent** matrix.
- Thus, as a final estimator, linear model with penalty or not a linear model is recommended.
- If you want to apply plain linear model with no penalty, please remove the first or last class probabilities of each base estimators (e.g., remove $P_{rf}(Y=2|X=x)$ and $P_{xgb}(Y=2|X=x)$)


The code provided by sklearn exists ([StackingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html)), but this can not be applied to the skorch models.

So I provide below code which does the stacking operation.

In [None]:
from sklearn.model_selection import KFold
from tqdm import notebook


def stack_clf(estimators, X_train, y_train, X_test, n_folds=5, shuffle=True, random_state=2022):
  final_estimator = estimators[-1]
  num_estimators = len(estimators)-1

  kf = KFold(n_splits=n_folds, random_state=random_state, shuffle=shuffle)
  preds = []
  y_valid_list = []
  for train_idx, valid_idx in notebook.tqdm(kf.split(X_train), total=kf.get_n_splits(), desc='Stack_CV'):
    X_train_train, X_valid = X_train[train_idx], X_train[valid_idx]
    y_train_train, y_valid = y_train[train_idx], y_train[valid_idx]
    
    valid_preds = []
    for iter in range(num_estimators):
      try:
        estimators[iter].module__num_factors
      except: # for other models
        estimators[iter].fit(X_train_train, y_train_train)
        valid_preds.append(estimators[iter].predict_proba(X_valid))
      else: # for factorization machine
        X_train_train_fm, X_valid_fm, _ = prepro_for_fm(X_train_train, X_valid)
        estimators[iter].fit(X_train_train_fm, y_train_train)
        valid_preds.append(estimators[iter].predict_proba(X_valid_fm))

    preds.append(np.hstack((valid_preds))) # warning: this matrix is linearly dependent. If you want to ge linearly independent matrix, drop first column
    y_valid_list.append(y_valid)

  cv_preds = np.vstack((preds))
  cv_y = np.hstack((y_valid_list))
  
  final_estimator.fit(cv_preds, cv_y)
  print(' Train score: {}'.format(final_estimator.score(cv_preds, cv_y)))
  print(' Estimated coefficients: {} \n intercept: {}'.format(final_estimator.coef_, final_estimator.intercept_))

  test_preds =[]
  for iter in range(num_estimators):
      try:
        estimators[iter].module__num_factors
      except: # for other models
        estimators[iter].fit(X_train, y_train)
        test_preds.append(estimators[iter].predict_proba(X_test))
      else: # for factorization machine
        X_train_fm, X_test_fm, _ = prepro_for_fm(X_train, X_test)
        estimators[iter].fit(X_train_fm, y_train)
        test_preds.append(estimators[iter].predict_proba(X_test_fm))

  test_preds_mat = np.hstack((test_preds)) # warning: this matrix is linearly dependent. If you want to ge linearly independent matrix, drop first column
  pred_fin = final_estimator.predict_proba(test_preds_mat)
  return pred_fin

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Base estimators
linear = SGDClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='linear', 'best_hyper_param'].values[0]))
svm = SGDClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='svm', 'best_hyper_param'].values[0]))
rf = RandomForestClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='rf', 'best_hyper_param'].values[0]))
xgb = XGBClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='xgb', 'best_hyper_param'].values[0]))
mlp = mlp.set_params(**eval(best_cv_df.loc[best_cv_df['model']=='mlp', 'best_hyper_param'].values[0]))
fm = fm.set_params(**eval(best_cv_df.loc[best_cv_df['model']=='fm', 'best_hyper_param'].values[0]))

estimators = [rf, xgb, mlp]
estimators_name = 'rf_xgb_mlp'

# Final estimator
clf = LogisticRegression(penalty='l2', max_iter=1000, random_state=config['random_state'])

estimators.append(clf)
ensemble_func = stack_clf
ensemble_name = 'stack_ridge' + '_by_' + estimators_name

# Run CV 
X = X_train
y = y_train_trans

res_df = CV_ensemble(ensemble_name, ensemble_func, estimators, X, y, n_folds=5, shuffle=True, random_state=config['random_state'])
best_cv_df = best_cv_df.append(res_df)

## 2-4. Model Comparison based on CV results including model combination methods

From the figure below, we can observe that model combination methods outperform single models in terms of accuracy and its variance. 
In 5-fold CV model combination methods shows much more stable performance.

As a result, 'stack_ridge_by_rf_xgb_mlp' model is chosen as the best model.

In [None]:
fig = px.box(best_cv_df, x='model', y='accuracy', color='model', width=800 )
fig.show()

In [None]:
best_cv_df.to_csv('best_cv_results.csv', index=False)

# 3. Make a prediction with the best model


In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Base estimators
linear = SGDClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='linear', 'best_hyper_param'].values[0]))
svm = SGDClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='svm', 'best_hyper_param'].values[0]))
rf = RandomForestClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='rf', 'best_hyper_param'].values[0]))
xgb = XGBClassifier(**eval(best_cv_df.loc[best_cv_df['model']=='xgb', 'best_hyper_param'].values[0]))
mlp = mlp.set_params(**eval(best_cv_df.loc[best_cv_df['model']=='mlp', 'best_hyper_param'].values[0]))
fm = fm.set_params(**eval(best_cv_df.loc[best_cv_df['model']=='fm', 'best_hyper_param'].values[0]))

estimators = [rf, xgb, mlp]
estimators_name = 'rf_xgb_mlp'

# Final estimator
clf = LogisticRegression(penalty='l2', max_iter=1000, random_state=config['random_state'])

estimators.append(clf)
ensemble_func = stack_clf
ensemble_name = 'stack_ridge' + '_by_' + estimators_name

# Run CV 
X = X_train
y = y_train_trans

pred_proba = stack_clf(estimators, X, y,  X_test, n_folds=5, shuffle=True, random_state=config['random_state'])
pred = pred_proba.argmax(axis=1)
pred_trans = pred + 1

res_df = pd.DataFrame({'Id': test['Id'], 'Cover_Type': pred_trans})
res_df.to_csv('subission.csv', index=False)
print(ensemble_name)