## Introduction

Previous versions of this notebook present models of neural network, which are diferrent from optimizers and initializers. I had an idea to verify quality of these models by GridSearchCV, but it took very long time, so I decided to verify each model separately. And here are the results:

* **V1** - optimizer: **rmsprop**, initializer: **glorot_uniform** -------> _**0.94749**_
* **V2** - optimizer: **rmsprop**, initializer: **normal** ----------------> _**0.94635**_
* **V3** - optimizer: **rmsprop**, initializer: **uniform** ---------------> _**0.94638**_
* **V4** - optimizer: **adam**,    initializer: **glorot_uniform** ----------> _**0.94871**_
* **V5** - optimizer: **adam**,    initializer: **normal** -------------------> _**0.94897**_
* **V6** - optimizer: **adam**,    initializer: **uniform** ------------------> _**0.94903**_ <-best
 
There is a code below, which contains data preprocessing. Besides, it enable to search the best neural network model by GridSearchCV tool and other things related to evaluation of model.

## Import necessary libraries and datasets

In [None]:
import math
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', 60)

import seaborn as sns
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.models import Model
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.callbacks import ReduceLROnPlateau
from keras.wrappers.scikit_learn import KerasClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report

In [None]:
INT8_MIN = np.iinfo(np.int8).min
INT8_MAX = np.iinfo(np.int8).max
INT16_MIN = np.iinfo(np.int16).min
INT16_MAX = np.iinfo(np.int16).max
INT32_MIN = np.iinfo(np.int32).min
INT32_MAX = np.iinfo(np.int32).max

FLOAT16_MIN = np.finfo(np.float16).min
FLOAT16_MAX = np.finfo(np.float16).max
FLOAT32_MIN = np.finfo(np.float32).min
FLOAT32_MAX = np.finfo(np.float32).max


def memory_usage(data, detail = 1):
    if detail:
        display(data.memory_usage())
    memory = data.memory_usage().sum() / (1024 * 1024)
    print("Memory usage : {0:.2f}MB".format(memory))
    return memory


def compress_dataset(data):
    memory_before_compress = memory_usage(data, 0)
    print()
    print('=' * 50)
    for col in data.columns:
        col_dtype = data[col][:100].dtype

        if col_dtype != 'object':
            print("Name: {0:24s} Type: {1}".format(col, col_dtype))
            col_series = data[col]
            col_min = col_series.min()
            col_max = col_series.max()

            if col_dtype == 'float64':
                print(" variable min: {0:15s} max: {1:15s}".format(str(np.round(col_min, 4)), str(np.round(col_max, 4))))
                if (col_min > FLOAT16_MIN) and (col_max < FLOAT16_MAX):
                    data[col] = data[col].astype(np.float16)
                    print("  float16 min: {0:15s} max: {1:15s}".format(str(FLOAT16_MIN), str(FLOAT16_MAX)))
                    print("compress float64 --> float16")
                elif (col_min > FLOAT32_MIN) and (col_max < FLOAT32_MAX):
                    data[col] = data[col].astype(np.float32)
                    print("  float32 min: {0:15s} max: {1:15s}".format(str(FLOAT32_MIN), str(FLOAT32_MAX)))
                    print("compress float64 --> float32")
                else:
                    pass
                memory_after_compress = memory_usage(data, 0)
                print("Compress Rate: [{0:.2%}]".format((memory_before_compress-memory_after_compress) / memory_before_compress))
                print('=' * 50)

            if col_dtype == 'int64':
                print(" variable min: {0:15s} max: {1:15s}".format(str(col_min), str(col_max)))
                type_flag = 64
                if (col_min > INT8_MIN / 2) and (col_max < INT8_MAX / 2):
                    type_flag = 8
                    data[col] = data[col].astype(np.int8)
                    print("     int8 min: {0:15s} max: {1:15s}".format(str(INT8_MIN), str(INT8_MAX)))
                elif (col_min > INT16_MIN) and (col_max < INT16_MAX):
                    type_flag = 16
                    data[col] = data[col].astype(np.int16)
                    print("    int16 min: {0:15s} max: {1:15s}".format(str(INT16_MIN), str(INT16_MAX)))
                elif (col_min > INT32_MIN) and (col_max < INT32_MAX):
                    type_flag = 32
                    data[col] = data[col].astype(np.int32)
                    print("    int32 min: {0:15s} max: {1:15s}".format(str(INT32_MIN), str(INT32_MAX)))
                    type_flag = 1
                else:
                    pass
                memory_after_compress = memory_usage(data, 0)
                print("Compress Rate: [{0:.2%}]".format((memory_before_compress-memory_after_compress) / memory_before_compress))
                if type_flag == 32:
                    print("compress (int64) ==> (int32)")
                elif type_flag == 16:
                    print("compress (int64) ==> (int16)")
                else:
                    print("compress (int64) ==> (int8)")
                print('=' * 50)

    print()
    memory_after_compress = memory_usage(data, 0)
    print("Compress Rate: [{0:.2%}]".format((memory_before_compress-memory_after_compress) / memory_before_compress))
    
    return data

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')

## Train set summary

Let's see what a train set looks like

In [None]:
df_train.head()

`Id` column **is redundant**. Let's remove it.

In [None]:
df_train.drop('Id', axis = 1, inplace = True)

Let's check how big is our data.

In [None]:
print(f'Train set shape:   {df_train.shape}')

Our train set has **4 000 000 rows** and **55 columns**.

Let's find out something more about data. 

In [None]:
df_train.info()

All columns consist of **integers** and the set is huge - it using a lot of memory.

Let's see a distribution of each column.

In [None]:
df_train.describe()

Columns `Soil_Type7` and `Soil_Type15` contain only **one value - 0**. They don't introduce a variability, so we can remove them. All remaining columns`Soil_Type` has **two values - 0 and 1**.

In [None]:
df_train.drop(['Soil_Type7', 'Soil_Type15'], axis = 1, inplace = True)

At the end let's check dataset has some missing values.

In [None]:
df_train.isnull().sum().max() != 0

There are **no missing values** in train dataset.

## Test set summary

We'll carry out exactly the same steps as above.

Let's see what a test set looks like

In [None]:
df_test.head()

`Id` column **is redundant**. Let's remove it.

In [None]:
df_test.drop('Id', axis = 1, inplace = True)

Let's check how big is our data.

In [None]:
print(f'Test set shape:   {df_test.shape}')

Our test set has **1 000 000 rows** and **54 columns**.

Let's find out something more about data. 

In [None]:
df_test.info()

All columns consist of **integers** and the set is huge - it using a lot of memory. Just like in train set.

Let's see a distribution of each column.

In [None]:
df_test.describe()

Columns `Soil_Type7` and `Soil_Type15` contain only **one value - 0**. They don't introduce a variability, so we can remove them. All remaining columns`Soil_Type` has **two values - 0 and 1**. Just like in train set.

In [None]:
df_test.drop(['Soil_Type7', 'Soil_Type15'], axis = 1, inplace = True)

At the end let's check dataset has some missing values.

In [None]:
df_test.isnull().sum().max() != 0

There are **no missing values** in test dataset.

## Target summary

Let's check a number of classes in target column.

In [None]:
df_train['Cover_Type'].unique()

We have **7** classes. We need to check that classes are balanced or not.

In [None]:
sns.countplot(x = df_train['Cover_Type'])

In [None]:
df_train['Cover_Type'].value_counts()

Unfortunatelly, **classes are imbalanced**. Class no. 5 appears only once. We'll remove it.

In [None]:
df_train.drop(df_train[df_train['Cover_Type'] == 5].index, axis = 0, inplace = True)

## Correlation

In [None]:
non_binary_columns = list(df_train.columns[:10])
sns.heatmap(df_train[non_binary_columns].corr())

There is **no correlation between non-binary feature**.

## Standard Scaler

Let's scale and match our datasets. 

In [None]:
columns = list(df_test.columns)

scaler = StandardScaler()

df_train[columns] = scaler.fit_transform(df_train[columns])
df_test = pd.DataFrame(scaler.transform(df_test), columns = df_test.columns)

## Memory releasing

Datasets are very large and use huge quantity of memory, so we need to convert type of columns to ones using less memory.

In [None]:
df_train = compress_dataset(df_train)
df_test = compress_dataset(df_test)

## CNN

Let's define our features and target.

In [None]:
X = df_train[columns]
y = df_train['Cover_Type']

We need to modify target column by formatting vector (number of rows, 1) to matrix **(number of rows, number of classes)**. Each column of new target matrix corresponds to one class, so each row consists of 0s (which means no class) and 1 (specific class) 

In [None]:
y = pd.get_dummies(y)
y.head()

Split data into train set and test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 12)
print(f'X train shape:  {X_train.shape}')
print(f'X test shape:   {X_test.shape}')
print(f'y train shape:  {y_train.shape}')
print(f'y test shape:   {y_test.shape}')

Let's build neural network model. We use GarchSearchCV tool to select the best optimizer and initializer.

In [None]:
def model_cnn(optimizer, initializer):
    model = Sequential()
    model.add(Dense(128, activation = 'relu', kernel_initializer = initializer, input_shape = [X.shape[1]]))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Dense(64, activation = 'relu', kernel_initializer = initializer))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Dense(32, activation = 'relu', kernel_initializer = initializer))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Dense(16, activation = 'relu', kernel_initializer = initializer))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Dense(8, activation = 'relu', kernel_initializer = initializer))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Dense(6, activation = 'softmax', kernel_initializer = initializer))

    model.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics = ['accuracy'])

    return model

early_stop = EarlyStopping(monitor = 'val_accuracy', patience = 10, 
                           verbose = 1, mode = 'max', restore_best_weights = True)
red_lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.5, patience = 5, verbose = 1)

In [None]:
model = KerasClassifier(build_fn = model_cnn, epochs = 100, batch_size = 1024, 
                        verbose = 1, callbacks = [early_stop, red_lr], validation_data = (X_test, y_test))

In [None]:
params_grid = dict(optimizer = ['adam', 'rmsprop'],
                   initializer = ['uniform', 'glorot_uniform', 'normal'])

grid_model = GridSearchCV(estimator = model, param_grid = params_grid, 
                          cv = 10, verbose = 0)
# grid_model.fit(X_train, y_train)

Results of models.

In [None]:
# pd.DataFrame(grid_model.cv_results_)[['param_initializer', 'param_optimizer', 'mean_test_score', 'std_test_score', 'rank_test_score']].sort_values('rank_test_score')

In [None]:
# print(f'Best: {grid_model.best_score_} using {grid_model.best_params_}')

Time to predict classes for test data. We need to modify predicted values in following way:
* 0 -> 1
* 1 -> 2
* 2 -> 3
* 3 -> 4
* 4 -> 6
* 5 -> 7

because indices of columns don't correspond to values of class.

In [None]:
# best_model = grid_model.best_estimator_

# preds = best_model.predict(X_test)
# preds += 1
# preds[preds > 4] += 1
# preds

We need to extract informations about the values of class from target matrix. We can do it in two steps. First, we take indices of these columns, where no. 1 appears in each rows and then we modify them exactly the same way as before.

In [None]:
# y_true = np.argmax(np.array(y_test), axis = 1)
# y_true += 1
# y_true[y_true > 4] += 1
# y_true

Now we can evaluate our model.

## Evaluation

In [None]:
# print(classification_report(y_true, preds))

In [None]:
# cm = confusion_matrix(y_true, preds)
# fig, ax = plt.subplots(figsize = (10,10))
# cmd = ConfusionMatrixDisplay(cm, display_labels = pd.DataFrame(y_true)[0].sort_values().unique())
# cmd.plot(cmap = plt.cm.Blues, ax = ax)
# plt.show()

In [None]:
# sns.countplot(x = preds)
# plt.grid()

In [None]:
# model = best_model.fit(X_train, y_train, validation_data = (X_test, y_test), 
#                   verbose = 1, callbacks = [early_stop, red_lr])

In [None]:
# plt.figure(figsize = (15, 5))

# plt.plot(model.history['accuracy'])
# plt.plot(model.history['val_accuracy'])

# plt.title('Model Accuracy', size = 16)
# plt.xlabel('Epoch')
# plt.legend(['Train accuracy', 'Test accuracy'], loc = 4)
# plt.grid()
# plt.show()
           
# plt.figure(figsize = (15, 5))

# plt.plot(model.history['loss'])
# plt.plot(model.history['val_loss'])

# plt.title('Model loss', size = 16)
# plt.xlabel('Epoch')
# plt.legend(['Train loss', 'Test loss'], loc = 7)
# plt.grid()
# plt.show()

## Submission

In [None]:
# sub = pd.read_csv('../input/tabular-playground-series-dec-2021/sample_submission.csv')
# preds = best_model.predict(df_test)
# preds += 1
# preds[preds > 4] += 1
# sub['Cover_Type'] = preds
# sub.head()

In [None]:
# sub.to_csv('gs_best.csv', index = False)