# Introduction

While FFNs (Feed Forward Networks) with Batch Normalization holds great potential to harness the many levels of abstract representations that comes with a deep network, the number of layers is limited by SGD. This is because after a certain number of few layers, SGD becomes unstable and the network starts to encounter problems such as vanishing and exploding gradients. Moreover, SGD and regularization techniques like dropout often perturbs Batch Normalization leading to high variance in training error. These problems are solved by Self Normalizing Neural Networks.

Self-Normalizing Neural Networks (SNNs) are neural networks which automatically keep their activations at zero-mean and unit-variance (per neuron). This is accomplished through the use of SeLU activation function which requires LeCun Normal kernel initialization.

Following is an excerpt from the [research paper](https://arxiv.org/pdf/1706.02515.pdf) of Self Normalizing Neural Networks:

> Self-normalizing neural networks (SNNs) are robust to perturbations and do not have high variance
in their training errors. SNNs push neuron activations to zero mean and unit variance
thereby leading to the same effect as batch normalization, which enables to robustly learn many
layers. SNNs are based on scaled exponential linear units “SELUs” which induce self-normalizing
properties like variance stabilization which in turn avoids exploding and vanishing gradients.

# Code

## Some preprocessing

In [None]:
import numpy as np
import pandas as pd

In [None]:
train_df = pd.read_csv("../input/tabular-playground-series-nov-2021/train.csv")
test_df = pd.read_csv("../input/tabular-playground-series-nov-2021/test.csv")
sub_df = pd.read_csv("../input/tabular-playground-series-nov-2021/sample_submission.csv")

Seperating features and targets

In [None]:
train_df.drop(columns=["id"], inplace=True)
test_df.drop(columns=["id"], inplace=True)

X = train_df.drop(columns=["target"]).values
y = train_df["target"].values

## Building SNN Model

Notice that, the **[LeCun Normal](https://www.tensorflow.org/api_docs/python/tf/keras/initializers/LecunNormal)** kernel initializer is used instead of the default one. Although, this network does not contains dropout layers, deep networks with large number of neurons can have [dropout](https://keras.io/api/layers/regularization_layers/dropout/) layers. However, the authors of the SNN paper have advised not to use this dropout. Instead they have proposed a new dropout technique called **alpha dropout** and have also suggested to use it instead. **[Alpha dropout](https://keras.io/api/layers/regularization_layers/alpha_dropout/)** is available as a layer in keras.

Although in this case I have built an SNN with only 3 layers, it is possible to stack many layers in an SNN. I have used 128 neurons in the first hidden layer, 32 neurons in the second hidden layer and 32 in the third hidden layer.

In [None]:
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential


def build_model():
    model = Sequential([
        layers.Dense(units=128, activation="selu", kernel_initializer="lecun_normal", input_shape=X.shape[1:]),
        layers.Dense(units=32, activation="selu", kernel_initializer="lecun_normal"),
        layers.Dense(units=32, activation="selu", kernel_initializer="lecun_normal"),
        layers.Dense(units=1, activation="sigmoid")
    ])

    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["AUC"]
    )

    return model


build_model().summary()

Defining Various callbacks

In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping


reduce_lr = ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.8,
    patience=10,
)

early_stop = EarlyStopping(
    monitor="val_loss",
    patience=60,
    restore_best_weights=True
)

callbacks = [reduce_lr, early_stop]

## Training Model

I have used the StratifiedKFold validation strategy with 7 folds. To speed up the model training, a batch size of 2048 is used.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold


EPOCHS = 500
BATCH_SIZE = 2048
FOLDS = 7

cv = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
test_preds = []
mean_score = 0

for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    X_train, y_train = X[train_idx], y[train_idx]
    X_val, y_val = X[val_idx], y[val_idx]

    scaler = MinMaxScaler()

    X_train = scaler.fit_transform(X_train)
    X_val = scaler.transform(X_val)
    X_test = scaler.transform(test_df)

    model = build_model()

    model.fit(
        X_train,
        y_train,
        validation_data=(X_val, y_val),
        epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        callbacks=[reduce_lr, early_stop],
        verbose=False
    )

    y_pred = model.predict(X_val)
    score = roc_auc_score(y_val, y_pred)
    mean_score += score

    print(f"FOLD {fold} | Score: {score}")

    test_preds.append(model.predict(X_test))


print()
print(f"Mean score of all folds: {mean_score/FOLDS}")

In [None]:
sub_df["target"] = sum(test_preds)/FOLDS
sub_df.to_csv("submission.csv", index=False)

sub_df.head()