Building upon the very good notebook by https://www.kaggle.com/mlanhenke (https://www.kaggle.com/mlanhenke/tps-11-nn-baseline-keras) which presents a very clean and easy to understand structure, I prepared some a few variations to help furthermore the participants of this competition.

The improvements upsofar are:

* deterministic random seeding
* enabling TPU usage just by switching accelerator type
* adding mish and gelu activations
* model summary and plot


How the network is organized in layers and the details of its architecture can make even more difference. In fact, technically speaking, architecture implies the representational capacity of the deep neural network, which means that depending on the layers you use the network will be able or not to read and process all the information available in the data.

It is said that common best practices for great deep learning practitioners to assemble performing DNNs depend mainly on:

* Relying on pre-trained (so you have to be very knowledgeable about the solutions around)
* Reading cutting-edge papers
* Coping from top Kaggle Kernels, also from previous competitions
* Trial and error
* Ingenuity & luck

Actually, a famous lesson by Prof. Geoffrey Hinton (for video see: https://www.youtube.com/watch?v=i0cKa0di_lo and https://www.cs.toronto.edu/~hinton/coursera/lecture16/lec16.pdf for the presentation ) clarifies that you can achieve similar and often better results using automated methods such as Bayesian optimization. Using Bayesian optimization will also avoid you failing struck because you cannot figure out the best combinations of hyper-parameters among so many possible combinations of values.

In this notebook I introduce how to tune the architecture of a neural network (NAS) using KerasTuner.

KerasTuner (https://keras.io/keras_tuner/) has been announced as a “flexible and efficient hyperparameter tuning for Keras models” by Francois Chollet, the creator of Keras.Francois Chollet also created a series of notebooks for Kaggle competitions in order to showcase the workings and functionalities of KerasTuner:

* https://www.kaggle.com/fchollet/keras-kerastuner-best-practices for the Digit Recognizer datasets

* https://www.kaggle.com/fchollet/titanic-keras-kerastuner-best-practices for the Titanic dataset

* https://www.kaggle.com/fchollet/moa-keras-kerastuner-best-practices for the Mechanisms of Action (MoA) Prediction competition

The recipe proposed by Chollet for running KerasTuner is made of simple steps starting from your existing Keras model:
* Wrap your model in a function with hp as first parameter
* Define hyper-parameters at the beginning of the function
* Replace DNN static values with hyper-parameters
* Create block branches as hyper-parameters (alternative paths)
* Define dynamically hyper-parameters as you build the network

In the follwing notebook you can find all this implemented in a easy way. Just replicate it and change the *tunable_model* function accordingly to your desired experiments.

In [None]:
import os
import numpy as np
import pandas as pd
import random

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import StratifiedKFold, train_test_split

from warnings import filterwarnings
filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential, Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.layers import Dense, Flatten, Input, Concatenate, Dropout
from tensorflow.keras.utils import plot_model

import tensorflow_addons as tfa

import keras_tuner as kt

In [None]:
def seed_everything(seed):
    """
    Seeds basic parameters for reproductibility of results
    """
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    
SEED = 3024

In [None]:
def gelu(x):
    return 0.5 * x * (1 + tf.tanh(tf.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))

class Mish(keras.layers.Activation):
    '''
    Mish Activation Function.
    see: https://github.com/digantamisra98/Mish/blob/master/Mish/TFKeras/mish.py
    .. math::
        mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^{x}))
    Shape:
        - Input: Arbitrary. Use the keyword argument `input_shape`
        (tuple of integers, does not include the samples axis)
        when using this layer as the first layer in a model.
        - Output: Same shape as the input.
    Examples:
        >>> X = Activation('Mish', name="conv1_act")(X_input)
    '''

    def __init__(self, activation, **kwargs):
        super(Mish, self).__init__(activation, **kwargs)
        self.__name__ = 'Mish'

def mish(inputs):
    return inputs * tf.math.tanh(tf.math.softplus(inputs))

keras.utils.get_custom_objects().update({'gelu': keras.layers.Activation(gelu)})
keras.utils.get_custom_objects().update({'mish': Mish(mish)})

In [None]:
def plot_history(history, start=0, fold=0):
    epochs = np.arange(len(history.history['loss']))
    plt.figure(figsize=(15,5))
    plt.plot(epochs[start:], history.history['loss'][start:], 
             label='train', color='blue')
    plt.plot(epochs[start:], history.history['val_loss'][start:], 
             label='validation', color='red')
    plt.ylabel('Loss',size=14)
    plt.title(f"Fold: {fold+1}")
    plt.legend()
    plt.show()

In [None]:
# Loading data
df_train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')

# Preparing training data
X = df_train.drop(columns=['id','target']).copy()
y = df_train['target'].copy()

# Data Standardization
scaler = MinMaxScaler(feature_range=(0, 1))
X = pd.DataFrame(columns=X.columns, data=scaler.fit_transform(X))

del(df_train)

In [None]:
### define callbacks
early_stopping = EarlyStopping(
    monitor='val_auc', 
    min_delta=0, 
    patience=5, 
    verbose=0,
    mode='min', 
    baseline=None, 
    restore_best_weights=True
)

reduce_lr = ReduceLROnPlateau(
    monitor='val_auc', 
    factor=0.2,
    patience=2,
    mode='min'
)

In [None]:
def tunable_model(hp):
    
    layer_1_size = hp.Int('layer_1_size', min_value=2, max_value=512, step=2)
    layer_2_size = hp.Int('layer_2_size', min_value=2, max_value=512, step=2)
    layer_3_size = hp.Int('layer_3_size', min_value=2, max_value=512, step=2)
    layer_4_size = hp.Int('layer_4_size', min_value=2, max_value=512, step=2)
    layer_5_size = hp.Int('layer_5_size', min_value=2, max_value=512, step=2)
    
    hidden_layers = hp.Int('hidden_layers', min_value=1, max_value=5, step=1)
    dense_dropout = hp.Float('dense_dropout', min_value=0, max_value=0.5, step=0.05)
    skip_layers = hp.Boolean('skip_layers')
    
    activation = hp.Choice('activation', values=['relu', 'swish', 'elu', 'mish', 'gelu'])
    
    inputs_sequence = Input(shape=(X.shape[1]))
    x = Flatten()(inputs_sequence)

    skips = list()
    layers = [layer_1_size, layer_2_size, layer_3_size, layer_4_size,layer_5_size][:hidden_layers]
    
    for layer, nodes in enumerate(layers):
        x = Dense(nodes, activation=activation)(x)
        x = Dropout(dense_dropout)(x)
        if layer != (len(layers) - 1) and skip_layers is True:
            skips.append(x)
    
    if len(skips) > 0:
        x = Concatenate(axis=1)([x] + skips)
        
    output_class = Dense(1, activation='sigmoid')(x)

    model = Model(inputs=inputs_sequence, outputs=output_class)
    
    model.compile(
        keras.optimizers.Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['AUC'])
    
    return model

In [None]:
try:
    # detect and init the TPU
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
    # instantiate a distribution strategy
    tf_strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print("Running on TPU:", tpu.master())
except:
    tf_strategy = tf.distribute.get_strategy()
    print(f"Running on {tf_strategy.num_replicas_in_sync} replicas")
    print("Number of CPUs Available: ", len(tf.config.list_physical_devices('CPU')))
    print("Number of GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, stratify=y, random_state=SEED)

In [None]:
with tf_strategy.scope():
    tuner = kt.BayesianOptimization(hypermodel=tunable_model,
                                    objective=kt.Objective("auc", direction="max"),
                                    max_trials=60,
                                    num_initial_points=3,
                                    directory='storage',
                                    project_name='tps11',
                                    seed=101)

    tuner.search(X_train, y_train, 
                 epochs=1000,
                 batch_size=1024, 
                 validation_data=(X_valid, y_valid),
                 shuffle=True,
                 verbose=1,
                 callbacks = [early_stopping, reduce_lr]
                 )

In [None]:
best_hps = tuner.get_best_hyperparameters()[0]
model = tuner.hypermodel.build(best_hps)

In [None]:
print(best_hps.values)

In [None]:
model.summary()

In [None]:
plot_model(
    model, 
    to_file='baseline.png', 
    show_shapes=True,
    show_layer_names=True
)