> **Essential ML process for Intrusion Detection**
<br>`NOTE: {python3.8 numpy 1.19.5} are max versions for the June 2022 conda tensorflow 2.2-2.6 builds (at least) - seems like the pip build works with numpy >= 1.20 but pip install breaks the consistency of the conda environment`<br>This was fixed in the Dec.2022 builds

***
**In This Notebook:**<br>
* Model Definitions
* Using Class Weights
* Notes on Using Class Weights
<br><br>

***
**Model Definitions**<br>
* tensorflow.keras "feed forward"
* tensorflow.keras RNN
* tensorflow.keras LSTM
* tensorflow.keras Conv1D

***
**tensorflow.keras "feed forward"**

In [None]:
# shape[0] = rows|observations ; shape[1] = cols|features
# shape for initial input tensor depends on first layer:
#     Dense (Feed Forward|Fully Connected) uses 2D
#     CNN1D, RNN both use 3D (with different semantics for the 3rd dim!)

# Dense initial layer: no need to reshape ... 
shape = (X_train.shape[1])

In [None]:
X_train.shape, X_test.shape, shape

In [None]:
# Dense layer = Feed Forward|Fully Connected 
# If you don't specify an Activation function, no activation is applied 
#   (ie. "linear" activation: a(x) = x).

# NO Spaces in names
model_name = 'feed_forward'

model = keras.Sequential()
# use the proper shape!
model.add(keras.layers.InputLayer(input_shape=shape, name='optionalLayer'))

model.add(keras.layers.Dense(128, activation='relu', name='InitialLayer'))
model.add(keras.layers.Dense(64, activation='relu', name='mid_Layer'))
model.add(keras.layers.Dense(32, activation='relu', name="mid-Layer"))

# output layers
model.add(keras.layers.Dense(CLASSES, name="OutputLayer"))
model.add(keras.layers.Softmax(name="ResultLayer"))

***
**tensorflow.keras RNN**

In [None]:
# shape[0] = rows|observations ; shape[1] = cols|features
# shape for initial input tensor depends on first layer:
#     Dense (Feed Forward|Fully Connected) uses 2D
#     CNN1D, RNN both use 3D (with different semantics for the 3rd dim!)

# reshape the datasets to 3D
X_train = X_train.values.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.values.reshape(X_test.shape[0], X_test.shape[1], 1)

# shape for initial input tensor: CNN1D, RNN 
shape = (X_train.shape[1], X_train.shape[2])

In [None]:
X_train.shape, X_test.shape, shape

In [None]:
# Recurrent layers (RNN, LSTM) require 
#    return_sequences=True
# to initialise another Recurrent layer
# shape of this output is (batch_size, timesteps, units).
# for a Dense layer, the default (False) is fine
# shape of this output is (batch_size, units) 
#    where units corresponds to the argument passed to the constructor

model_name = 'RNN'
model = keras.Sequential()
# use the proper shape!
model.add(keras.layers.InputLayer(input_shape = shape))

model.add(keras.layers.SimpleRNN(128, return_sequences=True))
model.add(keras.layers.SimpleRNN(64))

# output layers
model.add(keras.layers.Dense(CLASSES, name="OutputLayer"))
model.add(keras.layers.Softmax(name="ResultLayer"))

***
**tensorflow.keras LSTM**

In [None]:
# shape[0] = rows|observations ; shape[1] = cols|features
# shape for initial input tensor depends on first layer:
#     Dense (Feed Forward|Fully Connected) uses 2D
#     CNN1D, RNN both use 3D (with different semantics for the 3rd dim!)

# reshape the datasets to 3D
X_train = X_train.values.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.values.reshape(X_test.shape[0], X_test.shape[1], 1)

# shape for initial input tensor: CNN1D, RNN 
shape = (X_train.shape[1], X_train.shape[2])

In [None]:
X_train.shape, X_test.shape, shape

In [None]:
# Recurrent layers (RNN, LSTM) require 
#    return_sequences=True
# to initialise another Recurrent layer
# shape of this output is (batch_size, timesteps, units).
# for a Dense layer, the default (False) is fine
# shape of this output is (batch_size, units) 
#    where units corresponds to the argument passed to the constructor

model_name = 'lstm'
model = keras.Sequential()
# use the proper shape!
model.add(keras.layers.InputLayer(input_shape = shape))

model.add(keras.layers.LSTM(128, return_sequences=True)) 
model.add(keras.layers.LSTM(64))

# output layers
model.add(keras.layers.Dense(CLASSES, name="OutputLayer"))
model.add(keras.layers.Softmax(name="ResultLayer"))

***
**tensorflow.keras Conv1D**

In [None]:
# shape[0] = rows|observations ; shape[1] = cols|features
# shape for initial input tensor depends on first layer:
#     Dense (ANN|Fully Connected) uses 2D
#     CNN1D, RNN both use 3D (with different semantics for the 3rd dim!)

# reshape the datasets to 3D
X_train = X_train.values.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.values.reshape(X_test.shape[0], X_test.shape[1], 1)

# shape for initial input tensor: CNN1D, RNN 
shape = (X_train.shape[1], X_train.shape[2])

In [None]:
X_train.shape, X_test.shape, shape

In [None]:
model_name = 'conv1D'
model = keras.Sequential()
# use the proper shape!
model.add(keras.layers.InputLayer(input_shape = shape))

model.add(keras.layers.Conv1D(filters = 64,
                              kernel_size = 4, strides = 1,
                              padding = 'valid'))
model.add(tf.keras.layers.LSTM(64))  
model.add(tf.keras.layers.Dense(32, activation='relu'))

# output layers
model.add(keras.layers.Dense(CLASSES, name="OutputLayer"))
model.add(keras.layers.Softmax(name="ResultLayer"))

 ***

***
**Using Class Weights**

Balanced weighting is one of the widely used methods for imbalanced classification models. It modifies the class weights of the majority and minority classes during the model training process to achieve better model results.

Unlike the oversampling and under-sampling methods, the balanced weighting methods do not modify the dataset. Instead, each observation is weighted so that wrong predictions for the minority class are given more weight when the loss value is calculated during the training process. Weights for the loss function by can be arbitrary, but a typical choice is class weights (distribution of labels). 

Let's start with the practical requirements for using class weights in our model, with an explanation of terms and concepts after. The blocks below can be used to make the changes to our typical example neural network. There are four things to take care of:

> 1a. No keras.Softmax() layer at the end of the model definition<br>
1b. Define the loss function with the parameter from_logits=True<br><br>
2a. Calculate the weights<br>
2b. Choose whether we pass the weights to model.fit() or model.compile()  

**1a. No keras.Softmax() layer at the end of the model definition**
<br>Many multi-layer neural networks end with a Softmax() layer, to convert real-valued scores ("logits") to a normalized probability distribution that is more convenient for display to users and passing to other programs. However, <a href="https://www.tensorflow.org/tutorials/quickstart/beginner"> the tensorflow docs</a> point out that it is impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.

So, step 1a is to modify our model definition block:

In [None]:
# output layers
model.add(keras.layers.Dense(CLASSES, name="OutputLayer"))
# model.add(keras.layers.Softmax(name="ResultLayer"))

**1b. Define the loss function with the parameter from_logits=True**
<br>tf.keras built-in loss functions may be passed by their string identifier, or by 
instantiating a loss class. Using classes enables you to pass configuration arguments at instantiation time, using the string identifier is more convenient when all you need are the default parameters.

The default in the tf.keras loss function definitions is to assume the optimizer will use softmax inputs. To use class weights, we need to tell it that the inputs will be "logits" to preserve accuracy, which means using the class instantiation method.

It is probably easiest to just replace the model.compile() block ...

In [None]:
# OLD

In [None]:
# NEW

In [39]:
# class_weight =   causes error with keras.Softmax as last model layer
#                * with no Softmax, we must tell the loss function to use logits

loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss = loss_fn,
              # loss_weights = loss_wts,
              optimizer = "adam",
              metrics = ['acc','mse']
             )

**2a. Calculate the weights**
<br>Scikit-Learn has a convenient 
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html"> compute_class_weight</a> function that makes this painless. With class_weight = balanced, the weights are calculated as <br>
> n_samples / (n_classes * np.bincount(y))

Just to note, the function does not support positional parameters, we need to specify the keyword arguments.

In [None]:
# add this block
# just above the model.compile() block

In [25]:
from sklearn.utils import compute_class_weight
loss_wts = compute_class_weight(class_weight = "balanced",
                                classes = numpy.unique(y_train),
                                y = y_train
                               )
# model.compile() just needs the array for loss_weights =
# model.fit() requires a dict for class_weight = 
clas_wts = dict(zip(numpy.unique(y_train), loss_wts))
clas_wts

{0: 0.37587241246176234,
 1: 0.5486863782190237,
 2: 2.1425427458947013,
 3: 21.896193771626297,
 4: 218.20689655172413}

**2b. Choose whether to pass the weights to model.fit() or model.compile()**
<br>Passing a class_weight argument to model.fit() is used to weight the importance of each sample based on the class they belong to during training. This is typically used when there is an uneven distribution of samples per class.

Passing a loss_weights argument to model.fit() is used to weight the multiple loss values in the calculation of the final loss value of the model. This is typically used for models with multiple loss functions (see the note below), but it can also be used for simple multiclass models.

In [None]:
# after the early stopping block
# replace the model.fit() block with this, or just add the line
##     class_weight = clas_wts,

In [None]:
hist = model.fit(X_train, y_train, 
                 epochs=EPOCHS, 
                 batch_size = BATCH_SIZE,
                 # validation_data=(X_test,y_test),
                 validation_split = .15,
                 class_weight = clas_wts,    # no keras.Softmax() layer!
                 # callbacks=[monitor],
                 shuffle = True
                )

_...   Done!   ..._

***
**Notes on Using Class Weights**<br>
* Final Softmax() layer
* Logit vs.Softmax
* loss_weights to model.compile()
* Using Class Weights in Scikit Learn
<br><br>

**Final Softmax() layer**
<br>If the model really must return a probability,
<a href="https://www.tensorflow.org/tutorials/quickstart/beginner"> 
the tensorflow docs</a> suggest wrapping the *trained* model to attach the softmax to it:<br>
> probability_model = tf.keras.Sequential([<br>&emsp;model,<br>&emsp;tf.keras.layers.Softmax()<br>])

This throws an error, suggesting it needs to be rewritten in the keras functional model style ... 

**2b. Logit vs. Softmax**
<br>In statistics, a logistic function is the result of the division of two exponential functions, that gives rise to the logistic curve. Sigmoid refers to various real functions whose graph resembles an elongated letter "S"; specifically, the logistic function. The inverse of the sigmoid or logistic function is known as a "logit" (logistic unit) function.

In machine learning, a logit function (also known as the log-odds function), calculates the natural log of the odds that an observation belongs to one of the classification categories. It yields a vector of K real values for each observation that range from negative infinity to infinity, where K equals the number of classes. 

The softmax (sigmoid) function transforms a vector of K real values into a vector of K real values that sum to one, which can be interpreted as probabilities. Small or negative inputs become small probabilities, large values become higher probabilities, and the final sum of the probabilities will always be one. 

Sigmoid is used for binary classification with only 2 classes, while SoftMax applies to multiclass problems - many folks consider sigmoid a special case of softmax. These functions can be used in a classifier only when the classes are mutually exclusive.

When from_logits = True, the loss function takes a vector of ground truth values and a vector of logits and returns a scalar loss for each observation. An extra vector of real-valued loss_weights can easily be applied to the logit values for each observation before the final loss is calculated. 

When from_logits = False (the default in tf.keras) the loss function gets a a vector of ground truth values and a vector of "softmaxed" relative probabilities for each class. This "preprocessing" makes it impossible to apply loss_weights and get a properly accurate result. 

In either case, the final loss value is the negative log probability of the true class: the loss is zero if the model is sure of the correct class.

**loss_weights to model.compile()**
<br>Passing the loss_weights to model.compile() works when the final layer of the mofel is keras.Softmax(), which can be used in a classifier only when the classes are mutually exclusive.

This functionality is really meant for multilabeled classificatin and models with multiple loss functions [examples exist]. You can assign different levels of importance to the loss values in their contribution to the final loss. 

One common application is multilabeled classes (not mutually exclusive) like item {shirt shoes, socks} and color {black, white, red} to classify "Red Shirt" and "Black Shoes", or determine if an image shows a cat, a dog, or both.

With multiple loss functions, the book Deep Learning with Python says: "This is useful in particular if the loss values use different scales. For instance, the mean squared error (MSE) loss used for the age-regression task typically takes a value around 3–5, whereas the cross-entropy loss used for the gender-classification task can be as low as 0.1. In such a situation, to balance the contribution of the different losses, you can assign a weight of 10 to the crossentropy loss and a weight of 0.25 to the MSE loss."


**Using Class Weights in Scikit Learn**
<br>Scikit Learn has a limited number of classifiers that can take class_weight as an argument:
> <br>##--  Linear  --  ##<br>
sklearn.linear_model.LogisticRegression<br>
sklearn.linear_model.SGDClassifier
<br>##  --  Support Vector  --  ##<br>
sklearn.svm.SVC<br>
sklearn.svm.LinearSVC<br>
sklearn.linear_model.RidgeClassifier
<br>##  --  Non-linear  --  ##<br>
sklearn.tree.DecisionTreeClassifier
<br>##  --  Ensemble: bagging  --  ##<br>
sklearn.ensemble.RandomForestClassifier<br><br>

 ***