 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Training models in Keras`

* two step proces

* we will use the `fit()` method to train our model

* before we **fit** a model, we must **compile it**

    * this will allow us to decide which optimizer, which loss function and which metrics to use for our model

# `Compiling models in Keras`

* performed using the `compile()` method

* additional information needed before training our neural network (up until now we have only defined the structure of our network and nothing else)

* when compiling models, there are two parameters we must  define:

    * **`optimizer`**
    * **`loss function`**


* it is also a good idea to always define the **`metrics`** parameter
    * it allows us to specify which metrics should the model evaluate during training and testing
    
    
* other parameters we can set **(for beginners it is best if they just leave them at default values)**:

    * **loss_weights** 
    * **weighted_metrics** 
    * **run_eagerly** 
    * **kwargs** 

## `Optimizing models`

* to optimize neural networks we use a technique known as gradient descent

### `Gradient Descent`

* a generic optimization algorithm

* used to optimize neural networks

* it  starts with some random values for each weight $w_i$ and iteratively adjusts these values in order to minimize a **`cost function`**  until the algorithm **converges to a minimum**

* an important parameter that modifies gradient descent is the **learning rate**

<center><img src="https://edlitera-images.s3.amazonaws.com/gradient_descent.png" width="700">

source:
<br>
https://morioh.com/p/15c995420be6

#### `Learning rate`

* denoted as $\eta$

* varies between 0 and 1

* determines the magnitude of changes for each of the model parameters in each iteration

* if it's too small, it will take a long time for gradient descent to converge to a minimum (to get to the bottom)


* if it's too large, gradient descent might keep missing the minimum

**Gradient Descent variants:**

* `Batch Gradient Descent`
* `Stochastic Gradient Descent`
* `Mini-batch Gradient Descent`

**Note:** Mini-batch Gradient Descent is the one that is tipically used

#### `Mini-batch Gradient Descent`

* it's a middle ground between Batch Gradient Descent and Stochastic Gradient Descent
    * **"the best of both worlds"**

* it computes the gradients using a small random subset of the training instances (these are called **mini batches**)

* provides a variety of advantages:
    * reduced noise (we accumulate gradients from multiple training examples)
    * efficiency
    * fast convergence

**We need to choose the right batch size: that size becomes a hyperparameter of the model**

* small batch size: converges more quickly, but is less accurate

* big batch size: converges slower, but is more accurate

**Hardware limitations**

* we prefer to train Deep Learning models on GPUs

* because of that, we aim to pick batch sizes that are powers of two (8, 16, 32)

* bigger batch sizes also require more memory so take that into account

* it comes with its own set of problems:

    * hard to choose a good learning rate
    * it is easy to get trapped in a local minima spot

<center><img src="https://edlitera-images.s3.amazonaws.com/local_minima_problem.png" width="700">

source:
<br>
https://medium.com/@raza.shan83/gradient-descent-c568801d0b62

### `Gradient descent optimization algorithms`

**Optimization algorithms:**

* Adagrad
* Adadelta
* RMSprop
* Adam

**Honourable mentions:**

* AdaMax
* Nadam
* AMSGrad
* AdamW
* QHAdam

### `Gradient descent optimization algorithms in Keras`

* list of optimizers available at: https://keras.io/api/optimizers/

* Keras supports the following optimizers
    <br>
    
    * SGD
    * RMSprop
    * Adam
    * Adadelta
    * Adagrad
    * Adamax
    * Nadam
    * Ftrl
    
    
* most of the time you will use one of the more popular optimizers: **SGD, RMSprop or Adam**

In [1]:
from keras.models import Sequential
from keras.layers import Dense


# Define example model

model = Sequential()

model.add(Dense(8, activation="relu", input_shape=(8,)))

model.add(Dense(1, activation="sigmoid"))

In [4]:
import keras
from tensorflow.keras.optimizers import Adam
from keras.losses import CategoricalCrossentropy

# Define optimizer and loss function

optim = Adam(learning_rate=0.2)

loss_function = CategoricalCrossentropy()

# Compile the model

model.compile(loss=loss_function, optimizer=optim)

## `Loss functions`

* optimizers such as Adam try to minimize error in the algorithm

* the error they are trying to minimize is computed using loss functions 
    * they allow us to quantify how well a model is performing

**Loss functions used in classification problems:** 

* `Binary Cross-Entropy`
* `Categorical Cross-Entropy`
* `Sparse Categorical Cross-Entropy`

**Honourable mentions:**

* `Hinge Loss`
* `KL Loss`

### `Loss functions in Keras`

* list of losses available at: https://keras.io/api/losses/

* for simple classification tasks you will mostly use:
    <br>
    
    * **binary crossentropy** for binary classification problems
    * **categorical crossentropy** for multiclass classification problems
    * **sparse categorical crossentropy** - variant of categorical crossentropy that is more efficient to use than categorical crossentropy 
        * we will use this for multiclass classification problems

In [6]:
import keras
from tensorflow.keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy


# Define a loss function and an optimizer

optim = Adam()
loss_function = SparseCategoricalCrossentropy()

# Compile the model

model.compile(loss=loss_function, optimizer=optim)


## `Metrics`

* list of metrics available at: https://keras.io/api/metrics/

* divided into:
    <br>
    
    * Accuracy metrics
    * Probabilistic metrics
    * Regression metrics
    * Classification metrics based on True/False positives & negatives
    * etc.
    
    

* used to judge the performance of the model


* keep in mind that any loss function can also be used as a metric

In [7]:
import keras
from tensorflow.keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy
from keras.metrics import SparseCategoricalAccuracy

# Define a loss function, an optimizer and a metric

loss_function = SparseCategoricalCrossentropy()

metric = SparseCategoricalAccuracy()

optim = Adam()

# Compile the model

model.compile(loss=loss_function, optimizer=optim, metrics=[metric])

# `Fitting models in Keras`

* performed using the `fit()` method
    * the model is trained for a certain number of epochs
    

### `Arguments`

**`x`**

- input data used for predicting target data

**`y`**

- target data

**`batch_size`** 

- the size of the minibatches (**remember:** use multiples of 2)

- **NOTE:** no need to use this if you use a generator or something similar to feed data into your model

**`epochs`** 

- how many epochs to train the model for

**`verbose`** 

- verbosity mode, you can leave it on "auto"

**`callbacks`** 

- the progress bar logger and the history callbacks get created automatically, but all the others need to be specified

**`validation_split`** 

- how much of the data to use for the validation step

**`validation_data`**

- if you have separated data for validation, pass it using this argument

**`shuffle`**

- `True` if you want to shuffle the data before each epoch

**`validation_batch_size`**  

- the size of the validation data minibatch (remember: use multiples of 2)

## `Example`

In [8]:
# Import necessary libraries

import pandas as pd
import numpy as np

import keras
from keras.layers import Dense, Input
from keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from keras.losses import BinaryCrossentropy
from keras.metrics import BinaryAccuracy

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

In [9]:
# Load data
# Display first five rows

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/students.csv")

df.head()

Unnamed: 0,daily_study,monthly_tuition,passing_grade
0,7,27,1
1,2,43,0
2,7,26,1
3,8,29,1
4,3,42,0


In [10]:
# Shuffle dataset

df = df.sample(frac=1).reset_index(drop=True)

In [11]:
# Separate features from the label

X = df[["daily_study", "monthly_tuition"]]

# Flatten data

y = df["passing_grade"]
y = y.values.flatten()

In [12]:
# Split data into train and test data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42
)

# Split train data into train and validation data

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, 
    test_size=0.3, 
    random_state=42
)

In [13]:
# Define scaler and scale data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

In [14]:
# Define input dimension

input_dimension = X_train.shape[1]

print(input_dimension)

2


In [15]:
# Define model

model = Sequential()

model.add(Dense(8, activation="relu", input_shape=(input_dimension,)))

model.add(Dense(4, activation="relu"))

model.add(Dense(1, activation="sigmoid"))


model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 8)                 24        
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 36        
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 5         
Total params: 65
Trainable params: 65
Non-trainable params: 0
_________________________________________________________________


In [16]:
# Define optimizer

optim = Adam()

# Define loss

loss_function = BinaryCrossentropy()

# Define metric we will track 

metric = BinaryAccuracy()

# Compile model

model.compile(
    loss=loss_function,
    optimizer=optim,
    metrics=[metric]
)

In [17]:
# Fit model

model.fit(
    X_train, 
    y_train, 
    epochs=100, 
    batch_size=32, 
    validation_data=(X_valid, y_valid), 
    verbose=1
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x24c3d4a3f40>

# `Evaluating models in Keras and making predictions`

* by evaluating we analyze the performance of our model
    * returns the loss value
    * returns what we specified as metrics

* performed using the `evalute()` method


* evaluation is performed in batches

### `Arguments`

**`x`**

- input data used for predicting target data

**`y`**

- target data

**`batch_size`** 

- the size of the minibatches (**remember:** use multiples of 2)

- **NOTE:** no need to use this if you use a generator or something similar to feed data into your model

**`verbose`** 

- verbosity mode, you can leave it on "auto"

**`callbacks`** 

- the callbacks we want to apply during evaluation

**`steps`** 

- number of batches before we define that an evaluation round has ended

## `Making predictions using trained models in Keras`

* once we have a trained model, we can use it to generate predictions
    * computation for this is done in batches

* to generate predictions for some input samples we use the `predict()`

### `Arguments`

**`x`**

- input samples for which we want to generate predictions 

**`batch_size`** 

- the size of the minibatches (**remember:** use multiples of 2)

- **NOTE:** no need to use this if you use a generator or something similar to feed data into your model

**`verbose`** 

- verbosity mode, you need to select 0 or 1

- 0 will make sure that nothing is showed on the screen while making predictions

- 1 will display progress on screen when making predictions

**`steps`** 

- number of batches before we define that a prediction round has ended

**`callbacks`** 

- the callbacks we want to apply during prediction

## `Example`

In [18]:
# Evaluate model

y_pred = model.predict(X_test)

score = model.evaluate(X_test, y_test, verbose=1)

print(score)

[0.04340706393122673, 0.9900000095367432]


In [19]:
# Predict class for text example

predicted_classes = np.where(y_pred > 0.6, 1,0)

In [20]:
# Generate classification report

print(classification_report(y_test, predicted_classes, labels=[0,1]))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99       161
           1       0.98      1.00      0.99       139

    accuracy                           0.99       300
   macro avg       0.99      0.99      0.99       300
weighted avg       0.99      0.99      0.99       300



## Very important note:

* when using binary crossentropy use the following to get the final predictions: 

`
predicted_classes = np.where(y_pred > 0.5, 1,0)
`

* when using sparse categorical crossentropy use the following to get the final predictions: 

`
predicted_classes = np.argmax(y_pred, axis=-1)
`

**Why?**

* the outputs we get from our neural network are probabilities that some value belongs in some class
    * we use the code above to turn those probabilities into actual class predictions

# `Take-home exercise`

**Using the dataset available at `https://edlitera-datasets.s3.amazonaws.com/breast_cancer_data.csv`, create a model that can predict whether a person has a benign tumor or a malignant tumor. The`diagnosis` column indicates whether a person has a bening tumor (0) or a malignant tumor (1). Try to achieve the best possible accuracy you can by modifying hyperparameters.**

## `Solution:`

# `Saving and loading models in Keras`

* saving a model in Keras can be done in three ways, depending on the need
    * **saving the whole model**
    * **saving only the arhitecture (configuration) of the model**
    * **saving only the model weights**

**Model arhitecture**

* specifies the model layers and how they are connected

* allows us to create a variant of the model that can be freshly initialized and that has no compilation information

**Model weights**

* the part of the model that is optimized


* we can apply the weights we save to some other untrained model that is of the same structure as our original one without needing to train from the beginning


* **ESSENTIAL FOR TRANSFER LEARNING**

**Model compilation information**

* the arguments we set for the **`compile() method`** when creating and training the model

**Model optimizer state**

* very important in some cases

* if we want to completely recreate some conditions, optimizer state is extremely important
    * a lot of popular optimizers have values that change during training, so starting with a fresh optimizer would not lead to good results even if we used the same model configuration and the same weights

# `Saving and loading the whole model`

* we can save a whole model to a single artifact

* this saves the following:

    * **model architecture**
    * **model weights** 
    * **model compilation information**
    * **model optimizer state**

**Saving a model**

* done using the following code
```
model.save('path/to/location')
```

**Loading a model**

* done using the following code
```
model.load_model('path/to/location')
```

**Example**

In [22]:
from keras.models import load_model

# Save model

model.save("my_model.h5")


# Load saved model

recreate_model = load_model("my_model.h5")
recreate_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 8)                 24        
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 36        
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 5         
Total params: 65
Trainable params: 65
Non-trainable params: 0
_________________________________________________________________


## `Saving and loading the architecture of a model`

* we can save the architecture of a single layer, of a whole sequential model or the arhitecture of a whole functional model

**Saving the architecure of a single layer and loading it**

```

layer = keras.layers.Dense(20, activation="relu", input_shape=(10,)))

layer_config = layer.get_config()

new_layer = keras.layers.Dense.from_config(layer_config)
```

**Saving the architecure of a Sequential Model and loading it**


```
model = keras.Sequential([keras.Input((32,)), keras.layers.Dense(1)])

model_config = model.get_config()

new_model = keras.Sequential.from_config(model_config)
```

**Saving the architecure of a Functional Model and loading it**


```
x = keras.Input((50,))
out = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(x, out)

model_config = model.get_config()


new_model = keras.Model.from_config(model_config)
```

## `Saving and loading the weights of a model`

* useful when we just need to perform inference using a model
    * the optimizer state and the compilation information are then useless, so we can load only weights and apply them to some model architecture they fit
    
    
* makes it easy to perform **transfer learning**

**Saving weights**

```
model.save_weights("pretrained_checkpoint")
```

**Loading weights**

```
model.load_weights("pretrained_checkpoint")
```

# `How to further improve models using regularization techniques`

* we test how our models perform on some type of testing dataset
    * it gives us an idea of how good our model generalizes on unseen data

* achieving the proper fit is very hard
    * deep learning models are prone to overfitting more than underfitting

* **getting to the global minima is not necessary**
    * well designed and regularized complex Deep Learning models can converge into local minima that are very, very close to the global minima

**Regularization methods:**

   * `Early Stopping`
   * `L2 Regularization`
   * `L1 Regularization`
   * `Dropout`
   * `Batch Normalization`

    
**Honourable mentions:**

* `Multitask Learning`
* `Parameter Sharing`

* we typically use more than one regularization method when designing neural networks

## `Early stopping`

* very common way of avoiding overfitting
    * used because it is very easy to implement

* compares the error we get on the training set to the error we get on the validation dataset
    * if the difference between these two errors is big, it means our model has started overfitting

* using early stopping we can select that state of the model where the validation error was smallest
    * essentially, we select those weights that were present when our model performed best on the validation set

<center><img src="https://edlitera-images.s3.amazonaws.com/early_stopping.png" width="700">

source:
<br>
https://www.semanticscholar.org/paper/Deep-Learning-for-NLP-and-Speech-Recognition-Kamath-Liu/6fe17c5de719df4dcbdb7967e14f7b457ec8c2ca

## `Dropout`

* a bit like an ensemble technique for machine learning

* randomly "drops out" (removes) connections in the neural network
    * this makes it impossible for any prediction to completely depend on a single neuron

## `Batch Normalization`

* created to combat **internal covariate shift**

**Internal covariate shift**

* we usually normalize data before feeding it into networks (or any other ML model)

* problem with deep learning: distribution changes frequently during the learning procedure
    * outputs of layers become non-normalized inputs for the subsequent layers

**Batch normalization = normalizing the outputs of intermediate layers during training** 

**How it works:**

* the output of a hidden layer gets normalized using the mean and variance of the mini-batch we are currently training on before being fed as an input into the next layers

**Benefit:**

* faster learning - learning rate can be higher

**Important:**

* Batch Normalization captures the **moving average of the mean and variance**
    * this allows it to fix them at inference time to make sure that they don't affect predictions

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>