# Deep Learning - Exercise 4

This lecture is about advanced topics of the CNN such as transfer learning and 1D convolutions for time-series processing.

We will use CIFAR-10 dataset again and [FordA](https://www.timeseriesclassification.com/description.php?Dataset=FordA) for time-series classification task.

[Open in Google colab](https://colab.research.google.com/github/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_04.ipynb)
[Download from Github](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_04.ipynb)

##### Remember to set **GPU** runtime in Colab!

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import pandas as pd
import matplotlib.pyplot as plt # plotting
import matplotlib.image as mpimg # images
import numpy as np #numpy
import tensorflow as tf
import tensorflow.keras as keras
import requests

from tensorflow.keras.layers import Activation
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications import VGG19

from sklearn.model_selection import train_test_split # split for validation sets
from sklearn.preprocessing import normalize # normalization of the matrix
from scipy.signal import convolve2d # convolutionof the 2D signals
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from PIL import Image
from io import BytesIO
from skimage.transform import resize


tf.version.VERSION

In [None]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

def show_example(train_x, train_y, class_names):
    plt.figure(figsize=(10,10))
    for i in range(25):
        plt.subplot(5,5,i+1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(train_x[i], cmap=plt.cm.binary)
        plt.xlabel(class_names[train_y[i][0]])
    plt.show()
                
def compute_metrics(y_true, y_pred, show_confusion_matrix=False):
    print(f'\tAccuracy: {accuracy_score(y_true, y_pred)*100:8.2f}%')
    if (show_confusion_matrix):
        print('\tConfusion matrix:\n', confusion_matrix(y_true, y_pred))

# 🔎 What is *transfer learning* about? 🔎

* Transfer learning consists of taking features learned on one problem, and leveraging them on a new, similar problem. 
    * For instance, features from a model that has learned to identify cars may be useful to kick-start a model meant to identify trucks.
        * 🔎 Do you know any famous CNN models?

* Transfer learning is usually done for tasks where your dataset has too little data to train a full-scale model from scratch.
    * 🔎 How do we benefit from it?
    
## 📌 Usual pipeline

1) Take layers from a previously trained model.

2) Freeze them, so you avoid destroying any of the information they contain during future training rounds.

3) Add some new, trainable layers, on top of the frozen layers. 
    * 💡 They will learn how to turn the features extracted by pre-trained layers into predictions on a new dataset.

4) Train the new layers using your dataset.

* 💡 Optional step: Fine-tuning (= unfreezing the entire model you obtained above, or part of it), and re-training it on the new data with a very **low** learning rate. 
    * This can potentially achieve meaningful improvements, by incrementally adapting the pretrained features to the new data.
    * 🔎 Why do we use **low** learning rate?


# 🚀 Let's start!

## Import dataset **CIFAR10** again
* I think (or hope 😀) that you remember most of these detailes from the previous lecture 🙂
    * The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. 
    * The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. 
    * There are 6,000 images of each class.

## We will resize the images into (224, 224) shape because we will use ResNet50 later and we will also one-hot encode our labels
* 💡 If you do not encode the labels you will run into shape mismatch error which is hard to debug - trust me, I've been there 🙂

In [None]:
# cifar is the basic dataset for image classifaction
dataset = tf.keras.datasets.cifar10
img_size = 224
subset = 1000
test_size = 0.2

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
class_count = len(class_names)

# data from any dataset are loaded using the load_Data function
(train_x, train_y), (test_x, test_y) = dataset.load_data()

train_y = tf.keras.utils.to_categorical(train_y[:subset], class_count)
test_y = tf.keras.utils.to_categorical(test_y[:subset], class_count)

train_x_resized = tf.image.resize(train_x[:subset], [img_size, img_size], )
test_x_resized = tf.image.resize(test_x[:subset], [img_size, img_size])

In [None]:
show_example(train_x, np.argmax(train_y, axis=1).reshape(-1, 1), class_names)

# Instantiate a `ResNet50` model with pre-trained weights.
* 🔎 What the **include_top** do?
* What means **weights='imagenet'** parameter? 
    * 🔎 Do we need it? 
    * 🔎 What happens if we use random weights?

In [None]:
base_model = ResNet50(
    weights='imagenet',  # Load weights pre-trained on ImageNet.
    input_shape=(img_size, img_size, 3),
    include_top=False)  # Do not include the ImageNet classifier part at the top.

## 📌 IMPORTANT: Freeze the base model 📌
* We don't want to train the encoder path of model yet

In [None]:
base_model.trainable = False

# ⚡ Create a model input and output layers and interconnect all the parts together
* 💡 We make sure that the base_model is running in inference mode here, by passing `training=False`.

## 📌 Notes about BatchNormalization layer
* Many image models contain **BatchNormalization** layers. 
* Here are a few things to keep in mind:
    * BatchNormalization contains 2 non-trainable weights that get updated during training. 
        * These are the variables **tracking the mean and variance of the inputs**.
* 💡 When you **unfreeze** a model that contains BatchNormalization layers in order to do **fine-tuning**, you should **keep the BatchNormalization layers in inference mode by passing training=False** when calling the base model. 
    * **Otherwise the updates applied to the non-trainable weights will suddenly destroy what the model has learned.**


* 🔎 What the **GlobalAveragePooling2D** layer does?
    * After convolutional operations, *tf.keras.layers.Flatten* will reshape a tensor into (n_samples, height*width*channels), for example turning (16, 28, 28, 3) into (16, 2352)
    * *GlobalAveragePooling* layer is an alternative to this because it averages all the values according to the last axis. 
        * This means that the resulting shape will be (n_samples, last_axis). 
        * 💡 For instance, if your last convolutional layer had 64 filters, it would turn (16, 7, 7, 64) into (16, 64)

# 📌 Make sure that you call the `preprocess_input` function
* Each Keras Application expects a specific kind of input preprocessing. 
* For ResNet, call `tf.keras.applications.resnet.preprocess_input` on your inputs before passing them to the model.
    * 💡 It will convert the input images from RGB to BGR, then will zero-center each color channel with respect to the ImageNet dataset, without scaling.

In [None]:
inputs = keras.Input(shape=(img_size, img_size, 3), dtype=tf.uint8)
x = tf.cast(inputs, tf.float32)
x = tf.keras.applications.resnet50.preprocess_input(x)
x = base_model(x, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
outputs = keras.layers.Dense(class_count, activation='softmax')(x)
model = keras.Model(inputs, outputs)

## Compile the model and check number of parameters
* Why do we have only **20,490** trainable parameters?
* Why do we use `CategoricalAccuracy` and `CategoricalCrossentropy`?

In [None]:
model.compile(optimizer=keras.optimizers.Adam(),
              loss=keras.losses.CategoricalCrossentropy(),
              metrics=[keras.metrics.CategoricalAccuracy()])

model.summary()

![Meme01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_meme_tf_01.png?raw=true)

## 💡 Always check if all the shapes match the pre-defined ranges! 
* Otherwise you will run into shape missmatch issue in the training loop and it is harder to debug than the C++ templates 😅

In [None]:
train_x_resized.shape, train_y.shape

In [None]:
test_x_resized.shape, test_y.shape

## 🚀 Fit the model

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.hdf5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
history = model.fit(train_x_resized, train_y, validation_split=0.2, batch_size=32, epochs=10, callbacks=[model_checkpoint_callback])

show_history(history)

# Load best setup
model.load_weights("weights.best.hdf5")
test_loss, test_acc = model.evaluate(test_x_resized, test_y)
print('Test accuracy: ', test_acc)

# 🚀 Fine-tuning
* Once your model has converged on the new data, you can try to unfreeze all or part of the base model and retrain the whole model end-to-end with a very low learning rate.
    * 💡 It could also potentially lead to quick overfitting -- keep that in mind.
* It is critical to only do this step **after the model with frozen layers has been trained to convergence**. 
    * 💡 If you mix randomly-initialized trainable layers with trainable layers that hold pre-trained features the randomly-initialized layers will cause very large gradient updates during training, 
    * This will **destroy your pre-trained features**.
    
### It's also critical to use a *very low learning rate* at this stage, 
* You are training a much larger model than in the first round of training, on a dataset that is typically very small. 
    * 💡 As a result, you are at **risk of overfitting** very quickly if you apply large weight updates.

## Unfreeze the base model

In [None]:
base_model.trainable = True

## 💡 Recompile your model after you make any changes
* The `trainable` attribute of any inner layer is taken into account after re-compilation

* Calling `compile()` on a model is meant to "freeze" the behavior of that model. 
    * This implies that the trainable attribute values at the time the model is compiled should be preserved throughout the lifetime of that model, until compile is called again. 
    * Hence, if you change any trainable value, make sure to call `compile()` again on your model for your changes to be taken into account.

In [None]:
model.compile(optimizer=keras.optimizers.Adam(1e-5),
              loss=keras.losses.CategoricalCrossentropy(),
              metrics=[keras.metrics.CategoricalAccuracy()])

model.summary()

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.hdf5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
history = model.fit(train_x_resized, train_y, validation_split=0.2, batch_size=32, epochs=10, callbacks=[model_checkpoint_callback])

show_history(history)

In [None]:
# Load best setup
model.load_weights("weights.best.hdf5")
test_loss, test_acc = model.evaluate(test_x_resized, test_y)
print('Test accuracy: ', test_acc)

# Now you are an absolute expert on CNN applications in the image classification tasks 👏 

## We can switch to time series processing part of the lecture! 🙂
* 🔎What tasks can you imagine for time series processing?
* We will use CNN again, but now in Conv1D variant
    * 🔎 What is the difference among the 1 - 3D Conv?

### There is definitely a cool mathematical expression for each conv layer type however I would like you to understand the topic so we will use the diagrams below 🙂

![Meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_meme_tf_02.png?raw=true)

### 📒 Conv2D
* Conv2D is generally used on Image data. 
* It is called 2 dimensional CNN because the kernel slides along 2 dimensions on the data as shown in the following image.

![Conv2D](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_04_conv2d.png?raw=true)


### 📒 Conv1D
* Following plot illustrate how the kernel will move on accelerometer data. 
* Each row represents time series acceleration for some axis. 
    * The kernel can only move in one dimension along the axis of time.

![Conv1D](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_04_conv1d.png?raw=true)

# 📌 Summary
* In 1D CNN, kernel moves in 1 direction. Input and output data of 1D CNN is 2 dimensional. Mostly used on Time-Series data.
* In 2D CNN, kernel moves in 2 directions. Input and output data of 2D CNN is 3 dimensional. Mostly used on Image data.
* In 3D CNN, kernel moves in 3 directions. Input and output data of 3D CNN is 4 dimensional. Mostly used on 3D Image data (MRI, Video).
    * 💡 You can check https://towardsdatascience.com/understanding-1d-and-3d-convolution-neural-network-keras-9d8f76e29610 for more details

# Download the FordA data

* Let's download [FordA](https://www.timeseriesclassification.com/description.php?Dataset=FordA) dataset converted for our purposes to the [Feather file format](https://arrow.apache.org/docs/python/feather.html), a binary file format for data exchange.

* 💡 The classification problem is to diagnose whether a certain symptom exists or does not exist in an automotive subsystem.
    * Each case consists of 500 measurements of engine noise and a classification.

* 💡 The data originates from ARFF file format used in Weka Data analysis tool and has classes labeled $\{-1,1\}$ 
    * We will convert it to the $\{0,1\}$ set

In [None]:
train = pd.read_feather('https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/raw/main/datasets/FordA_TRAIN.feather')
test = pd.read_feather('https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/raw/main/datasets/FordA_TEST.feather')
train.target.replace({-1:0}, inplace=True)
test.target.replace({-1:0}, inplace=True)
print('Train: ',train.shape)
print('Test: ', test.shape)

## ⚡ We can take a look at the data
* The data contain 500 time steps of a measurement and single target value. 
* The time series is almost normalized and it is not necessary to deal with it using scaling or normalizing. 
    * It may slightly improve the results but it depends on your experiments. 

### 🔎 What would you do if the time-series was continual sequence?
* How to preprocesss such data and feed it into ANN?

In [None]:
colors = ['b', 'g']
plt.figure(figsize=(21,9))
for idx in range(10):
  plt.plot(train.iloc[idx][:-1], c=colors[int(train.iloc[idx][-1])])
plt.tight_layout()
plt.show()

In [None]:
train.head()

In [None]:
train.groupby('target').mean()

In [None]:
train.groupby('target').std()

# Check the labels balance
* 🔎 Which metrics can we use? Why?

In [None]:
train.target.value_counts()

# Convert the data into numpy arrays and separates *X* and *y* data from each other for training and testing data.

In [None]:
train_x, train_y = train.drop(columns=['target']).values, train.target.values
test_x, test_y = test.drop(columns=['target']).values, test.target.values

## ⚡ Create a baseline model
* Lets try some simple basic model on the data. DecisionTree and RandomForrest. 
    * As you will see it is a difficult task for them to get high accuracy results.

In [None]:
base_models = [DecisionTreeClassifier(random_state=13), RandomForestClassifier(random_state=13)]

for model in base_models:
    model.fit(train_x, train_y)
    y_pred = model.predict(test_x)
    print(type(model).__name__)
    compute_metrics(test_y, y_pred)

## Fully connected ANN model
* Let's try some basic neural network model for this task. 
    * It is a typical Dense network with two hidden layers and dropout optimization - it should be able to beat the Randomforrest classifier.

In [None]:
model = keras.Sequential([
    keras.layers.Dense(256, activation='relu', input_shape=train_x[0].shape),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.summary()
model.compile(optimizer=keras.optimizers.Adam(), loss = keras.losses.BinaryCrossentropy(from_logits=False), metrics = keras.metrics.BinaryAccuracy())

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.hdf5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
history = model.fit(train_x, train_y, validation_split=0.2, epochs=10, batch_size=32, callbacks=[model_checkpoint_callback])
show_history(history)

model.load_weights("weights.best.hdf5")
test_loss, test_acc = model.evaluate(test_x, test_y)
print('Test accuracy: ', test_acc)

# 🚀 Now we will finally use the CNN! 🙂
* To use convolution in single dimension we need to reshape the data to have the proper format. 
    * The format is the same as for RNN and must be in a format $(number\_of\_vectors, vector\_length,number\_of\_dimensions)$
        * Given the user experience for the time series analysis tasks in Tensorflow, sharing the same format among CNN and RNN must've been an accident 😅

In [None]:
train_xc = np.reshape(train_x, (*train_x.shape, 1))
test_xc = np.reshape(test_x, (*test_x.shape, 1))
train_xc.shape, test_xc.shape

## Let's try the single convolution layer as a input mapping 
* It generates a huge number of weights for Dense layers after flattening

* The results are far from excelent
    * 🔎 Why?

In [None]:
model = keras.Sequential([
    keras.layers.Conv1D(64, kernel_size=3, activation='relu', input_shape=train_xc[0].shape),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.summary()
model.compile(optimizer=keras.optimizers.Adam(), loss = keras.losses.BinaryCrossentropy(from_logits=False), metrics = keras.metrics.BinaryAccuracy())

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.hdf5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
history = model.fit(train_x, train_y, validation_split=0.2, epochs=10, batch_size=32, callbacks=[model_checkpoint_callback])
show_history(history)

model.load_weights("weights.best.hdf5")
test_loss, test_acc = model.evaluate(test_x, test_y)
print('Test accuracy: ', test_acc)

## A slightly more complicated model is able to beat all previous models with smaller number of weights needed

In [None]:
model = keras.Sequential([
    keras.layers.Conv1D(64, kernel_size=3, activation='relu', input_shape=train_xc[0].shape),
    keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    keras.layers.MaxPool1D(2),
    keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.summary()
model.compile(optimizer=keras.optimizers.Adam(), loss = keras.losses.BinaryCrossentropy(from_logits=False), metrics = keras.metrics.BinaryAccuracy())

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.hdf5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
history = model.fit(train_x, train_y, validation_split=0.2, epochs=10, batch_size=32, callbacks=[model_checkpoint_callback])
show_history(history)

model.load_weights("weights.best.hdf5")
test_loss, test_acc = model.evaluate(test_x, test_y)
print('Test accuracy: ', test_acc)

## Even more capable model with more pooling layers but with 1/4 weight of the previsou model is able to achieve more than 90% of the accuracy. 

In [None]:
model = keras.Sequential([
    keras.layers.Conv1D(64, kernel_size=3, activation='relu', input_shape=train_xc[0].shape),
    keras.layers.MaxPool1D(2),
    keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    keras.layers.MaxPool1D(2),
    keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    keras.layers.MaxPool1D(2),
    keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.summary()
model.compile(optimizer=keras.optimizers.Adam(), loss = keras.losses.BinaryCrossentropy(from_logits=False), metrics = keras.metrics.BinaryAccuracy())

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.hdf5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
history = model.fit(train_x, train_y, validation_split=0.2, epochs=10, batch_size=32, callbacks=[model_checkpoint_callback])
show_history(history)

model.load_weights("weights.best.hdf5")
test_loss, test_acc = model.evaluate(test_x, test_y)
print('Test accuracy: ', test_acc)

# ✅  Tasks for the lecture (2p)

1) Choose any of the models from [Keras pre-trained models](https://keras.io/api/applications/) - **(1p)**

    * Investigate its' architecture
    * Search for the needed input shape for the model - remeber to preprocess the data and call correct `preprocess_input` function
        * 💡 There could be more variants of the model, the choice depends on you
    * Use the selected model for CIFAR-10 classification, 
        * Fine-tune it, experiment with it and write down your conclusions!
    
2) Define your own model for the FordA data task  - **(1p)**

    * Try to beat defined models or have at least the same accuracy score
        * 💡 You can also try to minimize the number of parameters for having approx. the same accuracy as we do!
    * Experiment with the model and write down your conclusions!