# 4. Using ANN for image recognition

In this class, we will combine the knowledge that we have gained so far, both about images and about ANN classification, and use it for more practical and funny applications. Let's learn how to use ANN to classify images.

We will use a [dataset](https://www.kaggle.com/datasets/jorgebuenoperez/datacleaningglassesnoglasses) from [Kaggle](https://www.kaggle.com) consisting of the faces of more than 4000 people. Some of them wearing glasses, some of them not. Let's develop an ANN model that will recognize this.

The first step is to gain a bit more knowledge about ANN and learn which type of neural networks is specifically good for images. But before we do this, let's re-create some of the functions we created at the end of the previous class:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def table(reference, predicted):
    """ computes contingency table for predicted and reference class label indices """
    indices = reference.unique()
    n = len(indices)
    ct = np.zeros((n, n))
    for i in range(n):
        ni = sum(reference == indices[i])
        for j in range(n):
            ct[i, j] = sum((reference == indices[i]) & (predicted == indices[j])) / ni

    return ct

def ct_heatmap(ct, classes):
    """ shows heatmap for the table """
    plt.imshow(ct, clim = [0, 1])
    plt.colorbar()

    n = len(classes)
    plt.gca().set_xticks(range(n), classes)
    plt.gca().set_yticks(range(n), classes)
    for i in range(n):
        for j in range(n):
            plt.text(i, j, round(ct[j, i], 3), color = "white" if ct[j, i] < 0.5 else "black")

As you may have noticed, we did not copy the methods `train()` and `predict()` here because we are going to learn a new way of training and will write new functions for that.

Now let's learn some theory.

## Convolutional neural networks 


Convolutional Neural Networks (CNNs) are a type of Artificial Neural Networks (ANNs) designed specifically for image classification tasks. CNNs are inspired by the human visual system and are particularly effective for processing grid-like data, such as images (containing pixels).  

Why not to use simple ANN's we learned in the previous class? Because they need features as inputs — some measurements which are relevant for classification. In the case of Iris data, we used geometric measurements of the flowers, because they are indeed different for various Iris species.

In the case of images, we do not have features, we have images. And what we need is to take the images and compute relevant features that will be used for classification. This is exactly how CNN works.

To start, let’s first have a look at the main structure of the CNNs architecture and its main components. A general illustration of a convolutional neural network is shown below:

<img src="./images/CNN-Architecture.png" style="width:800px; height:350px;"/>

Thereby, CNNs architecture consists of two blocks. The first block is needed to create features from images, and it usually consists of the following components:    

* **Input Layer:** the network takes an image as an input, so the input is 2D (for grayscale images) or 3D (for color images), not just a vector of values like in the network we used in the previous class. 

* **Convolutional Layers:** these layers are the core building blocks of CNNs. They apply different filters (like the ones we learned in the first class) to the input images in order to reveal various patterns and features, such as edges, contours, bright and dark spots, textures, or more complex structures.

* **Activation Function:** like in simple ANN, activation functions can also be applied to the output of convolutional layers. They introduce non-linearity, allowing the network to learn more complex relationships in the data.   

* **Pooling Layer:** pooling layers are needed to reduce the size of the features created by convolutional layers. They kind of pool the features, keeping only the most important information. 

These three types of layers have a special property — they take images as input (not necessarily images, it can any 3-way array that has width, heights, and number of channels or slices) and produce images as output. 

The next block of layers works just with numeric tabulated data, like we used for Iris classification in the previous class. Which means that the output from the previous block must be reshaped from 3D to a simple vector of numbers in order to proceed. This can be done by a special operation called *flattening* (or *unfolding*). After that, the information is sent to the next layers: 

* **Fully Connected Layers:** they are completely the same as what we used in the previous class for the Iris classification. Take inputs as a set of numbers and produce output as a number as well. Usually supplemented with activation function.

* **Output Layers:** again, same as we used before — they collect outputs from the fully connected layers and narrow them down to one or several final outputs.

One can say that convolutional and pooling layers are needed to construct different features that represent various properties of the image (e.g., the presence of vertical, horizontal, or diagonal lines, circles, etc.). While the fully connected layers and the output layer use these features to do the classification, like in the case of Iris data.

This is a simple data flow in a typical CNN:

$Input→Convolution→Pooling→Flatenning→Fully Connected Layers→Output$

Now let's learn more about the new layers.



### Convolution

What is a *convolution*? Well, you already used convolution in the first class when you tried different filters for images. The operation of computing the intensity of the final pixels based on the linear combination of their neighbors and the weights of the filter is called a *convolution*.

In the case of CNN, convolution is a way to compute different features for image pixels. Let's recall how images can be represented as a matrix with numbers:

<img src="./images/Original Image-Pixels.png" style="width:800px; height:350px;"/>

In this case, instead of 0 and 255, we use values of 0 (for black) and 1 (for white) just for the sake of simplicity.

In order to apply a convolution to this simple image, we need to define a *filter* (also known as a *kernel*) — a small grid of numbers that will slide from one pixel to another. These numbers are weights, which are used to compute a weighted sum of the original pixel intensities. The result of this computation is the new value, which we call a *feature*.

By applying this operation to every pixel of the original (input) image, we create a *feature map*. You can think of the feature map as another image based on the original one. Here is an example:

<img src="./images/Filters-General.png" style="width:800px; height:1200px;"/>

And here is an interactive illustration of this process also for a 3x3 filter :

<img src="./images/Image-Kernel-Filter.gif" style="width:800px; height:300px;"/>

This is exactly what convolution layer does — creates feature maps.

At this step, it is recommended to open the third sheet of the [Excel workbook](../mlcourse.xlsm) and play with the filtering/convolution example to refresh how it works. Try, e.g., applying the filter from the illustration above.

What makes this procedure different in the case of CNN is that we do not define the weights of the filters, only their size. The weights are automatically created during the learning process. In other words, CNN automatically generates features that are best suited for classification or any other task. It literally learns from the data, we just define the number of layers and the size of the kernels.

Once the convolution is done, the network applies an activation function to the feature maps, as we also did for our simple model for Iris classification. For example, if ReLU is used as an activation function, it will replace all negative values in the feature map with 0, as shown below.

<img src="./images/Filter-Activation2.png" style="width:800px; height:250px;"/>

#### Stride and padding

In addition to the filter size, the convolutional layer has two other important parameters — *stride* and *padding*. 

**Stride** is a step filter that moves inside the image. If stride is equal to one, it simply moves the filter window from one pixel to another. If the stride is larger, it jumps with a bigger step. 

Here is an illustration from Wikimedia showing convolution with stride equal to 2, so it processes every second row (2 and 4 in this case) and every second column (2 and 4 as well):

<img src="images/strides.gif">

**Padding** is needed to process all pixels and avoid size reduction after filtration. As you can see in the example above, using a 3x3 filter cannot start from row 1 and column 1. The boundary pixels are not processed because, in this case, part of the filter will be outside of the image. To avoid this situation, one can add padding around the original images (usually filled with zeros).

Here is an illustration from Wikimedia where convolution still works with stride = 2, but this time padding is added, so the filter takes rows 1, 3, and 5 and the same columns:

<img src="images/padding-strides.gif">

Try to implement padding for the filtration example in the Excel workbook.


### Pooling


After the activation function is applied to the convolution results, the network continues with pooling. 

Pooling maintains important information while discarding less relevant details, creating spatial hierarchies of features. The most common form of pooling is "max pooling", which keeps the biggest value inside a pooling window. If we take a pooling window of size 2 by 2 and apply it to the activated feature map we produced together at the previous step, the result will look as follows:

<img src="./images/Pooling-Main.png" style="width:800px; height:1000px;"/>

### Fully connected layers

After pooling, the feature maps are unfolded, or *flattened*, so they look like vectors of numbers (like in tabulated data) and are transferred to a set of fully connected layers, followed by the output layer. The fully connected layer is similar to what we had in the case of Iris data, it consists of linear neurons and an activation function:

<img src="./images/Fully connected.png" style="width:800px; height:400px;"/>

## Implementation of CNN in PyTorch


Let's build a simple CNN network for the classification of color images using PyTorch. 

We will assume that the input image has only three channels (RGB). The first convolutional layer will take a three-channel image and produce four different feature maps, like applying four different filters to the same image. Then we will apply an activation function and pool the features using a 2x2 max pooling layer. 

Because we will also add padding, the convolutional layer will produce a feature map of the same size as the original images. But after the pooling layer, it will become twice as small. If the original images have a size of 256x256 pixels, after pooling, we will get 4 feature maps with 128x128 pixels each. 

After that, we add the second convolutional layer, which will apply its filters to the feature maps produced and pooled in the previous steps. It will take the four feature maps from the previous layer and produce eight new feature maps as a result. 

The feature maps will also be pooled, so after pooling we will get 8 feature maps with 64x64 pixels each (for an image of size 256x256), which gives a vector with 32,768 feature values, which must be flattened and sent to the fully connected layers.

Here is the full code of our network:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class ImageClassifier(nn.Module):
    def __init__(self, width, height):
        super().__init__()

        # first convolutional layer
        # - takes 3 channel image (RGB) and produces 4 channels (maps) with features
        # - it uses kernel of size 3x3 and adds a pad of 1 pixel around to keep the same size
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, stride=1, padding=1)        # convolutional layer 1
        # max pooling layer
        # - has size of 2x2 so it reduces size of feature map twice
        self.pool = nn.MaxPool2d(2, 2)
        # second convolutional layer
        # - takes 4 channels (feature maps) and produce 8 channels with features
        # - it uses kernel of size 3x3 and adds a pad of 1 pixel around to keep the same size
        self.conv2 = nn.Conv2d(in_channels=4, out_channels=8, kernel_size=3, stride=1, padding=1)         # convolutional layer 2

        # set of fully connected layers, number of inputs in this case depends on width
        # and height of the original image (both will be reduced by 4)
        self.fc1 = nn.Linear(8 * width // 4 * height // 4, 1024)
        self.fc2 = nn.Linear(1024, 256)
        # the output layer in this case has two outputs — one for each class
        # the classification decision will be made by taking the biggest of the
        # two output values
        self.fc3 = nn.Linear(256, 2)

    def forward(self, x):
        x = F.relu(self.conv1(x))    # send input to 1st convolutional layer and ReLU
        x = self.pool(x)             # pool the feature maps from previous layer
        x = F.relu(self.conv2(x))    # send pooled features to 2nd convolutional layer and ReLU
        x = self.pool(x)             # pool the feature maps again

        x = torch.flatten(x, 1)      # flatten the outputs from the previous layer
        x = F.relu(self.fc1(x))      # send flattened features to the 1st linear layer + ReLU
        x = F.relu(self.fc2(x))      # send output of the 1st layer to the 2nd linear layer + ReLU
        y_hat = self.fc3(x)          # send output of the 2nd layer to the output layer
        return y_hat

Let's initialize the model for images of 256x256 pixels size and look at the summary:

In [None]:
from torchinfo import summary

model = ImageClassifier(256, 256)
summary(model)

For example, the first convolutional layer should have 12 kernels 3x3 each (one kernel for each input and each output). Plus four bias values which gives: $3 \times 4 \times 3 \times 3 + 4 = 112$ parameters.

For the second we have: $4 \times 8 \times 3 \times 3 + 8 = 296$ parameters.

And so on, in total the model has more that 30 millions parameters to train! 

## Classification of faces


Now let's learn how we can use the CNN network, we have just defined, for real dataset. 


### Load images as dataset

First of all, it worth to mention, that PyTorch has a special additional library, `torchvision` which helps to load images as datasets, assign them labels, etc. This library also provides a module `transforms` which can apply various transformations, like we did in the first class: crop, resize, etc.

For CNN it is important that:

1. All images have the same size (same number of pixels).
2. All images have the same color model or grayscale format.
3. Pixels have intensity between 0 and 1.
4. All images are PyTorch tensors.

All these can be achieved by defining a sequence of transformation methods from `torchvision.transforms` module, which Torch will automatically apply to each image.

Let's look at the following code:

In [None]:
from torchvision import datasets, transforms

# define same size for all images
img_width = 256
img_height = 256

# define path to folder with images for each subset
image_path = "dataset"

# transformation sequence
transform = transforms.Compose([
    transforms.Resize([img_width, img_height]), # resize so each image has the same size
    transforms.ToTensor()                       # convert to PyTorch tensor
])

# create a dataset based on the image folder structure and defined transformation
dataset = datasets.ImageFolder(root=image_path, transform=transform)
dataset

First, we load two modules from `torchvision`. Module `dataset` contains methods which can load images from disk, assign labels, transform them into Torch tensors, and combine them into a dataset. Module `transforms`, as mentioned already, contains methods for the transformation of images.

Then we define the width and height of the target images in pixels. All images will be rescaled to this size. After that, we define the location of the images on disk (path to folder).

If you open the folder `dataset` you can see that inside this folder there are two others. Folder `glasses` contains images of faces with glasses, and folder `noglasses` contains face images without glasses. Check several images from each folder.

The method `ImageFolder` that we use to load the images "knows" about this. So it will assign all images from the first subfolder to the class label `"glasses"` and all images from the second subfolder to the class label `"no glasses"`. 

Let's investigate the dataset a little more:

In [None]:
# shows list of classes
dataset.classes

In [None]:
# numeric labels for each class
dataset.class_to_idx

In [None]:
# show number of elements in the dataset
len(dataset)

In [None]:
# list of first 10 images
dataset.samples[0:10]

As you can see, inside the `dataset` we do not have tensors with image pixels, but just a full path to every image and a numeric label, which is connected to the text class label. The images will be loaded during the training and prediction processes. This approach helps to save your computer memory.

Let's show some of the images using the PIL library we learned in the first class:

In [None]:
img_indices = range(0, 4000, 200)
list(img_indices)
list(range(len(img_indices)))

img_indices[2]

In [None]:
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

# take every 200th of the first 4000 images
img_indices = range(0, 4000, 200)

plt.figure(figsize = (10, 8))

for i in range(len(img_indices)):
    path, class_ind = dataset.samples[img_indices[i]]
    img = Image.open(path)

    plt.subplot(4, 5, i + 1)
    plt.imshow(np.array(img))
    plt.axis('off')
    plt.title(dataset.classes[class_ind])


Finally, let's split the whole dataset to training, validation, and test sets. This time we will do it randomly, using function `random_split` from Torch. 

The ideal split will be to take 70% for training, 20% for validation, and 10% for testing. But our dataset is huge (4000+ images), and using even 70% of it for training will make this process very long until we run it on a powerful computer with GPU. 

To save time, we will take only 20% (800+) of the images for training , 10% for validation, and 10% for testing. However, because `random_split` requires all percents to sum up to 100%, we will create the fourth subset, `rest_set` which we will simply not use.

Later, when you learn all the content, try to increase the training set and see how it affects the model quality. In general, the more data you have, the more efficient the training process will be.

In [None]:
nall = len(dataset)
ntrain = int(nall * 0.20)
ntrain

In [None]:
from torch.utils.data import random_split

nall = len(dataset)
ntrain = int(nall * 0.20)
nval = int(nall * 0.10)
ntest = int(nall * 0.10)
nrest = nall - ntrain - nval - ntest


# we need a seed here as well because of random splits
torch.manual_seed(12)
[train_set, val_set, test_set, rest_set] = random_split(dataset, (ntrain, nval, ntest, nrest))


### Training the CNN model

The training process for the CNN model is almost identical to the one we used in the previous class. We will introduce two differences, though.

The first important difference is that we will not send all images at once to the training process. It would take a lot of memory and computational power. Instead, we will do it in small portions — *batches*. 

So we will split all training and validation sets into batches and make a loop, so it takes images from the first batch and trains the model using this batch. Then it takes images from the second batch, trains the model with the second batch, and so on.

This way of training is more efficient and is also a bit faster. Speed is important because in this case our model is very large and the images contain a lot of pixels, so the training process will be much slower than in the case of Iris classification.

In order to use batches, PyTorch has a special class called `DataLoader`. So let's create loaders for the training and validation sets.

In [None]:
# import data loader class
from torch.utils.data import DataLoader

# define how many images will be in one batch
batch_size = 10

# create data loaders with this batch size
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False)

As you can see, there is a special parameter, `shuffle`, which is set to `True` for the training set loader. This means that images will be sorted into batches randomly. This is very important as it helps to avoid situations when, for example, all images in one batch contain faces without glasses, so the model cannot learn anything from such a batch.

The batch size is another (together with the learning rate) important setting which can influence the training quality, so it makes sense to vary it a little if the quality of the trained model is not satisfactory. 

These settings, which cannot be optimized automatically and it is your responsibility to find the good ones, are called *hyperparameters*. So any model has *parameters* (such as weights and biases of the neurons), which are estimated automatically during the learning process, and *hyperparameters* (such as learning rate, batch size, number of epochs, etc.), which must be optimized manually by a data scientist.

Here is a code that implements batch learning (we also make it as a function like in the previous class). We use the same optimizer and the same loss function, like in the final example with Irises. The process will take up to 10-15 minutes, so it can be a good idea to start it, make sure it works and take a break:

In [None]:
def train_model(model, train_loader, val_loader, nepochs = 100, lr = 0.001):
    """ trains CNN model with provided data """

    # define a loss function
    loss_function = nn.CrossEntropyLoss()

    # define optimizer which will compute gradients — do the learning.
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    # prepare arrays for losses
    train_losses = np.zeros(nepochs)
    val_losses = np.zeros(nepochs)

    # training loop
    for epoch in range(nepochs):  # Number of training epochs

        # train
        model.train()
        train_loss = 0
        for batch_data in train_loader:
            inputs, labels = batch_data
            optimizer.zero_grad()
            labels_predicted = model(inputs)
            loss = loss_function(labels_predicted, labels)
            loss.backward()
            optimizer.step()
            train_loss = train_loss + loss.item()
        train_losses[epoch] = train_loss / len(train_loader)

        # validate
        val_loss = 0
        model.eval()
        for batch_data in val_loader:
            inputs, labels = batch_data
            labels_predicted = model(inputs)
            loss = loss_function(labels_predicted, labels)
            val_loss = val_loss + loss.item()
        val_losses[epoch] = val_loss / len(val_loader)

        # show how big the losses are at this epoch
        print(f'Epoch {epoch}, train loss: {train_losses[epoch]:.4f} - validation loss {val_losses[epoch]:.4f}')

    return train_losses, val_losses


In [None]:
# seed the random numbers generator to get reproducible results
torch.manual_seed(12)

# initialize the model
model = ImageClassifier(img_width, img_height)

# train it for 40 epochs and learning rate of 0.01
train_losses, val_losses = train_model(model, train_loader, val_loader, nepochs = 40, lr = 0.01)

Let's look at the losses values:

In [None]:
# show plot with losses
plt.plot(train_losses, label = "train")
plt.plot(val_losses, label = "val")
plt.legend()
plt.xlabel("Epochs")
plt.ylabel("Loss")

Here, we have several problems. First of all, you can see the strange behaviour of the validation loss, it jumps up and down. Perhaps we need to use a smaller learning rate, we will find this out later. 

The second problem is that starting from approximately the 20th epoch, the validation loss is going slowly up, so the final model is not the most optimal one. Let's talk about how to handle this problem a bit later but now let's check the performance of the model on the test set.

To do this, we need to write a new function for making predictions. As you remember, this time we used datasets based on image folders. These types of datasets contain paths to the images and corresponding labels, and then use data loaders to load the images from disk, preprocess them, split them into batches, and feed the model with the batches.

For making predictions, we do not need batches, but using a data loader is still handy as it automates a lot of things. What we can do is create a loader with a single batch, so all images will be in that batch, and use it to make predictions. 

Here it is:

In [None]:
def predict(model, dataset):
    """ get ANN model and tensor with predictors and returns predicted class label indices """
    model.eval()
    data_loader = DataLoader(dataset, batch_size = len(dataset))
    for inputs, labels in data_loader:
        output = model.forward(inputs)
        _, predicted_labels = torch.max(output, 1)

    return labels, predicted_labels

As you can see, this function returns both labels and predicted labels, so we can easily reuse the other functions we created in the last class to check the model performance — the computation of the cross table and the visualization of this table as a heat map.

Let's make predictions first:

In [None]:
predicted_labels, labels = predict(model, test_set)

And visualize the performance.

In [None]:
ct = table(labels, predicted_labels)
ct_heatmap(ct, dataset.classes)

No bad at all, right? Of course, the result will vary if you comment on `manual_seed()` line and run it several times, because in this case, the weights will be initialized randomly, and every time you run the training process, you will get different performance. 

Now you have a new achievement — you created and trained a CNN model that can automatically recognize people with glasses. Similar models can do a more useful job, for example, detecting if a car driver uses a mobile phone while driving (you probably heard that this is illegal) or sorting different objects (for example, garbage, vegetables, or something similar).

There are still a couple of things to learn, but now it is time for exercise:

### Exercise

In the previous class, you learned how to save a state dictionary of a model at any stage to a variable (by taking a deep copy of the dictionary). Modify the function `train_model()` in the code block below so it always results in a model with the lowest validation loss.

For example, if you run a model for 100 epochs and the lowest validation loss was obtained at epoch 67, the function will save the state of this model, and at the end of the training loop, it will load this state to the current model.

In [None]:
from copy import deepcopy

def train_model(model, train_loader, val_loader, nepochs = 100, lr = 0.001):
    """ trains CNN model with provided data """

    # define a loss function
    loss_function = nn.CrossEntropyLoss()

    # define optimizer which will compute gradients — do the learning.
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    # prepare arrays for losses
    train_losses = np.zeros(nepochs)
    val_losses = np.zeros(nepochs)

    # HINT: here you need to initialize two variables
    # - one will keep the best validation loss value
    # - second one will keep the state dictionary of the model you got this loss for
    best_model = None
    best_loss = 99999999999.0

    # training loop
    for epoch in range(nepochs):  # Number of training epochs

        # train
        model.train()
        train_loss = 0
        for batch_data in train_loader:
            inputs, labels = batch_data
            optimizer.zero_grad()
            labels_predicted = model(inputs)
            loss = loss_function(labels_predicted, labels)
            loss.backward()
            optimizer.step()
            train_loss = train_loss + loss.item()
        train_losses[epoch] = train_loss / len(train_loader)

        # validate
        val_loss = 0
        model.eval()
        for batch_data in val_loader:
            inputs, labels = batch_data
            labels_predicted = model(inputs)
            loss = loss_function(labels_predicted, labels)
            val_loss = val_loss + loss.item()
        val_losses[epoch] = val_loss / len(val_loader)

        # show how big the losses are at this epoch
        print(f'Epoch {epoch}, train loss: {train_losses[epoch]:.4f} - validation loss {val_losses[epoch]:.4f}')

        # HINT:
        # here you need to add a condition which will compare current model with the
        # best one you got so far. If the current model is better, you save its state as
        # the new best. You also need to save the best loss value — this is the way to
        # see if the next model will be even better
        if val_losses[epoch] < best_loss:
            best_model = deepcopy(model.state_dict())
            best_loss = val_losses[epoch]

    # HINT:
    # here you need to load the state of the best model from the loop
    # to your model object
    model.load_state_dict(best_model)

    return train_losses, val_losses

The solution:

In [None]:
from copy import deepcopy

def train_model(model, train_loader, val_loader, nepochs = 100, lr = 0.001):
    """ trains CNN model with provided data """

    # define a loss function
    loss_function = nn.CrossEntropyLoss()

    # define optimizer which will compute gradients — do the learning.
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    # prepare arrays for losses
    train_losses = np.zeros(nepochs)
    val_losses = np.zeros(nepochs)

    # initialize two variables which will keep the best validation loss and the best model state
    best_loss = 999999999999
    best_model_state = None

    # training loop
    for epoch in range(nepochs):  # Number of training epochs

        # train
        model.train()
        train_loss = 0
        for batch_data in train_loader:
            inputs, labels = batch_data
            optimizer.zero_grad()
            labels_predicted = model(inputs)
            loss = loss_function(labels_predicted, labels)
            loss.backward()
            optimizer.step()
            train_loss = train_loss + loss.item()
        train_losses[epoch] = train_loss / len(train_loader)

        # validate
        val_loss = 0
        model.eval()
        for batch_data in val_loader:
            inputs, labels = batch_data
            labels_predicted = model(inputs)
            loss = loss_function(labels_predicted, labels)
            val_loss = val_loss + loss.item()
        val_losses[epoch] = val_loss / len(val_loader)

        # show how big the losses are at this epoch
        print(f'Epoch {epoch}, train loss: {train_losses[epoch]:.4f} - validation loss {val_losses[epoch]:.4f}')

        # check if validation loss is better to what is known so far
        # if so save the model state
        if val_losses[epoch] < best_loss:
            best_loss = val_losses[epoch]
            best_model_state = deepcopy(model.state_dict())

    # load the state from the best model
    model.load_state_dict(best_model_state)
    return train_losses, val_losses

After you write your function, test it using the code below:

In [None]:
# seed the random numbers generator to get reproducible results
torch.manual_seed(12)

# initialize the model
model = ImageClassifier(img_width, img_height)

# train it for 40 epochs and learning rate of 0.01
train_losses, val_losses = train_model(model, train_loader, val_loader, nepochs = 40, lr = 0.01)

In [None]:

# compute loss of the final model on validation set
loss_function = nn.CrossEntropyLoss()
val_loss = 0
model.eval()
for batch_data in val_loader:
    inputs, labels = batch_data
    labels_predicted = model(inputs)
    loss = loss_function(labels_predicted, labels)
    val_loss = val_loss + loss.item()
val_loss = val_loss / len(val_loader)

val_loss

If you see that the loss we got in this test is smaller than the loss shown for the last epoch when we trained the model, it works. 

Let's visualize this by showing a plot with losses from the training process and the loss of the final model as a horizontal line. If the line touches the validation loss curve at the global minimum, your function works correctly.

In [None]:
# show plot with losses
plt.plot(train_losses, label = "train")
plt.plot(val_losses, label = "val")
plt.legend()
plt.xlabel("Epochs")
plt.ylabel("Loss")

# show the recently computed loss as horizontal line
plt.plot(plt.xlim(), [val_loss, val_loss], color = "black", linestyle = "--")

### Making predictions for new images

What if we want to make a prediction for a new image? Just one single image whose real label we do not know, so we can not use the data loader in this case. Let's take the image number 2000 from the original dataset set for this purpose:

In [None]:
# dataset items contain two elements - path and label, so we take the first one
new_image_path = dataset.imgs[2000][0]
new_image_path

In [None]:
from PIL import Image
img = Image.open(new_image_path)
display(img)

So the person wears glasses, let's see how to make predictions without data loader:

In [None]:
# manually apply all transformations to the new image
img_transformed = transform(img)

# reshape image tensor because model does not work with single images
# it expects tensor of images. So if we need to feed model the image
# we make it a tensor with one image
img_transformed = img_transformed.reshape(1, 3, img_width, img_height)

# apply the model
output = model.forward(img_transformed)

# compute label
_, labels_predicted = torch.max(output, 1)

# show all outcomes
(output, labels_predicted, dataset.classes[labels_predicted])

### Apply network to camera image

Now let's learn how we can use the model in order to make predictions in real time by taking photos with our frontal cameras. Run the next code in order to get the image of your face or the face of your friend.

In [None]:
# connect to camera
import cv2
camera = cv2.VideoCapture(0)
camera.isOpened()

In [None]:
# take a picture (repeat if necessary to get better result)
ret, frame = camera.read()
plt.imshow(frame)

In [None]:
# close connection to the camera
camera.release()

Now use the coordinate axes in order to define a crop rectangle. Make sure the final face is located at the center of the cropped image:

In [None]:
# left top right bottom
crop_rect = [250, 70, 474, 294]
img = Image.fromarray(frame)
img_cropped = img.crop(crop_rect)
print(img_cropped.size)
display(img_cropped)

Apply the model and see prediction.

In [None]:
img_transformed = transform(img_cropped)
img_transformed = img_transformed.reshape(1, 3, 256, 256)
output = model.forward(img_transformed)

# compute label
_, labels_predicted = torch.max(output, 1)

# show all outcomes
(output, labels_predicted, dataset.classes[labels_predicted])

## How to perform calculations on the GPU

You may have noticed that, in contrast to experiments with Iris data, training and using the model for images require much longer. This process can be sped up if you have an NVIDIA GPU on your computer (even the simplest one, like RTX 1060).

In order to do this, you need to send your model and your data to the GPU device using a special method. Below, you will find a code that implements this approach. The code is versatile, which means if you have a GPU it will use it but if not, you can run your code on a CPU without changing anything.

First of all, let's learn how to detect compatible GPU devices. In terms of Torch, such a device is called [CUDA](https://en.wikipedia.org/wiki/CUDA) (`cuda`). It is a name of the corresponding computer library which lets use GPU devices for calculations. Here is how to check if you have one:

In [None]:
torch.cuda.is_available()

In [None]:
torch.device("cuda")

And this code will automatically define the available device and keep it in separate variable. If you have CUDA it will select it, if not it will select you CPU instead:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

Now let's re-write the training loop in order to use the detected device for training. Here we reuse the data loaders and model class you have created before in this class, so to make it work, you should run the previous code cells.

The code is kept as simple as possible without a validation loop, just to give you an idea. As you can see, there are only two changes, one line where we send the model to the device (`model.to(device)`). And second line, where we do the same with inputs and labels (`inputs.to(device)` and `labels.to(device)`).

In [None]:
# initialize a model
model = ImageClassifier(img_width, img_height)

# define number of epochs and learning rate
nepochs = 3
lr = 0.01

# define a loss function
loss_function = nn.CrossEntropyLoss()

# define optimizer which will compute gradients — do the learning.
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

# NEW: send the model to device you defined earlier
model.to(device)

# training loop
for epoch in range(nepochs):  # Number of training epochs

    # train
    model.train()
    train_loss = 0
    for batch_data in train_loader:
        inputs, labels = batch_data

        # NEW: send batch data to the device as well
        inputs = inputs.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        labels_predicted = model(inputs)
        loss = loss_function(labels_predicted, labels)
        loss.backward()
        optimizer.step()
        train_loss = train_loss + loss.item()

    train_losses = train_loss / len(train_loader)
    print(f'Epoch {epoch}, train loss: {train_loss:.4f}')


If you train your model on a GPU and want to make predictions, the new data (test set, validation set, or new image) must also be loaded onto the GPU first. Alternatively, you can unload your model from GPU/CUDA to CPU by running: `model.to(torch.device("cpu"))` after training.

## Transfer learning

As you may have noticed, the training process, even for a relatively simple network like the one we created on a relatively small dataset, takes a long time. Using a GPU/CUDA can solve this problem, but only partially. Good models with high accuracy are much more complicated and have been trained on a much larger dataset. Can we use the benefits of the model trained by someone else?

Yes, we can, and this is exactly what *transfer learning* does. The idea is that you take a model that has already been trained on a very broad range of images and image classes. And then you fine-tune this model to make it work on your particular dataset.

Because the weights of this model are already set, you do not need to start the learning process from the scratch and this saves a lot of resources.

Torch has already several pre-treated models, including famoues ones like [AlexNet](https://en.wikipedia.org/wiki/AlexNet), [VGG](https://www.kaggle.com/code/blurredmachine/vggnet-16-architecture-a-complete-guide) and many others. Models for image classification are located in `torchvision.models`. Let's load one of those ([residual network](https://en.wikipedia.org/wiki/Residual_neural_network) with 18 layers, there is also one with 50):

In [None]:
from torchvision import models
from torchinfo import summary

model = models.resnet18()
summary(model)

As you can see, the model has fewer parameters than ours, but the architecture is much more complex, which makes it more efficient and versatile (the output is actually truncated, so you do not see the full structure).

In order to use the network, we need to know how to transform (preprocess) the images and what the image size should be. Luckily, we can get the already-prepared stack of transformations, which is connected to the weights of the model. 

Here is how to get it.

In [None]:
weights = models.ResNet18_Weights.DEFAULT
transform = weights.transforms()
transform

Because the transformation stack is different from what we used before, we need to recreate the dataset, subsets and data loaders. We just repeat the code with new transformation object:

In [None]:
# create a dataset based on the image folder structure and defined transformation
dataset = datasets.ImageFolder(root=image_path, transform=transform)
dataset

In [None]:
from torch.utils.data import random_split

nall = len(dataset)
ntrain = int(nall * 0.20)
nval = int(nall * 0.10)
ntest = int(nall * 0.10)
nrest = nall - ntrain - nval - ntest


torch.manual_seed(12)

[train_set, val_set, test_set, rest_set] = random_split(dataset, (ntrain, nval, ntest, nrest))

Now we create the data loaders:

In [None]:
# import data loader class
from torch.utils.data import DataLoader

# define how many images will be in one batch
batch_size = 10

# create data loaders with this batch size
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False)

Before training we actually need to modify the model. The original model can classify images among 1000 classes, so it has 1000 outputs. You can see this if you check the output layer, which has name `fc` in this model:

In [None]:
model.fc

So we can adjust it by creating a new linear layer with as many inputs as the original one and only two outputs. We can replace part of the ResNet18 model. Here is how to do it:

In [None]:
# replace the output layer in the ResNet18 model by a new one with 2 outputs
in_features = model.fc. in_features
model.fc = nn.Linear(in_features, 2)

Now we are ready to train, or rather fine-tune, the model. We will start with just 10 epochs. If you have created the smart `train_model()` method in one of the exercises, it will end up with the most optimal model.

In [None]:
train_losses, val_losses = train_model(model, train_loader, val_loader, nepochs=10, lr = 0.01)

As you can see, already after 1-2 epochs the loss gets very small. Let's check the performance:

In [None]:
labels, predicted_labels = predict(model, test_set)
ct = table(labels, predicted_labels)
ct_heatmap(ct, dataset.classes)

Perfect result and much faster than training the CNN model from the scratch!

## What to do next?

If you want to continue and develop your skills further, here are some useful links.

To learn Python in a more systematic way, you can look into the following materials:

* [Python for kids](https://www.geeksforgeeks.org/python-for-kids/)
* [Google Python class](https://developers.google.com/edu/python)
* [Python tutorial at geeksforgeeks.org](https://www.geeksforgeeks.org/python-programming-language-tutorial/)
* [Scientific Python lectures](https://lectures.scientific-python.org/)

There are tonnes more, of course, and Python is probably the most popular programming language nowadays.

As for data science, machine learning, or artificial intelligence, there are hundreds of good courses as well. We recommend a [free online course](https://course.fast.ai) from FastAI, which covers all aspects of modern ML/AI including more sophisticated topics like Stable Diffusion. It is also based on Jupyter notebooks, so you will feel comfortable from the start.

Happy learning!