## Deep Convolutional Classificator using Fashion MNIST dataset

![alt text](fashion-mnist_long.png)

In this project, we will make a convolutional neural network (hereafter CNN) to classify the different fashion images contained on the [Fashion MNIST dataset](https://www.kaggle.com/zalando-research/fashionmnist). This dataset consists of a set of 70000 images of 28 x 28 pixels in greyscale. Such images can be of 10 different labels, corresponding to 

| Label number | Label |
| --- | --- |
| 0 | T-shirt |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |

Before we proceed, we will go through a brief explanation of what is a CNN.

As it is said in this [IBM article](https://www.ibm.com/topics/convolutional-neural-networks), neural networks are a subset of machine learning, and they are at the heart of deep learning algorithms. They are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node connects to another and has an associated weight and threshold, as it is shown below.

<figure>
    <img src="neural_networks-001.png" style="width:80%">
    <figcaption>Source: https://tikz.net/neural_networks/</figcaption>
</figure>

The elemental brick of each neural network is an *artificial neuron*, which is described through a model called *perceptron*, whose structure is:

<figure>
    <img src="activation_function.ppm" style="width:80%">
    <figcaption>Source: Google Images (2019)</figcaption>
</figure>

The idea of such a mechanism is to try to model the real behavior of a biological neuron. The biological neuron takes information sent from other neurons through the dendrites, it is processed by the nucleus of the neuron and it is sent away by the axon. To model this biological mechanism, computer scientists have developed the perceptron, which aims to take data from other neurons through its *weights*, process it with a *linear function*, and 'fire' that information based on the output of the *activation function*.

<figure>
    <img src="neuron.png" style="width:80%">
    <figcaption>Source: https://en.wikipedia.org/wiki/Neuron#/media/File:Blausen_0657_MultipolarNeuron.png</figcaption>
</figure>

While working with image, speech, or audio signal inputs, *convolutional neural networks* are distinguished from other neural networks by their superior performance. They have three main types of layers, which are:
- Convolutional layer
- Pooling layer
- Fully-connected (FC) layer

The *convolutional layer* is the first layer of a convolutional network. While convolutional layers can be followed by additional convolutional layers or *pooling layers*, the *fully-connected* layer is the final layer. With each layer, the CNN increases in its complexity, identifying greater portions of the image. Earlier layers focus on simple features, such as colors and edges. As the image data progresses through the layers of the CNN, it starts to recognize larger elements or shapes of the object until it finally identifies the intended object.

#### Convolutional layer

The convolutional layer is the core building block of a CNN, and it is where the majority of computation occurs. It requires a few components, which are:

- **Input data**: Since in this projects we are going to work with clothes images, let’s assume that the input will be a grey color image. It is made up of a matrix of pixels, which means that the input will have two dimensions, a height and width. Since the images are in greyscale, they have just one *channel*. For RGB images, the number of channels is 3.
- **Filter**: It is the set of feature detectors, composed of various *kernels*, which will move across the *receptive fields* of the image, checking if the feature is present. This process is known as a *convolution*. The kernel is a two-dimensional (2-D) array of weights, which represents part of the image. They can vary in size, which determines the size of the receptive field. The kernel is then applied to an area of the image, and a dot product is calculated between the input pixels and the kernel. This dot product is then fed into an output array. Afterwards, the kernel shifts by a *stride*, repeating the process until the kernel has swept across the entire image. There will be one kernel per input channel of the convolutional layer and one filter per output channel.
- **Feature map**: The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or a convolved feature.

After each convolution operation, a CNN applies an *activation function* transformation to the feature map, introducing nonlinearity to the model. 

As we mentioned earlier, another convolution layer can follow the initial convolution layer. When this happens, the structure of the CNN can become hierarchical as the later layers can see the pixels within the receptive fields of prior layers.

<a title="Vincent Dumoulin, Francesco Visin, MIT &lt;http://opensource.org/licenses/mit-license.php&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Padding_strides.gif"><img width="512" alt="Convolution arithmetic - Padding strides" src="https://upload.wikimedia.org/wikipedia/commons/0/04/Convolution_arithmetic_-_Padding_strides.gif"></a>

<figure>
    <img src="convolution_example.png" style="width:80%">
    <figcaption>Source: https://developer.nvidia.com/discover/convolution</figcaption>
</figure>

#### Pooling layer

Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the number of parameters in the input. Its purpose is to gradually shrink the representation’s spatial dimension. Similar to the convolutional layer, the pooling operation sweeps a filter across the entire input, but the difference is that this filter does not have any weights. Instead, the kernel applies an aggregation function to the values within the receptive field, populating the output array. There are two main types of pooling:

- **Max pooling**: As the filter moves across the input, it selects the pixel with the maximum value to send to the output array. As an aside, this approach tends to be used more often compared to average pooling.
- **Average pooling**: As the filter moves across the input, it calculates the average value within the receptive field to send to the output array.

<a title="Rafay Qayyum - Introduction To Pooling Layers In CNN" href="https://pub.towardsai.net/introduction-to-pooling-layers-in-cnn-dafe61eabe34"><img width="512" alt="Introduction To Pooling Layers In CNN" src="https://miro.medium.com/v2/resize:fit:828/1*fXxDBsJ96FKEtMOa9vNgjA.gif"></a>

While a lot of information is lost in the pooling layer, it also has a number of benefits to the CNN. They help to reduce complexity, improve efficiency, and limit risk of overfitting.

As a final comment, after each kernel is applied in each input channel, the output maps are combined to give as a result one feature map. Thus, as said above, the number of filters will determine the number of output channels in a convolutional layer.

<a title="Irhum Shafkat - Intuitively Understanding Convolutions for Deep Learning" href="https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1#:~:text=Each%20filter%20in%20a%20convolution,a%20processed%20version%20of%20each."><img width="1024" alt="Filter's output combination" src="https://miro.medium.com/v2/resize:fit:2000/1*CYB2dyR3EhFs1xNLK8ewiA.gif"></a>

#### Fully-connected layer

They are the layers that do not use as inputs 2D or 3D arrays of data, performing a convolution, but take 1D arrays of data and apply to them a linear transformation. They are composed of the perceptrons defined above and their objective is to produce the final output of the entire network.

<figure>
    <img src="fully_cnn.jpg" style="width:100%">
    <figcaption>Source: https://developersbreach.com/convolution-neural-network-deep-learning/</figcaption>
</figure>

We will implement a CNN whose aim is to predict the labels of the clothes given input images. Let us start by importing the packages.

In [None]:
import torch
from torch import nn
import torch.optim as optim
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
from torchvision.datasets import FashionMNIST
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.metrics as met

# ------------------------------- Plot features ------------------------------
# Properties to decorate the plots.
plt.rcParams['axes.linewidth'] = 0.5
plt.rcParams['text.usetex'] = False
plt.rcParams['font.family'] = 'serif'   
plt.rcParams['font.sans-serif'] = 'New Century Schoolbook' # 'Times', 'Liberation Serif', 'Times New Roman'
#plt.rcParams['font.serif'] = ['Helvetica']
plt.rcParams['font.size'] = 10
plt.rcParams['legend.frameon'] = False
plt.rcParams['legend.edgecolor'] = 'k'
plt.rcParams['legend.markerscale'] = 7
plt.rcParams['xtick.minor.visible'] = True
plt.rcParams['ytick.minor.visible'] = True
plt.rcParams['xtick.top'] = False
plt.rcParams['ytick.right'] = False
plt.rcParams['xtick.direction'] = 'in'
plt.rcParams['ytick.direction'] = 'in'
plt.rcParams['xtick.major.width']= 0.5
plt.rcParams['xtick.major.size']= 5.0
plt.rcParams['xtick.minor.width']= 0.5
plt.rcParams['xtick.minor.size']= 3.0
plt.rcParams['ytick.major.width']= 0.5
plt.rcParams['ytick.major.size']= 5.0
plt.rcParams['ytick.minor.width']= 0.5
plt.rcParams['ytick.minor.size']= 3.0
# ----------------------------------------------------------------------------

As a first step, we can load the set of images, which is provided by Torch. But before, let us define a custome image dataset class.

In [None]:
# We create a custom Dataset class to work the images
class CustomImageDataset(Dataset):
    def __init__(self, dataset):
        super().__init__()
        self.dataset = dataset
        
    # We redefine the __len__() method
    def __len__(self):
        return len(self.dataset)
    
    # We redefine the __getitem__() method
    def __getitem__(self, i):
        image, label = self.dataset[i]
        # We rewrite the original label to allow it be able to be compared against the model predictions
        zer = torch.zeros(10)
        zer[label] = 1.
        label = zer
        return image, label

In [None]:
# Download training data
train_d = FashionMNIST(
    root='Dataset',
    train=True,
    download=True,
    transform=ToTensor(),
)

train_data = train_d
train_data = CustomImageDataset(train_data)

# Download test data
test_d = FashionMNIST(
    root='Dataset',
    train=False,
    download=True,
    transform=ToTensor(),
)

test_data = test_d
test_data = CustomImageDataset(test_data)

Once we have loaded the Dataset objects it is time to instantiate the Dataloader objects in order to get the proper inputs to the CNN. Also, it is possible to train the CNN in batches if the inputs are Dataloader objects.

In [None]:
# Size of the batch of images
batch_size = 1000

train_dl = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_dl = DataLoader(test_data, batch_size=batch_size, shuffle=False)

Before proceed, let us talk a little bit about **Torch tensors**.

In [None]:
# Tensor definition
t = torch.tensor([1.0, 3.5, -np.pi, 0.0, 9.0], dtype=torch.double)

In [None]:
print(len(t), "\n", t.size(), "\n", t.shape, "\n")

In [None]:
print(t*t, "\n", t+t, "\n", t[-1], "\n", t.view(1, 5), "\n", t.view(5, 1))

In [None]:
print(t.reshape(1, 5), "\n", t.reshape(5, 1))

In [None]:
# Some mathematical functions
print(torch.sin(t), "\n", torch.arctan(t), "\n", t.cosh(), "\n", t.exp())

In [None]:
a = torch.arange(1.0, 6.0, 1.0)
print(a, "\n")
print(torch.pow(t, 2.0), "\n", t.pow(2), "\n", t.pow(a), "\n")
a = torch.linspace(1.0, 6.0, 5)
print(a, "\n")
print(torch.pow(t, 2.0), "\n", t.pow(2), "\n", t.pow(a))

In [None]:
print(t == t, "\n", torch.ge(t, 2.0*t), "\n", t.isnan(), "\n", t.argmax())

In [None]:
# Some statistical functions
print(t.mean(), "\n", t.mode(), "\n", t.median(), "\n", t.sum())

In [None]:
print(t.numpy())

In [None]:
ran = torch.rand(2, 2, 3)
print(ran)

In [None]:
ran_int = torch.randint(0, 100, (2, 2, 3))
print(ran_int)

Coming back to the main work, we can check if cuda is available for training. The use of cuda optimizes the training process, allowing us to use the different GPUs we have in our computer.

In [None]:
# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Now we have loaded the dataset and we transformed it in appropiate Dataloader objects, it is time to define the model we will train to predict the label of the image used as input. As said before, the model is a CNN, and the architecture of such a network will be explained below. Before, we need to define the hyperparameters of the CNN, as usual in machine learning models.

In [None]:
# Model parameters
n_inputs = 196
n_hidden = 98
n_outputs = 10
in_channels = 1
out_channels = 1
kernel_size_1 = 7
kernel_size_2 = 5
p_dropout = 0.1   # Dropout probability
lr = 1e-3   # Learning rate
n_epochs = 300   # Number of epochs

In [None]:
# Model definition
class Model(nn.Module):
    # Define model elements
    def __init__(self):
        """ The super() builtin returns a proxy object (temporary object of the superclass) which
        let's you avoid referring to the base class explicitly and it allows us to access methods 
        of the base class. """
        super().__init__()
        # Sequence of transformations implemented by the layers of the network
        self.cnn = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size_1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size_2, stride=1),
            nn.Conv2d(in_channels, out_channels, kernel_size_2),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(n_inputs, n_hidden),
            nn.Dropout(0.1),
            nn.ReLU(),
            nn.Linear(n_hidden, n_outputs),
            nn.Softmax()
        )
    
    # Method to transform inputs in outputs considering the internal structure of the network
    def forward(self, X):
        output = self.cnn(X)
        return output
    
# Now we can create a model and send it at once to the device
model = Model().to(device)
# We can also inspect its parameters using its state_dict() method
print(model.state_dict())

![alt text](red.JPG)
###### Architecture of the CNN

$$
\begin{equation}
    Softmax(x_{i}) = \frac{\mathrm{e}^{x_{i}}}{\Sigma_{j=1}^{N}\mathrm{e}^{x_{j}}}
\end{equation}
$$

<figure>
    <img src="sigmoid.png" style="width:80%">
    <figcaption>Source: https://www.researchgate.net/figure/A-Basic-sigmoid-function-with-two-parameters-c1-and-c2-as-commonly-used-for-subitizing_fig2_325868989</figcaption>
</figure>

Now we have defined the model, we want to train it. So, let us take a look at the dataset.

In [None]:
# Labels of the clothes based on the table given at the beggining of the notebook
labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}

# Plot of a random sample of tain_data
figure = plt.figure(figsize=(5, 5), dpi=180)
cols, rows = 3, 3
for i in range(1, cols*rows + 1):
    sample_idx = torch.randint(len(train_d), size=(1,)).item()
    img, label = train_d[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.tight_layout()
plt.savefig('examples.jpg')

Let us make some statistical analysis regarding the labels. In doing so, we are going to be able to see if the dataset is imbalanced or not.

In [None]:
# Let us convert the targets tensors into a pandas DataFrame object to make easier the computations
df_train = pd.DataFrame(train_d.targets.numpy(), columns = ['Labels'])
df_test = pd.DataFrame(test_d.targets.numpy(), columns = ['Labels'])

Now we have a DataFrame object we can inspect the targets.

In [None]:
df_train.value_counts()

In [None]:
df_test.value_counts()

We can see that the training labels are perfectly balanced, making the training process easier than in the case of imbalanced classes. Note that we have labels that are encoded, so it is not needed to apply the label encoding method on the targets. To move forward, we have to define the function that will perform the training of the CNN.

In [None]:
# We define the training function
def train_loop(dataloader, model, loss_fn, optimizer):
    size = int(len(dataloader.dataset)/1000)
    tmp = []

    # We iterate over batches
    for batch, (X, y) in enumerate(dataloader):
        # We calculate the model's prediction
        pred = model(X)
        # With the model's prediction we calculate the loss function
        loss = loss_fn(pred, y)

        # We apply the backpropagation method
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Training progress
        loss, current = loss.item(), batch
        tmp.append(loss)
        print(f"Actual batch = {current} | Loss = {loss:>7f} | Processed samples: [{current:>2d}/{size:>2d}]")
    
    tmp = np.array(tmp)
    loss_avg = tmp.sum()/len(tmp)
    return loss_avg

# We define the test function
def test_loop(dataloader, model, loss_fn, num_batches):
    size = int(len(dataloader.dataset)/1000)
    test_loss = 0
    j = 0
    
    # To test, we need to deactivate the calculation of the gradients
    with torch.no_grad():
        # We iterate over batches
        for X, y in dataloader:
            # Model's prection
            pred = model(X)
            # Corresponding errors, which we acumulate in a total value
            test_loss += loss_fn(pred, y).item()
            j += 1
            
    # We calculate the total loss and print it
    test_loss /= j
    print(f"Test Error: Avg loss = {test_loss:>8f} \n")
    return test_loss

In order to train the model, we need to instanciate an optimizer object and a loss function object. Let us do this.

In [None]:
# Loss function object. It is a Medium Squared Error.
loss_fn = nn.MSELoss()

# We instantiate an optimizer. In this case we choose an Adam optimizer.
optimizer = torch.optim.Adam(model.parameters(), lr=lr, eps=1e-08, weight_decay=0, amsgrad=False)

In [None]:
# Print model's state_dict size to gain some perspective about the model
print("Model's state_dict size:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

We will plot the loss function against the epochs

In [None]:
"""
We define a loss array to plot the training loss function and the testing loss function.
Comment if you have to load a trained model.
loss_to_plot = []
loss_to_plot_test = []
"""

We are ready to train the model. Let us train it during $n_{epochs}$ epochs, as defined above.

In [None]:
"""
We train the model iterating over the different epochs. Comment if you have to load a trained model.
for t in range(n_epochs):
    print(f"Epoch {t+1}\n=============================================")
    loss_to_plot.append(train_loop(train_dl, model, loss_fn, optimizer))
    loss_to_plot_test.append(test_loop(test_dl, model, loss_fn, batch_size))
print("Done!")
"""

The model has been trained. It is better to save it before continue doing other things.

In [None]:
# Save the model in .pth file. Comment if you have to load a trained model.
# torch.save(model.state_dict(), "trained_model.pth")

# To load it just uncomment the following lines
model = Model()
model.load_state_dict(torch.load("trained_model.pth"))
model.eval()

Now, it is time to check the precision of the predictions. In order to do so, we will plot both loss functions and we will make a *confusion matrix* plot to see the tendencies of the predictions.

In [None]:
# We save both loss functions. Comment if you have to load a trained model.
#np.savetxt('loss_to_plot.txt', loss_to_plot)
#np.savetxt('loss_to_plot_test.txt', loss_to_plot_test)

# We choose an image and calculate the corresponding prediction generated by the model
ind = 78
for (X, y) in test_dl:
    pred_cpu = model(X)
    image_cpu = X[ind]
    break

pred_cpu = pred_cpu[ind].detach().numpy()

# We plot the image to be predicted and as a title the corresponding prediction
fig, ax = plt.subplots(1, 1, dpi=180)
fig.set_size_inches(4.0, 4.0)
ax.axis("off")
plt.title(labels_map[np.argmax(pred_cpu)])
ax.imshow(image_cpu.squeeze(), cmap="gray")

In [None]:
lp = np.loadtxt('loss_to_plot.txt')
lp_test = np.loadtxt('loss_to_plot_test.txt')

# Let us plot both loss functions
fig, ax = plt.subplots(1, 1, figsize=(7, 6), dpi=200)
ax.plot([i for i in range(3, n_epochs+1)], lp[2:], color='darkblue', lw=1.5, label='Training error')
ax.plot([i for i in range(3, n_epochs+1)], lp_test[2:], ls=':', color='maroon', lw=1.5, label='Test error')
ax.set_xlabel('Epoch')
ax.set_ylabel('Average loss function')
ax.set_xticks([0, 75, 150, 225, 300])
ax.set_xticklabels(['0', '75', '150', '225', '300'])
plt.legend()
plt.tight_layout()
plt.savefig('loss.jpg')

In order to avoid the model being overfitted, it is needed to train it during 300 epochs, because if it is trained during a greater number of epochs, it can be seen a very small tendency of the average test loss function to start growing.

To conclude the analisys of the precision of the model, let us make a confusion matrix. That way, we will be able to see the tendencies of the model at the time of predict the input image.

In [None]:
# Let us define a function that get all the predictions made by the CNN
@torch.no_grad() # turn off gradients during inference for memory efficiency
def get_all_preds(network, dataloader):
    """ Function to return the number of correct predictions across data set """
    all_preds = torch.tensor([])
    model = network
    tmp_labels = np.array([])
    for batch in dataloader:
        images, labels = batch
        tmp_labels = np.concatenate((tmp_labels, labels.argmax(1).numpy()))
        preds = model(images) # get preds
        all_preds = torch.cat((all_preds, preds), dim=0) # join along existing axis
    
    return all_preds, tmp_labels

# Let us define the function that plots the confusion matrix
def plot_confusion_matrix(cm, target_names, title='Confusion matrix', cmap=None, normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import itertools

    accuracy = np.trace(cm)/np.sum(cm).astype('float')
    
    if normalize:
        cm = cm.astype('float')/cm.sum(axis=1)[:, np.newaxis]

    if cmap is None:
        cmap = plt.get_cmap('Blues')    # Choose Blues by default

    plt.figure(figsize=(12, 10), dpi=300)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)


    thresh = cm.max()/1.5 if normalize else cm.max()/2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.ylabel('Correct class')
    plt.xlabel('Predicted class')
    plt.tight_layout()
    plt.savefig('confusion_matrix.jpg')

Once defined the functions, let us get all the predictions made over the test set and plot them in the confusion matrix.

In [None]:
# Get the predictions over the test set
all_preds_test, labels_test = get_all_preds(network=model, dataloader=test_dl)

# Get the confusion matrix
cm = met.confusion_matrix(y_true=labels_test, y_pred=all_preds_test.argmax(1).numpy())

# Plot the predictions as a confusion matrix
plot_confusion_matrix(cm, target_names=labels_map.values(), cmap='Blues', normalize=True)

### References

[1] [A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

[2] [Fashion MNIST classification using custom PyTorch Convolution Neural Network (CNN)](https://boscoj2008.github.io/customCNN/)

[3] [Convolutional Neural Networks](https://www.ibm.com/cloud/learn/convolutional-neural-networks)