# DAA ML Intro - Activity 3

This is a bit of a longer activity leveraging deep nerual networks for image classification. We pull data in using a slightly different method, but all the imports and downloading is done below.

In [None]:
!pip3 install torch torchvision torchsummary rich ipywidgets scikit_learn
!wget https://github.com/YalinZhengLab/outreach/raw/main/test.zip
!wget https://github.com/YalinZhengLab/outreach/raw/main/train.zip
!mkdir data
!unzip test.zip -d data/test/
!unzip train.zip -d data/train/

In [None]:
# Import all the libraries we need
from pathlib import Path
import torch
import random
import numpy as np

from torchsummary import summary

from torch.utils.data import Dataset, DataLoader
from torch.optim import SGD
from torch.nn import CrossEntropyLoss

from rich.progress import track

from torchvision.io import read_image

from sklearn.metrics import accuracy_score, f1_score

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# What is AI, and how does it work?

AI models are computer programs that aim to perform complex tasks to save us time and effort. Usually, this is about doing something that is time-consuming, and at a difficulty level which is too hard for simpler computer programs to solve. An easy example problem would be classifying images into "Cats" and "Dogs" - as humans we can do this very easily; but computers find this much harder (how does a computer know what a "dog" is?). By using AI, the computer "learns" the features of dogs and cats and can distinguish between the two.

Automating these processes allows for us to spend our time doing more important things, instead of these tedious tasks. Perhaps more importantly, it allows us to do them on a scale that simply wouldn't be possible without AI (e.g. fraud detection for card transactions).

Recently, most cutting-edge AI works using *neural networks*; these are massive mathematical models originally developed by neurologists to imitate how our brains worked. These models function by having millions, or billions of *parameters* that change as the model learns, improving the model as training continues.

<img src="./img/nn.png"/>

In short, the way a model "learns" is by computing a *loss function* every so often in training; this is a general measure of how well the model is doing. The higher the loss, the worse the model performs, and the more the parameters change to try and fix it using some clever maths. All of this is handled by what we installed at the top of the page, but it's good to know how the model is working underneath the surface!

# Training our own Neural Network

We are going to build a model for detecting Glaucoma from fundus photography. This is a bit of a jump in difficulty but Python has several tools that will make it easier for us!

Glaucoma is a common eye condition that is the leading cause of blindness in the UK. Clinicans will take a photo of the back of the eye and examine the images for features that are characteristic to the condition.

![Example of a healthy retina](img/ex1.jpg)
![Example of another healthy retina](img/ex2.jpg)

We have an open-source dataset from the Rotterdam EyePACS AIROGS challenge (https://www.kaggle.com/datasets/deathtrooper/eyepacs-airogs-light?resource=download) that we are going to use to train our AI model. Under the `data/train` directory, we have 2500 positive (RG) and 2500 negative cases (NRG) of glaucoma.

Our first job is to create Python object that contains all of our data ready to be trained; unfortunately the Python tool that manages the AI for us needs this code, and it's a bit abstract, so feel free to run it and move on.

In [None]:
# Feel free to run this and ignore it. Classes are tricky!
class FundusDataset(Dataset):
    def __init__(self, files):
        # This function is run when creating a FundusDataset object
        print("Fundus Dataset initialised!")
        self.files = files

    def __len__(self):
        # This function is called when using len() on the FundusDataset object
        return len(self.files)

    def __getitem__(self, idx):
        # This function is called when using indexing (e.g. my_dataset[1]) on the dataset.
        
        image_path = self.files[idx]

        # Read the file
        image = read_image(image_path)

        # Get the label
        label = 0 if "NRG" in str(image_path) else 1

        # return outputs the image, as well as the label for the model.
        return image.float(), label

The code above doesn't actually do anything yet, it just sets up the next step.

In the code below, we actually create an instance of the `FundusDataset` object we define above. Again, this needs a bit of setting up, but is a little more understandable.

First, we need to get all the image files in the training folder. The `Path` type finds the training folder and gives us access to some helpful tools to work with files and folders. For example, `.glob()` finds files based on a pattern; `**` means "any folder" and `*` means "any file" - as long as it ends in .jpg! This will give us a list of all the fils in the training folder:



In [None]:
files = list(Path("./data/train").glob("**/*.jpg"))
print(files)

Now we have the list of all the training files, we can shuffle and take a subset of them for training:

In [None]:
# Randomly shuffle this list to get a good mix of positive and negative eyes.
random.shuffle(files)

# 10k eyes is too much to work with now, so to speed things up we can take the first 1000
# Note the colon in the index here - this is saying "take every single element up to the 1000th".
files = files[:1000]

Now we are ready to use the `FundusDataset` type we defined above to handle the data for our model!

In [None]:
# We have created the FundusDataset object to convert this list
# into a type of object PyTorch needs. It's fairly straightforward,
# but the code is a bit challenging, so we've collapsed it above.
training_dataset = FundusDataset(files)

# Next, we convert this into a "DataLoader" - this prepares the 
# data ready to be put into the AI model!
training_dataloader = DataLoader(training_dataset, batch_size=64, shuffle=True)

The next thing we want to do is load and train the model. We are going to be doing this using the ResNet model (https://arxiv.org/abs/1512.03385), a well established, mature network that performs well on classification tasks.

We do not need to know exactly how this AI works (although that's the excitement of AI research!), but we only need to know that we need to set up a few things for training to take place:

In [None]:
# We can download the untrained model as follows.
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', num_classes=2, weights=None)

# The loss function tells the model how well it's doing.
criterion = CrossEntropyLoss()

# The optimizer will work to reduce the loss function by changing the model parameters
# The amount the model parameters change depends on how bad the loss function is.
# An important input here is "lr" - this is the learning rate of the model;
# The higher the learning rate, the faster the model will train,
# but increases training instability.
optimizer = SGD(model.parameters(), lr=0.05, momentum=0.9)

# This moves the model object to the right bit of memory. Safe to ignore this!
model.to(device)

# Finally, print a summary of the model. (3, 256, 256) means the model has size 256x256 and 3 channels:
# red, green and blue.
summary(model, input_size=(3,256, 256))

And just like that, you have built a neural network with 12 million parameters ready to train!

Now we can get to training our model; this is the part the computer does the most work!

In [None]:
# A for loop tells Python to do something a number of times.
# In this case, we are running through the entire dataset 10 times before stopping/
# Try changing this 10 to a smaller number and see what happens!
for epoch in range(20):

    # This is a measure of how how well the model has done this epoch. Low is good!
    running_loss = 0.0
    
    for data in track(training_dataloader):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # Move these to the right device. Safe to ignore this!
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Zero the parameter gradients - this stops the model from making changes
        # to the parameters based on old results
        optimizer.zero_grad()

        # Get the model output for the input
        outputs = model(inputs)

        # Mark how well the model has done
        loss = criterion(outputs, labels)

        # Adjust the model parameters - if the model has done badly, adjust them a lot
        # The parameters are adjusted in a way that should make the model a little better
        # each time!
        loss.backward()
        optimizer.step()

        # Add the loss to the running loss - this is a measure of how well the model
        # has done this epoch!
        running_loss += loss.item()

    # Print the average loss for the epoch 
    print(epoch, "loss", running_loss / len(training_dataloader))
    running_loss = 0.0

print('Finished Training')

## Testing the Model

Congratulations - you have just trained an AI model!

Now that we have trained the model; we want to test how well the model is doing. When testing our models, it's unfair to use the data the model already has; we don't want to test the *memory* of the model, but the *understanding* of the model to look at the typical features of glaucoma. Typically, we split the dataset to reserve some images specifically for testing for this purpose - in fact, the `data` folder has a specific foldder for us here. All we need to do is set up the images in the same way as training and get the outputs.

This is a key point for AI models; beyond the training dataset, they don't really know anything else. The model may be too specialised on the training dataset to perform well on data outside what it's seen before. However, if the model is not specialised enough, it may just not perform well on anything!

From here, we can use some measures to see how well our model stacks up. We'll be using the `f1_score`, which is a fairly standard place to start for classification tasks, where a high F1 score is good. Let's do that now:

In [None]:
# This process is very similar to the training process - we set up the dataset as before.
files = list(Path("./data/test").glob("**/*.jpg"))
random.shuffle(files)
files = files[:100]
testing_dataset = FundusDataset(files)
testing_dataloader = DataLoader(training_dataset, batch_size=64, shuffle=True)


# Set up two lists 
truth = []
predictions = []

# The for loop now doesn't do something a number of times, but does something for
# every element in a list - in this case, we are taking each image and it's label one by one
# and comparing the model's prediction with what we know to be true.
for data in track(testing_dataloader):
    
    # get the inputs; data is a list of [inputs, labels]
    inputs, labels = data

    # Getting the inputs to the right bit of computer memory
    inputs = inputs.to(device)

    # Add the actual true labels to the truth list
    truth.append(labels.numpy())

    # Use the model to predict the diagnosis
    outputs = model(inputs).detach().cpu().numpy().argmax(axis=1)

    # Add this to the list
    predictions.append(outputs)

# Because of how the output comes out, we need the following code to "flatten" the lists
# into one long array.
truth = np.concatenate(truth)
predictions = np.concatenate(predictions)

We now have two lists - for every image in our testing dataset, we have the model prediction labels (1 for positive, 0 for negative), and the corresponding *true* values for each image. We can look at these by printing out the objects:

In [None]:
print(truth)
print(predictions)

Clearly this isn't the best way to see how our model is doing - just looking at these numbers makes my eyes hurt! Let's look at the accuracy (how many it got right divided by the total number of images) and the F1 score.

In [None]:
accuracy = accuracy_score(truth, predictions)
f1 = f1_score(truth, predictions)

print("Accuracy:", accuracy)
print("F1 Score:", f1)