<hr style="color:green" />
<h1 style="color:green">COSC2673 Assignment 2: Image Classification for Cancerous Cells</h1>
<h2 style="color:green">File 06: Basic PyTorch Fully Connected Neural Network model test on Main data</h2>
<hr style="color:green" />

<p>
In this file, Train a basic fully connected NN with Pytorch, using a basic 3 layer configuration
</p>

In [80]:
import pandas as pd
import numpy as np
import os
import cv2

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torch.utils.data
import torchvision.transforms as transforms
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision.io import read_image


Configure this script as to whether it runs on Google Colab, or locally

In [81]:
# When on Google Colab, running full training, change both to true. Locally, advised set both to false
isGoogleColab = False
useFullData = False

In [82]:
# In local, the base directory is the current directory
baseDirectory = "./"

if isGoogleColab:
    from google.colab import drive
    
    # If this is running on Google colab, assume the notebook runs in a "COSC2673" folder, which also contains the data files 
    # in a subfolder called "image_classification_data"
    drive.mount("/content/drive")
    !ls /content/drive/'My Drive'/COSC2673/

    # Import the directory so that custom python libraries can be imported
    import sys
    sys.path.append("/content/drive/MyDrive/COSC2673/")

    # Set the base directory to the Google Drive specific folder
    baseDirectory = "/content/drive/MyDrive/COSC2673/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
06.PyTorchBasic01.ipynb  dataframe_utility.py		images_main.csv
07.PyTorchBasic02.ipynb  graphing_utility.py		__pycache__
a2_utility.py		 Image_classification_data	pytorch_utility.py
data_basic_utility.py	 Image_classification_data.zip	statistics_utility.py


Import the custom python files that contain reusable code

In [83]:
import data_basic_utility as dbutil
import graphing_utility as graphutil
import statistics_utility as statsutil

import a2_utility as a2util
import pytorch_utility as ptutil
from pytorch_utility import CancerBinaryDataset
from pytorch_utility import CancerCellTypeDataset


# randomSeed = dbutil.get_random_seed()
randomSeed = 266305
print("Random Seed: " + str(randomSeed))

Random Seed: 266305


In [84]:
# this file should have previously been created in the root directory
dfImages = pd.read_csv(baseDirectory + "images_main.csv")

In [85]:
# Get The training Split and the Validation Split
dfImagesTrain = dfImages[dfImages["trainValTest"] == 0].reset_index()
dfImagesVal = dfImages[dfImages["trainValTest"] == 1].reset_index()

dfImagesTrain.head()

Unnamed: 0,index,ImageName,isCancerous,cellType,trainValTest
0,0,./Image_classification_data/patch_images\1.png,0,0,0
1,1,./Image_classification_data/patch_images\10.png,0,0,0
2,3,./Image_classification_data/patch_images\1000.png,1,2,0
3,4,./Image_classification_data/patch_images\10000...,0,1,0
4,5,./Image_classification_data/patch_images\10001...,0,1,0


Note: The definition of the Custom Datasets for both the isCancerous data and the Cell Type data are defined in the pytorch_utility.py file.

Also, rather than loading all the training images and calculating the mean and standard deviation values in here, that was run separately in file 05a.PyTorchGetMeanAndStd.ipynb

Here we can just define the values to use, which shouldn't change unless the data is reloaded and a new train/validation/test split is generated

In [None]:
train_mean, train_std = ptutil.getTrainMeanAndStdTensors()
print(train_mean)
print(train_std)

In [89]:
# Create a tranform operation that also normalizes the images according to the mean and standard deviations of the images
transform_normalize = transforms.Compose(
    [transforms.ToPILImage(),
    transforms.ToTensor(), 
    transforms.Normalize(train_mean, train_std)])


In [90]:
cancerous_training_data = None

# Create a custom Dataset for the training and validation data
if useFullData:
    cancerous_training_data = CancerBinaryDataset(isGoogleColab, dfImagesTrain, baseDirectory, transform=transform_normalize)
else:
    # For testing in a small dataset
    dfImagesTrainTest = dfImagesTrain.iloc[range(1000), :].reset_index()
    cancerous_training_data = CancerBinaryDataset(isGoogleColab, dfImagesTrainTest, baseDirectory, transform=transform_normalize, target_transform=None)

cancerous_validation_data = CancerBinaryDataset(isGoogleColab, dfImagesVal, baseDirectory, transform=transform_normalize, target_transform=None)

# Create data loaders
cancerous_train_dataloader = DataLoader(cancerous_training_data, batch_size=32, shuffle=True, num_workers=2)
cancerous_val_dataloader = DataLoader(cancerous_validation_data, batch_size=32, shuffle=True, num_workers=2)

Now, create a class for the basic, Fully Connected Neural Network. For this basic NN, we will use 3 fully connected layers. The number of features in this will be 27 x 27 x 3, or 2187.

Layer 1: Input is the images, which are 27 x 27 pixels, with 3 color values (RGB). Experiment initially with 1458 nodes
Layer 2: Input is 1458 from the the previous layer, down to 729
Layer 3: Input is 729 from the the previous layer, since this is a binary classification problem, the output will be 2 classes

In this, we will use the **ReLU** Activation Function. This is the Rectified Linear Unit function, which allows the function to become non-linear

In [91]:
# Create a class for the Neural Network
class PT_NN_IsCancerous(nn.Module):

    # In the constructor, initialize the layers to use
    def __init__(self):
        super(PT_NN_IsCancerous, self).__init__()
        self.fc1 = nn.Linear(27 * 27 * 3, 1458)
        self.fc2 = nn.Linear(1458, 729)
        self.fc3 = nn.Linear(729, 2)

    # Create the forward function, which is used in training
    def forward(self, x):
        # process through each layer
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        # return the result
        return x


Now train the Fully Connected Neural Network Model.

During training, we will use the following:
- Softmax Cross Entropy Loss as our Loss function. This is a good Loss function that basically converts scores for each class into probabilities
- The Adam Optimizer, which is a version of Gradient Descent
- Initially, just 10 epochs

In [92]:
# set the Learning Rate to use
learning_rate = 0.0001
epochsToUse = 10

net = PT_NN_IsCancerous()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=learning_rate)

for epoch in range(epochsToUse):
    print("Starting Epoch " + str(epoch) + "...")
    for i, data in enumerate(cancerous_train_dataloader, 0):
        # Get the inputs
        inputs, labels = data

        # This should convert the image tensors into vectors
        inputs = inputs.view(-1, 27 * 27 * 3)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Perform Forward and Backward propagation then optimize the weights
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

Starting Epoch 0...
Starting Epoch 1...
Starting Epoch 2...
Starting Epoch 3...
Starting Epoch 4...
Starting Epoch 5...
Starting Epoch 6...
Starting Epoch 7...
Starting Epoch 8...
Starting Epoch 9...


Training Time in Nelson's Local Environment on the full data takes a very long time, stopped after 100 minutes. This will need to be done in Colab.

Now Predict according to the validation data and evaluate. While looping through here, we will need to get out the Labels from the data loader, because the order of predictions in the batches do not match the order of the original Target values in the dataset (because we turned Shuffle on)

In [93]:
correct, total = 0,  0
predictions = []

# Set the Neural Network into evaluation (test) mode
net.eval()

step=0

y_val_cancerous = []
y_pred_cancerous = []

# Looping through this dataloader essentially processes them in batches of 32 (or whatever the batchsize is configured in the data loader
for i, data in enumerate(cancerous_val_dataloader, 0):
    inputs, labels = data


    # This should convert the image tensors into vectors
    inputs = inputs.view(-1, 27 * 27 * 3)

    outputs = net(inputs)
    _, predicted = torch.max(outputs.data, 1)
    
    # print(labels)
    # print(predicted)  
    # print(len(labels))
    # print(len(predicted))

    # Loop through the batch, build the lists of the raw label and prediction values
    for j in range(len(labels)):
        y_val_cancerous.append(labels[j].item())
        y_pred_cancerous.append(predicted[j].item())

    predictions.append(predicted)
    total += labels.size(0)
    correct += (predicted == labels).sum().item()


accuracy = (correct/total) * 100
print("The testing set accuracy of the isCancerous Classification Network is " + str(accuracy) + "%")

The testing set accuracy of the isCancerous Classification Network is 86.32395732298738%


In [94]:
for i in range(3):
    print(predictions[i])

print(len(predictions))

tensor([0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0,
        0, 1, 1, 1, 1, 1, 1, 0])
tensor([1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 1, 1, 0, 1])
tensor([0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
        1, 0, 1, 0, 0, 1, 1, 1])
33


In [95]:
# y_pred_cancerous = [item for sublist in y_pred_cancerous for item in sublist]
print("Labels")
for i in range(5):
    print(y_val_cancerous[i])

print("Predictions")
for i in range(5):
    print(y_pred_cancerous[i])

Labels
0
1
1
1
1
Predictions
0
1
1
0
0


Convert these results into a confusion matrix and then get an F1 Score for comparison

In [96]:
# assuming predictions is just a list of predicted values
cm = confusion_matrix(y_val_cancerous, y_pred_cancerous)

print("Binary Classification Results for isCancerous Predictions")
print(classification_report(y_val_cancerous, y_pred_cancerous))
print("- Accuracy Score: " + str(accuracy_score(y_val_cancerous, y_pred_cancerous)))
print("- Precision Score: " + str(precision_score(y_val_cancerous, y_pred_cancerous)))
print("- Recall Score: " + str(recall_score(y_val_cancerous, y_pred_cancerous)))
print("- F1 Score: " + str(f1_score(y_val_cancerous, y_pred_cancerous)))

Binary Classification Results for isCancerous Predictions
              precision    recall  f1-score   support

           0       0.83      0.82      0.83       408
           1       0.88      0.89      0.89       623

    accuracy                           0.86      1031
   macro avg       0.86      0.86      0.86      1031
weighted avg       0.86      0.86      0.86      1031

- Accuracy Score: 0.8632395732298739
- Precision Score: 0.8813291139240507
- Recall Score: 0.8940609951845907
- F1 Score: 0.8876494023904382


**Results for 10 epochs**

Binary Classification Results for isCancerous Predictions
              precision    recall  f1-score   support

           0       0.83      0.82      0.83       408
           1       0.88      0.89      0.89       623

    accuracy                           0.86      1031
   macro avg       0.86      0.86      0.86      1031
weighted avg       0.86      0.86      0.86      1031

- Accuracy Score: 0.8632395732298739
- Precision Score: 0.8813291139240507
- Recall Score: 0.8940609951845907
- F1 Score: 0.8876494023904382

**Results for 20 epochs**

Binary Classification Results for isCancerous Predictions
              precision    recall  f1-score   support

           0       0.80      0.83      0.81       408
           1       0.88      0.87      0.87       623

    accuracy                           0.85      1031
   macro avg       0.84      0.85      0.84      1031
weighted avg       0.85      0.85      0.85      1031

- Accuracy Score: 0.8496605237633366
- Precision Score: 0.8836065573770492
- Recall Score: 0.8651685393258427
- F1 Score: 0.8742903487429036

Now also train a model for CellType Predictions

In [97]:
# Create a custom Dataset for the training and validation data
celltype_training_data = CancerCellTypeDataset(isGoogleColab, dfImagesTrain, baseDirectory, transform=transform_normalize)
celltype_validation_data = CancerCellTypeDataset(isGoogleColab, dfImagesVal, baseDirectory, transform=transform_normalize)

# Create data loaders
celltype_train_dataloader = DataLoader(celltype_training_data, batch_size=32, shuffle=True, num_workers=4)
celltype_val_dataloader = DataLoader(celltype_validation_data, batch_size=32, shuffle=True, num_workers=4)



Create a class for the Cell Type Neural Network model. The structure of the class will be fundamentally the same, only the model will need to output 4 classes

In [98]:
# Create a class for the Neural Network
class PT_NN_CellType(nn.Module):

    # In the constructor, initialize the layers to use
    def __init__(self):
        super(PT_NN_CellType, self).__init__()
        self.fc1 = nn.Linear(27 * 27 * 3, 1458)
        self.fc2 = nn.Linear(1458, 729)
        self.fc3 = nn.Linear(729, 4)

    # Create the forward function, which is used in training
    def forward(self, x):
        # process through each layer
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        # return the result
        return x

Now train the Fully Connected Neural Network Model. Use the same configuration (objective function, optimizer etc) as the Binary Classifier

In [None]:
# set the Learning Rate to use
learning_rate = 0.0001
epochsToUse = 10

net = PT_NN_CellType()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=learning_rate)

for epoch in range(epochsToUse):
    print("Starting Epoch " + str(epoch) + "...")
    for i, data in enumerate(celltype_train_dataloader, 0):
        # Get the inputs
        inputs, labels = data

        # This should convert the image tensors into vectors
        inputs = inputs.view(-1, 27 * 27 * 3)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Perform Forward and Backward propagation then optimize the weights
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

Starting Epoch 0...
Starting Epoch 1...
Starting Epoch 2...
Starting Epoch 3...
Starting Epoch 4...
Starting Epoch 5...
Starting Epoch 6...
Starting Epoch 7...
Starting Epoch 8...
Starting Epoch 9...


Predict on the Validation data and evaluate the results

In [None]:
correct, total = 0,  0
predictions = []

# Set the Neural Network into evaluation (test) mode
net.eval()

step=0

y_val_celltype = []
y_pred_celltype = []

# Looping through this dataloader essentially processes them in batches of 32 (or whatever the batchsize is configured in the data loader
for i, data in enumerate(celltype_val_dataloader, 0):
    inputs, labels = data

    # This should convert the image tensors into vectors
    inputs = inputs.view(-1, 27 * 27 * 3)

    outputs = net(inputs)
    _, predicted = torch.max(outputs.data, 1)
    
    # print(labels)
    # print(predicted)  
    # print(len(labels))
    # print(len(predicted))

    # Loop through the batch, build the lists of the raw label and prediction values
    for j in range(len(labels)):
        y_val_celltype.append(labels[j].item())
        y_pred_celltype.append(predicted[j].item())

    predictions.append(predicted)
    total += labels.size(0)
    correct += (predicted == labels).sum().item()


accuracy = (correct/total) * 100
print("The testing set accuracy of the CellType Classification Network is " + str(accuracy) + "%")

Convert these results into a confusion matrix and then get an F1 Score for comparison

In [None]:
# assuming predictions is just a list of predicted values
cm = confusion_matrix(y_val_celltype, y_pred_celltype)

print("CellType Multi-class Classification Results for Cell Type Predictions")
print(classification_report(y_val_celltype, y_pred_celltype))
print("- Accuracy Score: " + str(accuracy_score(y_val_celltype, y_pred_celltype)))
print("- Precision Score: " + str(precision_score(y_val_celltype, y_pred_celltype, average="micro")))
print("- Recall Score: " + str(recall_score(y_val_celltype, y_pred_celltype, average="micro")))
print("- F1 Score: " + str(f1_score(y_val_celltype, y_pred_celltype, average="micro")))

**Results for 10 epochs:**

CellType Multi-class Classification Results for Cell Type Predictions
              precision    recall  f1-score   support

           0       0.63      0.55      0.59       155
           1       0.61      0.79      0.69       185
           2       0.88      0.88      0.88       623
           3       0.23      0.10      0.14        68

    accuracy                           0.77      1031
   macro avg       0.59      0.58      0.58      1031
weighted avg       0.75      0.77      0.76      1031

- Accuracy Score: 0.7672162948593598
- Precision Score: 0.7672162948593598
- Recall Score: 0.7672162948593598
- F1 Score: 0.7672162948593597

**Results for 20 epochs**

CellType Multi-class Classification Results for Cell Type Predictions
              precision    recall  f1-score   support

           0       0.62      0.50      0.55       155
           1       0.59      0.69      0.64       185
           2       0.88      0.89      0.88       623
           3       0.20      0.19      0.20        68

    accuracy                           0.75      1031
   macro avg       0.57      0.57      0.57      1031
weighted avg       0.75      0.75      0.74      1031

- Accuracy Score: 0.7468477206595538
- Precision Score: 0.7468477206595538
- Recall Score: 0.7468477206595538
- F1 Score: 0.746847720659554