<hr style="color:green" />
<h1 style="color:green">COSC2673 Assignment 2: Image Classification for Cancerous Cells</h1>
<h2 style="color:green">File 10: Basic PyTorch FC-NN with Cross Valiation</h2>
<hr style="color:green" />

<p>
In this file, Train a basic fully connected NN with Pytorch, using a basic 3 layer configuration. Update the process to use Cross Validation
</p>

In [22]:
import pandas as pd
import numpy as np
import os
import cv2

import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torch.utils.data
import torchvision.transforms as transforms
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision.io import read_image


Configure this script as to whether it runs on Google Colab, or locally

In [23]:
# When on Google Colab, running full training, change both to true. Locally, advised set both to false
isGoogleColab = False
useFullData = False

In [24]:
# In local, the base directory is the current directory
baseDirectory = "./"

if isGoogleColab:
    from google.colab import drive
    
    # If this is running on Google colab, assume the notebook runs in a "COSC2673" folder, which also contains the data files 
    # in a subfolder called "image_classification_data"
    drive.mount("/content/drive")
    !ls /content/drive/'My Drive'/COSC2673/

    # Import the directory so that custom python libraries can be imported
    import sys
    sys.path.append("/content/drive/MyDrive/COSC2673/")

    # Set the base directory to the Google Drive specific folder
    baseDirectory = "/content/drive/MyDrive/COSC2673/"

Import the custom python files that contain reusable code

In [25]:
import data_basic_utility as dbutil
import graphing_utility as graphutil
import statistics_utility as statsutil

import a2_utility as a2util
import pytorch_utility as ptutil
from pytorch_utility import CancerBinaryDataset
from pytorch_utility import CancerCellTypeDataset


# randomSeed = dbutil.get_random_seed()
randomSeed = 266305
print("Random Seed: " + str(randomSeed))

Random Seed: 266305


In [26]:
# this file should have previously been created in the root directory
dfImages = pd.read_csv(baseDirectory + "images_main.csv")

In [27]:
# Get The training Split and the Validation Split together
dfImagesTrainVal = dfImages[(dfImages["trainValTest"] == 0) | (dfImages["trainValTest"] == 1)].reset_index()
dfImagesTrainVal.head()

Unnamed: 0,index,ImageName,isCancerous,cellType,trainValTest
0,0,./Image_classification_data/patch_images\1.png,0,0,0
1,1,./Image_classification_data/patch_images\10.png,0,0,0
2,3,./Image_classification_data/patch_images\1000.png,1,2,0
3,4,./Image_classification_data/patch_images\10000...,0,1,0
4,5,./Image_classification_data/patch_images\10001...,0,1,0


Note: The definition of the Custom Datasets for both the isCancerous data and the Cell Type data are defined in the pytorch_utility.py file.

Also, rather than loading all the training images and calculating the mean and standard deviation values in here, that was run separately in file 05a.PyTorchGetMeanAndStd.ipynb

Here we can just define the values to use, which shouldn't change unless the data is reloaded and a new train/validation/test split is generated

In [28]:
train_mean, train_std = ptutil.getTrainMeanAndStdTensors()
print(train_mean)
print(train_std)

tensor([0.8035, 0.5909, 0.7640])
tensor([0.1246, 0.1947, 0.1714])


In [29]:
# Create a tranform operation that also normalizes the images according to the mean and standard deviations of the images
transform_normalize = transforms.Compose(
    [transforms.ToPILImage(),
    transforms.ToTensor(), 
    transforms.Normalize(train_mean, train_std)])


Now, create a class for the basic, Fully Connected Neural Network. For this basic NN, we will use 3 fully connected layers. The number of features in this will be 27 x 27 x 3, or 2187.

Layer 1: Input is the images, which are 27 x 27 pixels, with 3 color values (RGB). Experiment initially with 1458 nodes
Layer 2: Input is 1458 from the the previous layer, down to 729
Layer 3: Input is 729 from the the previous layer, since this is a binary classification problem, the output will be 2 classes

In this, we will use the **ReLU** Activation Function. This is the Rectified Linear Unit function, which allows the function to become non-linear

In [30]:
# Create a class for the Neural Network
class PT_NN_IsCancerous(nn.Module):

    # In the constructor, initialize the layers to use
    def __init__(self):
        super(PT_NN_IsCancerous, self).__init__()
        self.fc1 = nn.Linear(27 * 27 * 3, 1458)
        self.fc2 = nn.Linear(1458, 729)
        self.fc3 = nn.Linear(729, 2)

    # Create the forward function, which is used in training
    def forward(self, x):
        # process through each layer
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        # return the result
        return x


Now train the Fully Connected Neural Network Model.

During training, we will use the following:
- Softmax Cross Entropy Loss as our Loss function. This is a good Loss function that basically converts scores for each class into probabilities
- The Adam Optimizer, which is a version of Gradient Descent
- Initially, just 10 epochs

In [31]:
# set the Learning Rate and Epochs to use
learning_rate = 0.0001 # Note: 0.00003 is too slow
epochsToUse = 10

accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

kfolds = KFold(n_splits=5, random_state=randomSeed, shuffle=True)

# Iterate over 5 splits of the data
for k, (train_index, vali_index) in enumerate(kfolds.split(dfImagesTrainVal)):
    print("Fold " + str(k) + ":")
    # Split the dataset between train and validation for this split
    dfImagesTrain = dfImagesTrainVal.iloc[train_index, ]
    dfImagesVal = dfImagesTrainVal.iloc[vali_index, ]

    cancerous_training_data = None

    #--------- Set up the data for training and validation
    # Create a custom Dataset for the training and validation data
    if useFullData:        
        dfImagesTrain = dfImagesTrain.reset_index()
        cancerous_training_data = CancerBinaryDataset(isGoogleColab, dfImagesTrain, baseDirectory, transform=transform_normalize)
    else:
        # For testing in a small dataset
        dfImagesTrainTest = dfImagesTrain.iloc[range(200), :].reset_index()
        cancerous_training_data = CancerBinaryDataset(isGoogleColab, dfImagesTrainTest, baseDirectory, transform=transform_normalize, target_transform=None)

    dfImagesVal = dfImagesVal.reset_index()
    cancerous_validation_data = CancerBinaryDataset(isGoogleColab, dfImagesVal, baseDirectory, transform=transform_normalize, target_transform=None)

    # Create data loaders
    cancerous_train_dataloader = DataLoader(cancerous_training_data, batch_size=32, shuffle=True, num_workers=2)
    cancerous_val_dataloader = DataLoader(cancerous_validation_data, batch_size=32, shuffle=True, num_workers=2)

    #--------- Train the NN over 10 epochs
    net = PT_NN_IsCancerous()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(net.parameters(), lr=learning_rate)

    for epoch in range(epochsToUse):
        print("   Starting Epoch " + str(epoch) + "...")
        for i, data in enumerate(cancerous_train_dataloader, 0):
            # Get the inputs
            inputs, labels = data

            # This should convert the image tensors into vectors
            inputs = inputs.view(-1, 27 * 27 * 3)

            # Zero the parameter gradients
            optimizer.zero_grad()

            # Perform Forward and Backward propagation then optimize the weights
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    
    #--------- Predict on Validation and get the Evaluation Metric 
    correct, total = 0,  0
    step=0
    predictions = []

    # Set the Neural Network into evaluation (test) mode
    net.eval()

    y_val_cancerous = []
    y_val_pred_cancerous = []

    # Looping through this dataloader essentially processes them in batches of 32 (or whatever the batchsize is configured in the data loader
    for i, data in enumerate(cancerous_val_dataloader, 0):
        inputs, labels = data

        # This should convert the image tensors into vectors
        inputs = inputs.view(-1, 27 * 27 * 3)

        outputs = net(inputs)
        _, predicted = torch.max(outputs.data, 1)

        # Loop through the batch, build the lists of the raw label and prediction values
        for j in range(len(labels)):
            y_val_cancerous.append(labels[j].item())
            y_val_pred_cancerous.append(predicted[j].item())

        predictions.append(predicted)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    accuracy_scores.append(accuracy_score(y_val_cancerous, y_val_pred_cancerous))
    precision_scores.append(precision_score(y_val_cancerous, y_val_pred_cancerous))
    recall_scores.append(recall_score(y_val_cancerous, y_val_pred_cancerous))
    f1_scores.append(f1_score(y_val_cancerous, y_val_pred_cancerous))  


print("IsCancerous Binary Classification Results for Cell Type Predictions after K-Folds CV:")
print("Average Accuracy Score: " + str(np.mean(accuracy_scores)))
print("Average Precision Score: " + str(np.mean(precision_scores)))
print("Average Recall Score: " + str(np.mean(recall_scores)))
print("Average F1 Score: " + str(np.mean(f1_scores)))      

Fold 0:
   Starting Epoch 0...
   Starting Epoch 1...
   Starting Epoch 2...
   Starting Epoch 3...
   Starting Epoch 4...
   Starting Epoch 5...
   Starting Epoch 6...
   Starting Epoch 7...
   Starting Epoch 8...
   Starting Epoch 9...
Fold 1:
   Starting Epoch 0...
   Starting Epoch 1...
   Starting Epoch 2...
   Starting Epoch 3...
   Starting Epoch 4...
   Starting Epoch 5...
   Starting Epoch 6...
   Starting Epoch 7...
   Starting Epoch 8...
   Starting Epoch 9...
Fold 2:
   Starting Epoch 0...
   Starting Epoch 1...
   Starting Epoch 2...
   Starting Epoch 3...
   Starting Epoch 4...
   Starting Epoch 5...
   Starting Epoch 6...
   Starting Epoch 7...
   Starting Epoch 8...
   Starting Epoch 9...
Fold 3:
   Starting Epoch 0...
   Starting Epoch 1...
   Starting Epoch 2...
   Starting Epoch 3...
   Starting Epoch 4...
   Starting Epoch 5...
   Starting Epoch 6...
   Starting Epoch 7...
   Starting Epoch 8...
   Starting Epoch 9...
Fold 4:
   Starting Epoch 0...
   Starting Epoch

In [32]:
print(accuracy_scores)

[0.7914317925591883, 0.8043968432919955, 0.7609921082299888, 0.7913141567963903, 0.8127467569091935]


Create a class for the Cell Type Neural Network model. The structure of the class will be fundamentally the same, only the model will need to output 4 classes

In [33]:
# Create a class for the Neural Network
class PT_NN_CellType(nn.Module):

    # In the constructor, initialize the layers to use
    def __init__(self):
        super(PT_NN_CellType, self).__init__()
        self.fc1 = nn.Linear(27 * 27 * 3, 1458)
        self.fc2 = nn.Linear(1458, 729)
        self.fc3 = nn.Linear(729, 4)

    # Create the forward function, which is used in training
    def forward(self, x):
        # process through each layer
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        # return the result
        return x

Now train the Fully Connected Neural Network Model. Use the same configuration (objective function, optimizer etc) as the Binary Classifier

In [34]:
# set the Learning Rate and Epochs to use
learning_rate = 0.00003
epochsToUse = 10

accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

kfolds = KFold(n_splits=5, random_state=randomSeed, shuffle=True)

# Iterate over 5 splits of the data
for k, (train_index, vali_index) in enumerate(kfolds.split(dfImagesTrainVal)):
    print("Fold " + str(k) + ":")

    # Split the dataset between train and validation for this split
    dfImagesTrain = dfImagesTrainVal.iloc[train_index, ]
    dfImagesVal = dfImagesTrainVal.iloc[vali_index, ]
    
    celltype_training_data = None

    #--------- Set up the data for training and validation
    # Create a custom Dataset for the training and validation data
    if useFullData:
        dfImagesTrain = dfImagesTrain.reset_index()
        celltype_training_data = CancerCellTypeDataset(isGoogleColab, dfImagesTrain, baseDirectory, transform=transform_normalize)
    else:
        # For testing in a small dataset
        dfImagesTrainTest = dfImagesTrain.iloc[range(200), :].reset_index()
        celltype_training_data = CancerCellTypeDataset(isGoogleColab, dfImagesTrainTest, baseDirectory, transform=transform_normalize, target_transform=None)

    dfImagesVal = dfImagesVal.reset_index()
    celltype_validation_data = CancerCellTypeDataset(isGoogleColab, dfImagesVal, baseDirectory, transform=transform_normalize, target_transform=None)

    # Create data loaders
    celltype_train_dataloader = DataLoader(celltype_training_data, batch_size=32, shuffle=True, num_workers=2)
    celltype_val_dataloader = DataLoader(celltype_validation_data, batch_size=32, shuffle=True, num_workers=2)

    #--------- Train the NN over 10 epochs
    net = PT_NN_CellType()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(net.parameters(), lr=learning_rate)

    for epoch in range(epochsToUse):
        print("   Starting Epoch " + str(epoch) + "...")
        for i, data in enumerate(celltype_train_dataloader, 0):
            # Get the inputs
            inputs, labels = data

            # This should convert the image tensors into vectors
            inputs = inputs.view(-1, 27 * 27 * 3)

            # Zero the parameter gradients
            optimizer.zero_grad()

            # Perform Forward and Backward propagation then optimize the weights
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    
    #--------- Predict on Validation and get the Evaluation Metric 
    correct, total = 0,  0
    step=0
    predictions = []

    # Set the Neural Network into evaluation (test) mode
    net.eval()

    y_val_celltype = []
    y_val_pred_celltype = []

    # Looping through this dataloader essentially processes them in batches of 32 (or whatever the batchsize is configured in the data loader
    for i, data in enumerate(celltype_val_dataloader, 0):
        inputs, labels = data

        # This should convert the image tensors into vectors
        inputs = inputs.view(-1, 27 * 27 * 3)

        outputs = net(inputs)
        _, predicted = torch.max(outputs.data, 1)

        # Loop through the batch, build the lists of the raw label and prediction values
        for j in range(len(labels)):
            y_val_celltype.append(labels[j].item())
            y_val_pred_celltype.append(predicted[j].item())

        predictions.append(predicted)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    accuracy_scores.append(accuracy_score(y_val_celltype, y_val_pred_celltype))
    precision_scores.append(precision_score(y_val_celltype, y_val_pred_celltype, average="micro"))
    recall_scores.append(recall_score(y_val_celltype, y_val_pred_celltype, average="micro"))
    f1_scores.append(f1_score(y_val_celltype, y_val_pred_celltype, average="micro"))  


print("CellType Multi-class Classification Results for Cell Type Predictions after K-Folds CV:")
print("Average Accuracy Score: " + str(np.mean(accuracy_scores)))
print("Average Precision Score: " + str(np.mean(precision_scores)))
print("Average Recall Score: " + str(np.mean(recall_scores)))
print("Average F1 Score: " + str(np.mean(f1_scores)))      

Fold 0:
   Starting Epoch 0...
   Starting Epoch 1...
   Starting Epoch 2...
   Starting Epoch 3...
   Starting Epoch 4...
   Starting Epoch 5...
   Starting Epoch 6...
   Starting Epoch 7...
   Starting Epoch 8...
   Starting Epoch 9...
Fold 1:
   Starting Epoch 0...
   Starting Epoch 1...
   Starting Epoch 2...
   Starting Epoch 3...
   Starting Epoch 4...
   Starting Epoch 5...
   Starting Epoch 6...
   Starting Epoch 7...
   Starting Epoch 8...
   Starting Epoch 9...


AttributeError: 'NoneType' object has no attribute 'append'