-------------------------------------------------------
- Wiley Winters
- MSDS 686 Deep Learning
- Week 7-8 Kaggle Project&nbsp;&mdash;&nbsp;Brain Tumor Classification
- 2025-MAR-09
--------------------------------------------------------

## Requirements

----------------------------------------------
### Required for 80%
Complete project on *kaggle.com* using the skills learned in the <u>Deep Learning</u> class.  The following are required:
- Show/plot sample images or data with labels
- Include at least on of the following
  - Convolution
  - Max Pooling
  - Batch Normalization
  - Dropout
  - LSTM
  - TF-IDf
- Use validation data
- Evaluate model on test data

-------------------------------------------
## Additional for another 20%
- Use data augmentation
- Use at least one of the following:
  - Kernels
  - Activation functions
  - Loss functions
  - Libraries
  - Methods
- Learning rate optimization
- Functional API model
- Transfer learning with or without trainable parameters
- Confusion matrix and / or ROC plots
- Plots of accuracy/loss vs epochs
- Show/plot sample incorrect prediction with labels and correct label

----------------------------------------------------------------
<a name='imports'></a>
## 1.0 <span style='color:blue'>|</span> Load Libraries and Packages

In [None]:
# General Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os, logging, random

# Data prep and model scoring
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

# TensorFlow likes to display a lot of debug information
# on my home system
# I will squash the messages
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
logging.getLogger('tensorFlow').setLevel(logging.FATAL)

# tensorflow and keras' API
import tensorflow as tf
from tensorflow import keras

# Model building
from tensorflow.keras import backend, optimizers, regularizers
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import Input, Dense, Dropout, Flatten, Rescaling
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Model architecture visualization
from visualkeras import layered_view

# Model training
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.metrics import Precision, Recall, AUC

# Make plots have guidelines
plt.style.use('ggplot')

# Squash Python warnings
import warnings
warnings.filterwarnings('ignore')

<a name='random'></a>
### 1.1 <span style='color:blue'>|</span> Set Random Seed for Reproducibility

In [None]:
tf.keras.utils.set_random_seed(42)
tf.random.set_seed(42)
np.random.seed(42)
random.seed(42)

<a name='global'></a>
### 1.2 <span style='color:blue'>|</span> Declare Global Variables

In [None]:
# Define training and testing image directories
home_dir = '/home/wiley'
trn_dir = home_dir+'/regis/dataScience/kaggleProject/images/data/training'
tst_dir = home_dir+'/regis/dataScience/kaggleProject/images/data/testing'

# Define classes
classes = ['negative', 'positive']

# Image size and shape
img_size = (224, 224)
img_shape = (224, 224, 3)

# Number of classes
num_classes = 2

# Declare batch size
batch_size = 64

<a name='functions'></a>
## 2.0 <span style='color:blue'>|</span> Define Functions

---------------------------------------------------------------
<a name='load_df'></a>
### 2.1 <span style='color:blue'>|</span> Load DataFrames
- Join image filename and path information
- Create labels from class directory names
- Create dataframe
- Randomize dataframe rows

In [None]:
def load_dataframe(path):
    # Derive image file paths and labels from directory structure
    labels, paths = zip(*[(label, os.path.join(path, label, image))
                        for label in os.listdir(path)
                        if os.path.isdir(os.path.join(path, label))
                        for image in os.listdir(os.path.join(path, label))])

    # Create DataFrame
    df = pd.DataFrame({'paths': paths, 'labels': labels})
    
    # Randomize rows to help eliminate bias
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    return df

<a name='metrics'></a>
### 2.2 <span style='color:blue'>|</span> Plot Performance Metrics
Plot the following:
- Training loss
- Validation loss
- Training Accuracy
- Validation Accuracy
- Training Precision
- Validation Precision
- Training Recall
- Validation Recall
- Training AUC
- Validation AUC

In [None]:
def plot_history(history):
    epochs = range(1, len(history.history['accuracy']) + 1)

    # Plot training and validation loss
    plt.figure(figsize=(20,12))
    plt.subplot(2,2,1)
    plt.plot(epochs, history.history['loss'], 'b', label = 'Training Loss')
    plt.plot(epochs, history.history['val_loss'], 'r', label = 'Validation Loss')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

    # Plot training and validation accuracy
    plt.subplot(2,2,2)
    plt.plot(epochs, history.history['accuracy'], 'b', label = 'Training Accuracy')
    plt.plot(epochs, history.history['val_accuracy'], 'r', label = 'Validation Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    
    plt.suptitle('Model Loss and Accuracy over Epochs', fontsize=16)
    plt.show()

    # Plot training and validation precision
    plt.figure(figsize=(20,12))
    plt.subplot(2,2,1)
    plt.plot(epochs, history.history['precision'], 'b', label='Training Precision')
    plt.plot(epochs, history.history['val_precision'], 'r', label='Validation Precision')
    plt.title('Training and Validation Precision')
    plt.xlabel('Epochs')
    plt.ylabel('Precision')
    plt.legend()

    # Plot training and validation recall
    plt.subplot(2,2,2)
    plt.plot(epochs, history.history['recall'], 'b', label='Training Recall')
    plt.plot(epochs, history.history['val_recall'], 'r', label='Validation Recall')
    plt.title('Training and Validation Recall')
    plt.xlabel('Epochs')
    plt.ylabel('Recall')
    plt.legend()

    plt.suptitle('Model Precision and Recall over Epochs', fontsize=16)
    plt.show()

    # Plot training and validation AUC
    plt.figure(figsize=(5,3))
    plt.plot(epochs, history.history['auc'], 'b', label='Training AUC')
    plt.plot(epochs, history.history['val_auc'], 'r', label='Validation AUC')
    plt.title('Training and Validation AUC')
    plt.xlabel('Epochs')
    plt.ylabel('Recall')
    plt.legend()
    plt.show()

<a name='performance'></a>
### 2.3 <span style='color:blue'>|</span> Evaluate Model's Performance on Test DataSet
- Infer loss, accuracy, precision, recall, and AUC from dataset
- Compute F1 Score

In [None]:
def score_model(model, ds):
    # Get metrics from test data
    loss, acc, auc, prec, recall = model.evaluate(ds)

    # Calculate F1 Score from precision and recall
    f1_score = 2 * (prec * recall) / (prec + recall)

    # Print results
    print('-' * 30)
    print(f'Loss:      {loss:.4f}')
    print(f'Accuracy:  {acc:.4f}')
    print(f'Precision: {prec:.4f}')
    print(f'AUC:       {auc:.4f}')
    print(f'F1 Score:  {f1_score:.4f}')
    print('-' * 30)

<a name='cm_matrix'></a>
### 2.4 <span style='color:blue'>|</span> Plot Confusion Matrix

In [None]:
def plot_cm(model, ds):
    # Get predictions from dataset
    preds = np.argmax(np.round(model.predict(ds)), axis=1)

    # Create confusion matrix
    cm = confusion_matrix(ds.classes, preds)

    # Visualize confusion matrix
    plt.figure(figsize=(5,3))
    sns.heatmap(cm, annot=True, fmt='d', cmap='viridis',
                xticklabels=classes,
                yticklabels=classes)
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

<a name='tpr'></a>
### 2.5 <span style='color:blue'>|</span> Compute TPR and TNR

In [None]:
def compute_tpr(model, ds):
    # get predictions from dataset
    preds = np.argmax(np.round(model.predict(ds)), axis=1)
    
    # Create confusion matrix
    cm = confusion_matrix(ds.classes, preds)

    # Extract required values from confusion matrix
    (tn, fp, fn, tp) = cm.flatten()

    # Calculate TPR
    tpr = tp / (tp + fn)

    # Calculate TNR
    tnr = tn / (tn + fp)

    # Print TPR and TNR
    print('-' * 30)
    print(f'True Positive Rate (TPR): {tpr:.4f}')
    print(f'True Negative Rate (TNR): {tnr:.4f}')
    print('-' * 30)

<a name=load_data></a>
## 3.0 <span style='color:blue'>|</span> Load Data

--------------------------------------------------
<a name='load_df'></a>
### 3.1 <span style='color:blue'>|</span> Create and Load DataFrame for EDA

In [None]:
# Load training data
trn_df = load_dataframe(trn_dir)

# Load testing data
tst_df = load_dataframe(tst_dir)

# Take a look at the results
print('Training:   \n', trn_df.head(10).to_markdown())
print('Testing:    \n', tst_df.head(10).to_markdown())

<a name='eda'></a>
## 4.0 <span style='color:blue'>|</span> EDA

------------------------------------------
<a name='trn_dist'></a>
### 4.1 <span style='color:blue'>|</span> Look at Training Images' Distribution

In [None]:
plt.figure(figsize=(6,4))
trn_df['labels'].value_counts().plot(kind='bar')
plt.title('Distribution of Image Counts in Training Data')
plt.xlabel('Category')
plt.ylabel('Image Count')
plt.show()

Negative images slightly outnumber the positive ones, but are close enough to continue without additional data wrangling

<a name='tst_dist'></a>
### 4.2 <span style='color:blue'>|</span> Look at Testing Images' Distribution

In [None]:
plt.figure(figsize=(6,4))
tst_df['labels'].value_counts().plot(kind='bar')
plt.title('Distribution of Image Counts in Testing Data')
plt.xlabel('Category')
plt.ylabel('Image Count')
plt.show()

Distribution mirrors what the *training data* shows, but with less frequency.

<a name='shape'></a>
### 4.3 <span style='color:blue'>|</span> Examine Shape of Training and Testing DataFrames

In [None]:
print('Training Shape: \n', trn_df.shape)
print('Testing Shape:  \n', tst_df.shape)

**NOTE:**&nbsp;&nbsp;Since the dataframes are built from the contents of the image directories, there should be no missing values or duplicates.

<a name='wrangling'></a>
## 4.0 <span style='color:blue'>|</span> Data Wrangling

-------------------------------------
<a name='cr_val'></a>
### 4.1 <span style='color:blue'>|</span> Create a Validation Subset from Training Data
I will use `flow_from_dataframe()` to create datasets for model training; therefore, no reason to create a new directory structure for validation data

In [None]:
val_df, trn_df = train_test_split(trn_df, train_size=0.2, random_state=42,
                                  stratify=trn_df['labels'])
print(val_df.sample(10).to_markdown())
print(f'Validation Shape: {val_df.shape}')

<a name='proc_imgs'></a>
### 4.2 <span style='color:blue'>|</span> Process Images from DataFrames

In [None]:
# Apply image augmentation
gen = ImageDataGenerator(rescale=1./255,
                         brightness_range=(0.5, 1.5),
                         rotation_range=20,
                         width_shift_range=0.2,
                         height_shift_range=0.2,
                         shear_range=0.2,
                         zoom_range=0.2)

# The test dataset should not be augmented
# just rescaled
tst_gen = ImageDataGenerator(rescale=1./255)

# Create training datagen set
trn_gen = gen.flow_from_dataframe(trn_df, x_col='paths', y_col='labels',
                                  batch_size=batch_size, target_size=img_size,
                                  shuffle=True)

# Create validation datagen set
val_gen = gen.flow_from_dataframe(val_df, x_col='paths', y_col='labels',
                                  batch_size=batch_size, target_size=img_size,
                                  shuffle=True)

# Create test datagen set
tst_gen = tst_gen.flow_from_dataframe(tst_df, x_col='paths', y_col='labels',
                                      batch_size=16, target_size=img_size,
                                      shuffle=False)

<a name='exam_imgs'></a>
### 4.3 <span style='color:blue'>|</span> Examine a few Images and their Labels
The images displayed have been augmented in the previous step

In [None]:
dict = trn_gen.class_indices
classes = list(dict.keys())
images, labels = next(tst_gen)

plt.figure(figsize=(20,20))
for i, (image, label) in enumerate(zip(images, labels)):
    plt.subplot(4,4,i+1)
    plt.imshow(image)
    class_name = classes[np.argmax(label)]
    plt.title(class_name, color='k', fontsize=15)

plt.show()

<a name='configure'></a>
## 5.0 <span style='color:blue'>|</span> Configure Training Values

-----------------------------------------------
<a name='basic_values'></a>
### 5.1 <span style='color:blue'>|</span> Basic Values

In [None]:
# Number of training epochs
epochs = 50

# Steps per epoch
steps_per_ep = trn_gen.samples // batch_size

# Validation steps
val_steps = tst_gen.samples // batch_size

print(f'Image shape:      {img_shape}')
print(f'Epochs:           {epochs}')
print(f'Batch size:       {batch_size}')
print(f'Steps per epoch:  {steps_per_ep}')
print(f'Validation steps: {val_steps}')

<a name='callbacks'></a>
### 5.2 <span style='color:blue'>|</span> Define Callbacks
With these *callbacks* the model's training will stop if the loss stops decreasing (`EarlyStopping()`), and the learing rate will be reduced if the validation loss plateaus (`ReduceLROnPlateau()`)

In [None]:
# Define early_stop callback
early_stop = EarlyStopping(monitor='val_loss', min_delta=0.000000001, patience=4,
                           baseline=None, restore_best_weights=True, start_from_epoch=0)

# Define reduce LR on Plateau callback
reduceLRO = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, mode='auto',
                              min_delta=0.0001, cooldown=0, min_lr=0)

<a name='baseline_model'></a>
## 6.0 <span style='color:blue'>|</span> Baseline Model
### Define Model's Architecture

--------------------------------------------
<a name='architecture'></a>
### 6.1 <span style='color:blue'>|</span> Define Model's Architecture
The functional API method was chosen to define the model's architecture, since it provides a lot of control over the model's structure.

In [None]:
backend.clear_session()

inputs  = Input(shape=(img_shape))

# Conv Layer 1
conv1   = Conv2D(filters=32, kernel_size=4, padding='same',
                 activation='relu')(inputs)
pool1   = MaxPooling2D(pool_size=(2,2))(conv1)
drop1   = Dropout(0.1)(pool1)

# Conv Layer 2
conv2   = Conv2D(filters=64, kernel_size=4, padding='same',
                 activation='relu')(drop1)
pool2   = MaxPooling2D(pool_size=(2,2))(conv2)
drop2   = Dropout(0.1)(pool2)

# Conv Layer 3
conv3   = Conv2D(filters=128, kernel_size=4, padding='same',
                 activation='relu')(drop2)
pool3   = MaxPooling2D(pool_size=(2,2))(conv3)
drop3   = Dropout(0.1)(pool3)

# Conv Layer 4
conv4   = Conv2D(filters=128, kernel_size=4, padding='same',
                 activation='relu')(drop3)
pool4   = MaxPooling2D(pool_size=(2,2))(conv4)
drop4   = Dropout(0.1)(pool4)

# Apply Batch Normalization, Flatten, and Dense Layers
batch1  = BatchNormalization()(drop4)
flatten = Flatten()(batch1)
dense1  = Dense(128, activation='relu')(flatten)
dropout = Dropout(0.5)(dense1)
dense2  = Dense(512, activation='relu')(dropout)

# Last Dense layer with softmax activation
preds   = Dense(num_classes, activation='softmax')(dense2)

# Pulling it all together
model_base = Model(inputs, preds)

model_base.summary()

<a name='layered_view'></a>
### 6.2 <span style='color:blue'>|</span> Visualize Layers

In [None]:
layered_view(model_base, legend=True, max_xy=300)

<a name='compile'></a>
### 6.3 <span style='color:blue'>|</span> Compile and Train Model
The `Adam()` optimizer was selected for this model, since it is well suited to classification problems.  The loss function `categorical_crossentropy()` was also selected for the same reason.

In [None]:
# Configure Adam optimizer
opt = optimizers.RMSprop(learning_rate=0.0005)

# Compile base model
model_base.compile(optimizer=opt, loss='categorical_crossentropy',
                   metrics=['accuracy',
                            tf.keras.metrics.Precision(name='precision'),
                            tf.keras.metrics.Recall(name='recall'),
                            tf.keras.metrics.AUC(curve='PR', name='auc')])

hist_base = model_base.fit(trn_gen, batch_size=batch_size, steps_per_epoch=steps_per_ep, 
                           epochs=epochs, validation_data=val_gen,
                           validation_steps=val_steps,
                           callbacks=[early_stop, reduceLRO])

<a name='evaluate'></a>
## 7.0 <span style='color:blue'>|</span> Evaluate Performance

------------------------------------------------------------
<a name='history'></a>
### 7.1 <span style='color:blue'>|</span> Plot Training and Validation Metrics

In [None]:
plot_history(hist_base)

<a name='score'></a>
### 7.2 <span style='color:blue'>|</span> Score Model
To evaluate the model's performance the following matrices will be evaluated against the test dataset:
- Model Loss&nbsp;&mdash;&nbsp;gives a nuanced view of model optimization
- Model Accuracy&nbsp;&mdash;&nbsp;provides the proportion of all classifications that were correct
- Precision&nbsp;&mdash;&nbsp;is the proportion of the model's positive classifications that are actually positive
- Recall&nbsp;&mdash;&nbsp;proportion of correct positive classifications
- Area Under Curve (AUC)&nbsp;&mdash;&nbsp;represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative
- F1 Score&nbsp;&mdash;&nbsp;describes the harmonic mean of the precision and recall of the model

In [None]:
score_model(model_base, tst_gen)

<a name='plot_cm'></a>
### 7.3 <span style='color:blue'>|</span> Plot Confusion Matrix
A confusion matrix provides a visual representation of a model's performance when it comes to comparing true positives, false negatives, true negatives, and false positives.

In [None]:
plot_cm(model_base, tst_gen)

<a name=tpr_tnr></a>
### 7.4 <span style='color:blue'>|</span> Compute TPR and TNR
The True Positive Rate (TPR) and True Negative Rate (TNR) are good indicators of how well the model is predicting positives (1s) and negatives (0s).  

In [None]:
compute_tpr(model_base, tst_gen)

<a name='save_weights'></a>
## 8.0 <span style='color:blue'>|</span> Save Weights for Future Use

---------------------------------------------------

In [None]:
#model_base.save('kaggleProject.keras')

<a name='dicussion'></a>
## 9.0 <span style='color:blue'>|</span> Discussion and Conclusions

-------------------------------------------------------
<a name='about_data'></a>
### 9.1 <span style='color:blue'>|</span> About the Data
A majority of the image data was obtained from Viradiya's Kaggle notebook an it contains two classes of images (2021).  Those with brain tumors and those without.  All of the images are labeled and divide into their respective classes. The *healthy* class was in the minority; therefore additional *healthy* images were copied from a dataset curated by Bhuvaji in 2019. A total of 2,589 healthy and 2,513 tumor images were used for model training.  A traditional 80/20% split was used to separate the images into training and testing datasets.  In addition, another 20% were taken from the training dataset for validation during the training process.</p>
Other sources were examined, but discarded due to poor labeling and unknown origin of the images.

<a name='question'></a>
### 9.2 <span style='color:blue'>|</span> Research Question
A review of literature illustrated that many studies conducted into using machine learning to classify brain tumor MRI images concentrated on classifying images based on the tumor type; however, very few looked at just determining if an image has a tumor or not.</p>
The question the author of this study would like it answer is: ***Can a CNN model be developed that can accurately predict if an MRI brain image contains a tumor or not?***  While classifying tumor types is important, many times a tumor is too small to be recognized or a post tumor resection MRI is not clear enough for the radiologist or neurosurgeon to make an accurate diagnoses. This is where using ML/AI to predict if an image is positive for a brain tumor can come into play.

<a name='methods'></a>
### 9.3 <span style='color:blue'>|</span> Methods
A review of literature was conducted on peer-reviewed articles in search of reliable MRI brain image datasets.  Many papers pointed to the MRI datasets on [kaggle.com](https://www.kaggle.com/search?q=brain+tumor+mri+dataset) as their source.  I found this interesting since a few of the datasets on kaggle.com lack enough documentation to make them useful for research purposes.</p>
Once the data was curated, a model architecture had to be decided upon.  Current literature indicated that Convoluted Neural Networks (CNN)s are often used in this type of image classification problem.  Saeedi, et al., suggested an architecture of four convoluted layers each one containing a *MaxPooling2D* and *Dropout* component.  These layers are then flattened and fed into two dense layers for final classification (2023).  This architecture formed the basis for the one used in this study.>/p>
### 9.3.1 <span style='color:blue'>|</span> Callbacks
To prevent overfitting two callbacks were employed.  The `EarlyStopping()` callback was combined with `ReduceLROnPlateau()` to stop the model's training when validation loss stops decreasing (EarlyStoppping) and the learning rate will be decreased if the validation loss plateaus (ReduceLROnPlateau).