#Final Project
Class: Adv ML

By: Sierra Cheung

Citations:

* Data: https://www.kaggle.com/datasets/preetviradiya/brian-tumor-dataset/code
  - Owner: Preet Viradiya
  - The data includes 4600 scans of brains, y labeled 'normal' and z labeled as 'tumor' (meaning they contain a brain tumor)

  - type: image data, target: labels (cancer vs healthy)

  - I decided to use this dataset because use of well-trained machine learning models on medical scans would be an extremely useful innovation. Currently, for doctors to evaluate a scan they have to go through a lot of training and it takes a lot of time to learn how to spot minute differences in the images. Not to mention, doctors may vary in their ability to spot key details both between doctors (one doctor is not always as good as the next when it comes to reading scans), and within their own career. For example, as doctors age they may get poorer eyesight, and conversely newer doctors may not be trained enough to have a reliable eye. Using a machine learning model that is accurate and consistent, it can be used widely and easily. A model is not as temporally confined as a doctor, all it needs to do is scan images, doctors have a long list of duties to take care of day to day. While, this may not be able to replace a doctor in terms of evaluating scans, it would be able to provide a 'second opinion' in terms of a generally reliable classification. In the case of diagnosing cancer, having the reassuredness of a 90% or more evaluation can make a world of difference for treatment and peace of mind for all invovled, doctors, patients, family members, etc.

* Pitfalls: 
  - data is limited (only tumor vs no tumor). Can't look at classification of malignant vs benign. Might need to get time series data on growth for that. 
  - data (I think) is from mostly adults, so it may not be generalizable to children. This bias really limits the ability to apply this widely. It could be used discriminatorily but for real-world implementation it's important to get a wider sample that represents more about the target population/general population.


* Preprocessor: from the COVID lung classification
  - The preprocessing steps here essentailly just reading in, adusting the RGB order and resizing the images and then putting them into an array for Keras.
  - I decided to use this one because the type of data is very similar, they are scans of brains instead of lungs, so I decided to start out by seeing if the models I used in that homework performed similarly on this data set. And then I started building off of that baseline, because the models I started with were not very good. The process of building them is outlined below.

##Loading Data and Libraries

In [None]:
#loading libraries
import os
import sys
import cv2
import time
import zipfile
import matplotlib
import numpy as np
import pandas as pd
import tensorflow as tf
from itertools import repeat
from google.colab import drive
from skimage.transform import resize
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.python.keras.utils import np_utils
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation, BatchNormalization
from tensorflow.python.keras.layers.convolutional import Conv2D, MaxPooling2D 
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam,SGD,Adagrad,Adadelta,RMSprop
from tensorflow.keras.applications import VGG16, ResNet50
with tf.device('/device:GPU:0'):
  from tensorflow.python.keras.callbacks import ReduceLROnPlateau
  from tensorflow.python.keras.callbacks import ModelCheckpoint

In [None]:
%%capture
#mount gdrive
drive.mount('/content/gdrive')

#loading data & opening the zip file in READ mode
with zipfile.ZipFile('gdrive/My Drive/archive.zip', 'r') as zip:
    # printing & extracting contents
    zip.printdir()
    zip.extractall()

In [None]:
# Extracting all filenames iteratively
base_path = 'Brain Tumor Data Set/Brain Tumor Data Set/'
categories = ['Brain Tumor', 'Healthy']

# load file names to fnames list object
fnames = []
for category in categories:
    image_folder = os.path.join(base_path, category)
    file_names = os.listdir(image_folder)
    full_path = [os.path.join(image_folder, file_name) for file_name in file_names]
    fnames.append(full_path)

print('number of images for each category:', [len(f) for f in fnames])
print(fnames[0:1]) #examples of file names

number of images for each category: [2513, 2087]
[['Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (473).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (1222).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (161).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (550).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (1834).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (15).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (116).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (804).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (1649).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (2009).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (2043).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Cancer (630).jpg', 'Brain Tumor Data Set/Brain Tumor Data Set/Brain Tumor/Ca

In [None]:
def preprocessor(data, shape=(192, 192)):
        img = cv2.imread(data)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, shape)
        img = img / 255.0 
        X = np.array(img)
        X = np.array(X, dtype=np.float32)
        return X

In [None]:
#Import image files iteratively and preprocess them into array of correctly structured data

# Create list of paths
image_filepaths=fnames[0]+fnames[1]
preprocessed_image_data=list(map(preprocessor,image_filepaths ))
X= np.array(preprocessed_image_data) # Assigning to X to highlight that this represents feature input data for our model

In [None]:
print(len(X) ) #same number of elements as filenames

4600


In [None]:
# Create y data made up of correctly ordered labels from file folders
# 2 folders (2513 cancer images, 2087 healthy ones)

print('number of images for each category:', [len(f) for f in fnames])
cancer=list(repeat("CANCER", 2513))
healthy=list(repeat("HEALTHY", 2087))

#combine to single list
y_labels = cancer+healthy

#check length
print(len(y_labels) )

#one-hot encode
y=pd.get_dummies(y_labels)
display(y)

number of images for each category: [2513, 2087]
4600


Unnamed: 0,CANCER,HEALTHY
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
4595,0,1
4596,0,1
4597,0,1
4598,0,1


In [None]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.20, random_state = 1987)

##Model 1

In [None]:
base_vgg16 = VGG16(input_shape=(192,192,3),include_top=False, weights='imagenet') #pretrained weight from model in imagenet comp.

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [None]:
with tf.device('/device:GPU:0'): #"/GPU:0": Short-hand notation for the first GPU of your machine that is visible to TensorFlow.
  from tensorflow.python.keras.callbacks import ReduceLROnPlateau
  from tensorflow.python.keras.callbacks import ModelCheckpoint

In [None]:
flat1 = Flatten()(base_vgg16.layers[-1].output)
class1 = Dense(1024, activation='relu')(flat1)
output = Dense(2, activation='softmax')(class1)
# define model1
model1 = Model(inputs=base_vgg16.inputs, outputs=output) #base_model.inputs imports the vgg16 model defined in base_model
  
mc = ModelCheckpoint('best_model.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=2,verbose=1,factor=0.5, min_lr=0.001) # dividing lr by 2 when val_accuracy fails to improve after 2 epochs

model1.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['acc']) 

model1.fit(X_train, y_train,batch_size=1,epochs = 3, verbose=1,validation_data=(X_test,y_test),callbacks=[mc,red_lr])

Epoch 1/3
Epoch 00001: val_acc improved from -inf to 0.54674, saving model to best_model.h5
Epoch 2/3
Epoch 00002: val_acc did not improve from 0.54674
Epoch 3/3
Epoch 00003: val_acc did not improve from 0.54674

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.004999999888241291.


<keras.callbacks.History at 0x7fad4c1b3050>

## Model 2

In [None]:
base_resnet = ResNet50(input_shape=(192,192,3),include_top=False, weights='imagenet') #pretrained weights

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


In [None]:
flat1 = Flatten()(base_resnet.layers[-1].output)
class1 = Dense(1024, activation='relu')(flat1)
output = Dense(2, activation='softmax')(class1)
# define model2
model2 = Model(inputs=base_resnet.inputs, outputs=output)
  
mc = ModelCheckpoint('best_model.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True)
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=2,verbose=1,factor=0.5, min_lr=0.001)

model2.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['acc']) 

model2.fit(X_train, y_train,batch_size=1,epochs = 3, verbose=1,validation_data=(X_test,y_test),callbacks=[mc,red_lr])

Epoch 1/3
Epoch 00001: val_acc improved from -inf to 0.51630, saving model to best_model.h5
Epoch 2/3
Epoch 00002: val_acc improved from 0.51630 to 0.55109, saving model to best_model.h5
Epoch 3/3

In [None]:
flat1 = Flatten()(base_resnet.layers[-1].output)
class1 = Dense(1024, activation='relu')(flat1)
output = Dense(2, activation='tanh')(class1)
# define new model
model2b = Model(inputs=base_resnet.inputs, outputs=output) #base_model.inputs imports the vgg16 model defined in base_model
  
mc = ModelCheckpoint('best_model.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True)
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=2,verbose=1,factor=0.5, min_lr=0.001) # dividing lr by 2 when val_accuracy fails to improve after 2 epochs

model2b.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['acc']) 

model2b.fit(X_train, y_train,batch_size=10,epochs = 5, verbose=1,validation_data=(X_test,y_test),callbacks=[mc,red_lr])

Epoch 1/5
Epoch 00001: val_acc improved from -inf to 0.54674, saving model to best_model.h5
Epoch 2/5
Epoch 00002: val_acc did not improve from 0.54674
Epoch 3/5
Epoch 00003: val_acc did not improve from 0.54674

Epoch 00003: ReduceLROnPlateau reducing learning rate to 0.004999999888241291.
Epoch 4/5
Epoch 00004: val_acc did not improve from 0.54674
Epoch 5/5

## Model 3

In [None]:
model3 = tf.keras.Sequential([   
  tf.keras.layers.Conv2D(kernel_size=3, filters=32, padding='same', activation='relu', input_shape=(192, 192, 3)),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=64, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=128, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=2),
  tf.keras.layers.Conv2D(kernel_size=3, filters=512, padding='same', activation='relu'),
  tf.keras.layers.Conv2D(kernel_size=3, filters=512, padding='same', activation='relu'),
  tf.keras.layers.Conv2D(kernel_size=3, filters=512, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=3),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(2, activation='softmax')
  tf.keras.layers.Dense(2, activation='softmax')
])


model3.compile(
  optimizer="sgd", 
  loss= 'categorical_crossentropy',
  metrics=['accuracy'])

In [None]:
# Fitting
model3.fit(X_train, y_train, epochs = 2, verbose=1,validation_data=(X_test,y_test)) #, callbacks=[red_lr]) for callback that automatically adjusts lr

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fadf584f6d0>

##Conclusion: 
* Discussion of models: 
  
  __Model 1:__ 

  For the first model I wanted to try transfer learning from a pretrained vgg model. I used a base of vgg16 and then flattened and added a fully connected layer at the end that would give me 2 classes (corresponding to health and cancer classes for my images). I originally toyed around with  the number of epochs and the optimizer but the overall scores were still hitting an accuracy of only about 0.5-0.6 so I scrapped this idea in the interest of time and moved on to model 2.

  __Model 2:__ 

  For my second model I decided to try the resnet50 as my base pretrained model. Again I added the dense layer at the end to map to my two classes. I first tried it with a fully connected layer using relu, which seems to be pretty conventional, as a starting point. I used softmax in the output layer. Then I ran this with 3 epochs and it performed fairly well. The accuracy score in the end was 0.90. Both this model and the VGG16 model took a couple hours to run though, so I decided to try a more simplistic model to see if I could get decent accuracy scores with a faster model in model 3

  __Model 3:__ 

  Model 3 was not pretrained, but it was based off of the AlexNet architecture, because I recall it won the imagenet challenge in 2012. I used the basic structure first, two alternations between convolutional and max pooling layers, then 3 convolutions and one more max pool before flattening to a dense layer. this one only performed at about 0.55 accuracy. Originally, when reading about AlexNet, relu seemed to be able to speed up the process, but since my first attempt didn't work well I tried changing the activation function to tanh. That didn't change much though, the score was still in the 50's accuracy-wise. I went back to relu because it didn't seem to make much of a difference using other activation functions. I stuck with same padding throughout because I wanted to keep the size of the images constant. I also tried adding an extra convolution-pooling pair before the block of convolutional layers, that ended up with a model that performed at 0.54 accuracy.

* __top model:__ 
  
  The top performing model by a longshot was the resnet-based model. I did try one variation, just increasing the epochs and changing the optimizer, which was model '2b' but these alone didn't do much. I would have liked to have added some fully-connected and pooling layers at the end and toyed around with the types of pooling or the types of activation function, but in the interest of time I only ran the one variation. 
  
  Overall I don't think 90% accuracy is too bad, it would be useful but not sufficient in the medical field. I think it could be useful in that doctors would have a 'second opinion' from the model with very little cost to them (i.e. without having to get a second doctor who would need all of the specialized training and expertise). However, at a rate of 90% it wouldn't be enough to be confident as a diagnostic tool without a doctor's full consideration/analysis of the scans herself. I do think though that with more testing I might be able to get a better accuracy score if I continue to work with this data. Additionally, while I know more data isn't always the solution, I do think I might be able to improve it with more data, because brain tumors can vary drastically insize, shape, location, and whether or not they are malignant. If I could try this again I would like to find time-series data if possible, or a dataset that has some kind of metric regarding symptoms or growth rate. I think those could be really important features aside from the scans itself, and I wonder if in conjunction it would help me get better classification.
  


  
I think it's worth mentioning a couple set-backs I had in the process, the main one being time constraint. As you probably know, the thesis deadline got moved *very* last-minute, so a lot of the time I had originally allocated to running my models was then split to finish my thesis by the new deadline. I also had some trouble running so many models or running models that were too big because my 'runtime' in colab kept timing out and I'd have to start over. I think in the future I should get a computer with a more powerful GPU so I don't run into this issue as much. If I had been able to run more models, or more complicated models I would have liked to have just tested out a lot more ResNet architectures, because this model seemed the most promising. I could have tried more time tweeking the activation, or optimizers, or changing the structure of the dense layers at the end before the output layer.