## Intro
The following model was built using Angie Ashraf's [approach](https://www.kaggle.com/angieashraf/89-brain-tumor-detection-using-dl).

### My Approach:
1. Preliminary Data Analysis: check for null values, distribution, do basic plot analysis.
2. Data Preprocessing: split data in train and test sets, create data generator function with image resizing, float32 and numpy array conversions.
3. Building Model: use MobileNetV2 with global average pooling layer, dropout of 0.2 and top dense layer with sigmoid activation.
4. Optimization: use binary cross entropy loss with adam optimizer.
5. Training Model: use early stopping.
6. Results: binary accuracy and loss graphs, classification report and confusion matrix

### Libraries

In [None]:
import os #directory navigation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from skimage.io import imread # data processing; images
import keras # deep learning
from keras import Sequential # model building
from keras.applications import MobileNetV2 # pretrained model
from keras.layers import Dense # neural network layer
from keras.preprocessing import image # data processing; images
import tensorflow as tf # machine learning; deep learning
import tensorflow.keras.layers as layers # model building
import warnings # what if

In [None]:
# Reproducibility
def set_seed(seed=42):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
set_seed()

## Preliminary Data Analysis

In [None]:
brain_df = pd.read_csv('../input/brain-tumor/Brain Tumor.csv', usecols=[0,1])
brain_df.head()

In [None]:
# Check for null variables
brain_df.isnull().sum()

In [None]:
# Check for imbalance
brain_df['Class'].value_counts()

In [None]:
# Plot the value count
sns.countplot(brain_df['Class'])

In [None]:
# Get image paths to build a dictionary for data generators
path_list = []
base_path = '../input/brain-tumor/Brain Tumor/Brain Tumor'
for entry in os.listdir(base_path):
    path_list.append(os.path.join(base_path, entry))

In [None]:
# Create path dictionary and map it to brain_df['paths']
paths_dict = {os.path.splitext(os.path.basename(x))[0]: x for x in path_list}
brain_df['Path'] = brain_df['Image'].map(paths_dict.get)
brain_df.head()

In [None]:
# Plot few samples
for x in range(0,9):
    plt.subplot(3,3,x+1)
    # Remove x and y axis scales
    plt.xticks([])
    plt.yticks([])
    img = imread(brain_df['Path'][x])
    plt.imshow(img)
    plt.xlabel(brain_df['Class'][x])

In [None]:
# Split brain_df into test and train lists for data generators
brain_df['split'] = np.random.randn(brain_df.shape[0], 1)

msk = np.random.rand(len(brain_df)) <= 0.8

train_df = brain_df[msk]
test_df = brain_df[~msk]
train_df.to_csv('brain_tumor_train.csv', index=False)
test_df.to_csv('brain_tumor_test.csv', index=False)
train_list = train_df.values.tolist()
test_list = test_df.values.tolist()

## Data Preprocessing
1. Data Cleaning. Data is clean; images are stored in one folder with feature and label details located in csv file.
2. Data Integration. Data is coming from one source; no data integration techniques were applied.
3. Data Transformation. Images were resized to 224x224 (below, in the generator function) for MobileNetV2 pretrained base, converted to float and numpy array format for CNN. For the future: data augmentation?
4. Data Reduction. No data reduction techniques were used (ignoring image size reduction from 240x240 to 224x224). Furthermore, we cannot decrease the number of channels (could have been a possibility since images are almost black and white) due to the pretrained model that expects 3 channels as the input.
5. Data discretization. Not applicable to images; the inputs are already discrete.

### Data Generator
Below is a generator function taken from [this article](https://medium.com/@anuj_shah/creating-custom-data-generator-for-training-deep-learning-models-part-2-be9ad08f3f0e). Used due to CPU limitations.

In [None]:
from random import shuffle
import cv2
def generator(samples, batch_size=32,shuffle_data=True):
    """
    Yields the next training batch.
    Suppose `samples` is an array [[image1_filename,label1], [image2_filename,label2],...].
    """
    num_samples = len(samples)
    while True: # Loop forever so the generator never terminates
        shuffle(samples)

        # Get index to start each batch: [0, batch_size, 2*batch_size, ..., max multiple of batch_size <= num_samples]
        for offset in range(0, num_samples, batch_size):
            # Get the samples you'll use in this batch
            batch_samples = samples[offset:offset+batch_size]

            # Initialise X_train and y_train arrays for this batch
            X_train = []
            y_train = []

            # For each example
            for batch_sample in batch_samples:
                # Load image (X) and label (y)
                label = batch_sample[1]
                img_path = batch_sample[2]
                img =  cv2.imread(img_path)
                
                # apply any kind of preprocessing
                # img = cv2.resize(img,(resize,resize))
                img = img.astype(np.float32)
                # Add example to arrays
                X_train.append(keras.applications.nasnet.preprocess_input(img))
                y_train.append(label)

            # Make sure they're numpy arrays (as opposed to lists)
            X_train = np.array(X_train)
            y_train = np.array(y_train)

            # The generator-y part: yield the next training batch            
            yield X_train, y_train

In [None]:
# Create test and train generators
train_generator = generator(train_list)
test_generator = generator(test_list)

In [None]:
# Resizing image (not used, since we are using generator with resize function)
# from PIL.Image import open
# brain_df['pixels']=brain_df['paths'].map(lambda x:np.asarray(open(x).resize((331,331))))

In [None]:
# CPU stats
# import os, psutil  

# def cpu_stats():
#     pid = os.getpid()
#     py = psutil.Process(pid)
#     memory_use = py.memory_info()[0] / 2. ** 30
#     return 'memory GB:' + str(np.round(memory_use, 2))

## Building Model

In [None]:
model = Sequential([
    # base
    MobileNetV2(input_shape=(224, 224, 3),include_top=False, weights='imagenet'),
    layers.GlobalAveragePooling2D(),
    layers.Dropout(0.2),
    layers.Dense(units=1, activation='sigmoid',name='preds'),   
])
model.layers[0].trainable= False
# show model summary
model.summary()

## Optimization

In [None]:
model.compile(
    # Set the loss as binary_crossentropy
    loss='binary_crossentropy',
    # Set the optimizer to Adam
    optimizer=keras.optimizers.Adam(epsilon=0.01),
    # Set the metric as accuracy
    metrics=['binary_accuracy']
)

In [None]:
# Measure memory consumption by file

# import sys

# # These are the usual ipython objects, including this one you are creating
# ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# # Get a sorted list of the objects and their sizes
# sorted([(x, sys.getsizeof(globals().get(x))/1024**3) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], 
#        key=lambda x: x[1], reverse=True)


## Training Model

In [None]:
# Set parameters for model training
batch_size = 32
train_size = len(train_list)
test_size = len(test_list)
steps_per_epoch = train_size//batch_size
validation_steps = test_size//batch_size

In [None]:
# Use early stopping to cut resource wasting
early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)

In [None]:
# Train the model
history = model.fit_generator(
    train_generator,
    steps_per_epoch = steps_per_epoch,
    epochs=110,
    validation_data=test_generator,
    validation_steps = validation_steps,
    verbose=1,
    callbacks = [early_stopping]
)
model.save("model_brain_adam.h5")
print("Saved model to disk")

## Results

In [None]:
# Graph loss and binary accuracy graphs
history_df = pd.DataFrame(history.history)
# Start the plot at epoch 5
history_df.loc[5:, ['loss', 'val_loss']].plot()
history_df.loc[5:, ['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

In [None]:
# Evaluate samples using the model I've pretrained, saved, and loaded back
pretrained_cnn = keras.models.load_model('../input/h5files/model_brain_adam.h5')
eval_score = pretrained_cnn.evaluate(test_generator, steps = validation_steps)
# print loss score
print('Eval loss:',eval_score[0])
# print accuracy score
print('Eval accuracy:',eval_score[1])

In [None]:
# Output classification report and confusion matrix
from sklearn.metrics import confusion_matrix , classification_report
# Get predicted and true classes for our report and matrix
y_pred = np.rint(pretrained_cnn.predict_generator(test_generator, steps = validation_steps)).astype(int)
y_test = [i[1] for i in test_list[0:-2]]
target_classes = ['No Tumor','Tumor']

classification_report(y_test , y_pred , output_dict = True
                      , target_names=target_classes)

In [None]:
confusion_matrix(y_test , y_pred ) 

## Conclusion
We misclassified 43 images out of 693, with sensitivity of 94.006% and specificity of 94.272%.