# **Introduction**
* This kernel is my project for the course - Machine Learning => CSE4020
* In this kernel I will first go through the dataset, perform the exploratory data analysis and then run 3 approaches to convolutional neural networks - a simple model created by me, the NASNET mobile model which provides a learning architecture which scales from smaller datasets to larger datasets and the DenseNet 121 model provided by Fast.ai

# **Contents**
* Importing modules and an explanation of their use
* Explanation of the competition and the description of the dataset
* Loading the data and the exploratory data analysis
* Creating, training and validating the model
* Creating, training and validating the NASNET Mobile model
* Creating, training and validating the DenseNet 121 model
* Choosing the model with the best results and creating a submission

# **Importing the modules and an explanation of their use**
We will use multiple libraries and modules in the kernel. They are all powerful modules which will be used for a variety of jobs such as reading files, storing the data efficiently, a deep learning API, image processing and plotting functionalities! These modules are:
* Numpy - The math module that makes matrix multiplication and other complicated mathematical operations easier
* Pandas - The module that provides us with efficient and powerful data structures which can help us store and use our data efficiently.
* Matplotlib - A powerful module that will help us plot and visualize our data.
* Opencv - Imported as cv2, it is used for image processing and in this case, we will use it to for loading images.
* Keras - A high level API used by most of the world for Deep Learning. It is very useful as it provides us with a high degree of functionality and makes using Tensorflow easier.
* Glob - A module that helps us load and match filenames easily.
* TQDM - A useful library which provides with a progress bar while training.
* Torchvision - A module that holds the popular datasets and models while using PyTorch.
* Sklearn - A very powerful module that provides with tools for Machine Learning and Data Analysis in Python.
* Imgaug - A machine learning oriented package that helps us create a larger set of images which are altered from a smaller set of images


In [None]:
# Modules required for simple model
from glob import glob 
import numpy as np
import pandas as pd
import cv2,os
import keras
from keras.utils import np_utils
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers import BatchNormalization
from keras.layers import Activation
from keras.layers import Conv2D
from keras.layers import MaxPool2D
from keras.layers import BatchNormalization
from tqdm import tqdm_notebook,trange
import matplotlib.pyplot as plt
import gc
scale = 70
seed = 7
# Modules required for NASNET Mobile Model
from random import shuffle
from sklearn.model_selection import train_test_split
from imgaug import augmenters as iaa
import imgaug as ia
# Modules required for DenseNet 121 Model
#from fastai.vision import *
#import torchvision

# **Explaning the competition and Dataset Description**
* The competition is basically a binary classification problem for images - which means in this kernel, we will be dividing the images into 2 separate classes.
* The dataset consists of microscopic images of lymph node tissue with a resolution of 96 pixels by 96 pixels and we have to predict or classify the images based on whether the images show metastatic cancer tissue in the 32 pixel by 32 pixel centre region of the image.
* The dataset has 220000 images for training our models and 57000 images in the testing set.

# **Loading the data and exploratory data analysis**
* We will first create a DataFrame which will store the path of the files in the training folder and then read the labels of the images from the given csv file.
* We will then load the images using the uint8 format - this will reduce the size of the images and thereby letting the data fit in the 14GB of memory provided to us.
* We will then perform the Exploratory Data Analysis where will "SEE" our data for the first time. We will also check for dataset imbalance and look for image features like RGB channels, HSV channels, and other features.

In [None]:
#set paths to training and test data
path = "../input/"
train_path = path + 'train/'
test_path = path + 'test/'
df = pd.DataFrame({'path': glob(os.path.join(train_path,'*.tif'))})
df['id'] = df.path.map(lambda x: x.split('/')[3].split(".")[0])
labels = pd.read_csv(path+"train_labels.csv")
# merge labels and filepaths
df = df.merge(labels, on = "id") 
# Loading the images
def load_data(N,df):
    # allocate a numpy array for the images - 3 channels
    X = np.zeros([N,96,96,3],dtype=np.uint8) 
    #convert the labels to an array
    y = np.squeeze(df.as_matrix(columns=['label']))[0:N]
    #read images one by one, tdqm notebook displays a progress bar
    for i, row in tqdm_notebook(df.iterrows(), total=N):
        if i == N:
            break
        X[i] = cv2.imread(row['path'])
    return X,y
N = 10000
X,y = load_data(N=N,df=df)

In [None]:
# Displaying the loaded images
fig = plt.figure(figsize=(10, 4), dpi=150)
np.random.seed(100)
for plotNr,idx in enumerate(np.random.randint(0,N,8)):
    ax = fig.add_subplot(2, 8//2, plotNr+1, xticks=[], yticks=[])
    plt.imshow(X[idx])
    ax.set_title('Label: ' + str(y[idx]))

So we can see some sample images from the dataset - we can also see how it is impossible for computer scientists like us to discern the cancer containing images!

## Data Distribution
Let us see how the class distribution is in the dataset

In [None]:
fig = plt.figure(figsize=(4, 2),dpi=150)
plt.bar([1,0], [(y==0).sum(), (y==1).sum()]); #plot a bar chart of the label frequency
plt.xticks([1,0],["Negative (N={})".format((y==0).sum()),"Positive (N={})".format((y==1).sum())]);
plt.ylabel("# of samples")

From the above graph, we can see that there is a 40/60 split amongst the positive and negative classes. This shows that even a baseline classification - where all the images are given the more populated class will give an accuracy of 60%!
We may have to undersample the negative class or generate more positive sample so that we can avoid a bias and improve classification stability.

## Let us look at each class individually!
We will now compare the pixel distribution in each of the BGR channels in both positive and negative samples to see if we can see any features that can be engineered!

In [None]:
#Separating the classes
positive_samples = X[y == 1]
negative_samples = X[y == 0]
#Binning each pixel value for the histogram
nr_of_bins = 256 
fig,axs = plt.subplots(4,2,sharey=True,figsize=(8,8),dpi=150)
#RGB channels
axs[0,0].hist(positive_samples[:,:,:,0].flatten(),bins=nr_of_bins,density=True)
axs[0,1].hist(negative_samples[:,:,:,0].flatten(),bins=nr_of_bins,density=True)
axs[1,0].hist(positive_samples[:,:,:,1].flatten(),bins=nr_of_bins,density=True)
axs[1,1].hist(negative_samples[:,:,:,1].flatten(),bins=nr_of_bins,density=True)
axs[2,0].hist(positive_samples[:,:,:,2].flatten(),bins=nr_of_bins,density=True)
axs[2,1].hist(negative_samples[:,:,:,2].flatten(),bins=nr_of_bins,density=True)
#All channels
axs[3,0].hist(positive_samples.flatten(),bins=nr_of_bins,density=True)
axs[3,1].hist(negative_samples.flatten(),bins=nr_of_bins,density=True)
# Labelling the Plots
axs[0,0].set_title("Positive samples (N =" + str(positive_samples.shape[0]) + ")");
axs[0,1].set_title("Negative samples (N =" + str(negative_samples.shape[0]) + ")");
axs[0,1].set_ylabel("Red",rotation='horizontal',labelpad=35,fontsize=12)
axs[1,1].set_ylabel("Green",rotation='horizontal',labelpad=35,fontsize=12)
axs[2,1].set_ylabel("Blue",rotation='horizontal',labelpad=35,fontsize=12)
axs[3,1].set_ylabel("RGB",rotation='horizontal',labelpad=35,fontsize=12)
for i in range(4):
    axs[i,0].set_ylabel("Relative frequency")
axs[3,0].set_xlabel("Pixel value")
axs[3,1].set_xlabel("Pixel value")
fig.tight_layout()

### Conclusions:
We can see some interesting conclusions from the following graphs!
* Negative samples have higher peaks on the right hand side of the graph - showing brightness is higher in negative samples
* Positive samples have darker pixels in the green channel.


### Mean Brightness Distribution
This is the process of taking each images, averaging its pixel values to get a value. This value is plotted for all the classes. This can help us generate another feature!

In [None]:
# We use 64 bins to get smooth graphs
nr_of_bins = 64 
fig,axs = plt.subplots(1,2,sharey=True, sharex = True, figsize=(8,2),dpi=150)
axs[0].hist(np.mean(positive_samples,axis=(1,2,3)),bins=nr_of_bins,density=True);
axs[1].hist(np.mean(negative_samples,axis=(1,2,3)),bins=nr_of_bins,density=True);
axs[0].set_title("Mean brightness, positive samples");
axs[1].set_title("Mean brightness, negative samples");
axs[0].set_xlabel("Image mean brightness")
axs[1].set_xlabel("Image mean brightness")
axs[0].set_ylabel("Relative frequency")
axs[1].set_ylabel("Relative frequency");

### Conclusions:
* The mean brightness distribution shows that the positive samples have a normal distribution with images having brightness in the range of 100-240 while the negative samples have a biomodal distribution with the pixel values ranging from 50-255.


## Creating the model
Let us create a simple model for this problem - with 3 convolutional blocks. The basic architecture consists of :
* A convolutional layer followed by 
* A batch normalization layer followed by
* An activation layer
* We repeat the above block and follow it up with a maxpooling layer and a dropout layer.

This is a very standard and basic architecture - which if we repeat three times - can give us above average results!

We will also divide the data into 80% for training and 20% for validation. We will set up a garbage collector to speed up the process by freeing up the RAM.

In [None]:
# Getting the number of images in the dataset
N = df["path"].size
X,y = load_data(N=N,df=df)

# Collecting garbage
positives_samples = None
negative_samples = None
gc.collect();

# Setting up the training/testing ratio
training_portion = 0.8 
split_idx = int(np.round(training_portion * y.shape[0]))

#Setting seeds to ensure we can repeat this process 
np.random.seed(42) 
idx = np.arange(y.shape[0])
np.random.shuffle(idx)
X = X[idx]
y = y[idx]

In [None]:
# Network Parameters
kernel_size = (3,3)
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128

# Setting up dropout parameters for regularization
dropout_conv = 0.3
dropout_dense = 0.5

# Creating model
model = Sequential()

# Convolutional Block 1
model.add(Conv2D(first_filters, kernel_size, input_shape = (96, 96, 3)))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(first_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

# Convolutional Block 2
model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

# Convolutional Block 3
model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

# Dense Layer 
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(dropout_dense))

# Softmax - to convert values to 0 or 1
model.add(Dense(1, activation = "sigmoid"))

model.summary()

batch_size = 50

model.compile(loss=keras.losses.binary_crossentropy,
              optimizer=keras.optimizers.Adam(0.001), 
              metrics=['accuracy'])

### Training and Validation of the model:
We will now train the model for three epochs (should take ~20mins). That means the model will have performed a forward and backward pass for each image in the training exactly three times.
To do so, we will split the training data in batches and feed one batch after another into the network. The batch size is a critical parameter for training a neural network.
Keras can do the splitting automatically for you, but, I thought, this way it is more transparent what is happening.

In [None]:
epochs = 3 #how many epochs we want to perform
for epoch in range(epochs):
    #compute how many batches we'll need
    iterations = np.floor(split_idx / batch_size).astype(int) #the floor makes us discard a few samples here, I got lazy...
    loss,acc = 0,0 #we will compute running loss and accuracy
    with trange(iterations) as t: #display a progress bar
        for i in t:
            start_idx = i * batch_size #starting index of the current batch
            x_batch = X[start_idx:start_idx+batch_size] #the current batch
            y_batch = y[start_idx:start_idx+batch_size] #the labels for the current batch

            metrics = model.train_on_batch(x_batch, y_batch) #train the model on a batch

            loss = loss + metrics[0] #compute running loss
            acc = acc + metrics[1] #compute running accuracy
            t.set_description('Running training epoch ' + str(epoch)) #set progressbar title
            t.set_postfix(loss="%.2f" % round(loss / (i+1),2),acc="%.2f" % round(acc / (i+1),2)) #display metrics

Now, to verify that our model also works with data it hasn't seen yet, we will perform a validation epoch, i.e., check the accuracy on the validation set without further training the network.

In [None]:
#compute how many batches we'll need
iterations = np.floor((y.shape[0]-split_idx) / batch_size).astype(int) #as above, not perfect
loss,acc = 0,0 #we will compute running loss and accuracy
with trange(iterations) as t: #display a progress bar
    for i in t:
        start_idx = i * batch_size #starting index of the current batch
        x_batch = X[start_idx:start_idx+batch_size] #the current batch
        y_batch = y[start_idx:start_idx+batch_size] #the labels for the current batch
        
        metrics = model.test_on_batch(x_batch, y_batch) #compute metric results for this batch using the model
        
        loss = loss + metrics[0] #compute running loss
        acc = acc + metrics[1] #compute running accuracy
        t.set_description('Running training') #set progressbar title
        t.set_description('Running validation')
        t.set_postfix(loss="%.2f" % round(loss / (i+1),2),acc="%.2f" % round(acc / (i+1),2))
        
print("Validation loss:",loss / iterations)
print("Validation accuracy:",acc / iterations)

## Create a submission
Well, now that we have a trained a model, we can create a submission by predicting the labels of the test data and see, where we are at in the leaderboards!

In [None]:
X = None
y = None
gc.collect();
base_test_dir = path + 'test/' #specify test data folder
test_files = glob(os.path.join(base_test_dir,'*.tif')) #find the test file names
submission = pd.DataFrame() #create a dataframe to hold results
file_batch = 5000 #we will predict 5000 images at a time
max_idx = len(test_files) #last index to use
for idx in range(0, max_idx, file_batch): #iterate over test image batches
    print("Indexes: %i - %i"%(idx, idx+file_batch))
    test_df = pd.DataFrame({'path': test_files[idx:idx+file_batch]}) #add the filenames to the dataframe
    test_df['id'] = test_df.path.map(lambda x: x.split('/')[3].split(".")[0]) #add the ids to the dataframe
    test_df['image'] = test_df['path'].map(cv2.imread) #read the batch
    K_test = np.stack(test_df["image"].values) #convert to numpy array
    predictions = model.predict(K_test,verbose = 1) #predict the labels for the test data
    test_df['label'] = predictions #store them in the dataframe
    submission = pd.concat([submission, test_df[["id", "label"]]])
print(submission.head())
submission.to_csv("submission.csv", index = False, header = True) #create the submission file