# Metastatic tissue identifier with Convolutional Neural Networks:

Welcome folks!, in the current project we will implement a CNN by scratch to classify images of tissues as cancerous or not, for this we have to make use of the **Histopathologic Cancer Detection** competition dataset which contains over 220 thousand images for training set and 57458 unseen images to classify and submit.

Having said that, let's get started!

Firstly, we have to import the main libraries to perform EDA as follows:

In [None]:
pip install visualkeras

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import cv2
import visualkeras
import tensorflow as tf

## Exploratory Data Analysis:

Now, we will print the files available in the main directory so as to know the distribution of images. 

In [None]:
os.listdir('../input/histopathologic-cancer-detection')

Let's print the file name of the first five images in the training dataset:

In [None]:
os.listdir('../input/histopathologic-cancer-detection/train')[:5]

In order to know the exact number of images in each folder we will print the length of the list containing the file names as can be seen below:

In [None]:
len(os.listdir('../input/histopathologic-cancer-detection/train'))

In [None]:
len(os.listdir('../input/histopathologic-cancer-detection/test'))

Perfect, the dataset author gives a csv file containing the images id's and their respective category in two columns, let's print it in the next line:

In [None]:
df=pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv')
df.head()

The next step is to plot the distribution of categories as proportions just to know if the data is balanced or not.

In [None]:
df.label.value_counts()

In [None]:
sns.set(style='whitegrid')
pie_chart=pd.DataFrame(df['label'].replace(0,'Non-cancerous tissue').replace(1,'Cancerous tissue').value_counts())
pie_chart.reset_index(inplace=True)
pie_chart.plot(kind='pie', title='Category Images',y = 'label', 
             autopct='%1.1f%%', shadow=False, labels=pie_chart['index'], legend = False, fontsize=14, figsize=(18,8))

Evidently the dataset is unbalanced and for this task I prefer to undersample to the lowest number which corresponds to 89117 avoiding sidetracked predictions in our future model, as my main purpose is to make you understand better the process we will simplify the number to 89000 so as to not get confused in the splitting process later.

In the following lines I will print the first image contained in training set and its corresponding category by using its id in the csv file:

In [None]:
from PIL import Image

In [None]:
im = Image.open('../input/histopathologic-cancer-detection/train/'+os.listdir('../input/histopathologic-cancer-detection/train')[0])
plt.imshow(im)
plt.axis('off')
print(df[df.id==os.listdir('../input/histopathologic-cancer-detection/train')[0].split('.')[0]].label)

Above we see the the first training image has label:1 corresponding to metastatic tissue, let's see more examples about both categories by running the following lines. If you want to see more on the next functions to print images by categories I encourage you to take a look at my next notebook in which I explain it much detailed: 

https://www.kaggle.com/georgesaavedra/best-intel-image-classifiers

In [None]:
Labels = df.label.values
Labels

In [None]:
def get_indexes(label,list_n):
  for x in range(len(Labels)):
    if Labels[x]==label:
      list_n.append(x)
  return list_n

In [None]:
no_cancer=[]
no_cancer=get_indexes(0,no_cancer)
cancer=[]
cancer=get_indexes(1,cancer)

In [None]:
def get_classlabel(class_code):
    labels = {0:'Non-cancerous', 1:'Cancerous'}
    
    return labels[class_code]

In [None]:
import random
from random import randint

f,ax = plt.subplots(2,4, figsize=(12,12)) 
types_img=[no_cancer, cancer]

for z in range(0,2,1):
    for j in range(0,4,1):
        rnd_number=random.choice(types_img[z])
        ax[z,j].imshow(Image.open('../input/histopathologic-cancer-detection/train/'+df.iloc[rnd_number,0]+'.tif'))
        ax[z,j].set_title(get_classlabel(z))
        ax[z,j].axis('off')
        plt.tight_layout()

## Data Preparation:

We will get rid of the next images as they either create errors during training or doesn't represent its category:

In [None]:
f, (ax1, ax2) = plt.subplots(1,2,figsize=(17,17))

ax1.imshow(Image.open('../input/histopathologic-cancer-detection/train/'+'dd6dfed324f9fcb6f93f46f32fc800f2ec196be2.tif'))
ax1.axis('off')
ax1.set_title('Error Image')

ax2.imshow(Image.open('../input/histopathologic-cancer-detection/train/'+'9369c7278ec8bcc6c880d99194de09fc2bd4efbe.tif'))
ax2.axis('off')
ax2.set_title('Black Image')

In [None]:
df.shape

In [None]:
# removing this image because it caused a training error previously
df = df[df['id'] != 'dd6dfed324f9fcb6f93f46f32fc800f2ec196be2']

# removing this image because it's black
df = df[df['id'] != '9369c7278ec8bcc6c880d99194de09fc2bd4efbe']

print(df.shape)

Once we got rid of both images we have to create dataframes containing the id's of each category and gather 89000 images as we said earlier randomly:

In [None]:
SAMPLE_SIZE=89000

# take a random sample of class 0 with size equal to num samples in class 1
df_0 = df[df['label'] == 0].sample(SAMPLE_SIZE, random_state = 42)
# filter out class 1
df_1 = df[df['label'] == 1].sample(SAMPLE_SIZE, random_state = 42)

Once we created both dataframes we will concatenate them so as to have one file containing 178000 images, finally we will shuffle it because it was sorted:

In [None]:
from sklearn.utils import shuffle

# concat the dataframes
df_data = pd.concat([df_0, df_1], axis=0).reset_index(drop=True)
# shuffle
df_data = shuffle(df_data)

df_data['label'].value_counts()

In [None]:
df_data.head(10)

Now that we have a balanced dataset we can split into training and validation sets at 10% as a typical ML workflow.

In [None]:
# train_test_split
from sklearn.model_selection import train_test_split

# stratify=y creates a balanced validation set.
y = df_data['label']

df_train, df_val = train_test_split(df_data, test_size=0.10, random_state=42, stratify=y)

print(df_train.shape)
print(df_val.shape)

We can see the number of images in both sets make sense and kept the balance.

In [None]:
df_train['label'].value_counts()

In [None]:
df_val['label'].value_counts()

At this moment we could create the model and the objects containing the training and validation images according to the id's in both sets, however this process will crash our machine RAM forcing us to restart it, this is why we have to create directories and sub-directories and save the images in their respective folder, this process will consume a significative lower RAM allowing us to succesfully train our model for various epochs.

As I said we will create a base directory containing training and testing sub-directories and these will contain folders for categories 'Cancerous' and 'Non-Cancerous', we will perform this step-by-step as follows:

In [None]:
# Create a new directory
base_dir = 'base_dir'
os.mkdir(base_dir)

#[CREATE FOLDERS INSIDE THE BASE DIRECTORY]
# train_dir
train_dir = os.path.join(base_dir, 'train_dir')
os.mkdir(train_dir)

# val_dir
val_dir = os.path.join(base_dir, 'val_dir')
os.mkdir(val_dir)

# [CREATE FOLDERS INSIDE THE TRAIN AND VALIDATION FOLDERS]
# Inside each folder we create separate folders for each class

# create new folders inside train_dir
non_cancerous_tissue = os.path.join(train_dir, 'non_cancerous_tissue')
os.mkdir(non_cancerous_tissue)
cancerous_tissue = os.path.join(train_dir, 'cancerous_tissue')
os.mkdir(cancerous_tissue)

# create new folders inside val_dir
non_cancerous_tissue = os.path.join(val_dir, 'non_cancerous_tissue')
os.mkdir(non_cancerous_tissue)
cancerous_tissue = os.path.join(val_dir, 'cancerous_tissue')
os.mkdir(cancerous_tissue)

We have to set the 'id' column as index in order to gather the image labels more easily, this will help us when copying the images to our just created directories:

In [None]:
# Set the id as the index in df_data
df_data.set_index('id', inplace=True)

In [None]:
df_data.head()

Time to copy the images to their respective directories:

In [None]:
import shutil

# Get a list of train and val images
train_list = list(df_train['id'])
val_list = list(df_val['id'])

# Transfer the train images
for image in train_list:
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif'
    # get the label for a certain image
    target = df_data.loc[image,'label']
    
    # these must match the folder names
    if target == 0:
        label = 'non_cancerous_tissue'
    if target == 1:
        label = 'cancerous_tissue'
    
    # source path to image
    src = os.path.join('../input/histopathologic-cancer-detection/train', fname)
    # destination path to image
    dst = os.path.join(train_dir, label, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)


# Transfer the val images
for image in val_list:
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif'
    # get the label for a certain image
    target = df_data.loc[image,'label']
    
    # these must match the folder names
    if target == 0:
        label = 'non_cancerous_tissue'
    if target == 1:
        label = 'cancerous_tissue'
    
    # source path to image
    src = os.path.join('../input/histopathologic-cancer-detection/train', fname)
    # destination path to image
    dst = os.path.join(val_dir, label, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)

To confirm that the images were succefully copied we will print the amount of images in each folder of the directories, the numbers should correspond to the same we got when the dataset was splitted:

In [None]:
# check how many training images we have in each folder
print(len(os.listdir('base_dir/train_dir/non_cancerous_tissue')))
print(len(os.listdir('base_dir/train_dir/cancerous_tissue')))

In [None]:
# check how many validation images we have in each folder
print(len(os.listdir('base_dir/val_dir/non_cancerous_tissue')))
print(len(os.listdir('base_dir/val_dir/cancerous_tissue')))

Perfect, now our images are available in in the 'Output Data' using such memory. We could start creating our model and training it with flow_from_directory, however in order to increase the accuracy we will perform Data Augmentation by creating flipped, shifted and rotated images based on the existing ones and obviously scale them, as we are working with images the proper process is min-max scaling and this can be done in the augmentation process.

In [None]:
train_path = 'base_dir/train_dir'
valid_path = 'base_dir/val_dir'
test_path = '../input/test'

num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 32
val_batch_size = 32

train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

ImageDataGenerator allows us to augment our data and this function will be applied to our training and validation by gathering them with the flow_from_directory function, notice that we have to shuffle both as this is a new function and we have to create a another validation set unshuffled in which we will perform a new prediction to compute the error metrics: 

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

IMAGE_SIZE=96
datagen = ImageDataGenerator(rescale=1.0/255,
                             featurewise_center=False,
                             samplewise_center=False,
                             featurewise_std_normalization=False,
                             samplewise_std_normalization=False,
                             zca_whitening=False,
                             rotation_range=10,
                             zoom_range = 0.1,
                             width_shift_range=0.1,
                             height_shift_range=0.1,
                             horizontal_flip=True,
                             vertical_flip=True)

train_gen = datagen.flow_from_directory(train_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='binary',
                                        shuffle=True)

val_gen = datagen.flow_from_directory(valid_path,
                                      target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                      batch_size=val_batch_size,
                                      class_mode='binary',
                                      shuffle=True)

# Note: shuffle=False causes the test dataset to not be shuffled
val2_gen = datagen.flow_from_directory(valid_path,
                                       target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                       batch_size=1,
                                       class_mode='binary',
                                       shuffle=False)

Nice!, the function found all images distributed into two classes, these objects will be used as arguments when training the model.

## Modeling:

We will start by importing all libraries and functions needed to create the model as follows:

In [None]:
from sklearn.metrics import confusion_matrix
import itertools

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.optimizers import RMSprop,Adam,SGD,Adadelta

Before we start creating our model we will define three callbacks that will help us to improve the training, stop it once it reaches a threshold and save the best model according to its accuracy:

In [None]:
#Will stop the training once it reaches 99% validation accuracy:
class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    if(logs.get('val_accuracy')>0.99):
      print("\nReached 99% accuracy so cancelling training!")
      self.model.stop_training = True
        
callbacks = myCallback()

#Will reduce the learning rate is validation accuracy didn't improve in one epoch:
from tensorflow.keras.callbacks import ReduceLROnPlateau
lr_reduction = ReduceLROnPlateau(monitor='val_accuracy',
                                 patience=1, 
                                 verbose=1, 
                                 factor=0.5, 
                                 min_lr=0.000001)

#Will save the very best model according to validation accuracy:
from tensorflow.keras.callbacks import ModelCheckpoint
model_dir = 'CNN_model_histo.h5'
checkpoint = ModelCheckpoint(model_dir, monitor='val_accuracy', verbose=1,
                             save_best_only=True, mode='max')

The architecture I decided to use was made in one of my previous projects in which the performance was considerably high, it considers 4 sets of layers including 2D-Convolutional, 2D-Max Pooling and Batch Normalization, at the end we add a couple of Dropout and Dense layers.

I kindly encourage you to take a look at the following notebook in which I explain such architecture much better and aims a similar task: 

https://www.kaggle.com/georgesaavedra/tumor-classification-cnn

In [None]:
optimizer = Adam(learning_rate=0.001,beta_1=0.9,beta_2=0.999)

model=Sequential()
model.add(Conv2D(32,(3,3),strides=1,padding='Same',activation='relu',input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)))
model.add(MaxPool2D(2,2))
model.add(BatchNormalization())
model.add(Conv2D(64,(3,3), strides=1,padding= 'Same', activation='relu'))
model.add(MaxPool2D(2,2))
model.add(BatchNormalization())
model.add(Conv2D(128,(3,3), strides=1,padding= 'Same', activation='relu'))
model.add(MaxPool2D(2,2))
model.add(BatchNormalization())
model.add(Conv2D(256,(3,3), strides=1,padding= 'Same', activation='relu'))
model.add(MaxPool2D(2,2))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(512, activation = "relu"))
model.add(Dropout(0.2))
model.add(Dense(1, activation = "sigmoid"))

model.compile(optimizer = optimizer , loss = "binary_crossentropy", metrics=["accuracy"])

In [None]:
model.summary()

The following function displays our model architecture by layers, not showing the layers detail though, however we can follow and associate each color to a type of layer, such as: Yelow: Conv2D, Red: MaxPooling, Green: BatchNormalization, Blue:Flatten, Black: Dropout, Yelow: Dense.

In [None]:
visualkeras.layered_view(model)

Time now to train our model using the objects pointing to our created directories, we will train for 20 epochs so as to find the best possible model. 

*Important: Notice that for the deep of our network it will take a considerably long time to train, where each epoch took me around 8 minutes and 20 seconds, so be patient if you want to train it again.*

In [None]:
history = model.fit_generator(train_gen, validation_data=val_gen,
                              epochs=20, verbose=1,
                              callbacks=[callbacks, lr_reduction, checkpoint])

The best model reached 95.37% validation accuracy and was saved in the CNN_model_histo.h5 file that we will load, but before we have to plot the performance curves in relation to the epochs:

In [None]:
pd.DataFrame(history.history)

In [None]:
def metrics_plot(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(1,len(acc)+1,1)

    plt.plot(epochs, acc, 'r', label='Training accuracy')
    plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.figure()

    plt.plot(epochs, loss, 'r', label='Training Loss')
    plt.plot(epochs, val_loss, 'b', label='Validation Loss')
    plt.title('Training and validation loss')
    plt.legend()

    plt.show()

In [None]:
metrics_plot(history)

We can see the validation set had a performance which was underdamped and the settling time was equivalent to around 15 epochs, this is the main why we trained for 20 epochs and used ModelCheckpoint.

In the next line we will load the model saved in the h5 file and confirm it's working properly if we want to use it in other moment avoiding training the model again.

In [None]:
from keras.models import load_model

model_saved = load_model('./CNN_model_histo.h5')

Here is where we have to use the copy of the validation set we created earlier which was unshuffled, we will evaluate the performance of this loaded model and predict the label for such images as follows:

In [None]:
model_saved.evaluate_generator(val2_gen, steps=len(df_val), verbose=1)

In [None]:
predicted_val_prob = model_saved.predict_generator(val2_gen, steps=len(df_val), verbose=1)

In [None]:
Y_val_pred= np.round(predicted_val_prob)
Y_val_pred

The next lines will compute error metrics such as accuracy, recall, precision, f1-score and area under the curve, shown as classification report and confusion matrix:

In [None]:
from sklearn.metrics import classification_report

y_true = val2_gen.classes
report = classification_report(y_true, Y_val_pred)

print(report)

In [None]:
from sklearn.metrics import confusion_matrix

f,ax = plt.subplots(figsize=(15, 15))
confusion_mtx = confusion_matrix(y_true, Y_val_pred)
sns.set(font_scale=1.4)
sns.heatmap(confusion_mtx, annot=True, linewidths=0.01,cmap="Greens",linecolor="gray",ax=ax)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix Validation set")
plt.show()

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import label_binarize

In [None]:
metrics = []

precision, recall, fscore, _ = score(y_true, Y_val_pred, average='weighted')
accuracy = accuracy_score(y_true, Y_val_pred)
auc = roc_auc_score(y_true, Y_val_pred)
metrics.append(pd.Series({'precision':precision, 'recall':recall,
                          'fscore':fscore, 'accuracy':accuracy,
                          'auc':auc}, name='CNN model'))
    
metrics = pd.concat(metrics, axis=1)

In [None]:
metrics

## Inference:

We will do almost the same process as before, we have to understand the unseen images and copy all of them in a new directory:

In [None]:
os.listdir('../input/histopathologic-cancer-detection/test')[:5]

In [None]:
print('\n Amount of images in test dataset: ', len(os.listdir('../input/histopathologic-cancer-detection/test')))

In [None]:
# create test_dir
test_dir = 'test_set_dir'
os.mkdir(test_dir)
    
# create test_images inside test_dir
test_images = os.path.join(test_dir, 'test_images')
os.mkdir(test_images)

In [None]:
# Transfer the test images into image_dir

test_list = os.listdir('../input/histopathologic-cancer-detection/test')

for image in test_list:
    fname = image
    # source path to image
    src = os.path.join('../input/histopathologic-cancer-detection/test', fname)
    # destination path to image
    dst = os.path.join(test_images, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)

In [None]:
len(os.listdir('test_set_dir/test_images'))

Once we copied all images to our 'test_set_dir' directory we have to gather them using the flow_from_directory function as before and use it as argument of the prediction function:

In [None]:
test_path = 'test_set_dir'
test_gen = datagen.flow_from_directory(test_path,
                                       target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                       batch_size=1,
                                       class_mode='categorical',
                                       shuffle=False)

In [None]:
test_predictions = model_saved.predict_generator(test_gen, steps=len(os.listdir('test_set_dir/test_images')), verbose=1)

In [None]:
test_predictions

In [None]:
non_cancerous=1-test_predictions

The following line creates the csv file that we will submit:

In [None]:
submission=pd.DataFrame(non_cancerous, columns=['label'])
submission['id']=test_gen.filenames
submission['id']=submission['id'].str.split('/', n=1, expand=True)[1].str.split('.', n=1, expand=True)[0] 
submission.set_index('id', inplace=True)
submission.head()

In [None]:
submission.to_csv('submission.csv')

I would like to know any feedback in order to increase the performance of the models or tell me if you found a different one even better!

If you liked this notebook I would appreciate so much your upvote if you want to see more projects/tutorials like this one. I encourage you to see my projects portfolio, am sure you will love it.

Thank you!