Identify metastatic tissue in histopathologic scans of lymph node sections
Datasets obtained from kaggle competition https://www.kaggle.com/c/histopathologic-cancer-detection/data
Create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans
Binary classification of H&E pathology images of lymph node sections into 2 classes:
- 0:
neg
-- The center 32x32px region of a patch doesn't contain a single tumor cell - 1:
pos
-- The center 32x32px region of a patch contains at least one pixel of tumor tissue
-
The dataset contains
train
folder andtest
folder. There are 220025 images in thetrain
folder and each image has corresponding labels (0 or 1) saved in a separate filetrain_labels.csv
. Thetest
folder contains 57458 images to evaluate prediction performance. -
The ratio of neg/pos images is around 60%/40% in the
train
folder.
- All the images have size 96x96px in RGB mode. Here is a panel of representative images from each class:
- In order to improve prediction accuracy, avoid overfitting, and improve training efficiency, images in
train
folder first need to be split to a new set oftrain
,val
, andtest
datasets and the class ratio needs to be kept consistent across different datasets. The new directory structure is shown as below:
- CNN build from scratch Here I first tried to build an 11-layer CNN model from scratch:
((ConV-ReLU)x3-MaxPool)x3 --flattening-- FC1-ReLU-FC2
The CNN model is trained for 50 epochs, and model with the lowest lost function output based on validation dataset is saved as the optimal model for evaluation and further analysis.
The trained CNN model is evaluated using the test
dataset generated from the original train
folder.
- Overall prediction accuracy: 0.974
- Precision: 0.966
- Recall: 0.970
- F1 score: 0.968
- ROC_AUC: 0.983
Here is the confusion matrix and the ROC curve:
Besides, t-SNE plot generated by both raw pixels from each image and features extracted by the trained CNN model for each image gives a clear picture of how well data points are separated after being propagated through the trained CNN model.
The trained CNN model has a Private Score of 0.8799 and Public Score of 0.9472 for prediction of images in test
folder from Kaggle.
Although neural networks have proven very successuful in extracting information, identifying objects and performing classification of images, how they perform these challenging tasks remains largely a mystery. Here I drawed output of each stack of convolutional layers (conv1, conv2, conv3) as well as each convolutional layer in the first stack (conv1_1, conv1_2, conv1_3) to have a peek inside the 'black box'.
As we can see from the output of conv1_1
, conv1_2
, and conv1_3
as well as post activation images, the first few convolutional layers can capture basic 'content' features of images such as shape, edge, and boundaries.
Deeper convolutional layers tend to capture higher-order features such as colors and textures. It's visually obvious that output from the deeper layers (eg. conv3
) contains very 'abstract' info that can hardly be inferred from the original input image. In contrast, features captured by the first stack of convolutional layers (conv1
: conv1_1
, conv1_2
, conv1_3
) mainly contain information about object and shape arrangement that can easily be inferred from the original image.