# <center>[Histopathologic Cancer Detection](https://www.kaggle.com/competitions/histopathologic-cancer-detection/overview)</center>
---

## [About the data](https://www.kaggle.com/competitions/histopathologic-cancer-detection/data)
In this dataset, you are provided with a large number of small pathology images to classify. Files are named with an image **`id`**. The **`train_labels.csv`** file provides the ground truth for the images in the **`train`** folder. You are predicting the labels for the images in the **`test`** folder. A **`positive label`** indicates that the center **`32x32px`** region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image.

The original PCam dataset contains duplicate images due to its probabilistic sampling, however, the version presented on Kaggle does not contain duplicates. We have otherwise maintained the same data and splits as the PCam benchmark.

## Importing the packages

In [1]:
import os
import shutil
import pandas as pd
import numpy as np
from PIL import Image
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import torch

from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.sampler import SubsetRandomSampler
from torchvision.datasets import ImageFolder
import torchvision.transforms as T
from torchvision.utils import make_grid
from torchmetrics.functional import accuracy
import pytorch_lightning as pl

# Check if gpu support is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print()
print(device)

%matplotlib inline

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']



cpu


## Paths

In [2]:
! tree -d /kaggle

[01;34m/kaggle[00m
├── [01;34minput[00m
│   └── [01;34mhistopathologic-cancer-detection[00m
│       ├── [01;34mtest[00m
│       └── [01;34mtrain[00m
├── [01;34mlib[00m
│   └── [01;34mkaggle[00m
├── [01;34msrc[00m
└── [01;34mworking[00m

8 directories


In [3]:
csv_submission_ex_file = '/kaggle/input/histopathologic-cancer-detection/sample_submission.csv'
train_labels_csv_file = '/kaggle/input/histopathologic-cancer-detection/train_labels.csv'
train_dir = '/kaggle/input/histopathologic-cancer-detection/train/'
test_dir = '/kaggle/input/histopathologic-cancer-detection/test/'

## Collecting the data

### Sample submission 

In [4]:
# Display top 5 rows
! head $csv_submission_ex_file

id,label
0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5,0
95596b92e5066c5c52466c90b69ff089b39f2737,0
248e6738860e2ebcf6258cdc1f32f299e0c76914,0
2c35657e312966e9294eac6841726ff3a748febf,0
145782eb7caa1c516acbe2eda34d9a3f31c41fd6,0
725dabe6ecccc68b958a2c7dd75bcbf362c7cb03,0
aa0307865281d4484ddf8c637c348292968b93a7,0
f4e5dc9c949920f1b3362982e15e99bf6f3ef83b,0
95e08c9cedc28a9b4a86f4fc1e06c1972134be08,0


In [5]:
submission_data_df = pd.read_csv(filepath_or_buffer=csv_submission_ex_file)
submission_data_df.head()

Unnamed: 0,id,label
0,0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5,0
1,95596b92e5066c5c52466c90b69ff089b39f2737,0
2,248e6738860e2ebcf6258cdc1f32f299e0c76914,0
3,2c35657e312966e9294eac6841726ff3a748febf,0
4,145782eb7caa1c516acbe2eda34d9a3f31c41fd6,0


In [6]:
submission_data_df.shape

(57458, 2)

### Train data

In [7]:
# Display top 5 rows
! head $train_labels_csv_file

id,label
f38a6374c348f90b587e046aac6079959adf3835,0
c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
755db6279dae599ebb4d39a9123cce439965282d,0
bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
068aba587a4950175d04c680d38943fd488d6a9d,0
acfe80838488fae3c89bd21ade75be5c34e66be7,0
a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1
7f6ccae485af121e0b6ee733022e226ee6b0c65f,1
559e55a64c9ba828f700e948f6886f4cea919261,0


In [8]:
data_df = pd.read_csv(filepath_or_buffer=train_labels_csv_file)
data_df.head()

Unnamed: 0,id,label
0,f38a6374c348f90b587e046aac6079959adf3835,0
1,c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
2,755db6279dae599ebb4d39a9123cce439965282d,0
3,bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
4,068aba587a4950175d04c680d38943fd488d6a9d,0


In [9]:
# Shape of the data
data_df.shape

(220025, 2)

In [10]:
# Target feature value_counts
print(data_df["label"].value_counts())
data_df["label"].value_counts(normalize=True) * 100

0    130908
1     89117
Name: label, dtype: int64


0    59.496875
1    40.503125
Name: label, dtype: float64

1. Since, the target feature i.e `label` has only two unique values i.e `[0,1]`, hence, it's a **`Binary Classification`** problem. And also **`Supervised Learning`** as the datset is **labelled.**

## Downsampling the dataset
A positive label indicates that the center 32x32 px region of a patch contains at least one px of tumor tissue. There are **130,908 normal cases (0)** and **89,117 abnormal (or cancerous) tissue images (1)**. This is a huge dataset that will require a lot of time and compute resources to train on a full dataset, thus we will downsample the 220,000 images in the train folder to 10,000 images and then split them into training and testing datasets:

In [11]:
# np.random.seed(42)
# train_imgs_orig = os.listdir("/kaggle/input/histopathologic-cancer-detection/train")
# selected_image_list = []

# for img in np.random.choice(train_imgs_orig, 10000):
#     selected_image_list.append(img)
    
# len(selected_image_list)

## Train Test split

In [12]:
# np.random.seed(42)
# np.random.shuffle(selected_image_list)

# train_idx = selected_image_list[:8000]
# test_idx = selected_image_list[8000:]

# print(f"Number of images in the downsampled training dataset: {len(train_idx)}")
# print(f"Number of images in the downsampled testing dataset: {len(test_idx)}")

In [13]:
# ### downsampled_train_dataset
# os.mkdir('/kaggle/working/downsampled_train_dataset')

# for file_name in train_idx:
#     source = os.path.join('/kaggle/input/histopathologic-cancer-detection/train',file_name)
#     destination = os.path.join('/kaggle/working/downsampled_train_dataset/', file_name)
#     shutil.copyfile(source, destination)

# ### downsampled_test_dataset
# os.mkdir('/kaggle/working/downsampled_test_dataset')

# for file_name in test_idx:
#     # we used the train datset instead. of test because we are using the train datset itself for both training & testing
#     source = os.path.join('/kaggle/input/histopathologic-cancer-detection/train',file_name)
#     destination = os.path.join('/kaggle/working/downsampled_test_dataset/', file_name)
#     shutil.copyfile(source, destination)
    
# print("Copying of data from Source to Destination is completed!!!")

In the preceding code snippet, we are first creating two folders called **`downsampled_train_dataset * downsampled_test_dataset`** on the following path:
```python
os.mkdir('/kaggle/working/downsampled_train_dataset')
os.mkdir('/kaggle/working/downsampled_test_dataset')
```
Then, we are looping over `train_idx` to get image names for the training data and copying all the files from input storage in the machine allotted to us at the time we created a notebook to working storage on our Kaggle using the **shutils** Python module.

## Label extraction

The labels for the images that were selected in the downsampled data will be extracted in a list that will be used for training and evaluating the image classification model, as shown here:

In [14]:
# Create a ictionary {"img_name_short": labels} 
dict_labels_train = data_df.set_index('id')['label'].to_dict()

In [15]:
# Also create dictionary for test_data
test_id_list = [os.path.splitext(img)[0] for img in os.listdir(test_dir)]
test_label_list = ["NA" for img_id in test_id_list]
print(test_id_list[:5])
print(test_label_list[:5])

# use the zip function to create a list of tuples
test_id_label_list = list(zip(test_id_list, test_label_list))
print(test_id_label_list[:5])
# use the dict function to convert the list of tuples into a dictionary
dict_labels_test = dict(test_id_label_list)

['a7ea26360815d8492433b14cd8318607bcf99d9e', '59d21133c845dff1ebc7a0c7cf40c145ea9e9664', '5fde41ce8c6048a5c2f38eca12d6528fa312cdbb', 'bd953a3b1db1f7041ee95ff482594c4f46c73ed0', '523fc2efd7aba53e597ab0f69cc2cbded7a6ce62']
['NA', 'NA', 'NA', 'NA', 'NA']
[('a7ea26360815d8492433b14cd8318607bcf99d9e', 'NA'), ('59d21133c845dff1ebc7a0c7cf40c145ea9e9664', 'NA'), ('5fde41ce8c6048a5c2f38eca12d6528fa312cdbb', 'NA'), ('bd953a3b1db1f7041ee95ff482594c4f46c73ed0', 'NA'), ('523fc2efd7aba53e597ab0f69cc2cbded7a6ce62', 'NA')]


## Loading the dataset
**PyTorch Lightning** `expects data to be in folders with the classes.` So, we cannot use the DataLoader module directly when all train/test images are in one folder without subfolders. Therefore, we will write our custom class for loading the data, as follows:

In [16]:
class LoadCancerDataset(Dataset):
    
    def __init__(self, data_folder, transform = T.Compose([T.CenterCrop(32),T.ToTensor()]), dict_labels={}):
        self.data_folder = data_folder
        self.list_image_files = [s for s in os.listdir(data_folder)]
        self.transform = transform
        self.dict_labels = dict_labels
        self.labels = [dict_labels[i.split('.')[0]] for i in self.list_image_files]
        
    def __len__(self):
        return len(self.list_image_files)
    
    def __getitem__(self, idx):
        img_name = os.path.join(self.data_folder, self.list_image_files[idx])
        image = Image.open(img_name)
        image = self.transform(image)[0]
        img_name_short = self.list_image_files[idx].split('.')
        label = self.dict_labels[img_name_short]
        return image, label
    
# class LoadCancerDataset(Dataset):
#     def __init__(self, csv_file, img_dir, transform=None):
#         self.csv_file = pd.read_csv(csv_file)
#         self.img_dir = img_dir

#         self.transform = transform

#     def __len__(self):
#         return len(self.csv_file)

#     def __getitem__(self, idx):
#         img_name = os.path.join(self.img_dir, self.csv_file.iloc[idx, 0])
#         img_name = img_name + '.tif'

#         sample = Image.open(img_name)

#         if self.transform is not None:
#             sample = self.transform(sample)

#         return {'sample': sample}

In the preceding code block, we have defined a custom data loader.

The custom class defined earlier inherits from the `torch.utils.data.Dataset` module. The **LoadCancerDataset** custom class is initialized in the **__init__** method and accepts three arguments: **the `path to the data folder`, the `transformer` with a default value of cropping the image to size 32 and transforming it to a tensor, and a `dictionary` with the labels and IDs of the dataset.**

The **LoadCancerDataset** class reads all the images in the folder and **extracts** the image name from the filename, which is also the `ID` for the images. This image name is then matched with the label in the dictionary with the labels and IDs.

The **LoadCancerDataset** class returns the images and their labels, which can then be used in the DataLoader module of the `torch.utils.data` library as it can now read the images with their corresponding label.

## Augmenting the dataset
Now that we have loaded the data, we will start the process of data preprocessing by augmenting the images, as follows:

In [17]:
data_T_train = T.Compose([ T.CenterCrop(32), T.RandomHorizontalFlip(), T.RandomVerticalFlip(), T.ToTensor(), ])

data_T_test = T.Compose([ T.CenterCrop(32), T.ToTensor(), ])

In the preceding code block,

We have used transformations to crop the image to 32x32 by using **Torchvision's** built-in libraries. We then also **augmented the data** by flipping it horizontally and vertically, **thereby creating two additional copies from the original image.**

Now, we will call our **LoadCancerDataset** custom class with the path to the data folder, transformer, and the image label dictionary to convert it to the format accepted by the `torch.utils.data.DataLoader` module.

> The **test_dir** does not have any labels, because it is the data that you need to make predictions on and submit to the Kaggle competition. The `test_labels.csv` file is not provided to you, because it is used by Kaggle to evaluate your model and rank your submission. Therefore, you cannot use the same approach as the train_dir to create a dictionary of labels for the test_dir. However, **you can still create a LoadCancerDataset object for the test_dir, but you need to pass an empty dictionary as the dict_labels argument**. This way, the `getitem` method of the class will return a dummy label (such as 0) for each image file in the test_dir, but you can ignore this label when making predictions. You only need the image data from the test_dir to feed into your model and generate probabilities of cancer detection.

In [18]:
train_set = LoadCancerDataset(data_folder=train_dir, transform = data_T_train, dict_labels=dict_labels_train)
# in test_set dict_labels={} empty as it's a kaggle's competition rank evaluation data, where the label is kept hidden for evaluation by the authority
test_set = LoadCancerDataset(data_folder=test_dir, transform = data_T_test, dict_labels=dict_labels_test) 

This will be repeated for the test data as well, and then, in the final step, we will create train_dataloader and test_dataloader data loaders using the output from our LoadCancerDataset custom class by leveraging the DataLoader module from the torch.utils.data library. The code to do so is illustrated in the following snippet:


In [19]:
train_dataloader = DataLoader(train_set, batch_size=256, num_workers=2, pin_memory=True, shuffle=True)
test_dataloader = DataLoader(test_set, batch_size=256, num_workers=2, pin_memory=True, shuffle=False)

In [20]:
print(f"Length of the DataLoader :{train_dataloader.__len__()}")
print(f"Actual Length of the dataset :{train_dataloader.__len__() * 256}")

Length of the DataLoader :860
Actual Length of the dataset :220160


In [21]:
image, label = next(iter(train_dataloader))
print(image, label)

In the preceding code snippet, we have the following:
1. We started with the original data without the subfolders as expected by the PyTorch Lightning module. The data was downsampled and saved on Google Drive's persistent storage.
2. Using the LoadCancerDataset custom class, we created two datasets train_ set and test_set, by reading images and their labels.
3. In the process of creating datasets, we also used the Torchvision transform module to crop the images to the center, that is, converting images to the square of 32 x 32 px and also converting images to tensors.
4. In the final step, the two train_set and test_set datasets that were created are used to create two train_dataloader and test_dataloader data loaders for them.

At this point, we are ready with our train_dataloader data loader with around 8,000 images, and test_dataloader with around 2,000 images. All the images are of size 32 x 32, converted to tensor form, and served in batches of 256 images. We will use the train data loader to train our model and the test data loader to measure our model's accuracy.