# HOWTO upload module and perform simple image classification

In this notebook, I'll explain some stuff that needed some time for me to figure out. It uses the HPA Single Cell Classification competition as an example.



# Default folders and how to use them

Kaggle's notebook environment's basic directory structure is split into the folder `/kaggle/input/` where the user can upload files and `/kaggle/output/` for outputting files from notebook run.

The notebook itself is run from the `/kaggle/working/` folder. So to use the module, we have to put it in the working folder. This folder doesn't let users upload files directly, so we need to copy files from within the code.

Let's say we need to upload a module that we want to use in our notebook. First, we upload the module as a data set. This is then uploaded in the folder `/kaggle/input/`. After that, we copy the module explicitly in the working directory to import it into the notebook. Later you can update the module, and click **More Actions -> Check for Updates** to refresh the module and start using the updated version.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from shutil import copyfile, copytree
import matplotlib.pyplot as plt
import cv2 as cv

# We load explicitly md so we can print markdown from code
from IPython.display import Markdown as md
import os

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

We use `copyfile` to copy the module.

In [None]:
# copy our file into the working directory
copyfile(src = "/kaggle/input/hpamodule1/module_1.py", dst = "/kaggle/working/module_1.py")
if os.path.exists("../input/howto-upload-module-and-simple-classification/nuclei-model.pth"):
    copyfile(src = "../input/howto-upload-module-and-simple-classification/nuclei-model.pth", dst = "./nuclei-model.pth")
    copyfile(src = "../input/howto-upload-module-and-simple-classification/cell-model.pth", dst = "./cell-model.pth")

# copytree(src = "/kaggle/input/hpa-cell-segmentation/hpacellseg", dst = "/kaggle/working/hpacellseg")
import module_1 as hpm

Lets see how the module works by loading the row from the submission file and using the `test` folder for the image source. You can also use the train as a source, if you set `img_folder = 'train'`, in that case the csv file that is used is train.csv.

In [None]:
# Set data folders and files
hpa_data = hpm.HPA(data_folder='/kaggle/input/hpa-single-cell-image-classification', img_folder='test')
print(hpa_data.sample_sub_pd.loc[5, :])
hpa_data.get_rgby_images(hpa_data.sample_sub_pd.loc[5, 'ID'])

# hpa_data = hpm.HPA(data_folder='/kaggle/input/hpa-single-cell-image-classification', img_folder='train')
# print(hpa_data.train_csv_pd.loc[10, :])
# hpa_data.get_rgby_images(hpa_data.train_csv_pd.loc[10, 'ID'])

We can show the image which is uploaded as a tuple, per channel, in the `self.img` member variable.

In [None]:
fig = plt.figure(figsize=(23,23))
ax1 = fig.add_subplot(141)
ax1.imshow(hpa_data.img[0])
ax1.title.set_text("Original red channel")
ax2 = fig.add_subplot(142)
ax2.imshow(hpa_data.img[1])
ax2.title.set_text("Original green channel")
ax3 = fig.add_subplot(143)
ax3.imshow(hpa_data.img[2])
ax3.title.set_text("Original blue channel")
ax4 = fig.add_subplot(144)
ax4.imshow(hpa_data.img[3])
ax4.title.set_text("Original yellow channel")


In the end, everything that is written in the working folder is transferred to the output folder, so is the uploaded module.

# Download only specific files on your PC

This competition has large data set, around 160GB. If you want to use only some of the files, then you can install the kaggle API and download only specific files on your local machine. The Kaggle API is located on GitHub (https://github.com/Kaggle/kaggle-api), where you can also find instructions on using it. After installing it, create a new token (check this for info https://www.kaggle.com/docs/api). Then you can use the following function to download specific file on your PC.

In [None]:
'''
from kaggle.api.kaggle_api_extended import KaggleApi
import os

def download_file(api, fname):
    if os.path.exists(f"./data/train/{fname}_red.png"):
        return

    for ch in ch_names:
        api.competition_download_file(
            'hpa-single-cell-image-classification', 
            f'train/{fname}_{ch}.png', 
            path='./data/train'
        )
        
api = KaggleApi()
api.authenticate()
download_file(api, '5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0')
'''

# Dilation

Our first step of processing the loaded image is to perform dilation of the images. We use the module's function `dilate` which performs the following steps to every channel: **Gaussian blur -> Otsu's threshold -> dilation with 5x5 kernel**. 

In [None]:
hpa_data.dilate()

In [None]:
fig = plt.figure(figsize=(25,12))
ax1 = fig.add_subplot(241)
ax1.imshow(hpa_data.img[0])
ax1.title.set_text("Original red channel")
ax2 = fig.add_subplot(245, sharex=ax1, sharey=ax1)
ax2.imshow(hpa_data.img_dilate[0])
ax2.title.set_text("Dilate red channel")
ax3 = fig.add_subplot(242)
ax3.imshow(hpa_data.img[1])
ax3.title.set_text("Original green channel")
ax4 = fig.add_subplot(246, sharex=ax1, sharey=ax1)
ax4.imshow(hpa_data.img_dilate[1])
ax4.title.set_text("Dilate green channel")
ax5 = fig.add_subplot(243)
ax5.imshow(hpa_data.img[2])
ax5.title.set_text("Original blue channel")
ax6 = fig.add_subplot(247, sharex=ax1, sharey=ax1)
ax6.imshow(hpa_data.img_dilate[2])
ax6.title.set_text("Dilate blue channel")
ax7 = fig.add_subplot(244)
ax7.imshow(hpa_data.img[3])
ax7.title.set_text("Original yellow channel")
ax8 = fig.add_subplot(248, sharex=ax1, sharey=ax1)
ax8.imshow(hpa_data.img_dilate[3])
ax8.title.set_text("Dilate yellow channel")

# Connected components and the blue channel

Next we use connected components (cc) analysis on the blue channel, since this channel represents cells' nucleus. The nucleus is in the center of the cell and we can easly spot individual cells by performing `connectedComponentsWithStats` from the `cv` library. We run the cc on the dilated images from the previous step.

In [None]:
hpa_data.connected_components()

We can show the detected regions by drawing rectangles and centroids with the data given by the `cc_stat` variables. Variables are generated after calling `connected_components`. We start counting stats from 1, since 0 is reserved for the background as a connected region.

In [None]:
import matplotlib.patches as pch

fig = plt.figure(figsize=(25,12))
ax1 = fig.add_subplot(111)
ax1.imshow(hpa_data.cc_labels)
for i in range(1, hpa_data.cc_numLabels):
    left = hpa_data.cc_stats[i, cv.CC_STAT_LEFT]
    top = hpa_data.cc_stats[i, cv.CC_STAT_TOP]
    width = hpa_data.cc_stats[i, cv.CC_STAT_WIDTH]
    height = hpa_data.cc_stats[i, cv.CC_STAT_HEIGHT]
    rect = pch.Rectangle((left, top), 
        width, height, linewidth=1, edgecolor='r', facecolor='none')
    circ = pch.Circle((hpa_data.cc_centroids[i]), 
        5, linewidth=1, edgecolor='none', facecolor='r')
    ax1.add_patch(rect)
    ax1.add_patch(circ)

# Segmentation using CellProfiling/HPA-Cell-Segmentation code

After trying a couple of segmentation methods without much success, I accepted the popular and widely used professional algorithm for cell segmentation found on GitHub https://github.com/CellProfiling/HPA-Cell-Segmentation/. Before using, we need to install the packet, so we run the zipped folder's pip install.

In [None]:
#!pip install https://github.com/CellProfiling/HPA-Cell-Segmentation/archive/master.zip
!pip install ../input/pytorch-zoo/pytorch_zoo-master
!pip install ../input/hpacellsegmentation/HPA-Cell-Segmentation-master

After the first run, the module will download two files with pretrained models of nuclei and cells. If this is not a first run and you have added the output files from a previous run, the following code will load that files.

In [None]:
import hpacellseg.cellsegmentator as cellsegmentator
from PIL import Image
from hpacellseg.utils import label_cell, label_nuclei

NUC_MODEL = "./nuclei-model.pth"
CELL_MODEL = "./cell-model.pth"
segmentator = cellsegmentator.CellSegmentator(
    NUC_MODEL,
    CELL_MODEL,
    scale_factor=0.25,
    device="cuda",
    padding=True,
    multi_channel_model=True,
)

We detect nuclei and cells.

In [None]:
# Nuclei segmentation

nuc_segments = segmentator.pred_nuclei([hpa_data.b_img])

nuc_mask = label_nuclei(nuc_segments[0])

fig = plt.figure(figsize=(24, 24))
ax1 = fig.add_subplot(131)
ax1.imshow(hpa_data.b_img)
ax1.set_title('Original nuclei')
ax2 = fig.add_subplot(132)
ax2.imshow(nuc_segments[0])
ax2.set_title('HPA segmented nuclei')
ax3 = fig.add_subplot(133)
ax3.imshow(nuc_mask)
ax3.set_title('HPA labeled nuclei')
plt.show()

# Cell segmentation

cell_segments = segmentator.pred_cells([[hpa_data.r_img], [hpa_data.y_img], [hpa_data.b_img]])
nuc_mask, cell_mask = label_cell(nuc_segments[0], cell_segments[0])

rgb_img = np.stack([
    cv.convertScaleAbs(hpa_data.r_img, alpha=(255.0/65535.0)), 
    cv.convertScaleAbs(hpa_data.g_img, alpha=(255.0/65535.0)), 
    cv.convertScaleAbs(hpa_data.b_img, alpha=(255.0/65535.0))], 2)

fig = plt.figure(figsize=(24, 24))
ax1 = fig.add_subplot(131)
ax1.imshow(rgb_img)
ax1.set_title('Original cell rgb')
ax2 = fig.add_subplot(132)
ax2.imshow(cell_mask)
ax2.set_title('HPA segmented cells')
ax3 = fig.add_subplot(133)
ax3.imshow(rgb_img)
ax3.imshow(cell_mask, alpha=0.3)
ax3.set_title('HPA cell rgb + cell label')
plt.show()

# Submit segmented cell masks with label 0

Here we can even make a submission file by labeling all segments with 0.

In [None]:
# !pip install pycocotools
!pip install ../input/pycocotools/dist/pycocotools-2.0.2.tar
    
import base64
from pycocotools import _mask as coco_mask
import typing as t
import zlib

sub_file = open('submission.csv', 'w')

sub_file.write("ID,ImageWidth,ImageHeight,PredictionString\n")

cnt = 0
for id in hpa_data.sample_sub_pd.loc[:, 'ID']:
    print(f"{cnt}: {id}")
    hpa_data.get_rgby_images(id)
    nuc_segments = segmentator.pred_nuclei([hpa_data.b_img])
    nuc_mask = label_nuclei(nuc_segments[0])
    cell_segments = segmentator.pred_cells([[hpa_data.r_img], [hpa_data.y_img], [hpa_data.b_img]])
    nuc_mask, cell_mask = label_cell(nuc_segments[0], cell_segments[0])
    sub_str = f"{id},{hpa_data.b_img.shape[0]},{hpa_data.b_img.shape[1]},"
    for ci in range(1, cell_mask.max() + 1):
        mask = np.squeeze((cell_mask == ci))
        mask_to_encode = mask.reshape(mask.shape[0], mask.shape[1], 1)
        mask_to_encode = mask_to_encode.astype(np.uint8)
        mask_to_encode = np.asfortranarray(mask_to_encode)
        encoded_mask = coco_mask.encode(mask_to_encode)[0]["counts"]
        binary_str = zlib.compress(encoded_mask, zlib.Z_BEST_COMPRESSION)
        base64_str = base64.b64encode(binary_str)
        if ci == cell_mask.max():
            sub_str += f"0 1 {base64_str.decode()}\n"
        else:
            sub_str += f"0 1 {base64_str.decode()} "
    sub_file.write(sub_str)
    cnt += 1
    
sub_file.close()