# Object-based Image Classification
*Author: Vladislav Kim*
* [Introduction](#intro)
* [Leukemia coculture screen](#coculture)
* [Generate initial segmentation](#initialsegm)
* [Image labelling](#label)
* [Training and test set generation](#trainset)
* [Random forest classifier](#randomforest)
* [Parameter tuning and feature selection](#featureselect)
* [Comparison with other classifiers](#comparison)


<a id="intro"></a> 
## Introduction
Segmentation using classical computer vision approaches such as watershed may produce results that have to be filtered based on their region properties to eliminate segmentation artefacts, such as small objects, noise, etc. If the image set is large (such as in high-throughput screening), filtering based on fixed thresholds may produce supoptimal results with lots of variance. In order to automate the task of filtering artefacts we can resort to machine learning approaches. 


There is a number of different schemes and machine learning models that can be used for this purpose. Here we will show how to train an object-based random forest classifier. The input for this classifier will be cropped bounding regions of the initial segmentation generated by simple connected component labelling. The task will be to classify the image patches into various cell types. 

In [None]:
# load third-party Python modules
import javabridge
import bioformats as bf
import skimage
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd

import sys
sys.path.append('..')

javabridge.start_vm(class_path=bf.JARS)

<a id="coculture"></a> 
## Leukemia coculture screen
Here we are dealing with coculture images with 2 cell types: stroma and leukemia cells, which were not stained differentially. Primary leukemia cells are small, somewhat circular. The stroma cells are large and may take on various shapes. Due to this minimal staining palette we need to use machine learning to automate the process of identification of leukemia cells.

In [None]:
from base.utils import load_imgstack
imgstack = load_imgstack(fname="data/AML_trainset/180528_Plate3/r02c14.tiff")

In [None]:
# remove a 'dummy' z-axis
img = np.squeeze(imgstack)

Here we will visualize 3 color channels individually:
* Hoechst stains nuclei
* Lysosomal dye marks lysosomal compartments
* Calcein stains only viable cells

In [None]:
from base.plot import plot_channels
gamma = 0.4
plot_channels([img[:,:,i]**gamma for i in range(3)],
              nrow=1, ncol=3, cmap='gray',
              titles=['Hoechst', 'Lysosomal dye', 'Viability'])

In [None]:
from base.plot import combine_channels

img_rgb = combine_channels([img[:,:,i] for i in range(3)],
                            colors=['blue', 'red', 'green'],
                            blend=[1.5,1.5,2],
                            gamma=[0.6, 0.6,0.6])

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(img_rgb)
plt.axis('off')

<a id="initialsegm"></a> 
## Generate initial segmentation
We can generate initial segmentation using simple connected component labelling in the nucleus channel:

In [None]:
from transform.process import threshold_img
hoechst = img[:,:,0]**gamma
img_th = threshold_img(hoechst, method='otsu')

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(img_th)
plt.axis('off')

Apply morphological erosion in order to shrink adjacent boundaries:

In [None]:
from skimage.morphology import binary_erosion, disk
img_th = binary_erosion(threshold_img(hoechst, binary=True, method='otsu'), disk(5))

Use connencted component labelling and visualize:

In [None]:
from skimage.measure import label
from skimage.color import label2rgb
segm = label(img_th, connectivity=1)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(label2rgb(segm, image=hoechst, bg_label=0))
ax.axis('off')

The strategy here would be to subset the labelled regions based on their properties (area and perimeter). Since most leukemia nulcei are smaller than stroma nuclei, we can use a range of values: about 6 to 30 micron for the radius of leukemia nucleus candidates. Define a `dict` of lower and upper bounds for the features `area` and `perimeter`:

In [None]:
bounds = {'area': (500, 6000), 'perimeter': (100, 1000)}

`filter_segm` function subsets the labelled regions based on the defined lower and upper `bounds`:

In [None]:
from segment.cv_methods import filter_segm
segm1 = filter_segm(img=hoechst, labels=segm, bounds=bounds)

Thus large  (`area > 6000` and `perimeter > 1000`) and small (`area < 500` and `perimeter < 100`) regions were removed and only medium-size objects (presumably leukemia nuclei) were retained:

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(label2rgb(segm1, image=hoechst, bg_label=0))
ax.axis('off')

We can see that some of the small nuclei are missing as these are overlapping with the larger stroma nuclei. In order to address this  search for bright spots (apoptotic nuclei) in the mask with bigger regions. The approach we take here is to "break up" large objects into smaller chunks and these can be further prefiltered by intensity for example

In [None]:
big = filter_segm(img=hoechst, labels=segm, bounds={'area': (6000, np.inf)}) +\
filter_segm(img=hoechst, labels=segm, bounds={'perimeter': (1000, np.inf)})

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(label2rgb(big, image=hoechst, bg_label=0))
ax.axis('off')

In [None]:
big_obj = hoechst*np.isin(segm, np.unique(big[big != 0]))

Apply `white_tophat` filter which only preserves bright spots with the radius $<25$:

In [None]:
from skimage.morphology import white_tophat
from skimage.morphology import remove_small_objects
img_tophat = white_tophat(big_obj, disk(25))
# bright spots from the large regions that were filtered out
# in the previous step
segm2 = remove_small_objects(threshold_img(img_tophat, method='yen', binary=True),
                                min_size=500)

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(segm2)
plt.axis('off')

Now merge these brigh spots from large regions with the regions in `segm1`:

In [None]:
# only non-background pixels
segm1 = (segm1 != 0)

In [None]:
segm_out = label(np.logical_or(segm1, segm2))

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(label2rgb(segm_out, image=hoechst, bg_label=0))
plt.axis('off')

We can use these merged labelled regions that were filtered by size as our initial segmentation and use machine learning to refine the segmentation.

In [None]:
from skimage.measure import regionprops
feats_out =  regionprops(label_image=segm_out, intensity_image=hoechst)

We can extract bounding box coordinates for each labelled region and visualize these bounding boxes in the RGB image (with padding `pad = 20`):

In [None]:
pad = 20
bbox = []

for f in feats_out:
    ymin, xmin, ymax, xmax = f.bbox
    bb = np.array((max(0, xmin - pad),
                  min(xmax + pad, hoechst.shape[0] - 1),
                  max(0, ymin - pad),
                  min(ymax + pad, hoechst.shape[0] - 1)))
    bbox.append(bb)

In [None]:
from base.plot import show_bbox

In [None]:
show_bbox(img_rgb, bbox)

We see that we capture mostly viable and apoptotic leukemia cells in this initial segmentation, however there is an appreciable number of stroma cell fragments in this image. Another complication is that there is a couple of apoptotic leukemia nuclei overlapping with the stroma cells. Should we naively label all Calcein-positive cells as viable, we will run into the problem that apoptotic leukemia cells overlapping with Calcein-stained stroma will be falsely counted as viable.  This is why we will train a classifier to recognize such cases (both stroma cell fragments as well as dead leukemia cells overlapping with Calcein-positive stroma)


Now we need to store both bounding box coordinates as well as region properties for each image patch and we need an easy (and scalable) way of retrieving bounding box information for each image. First generate a table (`DataFrame`) with all the information:

In [None]:
from base.future_versions import regionprops_table

In [None]:
keys = [k for k in feats_out[0]]

In [None]:
exclude = ['convex_image', 'coords', 'extent',
           'filled_image', 'image']

In [None]:
selected_keys = list(set(keys) - set(exclude))
# sort by key lexicographically
selected_keys.sort()

In [None]:
feat_dict = regionprops_table(segm_out,
                       intensity_image=hoechst,
                      properties=selected_keys)

feat_df = pd.DataFrame(feat_dict)


In [None]:
feat_df.iloc[:6,:10]

In [None]:
feat_df.to_csv("data/AML_trainset/180528_Plate3/r02c14.csv", index=False)

<a id="label"></a> 
## Image labelling


<a id="trainset"></a> 
## Training and test set generation


<a id="randomforest"></a> 
## Random forest classifier

<a id="featureselect"></a> 
## Parameter tuning and feature selection

<a id="comparison"></a> 
## Comparison with other classifiers