<img src="https://cellero.com/wp-content/uploads/2019/12/human-protein-atlas-logo-1.png" width="600" height="400">


## <center>Human Protein Atlas - Single Cell Classification</center>
### <center>ðŸ”¬Find individual human cell differences in microscope imagesðŸ”¬</center>

# Table of contents <a id='0.1'></a>

1. [Version Notes](#0)
1. [Introduction](#1)
    * [Evaluation Metric: Mean Average Precision (mAP)](#1.1)
2. [Import Packages](#2)
3. [Utility](#3)
5. [Data Overview](#4)
5. [Exploratory Data Analysis](#5)
    * 5.1 [Label Preprocessing](#5.1)
    * 5.2 [Exploring Image Data](#5.2)
    * 5.3 [Label Wise Cell Segmentation](#5.3)
    * 5.4 [Label Wise Single Cell Segmentation](#5.4)
6. [Useful Resources](#6)
6. [Refrence](#7)

# 1. <a id='0'>Version NotesðŸ“ƒ</a>
[Table of contents](#0.1)

* Version 12: Updated EDA for bar graph explanation.
* version 19: Added visualization for single cell mask.
* Version 20: Visualize images for single label/organelle.
* Version 21: Updated submission.csv table representation.
* Version 23: Added Mean Average Precision (mAP) explanation video in Inroduction section.
* Version 25: Added **Useful Resources** section.

# 2. <a id='1'>IntroductionðŸ“’</a>
[Table of contents](#0.1)

Welcome to **Human Protein Atlas - Single Cell Classification** competiton hosted by [Human Protein Atlas](https://www.proteinatlas.org/). This competition aims to solve the single-cell image classification challenge that will help us to characterize single-cell heterogeneity in the large collection of images by generating more accurate annotations of the subcellular localizations for thousands of human proteins in individual cells. Please go through this section thouroghly to develop understanding for the competiton goal and data.

## What is Human Protein Atlas?

The Human Protein Atlas is an initiative based in Sweden that is aimed at mapping proteins in all human cells, tissues, and organs. The data in the Human Protein Atlas [database](https://www.proteinatlas.org/) is freely accessible to scientists all around the world that allows them to explore the cellular makeup of the human body.

In [None]:
from IPython.display import HTML

HTML('<center><iframe width="950" height="450" src="https://www.youtube.com/embed/P4gz6DrZOOI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>')

## Competition Goal (Brief Introduction)
[Table of contents](#0.1)
* This is segmentation and classification problem. We are provided with images of microscopic cells and corresponding labels of protein location assigned together for each cells in the image. This means for every train image we are given multiple labels. There are total **19 labels/classes**.

* The training image-level labels are provided for each sample in train.csv.

* For each sample we have 4 image files. Each file represents a different filter on the subcellular protein patterns represented by the sample. Colors are red for microtubule channels, blue for nuclei channels, yellow for Endoplasmic Reticulum (ER) channels, and green for protein of interest.

* We have to develop models capable of segmenting and classifying each individual cell with precise labels. This means we are predicting protein organelle localization labels for each cell in the image. First we need to segment each cell in the image and then assign it appropriate label. So this is segmentation - classification problem. It may be possible that not all cells in the image contain protein and so we need to only assign labels to those cells for which the protein in present. The labels only apply to the cells where green is present  **This is a weakly supervised multi-label classification problem**. Please visit [this](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/215736) awesome discussion to know more. 

## What is weak image-level labels?

The labels you will get for training are image level labels while the task is to predict cell level labels. That is to say, each training image contains a number of cells that have collectively been labeled. The prediction task is to look at images of the same type and predict the labels of each individual cell within those images.

As the training labels are a collective label for all the cells in an image, it means that each labeled pattern can be seen in the image but not necessarily that each cell within the image expresses the pattern. This imprecise labeling is what we refer to as weak.

During the challenge you will both need to segment the cells in the images and predict the labels of those segmented cells.

## About Competition Data
We are provided with following files - 

   * train.csv - filenames and image level labels for the training set. For each filename there are 4 files (images) in  train directory.
   
   * sample_submission.csv - the test set filenames and a guide for constructing a working submission.
   
   * train - train set image directory which consists of sample images for training. Each train images (sample) has four files. Each file represents a different filter on the subcellular protein patterns represented by the sample.    The format should be **"filename_filtercolor.png"** for the PNG files. Colors are red for microtubule channels, blue for nuclei channels, yellow for Endoplasmic Reticulum (ER) channels, and green for protein of interest. "**The green filter should hence be used to predict the label, and the other filters are used as references**". Check below example.
   
   <div class="alert alert-block alert-info">
       Example - 

       * 000a6c98-bb9b-11e8-b2b9-ac1f6b6435d0_blue.png
       * 000a6c98-bb9b-11e8-b2b9-ac1f6b6435d0_green.png
       * 000a6c98-bb9b-11e8-b2b9-ac1f6b6435d0_red.png
       * 000a6c98-bb9b-11e8-b2b9-ac1f6b6435d0_yellow.png     
    
   * test - test image directory. "**Our task is to segment and label the images in this folder**". 
    
   * The labels are represented as integers that map to the following:    
    0. Nucleoplasm
    1. Nuclear membrane
    2. Nucleoli
    3. Nucleoli fibrillar center
    4. Nuclear speckles
    5. Nuclear bodies
    6. Endoplasmic reticulum
    7. Golgi apparatus
    8. Intermediate filaments
    9. Actin filaments 
    10. Microtubules
    11. Mitotic spindle
    12. Centrosome
    13. Plasma membrane
    14. Mitochondria
    15. Aggresome
    16. Cytosol
    17. Vesicles and punctate cytosolic patterns
    18. Negative

## Label Discription
[Table of contents](#0.1)   
   * Nucleoplasm - Nucleoplasm is a type of protoplasm that is composed of thick fluid and constitutes chromatin fibres made up of DNA and usually found in the nucleus of the cells. This fluid contains primarily water, dissolved ions, and a complex mixture of molecules.
   
   * Nuclear membrane - The nuclear membrane appears as a thin circle around the nucleus. It is not perfectly smooth and sometimes it is also possible to see the folds of the membrane as small circles or dots inside the nucleus.
   
   * Nucleoli - The nucleoli are non-membrane enclosed, highly conserved, sub-organelles within the nucleus.
   
   * Nucleoli fibrillar center - Nucleoli fibrillary center can appear as a spotty cluster or as a single bigger spot in the nucleolus, depending on the cell type.
   
   * Nuclear speckles - Nuclear speckles are self-organizing non-membrane bound sub-compartments found in the interchromatin regions of the nucleoplasm.
   
   * Nuclear bodies - Nuclear bodies is a collective term for a number of non-membrane bound nuclear sub-compartments. They vary in shape, size and numbers depending on the type of bodies as well as cell type, but are usually more rounded compared to nuclear speckles.
   
   * Endoplasmic reticulum - The endoplasmic reticulum (ER) is a delicate membranous network composed of sheets and tubules that spread throughout the cytoplasm and are contiguous with the nuclear membrane.
   
   * Golgi apparatus - The Golgi apparatus is a central hub in the endomembrane system of human cells, placed at the intersection of the endosomal-, secretory- and lysosomal pathways. It consists of several stacks of flattened cisternae and tubular connections, forming a ribbon-like structure that is highly dynamic. 
   
   * Intermediate filaments - Intermediate filaments often exhibit a slightly tangled structure with strands crossing every so often. They can appear similar to microtubules, but do not match well with the staining in the red microtubule channel. Intermediate filaments may extend through the whole cytosol, or be concentrated in an area close to the nucleus.
   
   * Actin filaments - Actin filaments can be seen as long and rather straight bundles of filaments or as branched networks of thinner filaments. They are usually located close to the edges of the cells.
   
   * Microtubules - Microtubules are one of three principal components of the cytoskeleton. Microtubules are seen as thin strands that stretch throughout the whole cell. It is almost always possible to detect the center from which they all originate (the centrosome). And yes, as you might have guessed, this overlaps the staining in the red channel.
   
   * Mitotic spindle - The mitotic spindle can be seen as an intricate structure of microtubules radiating from each of the centrosomes at opposite ends of a dividing cell (mitosis). At this stage, the chromatin of the cell is condensed, as visible by intense DAPI staining. The size and exact shape of the mitotic spindle changes during mitotic progression, clearly reflecting the different stages of mitosis.
   
   * Centrosome - This class includes centrosomes and centriolar satellites. They can be seen as a more or less distinct staining of a small area at the origin of the microtubules, close to the nucleus. When a cell is dividing, the two centrosomes move to opposite ends of the cell and form the poles of the mitotic spindle.
   
   * Plasma membrane - The plasma membrane, or cell membrane, consists of a lipid bilayer which separates the interior of the cell from the exterior. The membrane is composed of phospholipids, cholesterol, glycolipids, and a large fraction of membrane proteins, organized together in different domains. 
   
   * Mitochondria - Mitochondria generate the energy that is needed to power the functions of the cell, but also participate directly in several other cellular processes, including apoptosis, cell cycle control and calcium homeostasis.. Mitochondria are small organelles distrÃ­buted in varying numbers and patterns in the cytosol of most human cells. Mitochondria are enclosed by a double membrane, with the inner membrane folded into characteristic cristae. 
   
   * Aggresome - Aggresomes are structures that form in response to accumulation of misfolded proteins in the cytosol. Aggresome formation is a regulated process that occurs in response to overload of the protein folding- and degradation systems, due to cellular stress or disease. 
   
   * Cytosol - The cytosol is a semi-fluid matrix that fills the space between the plasma membrane and the nuclear membrane, and embedding various organelles and cellular substructures.
   
   * Vesicles and punctate cytosolic patterns - Vesicles is a collective term for cytoplasmic organelles that are often too small to have distinct features when imaged by light microscopy. The majority of the vesicles are membrane-bound organelles, however, also large protein complexes and cytosolic bodies can fall under this category, as they are difficult to distinguish. Examples of organelles with a vesicle annotation are the members of the endolysosomal pathway, transport vesicles, peroxisomes, and lipid droplets. Following Substructures fall in this class such as, 
       * Vesicles
       * Peroxisomes
       * Endosomes
       * Lysosomes
       * Lipid droplets
     
   * Negative - This class include negative stainings and unspecific patterns. This means that the cells have no green staining (negative), or have staining but no pattern can be deciphered from the staining (unspecific).
    
## What we are prediciting?
[Table of contents](#0.1)
    
Let us first understand what is **protein targeting?** "So, Protein targeting or protein sorting is the biological mechanism by which proteins are transported to their appropriate destinations within or outside the cell."
    
We are the prediciting location (organelle) where the protein is targeted in a single cell. There are **19 labels/organelle present in the dataset (18 labels for specific locations, and 1 label for negative and unspecific signal)**.    
    
For each image we need to segment every cell present in it and identify the label/organelle of each cell (out of 19) and submit a string such that it contains a list of instance segmentation masks and their associated detection score (Confidence) and segmentation mask for each cell. The sample submission will look something like, 

<table style="width:70% height:200px">
  <tr>
    <th style="text-align:left">ImageID</th>
    <th style="text-align:left">ImageWidth</th>
    <th style="text-align:left">ImageHeight</th>
    <th style="text-align:left">PredictionString</th>
  </tr>
  <tr>
    <td style="text-align:left">ImageAID</td>
    <td style="text-align:left">ImageAWidth</td>
    <td style="text-align:left">ImageAHeight</td>
    <td style="text-align:left">LabelA1 ConfidenceA1 EncodedMaskA1 LabelA2 ConfidenceA2 EncodedMaskA2 ...</td>
  </tr>
  <tr>
    <td style="text-align:left">ImageBID</td>
    <td style="text-align:left">ImageBWidth</td>
    <td style="text-align:left">ImageAHeight</td>
    <td style="text-align:left">LabelB1 ConfidenceB1 EncodedMaskB1 LabelB2 ConfidenceB2 EncodedMaskB2 â€¦</td>
  </tr>
</table>

### What exactly is **PredictionString** column in submission.csv? 

The PredicitionString column contains a string of label, confidence score for that classified label/organelle and mask information ecoded in RLE encoding and we are doing this for every cell present in the image. It may be possible that some of the cells in the image may not have the protein of interest. 

For more information regarding evaluation metric check [here](https://www.kaggle.com/c/open-images-2019-instance-segmentation/overview/evaluation).

## Evaluation Metric: Mean Average Precision (mAP) <a id='1.1'></a>
[Table of contents](#0.1)

Please check this discussion [here](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/219885) for more information. 

In [None]:
HTML('<center><iframe width="950" height="450" src="https://www.youtube.com/embed/FppOzcDvaDI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>')

# 3. <a id='2'>Import PackagesðŸ“š</a>
[Table of contents](#0.1)

In [None]:
!pip install https://github.com/CellProfiling/HPA-Cell-Segmentation/archive/master.zip

In [None]:
# basic 
import warnings
import os, gc, cv2
import numpy as np
import pandas as pd
from glob import glob
from tqdm.notebook import tqdm

# visualize
import seaborn as sns
import matplotlib.pyplot as plt

# segmentation tool
import hpacellseg.cellsegmentator as cellsegmentator
from hpacellseg.utils import label_cell, label_nuclei

%matplotlib inline
warnings.filterwarnings('ignore')

Let us have a look at competiton data directory.

In [None]:
# directory
print('Competition Data/Files')
ROOT = '../input/hpa-single-cell-image-classification/'
os.listdir(ROOT)

# 4. <a id='3'>UtilityðŸ”¨</a>
[Table of contents](#0.1)

In [None]:
# read and visualize sample image
def read_sample_image(filename):
    
    '''
    read individual images
    of different filters (R, G, B, Y)
    and stack them.
    ---------------------------------
    Arguments:
    filename -- sample image path
    
    Returns:
    stacked_images -- stacked (RGBY) image
    '''
    
    red = cv2.imread(os.path.join(ROOT, 'train/') + filename + "_red.png", cv2.IMREAD_UNCHANGED)
    green = cv2.imread(os.path.join(ROOT, 'train/') + filename + "_green.png", cv2.IMREAD_UNCHANGED)
    blue = cv2.imread(os.path.join(ROOT, 'train/') + filename + "_blue.png", cv2.IMREAD_UNCHANGED)
    yellow = cv2.imread(os.path.join(ROOT, 'train/') + filename + "_yellow.png", cv2.IMREAD_UNCHANGED)

    stacked_images = np.transpose(np.array([red, green, blue, yellow]), (1,2,0))
    return stacked_images

def plot_all(im, label):
    
    '''
    plot all RGBY image,
    Red, Green, Blue, Yellow, 
    filters images.
    --------------------------
    Argument:
    im - image
    '''
    
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 5, 1)
    plt.imshow(im[:,:,:3])
    plt.title('RGBY Image')
    plt.axis('off')
    plt.subplot(1, 5, 2)
    plt.imshow(im[:,:,0], cmap='Reds')
    plt.title('Microtubule channels')
    plt.axis('off')
    plt.subplot(1, 5, 3)
    plt.imshow(im[:,:,1], cmap='Greens')
    plt.title('Protein of Interest')
    plt.axis('off')
    plt.subplot(1, 5, 4)
    plt.imshow(im[:,:,2], cmap='Blues')
    plt.title('Nucleus')
    plt.axis('off')
    plt.subplot(1, 5, 5)
    plt.imshow(im[:,:,3], cmap='Oranges')
    plt.title('Endoplasmic Reticulum')
    plt.axis('off')
    plt.show()

# read and visualize sample image
def read_sample_image_seg(filename):
    
    '''
    read individual images
    of different filters (R, B, Y)
    and stack them for segmentation.
    ---------------------------------
    Arguments:
    filename -- sample image file path
    
    Returns:
    stacked_images -- stacked (RBY) image path in lists.
    '''
    
    red = os.path.join(ROOT, 'train/') + filename + "_red.png"
    blue = os.path.join(ROOT, 'train/') + filename + "_blue.png"
    yellow = os.path.join(ROOT, 'train/') + filename + "_yellow.png"

    stacked_images = [[red], [yellow], [blue]]
    return stacked_images, red, blue, yellow

# segment cell 
def segmentCell(image, segmentator):
    
    '''
    segment cell and nuclei from
    microtubules, endoplasmic reticulum,
    and nuclei (R, B, Y) filters.
    ------------------------------------
    Argument:
    image -- (R, B, Y) list of image arrays
    segmentator -- CellSegmentator class object
    
    Returns:
    cell_mask -- segmented cell mask
    '''
    
    nuc_segmentations = segmentator.pred_nuclei(image[2])
    cell_segmentations = segmentator.pred_cells(image)
    nuclei_mask, cell_mask = label_cell(nuc_segmentations[0], cell_segmentations[0])
    
    gc.collect(); del nuc_segmentations; del cell_segmentations; del nuclei_mask
    
    return cell_mask

# plot segmented cells mask, image
def plot_cell_segments(mask, red, blue, yellow):
    
    '''
    plot segmented cells
    and images
    ---------------------
    Arguments:
    mask -- cell mask
    red -- red filter image path
    blue -- blue filter image path
    yellow -- yellow filter image path
    '''
    microtubule = plt.imread(r)    
    endoplasmicrec = plt.imread(b)    
    nuclei = plt.imread(y)
    img = np.dstack((microtubule, endoplasmicrec, nuclei))
    
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 3, 1)
    plt.imshow(img)
    plt.title('Image')
    plt.axis('off')
    
    plt.subplot(1, 3, 2)
    plt.imshow(mask)
    plt.title('Mask')
    plt.axis('off')
    
    plt.subplot(1, 3, 3)
    plt.imshow(img)
    plt.imshow(mask, alpha=0.6)
    plt.title('Image + Mask')
    plt.axis('off')
    plt.show()

# plot single segmented cells mask, image
def plot_single_cell(mask, red, blue, yellow):
    
    '''
    plot single cell mask
    and image
    ---------------------
    Arguments:
    mask -- cell mask
    red -- red filter image path
    blue -- blue filter image path
    yellow -- yellow filter image path
    '''
    microtubule = plt.imread(r)    
    endoplasmicrec = plt.imread(b)    
    nuclei = plt.imread(y)
    img = np.dstack((microtubule, endoplasmicrec, nuclei))
    
    contours= cv2.findContours(mask.astype('uint8'),
                               cv2.RETR_TREE, 
                               cv2.CHAIN_APPROX_SIMPLE)

    areas = [cv2.contourArea(c) for c in contours[0]]
    x = np.argsort(areas)
    cnt = contours[0][x[-1]]
    x,yc,w,h = cv2.boundingRect(cnt)
    
    plt.figure(figsize=(15, 15))
    plt.subplot(1, 3, 1)
    plt.imshow(img[yc:yc+h, x:x+w])
    plt.title('Cell Image')
    plt.axis('off')
    
    plt.subplot(1, 3, 2)
    plt.imshow(mask[yc:yc+h, x:x+w])
    plt.title('Cell Mask')
    plt.axis('off')
    
    plt.subplot(1, 3, 3)
    plt.imshow(img[yc:yc+h, x:x+w])
    plt.imshow(mask[yc:yc+h, x:x+w], alpha=0.6)
    plt.title('Cell Image + Mask')
    plt.axis('off')
    plt.show()
    
def binary_mask(rgby_images):
    
    '''
    generate masks from 
    rgby images.
    --------------------
    Arguments:
    rgby_images -- RGBY cell images
    
    Return:
    mask -- binary mask.
    '''
    pass

# 5. <a id='4'>Data OverviewðŸ§«</a>
## 5.1 <a id='4.1'>Train Data</a>
[Table of contents](#0.1)

In this section we will develop intuition for data we are working with.

In [None]:
train_df = pd.read_csv(os.path.join(ROOT, 'train.csv'))
train_df.head()

**ðŸ“Œ Observations**

We have following features in train.csv-

   * ID - The base filename of the sample. All samples consist of four files - blue, green, red, and yellow.
   * Label - This represents the labels assigned to each sample.
   
As mentioned in the [data](https://www.kaggle.com/c/hpa-single-cell-image-classification/data) section of the competition in [Introduction](#1) section above we can see every sample has multiple label. 

In [None]:
print(f'We have {train_df.shape[0]} rows and {train_df.shape[1]} columns in our train_df.csv.')

## Missing Values

In [None]:
print(f'Missing values in train_df.csv in each columns:\n{train_df.isnull().sum()}')

## Unique Values

In [None]:
print('Unique Values in each column of train_df.csv')
print('##########################################')
for col in train_df:
    print(f'{col}: {train_df[col].nunique()}')

## 5.2 <a id='4.2'>Test/Submission Data</a>
[Table of contents](#0.1)

In [None]:
sample_sub = pd.read_csv(os.path.join(ROOT, 'sample_submission.csv'))
sample_sub.head()

In [None]:
print(f'We have {sample_sub.shape[0]} rows and {sample_sub.shape[1]} columns in our sample_sub.csv.')

# 6 <a id='5'>Exploratory Data AnalysisðŸ§¬</a>
## 6.1 <a id=5.1>Preprocessing Labels</a>
[Table of contents](#0.1)

Since we are given multiple label in our **"Label"** column we need to binarize them so that each label/class will become a column.  

In [None]:
# spliting label column
train_df["Label"] = train_df["Label"].str.split("|")

# class labels
class_labels = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18']

# binarizing each label/class
for label in tqdm(class_labels):
    train_df[label] = train_df['Label'].map(lambda result: 1 if label in result else 0)

# rename column
train_df.columns = ['ID', 'Label', 'Nucleoplasm', 'Nuclear membrane', 'Nucleoli', 'Nucleoli fibrillar center',
                    'Nuclear speckles', 'Nuclear bodies', 'Endoplasmic reticulum', 'Golgi apparatus', 'Intermediate filaments',
                    'Actin filaments', 'Microtubules', 'Mitotic spindle', 'Centrosome', 'Plasma membrane', 'Mitochondria',
                    'Aggresome', 'Cytosol', 'Vesicles and punctate cytosolic patterns', 'Negative']

train_df

In [None]:
class_counts = train_df.sum().drop(['ID', 'Label']).sort_values(ascending=False)

print('Per class count in train dataset')
print('-------------------------------------------------')
for column in class_counts.keys():
    print(f"The class {column} has {train_df[column].sum()} samples")

In [None]:
plt.figure(figsize=(14,12))
with sns.axes_style("whitegrid"):
    aa = sns.barplot(y=class_counts.index.values, x=class_counts.values, palette='gist_earth')
    plt.title("Label Distribution")

**ðŸ“Œ Observations**

   * We see **Nucleoplasm** has most occurence around **8797**.
   * Plasma is the 3 most label with 3111 occurences followed by Cytosol with 5685.  
   * **Negative** are least only 34 samples with unspecified location.
   * Most labels seems to have occurence less than 2000. 

In [None]:
label_per_image = train_df.drop(['ID', 'Label'], axis=1).sum(axis=1)

plt.figure(figsize=(16,10))
with sns.axes_style("whitegrid"):
    ax = sns.countplot(label_per_image, palette='Pastel1')
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/len(label_per_image)*100),
                ha="center", fontsize=12)
    plt.title("Label Per Sample/Image", fontsize=16)

**ðŸ“Œ Observations**

   * 48% samples have only 1 label, 40% have 2 labels per image.
   * 10% samples have 3 labels.
   * Very small number of samples seems to have more than 3 labels.

## 6.2 <a id=5.2>Exploring Image Data</a>
[Table of contents](#0.1)

Each sample consists of four image files. Each file represents a different filter on the subcellular protein patterns represented by the sample (ID).

* Red for Microtubule channels.
* Blue for Nuclei channels.
* Yellow for Endoplasmic Reticulum (ER) channels.
* Green for Protein of interest.

The "**Protein of interest**" (Green channel) is what we are prediciting for each sample with labels (multiple labels). This labels are the organelles. It is the pattern(s) in the green channel that we should classify. 

We want to visualize those images which have single label to get proper understanding of the orgenelles. Thanks to [Kishan Joshi](https://www.kaggle.com/joshi98kishan) my teammate for suggesting this and helping me with this work. 

In [None]:
train = train_df.loc[train_df['Label'].apply(lambda x: len(x)==1)==True]

In [None]:
for label in train_df.drop(['ID', 'Label'], axis=1):
    print(label)
    im = read_sample_image(train[train[label]==1].sample(1).ID.to_string().split(' ')[4])
    plot_all(im, label)

## 6.3 <a id=5.3>Label Wise Cell Segmentation</a>
[Table of contents](#0.1)

For segmenting cells we will use the [HPA Cell Segmentation](https://github.com/CellProfiling/HPA-Cell-Segmentation) tool provided by the competition host. For more details check [here](https://www.kaggle.com/lnhtrang/hpa-public-data-download-and-hpacellseg/notebook). We will use the HPACellSegmentation model to segment the cells in images. 

The `CellSegmentator` class takes in following arguments - 

* nuclei_model - This should be a string containing the path to the nuclei-model weights. If the weights do not exist at the path, they will be downloaded to it. Defaults to ./nuclei_model.pth.

* cell_model - This should be a string containing the path to the cell-model weights. If the weights do not exist at the path, they will be downloaded to it. Defaults to ./cell_model.pth.

* scale_factor - This value determines how much the images should be scaled before being fed to the models. For HPA Cell images, a value of 0.25 is good. Defaults to 0.25.

* device - Inform Torch which device to put the model on. Valid values are â€˜cpuâ€™ or â€˜cudaâ€™ or pointed cuda device like â€˜cuda:0â€™. Defaults to cuda.

* padding - If True, add some padding before feeding the images to the neural networks. This is not required but can make segmentations, especially cell segmentations, more accurate. Defaults to False.

* multi_channel_model - If True, use the pretrained three-channel version of the model. Having this set to True gives you better cell segmentations but requires you to give the model endoplasmic reticulum images as part of the cell segmentation. Otherwise, the version trained with only two channels, microtubules and nuclei, will be used. Defaults to True.

We will post process the generated cell masks using the `label_cell` method followed by the `pred_cells` method to generate cell segmentations. 

In [None]:
NUC_MODEL = "./nuclei-model.pth"
CELL_MODEL = "./cell-model.pth"
segmentator = cellsegmentator.CellSegmentator(
    NUC_MODEL,
    CELL_MODEL,
    scale_factor=0.25,
    device="cpu",
    padding=False,
    multi_channel_model=True,
)

I am going to plot the images and mask for each label present in the dataset. 

In [None]:
for label in train_df.drop(['ID', 'Label'], axis=1):
    print(label)
    im, r, b, y = read_sample_image_seg(train[train[label]==1].sample(1).ID.to_string().split(' ')[4])
    mask = segmentCell(im, segmentator)
    plot_cell_segments(mask, r, b, y)

## 6.4 <a id=5.4>Label Wise Single Cell Segmentation</a>
[Table of contents](#0.1)

In this section we will look at each cell closely on the basis of labels.

In [None]:
for label in train_df.drop(['ID', 'Label'], axis=1):
    print(label)
    im, r, b, y = read_sample_image_seg(train[train[label]==1].sample(1).ID.to_string().split(' ')[4])
    mask = segmentCell(im, segmentator)
    plot_single_cell(mask, r, b, y)

In [None]:
gc.collect()

# 7. <a id='6'>Useful Resources</a>
[Table of cotents](#0.1)

1. **Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation.** [[Paper]](http://mftp.mmcheng.net/Papers/21PAMI_InsImgDatasetWSIS.pdf) [[Code]](https://github.com/yun-liu/LIID)

1. **Matrix Completion for Weakly-supervised Multi-label Image Classification.** [[Paper]](http://ca.cs.cmu.edu/sites/default/files/complete_14.pdf)

1. **Weakly Supervised Multi-Label Learning via Label Enhancement.** [[Paper]](https://www.ijcai.org/Proceedings/2019/0430.pdf)

1. **Puzzle-CAM: Improved localization via matching partial and full features** [[Code]](https://github.com/OFRIN/PuzzleCAM) [[Paper]](https://arxiv.org/abs/2101.11253) [[Discussion]](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/222094) [[Approach]](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/217395)

1. **Protein localization patterns Explained** [[notebook]](https://www.kaggle.com/lnhtrang/single-cell-patterns)

1. **The Previous Human Protein Atlas Competition** [[Discussion]](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/214518)

# 8. <a id='7'>ReferenceðŸ‘€</a>
[Table of cotents](#0.1)

* https://www.kaggle.com/allunia/protein-atlas-exploration-and-baseline
* https://www.kaggle.com/lnhtrang/hpa-public-data-download-and-hpacellseg/notebook
* https://www.kaggle.com/thedrcat/hpa-single-cell-classification-eda
* [mAP](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/219885)