# &emsp;&emsp; &emsp;&emsp;&emsp;Human Protein Atlas - Single Cell Classification

### Introduction:

&emsp;&emsp;&emsp; Human body consists of trillions of cells but also not all humans have the same kind of cells. Location of protein is very important in cells and hence dissimilarity in location of protein can breed cellular heterogeneity. For cellular processes/operations protein plays a crucial role, Collection of proteins come together at some discrete location to perform some task and outcome of this task is based on which kind of protein are present. From this different subcellular dispensation of one protein can give rise to great functional differences, finding such differences and figuring out why and how they occur, is important for understanding how cells function, how diseases develop, and ultimately how to develop better treatments for those diseases.   
    
&emsp;&emsp;&emsp; This is a supervised multi-label classification problem. Given images of cells from the microscopes and given labels of protein location assigned together for all the cells in the image. In this notebook We have developed a model which is capable of segmenting and classifying each individual cell with precise labels.

In [None]:
!pip install ../input/hpa-library-install/iterative-stratification-master/iterative-stratification-master
!pip install ../input/hpa-library-install/pytorch_zoo-master/pytorch_zoo-master
!pip install ../input/hpa-library-install/HPA-Cell-Segmentation-master/HPA-Cell-Segmentation-master

In [None]:
# importing required libraries for basic operations
import pandas as pd       # for dataset processing 
import numpy as np        # for mathemetical processes
import pickle             # for files read and write operations
import os                 # for system related operations
import zipfile            # for zip files read and write operations
from tqdm import tqdm     # for displaying progress bar
import cv2                # for image processing
from PIL import Image     # for image processing

# importing required libraries for plotting
import matplotlib.pyplot as plt        
import plotly.graph_objects as go
import plotly.express as px

# importing required libraries for model creation 
from fastai.vision.all import *                                   # deep learning library

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold    # library for sampling dataset
import warnings
warnings.filterwarnings("ignore")

### Data:

&emsp;&emsp;&emsp; We have train data for training and test data for testing our model. In training data we have a directory containing all the images for training purpose and a csv file containing labels for all images.

### Train Data:

In [None]:
# loading train.csv file into a dataframe
train_df = pd.read_csv('../input/hpa-single-cell-image-classification/train.csv')
print('(rows, columns)',train_df.shape)
train_df.head()

### Train Data Description

&emsp;&emsp;&emsp; There are 2 columns and 21806 records in train.csv file.  
   
1. **ID**:- Contains Image files name    
2. **Label**:- Contains corresponding labels for each image file. There total 19 labels from 0-18. Following are their means

In [None]:
pd.read_csv('../input/additional-data/Labels.csv').style.hide_index()

### Images for training

In [None]:
len(os.listdir('../input/hpa-single-cell-image-classification/train'))

There are total 87224 images are given for training purpose. But notice here that in train.csv files there are only 21806 records instead of 87224. This is becuase the images are provided in 4 different channels, it means there are 4 images belonging to a single image. All images have following 4 channels:   
  
1. Red (Microtubules)
2. Green (Protein of interest)
3. Blue (Nucleus)
4. Yellow (Endoplasmic reticulum)

According to https://biologydictionary.net/

**Microtubules**: Microtubules are microscopic hollow tubes made of the proteins alpha and beta tubulin that are part of a cell’s cytoskeleton, a network of protein filaments that extends throughout the cell, gives the cell shape, and keeps its organelles in place. Microtubules are the largest structures in the cytoskeleton at about 24 nanometers thick. They have roles in cell movement, cell division, and transporting materials within cells.

**Nucleus**: The cell nucleus is a large organelle in eukaryotic organisms which protects the majority of the DNA within each cell. The nucleus also produces the necessary precursors for protein synthesis. The DNA housed within the cell nucleus contains the information necessary for the creation of the majority of the proteins needed to keep a cell functional. While some DNA is stored in other organelles, such as mitochondria, the majority of an organism’s DNA is located in the cell nucleus. The DNA housed in the cell nucleus is extremely valuable, and as such the cell nucleus has a variety of important structures to help maintain, process, and protect the DNA.

**Endoplasmic reticulum**: The endoplasmic reticulum (ER) is a large organelle made of membranous sheets and tubules that begin near the nucleus and extend across the cell. The endoplasmic reticulum creates, packages, and secretes many of the products created by a cell. Ribosomes, which create proteins, line a portion of the endoplasmic reticulum.

**Protein of interest**: The information about Protein of interest is not disclosed much but these are marked in the images for which the scientist are researching.

Let's take out the first record from our training data and visualize the image and it's four channels 

In [None]:
train_df.iloc[0] # first record in training data

In [None]:
print('Following are the four channels for above image id in the training Image directory')
print('\n')
ID = '5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0'
for i in os.listdir('../input/hpa-single-cell-image-classification/train'):
    if(ID in i):
        print(i)

Let's visualize these 4 channels

In [None]:
#create custom color maps
cdict1 = {'red':   ((0.0,  0.0, 0.0),
                   (1.0,  0.0, 0.0)),

         'green': ((0.0,  0.0, 0.0),
                   (0.75, 1.0, 1.0),
                   (1.0,  1.0, 1.0)),

         'blue':  ((0.0,  0.0, 0.0),
                   (1.0,  0.0, 0.0))}

cdict2 = {'red':   ((0.0,  0.0, 0.0),
                   (0.75, 1.0, 1.0),
                   (1.0,  1.0, 1.0)),

         'green': ((0.0,  0.0, 0.0),
                   (1.0,  0.0, 0.0)),

         'blue':  ((0.0,  0.0, 0.0),
                   (1.0,  0.0, 0.0))}

cdict3 = {'red':   ((0.0,  0.0, 0.0),
                   (1.0,  0.0, 0.0)),

         'green': ((0.0,  0.0, 0.0),
                   (1.0,  0.0, 0.0)),

         'blue':  ((0.0,  0.0, 0.0),
                   (0.75, 1.0, 1.0),
                   (1.0,  1.0, 1.0))}

cdict4 = {'red': ((0.0,  0.0, 0.0),
                   (0.75, 1.0, 1.0),
                   (1.0,  1.0, 1.0)),

         'green': ((0.0,  0.0, 0.0),
                   (0.75, 1.0, 1.0),
                   (1.0,  1.0, 1.0)),

         'blue':  ((0.0,  0.0, 0.0),
                   (1.0,  0.0, 0.0))}


newcmap = matplotlib.colors.LinearSegmentedColormap('greens', cdict1)
plt.register_cmap('greens', newcmap)

newcmap = matplotlib.colors.LinearSegmentedColormap('greens', cdict2)
plt.register_cmap('reds', newcmap)

newcmap = matplotlib.colors.LinearSegmentedColormap('greens', cdict3)
plt.register_cmap('blues', newcmap)

newcmap = matplotlib.colors.LinearSegmentedColormap('greens', cdict4)
plt.register_cmap('yellows', newcmap)

green = cv2.imread('../input/hpa-single-cell-image-classification/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_green.png', 0)
red = cv2.imread('../input/hpa-single-cell-image-classification/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_red.png', 0)
blue = cv2.imread('../input/hpa-single-cell-image-classification/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_blue.png', 0)
yellow = cv2.imread('../input/hpa-single-cell-image-classification/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_yellow.png', 0)

#display each channel separately
fig, ax = plt.subplots(nrows = 2, ncols=2, figsize=(15, 15))
ax[0, 0].imshow(green, cmap="greens")
ax[0, 0].set_title("Protein of interest (Green)", fontsize=18)
ax[0, 1].imshow(red, cmap="reds")
ax[0, 1].set_title("Microtubules (Red)", fontsize=18)
ax[1, 0].imshow(blue, cmap="blues")
ax[1, 0].set_title("Nucleus (Blue)", fontsize=18)
ax[1, 1].imshow(yellow, cmap="yellows")
ax[1, 1].set_title("Endoplasmic reticulum (Yellow)", fontsize=18)
for i in range(2):
    for j in range(2):
        ax[i, j].set_xticklabels([])
        ax[i, j].set_yticklabels([])
        ax[i, j].tick_params(left=False, bottom=False)
plt.show()

Let's visualize the RGB image for image id 5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0 with it's labels

In [None]:
# mearging RGB channels to produce RGB image
plt.figure(figsize=(7,7))
green = cv2.imread('../input/hpa-single-cell-image-classification/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_green.png', cv2.IMREAD_UNCHANGED)
red = cv2.imread('../input/hpa-single-cell-image-classification/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_red.png', cv2.IMREAD_UNCHANGED)
blue = cv2.imread('../input/hpa-single-cell-image-classification/train/5c27f04c-bb99-11e8-b2b9-ac1f6b6435d0_blue.png', cv2.IMREAD_UNCHANGED)
img = cv2.merge((red, green, blue))
cv2.imwrite('first_image.tif', img)
plt.imshow(img)
plt.xticks([])
plt.yticks([])
plt.title("RGB Image", fontsize=18)
plt.xlabel('Labels    8:Intermediate filaments \n5:Nuclear bodies\n0:Nucleoplasm   ', fontsize=18)
plt.tick_params(left=False, bottom=False)
plt.show()

Let's visualize each label individually. We know what each lalel represents here but if we visualize them, we will able to undestand them more clearly.

### Train Data Preprocessing

&emsp;&emsp;&emsp; Since each images have many cells with different labels so to visualize each label individually we will need to crop each cell with it's corresponding label from the given images and them visualize them.

These are the steps to crop individual cells from images:  
1. Image segmentation (the segmentation process will segment individual cells in the given images and then label it with corresponding classes).
2. Cropping process (the cropping process will crop individual cells from all images after segmentation).

**Processing train.csv data to perform image segmentation**

In [None]:
class_labels = [str(i) for i in range(19)] # list of class labels [0-18]

train_label = train_df.Label # labels in training data

# one-hot encoding of class labels in training data
for x in class_labels: 
    train_df[x] = train_df['Label'].apply(lambda r: int(x in r.split('|')))
    
train_df.head()

Now there are 19 more binary columns. To understand them lets take image of first record, it has labels 8,5,0 so for this image the values for columns '8', '5', '0' are 1 and other (0-19 except 0,5,8) are 0. Same with other images also.    
     
1 mean True.    
0 mean False.

**Checking class labels imbalancies**

In [None]:
len(train_df.Label.unique())

There are total 432 unique values

In [None]:
max_label_counts = 0

for i in train_df.Label.unique():
    if(max_label_counts <= len(i.split('|'))):
        max_label_counts = len(i.split('|'))
        
max_label_counts

The maximum number of labels for an Image in our training data are 5

In [None]:
min_label_counts = 1

for i in train_df.Label.unique():
    if(max_label_counts >= len(i.split('|'))):
        max_label_counts = len(i.split('|'))
        
min_label_counts

The minimum number of class labels for an Image in our training data is 1

**Class labels imbalancies Visualization**

In [None]:
# Adding string lables in training csv data for better visualization
Labels = {0:  "Nucleoplasm", 1:  "Nuclear membrane",  2:  "Nucleoli",  3:  "Nucleoli fibrillar center" ,  4:  "Nuclear speckles",
          5:  "Nuclear bodies", 6:  "Endoplasmic reticulum",  7:  "Golgi apparatus", 8:  "Intermediate filaments",
          9:  "Actin filaments", 10: "Microtubules", 11:  "Mitotic spindle", 12:  "Centrosome",  13:  "Plasma membrane",
          14:  "Mitochondria",   15:  "Aggresome", 16:  "Cytosol",  17:  "Vesicles and punctate cytosolic patterns",   
          18:  "Negative"}

# Map the Individual labels to String_label
train_df["string_label"] = train_df.Label.apply(lambda x: "|".join([Labels[int(i)] for i in x.split("|")]))
train_df.head()

In [None]:
label_counts = Counter([c for sublist in train_df.string_label.str.split("|").to_list() for c in sublist])
fig = px.bar(x=label_counts.keys(), y=label_counts.values(), opacity=0.85, 
             color=label_counts.keys(),
             labels={
                 "y":"Number of Occurences Within The Dataset", 
                 "x":"Label Name", 
                 "color":"Label Name"
             },
             title="Number of Occurences For Each Label Within The Dataset")
fig.update_layout(legend_title=None,
                  xaxis_title="Label Names",
                  yaxis_title="Number of Occurences Within The Dataset")
fig.update_xaxes(categoryorder="total descending")
fig.show()

From the above graph

1. We can see that the training data highly imbalanced. 

2. We can see that most common protein structures belong to coarse grained cellular components Nucleoplasm.  
   
3. In contrast small or thin components like the mitotic spindle, microtubles, and vesicles are very seldom in our train data. In addition, rare organelles like Aggresome's and Negative also have very little representation in the dataset. For these classes the prediction will be very difficult as we have only a few examples that may not cover all variation normally present within these biological structures will be captured. So, we will struggle to make accurate predictions on the minor classes.

Let's compare the number of occurance of each label indiviaully with total occurance

In [None]:
unique_counts = {}
for label in class_labels:
    unique_counts[label] = len(train_df[train_df.Label == label])

full_counts = {}
for label in class_labels:
    count = 0
    for row_label in train_df['Label']:
        if label in row_label.split('|'): count += 1
    full_counts[label] = count
    
counts = list(zip(full_counts.keys(), full_counts.values(), unique_counts.values()))
counts = np.array(sorted(counts, key=lambda x:-x[1]))
counts = pd.DataFrame(counts, columns=['label', 'Total Count', 'Individual Count'])
counts.label = [int(i) for i in counts.label]
counts = counts.sort_values(by='label')
counts.set_index('label').T

#### Sampling

To deal with class imbalancies, we will perform downsampling here..   
    
Steps:   
1. We will first choose 500 records randomly which have single label for each label.
2. There are some class labels which don't have 500 records so for those labels we will select remaining records from the records which have more than one label.
3. If still some lable don't have 500 records we will leave them as it is.

In [None]:
train_dfs_0 = train_df[train_df['Label'] == '0'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_1u = train_df[train_df['Label'] == '1'].sample(n=221, random_state=42).reset_index(drop=True)
train_dfs_1 = train_df[train_df['1'] == 1].sample(n=500-221, random_state=42).reset_index(drop=True)
train_dfs_2 = train_df[train_df['Label'] == '2'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_3 = train_df[train_df['Label'] == '3'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_4 = train_df[train_df['Label'] == '4'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_5 = train_df[train_df['Label'] == '5'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_6u = train_df[train_df['Label'] == '6'].sample(n=476, random_state=42).reset_index(drop=True)
train_dfs_6 = train_df[train_df['6'] == 1].sample(n=500-476, random_state=42).reset_index(drop=True)
train_dfs_7 = train_df[train_df['Label'] == '7'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_8 = train_df[train_df['Label'] == '8'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_9u = train_df[train_df['Label'] == '9'].sample(n=294, random_state=42).reset_index(drop=True)
train_dfs_9 = train_df[train_df['9'] == 1].sample(n=500-294, random_state=42).reset_index(drop=True)
train_dfs_10u = train_df[train_df['Label'] == '10'].sample(n=404, random_state=42).reset_index(drop=True)
train_dfs_10 = train_df[train_df['10'] == 1].sample(n=500-404, random_state=42).reset_index(drop=True)
train_dfs_11u = train_df[train_df['Label'] == '11'].sample(n=1, random_state=42).reset_index(drop=True)
train_dfs_11 = train_df[train_df['11'] == 1].reset_index(drop=True)
train_dfs_12 = train_df[train_df['Label'] == '12'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_13 = train_df[train_df['Label'] == '13'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_14 = train_df[train_df['Label'] == '14'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_15u = train_df[train_df['Label'] == '15'].sample(n=82, random_state=42).reset_index(drop=True)
train_dfs_15 = train_df[train_df['15'] == 1].reset_index(drop=True)
train_dfs_16 = train_df[train_df['Label'] == '16'].sample(n=500, random_state=42).reset_index(drop=True)
train_dfs_17u = train_df[train_df['Label'] == '17'].sample(n=274, random_state=42).reset_index(drop=True)
train_dfs_17 = train_df[train_df['17'] == 1].sample(n=500-274, random_state=42).reset_index(drop=True)
train_dfs_18 = train_df[train_df['18'] == 1].reset_index(drop=True)
train_dfs_ = [train_dfs_0, train_dfs_1u, train_dfs_1, train_dfs_2, train_dfs_3, train_dfs_4, train_dfs_5, train_dfs_6u,
              train_dfs_6, train_dfs_7, train_dfs_8, train_dfs_9u, train_dfs_9, train_dfs_10u, train_dfs_10, train_dfs_11u, 
              train_dfs_11, train_dfs_12, train_dfs_13, train_dfs_14, train_dfs_15u, train_dfs_15, train_dfs_16,
              train_dfs_17u, train_dfs_17, train_dfs_18]

In [None]:
train_dfs = pd.concat(train_dfs_, ignore_index=True)
train_dfs.drop_duplicates(inplace=True, ignore_index=True)
len(train_dfs)

After sampling there are total 8081 records but we have now balanced dataset which is good for training our model.

In [None]:
train_dfs.head()

Let's compare the number of occurance of each label indiviaully with total occurance in our sampled dataset

In [None]:
unique_counts = {}
for label in class_labels:
    unique_counts[label] = len(train_dfs[train_dfs.Label == label])

full_counts = {}
for label in class_labels:
    count = 0
    for row_label in train_dfs['Label']:
        if label in row_label.split('|'): count += 1
    full_counts[label] = count
    
counts = list(zip(full_counts.keys(), full_counts.values(), unique_counts.values()))
counts = np.array(sorted(counts, key=lambda x:-x[1]))
counts = pd.DataFrame(counts, columns=['label', 'Total Count', 'Individual Count'])
counts.label = [int(i) for i in counts.label]
counts = counts.sort_values(by='label')
counts.set_index('label').T

Now we have balaced data set so the next is we have to perform segmentation and masking to crop individual cells from given images to visualize it and for training purpose also.

#### Image Segmantation

For segmentation we will use a pretrained model which is specific to cell segementation, read more about it from https://github.com/CellProfiling/HPA-Cell-Segmentation/

In [None]:
import warnings
warnings.filterwarnings('ignore')

import hpacellseg.cellsegmentator as cellsegmentator
from hpacellseg.utils import label_cell, label_nuclei

from sklearn.preprocessing import MultiLabelBinarizer
from array import array

NUC_MODEL = "../input/processed-hpa-data/nuclei-model.pth"
CELL_MODEL = "../input/processed-hpa-data/cell-model.pth"

segmentator = cellsegmentator.CellSegmentator(
    NUC_MODEL,
    CELL_MODEL,
    scale_factor=0.25,
    padding=False,
    multi_channel_model=True,
)

nuc_segmentations = segmentator.pred_nuclei([img])

f, ax = plt.subplots(1, 2, figsize=(10,10))
ax[0].imshow(img)
ax[0].set_title('Original Nucleis', size=20)
ax[1].imshow(nuc_segmentations[0])
ax[1].set_title('Segmented Nucleis', size=20)
plt.show()

# Cell segmentation
inter_step = [[i] for i in [red, green, blue]]
cell_segmentations = segmentator.pred_cells(inter_step)

f, ax = plt.subplots(1, 2, figsize=(10,10))
ax[0].imshow(cv2.merge((red, green, blue)))
ax[0].set_title('Original Cells', size=20)
ax[1].imshow(cell_segmentations[0])
ax[1].set_title('Segmented Cells', size=20)
plt.show()

Let's apply masking on above two segmented images so that we can visualize each cell individually

In [None]:
nuclei_mask = label_nuclei(nuc_segmentations[0])
# Cell masks
cell_nuclei_mask, cell_mask = label_cell(nuc_segmentations[0], cell_segmentations[0])
# Plotting
f, ax = plt.subplots(1, 3, figsize=(16,16))
ax[0].imshow(nuclei_mask)
ax[0].set_title('Nuclei Mask', size=20)
ax[1].imshow(cell_nuclei_mask)
ax[1].set_title('Cell Nuclei Mask', size=20)
ax[2].imshow(cell_mask)
ax[2].set_title('Cell Mask', size=20)
plt.show()

Visualizing each cell individually.

In [None]:
# Unique vector of cell_mask numbers
numbers = set(np.ravel(cell_mask))
numbers.remove(0)

fig = plt.figure(figsize=(20,len(numbers)))
index = 1

ax = fig.add_subplot(len(numbers)//5+1, 5, index)
ax.set_title("Complete Cell Mask", size=16)
plt.imshow(cell_mask)

index += 1
for number in numbers:
    isolated_cell = np.where(cell_mask==number, cell_mask, 0)
    ax = fig.add_subplot(len(numbers)//5+1, 5, index)
    ax.set_title(f"Segment {number}", size=16)
    plt.imshow(isolated_cell)
    index += 1

Let's segment all training images and crop invidual cell from all of them with their corresponding labels and create a new image dataset to train our model and visualizing each label individually as well.

In [None]:
# function to crop segmentaed cells from given images

def get_cropped_cell(img, msk):
    bmask = msk.astype(int)[...,None]
    masked_img = img * bmask
    true_points = np.argwhere(bmask)
    top_left = true_points.min(axis=0)
    bottom_right = true_points.max(axis=0)
    cropped_arr = masked_img[top_left[0]:bottom_right[0]+1,top_left[1]:bottom_right[1]+1]
    return cropped_arr

In [None]:
# return height and widht of given images

def get_stats(cropped_cell):
    x = (cropped_cell/255.0).reshape(-1,3).mean(0)
    x2 = ((cropped_cell/255.0)**2).reshape(-1,3).mean(0)
    return x, x2

In [None]:
ROOT = '../input/hpa-single-cell-image-classification'
def read_img(image_id, color, train_or_test='train', image_size=None):
    filename = f'{ROOT}/{train_or_test}/{image_id}_{color}.png'
    assert os.path.exists(filename), f'not found {filename}'
    img = cv2.imread(filename, cv2.IMREAD_UNCHANGED)
    if image_size is not None:
        img = cv2.resize(img, (image_size, image_size))
    if img.max() > 255:
        img_max = img.max()
        img = (img/255).astype('uint8')
    return img

The below code cell will take around 8 hours to complete. I have already run it and saved the output. SO I will directly use the save data.

In [None]:
"""
x_tot,x2_tot = [],[]
lbls = []
num_files = len(train_dfs)
all_cells = []
cell_mask_dir = '../input/hpa-mask/hpa_cell_mask'
train_or_test = 'train'

with zipfile.ZipFile('cells.zip', 'w') as img_out:

    for idx in tqdm(1606, 2000):
        image_id = train_dfs.iloc[idx].ID
        labels = train_dfs.iloc[idx].Label
        cell_mask = np.load(f'{cell_mask_dir}/{image_id}.npz')['arr_0']
        red = read_img(image_id, "red", train_or_test, None)
        green = read_img(image_id, "green", train_or_test, None)
        blue = read_img(image_id, "blue", train_or_test, None)
        #yellow = read_img(image_id, "yellow", train_or_test, image_size)
        stacked_image = np.transpose(np.array([blue, green, red]), (1,2,0))

        for cell in range(1, np.max(cell_mask) + 1):
            bmask = cell_mask == cell
            cropped_cell = get_cropped_cell(stacked_image, bmask)
            fname = f'{image_id}_{cell}.jpg'
            im = cv2.imencode('.jpg', cropped_cell)[1]
            img_out.writestr(fname, im)
            x, x2 = get_stats(cropped_cell)
            x_tot.append(x)
            x2_tot.append(x2)
            all_cells.append({
                'image_id': image_id, c
                'r_mean': x[0],
                'g_mean': x[1],
                'b_mean': x[2],
                'cell_id': cell,
                'image_labels': labels,
                'size1': cropped_cell.shape[0],
                'size2': cropped_cell.shape[1],
            })

#image stats
img_avr =  np.array(x_tot).mean(0)
img_std =  np.sqrt(np.array(x2_tot).mean(0) - img_avr**2)
cell_train_df = pd.DataFrame(all_cells)
cell_train_df.to_csv('cell_train_df.csv', index=False)
print('mean:',img_avr, ', std:', img_std)
"""
""

In [None]:
# loading processed train csv data
train_df = pd.read_csv('../input/processed-hpa-data/cell_train_df.csv')
train_df.head()

Since now we have individual images for each cell so lets visualize images for each label

In [None]:
# Choosing one-one image id belonging to each label

image_label_0 = train_df['image_id'][train_df['image_labels']=='0'].iloc[0]
image_label_1 = train_df['image_id'][train_df['image_labels']=='1'].iloc[0]
image_label_2 = train_df['image_id'][train_df['image_labels']=='2'].iloc[0]
image_label_3 = train_df['image_id'][train_df['image_labels']=='3'].iloc[0]
image_label_4 = train_df['image_id'][train_df['image_labels']=='4'].iloc[0]
image_label_5 = train_df['image_id'][train_df['image_labels']=='5'].iloc[0]
image_label_6 = train_df['image_id'][train_df['image_labels']=='6'].iloc[0]
image_label_7 = train_df['image_id'][train_df['image_labels']=='7'].iloc[0]
image_label_8 = train_df['image_id'][train_df['image_labels']=='8'].iloc[0]
image_label_9 = train_df['image_id'][train_df['image_labels']=='9'].iloc[0]
image_label_10 = train_df['image_id'][train_df['image_labels']=='10'].iloc[0]
image_label_11 = train_df['image_id'][train_df['image_labels']=='11'].iloc[0]
image_label_12 = train_df['image_id'][train_df['image_labels']=='12'].iloc[0]
image_label_13 = train_df['image_id'][train_df['image_labels']=='13'].iloc[0]
image_label_14 = train_df['image_id'][train_df['image_labels']=='14'].iloc[0]
image_label_15 = train_df['image_id'][train_df['image_labels']=='15'].iloc[0]
image_label_16 = train_df['image_id'][train_df['image_labels']=='16'].iloc[0]
image_label_17 = train_df['image_id'][train_df['image_labels']=='17'].iloc[0]
image_label_18 = train_df['image_id'][train_df['image_labels']=='18'].iloc[0]

In [None]:
import matplotlib.image as mpimg
image_ids = [image_label_0, image_label_1 ,image_label_2, image_label_3,image_label_4 ,image_label_5,image_label_6,
             image_label_7, image_label_8, image_label_9, image_label_10, image_label_11, image_label_12,
             image_label_13, image_label_14, image_label_15, image_label_16, image_label_17, image_label_18]

plt.figure(figsize=(20,25))
for i, label in Labels.items():
    plt.subplot(6,5,int(i)+1)
    img = mpimg.imread('../input/processed-hpa-data/cells/cells/'+image_ids[int(i)]+'_1.jpg')
    imgplot = plt.imshow(img)
    plt.title(label, fontsize=15)
    plt.xticks([])
    plt.yticks([])

plt.suptitle('Individual Cells with labels',fontsize=20)
plt.show()

### Test Data

&emsp;&emsp;&emsp; In test data there only images for which the labels will predict.

### Model Creation and Training

We will use fastai tool to create our model. We need images of individual cells as an input to the classification model. Due to system limitation we will not use all images for model training instead we have created a sample balanced dataset to train the model. We use RGB channels only, which has proven to work well in the previous HPA challenge. We saved the extracted cells as RGB jpg images already so that I can feed them easily into my classifier.
    
We will first train three models using three different pretrained models resnet, densenet and unet and after training we will compare which one will be best.
   
1. **Resnet**: A residual neural network (ResNet) is an artificial neural network (ANN) of a kind that builds on constructs known from pyramidal cells in the cerebral cortex. Residual neural networks do this by utilizing skip connections, or shortcuts to jump over some layers. Typical ResNet models are implemented with double- or triple- layer skips that contain nonlinearities (ReLU) and batch normalization in between.An additional weight matrix may be used to learn the skip weights; these models are known as HighwayNets.   

2. **EfficientNet**: EfficientNet is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a compound coefficient. Unlike conventional practice that arbitrary scales these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients.

3. **Densenet**: DenseNet is one of the new discoveries in neural networks for visual object recognition. DenseNet is quite similar to ResNet with some fundamental differences. ResNet uses an additive method (+) that merges the previous layer (identity) with the future layer, whereas DenseNet concatenates (.) the output of the previous layer with the future layer. 

In [None]:
labels = [str(i) for i in range(19)]
for x in labels: 
    train_df[x] = train_df['image_labels'].apply(lambda r: int(x in r.split('|')))
train_df.head()

In [None]:
# test data for checking the performace of our model after trainining

test_performance_df_0 = train_df[train_df['image_labels'] == '0'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_1 = train_df[train_df['image_labels'] == '1'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_2 = train_df[train_df['image_labels'] == '2'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_3 = train_df[train_df['image_labels'] == '3'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_4 = train_df[train_df['image_labels'] == '4'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_5 = train_df[train_df['image_labels'] == '5'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_6 = train_df[train_df['image_labels'] == '6'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_7 = train_df[train_df['image_labels'] == '7'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_8 = train_df[train_df['image_labels'] == '8'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_9 = train_df[train_df['image_labels'] == '9'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_10 = train_df[train_df['image_labels'] == '10'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_11 = train_df[train_df['image_labels'] == '10|11'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_11['image_labels'] = ['11']*25
test_performance_df_12 = train_df[train_df['image_labels'] == '12'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_13 = train_df[train_df['image_labels'] == '13'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_14 = train_df[train_df['image_labels'] == '14'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_15 = train_df[train_df['image_labels'] == '15'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_16 = train_df[train_df['image_labels'] == '16'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_17 = train_df[train_df['image_labels'] == '17'].sample(n=25, random_state=42).reset_index(drop=True)
test_performance_df_18 = train_df[train_df['image_labels'] == '18'].sample(n=25, random_state=42).reset_index(drop=True)

test_performance_df_ = [test_performance_df_0, test_performance_df_1, test_performance_df_2, test_performance_df_3, test_performance_df_4, test_performance_df_5,
                      test_performance_df_6, test_performance_df_7, test_performance_df_8, test_performance_df_9, test_performance_df_10, test_performance_df_11,
                      test_performance_df_12, test_performance_df_13, test_performance_df_14, test_performance_df_15, test_performance_df_16, test_performance_df_17,
                      test_performance_df_18]

test_performance_df = pd.concat(test_performance_df_, ignore_index=True)
test_performance_df.drop_duplicates(inplace=True, ignore_index=True)

test_performance_df.head()

In [None]:
# 20% sampled dataset 
train_dfs = train_df.sample(frac=0.20, random_state=42)
train_dfs = train_dfs.reset_index(drop=True)
len(train_dfs)

Let's compare the number of occurance of each label indiviaully with total occurance in our sampled dataset

In [None]:
unique_counts = {}
for lbl in labels:
    unique_counts[lbl] = len(train_dfs[train_dfs.image_labels == lbl])

full_counts = {}
for lbl in labels:
    count = 0
    for row_label in train_dfs['image_labels']:
        if lbl in row_label.split('|'): count += 1
    full_counts[lbl] = count
    
counts = list(zip(full_counts.keys(), full_counts.values(), unique_counts.values()))
counts = np.array(sorted(counts, key=lambda x:-x[1]))
counts = pd.DataFrame(counts, columns=['label', 'Total Count', 'Individual Count'])
counts.set_index('label').T

Class label 11 has 0 individual counts, this may create zero division error. To handle this we will add 10 records for label 11 from records label '10|11'.

In [None]:
df_11 = train_df[train_df.image_labels.str.contains('11')][0:10]
df_11['image_labels'] = ['11']*10
train_dfs = pd.concat((train_dfs, df_11))

unique_counts = {}
for lbl in labels:
    unique_counts[lbl] = len(train_dfs[train_dfs.image_labels == lbl])

full_counts = {}
for lbl in labels:
    count = 0
    for row_label in train_dfs['image_labels']:
        if lbl in row_label.split('|'): count += 1
    full_counts[lbl] = count
    
counts = list(zip(full_counts.keys(), full_counts.values(), unique_counts.values()))
counts = np.array(sorted(counts, key=lambda x:-x[1]))
counts = pd.DataFrame(counts, columns=['label', 'Total Count', 'Individual Count'])
counts.set_index('label').T

Now Label 11 has 10 individual counts.

Spliting the sampled training data into training data and validation data using stratify method to balance the class labels on both train and validation data.

In [None]:
nfold = 5

y = train_dfs[labels].values
X = train_dfs[['image_id', 'cell_id']].values

train_dfs['fold'] = np.nan

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
mskf = MultilabelStratifiedKFold(n_splits=nfold, shuffle=True, random_state=None)
for i, (_, test_index) in enumerate(mskf.split(X, y)):
    train_dfs.iloc[test_index, -1] = i
    
train_dfs['fold'] = train_dfs['fold'].astype('int')

In [None]:
train_dfs['is_valid'] = False
train_dfs['is_valid'][train_dfs['fold'] == 0] = True

In [None]:
train_dfs.is_valid.value_counts()

Now there are 29125 records in Train data and 7284 records in validation data.

In [None]:
# defining function to return image for given image id
def get_x(r): 
    return '../input/processed-hpa-data/cells/cells/'+(r['image_id']+'_'+str(r['cell_id'])+'.jpg')

# defining function to return label for 
def get_y(r): 
    return r['image_labels'].split('|')

In [None]:
# sample_stats = (image_array_mean, image_array_std) # one image have 3 channels # mean and std of all cell images.
sample_stats = ([0.07290461, 0.04505656, 0.07713918] , [0.1727259 , 0.10327134, 0.14257778])
item_tfms = RandomResizedCrop(224, min_scale=0.75, ratio=(1.,1.))
batch_tfms = [*aug_transforms(flip_vert=True, size=128, max_warp=0), Normalize.from_stats(*sample_stats)]
bs=128

In [None]:
# code to create batch dataset
dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock(vocab=labels)),
                splitter=ColSplitter(col='is_valid'),
                get_x=get_x,
                get_y=get_y,
                item_tfms=item_tfms,
                batch_tfms=batch_tfms,
                )
dls = dblock.dataloaders(train_dfs, bs=bs)

In [None]:
dls.show_batch(nrows=3, ncols=3)

**Training with Resnet**

In [None]:
# copying pretrained models 
if not os.path.exists('/root/.cache/torch/hub/checkpoints/'):
        os.makedirs('/root/.cache/torch/hub/checkpoints/')

!cp ../input/models/resnet50-19c8e357.pth /root/.cache/torch/hub/checkpoints/
!cp ../input/models/densenet121-a639ec97.pth /root/.cache/torch/hub/checkpoints/
!cp ../input/models/adv-efficientnet-b7-4652b6dd.pth /root/.cache/torch/hub/checkpoints/

In [None]:
#creating our cnn model

res_learn = cnn_learner(dls, resnet50, metrics=[accuracy_multi, PrecisionMulti()]).to_fp16()

In [None]:
torch.cuda.empty_cache()
res_learn.lr_find()

The suggested value of learning parameter is around 0.03

In [None]:
lr=3e-2 # learning parameter
torch.cuda.empty_cache() # empty GPU cache memory
res_learn.fine_tune(5,base_lr=lr) # starting training with 5 epochs

After epoch 7 the difference between train and valid loss started increasing but both are decreasing so to prevent our model from being overfit 10 epochs are enough and if we want to increase the accuracy further so we should increase the fraction of training data. Currently we have trained our on 20% of total training data due to system limitations.

In [None]:
res_learn.recorder.plot_loss() # plotting train and validation loss

In [None]:
res_learn.save('hpa_resnet50_model') # saving our model

#### Training with Efficient net

In [None]:
# locating and downloading the pretrained Efficient net model
# We will use transfer learning method with Efficientnet-B7 to train our learner


package_path = '../input/efficientnet-pytorch/EfficientNet-PyTorch/EfficientNet-PyTorch-master'
sys.path.append(package_path)

%cd /kaggle/input/efficientnet-pytorch/EfficientNet-PyTorch/EfficientNet-PyTorch-master
from efficientnet_pytorch import EfficientNet
%cd -

def get_learner(lr=1e-3):
    # Optimization funciton and parameters
    opt_func = partial(Adam, lr=lr, wd=0.01, eps=1e-8)    
    model = EfficientNet.from_pretrained("efficientnet-b7", advprop=True)
    # Set output layer
    model._fc = nn.Linear(2560, dls.c)
    # Group model, dataloader and metrics
    learn = Learner(
        dls, model, opt_func=opt_func,
        metrics=[accuracy_multi, PrecisionMulti()]
        ).to_fp16()
    return learn

In [None]:
# Initialize lerner
effi_learn=get_learner()

In [None]:
# Finding best value for learning parameter


torch.cuda.empty_cache()
effi_learn.lr_find()


In [None]:
# Training our model 

lr = 1e-3
torch.cuda.empty_cache()
effi_learn.fine_tune(5,base_lr=lr)


In [None]:
effi_learn.recorder.plot_loss()

In [None]:
effi_learn.save('hpa_effi-b7_model') # saving our model

#### Traininig with densenet

In [None]:
dens_learn = cnn_learner(dls, models.densenet121, metrics=[accuracy_multi, PrecisionMulti()]).to_fp16()

In [None]:

torch.cuda.empty_cache() # empty GPU cache memory
dens_learn.lr_find() # finding best value for learning parameter to train our cnn model


In [None]:

lr=3e-2 # learning parameter
torch.cuda.empty_cache() # empty GPU cache memory
dens_learn.fine_tune(5,base_lr=lr) # starting training with 5 epochs


In [None]:
dens_learn.recorder.plot_loss()

In [None]:
dens_learn.save('hpa_densenet121_model') # saving our model

### Models Comparison

We have trained 3 models based on resnet, efficientnet and densenet to find out which one will be better. We have trained each model for 5 epochs-  
   
After 5 epochs the losses for    
Resnet are :- Train Loss = 0.132128 and Validation Loss = 0.132426  
Efficient net are :- Train Loss = 0.095444 and Validation Loss = 0.121629     
Densenet are :- Train Loss = 0.127250 and Validation Loss = 0.130601     
     
The difference between Train Loss and Validation Loss is less in Resnet based model so it means it's a balanced model not an overfitted or underfitted model. So we will use Resnet based model to preform prediction on test images.

##### Let's train resnet model for 10 epochs to get more better results. Since we training our model on 20% of the data due to system limitations so it's performance will be less. To increase the performance of the model train it on 100% data.

In [None]:
res_learn = cnn_learner(dls, resnet50, metrics=[accuracy_multi, PrecisionMulti(), RocAucMulti()]).to_fp16()
lr=3e-2 # learning parameter
torch.cuda.empty_cache() # empty GPU cache memory
res_learn.fine_tune(10,base_lr=lr) # starting training with 5 epochs

In [None]:
res_learn.recorder.plot_loss() # plotting train and validation loss

### Model Prediction on Test Data

In [None]:
# loading sample_submission.csv into a dataframe 
test_df = pd.read_csv('../input/hpa-single-cell-image-classification/sample_submission.csv')
test_df

In sample submission we have Image files name for which we have to perform prediction. By looking at the sample submission, we realize that we need to predict a string for each test image which can be generate as below.   

1. Segment each single cell contained in the image.
2. Predict their class labels confidence.
3. Then generate a string by doing encoding of segmented cells.

The structure of the prediction string is as

ImageID,ImageWidth,ImageHeight,PredictionString

1. ImageAID,ImageAWidth,ImageAHeight,LabelA1 ConfidenceA1 EncodedMaskA1 LabelA2 ConfidenceA2 EncodedMaskA2 ...

2. ImageBID,ImageBWidth,ImageBHeight,LabelB1 ConfidenceB1 EncodedMaskB1 LabelB2 ConfidenceB2 EncodedMaskB2 …

Sample real values could be..

ID,ImageWidth,ImageHeight,PredictionString
1. 721568e01a744247,1118,1600,0 0.637833 eNqLi8xJM7BOTjS08DT2NfI38DfyM/Q3NMAJgJJ+RkBs7JecF5tnAADw+Q9I
2. 7b018c5e3a20daba,1600,1066,16 0.85117 eNqLiYrLN7DNCjDMMIj0N/Iz9DcwBEIDfyN/QyA2AAsBRfxMPcKTA1MMADVADIo=

Below is the code to encode the segmentation mask provided by the organiser..

Next two cells will create encoded string for each segmented cells in images and the whole process will take around 1 hours to complete for all images. I have already run it so I will use the saved data.

In [None]:
"""
import base64
import numpy as np
from pycocotools import _mask as coco_mask
import typing as t
import zlib


def encode_binary_mask(mask: np.ndarray) -> t.Text:

  # check input mask --
  if mask.dtype != np.bool:
    raise ValueError(
        "encode_binary_mask expects a binary mask, received dtype == %s" %
        mask.dtype)

  mask = np.squeeze(mask)
  if len(mask.shape) != 2:
    raise ValueError(
        "encode_binary_mask expects a 2d mask, received shape == %s" %
        mask.shape)

  # convert input mask to expected COCO API input --
  mask_to_encode = mask.reshape(mask.shape[0], mask.shape[1], 1)
  mask_to_encode = mask_to_encode.astype(np.uint8)
  mask_to_encode = np.asfortranarray(mask_to_encode)

  # RLE encode mask --
  encoded_mask = coco_mask.encode(mask_to_encode)[0]["counts"]

  # compress and base64 encoding --
  binary_str = zlib.compress(encoded_mask, zlib.Z_BEST_COMPRESSION)
  base64_str = base64.b64encode(binary_str)
  return base64_str.decode('ascii')
  
"""
''

Performing test image segmentation and generating encoding strings..

In [None]:
"""
x_tot,x2_tot = [],[]
lbls = []
num_files = len(test_df)
all_cells = []
train_or_test = 'test'
cell_mask_dir = 'F:/HPA/work/cell_masks'

with zipfile.ZipFile('F:/HPA/test_cells.zip', 'w') as img_out:

    for idx in tqdm(range(num_files)):
        image_id = test_df.iloc[idx].ID
        cell_mask = np.load(f'{cell_mask_dir}/{image_id}.npz')['arr_0']
        red = read_img(image_id, "red", train_or_test, None)
        green = read_img(image_id, "green", train_or_test, None)
        blue = read_img(image_id, "blue", train_or_test, None)
        #yellow = read_img(image_id, "yellow", train_or_test, image_size)
        stacked_image = np.transpose(np.array([blue, green, red]), (1,2,0))

        for j in range(1, np.max(cell_mask) + 1):
            bmask = (cell_mask == j)
            enc = encode_binary_mask(bmask)
            cropped_cell = get_cropped_cell(stacked_image, bmask)
            fname = f'{image_id}_{j}.jpg'
            im = cv2.imencode('.jpg', cropped_cell)[1]
            img_out.writestr(fname, im)
            x, x2 = get_stats(cropped_cell)
            x_tot.append(x)
            x2_tot.append(x2)
            all_cells.append({
                'image_id': image_id,
                'fname': fname,
                'r_mean': x[0],
                'g_mean': x[1],
                'b_mean': x[2],
                'cell_id': j,
                'size1': cropped_cell.shape[0],
                'size2': cropped_cell.shape[1],
                'enc': enc,
            })

#image stats
img_avr =  np.array(x_tot).mean(0)
img_std =  np.sqrt(np.array(x2_tot).mean(0) - img_avr**2)
cell_test_df = pd.DataFrame(all_cells)
cell_test_df.to_csv('F:/HPA/cell_test_df.csv', index=False)
print('mean:',img_avr, ', std:', img_std)
"""
""

In [None]:
# loading saved data for encoded cell segment 
cell_test_df = pd.read_csv('../input/processed-hpa-data/cell_test_df.csv')
cell_test_df.head()

In [None]:
# code to create batch dataset
test_dl = res_learn.dls.test_dl(cell_test_df)
torch.cuda.empty_cache()
test_dl.show_batch()

Predecting labels for cells presents in all test images..

In [None]:
# performing prediction
preds, _ = res_learn.get_preds(dl=test_dl) 

In [None]:
preds.shape

In [None]:
# saving prediction in a file
with open('preds.pickle', 'wb') as handle:
    pickle.dump(preds, handle)

In [None]:
cls_prds = torch.argmax(preds, dim=-1)
len(cls_prds), cls_prds

Creating submission file..

In [None]:
sample_submission = pd.read_csv('../input/hpa-single-cell-image-classification/sample_submission.csv')
sample_submission.head()

In [None]:
# combining predicted labels and encoded string
cell_test_df['cls'] = cls_prds
cell_test_df['pred'] = cell_test_df[['cls', 'enc']].apply(lambda r: str(r[0]) + ' 1 ' + r[1], axis=1)
cell_test_df.head()

In [None]:
# Grouping Cells records into their Image records from where segmented cells were cropped.
subm = cell_test_df.groupby(['image_id'])['pred'].apply(lambda x: ' '.join(x)).reset_index()
subm.head()

In [None]:
sub = pd.merge(sample_submission,subm,how="left",left_on='ID',right_on='image_id')
sub.head()

In [None]:
def isNaN(num):
    return num != num

for i, row in sub.iterrows():
    if isNaN(row['pred']): continue
    sub.PredictionString.loc[i] = row['pred']

In [None]:
sub = sub[sample_submission.columns]
sub.head()

In [None]:
sub.to_csv('submission.csv', index=False)

### Model Performance on Test data or unseen data

To check the performance of the model we need labeled and unseen data. So we will use 25 records for each label from training data which we have already seperated from train data before training.

In [None]:
test_performance_df.head()

In [None]:
test_performance_df.shape

In [None]:
# code to create batch dataset
test_dl = res_learn.dls.test_dl(test_performance_df)
torch.cuda.empty_cache()
test_dl.show_batch()

In [None]:
# performing predictions
predictions, _ = res_learn.get_preds(dl=test_dl)

In [None]:
print(predictions)

In [None]:
# Converting predicted probabilites into class labels
cls_predictions = torch.argmax(predictions, dim=-1)
len(cls_predictions), cls_predictions

In [None]:
cls_predictions = np.array(cls_predictions)

In [None]:
from sklearn.metrics  import accuracy_score, confusion_matrix

In [None]:
# Calculating accuracy score
true_label = [int(i) for i in test_performance_df['image_labels']]
accuracy_score(true_label, cls_predictions)

We have got 53.26% accuracy which is less. This is because we have trained our model only on 20% of given training data due system limitation. To increase the prediction accuracy train the model on whole training dataset.

In [None]:
# Plotting confusion matrix to check the prediction accuracy for each class

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15,10))
cfm = confusion_matrix(true_label, cls_predictions)
sns.heatmap(cfm, annot=True)
plt.show()

**Let's  interpret the  confusion matrix**  
1. For class label '0'    
Out of 25 reocrds 7 records are predicted correctly. Means 28% accuracy for class '0'.
2. For class label '1'   
Out of 25 reocrds 19 records are predicted correctly. Means 76% accuracy for class '1'.
3. For class label '2'    
Out of 25 reocrds 13 records are predicted correctly. Means 52% accuracy for class '2'.    
    
and so on....

#### NOTE:- we have trained our model only on 20% of given training data due system limitation. To increase the prediction accuracy train the model on whole training dataset.