# Malaria Bounding Boxes

Dataset courtesy - https://www.kaggle.com/kmader/malaria-bounding-boxes

## Context

Malaria is a disease caused by Plasmodium parasites that remains a major threat in global health, affecting 200 million people and causing 400,000 deaths a year. The main species of malaria that affect humans are Plasmodium falciparum and Plasmodium vivax.

For malaria as well as other microbial infections, manual inspection of thick and thin blood smears by trained microscopists remains the gold standard for parasite detection and stage determination because of its low reagent and instrument cost and high flexibility. Despite manual inspection being extremely low throughput and susceptible to human bias, automatic counting software remains largely unused because of the wide range of variations in brightfield microscopy images. However, a robust automatic counting and cell classification solution would provide enormous benefits due to faster and more accurate quantitative results without human variability; researchers and medical professionals could better characterize stage-specific drug targets and better quantify patient reactions to drugs.

Previous attempts to automate the process of identifying and quantifying malaria have not gained major traction partly due to difficulty of replication, comparison, and extension. Authors also rarely make their image sets available, which precludes replication of results and assessment of potential improvements. The lack of a standard set of images nor standard set of metrics used to report results has impeded the field.

#### Courtesy - https://www.kaggle.com/kmader/malaria-bounding-boxes

## Content

Images are in .png or .jpg format. There are 3 sets of images consisting of 1364 images (~80,000 cells) with different researchers having prepared each one: from Brazil (Stefanie Lopes), from Southeast Asia (Benoit Malleret), and time course (Gabriel Rangel). Blood smears were stained with Giemsa reagent.

Labels
The data consists of two classes of uninfected cells (RBCs and leukocytes) and four classes of infected cells (gametocytes, rings, trophozoites, and schizonts). Annotators were permitted to mark some cells as difficult if not clearly in one of the cell classes. The data had a heavy imbalance towards uninfected RBCs versus uninfected leukocytes and infected cells, making up over 95% of all cells.

A class label and set of bounding box coordinates were given for each cell. For all data sets, infected cells were given a class label by Stefanie Lopes, malaria researcher at the Dr. Heitor Vieira Dourado Tropical Medicine Foundation hospital, indicating stage of development or marked as difficult.

#### Courtesy - https://www.kaggle.com/kmader/malaria-bounding-boxes

## Problem statement

This is an object detection problem wherein given a test image the algorithm has to determine the following:
1. The objects/cells present in the images of the blood smears
2. A class label along with the class probability assigned to each of the objects/cells detected in the image
3. The location, determined by the bounding box coordinates, for each of the objects detected 

## Exploratory Data Analysis

### 1. Importing the libraries

In [68]:
import numpy as np
import os
import json
import cv2
import matplotlib.pyplot as plt
from collections import Counter

1. How many images?
2. How many objects and their count?
3. Average no of objects in an image
4. 

In [25]:
data_subdir ='data/malaria'
abs_path= os.path.abspath(os.getcwd() +'/..')
data_dir= os.path.join(abs_path,data_subdir)
if os.path.isdir(data_dir):
    print('The directory path is valid')

The directory path is valid


In [34]:
for dirpath,dirnames,_ in os.walk(data_dir):
    print(f"The dir path is --{dirpath.replace(abs_path,'')} and the dir name is --{dirnames}")

The dir path is --/data/malaria and the dir name is --['images']
The dir path is --/data/malaria/images and the dir name is --[]


In [42]:
with open(os.path.join(data_dir,'training.json'),'r') as f:
    train_data= json.load(f)

In [46]:
print(f"The number of images in the train data are {len(train_data)}")

The number of images in the train data are 1208


In [66]:
train_data[9]

{'image': {'checksum': '8b5cb906538df0f49df5e72efef40eaa',
  'pathname': '/images/bbf687b5-c6f9-4821-b2e5-a25df1acba47.png',
  'shape': {'r': 1200, 'c': 1600, 'channels': 3}},
 'objects': [{'bounding_box': {'minimum': {'r': 971, 'c': 1066},
    'maximum': {'r': 1095, 'c': 1189}},
   'category': 'red blood cell'},
  {'bounding_box': {'minimum': {'r': 1079, 'c': 847},
    'maximum': {'r': 1191, 'c': 969}},
   'category': 'red blood cell'},
  {'bounding_box': {'minimum': {'r': 110, 'c': 806},
    'maximum': {'r': 235, 'c': 958}},
   'category': 'red blood cell'},
  {'bounding_box': {'minimum': {'r': 316, 'c': 771},
    'maximum': {'r': 432, 'c': 929}},
   'category': 'red blood cell'},
  {'bounding_box': {'minimum': {'r': 21, 'c': 1061},
    'maximum': {'r': 146, 'c': 1191}},
   'category': 'red blood cell'},
  {'bounding_box': {'minimum': {'r': 798, 'c': 1392},
    'maximum': {'r': 927, 'c': 1531}},
   'category': 'red blood cell'},
  {'bounding_box': {'minimum': {'r': 703, 'c': 377},
  

In [57]:
no_images = sum([len(i['objects']) for i in train_data])
print('The number of infectious/non-infectious cells in the training data are {}'.format(no_images))

The number of infectious/non-infectious cells in the training data are 80113


### Checking the no of cells per image

In [70]:
Counter([len(i['objects']) for i in train_data]).most_common()

[(58, 25),
 (57, 24),
 (67, 23),
 (53, 22),
 (59, 22),
 (43, 21),
 (45, 21),
 (51, 21),
 (55, 20),
 (64, 20),
 (73, 20),
 (54, 19),
 (40, 19),
 (52, 19),
 (68, 19),
 (48, 18),
 (44, 18),
 (38, 18),
 (50, 17),
 (46, 17),
 (79, 17),
 (74, 16),
 (82, 16),
 (39, 16),
 (47, 16),
 (63, 15),
 (34, 15),
 (56, 15),
 (83, 15),
 (88, 14),
 (37, 14),
 (41, 14),
 (61, 13),
 (42, 13),
 (49, 13),
 (71, 13),
 (62, 13),
 (69, 13),
 (35, 13),
 (72, 13),
 (77, 13),
 (80, 13),
 (29, 12),
 (85, 12),
 (81, 12),
 (66, 12),
 (31, 11),
 (36, 11),
 (70, 11),
 (75, 11),
 (86, 11),
 (33, 11),
 (25, 11),
 (65, 11),
 (28, 10),
 (26, 10),
 (96, 10),
 (90, 10),
 (24, 9),
 (89, 9),
 (23, 9),
 (104, 9),
 (84, 8),
 (60, 8),
 (78, 8),
 (99, 8),
 (108, 8),
 (27, 8),
 (76, 8),
 (103, 8),
 (32, 7),
 (93, 7),
 (19, 7),
 (20, 6),
 (92, 6),
 (94, 6),
 (101, 6),
 (30, 6),
 (87, 6),
 (159, 6),
 (95, 6),
 (98, 6),
 (97, 6),
 (18, 5),
 (105, 5),
 (22, 5),
 (106, 5),
 (16, 5),
 (17, 5),
 (192, 4),
 (21, 4),
 (111, 4),
 (131, 4),
 (