# Exploration and analysis

In this exercise, you will explore and analyze a dataset containing images of Plasmodium vivax (P. vivax) infected blood smears. Plasmodium vivax is one of the parasites responsible for malaria, a disease that affects millions of people worldwide and causes over 400,000 deaths a year. The ability to accurately identify infected blood smears is crucial for the diagnosis and treatment of malaria.

### The dataset

> For malaria as well as other microbial infections, manual inspection of thick and thin blood smears by trained microscopists remains the gold standard for parasite detection and stage determination because of its low reagent and instrument cost and high flexibility. Despite manual inspection being extremely low throughput and susceptible to human bias, automatic counting software remains largely unused because of the wide range of variations in brightfield microscopy images. However, a robust automatic counting and cell classification solution would provide enormous benefits due to faster and more accurate quantitative results without human variability; researchers and medical professionals could better characterize stage-specific drug targets and better quantify patient reactions to drugs. [SOURCE](https://bbbc.broadinstitute.org/BBBC041/)

This dataset contains 1,328 images of blood smears featuring approximately 80,000 cells. Each image was prepared by one of three research centers, resulting in slight variations in picture quality (color, exposure, brightness, etc.). Cells within these images were annotated by hand with a class label and a set of bounding box coordinates. The data consists of two classes of uninfected cells (red blood cells and leukocytes) and four classes of infected cells (gametocytes, rings, trophozoites, and schizonts). Annotators were permitted to mark some cells as difficult if they did not clearly belong to one of the cell classes.

**Download the dataset [here](https://data.broadinstitute.org/bbbc/BBBC041/malaria.zip) and extract all files to a directory named `/malaria` within the same directory as this notebook.** _This download is approximately 2.2GB, so make sure your computer has enough space and is connected to a stable and fast internet connection!_

In [None]:
import os
import json

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from PIL import Image, ImageDraw
from notebook_checker import start_checks

# Start automatic globals checks
%start_checks

def image_formatter(image):
    """
    Formatter method that makes sure that images don't get displayed bigger than 400x300.
    Assumes that the aspect ratio is 3:4.
    """
    width = min(image.width, 400)
    height = min(image.height, 300)
    
    return image.resize((width, height))._repr_png_()

# Get the formatters for png and jpeg, and overwrite them for PIL Image instances to decrease notebook size ~tenfold~
im_form_png = get_ipython().display_formatter.formatters['image/png']
im_form_png.for_type(Image.Image, image_formatter)

# We don't need both png and jpeg formats to be stored in the ipynb file, so we set jpeg to return None
im_form_jpeg = get_ipython().display_formatter.formatters['image/jpeg']
im_form_jpeg.for_type(Image.Image, lambda image: None)

# Goals for data exploration

When doing data exploration, you're usually trying to do several things at once, namely

1. Understand what the data is about and how it is structured

2. Try to get the data in more usuable format for further processing / programming

3. Look for notable things that standout, that might be worth investigating further

So, you're processing the data a little further with each step, while also trying to understand how it is structured. During this process, you should always be on the lookout for results that standout. Such results can have very different possible causes, like:

* Outliers or even mislabeled data

* Errors you made in your code to handle the data

* Incorrect assumptions about the structure of the data

* Interesting patterns in the data you might want to investigate further

This means you should check your work at each step, and make sure you understand the results you get. If you don't understand a result, you should always investigate further. Ideally, you'd make a quick sanity check at each step, to confirm your results, which is what we'll try and do in this notebook.

## Preliminary analysis

Before you do any coding on a large collection of files, it is always a good idea to inspect the files and folders manually. That way, you can better interpret the results you get back from you code. Open up the folder `malaria` on you computer, and take a look at some of these images and the folder's structure.

**Q1. Describe the structure of the `malaria` folder and list anything that stands out.**

*Your answer goes here.*

### Counting files in each folder

Let's start by doing something simple that we can also verify manually; counting the number of files in a folder. You can reuse your `count_files()` function from the `os` introduction for this.

Check your function with the examples provided below. _Does the malaria directory contain 2 files, and do we get the expected number of images?_

In [None]:
def count_files(directory):
    # YOUR CODE HERE

data_dir = 'malaria'
print(f"The {data_dir} directory contains {count_files(data_dir)} files.")

image_dir = os.path.join(data_dir, 'images')
print(f"The {image_dir} directory contains {count_files(image_dir)} images.")

**Q2. Does this result make sense to you? What did you compare this result against to check it is correct?**

*Your answer goes here.*

## Loading the data

Now that we’ve confirmed the directory structure and confirmed that all images are present, let’s load the data from the JSON files. As we have already discovered, the dataset is split into a training and a test set. We'll start by loading these sets and inspecting their sizes to get an overview of the dataset.

In the cell below, load the 'training.json' and 'test.json' files using `json.load`.

In [None]:
# YOUR CODE HERE

print(f'Number of training records: {len(train)}')
print(f'Number of test records: {len(test)}')

**Q3. Does this result make sense to you? What did you compare this result against to check it is correct?**

*Your answer goes here.*

### Inspecting the JSON

Next, before we load the entire dataset into pandas, let's take a look at one of these JSON entries to get a better idea of what information we are going to be working with:

In [None]:
# Print the first record from the training set
train[0]

Let's break down the structure of the entries:
- **image**: This section contains metadata about the image.
  - **checksum**: A value that can be used to check the image was downloaded correctly.
  - **pathname**: The file path to the image.
  - **shape**: The dimensions of the image.
    - **r**: Number of rows (height of the image).
    - **c**: Number of columns (width of the image).
    - **channels**: Number of color channels (3 for RGB images, 1 if the image is black and white).
- **objects**: This is a list of objects (cells) that were detected in the image. For each detected object, we have a bounding box and a category.
  - **bounding_box**: The coordinates of the bounding box surrounding the object. _A bounding box is a rectangular box that can be drawn around an object in an image to define its spatial location._
    - **minimum**: The top-left corner of the bounding box.
      - **r**: Row coordinate of the top-left corner.
      - **c**: Column coordinate of the top-left corner.
    - **maximum**: The bottom-right corner of the bounding box.
      - **r**: Row coordinate of the bottom-right corner.
      - **c**: Column coordinate of the bottom-right corner.
  - **category**: The category or type of the object (e.g., 'red blood cell').
 
That is a lot of information! 

### Converting to a `DataFrame`

Now that we have a basic understanding of the dataset's structure, let's proceed by transforming it into a `DataFrame`. By using *Pandas* here, we can explore, filter, or further analyze our data more easily and efficiently. Pandas provides many functions for data wrangling and analysis, which will help us quickly get a better understanding of our dataset.

Convert the train JSON to a `DataFrame` using [`pd.json_normalize()`](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html). Read the documentation of this function to see how to use it.

In [None]:
# YOUR CODE HERE

display(df_train)

**Q4. Does this result make sense to you? What did you compare this result against to check it is correct?**

*Your answer goes here.*

*Hint: Look at the number of rows and columns in the resulting DataFrame.*

## Unpacking the data further

Now that we have our data loaded in a `DataFrame`, we can use *Pandas* to quickly take a look at some statistics. The cell below uses *Pandas* `describe()` to provide us with some information about the numeric columns in our data.

In [None]:
df_train.describe()

**Q5. What do these statistics say about the shapes and channels for each image?**

*Your answer goes here.*

### Cleaning up the DataFrame

Now we know some of the data actually isn't very useful, so it can be removed from the `DataFrame`. Remove all 3 columns for the image shape, and also the image checksum, as we won't be needing it. The resulting `DataFrame` should have 2 columns left: the objects list and the path to the image.

In [None]:
# YOUR CODE HERE

display(df_train_clean)

### Exploding the bounding boxes

The next step in processing the data should be unpacking the bounding boxes, which is currently still just a list of JSON objects. As this is a list of items inside each `DataFrame` row, the easiest way to do this is to use [`explode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html). Apply `explode()` on the *objects* column and inspect the results.

In [None]:
# YOUR CODE HERE

display(df_train_exploded)

Now, each row in the *objects* column should contain one JSON object, which we can split into separate `DataFrame` columns again using [`pd.json_normalize()`](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html). Apply `json_normalize()` on the exploded *objects* column and inspect the results.

In [None]:
# YOUR CODE HERE

display(df_bounding_box)

Next, we will recombine the image paths with this new bounding box dataframe, so it is clear which image each bounding box belongs to. 

If you want to combine `DataFrames`, their *indices* must always match, but the original exploded dataframe has indices from 0 to 1207, while the bounding box dataframe has indices 0 to 80112. Before we can continue with the recombination of the dataframes, you should use [`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) to change any index to run from 0 to the number of rows.

In [None]:
# YOUR CODE HERE

display(df_exploded_reindexed)

Note that indices now run from 0 to 80112, and there is now a new column called *index*, which has the old index for each row. This information will be useful to keep in the dataframe, as it can be used to identify which image a bounding box belongs to. 

Find a way to combine this reindexed dataframe with the bounding boxes dataframe into a single dataframe, but remove the *objects* column from the results, as that information is already stored in bounding box dataframe. 

In [None]:
# YOUR CODE HERE

display(df_train_merged)

_Your resulting dataframe should have 7 columns._

### Cleaning up further

Now we finally have all the JSON information into a fully normalized `DataFrame`, but it's a little hard to read, partially because the column names are so long. Here we'll do a last clean up step to get everything a bit more readable.

Write a function `replace_column_name(name)` that takes a column name and returns a new string, which should be a bit shorter and more readable. The function should make the following replacements:

- `'index'` should become `'img.id'`
- `'image'` should become `'img'`
- `'pathname'` should become `'path'`
- `'bounding_box'` should become `'bb'`
- `'minimum'` should become `'min'`
- `'maximum'` should become `'max'`


In [None]:
def replace_column_name(name):
    # YOUR CODE HERE

# First test, should print bb.min.r 
print(replace_column_name('bounding_box.min.r'))

# Second test, should print img.id 
print(replace_column_name('index'))

# Third test, should print category
print(replace_column_name('category'))

Now, find a way in *Pandas* to use this `replace_column_name()` function to actually rename all the columns in the `DataFrame`.

In [None]:
# YOUR CODE HERE

display(df_train_merged)

Lastly, all the image paths start with `'/images/'` to indicate they are in the images folder. However, this is the same for every image, and uses `/` to separate folders, so this would only work on Mac or Linux as a filepath! To fix this, you should remove the string `'/images/'` at the start of each path, keeping only the filename after the second `/` and store the result back into the *img.path* column.

In [None]:
# YOUR CODE HERE

display(df_train_merged)

## Showing images

Now that we’ve loaded all the data into pandas, let’s take a closer look at the images themselves. Visualizing a few of the images can give us valuable insights into the data we’ll be analyzing, such as the quality of the images, the appearance of the cells, and any visible patterns that might help us interpret the dataset.

Use `os.path.join()` to construct a path from the current directory to the image file. The result should look something like `malaria/images/8d02117d-6c71-4e47-b50a-6cc8d5eb1d55.png` if you're on Mac or Linux, and like `malaria\images\8d02117d-6c71-4e47-b50a-6cc8d5eb1d55.png` if you're on Windows instead. Next, use `Image.open()` from the `PIL` library to read in the image using that path, and then display the result. You can change the `image_index` to other integers between 0 and 80112 to view different files.

In [None]:
image_index = 0
image_file = df_train_merged['img.path'].loc[image_index]
print(image_file)

# YOUR CODE HERE

display(img)

### Showing a specific cell

Next, it might be interesting to see what just one of the cells looks like. Implement the function `show_specific_cell(cell)` that takes a `Series` describing a cell, finds and loads the corresponding image from the disk, crops that specific cell from that image, and displays it. Remember that, respectively, `'bb.min.c'` and `'bb.min.r'` stand for the _column_ and the _row_ where the bounding box of a cell starts, and that `'bb.max.c'` and `'bb.max.r'` stand for where the bounding box ends. Use `Image.crop(box)` to cut out the relevant part from the image.


In [None]:
def show_specific_cell(cell):
    # YOUR CODE HERE

cell_index = 0
cell = df_train_merged.loc[cell_index]
display(cell)

show_specific_cell(cell)

## Showing all bounding boxes

Now, write a function `show_image_bbs(df_cells, image_id, colors)` that takes an `image_id` to select a specific image (so a value between 0 and 1207 for the training data), and displays that image, including all of its bounding boxes in `df_cells`. The coloring each of these boxes should indicate which of the different categories that cell is, as describes by the dictionary `colors`. The result should be a single displayed image with many different rectangles drawn on it, some of which should be different colors.

**Hint:** Take a look at how you used `ImageDraw` in `os_and_pil.ipynb`! The color of a rectangle can be set through the optional argument `outline`.

In [None]:
def show_image_bbs(df_cells, image_id, colors):
    # YOUR CODE HERE
    
# Create a colors dictionary using all unique categories and a list of colors
colors = dict(zip(df_train_merged.category.unique(),
                  ['red', 'green', 'yellow', 'grey', 'blue', 'orange', 'purple']))
print(colors)

# Show the first image in the training data with all its bounding boxes
show_image_bbs(df_train_merged, 0, colors)

# Analysis

After observing some of the images firsthand, we can start making some hypotheses and asking further questions about this data set, such as:

- Are certain categories of cells more common than others?
- Does the size and shape of the bounding boxes vary a lot?
- How many of the bounding boxes overlap with other bounding boxes?
- Do certain categories of cells have unique visual characteristics?
- How much do images differ in terms of background noise or visual clarity?

In this section of the notebook, we will use a combination of `pandas` and `seaborn` to explore relationships in the data. By visualizing and analyzing various aspects, like cell distributions and bounding box sizes, we aim to ultimately understand the characteristics of each cell type and how they appear across images.

## Counting classes

Lets start with visually inspecting some of these attributes. Use *Seaborn* to create a plot that displays the frequency of each of the cell types. Your horizontal axis should show the different categories of cells, and your vertical axis the count of each of those categories.

In [None]:
# YOUR CODE HERE

As there are so many more red blood cells than any other type, it is hard to read the smaller values in the plot. Use *Pandas* create a `Series` with counts for each of categories and display these exact counts.

In [None]:
# YOUR CODE HERE

**Q6. Does this result make sense to you? What did you compare this result against to check it is correct?**

*Your answer goes here.*

*Hint: There are two healthy cell types, namely red blood cells and leukocytes (white blood cells). Try to find a source online about how many more red blood cells there should be compared to white blood cells in a normal blood sample. So, find out in what ratio these two cell types naturally occur, and compare that to the numbers you're finding here.*

### Number of cells per image

You might have already noticed that the number of cells per image is highly variable. A histogram of the number of cells per image will allow us to see if some images have significantly higher or lower cell counts.

Use *Seaborn* to create a histogram that shows the distribution of the number of cells per image. The horizontal axis should show the number of cells, and the vertical axis the amount of times this occured, with both axes clearly labeled. Note that you might need to do one or more *Pandas* operations on the data, before you can actually create this specific plot using *Seaborn*.

In [None]:
# YOUR CODE HERE

**Q7. Does this result make sense to you? What did you compare this result against to check it is correct?**

*Your answer goes here.*

## Pivoting data

As we have already concluded before, red blood cells are by far the most common cells in our images. This observation makes sense, given that red blood cells are the primary component of blood. However, our main interest lies in identifying and analyzing the presence of infected cells, and so it would be useful to know how many of each of the different cell categories there are in each of the different images.

For this we can constuct a [`pivot_table()`](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) from our data set. This pivot table should be _indexed_ by the image id, the _columns_ should be the different categories, where the _values_ in the table should _count_ how often each combination occurs. 

Note that we're only interested in how often each combination occurs, so it doesn't actually matter which value we end up counting, as long as we choose one for the pivot table. Make sure to also set a sensible `fill_value` for combinations that don't occurs using the optional parameter, and use `margins=True` to include a total count for each row and column. This will add an `'All'` row **and** column to your pivot table, showing the total cell count for each image and the total count of each cell type across all images!

**Hint:** Take a look at `pandas_advanced.ipynb` for a reference on how `pivot_table` works!

In [None]:
# YOUR CODE HERE

display(df_train_pivoted)

**Q8. Does this result make sense to you? What did you compare this result against to check it is correct?**

*Your answer goes here.*

### Finding the images with the fewest and the most number of cells

As you had probably already seen from your histogram, there are a couple of images with a relatively low or extremely high number of annotated cells.

This is very useful, because we can use the `'All'` column in the pivottable to find the image with the lowest number of cells overall and the image with the largest number of cells overall. 

In [None]:
# Find the image where the 'All' column has its lowest value
min_cell_id = df_train_pivoted['All'].idxmin()
min_cell_count = df_train_pivoted['All'].loc[min_cell_id]

# Do the same for the highest value, but drop the 'All' row, as this row has the highest value by definition
max_cell_id = df_train_pivoted['All'].drop(index='All').idxmax()
max_cell_count = df_train_pivoted['All'].loc[max_cell_id]

print(f"Image with the lowest number of cells is ID {min_cell_id}, which has {min_cell_count} cells")
show_image_bbs(df_train_merged, min_cell_id, colors)

print(f"Image with the largest number of cells is ID {max_cell_id}, which has {max_cell_count} cells")
show_image_bbs(df_train_merged, max_cell_id, colors)


**Q9. Which of these images do you think is more useful in identifying infected cells? Explain your answer.**

*Your answer goes here.*

### Most different cell types

It might also be interesting to find an image with more varieties of cells. Let's find the image with the highest number of different cell categories!

Use your pivot table to identify the image with the most distinct cell types. Display this image using show_image_bbs(), and print the number of unique cell types it contains.

**Hint:** You can count how many different cell types are present in each image by converting non-zero counts to *True* or 1, then summing them across each row! Don't forget that you'll want to drop the 'All' row, just like the example in the previous cell.

In [None]:
# YOUR CODE HERE

### Counting infections

With the pivot table created, we can also easily determine how many images contain infected blood. To do this, we will categorize the images into three distinct groups:

1. **Infected**: These are images that contain any of the infected cell types: gametocytes, rings, trophozoites, or schizonts. Identifying these images is crucial as they indicate the presence of malaria.
2. **Possibly Infected**: These images contain red blood cells and leukocytes, along with cells labeled as "difficult." The "difficult" category might include cells that are challenging to classify and could potentially indicate infection.
3. **Uninfected**: These are images that contain only red blood cells and leukocytes. These images represent healthy blood samples.

Use *Pandas* to create masks for each of these 3 groups, and then create three separate `DataFrames` to store each group of images: `df_infected`, `df_possibly_infected`, and `df_uninfected`.

In [None]:
# YOUR CODE HERE

print(f"Number of images with infected cells: {df_infected.shape[0]}")
print(f"Number of images with possibly infected cells: {df_possibly_infected.shape[0]}")
print(f"Number of images with only uninfected cells: {df_uninfected.shape[0]}")

display(df_infected.head())
display(df_possibly_infected.head())
display(df_uninfected.head())

### Bounding box sizes

Another useful characteristic we can analyze is the cell dimensions. Fortunately, the bounding boxes in this dataset are closely aligned with the edges of each cell. This allows us to use the bounding boxes to get a fairly accurate approximation of the width and height of a cell. 

Add three new columns to `df_train_merged`:

- `'bb.width'`: Width of each bounding box
- `'bb.height'`: Height of each bounding box
- `'bb.area'`: Total area of each bounding box (width × height)

In [None]:
# YOUR CODE HERE

display(df_train_merged)

Now that we’ve calculated the dimensions and area of each bounding box, let’s identify the smallest and largest cells in the dataset. Using the `'bb_area'` column, find the cell with the smallest bounding box area and the cell with the largest bounding box area. Print the category and area of these cells, and then display the cells using `show_specific_cell()`.

In [None]:
# YOUR CODE HERE

### Aggregating data and analyzing categories

Bounding box area and aspect ratio might reveal patterns unique to each cell category. For example, some categories of cells may consistently have larger or smaller bounding boxes or more elongated shapes. Let’s calculate a summary of statistics for these features across different categories.

Group the DataFrame by `'category'` and then use Pandas' `.agg` to aggregate the mean, standard deviation, minimum, and maximum for both `'bb_width'`, `'bb_height'`, and `'bb_area'`. This will allow us to see if any categories tend to have larger areas, more elongated shapes, or greater variability in these dimensions.

In [None]:
# Set the number of decimals displayed to 2
pd.set_option("display.precision", 2)

# YOUR CODE HERE

**Q10. Do you see any interesting patterns in this table? Would any of these features be good indicators for specific categories?**

*Your answer goes here.*

### Plotting bounding box widths

After looking at the statistics above, one of the possible things to investigate further is how the width of the bounding boxes is distributed for each category. The plot below shows the distribution of bounding box widths across different cell categories. Each category is represented by a different color. By setting `common_norm=False`, we visualize the relative proportion of each bounding box width within each category, which allows us to compare the spread of widths directly, regardless of how many cells exist for each category.

In [None]:
sns.kdeplot(data=df_train_merged, x='bb.width', hue='category', common_norm=False, fill=True)
plt.title('Bounding Box Widths')
plt.show()

From this plot, we can observe that bounding box widths for red blood cells tend to be between 70 and 150 pixels, indicating that red blood cells are generally smaller in width than other cell types. This difference in bounding box width distributions could mean this would be a useful feature when distinguishing the different cell categories.

***Ultimately, to answers original question of automated cell labeling for this data set, we'd like find features that can be used to predict if a cell is healthy or not. The less overlap there is between the plotted distributions for the different cell categories, the better that feature is at distinguishing between these categories.***

# BONUS: Further Exploration 

As an additional challenge, you should try and construct some other new features that might be useful when distinguishing between the different cell categories. For this you may use the images of the different cells, the shapes of their bounding boxes, or any other part of the data you want to explore further. Note that the [Image](https://pillow.readthedocs.io/en/stable/reference/Image.html#the-image-class) class also has several functions to extract different types of information from an image. In general, your steps for this should be:

1. Try to come up with a hypothesis of what might be a good distinguishing feature
2. Write the code to compute this feature for the whole data set
3. Try to compute or visually show if this is indeed a good distinguishing feature

Remember; we are more interested in the way you get to an answer than the answer itself, so **include graphs, comments, and markdown cells that explain your choices**.