# Exploring the BBBC021 dataset

## A little background on high content imaging / screening

In a high-content screening / imaging assay, a cell line is treated with a number of different compounds (often on the order of 10k, 100k, or more molecules) for a given period of time, 
and then the cells are [fixed](https://en.wikipedia.org/wiki/Fixation_(histology)) and stained with fluorescent dyes which visualize important cellular structures that are then imaged under a microscope.
Through this procedure, we can directly observe the impact of the given (drug) molecules on cellular morphology - 
changes in cell and subcellular shape and structure.
The biophysical interaction by which a bioactive molecule exerts its effects on cells is known as its [mechanism of action (MoA)](https://en.wikipedia.org/wiki/Mechanism_of_action).
Different compounds with the same MoA will have similar effects on cellular morphology, which we should be able to detect in our screen.
Note that a molecule in fact may have more than one MoA - these ["dirty drugs"](https://en.wikipedia.org/wiki/Dirty_drug) may exhibit multiple effects on cellular processes in the assay simultaneously, 
or effects may change based on dosage.

## Our dataset: BBBC021 from the Broad Bioimage Benchmark Collection

The [Broad Bioimage Benchmark Collection](https://bbbc.broadinstitute.org/) is a collection of open microscopy imaging datasets published by the [Broad Institute](https://www.broadinstitute.org/), 
an MIT- and Harvard-affiliated research institute in Cambridge, MA, USA. 
The [BBBC021 dataset](https://bbbc.broadinstitute.org/BBBC021) comprises a [high-content screening](https://en.wikipedia.org/wiki/High-content_screening) assay of [Human MCF-7 cells](https://en.wikipedia.org/wiki/MCF-7), 
a very commonly used breast cancer cell line in biomedical research.



In the BBBC021 dataset, 3 structures have been stained: DNA, and the cytoskeletal proteins F-actin and B-tubulin, 
which comprise actin filaments and microtubules, respectively.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import holoviews as hv
import numpy as np

from pybbbc import BBBC021

In [None]:
hv.extension("bokeh")

In [None]:
im_opts = hv.opts.Image(
    aspect="equal",
    tools=["hover"],
    active_tools=["wheel_zoom"],
    colorbar=True,
    cmap="fire",
    normalize=False,
)

rgb_opts = hv.opts.RGB(
    aspect="equal",
    active_tools=["wheel_zoom"],
)

hv.opts.defaults(im_opts, rgb_opts)

### Working with pybbbc

#### Constructing the BBBC021 object

When you create the `BBBC021` object, you can choose which images to include by selecting subsets with keyword arguments. For example:

In [None]:
from pybbbc import BBBC021

# Entire BBBC021 dataset, including unknown MoA

bbbc021_all = BBBC021()

# Just the images with known MoA

bbbc021_moa = BBBC021(moa=[moa for moa in BBBC021.MOA if moa != "null"])

`BBBC021` has a number of useful constant class attributes that describe the entirety of the dataset
(and can be accessed without creating an object):

* `IMG_SHAPE`
* `CHANNELS`
* `PLATES`
* `COMPOUNDS`
* `MOA`

These don't change with the subset of BBBC021 you have selected. On other other hand, these do:

* `moa`
* `compounds`
* `plates`
* `sites`
* `wells`

For example, `BBBC021.MOA` will give you a list of all the MoAs in the full dataset:

In [None]:
BBBC021.MOA

### Access an image and its metadata

Your initialized `BBBC021` object is indexable and has a length. 
An index is the integer offset into the subset of BBBC021 you have selected.

In [None]:
print(f'Number of images in BBBC021: {len(bbbc021_all)}')
print(f'Number of images with known MoA: {len(bbbc021_moa)}')

What you get back from the object is a `tuple` of the given image followed by its associated metadata
in the form of a `namedtuple`:

In [None]:
image, metadata = bbbc021_moa[0]

plate, compound, image_idx = metadata  # it can be unpacked like a regular `tuple`

print(f'{metadata=}\n\n{metadata.plate=}\n\n{metadata.compound=}')

### View the metadata `DataFrame`s

The metadata is compiled into two Pandas `DataFrame`s, `image_df` and `moa_df`, 
which contain only metadata from the selected subset of the BBBC021 dataset.

`image_df` contains metadata information on an individual image level. 
Each row corresponds to an image in the subset of BBBC021 you selected:

In [None]:
bbbc021_moa.image_df

`image_idx` corresponds to the absolute index of the image in the full BBBC021 dataset.
`relative_image_idx` is the index you would use to access the given image as in:

`image, metadata = your_bbbc021_obj[relative_image_idx]`

`moa_df` is a metadata `DataFrame` which provides you with all the compound-concentration pairs in the selected BBBC021 subset: 

In [None]:
bbbc021_moa.moa_df

# Visualize all BBBC021 images

In [None]:
def make_layout(image_idx):
    image, metadata = bbbc021_all[image_idx]

    prefix = f"{metadata.compound.compound} @ {metadata.compound.concentration:.2e} μM, {metadata.compound.moa}"

    plots = []

    cmaps = ["fire", "kg", "kb"]

    for channel_idx, im_channel in enumerate(image):
        plot = hv.Image(
            im_channel,
            bounds=(0, 0, im_channel.shape[1], im_channel.shape[0]),
            label=f"{prefix} | {bbbc021_all.CHANNELS[channel_idx]}",
        ).opts(cmap=cmaps[channel_idx])
        plots.append(plot)

    plots.append(
        hv.RGB(
            image.transpose(1, 2, 0),
            bounds=(0, 0, im_channel.shape[1], im_channel.shape[0]),
            label="Channel overlay",
        )
    )

    return hv.Layout(plots).cols(2)


hv.DynamicMap(make_layout, kdims="image").redim.range(image=(0, len(bbbc021_all) - 1))