# Introduction

This notebook contains code that is useful in performing image analysis.

We first demonstrate how to deal with WSIs using both OpenSlide and BioFormats.

We are using a `.svs` file from The Cancer Genome Atlas for demonstration, as well as a `.vsi` file from our database. 

These will NOT be included with the notebook by default, so you'll have to grab some WSI from someone to test out the same routines as in this book.
Alternatively, if you're just using this as a reference, you should be able to copy-paste the relevant bits to your own code.



In [None]:
# imports
import os

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

%matplotlib inline

# Working With WSIs: Openslide

To install openslide on a Mac, do `brew install openslide` followed by `pip install openslide-python`. Simple!

In [None]:
import openslide

## Opening a Slide, Looking at Parameters

The [OpenSlide API](http://openslide.org/api/python/) doesn't have extensive documentation, but it does have a lot of the properties we'd need for loading up the `scn` files from Leica. 

The following opens a handle to an `openslide` class instance which provides the associated metadata (via properties) and loading (through the `read_region()` method.

In [None]:
img_path = os.path.join('data', 'TCGA-AD-6899-01Z-00-DX1.646f5e1a-212f-4b15-8689-8b55f7ba8c47.svs')
#img_path = os.path.join('data', 'TCGA-A6-6654-01Z-00-DX1.ed491b61-7c44-4275-879b-22f8007b5ff1.svs')

img_slide = openslide.OpenSlide(img_path)

In [None]:
# Pull out the features of the slide scan
scan_properties = img_slide.properties
scan_levels = img_slide.level_count
scan_dimensions = img_slide.dimensions
scan_level_dimensions = img_slide.level_dimensions
scan_level_downsamples = img_slide.level_downsamples

print('Scan Levels: {}'.format(scan_levels))
print('Scan Dimensions: {}'.format(scan_dimensions))
print('Scan Level Dimensions: {}'.format(scan_level_dimensions))
print('Scan Level Downsamples: {}'.format(scan_level_downsamples))

We can also skim through all the properties of the slide. 
Note that these come in two forms: First, a set of metadata and properties associated with the scanner (e.g. `aperio.PropertyName`), and second, a set of OpenSlide-specific parameters (`openslide.parameter`). 
According to the documentation, this is in the form of a mapping, which is a dict-like object -- I'm not sure what that entails but you can use `.items()` on it and dot-notation to get the values.

In [None]:
# Print out all the properties associated with this slide
for k, v in img_slide.properties.items():
    print('{}: {}'.format(k,v))

## Select a Level to Work With

Openslide gives us a `get_thumbnail()` method, but the way the scanners work, this might not translate into the full scan height / width.

Instead we can load up an intermediate pyramidal level, in case we want to do tissue finding.
The code below will allow you to sort and select the correct tissue level (useful in case the levels aren't inherently ordered by size).

In [None]:
# Get the image dims associated with an intermediate-sized level
widths, heights = zip(*scan_level_dimensions)
idx = np.argsort(heights)

# Take the middle index; fix this later if you want a specific size or level
#intermediate_idx = idx[int(len(idx)/2)]
intermediate_idx = 2
int_width = scan_level_dimensions[intermediate_idx][0]
int_height = scan_level_dimensions[intermediate_idx][1]

print('intermediate index value: {}'.format(intermediate_idx))
print('width: {}'.format(int_width))
print('height: {}'.format(int_height))

## Load the Image

The openslide `read_region()` method seems to be the easiest way to get at a particular level. 
To get the whole thing, give the origin as (0,0) and the width / height from the `scan_level_dimensions` property we extracted earlier.

In [None]:
img_tile = img_slide.read_region((0,0), intermediate_idx, (int_width, int_height))

# Remove the alpha channel, if there is one
img_tile = np.array(img_tile)

In [None]:
if img_tile.shape[2] > 3:
    img_tile = img_tile[:,:,0:3]

In [None]:
plt.imshow(img_tile)
plt.show()

# Working with WSIs: BioFormats

Because life is hard, sometimes Openslide doesn't work. We can try using the [Python Bioformats](https://pythonhosted.org/python-bioformats/) extension instead.

In [None]:
import os
import javabridge
import bioformats

# Bioformats is written in Java, so we need a bridge to run the commands
javabridge.start_vm(class_path=bioformats.JARS)

# The OMEXML doesn't have user-friendly documentation, 
# so it can be useful to look at the methods and properties directly
# See: https://stackoverflow.com/questions/1911281/how-do-i-get-list-of-methods-in-a-python-class
import inspect

## Grabbing an Image Object and Looking at Metadata

The metadata in bioformats is XML-formatted.
The `get_omexml_metadata()` function will return a string of XML, which you can then parse using `OMEXML()`.
The result is a bioformats metadata object.

The documentation for the bioformats API leaves a lot to be desired, but you can look through the **[OME XML schema](https://www.openmicroscopy.org/Schemas/Documentation/Generated/OME-2016-06/ome.html)** to get an idea of what you can do with these objects.

It may also help to look at **[the source for the OMEXML class](https://pythonhosted.org/python-bioformats/_modules/bioformats/omexml.html#OMEXML.Image)** to get an idea of what kinds of methods are available.

I've also included the `inspect` module code to look at the various properties and methods associated with the object, but you have to play around with them to see which ones give you what you want.

In [None]:
# Set image path
img_path = os.path.join('data', 'wsi_occ', 'OCC-01-0008-01Z-01-O01.vsi')

# Grab the metadata as OME-XML
img_xml = bioformats.get_omexml_metadata(path=img_path)

# Read in the XML with methods to get and set properties
img_metadata = bioformats.OMEXML(img_xml)

# Look at the members to see what you can do with this object
#inspect.getmembers(img_metadata)

### Finding the Full Resolution Image Index

There's probably an easier way to find this, but the order of the images in the XML is not guaranteed. So here, we cycle through the number of images for this WSI and try to find the one with the largest pixel size. 

In [None]:
# Grab the number of images described by this metadata
img_count = img_metadata.get_image_count()

# Keep track of each index's pixel size
pixel_counts = np.zeros(img_count)

for img_idx in range(img_count):
    pixel_counts[img_idx] = img_metadata.image(img_idx).Pixels.SizeX * img_metadata.image(img_idx).Pixels.SizeY
    
# Get the sorted indices
pixel_idxes = np.argsort(pixel_counts)
target_idx = pixel_idxes[-1]

print('Index of the largest image: {}'.format(target_idx))

# Pull out the image corresponding to the largets index
img_object = img_metadata.image(target_idx)
img_rows = img_object.Pixels.SizeX
img_cols = img_object.Pixels.SizeY
print('Rows: {}, Columns: {}'.format(img_rows, img_cols))

## Read in Pixel Data

After you figure out the image you want, there are two ways to access the pixel data: 

- Using the `bioformats.ImageReader()` class, and
- Using the `bioformats.load_image()` convenience function.

Main difference is that the first way allows you to read in a portion of the image -- good for tiling or reading in annotated areas. Downside is that you need to create a context object as shown below (e.g. `with bioformats.ImageReader(path) as reader:`). 

`bioformats.load_image` seems to avoid this, but I don't think there's a way to grab only a part of the image -- you have to load the whole thing. 
Not feasible for large images.

The result, in both cases, is a straightforward `np.ndarray` object.
Also note that in both cases, you need to pass in a `series` parameter which indicates the index of the image you want to load (this can be used to grab the highest-resolution image, as detected above, or it can be used to grab a smaller thumbnail-sized image).

In [None]:
# Use the bioformats imagereader to open up access to 
with bioformats.ImageReader(img_path) as reader:
    img_tile = reader.read(series=target_idx, XYWH=(int(img_rows/2),int(img_cols/2),1000,1000))

    # Adjust the color channels for OCC ROIs
    #img_tile = img_tile[...,::-1]
    print('Img tile size: {}'.format(img_tile.shape))
    print('Type: {}'.format(type(img_tile)))
    plt.imshow(img_tile)
    plt.show()
    reader.close()

In [None]:
# Alternatively, use the convenience function to directly load the image
# Note this will only open a max size image of 2GB with the following warning:
#
## JavaException: Image plane too large. Only 2GB of data can be extracted at one time. You can workaround the problem by opening the plane in tiles; for further details, see: https://docs.openmicroscopy.org/bio-formats/5.9.0/about/bug-reporting.html#common-issues-to-check

#img_full = bioformats.load_image(img_path, series=max_idx)
#img = img_full[int(img_rows/2)-250:int(img_rows/2)+250,int(img_cols/2)-250:int(img_cols/2)+250,:]

# For OCC ROIs, roll the channel axis
#img = img[...,::-1]

#fig, ax = plt.subplots(figsize=(12,12))
#ax.imshow(img)
#ax.axis('off')

# Basic Image Processing

This section will run through some basic image processing using any `numpy`-style image array.

Using code from [the scikit-image docs](https://scikit-image.org/docs/stable/auto_examples/color_exposure/plot_ihc_color_separation.html#sphx-glr-auto-examples-color-exposure-plot-ihc-color-separation-py).

## Color Deconvolution for Pathology Stain Separation

See the [scikit-image gallery example](https://scikit-image.org/docs/stable/auto_examples/color_exposure/plot_ihc_color_separation.html#sphx-glr-auto-examples-color-exposure-plot-ihc-color-separation-py) as well as [documentation for stain separation](https://scikit-image.org/docs/stable/api/skimage.color.html#skimage.color.separate_stains) for a list of the different convolutional matrices you can import.

In [None]:
from skimage.color import separate_stains, hed_from_rgb

img_separated = separate_stains(img_tile, hed_from_rgb)
img_hema = img_separated[:,:,0]
img_eosin = img_separated[:,:,1]
img_dab = img_separated[:,:,2]

# Display
fig, ax = plt.subplots(1, 4, figsize=(20,10))
ax[0].imshow(img_tile)
ax[0].set_title('Original')
ax[1].imshow(img_hema, cmap=plt.cm.gray)
ax[1].set_title('Hematoxylin')
ax[2].imshow(img_eosin, cmap=plt.cm.gray)
ax[2].set_title('Eosin')
ax[3].imshow(img_dab, cmap=plt.cm.gray)
ax[3].set_title('DAB')

for a in ax:
    a.axis('off')
    
fig.tight_layout()

## Filtering Regional Maxima

This is something you often need to do if you've got brightish objects that you want to segment, but you can't simply threshold them.
In this case let's try to segment the results of color deconvolution to find the nuclei.

In [None]:
from skimage import img_as_float
from scipy.ndimage import gaussian_filter
#from scipy.ndimage import binary_opening, binary_closing

from skimage.morphology import reconstruction

# Convert image to a float for subtraction from the original
img_nuc = img_as_float(img_hema)

# Run a simple gaussian filter to blur the image
img_nuc = gaussian_filter(img_nuc, 1)

# Create a reconstruction of the image where low-intensity 
# regions in a neighborhood are suppressed

# First create a "seed": a matrix with the minimum value of the image
seed = img_nuc - 0.125

# Next create a "mask": Just the image itself
mask = img_nuc

# Create a "dilated" image: reconstruction through dilation
dilated = reconstruction(seed, mask, method='dilation')

img_nuc_filtered = img_nuc - dilated

In [None]:
fig, ax = plt.subplots(nrows=1,
                       ncols=3,
                       figsize=(20, 10),
                       sharex=True,
                       sharey=True)

ax[0].imshow(img_nuc, cmap='gray')
ax[0].set_title('original image')

ax[1].imshow(dilated, vmin=img_nuc.min(), vmax=img_nuc.max(), cmap='gray')
ax[1].set_title('dilated')

ax[2].imshow(img_nuc_filtered, cmap='gray')
ax[2].set_title('image - dilated')

for a in ax:
    a.axis('off')

fig.tight_layout()

## Initial Segmentation

Thresholding is simple and easy, but requires hard-coding in a value to use for the image (or for all images).

Otsu's method is also simple, but uses a histogram of the image to make the threshold -- slightly more flexible.

In [None]:
from skimage import filters

img_thresholded = img_nuc_filtered > 0.02
img_otsu = img_nuc_filtered > filters.threshold_otsu(img_nuc_filtered)

# Display
fig, ax = plt.subplots(1, 3, figsize=(20, 10), sharey=True)

ax[0].imshow(img_nuc_filtered, cmap='gray')
ax[0].set_title('filtered image')

ax[1].imshow(img_thresholded,  cmap='gray')
ax[1].set_title('simple threshold')

ax[2].imshow(img_otsu, cmap='gray')
ax[2].set_title('otsu thresholded image')

for a in ax:
    a.axis('off')

fig.tight_layout()

## Segmentation Cleanup: Area Filtering

THere are two convenience functions in `skimage.morphology`: `remove_small_objects` and `remove_small_holes`, which should be self-evident what they do.

In [None]:
from skimage.morphology import remove_small_objects, remove_small_holes

img_open = remove_small_objects(img_otsu, min_size=64)
img_close = remove_small_holes(img_open, area_threshold=64)

# Display
fig, ax = plt.subplots(1, 3, figsize=(20, 10), sharey=True)

ax[0].imshow(img_otsu, cmap='gray')
ax[0].set_title('Otsu thresholded image')

ax[1].imshow(img_open,  cmap='gray')
ax[1].set_title('small objects removed')

ax[2].imshow(img_close, cmap='gray')
ax[2].set_title('small holes filled')

for a in ax:
    a.axis('off')

fig.tight_layout()

# Replace img_nuc_bin with the final step of processing for the next section
img_nuc_bin = img_close

## Segmentation Labeling: Watershed

Once you get a binary image separated from the original, it's time to figure out which objects should be pulled apart as separate things.
For regular oval objects, we typically turn towards watershed -- but this can be tricky to actually code up and minimize noise.

Here is the process:

- Get the inverse of the Euclidean distance transform of the binary image
- Set the background to a very negative number (so you have a "lip" around the border with basins inside each blob)
- Suppress local minima to "even out" the catchment basis and provide a smooth interior segmentation (protect against oversegmentation)
- Run watershed

In [None]:
# Imports

from scipy.ndimage import distance_transform_edt
from scipy.ndimage import generate_binary_structure, grey_dilation
from skimage.morphology import watershed, label
from skimage.feature import peak_local_max
from skimage.color import label2rgb
from skimage.segmentation import clear_border

In [None]:
# Get the euclidean distance transform -- distance from each object-pixel to the background
img_distance = -distance_transform_edt(img_nuc_bin)

# Set the background to a very negative number
img_distance[~img_nuc_bin] = -100

# Plot the distance map
fig, ax = plt.subplots(1,1,figsize=(20,10))
ax.imshow(img_distance)
ax.axis('off')
fig.tight_layout()

In [None]:
# Suppress local minima in the image to prevent over-segmentation
# See: https://github.com/janelia-flyem/gala/blob/master/gala/morpho.py

# The height threshold is determined empirically, based on distances of the objects in the image
hthreshold = 1
maxval = img_distance.max()

img_inv = maxval - img_distance.astype(float)

marker = img_inv - hthreshold

mask = img_inv

sel = generate_binary_structure(marker.ndim, 1)
diff = True
while diff:
    markernew = grey_dilation(marker, footprint=sel)
    markernew = np.minimum(markernew, mask)
    diff = (markernew - marker).max() > 0
    marker = markernew

filled = maxval - marker

In [None]:
fig, ax = plt.subplots(1,1,figsize=(20,10))
ax.imshow(filled)
ax.axis('off')
fig.tight_layout()

In [None]:
# Perform watershed
labels_ws = watershed(filled, mask=img_nuc_bin, watershed_line=True)

fig, ax = plt.subplots(1,2,figsize=(20,10), sharey=True)

ax[0].imshow(img_nuc_bin, cmap='gray')
ax[0].set_title('Binary Nuclear Image')

ax[1].imshow(label2rgb(labels_ws, bg_label=0))
ax[1].set_title('Watershed Segmentation')

for a in ax:
    a.axis('off')
    
fig.tight_layout()