# Image Processing with Python -- Tutorial Pipeline

##  About this Tutorial

*This tutorial aims to teach the basics of (bio-)image processing with python, in particular the analysis of microscopy data in biology. It is based on an example pipeline for 2D cell segmentation and it follows a 'learning by doing' philosophy.*


#### Instructions

- This notebook contains detailed instructions on how to program a pipeline to segment cells in 2D microscopy images.


- Simply go through the instructions step by step and try to implement each step as best as you can.
    - By doing so, you will learn about key concepts of (bio-)image processing with python!


- If you are stuck...
    - ...first think some more about the problem and see if you can get yourself unstuck.
    - ...search the internet (in particular StackOverflow) for a solution to your problem.
    - ...if you are working through this tutorial in class, ask one of the tutors for help.
    - ...if nothing else helps, you can have a look at the solutions (`tutorial_pipeline_solutions.py`) for inspiration.


#### Background 

The aim of this pipeline is the *identification and segmentation of cells* in 2D confocal fluorescence microscopy images. Cell identification and segmentation are among the most common tasks in bio-image analysis and are often essential for the extraction of useful quantitative information from microscopy data.

The pipeline is optimized to run with the provided example images, which are dual-color spinning-disc confocal micrographs (objective: 40X 1.2NA W) of cells in a live zebrafish embryo in early development (~10h post fertilization), fluorescently labeled with two membrane-localized fusion proteins.

##  Setup

- Depending on your preferences, there are several alternatives for implementing this pipeline:
    1. Implement it directly in this Jupyter notebook *[recommended]*.
    2. Implement it in an Integrated Development Environment (IDE) like Spyder.
    3. Implement it in a simple text editor, running the program on the terminal.


- Make sure that you have downloaded the **example image data** (`example_cells_1.tif` and `example_cells_2.tif`).


- **Python version**
    - The solutions are provided in **python 2.7.x**, which is pre-installed if you are doing this tutorial in class.
    - However, most of the code should work with only minor changes in python 3.x, so feel free to use python 3 if you are more comfortable with it.


- **Required modules**
    - If you are doing this tutorial in class, all required modules are pre-installed.
    - Otherwise, you should make sure that they are installed before you get started.
    - List of all required modules:
        - numpy
        - scipy
        - matplotlib
        - scikit-image
        - tifffile
    - With the exception of tifffile, all required modules come pre-installed if you are using the **Anaconda distribution** of python.
    - You can install tifffile from the commandline using the command `pip install tifffile` or (if you are using Anaconda) `conda install tifffile -c conda-forge`.



- **Python legacy: integer division**
    - Using python 2.7.x has a curious caveat: integer divisions that should result in fractions do not automatically convert to float type. For example, `3/2` will return `1` (preserving integer type) instead of the more sensible `1.5` (converting to float type). Fortunately, the improved division function of python 3 can be imported from "the future", see below.

In [2]:
from __future__ import division
print 3/2

1.5


*With this, you are ready to get started!*

----

## Importing Modules & Packages

Let's import the package NumPy, which enables the manipulation of numerical arrays:

In [3]:
import numpy as np  

_(Note: if you are not familiar with arrays and NumPy, we strongly recommend that you first complete the accompanying introductory tutorial on this topic before carrying on here.)_

Recall that once imported, we can use functions/modules from the package, for example to create an array:

In [5]:
a = np.array([1, 2, 3])

print a
print type(a)

[1 2 3]
<type 'numpy.ndarray'>


Note that the package is imported under a variable name (here `np`). You can choose this name freely yourself. For example, it would be just as valid (but not as convenient) to write:

```python
import numpy as lovelyArrayTool
a = lovelyArrayTool.array([1,2,3])
```

#### <font color='teal'> Exercise </font>

Using the import command as above, follow the instructions in the comments below to import two additional modules that we will be using frequently in this pipeline.

In [None]:
# The plotting module matplotlib.pyplot as plt

# The image processing package scipy.ndimage as ndi


## Importing & Handling Image Data

#### Background

Images are essentially just numbers (representing intensity) in an ordered grid of pixels. Image processing means simply to carry out mathematical operations on these numbers.

The ideal object for storing and manipulating ordered grids of numbers is the **array**. Many mathematical operations are well defined on array and can be computed quickly by vector-based computation.

Arrays can have any number of dimensions (or "axes"). For example, a 2D array could represent the x and y axis of a grayscale image, a 3D array could contain a z-stack (zyx), a 4D array could also have multiple channels for each image (czyx) and a 5D array could have time on top of that (tczyx).

#### <font color='teal'> Exercise </font>

We will now proceed to import some image data, verifying that we get what we expect and then further specifying the data we will work with. Before you start, it makes sense to have a quick look at the data in Fiji/ImageJ so you know what you are working with!

Follow the instructions in the comments below.

In [8]:
# (i) Specify the filename
# Create a string variable with the name of the file to be imported ('example_cells_1.tif')
# Suggested name for the variable: filename
# Note: If the file is not in your current working directory, the filename variable must contain the 
#       entire path to the file, for example r'/home/jack/data/example_cells_1.tif'. Note the r at
#       the beginning of the string: it designates this string as a "raw" string, which helps to
#       avoid problems with slashes and other special symbols


# (ii) Load the image
# Import the function 'imread' from the module 'tifffile'.

# Load 'example_cells_1.tif' and store it in a variable.
# Suggested name for the variable: img


# (iii) Check that everything is in order
# Check that 'img' is a variable of type 'ndarray' - use Python's built-in function 'type'.

# Print the shape of the array using the numpy-function 'shape'. 
# Make sure you understand the output; recall that the image has 2 color channels and is 930 by 780 pixels. 

# Check the datatype of the individual numbers in the array. You can use the array attribute 'dtype' to do so.
    

# (iv) Allocate the green channel to a separate new variable
# For segmentation, we will only work with the green channel, so we need to allocate it to a new variable. 
# The green channel in this image is the first channel (or channel 0 in python). 
# We can allocated it to a new variable by slicing the 'img' array.
# Hint: Recall that the image has three dimensions, two (rows and columns) defining the size of the image 
#       in terms of pixels, and one defining the number of channels. To slice the array, you need to index  
#       each dimension to specify what you want from it.
#       For example, array A below has two dimensions.
#         A = np.array([[1,2,3],[4,5,6]])
#       To obtain all entries in the first row, we would slice like this:
#         B = A[0,:]
#       You can slice the 2D green channel out of the 3D 'img' array in a similar fashion. 


# (v) Look at the image to confirm that everything worked as intended
# Show one of the channels as an image; use pyplot's functions plt.imshow followed by plt.show. 
# Check the documentation for plt.imshow and note the parameters that can be specified, such as the color map (cmap)
# and interpolation. Since you are working with scientific data, interpolation is unwelcome, so you should set it to
# 'none'. The most common cmap for grayscale images is naturally 'gray'.
# You may also want to adjust the size of the figure. You can do this by preparing the figure canvas with
# the function plt.figure before calling plt.imshow. The canvas size is adjusted using the keyword argument
# figsize when calling plt.figure.



## Preprocessing

#### Background

The goal of image preprocessing is to prepare or optimize the images to make further analysis easier. The specific preprocessing steps used in a pipeline depend on the type of image, the microscopy technique used, the image quality, and the desired downstream analysis. 

The most common operations include:

- Deconvolution
    - Image reconstruction based on information about the PSF of the microscope
    - Often included with microscope software or done via the 'Huygens' server
    - *Our images are not devonvolved, but will do just fine regardless*


- Conversion to 8-bit images to save memory / computational time
    - *Our images are already 8-bit*


- Cropping of images to an interesting region
    - *Our field of view is fine as it is*


- Smoothing of technical noise
    - This is a very common step and is likely to improve almost any type of downstream analysis
    - Commonly used filters are the `Gaussian filter` and the `median filter`
    - *Here, we will be using a Gaussian filter.*


- Corrections of technical artifacts
    - Common examples are uneven illumination and multi-channel bleed-through
    - *Here, we will deal with uneven signal by adaptive thresholding*


- Background subtraction
    - There are various ways of sutracting background signal from an image
    - Two different types are often distinguished:
        - `uniform background subtraction` treats all regions of the image the same
        - `adaptive background subtraction` automatically accounts for differences between regions of the image
    - *Here, we do something similar to adaptive background subtraction when we do adaptive thresholding*

#### Gaussian Smoothing

A Gaussian filter smoothens an image by convolving it with a Gaussian-shaped kernel. In the case of a 2D image, the Gaussian kernel is also 2D and will look something like this:

<img src="ipynb_images\gaussian_kernel_grid.png" alt="Gaussian Kernel Figure" style="width: 300px;"/>

How much the image is smoothed by a Gaussian kernel is determined by the standard deviation  of the Gaussian distribution, usually referred to as **sigma** ($\sigma$). A higher $\sigma$ means a broader distribution and thus more smoothing.

**Choosing the correct value of $\sigma$:** this depends a lot on your images, in particular on the pixel size. In general, the chosen $\sigma$ should be large enough to blur out noise but small enough so the "structures of interest" do not get blurred too much. Usually, the best value for $\sigma$ is simply found by trying out some different options and looking at the result. 

#### <font color='teal'> Exercise </font>

Follow the instructions in the comments below.

In [None]:
# (i) Create a variable for the smoothing factor sigma, which should be an integer value
# After implementing the Gaussian smoothing function below, you can modify this variable 
# to find the ideal value of sigma.


# (ii) Perform the smoothing on the image
# To do so, use the Gaussian filter function 'ndi.filters.gaussian_filter' from the 
# image processing package ndimage, which was imported at the start of the tutorial. 
# Check out the documentation of scipy to see how to implement this function. 
# Allocate the output to a new variable.


# (iii) Visualize the result using plt.imshow and plt.show
# Compare with the original image visualized in the step above. 
# Does the output make sense? Is this what you expected? 
# Can you optimize sigma such that the image looks smooth without blurring the membranes too much?



## Adaptive Thresholding

#### Background

The easiest way to distinguish foreground objects (here: membranes) from the image background is to threshold the image, meaning all pixels with an intensity above a certain threshold are accepted as foreground, all others are set as background. However, just applying a fixed intensity threshold often gives rather poor results due to varying background and foreground intensities across the image.

One way of solving this problem is to use an *adaptive thresholding* algorithm, which adjusts the threshold locally in different regions of the image to account for varying signal-to-noise ratios. 

Our approach to adaptive tresholding works as follows:

1. *Generation of a "background image":* this image should always have higher intensities than the local background but lower intensities than the local foreground. This can be achieved by strong blurring/smoothing of the image, as illustrated in this 1D example:

<img src="ipynb_images\adaptive_bg_1D.png" alt="Adaptive Background Figure" style="width: 400px;"/>
    
2. *Thresholding of original image with the background.* In practice, thresholding means creating a *mask*; an array of type `np.bool` which contains only the values `True` and `False` (or `1` and `0`). Any pixel with an intensity higher than the corresponding background pixel is set to `1` and all others to `0`.

#### <font color='teal'> Exercise </font>

Implement the two steps of adaptive background subtraction:

1. Use a strong "mean filter" to create the background image. This simply assigns each pixel the average value of its local neighborhood. Just like the Gaussian blur, it can be done by convolution, but this time using a "uniform kernel":

 <img src="ipynb_images\uniform_filter_SE.png" alt="Uniform Filter SE Figure" style="width: 300px;"/>
    
 To define which pixels should be considered as the "neighborhood" of a given pixel, a `structuring element` (`SE`) is    used. This is a small binary image where all pixels set to `1` will be considered as part of the neighborhood and all pixels set to `0` will not be considered. Here, we use a disc-shaped `SE`, as this reduces artifacts compared to a square `SE`.
  
 *Side note:* The Gaussian blur above also used an `SE` but it was defined automatically by the `gaussian_filter` function based on the $\sigma$ value we specified. Here, we explicitely generate the `SE`.<br><br>

2. Use the background image for thresholding. Pixels with higher values in the original image than in the background should be given the value 1 and pixels with lower values in the original image than in the background should be given the value 0. The resulting binary image should represent the cell membranes.

Follow the instructions in the comments below.

In [None]:
# Step 1

# (i) Create a disk-shaped structuring element and asign it to a new variable.
# Structuring elements are small binary images that indicate which pixels 
# should be considered as the 'neighborhood' of the central pixel. 
#
# An example of a small disk-shaped SE would be this:
#   0 0 1 0 0
#   0 1 1 1 0
#   1 1 1 1 1
#   0 1 1 1 0
#   0 0 1 0 0
#
# The equation below creates such structuring elements. 
# It is an elegant but complicated piece of code and at the moment it is not 
# necessary for you to understand it in detail. Use it to create structuring 
# elements of different sizes (by changing 'i') and find a way to visualize 
# the result.
# 
# Try to answer the following questions: 
#   - Is the resulting SE really circular? 
#   - Can certain values of 'i' cause problems? If so, why?
#   - What value of i should used for the se?
#     Note that, similar to the sigma in Gaussian smoothing, the size of the SE
#     is first estimated based on the images and by thinking about what would 
#     make sense. Later, it can be optimized by trial and error.

#struct = (np.mgrid[:i,:i][0] - np.floor(i/2))**2 + (np.mgrid[:i,:i][1] - np.floor(i/2))**2 <= np.floor(i/2)**2


# (ii) Create the background
# Run a mean filter over the image using the disc SE and assign the output to a new variable.
# Use the function 'skimage.filters.rank.mean' (you first need to import the 'skimage.filters.rank' module).
# Think about why a mean filter is used and if a different function (e.g. minimum, maximum or median) 
# would work equally well.


# (iii) Visualize the resulting background image. 
# Compare it to the images generated above. Does the outcome make sense?



In [1]:
# Step 2  

# (iv) Threshold the Gaussian-smoothed original image using the background image created in step 1 
#      to obtain the cell membrane segmentation
# Set pixels with higher values in the original than in the bg to 1 and pixels with lower values to 0. 
# You can use a "relational operator" to do this, since numpy arrays will automatically perform element-wise
# comparisons when compared to other arrays of the same shape.


# (v) Visualize and understand the output. 
# What do you observe? 
# Are you happy with this result as a membrane segmentation? 



## Improving Masks with Binary Morphology

#### Background

Morphological operations such as `erosion`, `dilation`, `closing` and `opening` are common tools used (among other things) to improve masks after they are generated by thresholding. They can be used to fill small holes, remove noise, increase or decrease the size of an object, or smoothen mask outlines.

Most morphological operations are - once again - simple kernel functions that are applied at each pixel of the image based on their neighborhood as defined by an `structuring element` (`SE`). For example, `dilation` simply assigns the central pixel the maximum pixel value within the neighborhood; it is a maximum filter. Conversely, `erosion` is a minimum filter. Additional options emerge from combining the two: `morphological closing`, for example, is a `dilation` followed by an `erosion`. This is used to fill in gaps and holes or smoothing mask outlines without significantly changing the mask's area. Finally, there are also some more complicated morphological operations, such as `hole filling`.

#### <font color='teal'> Exercise </font>

Improve the membrane segmentation from above with morphological operations.

Specifically, use `binary hole filling` to get rid of the speckles of foreground pixels that litter the insides of the cells. Furthermore, try different other types of morphological filtering to see how they change the image and to see if you can improve the membrane mask even more, e.g. by filling in gaps.

Follow the instructions in the comments below.

In [None]:
# (i) Get rid of speckles using binary hole filling
# Use the function ndi.binary_fill_holes for this. Be sure to read up on the docs to
# understand exactly what it does. For this to work as intended, you will have to 
# invert the mask, which you can do using the function np.logical_not. Again, be
# sure to understand why this has to be done.


# (ii) Try out other morphological operations to further improve the membrane mask
# The various operations are available in ndimage, for example ndi.binary_closing.
# Play around and see how the different functions affect the mask. Can you optimize
# the mask, for example by closing gaps?
# Note that the default SE for these functions is a square. Feel free to create a
# new disc-shaped SE and see how that changes the outcome.
# Also, if you pay close attention you will notice that some of these operations
# introduce artifacts at the image boundaries. Can you come up with a way of
# solving this?


# (iii) Visualize the final result
# At this point you should have a pretty neat membrane mask.
# If you are not satisfied with the quality your membrane segmentation, you should go back 
# and fine tune size of the SE in the adaptive thresholding section and also optimize the
# morphological cleaning operations.
# Note that the quality of the membrane segmentation will have a significant impact on the 
# cell segmentation we will perform downstream.



## Connected Components Labeling

#### Background

Based on our membrane segmentation, we can get a preliminary segmentation of the cells in the image by considering each background region surrounded by membranes as a cell. This can already be good enough for many simple analyses.

The only thing we still need to do in order to get there is to label each cell individually. Only if each separate cell has a unique number (an `ID`) assigned, we can analyze values such as the mean intensity at the single-cell level.

The simple function we can use to achieve this is called `connected components labeling`. It gives every connected group of foreground pixels a unique `ID` number.

#### <font color='teal'> Exercise </font>

Use your membrane segmentation for connected components labeling.

Follow the instructions in the comments below.

In [None]:
# (i) Label connected components
# Use the function ndi.label from ndimage. 
# Note that this function labels foreground pixels (`1`), so you may need to invert your mask
# again if your membrane mask is currently labeled as foreground.


# (ii) Visualize the output
# Here, it is no longer ideal to use a 'gray' colormap, since we want to visualize that each
# cell has a unique ID. Play around with different colormaps (check the docs to see what
# types of colormaps are available) and choose one that you are happy with
# Take a close look at the picture and note mistakes in the segmentation. Depending on the
# quality of your membrane mask, there will most likely be some cells that are falsely 
# labeled as the same cells; this is called "under-segmentation". We will resolve this
# issue in the next step. Note that our downstream pipeline does not involve any steps to
# resolve "over-segmentation", so you should fine-tune your membrane mask such that this
# is not a common problem.



## Cell Segmentation by Seeding & Expansion

#### Background

The segmentation we achieved by membrane masking and connected components labeling is a good start. We could for example use it to measure the fluorescence intensity in each cell's cytoplasm. However, we cannot use it to measure intensities at the membrane of the cells, nor can we use it to accurately measure things like cell shape or size.

To improve this (and to also resolve cases of under-segmentation), we can use a "seeding & expansion" strategy. Expansion algorithms such as the `watershed` start "growing outward" from a small `seed` until they touch the boundaries of neighboring cells, which are themselves growing outward from neighoring seeds. Since the "growth rate" at the boundaries of the growing seeds is adjusted based on image intensity (higher intensity means slower expansion), these expansion methods often end up perfectly tracing a the cells' outlines.

### Seeding by Distance Transform

#### Background

An array of `seeds` contains a few pixels at the center of each cell labeled by a unique ID number and otherwise surrounded by zeros. The expansion algorithm will start from these central pixels and grow outward until all zeros are overwritten by an ID label. In the case of `watershed` expansion, one can imagine the `seeds` as the sources from which water pours into the cells and starts filling them up.

For multi-channel images that contain a nuclear staining, it is quite common to mask the nuclei by thresholding and use e.g. an eroded version of the nuclei as seeds for cell segmentation. However, there are good alternative seeding approaches for cases where nuclei are not available or not nicely separable by thresholding.

Here, we will use a `distance transform` for seeding. In a `distance transform`, each pixel in the foreground (here the cells) is assigned a value corresponding to its distance from the closest background pixel (here the membrane segmentation). In other words, we encode within the image how far each pixel of a cell is away from the membrane (see figure below). The pixels furthest away from the membrane will be at the center of the cells and will have the highest values. Using a function to detect `local maxima`, we will find these high-value peaks and use them as seeds for our segmentation.

<img src="ipynb_images\distance_transform.png" alt="Distance Transform Figure" style="width: 900px;"/>

One big advantage of this approach is that it will create two separate seeds for two cells, even if the two cells are connected by a hole in the membrane segmentation. Consequently, under-segmentation artifacts will be reduced.

#### <font color='teal'> Exercise </font>

Retrieve seeds using distance transformation.

This involves the following three steps:

1. Run the distance transform on your membrane mask.

2. Due to irregularities in the membrane shape, the distance transform may have some smaller local maxima in addition to those at the center of the cells. This will lead to additional seeds, which will lead to over-segmentation. To resolve this problem, smoothen the distance transform by applying a dilation/maximum filter. 

3. Find the seeds by detecting local maxima. Optimize the seeding by changing the dilation in step 2, aiming to have exactly one seed for each cell.

Follow the instructions in the comments below.

In [1]:
# (i) Distance transform on thresholded membranes
# Use the function ndi.distance_transform_edt.
    
    
# (ii) Visualize the output and understand what you are seeing.


# (iii) Dilate the distance threshold
# Use ndi.filters.maximum_filter to dilate the distance transform.
# Read the documentation to remind yourself how and where the structuring element can be defined with this function.
# You can try different SE sizes and shapes. 


# (iv) Retrieve the local maxima (the 'peaks') in the distance transform
# Use the function peak_local_max from the module skimage.feature. By default, this function will return the
# indices of the pixels where the local maxima are. However, we instead need a boolean mask of the same shape 
# as the original image, where all the local maximum pixels are labeled as `1` and everything else as `0`.
# This can be achieved by setting the keyword argument 'indices' to False.


# (v) Visualize the output as an overlay on the original (smoothed) image
# If you just look at the local maxima image, it will simply look like a bunch of distributed dots.
# To get an idea if the seeds are well-placed, you will need to overlay these dots onto the original image.
# To do this, it is important to first understand a key point about how the pyplot module works: 
# every plotting command is slapped on top of the previous plotting commands, until everything is ultimately 
# shown when plt.show is called. Hence, you can first plot the original input (or the smoothed) image and 
# then plot the seeds on top of it before showing both with 'plt.show'.
# As you can see if you try this, you will not get the desired result because the zero values in seed array
# are painted in black over the image you want in the background. To solve this problem, you need to mask 
# these zero values before plotting the seeds. You can do this by creating an appropriately masked array
# using the function 'np.ma.array'. Check the docs to figure out how to do this.


# (vi) Optimize the seeding
# Ideally, there should be exactly one seed for each cell.
# If you are not satisfied with your seeding, go back to the dilation step above and optimize it to get 
# rid of additional maxima. You can also try using the keyword argument min_distance in peak_local_max 
# to solve cases where there are multiple small seeds at the center of a cell.
# Note that good seeding is essential for a good segmentation with an expansion algorithm. However,
# no segmentation is ever perfect, so it's okay if a few cells end up being oversegmented!


# (vii) Label the seeds
# Use connected component labeling to give each cell seed a unique ID number.



### Expansion by Watershed

#### Background

To achieve a cell segmentation, the `seeds` now need to be expanded outward until they follow the outline of the cell. The most commonly used expansion algorithm is the `watershed`.

Imagine the intensity in the raw/smoothed image as a topographical height profile; high-intensity regions are peaks, low intensity pixels are valleys. In this representation, cells are deep valleys (with the seeds at the center), enclosed by mountains. As the name suggests, the `watershed` algorithm can be understood as the gradual filling of this landscape with water, starting from the seed. As the water level rises, the seed expands - until it final reaches the 'spine' of the cell membrane 'mountain range'. Here, the water would flow over into the neighboring valley, but since that valley is itself filled up with water from the neighboring cell's seed, the two water surfaces touch and the expansion stops.

<img src="ipynb_images\watershed_illustration.png" alt="Watershed Figure" style="width: 900px;"/>

#### <font color='teal'> Exercise </font>

Expand your seeds by means of a watershed expansion.

Follow the instructions in the comments below.

In [1]:
# (i) Perform watershed
# Use the function watershed from the module skimage.morphology.
# Use the labeled cell seeds and the smoothed membrane image as input.


# (ii) Show the result as transparent overlay over the smoothed input image
# This can be done similar to the masked overlay of the seeds, but now you don't need to mask 
# the background in the overlayed image (there will be none, since everything gets labeled in
# the watershed). Instead, you need to make the overlayed image semi-transparent. 
# This can be achieved using the optional argument 'alpha' of the 'plt.imshow' function 
# to specify the opacity.
# Be sure to choose an appropriate colormap that allows you to distinguish the segmented
# cells (I would suggest 'prism').



#### *A Note on Segmentation Quality*

This concludes the segmentation of the cells in the example image. Depending on the quality you achieved in each step along the way, the final segmentation may be of greater or lesser quality (in terms of over-/under-segmentation errors).

It should be noted that the segmentation will likely *never* be 'perfect'! This can't be helped because image segmentation is ultimately a `computational classification task` and all such tasks are subject to a fundamental trade-off between specificity and sensitivity, which in this case takes the form of a trade-off between over- and under-segmentation.

This raises an important question: ***when should I stop trying to optimize my segmentation?***

There is no absolute answer to this question, but the best answer is probably ***when you can use it to address your biological questions.***

*Importantly, this implies that you should already have a relatively clear question in mind when you are working on the segmentation!*

## Postprocessing: Removing Cells at the Image Border

#### Background

Since segmentation is never perfect, it often makes sense to explicitely remove artifacts afterwards. For example, one could filter out objects that are too small, have a very strange shape, or very strange intensity values. 

**Warning:** Filtering out objects is equivalent to the *removal of outliers* in data analysis and *should only be done for good reason and with caution!*

As an example of postprocessing, we will now filter out a particular group of problematic cells: those that are cut off at the image border.

#### <font color='teal'> Exercise </font>

Iterate through all the cells in your segmentation and remove those that are at the image border.

Follow the instructions in the comments below. Note that the instructions will tend to be less specific from here on, so you need to figure out how to approach a problem yourself.

In [None]:
# (i) Create image border mask
# We need some way to check if a cell is at the border. For this, we generate a 'mask' of the image border,
# i.e. a Boolean array of the same size as the image where only the border pixels are set to `1` and all 
# others to `0`, like this:
#   1 1 1 1 1
#   1 0 0 0 1
#   1 0 0 0 1
#   1 0 0 0 1
#   1 1 1 1 1
# There are multiple ways of generating this mask, for example by erosion or by array indexing.
# It is up to you to find a way to do it.


# (ii) 'Delete' the cells at the border:
# Note: When modifying a segmentation (in this case by deleting some cells), it makes sense
#       to work on a copy of the array, not on the original. This avoids unexpected behaviors,
#       especially within jupyter notebooks. Use the function np.copy to copy an array.

# Iterate over all cells in the segmentation. Use a for-loop and the function np.unique;
# remember that each cell in our segmentation is labeled with a different integer.

    # Create a mask that contains only the 'current' cell in the iteration
    # Hint: Remember that the comparison of an array with some number (array==number)
    #       returns a Boolean mask of the pixels in 'array' whose value is 'number'.

    # Using the cell mask and the border mask from above, test if the cell has pixels touching 
    # the image boundary or not.
    # Hint: np.logical_and

    # If a cell touches the image boundary, delete it by setting its pixels in the segmentation to 0.

    
# OPTIONAL: re-label the remaining cells to keep the numbering consistent from 1 to N (with 0 as background).


# (iii) Visualize the result
# Show the result as transparent overlay over the original/blurred image. 
# Here you have to combine alpha (to show cells transparently) and 'np.ma.array'
# (to hide empty space where the border cells were deleted).



## Identifying Cell Edges

#### Background

With the final segmentation in hand, we can now start to think about measurements and data analysis. However, to extract interesting measurements from our cells, the segmentation on its own is often not enough: additional masks that identify sub-regions for each cell allow more precise and more biologically relevant measurements.

The most useful example of this is an additional mask that identifies only the edge pixels of each cell. This is useful for a number of purposes, including:

- Edge intensities are a good measure of membrane intensity, which is often a desired readout.
- The intensity profile along the edge contains information on cell polarity.
- The length of the edge (relative to the cell area) is an informative feature about the cell shape. 
- Showing colored edges can be a nice way of visualizing cell segmentations.

There are many ways of identifying edge pixels in a fully labeled segmentation. Here, we will use a simple and relatively fast method based on erosion. In the <font color='green'>optional advanced content</font> you will find an even faster solution as an example of how `vectorization` can speed up your code. <font color='red'>**(this may be taken out)**</font>

#### <font color='teal'> Exercise </font>

Create a labeled mask of cell edges by following these steps:


- Create an empty array of the same size and data type as the segmentation
    - This will be your final cell edge mask; you gradually add cell edges as you iterate over cells
    

- *For each cell:*
    - Erode the cell's mask by 1 pixel
    - Using the eroded mask and the original mask, create a new mask of only the cell's edge pixels
    - Add the cell's edge pixels into the empty image generated above, labeling them with the cell's original ID number


Follow the instructions in the comments below.

In [None]:
# (i) Create an empty array of the same size and data type as the segmentation
# Hint: np.zeros_like(...) or np.zeros(...,dtype=...)


# (ii) Iterate over the cells
# Hint: np.unique


    # (iii) Erode the cell's mask by 1 pixel
    # Hint: smart indexing and ndi.binary_erode
    
    
    # (iv) Create cell edge mask
    # Hint: np.logical_xor
    
    
    # (v) Add the cell edge mask to the empty array generated above, labeling it with the cell's ID
    # Hint: smart indexing
    

# (vi) Visualize the result
# Note: Because the lines are so thin (1pxl wide), they may not be displayed correctly in small images.
#       If you wish, you can try and find a solution for this problem. One simple option is just to
#       show a sub-region of the image so it is rendered bigger.



## Extracting Quantitative Measurements

#### Background

The ultimate goal of image segmentation is of course the extraction of quantitative measurements, in this case on a single-cell level. Measures of interest can be based on intensity (in different channels) or on the size and shape of the cells.

To exemplify how different properties of cells can be measured, we will extraxt the following:

- Cell ID (so all other measurements can be traced back to the cell that was measured)
- Mean intensity of each cell, for each channel
- Mean intensity at the membrane of each cell, for each channel
- The cell area, i.e. the number of pixels that make up the cell
- The cell outline length, i.e. the number of pixels that make up the cell edge

*Note:* It makes sense to use smoothed/filtered/background-subtracted images for segmentation. When it comes to measurements, however, it's best to get back to the raw data!

#### <font color='teal'> Exercise </font>

Extract the measurements listed above for each cell and collect them in a dictionary.

Follow the instructions in the comments below.

In [1]:
# (i) Create a dictionary that contains a key-value pairing for each measurement
# The keys should be a strings describing the type of measurement (e.g. 'green_intensity_mean') 
# and the values should be empty lists. These empty lists will be filled with the results of 
# the measurements and the dictionary will make it easy to work with this data.


# (ii) Record the measurements for each cell
# Iterate over the segmented cells (np.unique).
# Inside the loop, create a mask for the current cell and use it to extract the measurements listed above. 
# Add them to the appropriate list in the dictionary using the list.append method.
# Hint: Remember that you can get out all the values within a masked area by indexing the image 
#       with the mask. For example, np.mean(image[cell_mask]) will return the mean of all the 
#       intensity values of 'image' that are masked by 'cell_mask'.


# (iii) Print the results and check that they make sense



## Simple Analysis & Visualisation

#### Background

By extracting quantitative measurements from an image we cross over from 'image analysis' to 'data analysis'. 

This section briefly explains how to do basic data analysis and plotting, including boxplots, scatterplots and linear fits. It also showcases how to map data back onto the image, creating an "image-based heatmap".

For a more in-depth intro to data analysis with single-cell segmentation data, refer to the <font color='green'>optional advanced content</font>. <font color='red'>**(This still needs to be 'modernized'!)**</font>

#### <font color='teal'> Exercise </font>

Follow the instructions in the comments below.

*Note:* If you're working in jupyter notebook, feel free to split the code cell below into multiple cells to make it easier to view the plots and modify the code without scrolling up and down all the time.

In [6]:
# (i) Familiarize yourself with the data structure of the results dict and summarize the results
# Recall that dictionaries are unordered; a dataset of interest is accessed through its key.
# In our case, the datasets inside the dict are lists of values, ordered in the same order
# as the cell IDs. 
# For each dataset in the results dict, print its name (the key) along with its mean, standard 
# deviation, maximum, minimum, and median. The appropriate numpy methods (e.g. np.median) work
# with lists just as well as with arrays.


# (ii) Create a box plot showing the mean cell and mean membrane intensities for both channels. 
# Use the function plt.boxplot. Use the 'label' keyword of 'plt.boxplot' to label the x axis with 
# the corresponding key names. Feel free to play around with the various options of the boxplot 
# function to make your plot look nicer. Remember that you can first call plt.figure to adjust 
# settings such as the size of the plot.


# (iii) Create a scatter plot of red membrane intensity over cell area
# Use the function plt.scatter for this. Be sure to properly label the plot using plt.xlabel,
# plt.ylabel and plt.title.


# (iv) Perform a linear fit of red membrane intensity over cell area
# Use the functions linregress from the module scipy.stats. Be sure to read the docs to
# understand the output of this function. Print the output. In addition to R, also
# calculate and print R-squared.


# (v) Think about the result
# Note that the fit seems to return a highly significant p-value but a very low correlation coefficient (r-value). 
# Based on prior knowledge, we would not expect a linear correlation to be present in our data. 
# This should prompt several questions:
#   1) What does this p-value actually mean? Check the docs of linregress!
#   2) Could there be artifacts in our segmentation that bias this analysis?
#   3) With single-cell analysis, we quickly get to a large number of datapoints. 
#      This can skew statistical analyses and should be accounted for by multiple
#      testing correction and/or by comparison with randomized datasets.
# In general, it's always good to be very careful when doing data analysis. Make sure you understand the functions 
# you are using and always check for possible errors or biases in your analysis!


# (vi) Overlay the linear fit onto the scatter plot
# Recall that a linear function is defined by `y = slope * x + intercept`.

# To define the line you'd like to plot, you need two values of x (the starting point and
# and the end point of the line). What values of x make sense?

# When you have the x-values for the starting point and end point, get the corresponding y 
# values from the fit using the equation above.

# Plot the line with plt.plot. Adjust the line's properties so it is well visible.
# Hint: remember that you will have to re-create the scatterplot before plotting the line, 
# so that the line will be placed on top of the scatterplot.

# Use plt.legend to add information about the line to the plot.

# Label the plot and finally show it with plt.show.


# (vii) Map the cell area back onto the image as a 'heatmap'
# Scale the cell area data to 8bit so that it can be used as pixel intensity values.
# Hint: if the largest cell area should correspond to the value 255 in uint8, then 
# the other cell areas correspond to cell_area*255/largest_cell_area.
    
# Initialize a new image; all values should be zeros, the shape should be identical to
# the images we worked with before and the dtype should be uint8.

# Iterate over the segmented cells (np.unique). In addition to the cell IDs, the
# for-loop should also include a simple counter (starting from 0) with which the area
# data can be accessed.
    
# Mask the current cell and assign the cell's (re-scaled) area to the cell's pixels.
    
# Visualize the result as a colored semi-transparent overlay over the raw/smoothed original input image.
# Optional: See if you can exclude outliers to make the colormapping more informative!



## Writing Output to Files

#### Background

The final step of the pipeline shows how to write various outputs of the pipeline to files.

Data can be saved to files in a human-readable format such as text files (e.g. to import into Excel), in a format readable for other programs such as tif-images (e.g. to view in Fiji) or in a language-specific file that makes it easy to reload the data into python in the future (e.g. for further analysis).

#### <font color='teal'> Exercise </font>

Follow the instructions in the comments below.

In [None]:
# (i) Write one or more of the images you produced to a tif file
# Use the function imsave from the tifffile module. Make sure that the array you are writing
# is of integer type. If necessary, you can use the method 'astype' for conversions, e.g.
# some_array.astype(np.uint8).
# You can also try adding the segmentation to the original image, creating an image with
# three channels, one of them being the segmentation. 
# After writing the file, load it into Fiji and check that everything worked as intended.


# (ii) Write a figure to a png or pdf
# Recreate the scatter plot from above (with or without the regression line), then save the figure
# as a png using plt.savefig. Alternatively, you can also save it to a pdf, which will create a
# vector graphic that can be imported into programs like Adobe Illustrator.


# (iii) Save the segmentation as a numpy file
# Numpy files allow fast storage and reloading of numpy arrays. Use the function np.save to save
# the array and reload it using np.load.


# (iv) Save the result dictionary as a json file
# Json files are generic files that can save almost any python object and reload it again.
# You will need to open an empty file object using "open" in write mode ('w'). It's best to do so
# using the 'with'-statement (context manager) to make sure that the file object will be closed
# automatically when you are done with it. Use the function json.dump to write the results to
# the file.
# Hint: Refer to the python documention for input and output to understand how file objects are
#       handled in python in general.
    

# Note: json files can be re-loaded again as follows:
#with open('my_filename.json', 'r') as infile:
#   results = json.load(infile)


# (v) Write a tab-separated text file of the results dict
# The most generic way of saving numerical results is a simple text file. It can be imported into 
# pretty much any other program.

# To write normal text files, open an empty file object in write mode ('w') using the 'with'-statement.

# Use the file_object.write(string) method to write strings to the file. First write the header of the
# date (the result dict keys), separated by tabs ('\t'). It makes sense to first generate a complete
# string with all the headers and then write this string to the file. Note that you will need to 
# explicitly write 'newline' characters ('\n') at the end of the line to switch to the next line.

# After writing the headers, iterate over all the cells saved and write the data to the file by
# creating strings similar to the header string.

# After writing the data, have a look at the output file in a text editor or in a spreadsheet
# program like Excel.



## Batch Processing

#### Background

In practice, we never work with just a single image, so we would like to make it possible to run our analysis pipeline for multiple images and then collect and analyze all the results. This final section of the tutorial shows how to do just that.

#### <font color='teal'> Exercise </font>

To run a pipeline multiple times, it needs to be packaged as a function or even as a module. Jupyter notebook is not well suited for this, so if you're working in a notebook, first extract your code to a .py file (see instructions below). If you are not working in a notebook, create a copy of your pipeline; we will modify this copy into a function that can then be called repeatedly for different images.

To export a jupyter notebook as a .py file, use `File > Download as > Python (.py)`, then save the file. Open the resulting python script in a text editor or in an IDE like Spyder. 


**Let's clean the script a bit:**

- Remove the line `%matplotlib inline`. It is not valid python code outside of a notebook


- Go through the script and comment out everything related to plotting; when running a pipeline for dozens or hundreds of images, we usually do not want it to generate tons of plots


- Similarly, it can make sense to remove some print statments if you have many of them.


- Remove the sections `Simple Analysis and Visualization` and `Writing Output to Files`; we will collect the output for each image within python and then analyze everything at once


- Feel free to delete the background information to make the script more concise.


**Converting the pipeline to a function:**

Convert the entire pipeline into a function that accepts a filename as input, runs everything, and returns the final segmentation and the results dictionary. To do this, you must:

- Add the function definition statement at the beginning of the script (after the imports)
- Replace the 'hard-coded' filename by a variable that is accepted by the function
- Indent all the code
- Add a return statement at the end


**Importing the function and running it for multiple input files:**

To actually run the pipeline function for multiple input files, we need to do the following:

- Iterate over all the filenames in a directory
- For each filename, call the pipeline function
- Collect the returned results

Follow the instructions in the code below.

In [1]:
# (i) Test if your pipeline function actually works
# Import your function using the normal python syntax for imports, like this:
#   from your_module import your_function
# Run the function and visualize the resulting segmentation. Make sure everything
# works as intended.


# (ii) Get all relevant filenames from the directory
# Use the function 'listdir' from the module 'os' to get a list of all the files
# in a directory. Find a way to filter out only the relevant input files, namely
# "example_cells_1.tif" and "example_cells_2.tif". Of course, one would usually
# do this for many more images, otherwise it's not worth the effort.
# Hint: Loop over the filenames and use if statements to decide which ones to 
#        keep and which ones to throw away.


# (iii) Iterate over the relevant input filenames and run the pipeline function
# Be sure to collect the output of the pipeline function in a way that allows
# you to trace it back to the file it came from. You could for example use a
# dictionary with the filenames as keys.


# (iv) Recreate the scatterplot from above, this time with all the cells
# You can color-code the dots to indicate which file they came from (add a legend
# to label it explicitely).



### <font color='teal'>*Congratulations! You have completed the tutorial!*</font>

<br>

**...but if you now just go back to your work and do nothing, you will forget all you learned within a month or two!**


So, what to do?

- Have a look at the <font color='green'>optional advanced content</font>


- Start applying what you have learned to your own work!


- Stay engaged even if you currently don't need your new skills at work!

    - Play around with data from your work, even if you don't need it at the moment

    - Find yourself an interesting little 'pet project' to play around with

    - Look for tutorials online with additional/advanced content
    
    - Join for seminars/events related to coding and image analysis
        - Check out the [Bio-IT Portal](https://bio-it.embl.de/) for more info! *[internal access only]*
        - Join the [EMBL Coding Club](https://bio-it.embl.de/coding-club/) *[internal access only]*