# Harpy pipeline

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import harpy as hp

## 1. Read in the data

In this notebook, we will be working with a Resolve Biosciences Molecular Cartography dataset of a mouse liver WT sample. The dataset will be downloaded and cached using `pooch` via `harpy.dataset.registry`. 

The DAPI image is then read in using the `bioio` package (see https://bioio-devs.github.io/bioio/) and added to a `SpatialData` object (see https://spatialdata.scverse.org/en/latest/ for more information).

In [None]:
import tempfile
from harpy.datasets.registry import get_registry

unit_testing = True

# The dataset will downloaded from the registry. If path is set to None, example data will be downloaded in the default cache folder of your os. Change path to your directory of choice to overwrite this behaviour.
registry = get_registry(path = None) # on Windows, set path (e.g. to r"c:\tmp")
path_image = registry.fetch("transcriptomics/resolve/mouse/20272_slide1_A1-1_DAPI.tiff")
path_coordinates = registry.fetch("transcriptomics/resolve/mouse/20272_slide1_A1-1_results.txt")

# The OUTPUT_DIR is the directory where the SpatialData .zarr will be saved. Change it to your output directory of choice.
OUTPUT_DIR =  tempfile.gettempdir()

In [None]:
from bioio import BioImage

# The DAPI image is read using bioio
img = BioImage(path_image)

# We print the image dimensions
print('Image dimensions: ', img.dims)

# We can have a look at the dask array
img.dask_data

In [None]:
# We "squeeze" the data to collapse the T and Z dimensions
array = img.dask_data.squeeze((0, 2)) # Squeeze T and Z dimension

# Let's look at the new dask array
array

In [None]:
import os
import uuid
from spatialdata import SpatialData, read_zarr

# Create an empty SpatialData object
sdata = SpatialData()

# Set the path for the SpatialData .zarr
zarr_path = os.path.join(OUTPUT_DIR, f"sdata_{uuid.uuid4()}.zarr")

# Write the SpatialData to Zarr
sdata.write(zarr_path)

In [None]:
# Reload the Zarr data back as a SpatialData
sdata = read_zarr(sdata.path)

# Check if SpatialData is backed (i.e. stored on disk)
sdata.is_backed()

In [None]:
# We add the DAPI image to the SpatialData object
sdata = hp.im.add_image_layer(
    sdata, # The SpatialData object to which the new image layer will be added.
    arr = array, # The array containing the image data to be added.
    dims = ( "c", "y", "x" ), # A tuple specifying the dimensions of the image data
    output_layer = "raw_image", # The name of the output layer where the image data will be stored.
    overwrite = True,
)

In [None]:
# We can access the DAPI image like this:
sdata["raw_image"] # Or, alternatively: sdata.images["raw_image"]

In [None]:
# Plot a crop of the DAPI image
hp.pl.plot_image(
    sdata, 
    img_layer = "raw_image" , 
    crd = [0, 6432, 0, 6432], # The coordinates for the region of interest in the format (xmin, xmax, ymin, ymax). If None, the entire image is plotted.
    figsize = (5,5),
)

In [None]:
# Or, alternatively, via spatialdata-plot:
import spatialdata_plot
sdata.pl.render_images("raw_image").pl.show()

<b>Excercise</b>:

- Use the `Harpy` function `hp.pl.plot_shapes` to visualize another crop (e.g.: `x_min=2000`, `x_max=4000`, `y_min=1000`, `y_max=4000`). 

- Bonus: How would you save the plot to disk?

- Bonus: Read the docstring of `hp.pl.plot_shapes`. What does the `fig_kwargs` parameter do? Can you change `dpi` of the resulting image?

<details>
<summary>Click to reveal the solution</summary>

```python
hp.pl.plot_shapes( sdata, img_layer="raw_image", crd = [ 2000, 4000, 1000, 4000 ], output = f'{OUTPUT_DIR}/plot.png', fig_kwargs={ "dpi":300 } )


<b>Excercise</b>:

- Uncomment the following cell and explore the DAPI image in Napari. Try changing the contrast of the image.

In [None]:
# from napari_spatialdata import Interactive

# Interactive(sdata)

<b>Excercise</b>:
- Bonus: Add DAPI as a multiscale image to the SpatialData object (tip: read the documentation).

<details>
<summary>Click to reveal the solution</summary>

```python
# Add as multiscale image
sdata=hp.im.add_image_layer(
    sdata,
    arr = array,
    dims = ( "c", "y", "x" ),
    output_layer = "raw_image",
    scale_factors = [2, 2, 2, 2],
    overwrite = True,
)

# Now it is a DataTree
type(sdata["raw_image"])  

# Let's have a look at the dask array
from harpy.image._image import _get_spatial_element
se = _get_spatial_element(sdata, layer="raw_image")
se.data

## 2. Image preprocessing

### 2.1 tiling correction and inpainting

When working with RESOLVE data, the data is acquired in tiles that have uneven illumination and this can influence the downstream analysis greatly. RESOLVE assured us this shouldn't impact the transcript counts, but we can check later on whether this is the case. This step is not necessary for most other imaging-based spatial transcriptomics technologies (Xenium, Merscope, ...), but you should plot the entire image to check whether you need this preprocessing step for your data. 

Harpy's tiling_correction() function can be used to correct for uneven illumination (using BaSiC on the back-end). The size of the imaging tiles needs to be known in order to run the function. The tile_size parameter is set to the tile size of RESOLVE (2144) by default.

The tiling_correction() function also corrects for the black lines in between the tiles by using OpenCV's inpainting. 

In [None]:
# Performing tiling correction
sdata, flatfields = hp.im.tiling_correction(
    sdata = sdata,
    img_layer = "raw_image",
    tile_size = 2144, # This is set to 2144 by default
    output_layer = "tiling_correction",
    crd = [0, 6432, 0, 6432],
    overwrite=True
)

In [None]:
# Plot the raw and corrected image side-by-side
hp.pl.plot_image(sdata, img_layer=[ "raw_image", "tiling_correction" ], crd =  [2000, 6000, 2000, 6000], figsize=(10,10))

### 2.2 min-max filtering and contrast enhancing
The next preprocessing steps include:

- A min max filter can be added. The goal of this function is to substract background noise and make the borders of the nuclei/cells cleaner. It will also remove some debris. Note that if you set the size of the filter too small (smaller then the size of your nuclei), the function will create "donuts" (black spots in the center of your cells). If the size of the min max filter is chosen too big, not enough background will be subtracted. Generally, you want to aim for the average nucleus size and some fine-tuning may be necessary. For nuclei in RESOLVE data, 45-55 should be a great starting point.

- We also recommend to perform contrast enhancement on your image. Harpy does this by using histogram equalization (CLAHE function). The amount of correction needed can be decided by adapting the contrast_clip value. If the image is already quite bright, 3.5 might be a good starting point. For dark images, you can go up to 10 or even more. Make sure at the end the whole image is evenly illuminated and no cells are dark in the background.
 
If you think your data needs further image processing steps, you can perform these using the map_image function (see further).

In [None]:
# Perform min max filtering
sdata = hp.im.min_max_filtering(
    sdata,
    img_layer = "tiling_correction",
    output_layer = "min_max_filtered",
    size_min_max_filter = 45,
    overwrite = True,
)

# Plot the min max filtered image
hp.pl.plot_image(
    sdata,
    img_layer = "min_max_filtered",
    crd = [2000,6000,2000,6000],
    figsize = (5, 5),
)

# Perform contrast enhancement using CLAHE
sdata = hp.im.enhance_contrast(
    sdata,
    img_layer = "min_max_filtered",
    output_layer = "clahe",
    contrast_clip = 3.5,
    chunks = 20000,
    overwrite = True
)

# Plot the contrast enhanced image
hp.pl.plot_image(
    sdata,
    img_layer = "clahe",
    crd = [2000,6000,2000,6000],
    figsize = (5, 5),
)

<b>Excercise</b>:

- Change the `size_min_max_filter` parameter in `hp.im.min_max_filtering`. What do you see? Try some extreme values.
- Change the `enhance_contrast` parameter in `hp.im.enhance_contrast`. What do you see? Try some extreme values.
- Try image preprocessing on a different crop.

<b>Excercise</b>:

- Uncomment the following cell and explore the preprocessed images in Napari.

In [None]:
#Interactive(sdata)

### 2.3 Custom distributed preprocessing of images using `hp.im.map_image` and `Dask`

See https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html and https://docs.dask.org/en/latest/generated/dask.array.map_overlap.html

Set `blockwise==True` if you want to do distributed processing using `dask.array.map_blocks` or `dask.array.map_overlap`, set `blockwise==False` if your function is already distributed (e.g. when using `dask_image` filters https://image.dask.org/en/latest/dask_image.ndfilters.html.)

In [None]:
import numpy as np
from numpy.typing import NDArray

# Define your custom function
def _my_dummy_function(image: NDArray, parameter: int | float )->NDArray:
    # input (1,1,y,x)
    # output (1,1,y,x)
    print(f"Type of the image is: {type(image)}")
    print(image.shape)
    return image*parameter

fn_kwargs = {"parameter": 2}

# Apply custom function
sdata = hp.im.map_image(
    sdata,
    func = _my_dummy_function,
    fn_kwargs = fn_kwargs,
    img_layer = "raw_image",
    output_layer="dummy_image",
    chunks = 5000,
    blockwise = True, # if blockwise == True --> input to _my_dummy_function is a numpy array of size chunks, else it is a Dask array (with chunksize chunks)
    depth = 1000, # if blockwise == True, and depth specified, will use map_overlap instead of map_blocks for distributed processing
    overwrite = True,
    dtype = np.uint16,
    meta = np.array((), dtype=np.uint16),
)

In [None]:
from harpy.image._image import _get_spatial_element

_get_spatial_element(sdata, layer="raw_image").data.compute()[ :, :10, :10 ]

In [None]:
_get_spatial_element(sdata, layer="dummy_image").data.compute()[ :, :10,:10 ]

<b>Excercise</b>:

- Adapt `my_dummy_function` so it accepts a new parameter, `parameter_2`. Now adapt `my_dummy_function` so the image is multiplied with (`parameter` + `parameter_2`)

<details>
<summary>Click to reveal the solution</summary>

```python
def _my_dummy_function(image: NDArray, parameter: int | float, parameter_2: int | float )->NDArray:
    # input (1,1,y,x)
    # output (1,1,y,x)
    print(f"Type of the image is: {type(image)}" )
    print(image.shape)
    return image*(parameter + parameter_2)

fn_kwargs = {"parameter": 2 , "parameter_2": 2}

sdata = hp.im.map_image(
    sdata,
    func = _my_dummy_function,
    fn_kwargs = fn_kwargs,
    img_layer = "raw_image",
    output_layer="dummy_image",
    chunks = 5000,
    blockwise = True, # if blockwise == True --> input to _my_dummy_function is a numpy array of size chunks, else it is a Dask array (with chunksize chunks)
    depth = 1000, # if blockwise == True, and depth specified, will use map_overlap instead of map_blocks for distributed processing
    overwrite = True,
    dtype = np.uint16,
    meta = np.array((), dtype=np.uint16),
)

<b>Excercise</b>:

- Bonus: Run the cell where `hp.im.map_image` is called in debug mode. Set a breakpoint in `my_dummy_function`. Inspect the shape and type of `image` when you set `blockwise=True` or `blockwise=False`. Set the `depth` parameter to `100`. What do you observe?

## 3. Segmentation

### 3.1 Nucleus segmentation

To segment the nuclei, we here show an example using cellpose, a deep learning network based on a UNET architecture.

Multiple parameters need to be given as an input to the cellpose algorithm. We recommend tuning these to achieve optimal segmentation quality (see https://cellpose.readthedocs.io/en/latest/settings.html). It is often a good idea to fine-tune the parameters on a crop of the image (especially when you only have CPU to work with).
 
- diameter: Includes an estimate of the average nucleus diameter and needs to be given in pixels. If set to None, cellpose will try to estimate the diameter, but this might take a long time and is usually far off. As a guideline, you can use approx. 7 micrometer (in this case 50 pixels at 0.138 micrometer per pixel) for a standard nucleus, but this may vary depending on your specific tissue, sample...
- device: Defines the device you want to work on. If you only have CPU, you can skip this input parameter.
- flow_threshold: Indicates something about the shape of the masks. If you increase it, more masks with less round shapes will be accepted. Usually set between 0.6 and 0.95 (max. is 1). Lower this parameter if you start segmenting artefacts. Increase it if the segmentation misses some non-round cells.
- mask_threshold: Indicates how many of the possible masks are kept. Decreasing the parameter will output more masks. Larger values will output less masks. Usually set between 0 and -6.
- min_size: Indicates the minimum size of a nucleus.
- model_type: If segmenting whole cells instead of nuclei, set this to 'cyto'. You can do this with and without a nucleus channel. When you want to include a nucleus channel for the segmentation, make sure your image is 3D and that the first channel contains the complete cell staining and the second one the nucleus channel (put the channel parameter to np.array([1,0])).

In [None]:
"""
ADVANCED: You can set up a local Dask distributed cluster for parallel computing. Once the cluster is created, a Dask Client is used to connect to it. 
The Dask dashboard link allows you to monitor cluster performance and task progress.
"""

# from dask.distributed import Client, LocalCluster

# # Create a local Dask cluster
# cluster = LocalCluster(
#     n_workers=1,              # Number of worker processes
#     threads_per_worker=10,    # Number of threads per worker
#     memory_limit="32GB",      # Memory limit per worker
# )

# # Connect a Client to the cluster
# client = Client(cluster)

# # Print the Dask dashboard link
# print(client.dashboard_link)


In [None]:
import torch
from cellpose import models
from harpy.image import cellpose_callable

gpu = False
device = "cpu"  # mps broken in cellpose (macOS), see https://github.com/MouseLand/cellpose/issues/1063
model = models.CellposeModel(gpu=gpu, pretrained_model='nuclei', device = torch.device(device))

# model = client.scatter(model) # ADVANCED: Uncomment this when using the Dask Client. We pass a loaded model to _cellpose, but we scatter the model to avoid large task graph.

# Perform nucleus segmentation
sdata = hp.im.segment(
    sdata,
    img_layer="clahe", # The image layer in sdata to be segmented.
    chunks=2048,
    depth=200,
    model=cellpose_callable,
    # parameters that will be passed to the callable _cellpose:
    pretrained_model=model,
    diameter=50,
    flow_threshold=0.9,
    cellprob_threshold=-4,
    output_labels_layer="segmentation_mask",
    output_shapes_layer="segmentation_mask_boundaries",
    crd=[2000, 4000, 2000, 4000] if unit_testing else None,  # region to segment [x_min, xmax, y_min, y_max],
    overwrite=True,
)

#client.close() # ADVANCED: Uncomment this when using the Dask Client.

In [None]:
# Plot segmentation results
hp.pl.plot_shapes(sdata, img_layer="clahe", shapes_layer="segmentation_mask_boundaries", figsize=(5,5), crd = [2000, 4000, 2000, 4000])

In [None]:
# or via spatialdata-plot
sdata.pl.render_images("clahe").pl.render_labels("segmentation_mask").pl.show()

In [None]:
# To only visualize a crop using spatialdata-plot, we can't pass any coordinates, so but we can perform a bounding box query, and then plot the resulting `SpatialData` object.
sdata_small = sdata.query.bounding_box(
    min_coordinate=[2000, 2000], max_coordinate=[4000, 4000], axes=("x", "y"), target_coordinate_system="global"
)

sdata_small.pl.render_images("clahe").pl.render_labels("segmentation_mask", fill_alpha=0.5  ).pl.show()

<b>Excercise</b>:

- Try changing segmentation parameters to see how they affect the results.
- Go to the [documentation](https://spatialdata.scverse.org/projects/plot/en/latest/) of `spatialdata-plot`, and try to visualize the cell boundaries (i.e. the segmentation shapes layer)

<details>
<summary>Click to reveal the solution</summary>

```python
sdata_small.pl.render_images("clahe").pl.render_shapes("segmentation_mask_boundaries", fill_alpha=1.0).pl.show()

### 3.2 Nucleus expansion
In some cases, it may be useful to expand de nuclei segmentations to approximate the cell bodies. Note that this is not very precise and, while it increases the number of transcripts assigned to a cell, it also introduces more wrongly assigned transcripts (i.e. that actually belong to other cells).

In [None]:
# Expand labels layer masks
sdata = hp.im.expand_labels_layer(
    sdata,
    labels_layer="segmentation_mask",
    distance=10, # Number of pixels to expand
    output_labels_layer="segmentation_mask_expanded", # Creates a new labels layer
    output_shapes_layer="segmentation_mask_expanded_boundaries", # Creates a new shapes layer
    overwrite=True,
)

In [None]:
# Plot nuclei masks vs expanded nuclei masks
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    shapes_layer=["segmentation_mask_boundaries", "segmentation_mask_expanded_boundaries"],
    figsize=(10,10),
    crd=[2000, 4000, 2000, 4000],
)

## 4. Allocating  the transcripts

###  4.1 Creating the count matrix
In this step we
- load in the transcipts: in the case of RESOLVE this is done with a specific loader. If no specific loader exist for your datatype, you can use the general `hp.io.read_transcripts` function.
- allocate the transcripts to the correct cell. This allocation step creates the count matrix saved in an [anndata](https://anndata.readthedocs.io/en/stable/) object.

In [None]:
# Read in RESOLVE transcript data as a points layer
sdata = hp.io.read_resolve_transcripts(
    sdata, 
    output_layer="transcripts", # Name of the points layer of the SpatialData object to which the transcripts will be added.
    path_count_matrix=path_coordinates, # Path to the file containing the transcripts information specific to Resolve.
    overwrite=True
)

# Allocate transcripts to cells based on the segmentation masks
sdata = hp.tb.allocate(
    sdata=sdata,
    labels_layer="segmentation_mask", # The labels layer (i.e. segmentation mask) in `sdata` to be used to allocate the transcripts to cells.
    points_layer="transcripts", # The points layer in `sdata` that contains the transcripts.
    output_layer="table_transcriptomics", # The table layer in `sdata` in which to save the AnnData object with the transcripts counts per cell.
    update_shapes_layers=False,
    overwrite=True,
)

In [None]:
# Inspect the new points layer
print(type(sdata.points["transcripts"]))
sdata.points["transcripts"].head()

In [None]:
# Inspect the new table layer
display(sdata.tables["table_transcriptomics"])

print('Number of cells: ', len(sdata.tables["table_transcriptomics"].obs.index))
print('Number of genes: ', len(sdata.tables["table_transcriptomics"].var.index))

In [None]:
# Inspect the count matrix in the new table layer
sdata.tables["table_transcriptomics"].to_df().head() # On large count matrices, calls to .to_df() should be avoided

In [None]:
# Inspect the var of the new table layer
sdata.tables["table_transcriptomics"].var.head()

In [None]:
# Inspect the obs of the new table layer
sdata.tables["table_transcriptomics"].obs.head()

In [None]:
# Inspect the spatial coordinates stored in obsm
sdata.tables["table_transcriptomics"].obsm['spatial'][:5] # x,y,(z) coordinates of cell centre (calculated based on mean transcripts location)

In [None]:
# Inspect the spatialdata_attrs in .uns to check the instance_key and region_key
sdata.tables["table_transcriptomics"].uns['spatialdata_attrs']

# NOTE: The AnnData object that is added as a table layer is annotated by the labels layer "segmentation_mask". The instance_key ('cell_ID') matches the labels in "segmentation_mask".
# NOTE: Tables of a SpatialData object can be theoretically be annotated by a labels layer, a shapes layer or a points layer, but tables generated by the Harpy pipeline will always use a labels layer.

In [None]:
import dask.array as da

print('Number of cells in table: ', len(sdata.tables["table_transcriptomics"].obs))
print('Number of segmentation masks in labels layer: ', len(da.unique(sdata.labels["segmentation_mask"].data).compute()) - 1) # We subtract 1 because 0 is also a value, but this corresponds to the background.
print('Number of segmentation boundaries in shapes layer: ', len(sdata.shapes["segmentation_mask_boundaries"]))

# NOTE: Not all segmentation masks are included in the table layer "table_transcriptomics". This is because not all cells could be assigned transcripts.

<b>Excercise</b>:

- Run .compute() on the points layer. What is the data type of the resulting object?

<details>
<summary>Click to reveal the solution</summary>

```python
from IPython.display import display

display(sdata["transcripts"].compute().head())
display(type(sdata["transcripts"].compute())) 

<b>Excercise</b>:

- Have a look at https://docs.dask.org/en/stable/dataframe.html.

<b>Excercise</b>:

- Bonus: Extract transformation from the points layer "transcripts" using `spatialdata.transformations.get_transformation`. See https://spatialdata.scverse.org/en/stable/generated/spatialdata.transformations.get_transformation.html
- Bonus: Now extract the transformation from the labels layer "segmentation_mask" and for the image layer "clahe".

<details>
<summary>Click to reveal the solution</summary>

```python
from IPython.display import display
from spatialdata.transformations import get_transformation

display(get_transformation(sdata["transcripts"]))
display(get_transformation(sdata["segmentation_mask"]))
display(get_transformation(sdata["clahe"]))

<b>Excercise</b>:

- Visualize the points layer and the labels layer using napari-spatialdata. Convince yourself they are registered.

In [None]:
# Interactive(sdata)

### 4.2 Visualizing gene expression

In [None]:
# Plot the expression of the Axl gene using hp.pl.polt_shapes()
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    shapes_layer="segmentation_mask_boundaries",
    figsize=(5,5),
    crd=[2000, 4000, 2000, 4000],
    table_layer="table_transcriptomics",
    column="Axl",
)

# NOTE: In Harpy/SpatialData there is a connection between tables, shapes and labels via the region_key and the cell id, which allows us to plot a certain column of a table spatially.

In [None]:
# Plot the expression of the Axl gene using spatialdata-plot
import matplotlib.pyplot as plt

plt.figure(figsize=(5, 5))
ax = plt.gca()

gene_name = "Axl"
sdata.pl.render_labels("segmentation_mask", color=gene_name, method="datashader", fill_alpha=0.5).pl.show(
    coordinate_systems="global", ax=ax
)

In [None]:
# Explore gene expression interactively using napari-spatialdata

#Interactive(sdata)

<b>Excercise</b>:

- Use `hp.pl.plot_shapes` to plot the expression of some other genes that are in the dataset.

<details>
<summary>Click to reveal the solution</summary>

```python
display(sdata["table_transcriptomics"].var)

hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    shapes_layer="segmentation_mask_boundaries",
    figsize=(5,5),
    crd=[2000, 4000, 2000, 4000],
    table_layer="table_transcriptomics",
    column="Vwf",
)

<b>Excercise</b>:

Use `napari-spatialdata` to visualize the gene expression of the gene `Axl`.

In [None]:
# Interactive(sdata)

###  4.3 Transcript quality
After we have created the anndata object, we want to check the transcript quality. First we create a plot to check if the transcript density is similar across the whole tissue. If this isn't the case, there can be multiple biological or technical reasons. Note that gene panel choices can also have an influence.

In [None]:
# Create transcript density image
sdata = hp.im.transcript_density(
    sdata,
    img_layer="clahe", # The layer of the SpatialData object used for determining image boundary.
    points_layer="transcripts", # The layer name that contains the transcript data points, by default "transcripts".
    output_layer="transcript_density", # The name of the output image layer
    overwrite=True,
)

In [None]:
# Plot transcript density
hp.pl.plot_image(sdata, img_layer = ["clahe", "transcript_density"], figsize=(10,10))

In [None]:
# Check number of transcripts
print('Number of transcripts in points layer: ', len(sdata.points["transcripts"]))
print('Number of transcripts assigned to cells: ', sdata.tables["table_transcriptomics"].X.sum())
print('Percentage of transcripts kept: ', ((sdata.tables["table_transcriptomics"].X.sum())/len(sdata.points["transcripts"]))*100)

# NOTE: Only a fraction of transcripts are assigned to cells.

In [None]:
# Check number of genes
print('Number of genes in points layer: ', sdata.points['transcripts'].compute()['gene'].nunique())
print('Number of genes found in cells: ', len(sdata.tables["table_transcriptomics"].var.index))

# NOTE: In general, we don't want to lose any genes, but this may happen if they have a low abundance.

In [None]:
# Check which genes are not found in cells
genes_not_found_in_cells = set(sdata.points['transcripts'].compute()['gene'].unique()) - set(sdata.tables["table_transcriptomics"].var.index)

print("Number of genes not found in cells: ", len(genes_not_found_in_cells))
print("Genes not found in cells:", genes_not_found_in_cells)


In [None]:
# Analyse and visualize the proportion of transcripts that could not be assigned to a cell during allocation step.

df = hp.pl.analyse_genes_left_out(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics",
    points_layer="transcripts",
)

# NOTE: In general we see a downward trend. The more a gene is measured, the less it is located in cells (in ratio). 
# NOTE: The function also prints the ten genes with the highest proportion of transcripts filtered out. If a lot of these genes are markers for the same cell type, you will want to find out why this is happening (bad staining, large cell body compared to nucleus, etc.)

In [None]:
# Inspect analyse_genes_left_out() output table
df.sort_values(by="proportion_kept", ascending=True)

## 5. Processing the AnnData table

### 5.1 Filtering and Normalization

The next steps are performed to further process the AnnData object:

- QC metrics are calculated.
- Filtering: cells with fewer than a certain amount of counts (e.g. 10) and genes occuring in fewer than a certain amount of cells (e.g. 5) are filtered out.
- Normalization: for small gene panels (<500), we recommend to normalize the data based on the size of the segmented object (`size_norm=True`). For transcriptome-wide methods, we recommend library size normalization based on the total expression (`size_norm=False`). 
- log1p-transformation of the expression data (y=ln(1+x)).
- Scale data to unit variance and zero mean. The scaling is capped at `max_value_scale`.
- PCA calculation


The last plot shows the size of the nucleus related to the counts. When working with whole cells, if there are some really big cells with really low counts, they are probably not real cells and you should filter based on max size. 

In [None]:
# Perform preprocessing.
sdata = hp.tb.preprocess_transcriptomics(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics",
    output_layer="table_transcriptomics_preprocessed", # write results to a new slot, we could also write to the same slot (when passing overwrite==True).
    min_counts=10,
    min_cells=5,
    size_norm=True,
    highly_variable_genes=False,  # If True, will only retain highly variable genes. This can be used for transcriptome-wide methods.
    max_value_scale=10, # The maximum value to which data will be scaled
    n_comps=50, # Number of principal components to calculate.
    overwrite=True,
    update_shapes_layers=False,
)

In [None]:
# Inspect preprocessed table
sdata.tables[ "table_transcriptomics_preprocessed" ]

In [None]:
# Inspect expression values
sdata.tables["table_transcriptomics_preprocessed"].to_df().head()

In [None]:
# Check mean expression values per gene
sdata.tables["table_transcriptomics_preprocessed"].to_df().mean(axis=0).head() # mean ~ 0

In [None]:
# Check standard deviation of expression values per gene
sdata.tables["table_transcriptomics_preprocessed"].to_df().std(axis=0).head() # std ~ 1

In [None]:
# Check max expression value per gene
sdata.tables["table_transcriptomics_preprocessed"].to_df().max(axis=0).head() # max ~ 10

In [None]:
# Inspect obs of preprocessed table
sdata.tables["table_transcriptomics_preprocessed"].obs.head()

# n_genes_by_counts: The number of genes with at least 1 count in a cell
# log1p_n_genes_by_counts: log1p-transformed n_genes_by_counts
# total_counts: Total number of counts for a cell
# log1p_total_counts: log1p-transformed total_counts
# pct_counts_in_top_2_genes: The percentage of the total gene expression in each cell that comes from the top 2 most highly expressed genes in that cell
# pct_counts_in_top_5_genes: The percentage of the total gene expression in each cell that comes from the top 5 most highly expressed genes in that cell 
# n_counts: Number of counts in a cell
# shapeSize: Area of cell (in pixels)

In [None]:
# Check sum of transcript counts
(sdata.tables["table_transcriptomics"].to_df()).sum(axis=1).head()

In [None]:
# Check number of genes
(sdata.tables["table_transcriptomics"].to_df()>0).sum(axis = 1).head()

In [None]:
# Inspect var of preprocessed table
sdata.tables["table_transcriptomics_preprocessed"].var.head()

# n_cells_by_counts: Number of cells this gene is found in
# mean_counts: Mean counts over all cells
# log1p_mean_counts: log1p of mean_counts
# pct_drop_by_counts: Percentage of cells this gene does not appear in
# total_counts: Total number of counts for a gene
# logp_total_counts: log1p of total_counts
# n_cells: Number of cells this gene is found in
# mean:
# std:

In [None]:
# Plot preprocessing QC plots
hp.pl.preprocess_transcriptomics(
    sdata,
    table_layer="table_transcriptomics_preprocessed",
)

In [None]:
# Plot total counts
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_preprocessed",
    column="total_counts",
    shapes_layer="segmentation_mask_boundaries",
    crd=[2000, 4000, 2000, 4000],
    figsize=(8,8)
)

In [None]:
# Filter cells on size
sdata = hp.tb.filter_on_size(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_preprocessed",
    output_layer="table_transcriptomics_filter",
    min_size=500, # Minimum cell size
    max_size=100000, # Maximum cell size
    update_shapes_layers=False,
    overwrite=True,
)

In [None]:
# Check which cells have been removed
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_filter",
    column="total_counts",
    shapes_layer="segmentation_mask_boundaries",
    crd=[2000, 4000, 2000, 4000],
    figsize=(8,8)
)

In [None]:
# Explore results interactively

#Interactive( sdata )

### 5.2 Clustering

This function performs the neighborhood analysis and the leiden clustering and the UMAP calculations using standard scanpy functions.

You need to define the following parameters:
- The amount of PC's used: Between 15-20 is a good starting point (based on the plot of PCs).
- The amount of neighbors used: 35 is generally a good value. In general, less neighbors means more spread, more means everything is tighter.
- Cluster resolution.

It returns the UMAP and marker gene list per cluster, that can be looked at for finding celltypes. 

In [None]:
import scanpy as sc

# Leiden clustering
sdata = hp.tb.leiden(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_filter",
    output_layer="table_transcriptomics_clustered",
    calculate_umap=True,
    calculate_neighbors=True,
    n_pcs=17, # The number of principal components to use when calculating neighbors.
    n_neighbors=35, # The number of neighbors to consider when calculating neighbors.
    resolution=0.8,
    rank_genes=True,
    key_added="leiden",
    overwrite=True,
)

# Plot UMAP
sc.pl.umap(sdata.tables["table_transcriptomics_clustered"], color=["leiden"], show=True)

In [None]:
sc.pl.rank_genes_groups(sdata.tables["table_transcriptomics_clustered"], n_genes=8, sharey=False, show=True)

In [None]:
# Plot clusters spatially
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_clustered",
    column="leiden",
    shapes_layer="segmentation_mask_boundaries",
    alpha=1.0,
    linewidth=0,
    crd=[2000, 4000, 2000, 4000]
)

<b>Excercise</b>:

Change the parameters of `hp.tb.leiden`. What do you observe?

In [None]:
#from napari_spatialdata import Interactive

#del sdata.tables["table_transcriptomics_clustered"].uns["leiden_colors"]
#Interactive(sdata)

In [None]:
import matplotlib.pyplot as plt

# for fun, also plot via spatialdataplot
plt.figure(figsize=(5, 5))
ax = plt.gca()

column = "leiden"

adata = sdata.tables[ "table_transcriptomics_clustered" ]

#cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
#                    "new_map",
#                    adata.uns[column + "_colors"],
#                    N=len(adata.uns[column + "_colors"]),
#                )

sdata_small = sdata.query.bounding_box(
    min_coordinate=[2000, 2000], max_coordinate=[4000, 4000], axes=("x", "y"), target_coordinate_system="global"
)

sdata_small.pl.render_labels("segmentation_mask", color=column, cmap=None, method="datashader", fill_alpha=1).pl.show(
    coordinate_systems="global", ax=ax
)

### 5.3 Cell type annotation

Next, we use a marker gene list and score cells for each cell type using those markers via scanpy's `sc.tl.score_genes` function.

In [None]:
import pandas as pd

# Download annotation file from registry
path_mg = registry.fetch("transcriptomics/resolve/mouse/markerGeneListMartinNoLow.csv")

# Inspect annotation file containing markers
display(pd.read_csv(path_mg).head()) # This is one-hot encoded matrix with cell types listed in the first row, and marker genes in the first column.

In [None]:
# Annotate cells
sdata, celltypes_scored, celltypes_all = hp.tb.score_genes(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_clustered",
    output_layer="table_transcriptomics_score_genes",
    path_marker_genes=path_mg, # Path to annotation file
    overwrite=True,
)

In [None]:
# Inspect new table layer
sdata["table_transcriptomics_score_genes"]

In [None]:
# Inspect new table layer obs
sdata.tables["table_transcriptomics_score_genes"].obs.head()

In [None]:
# Plot cell type annotations on UMAP
sc.pl.umap(sdata.tables["table_transcriptomics_score_genes"], color="annotation")

In [None]:
# Plot cell type annotations spatially
hp.pl.plot_shapes(
    sdata,
    column="annotation",
    img_layer="clahe",
    table_layer= "table_transcriptomics_score_genes",
    shapes_layer="segmentation_mask_boundaries",
    linewidth=0,
    alpha=0.7,
    crd=[2000, 4000, 2000, 4000]
)

### 5.4 Squidpy

In [None]:
# Try calculating spatial neighbors using Squidpy
import squidpy as sq

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic", # Set to 'generic' for targeted spatial transcriptomics
    n_neighs=6, # Only used when delaunay = False
    radius=None, # To compute the neighbors based on the radius
    delaunay=False, # Whether to compute the graph from Delaunay triangulation
)

sdata.tables["table_transcriptomics_score_genes"]

In [None]:
# BUT, this is not yet backed to the zarr store!
from spatialdata import read_zarr

sdata = read_zarr(sdata.path)

sdata.tables["table_transcriptomics_score_genes"]

# NOTE: .uns["spatial_neighbors"], .obsp["spatial_connectivities"] and .obsp["spatial_distances"] are no longer in table!

In [None]:
# Let's try calculating the spatial neighbors again, but we'll make sure the new table is backed to the zarr store by using hp.tb.add_table_layer().
from harpy.utils._keys import _REGION_KEY

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic", # Set to 'generic' for targeted spatial transcriptomics
    n_neighs=6, # Only used when delaunay = False
    radius=None, # To compute the neighbors based on the radius
    delaunay=False, # Whether to compute the graph from Delaunay triangulation
)

region = sdata["table_transcriptomics_score_genes"].obs[_REGION_KEY].cat.categories.to_list()

sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_score_genes"],
    output_layer="table_transcriptomics_squidpy",
    region=region, # A list of regions to associate with the table data. Typically this is all unique elements in adata.obs[_REGION_KEY].
    overwrite=True,
)

In [None]:
# Inspect spatial connectivities of first 10 rows and colomns
sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray()[0:10,0:10]

In [None]:
# Inspect number of neighbors (for first 10 cells)
sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray().sum(axis=1)[0:10]

# NOTE: Every cell has exactly 6 neighbors when using n_neigh=6

In [None]:
# Inspect for every cell how many cells have it as a neighbor (for first 10 cells)
sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray().sum(axis=0)[0:10]

# NOTE: Not every cell is a neighbor of exactly 6 cells when using n_neigh=6

In [None]:
import matplotlib.pyplot as plt

# Access the spatial connectivities matrix
matrix = sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities']

# Create the plot
plt.figure(figsize=(12, 10), dpi=300)
plt.imshow(matrix.toarray(), cmap='gray_r')
plt.colorbar()  
plt.title("Spatial Connectivities", fontsize=18)  
plt.show()

In [None]:
# Inspect spatial distances
sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_distances'].toarray()[0:4,0:4]

<b>Excercise</b>:

- Build a graph of spatial neighbors using a radius of 100 pixels. Inspect the neighbor relationships in both directions. How does this compare to the results for 6-nearest neighbors spatial graph? Try the same for Delaunay triangulation.

<details>
<summary>Click to reveal the solution</summary>

```python
# radius-based spatial graphs
from harpy.utils._keys import _REGION_KEY

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic",
    radius=100,
    delaunay=False,
)

region = sdata["table_transcriptomics_score_genes"].obs[_REGION_KEY].cat.categories.to_list()

sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_score_genes"],
    output_layer="table_transcriptomics_squidpy",
    region=region,
    overwrite=True,
)

print('Inspect spatial connectivities of first 10 rows and colomns:')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray()[0:10,0:10])

print('Inspect number of neighbors (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray().sum(axis=1)[0:10])

print('Inspect for every cell how many cells have it as a neighbor (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray().sum(axis=0)[0:10])

print('Plot spatial connectivities:')
import matplotlib.pyplot as plt
matrix = sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities']

plt.figure(figsize=(12, 10), dpi=300)
plt.imshow(matrix.toarray(), cmap='gray_r')
plt.colorbar()  
plt.title("Spatial Connectivities", fontsize=18)  
plt.show()

# Delaunay triangulation
from harpy.utils._keys import _REGION_KEY

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic",
    delaunay=True,
)

region = sdata["table_transcriptomics_score_genes"].obs[_REGION_KEY].cat.categories.to_list()

sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_score_genes"],
    output_layer="table_transcriptomics_squidpy",
    region=region,
    overwrite=True,
)

print('Inspect spatial connectivities of first 10 rows and colomns:')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray()[0:10,0:10])

print('Inspect number of neighbors (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray().sum(axis=1)[0:10])

print('Inspect for every cell how many cells have it as a neighbor (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities'].toarray().sum(axis=0)[0:10])

print('Plot spatial connectivities:')
import matplotlib.pyplot as plt
matrix = sdata.tables['table_transcriptomics_squidpy'].obsp['spatial_connectivities']

plt.figure(figsize=(12, 10), dpi=300)
plt.imshow(matrix.toarray(), cmap='gray_r')
plt.colorbar()  
plt.title("Spatial Connectivities", fontsize=18)  
plt.show()

In [None]:
# Calculate neighborhood enrichment
sdata = hp.tb.nhood_enrichment(
    sdata, 
    labels_layer="segmentation_mask", 
    table_layer="table_transcriptomics_squidpy", 
    output_layer="table_transcriptomics_squidpy", 
    celltype_column = "annotation",
    overwrite=True
)

# Plot neighborhood enrichment
hp.pl.nhood_enrichment(
    sdata, 
    table_layer="table_transcriptomics_squidpy",
)

# Add table layer to back to zarr
sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_squidpy"],
    output_layer="table_transcriptomics_squidpy",
    region=region,
    overwrite=True,
)

In [None]:
# Calculate Moran’s I global spatial auto-correlation statistics
sq.gr.spatial_autocorr(
    adata=sdata.tables["table_transcriptomics_squidpy"],
    mode="moran",
    n_perms=100,
    n_jobs=1,
)

# Add table layer to back to zarr
sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_squidpy"],
    output_layer="table_transcriptomics_squidpy",
    region=region,
    overwrite=True,
)

In [None]:
# Inspect highest Moran's I scores
sdata.tables["table_transcriptomics_squidpy"].uns["moranI"].head(10)

In [None]:
# Inspect lowest Moran's I scores
sdata.tables["table_transcriptomics_squidpy"].uns["moranI"].tail(10)

In [None]:
#Interactive(sdata)

### 5.5 Region annotation

In [None]:
# Region annotation in Napari

# from napari_spatialdata import Interactive
# Interactive(sdata)

# NOTE: - In napari, create a new shapes layer, annotate a region of interest, save to sdata using Shift + E and close Napari.
#       - For more info, see: https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/napari_rois.html 
#       - Currently, there is only support for saving rectangles, polygons and points.

In [None]:
# Let's check the sdata object to see whether the layer was correctly added
sdata

In [None]:
hp.pl.plot_shapes( sdata, img_layer = "clahe", shapes_layer="segmentation_mask_boundaries", crd = [ 2000, 4000, 2000, 4000 ], figsize=(5,5) )

In [None]:
# this cell only for unit tests to pass
from spatialdata.models import ShapesModel

if unit_testing:
    import geopandas as gpd
    from shapely.geometry import box

    # Define the rectangle boundaries
    x_min, y_min, x_max, y_max = 2250, 2250, 3000, 3000

    # Create a Shapely box (rectangle)
    rectangle = box(x_min, y_min, x_max, y_max)

    # Create a GeoDataFrame
    polygons= gpd.GeoDataFrame({'geometry': [rectangle]})
    polygons=ShapesModel.parse( polygons )
    sdata.shapes[ "region_annotation" ] = polygons
    sdata.write_element( element_name="region_annotation" )

In [None]:
if not unit_testing:
    # We need to make sure the shapes layer is backed to zarr
    sdata.write_element(element_name='region_annotation')
    sdata = read_zarr(sdata.path)

In [None]:
# Note that we can also import a GeoJSON from another source (e.g. QuPath)

# import geopandas as gpd
# gdf_regions = gpd.read_file(path_to_GeoJSON)
# sdata = hp.sh.add_shapes_layer(sdata, input=gdf_regions, output_layer='region_annotation', overwrite=True)

In [None]:
from shapely.geometry import Point

# Get spatial coordinates
spatial_coords = sdata.tables['table_transcriptomics_squidpy'].obsm["spatial"]
spatial_coords_df = pd.DataFrame(spatial_coords, columns=["x", "y"], index=sdata.tables['table_transcriptomics_squidpy'].obs.index)

# Define function to assign region annotations to cells
def assign_region(centroid, gdf):
    for index, row in gdf.iterrows():
        if Point(centroid).within(row["geometry"]):
            return "Yes"
    return "No"  

# Create new column in obs to check if cells are in region
sdata.tables['table_transcriptomics_squidpy'].obs["in_region"] = spatial_coords_df.apply(lambda row: assign_region((row["x"], row["y"]), sdata.shapes["region_annotation"]), axis=1)

# Add table layer to back to zarr
sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_squidpy"],
    output_layer="table_transcriptomics_squidpy",
    region=region,
    overwrite=True,
)

In [None]:
# Check obs
sdata.tables['table_transcriptomics_squidpy'].obs.head()

In [None]:
# Plot region annotation shapes layer
hp.pl.plot_shapes(
    sdata, 
    img_layer='clahe', 
    shapes_layer='region_annotation',
    alpha=0.5,
    crd=[2000, 4000, 2000, 4000]
)

In [None]:
# Plot cells colored according to in_region column
hp.pl.plot_shapes(
    sdata,
    column="in_region",
    img_layer="clahe",
    table_layer="table_transcriptomics_squidpy",
    shapes_layer="segmentation_mask_boundaries",
    linewidth=0,
    alpha=0.7,
    cmap="rainbow",
    crd=[2000, 4000, 2000, 4000]
)

### 5.6 TissUUmaps

TissUUmaps is a handy visualization software that allows easy interactive exploration of your spatial data. It can be used to visualize data from an AnnData .h5ad file or from a csv-file. You can also simultaneously visualize images (multiple file types, including tiff) and regions (GeoJSON). 

It can be installed using this link: https://tissuumaps.github.io/installation/ \
Documentation can be found here: https://tissuumaps.github.io/TissUUmaps-docs/ 

In [None]:
# Export image as tiff
from skimage.io import imsave
import numpy as np

if not unit_testing:

    # Save AnnData as h5ad
    sdata.tables["table_transcriptomics_squidpy"].write(os.path.join(OUTPUT_DIR, 'adata.h5ad'))

    # Export shapes layer as GeoJSON
    sdata.shapes['region_annotation'].to_file(os.path.join(OUTPUT_DIR, "region_annotation.geojson"), driver="GeoJSON")


    img = sdata.images['clahe'].data.compute()
    imsave(os.path.join(OUTPUT_DIR, "clahe.tiff"), img)