# SPArrOW pipeline

In this notebook, we will demonstrate how to use the SPArrOW pipeline to analyze targeted spatial transcriptomics data using the raw data from a Molecular Cartography ([Resolve Biosciences](https://resolvebiosciences.com/)) mouse liver WT dataset.

When you make use of the SPArrOW pipeline tools, please cite [Pollaris et al. (2024)](https://www.biorxiv.org/content/biorxiv/early/2024/07/06/2024.07.04.601829.full.pdf)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import harpy as hp

## 1. Read in the data

The dataset will be downloaded and cached using `pooch` via `harpy.dataset.registry`. 

In [None]:
import tempfile
from harpy.datasets.registry import get_registry

unit_testing = True # Set to False during training

# If path is set to None, example data will be downloaded in the default cache folder of your os. Set this to a custom path to change this behaviour.
path = None
# path = r"c:\tmp" # Recommended on Windows
# path = "/staging/leuven/stg_00143/spatial_data_training" # e.g. on HPC

registry = get_registry(path = path)
path_image = registry.fetch("transcriptomics/resolve/mouse/20272_slide1_A1-1_DAPI.tiff")
path_coordinates = registry.fetch("transcriptomics/resolve/mouse/20272_slide1_A1-1_results.txt")

In [None]:
# The OUTPUT_DIR is the directory where the SpatialData .zarr will be saved. Change it to your output directory of choice.
OUTPUT_DIR =  tempfile.gettempdir()

# OUTPUT_DIR = "/staging/leuven/stg_00143/spatial_data_training/output_dir" # e.g. on HPC

In [None]:
from dask_image.imread import imread

# The DAPI image is read using dask image
img = imread(path_image)

# We print the image dimensions
print('Image dimensions: ', img.shape)

img

In [None]:
import os
import uuid
from spatialdata import SpatialData, read_zarr

# Create an empty SpatialData object
sdata = SpatialData()

# Set the path for the SpatialData .zarr
zarr_path = os.path.join(OUTPUT_DIR, f"sdata_{uuid.uuid4()}.zarr")

# Write the SpatialData to Zarr
sdata.write(zarr_path)

In [None]:
# Reload the Zarr data back as a SpatialData
sdata = read_zarr(sdata.path)

# Check if SpatialData is backed (i.e. stored on disk)
sdata.is_backed()

In [None]:
# We add the DAPI image to the SpatialData object
sdata = hp.im.add_image_layer(
    sdata, # The SpatialData object to which the new image layer will be added.
    arr = img, # The array containing the image data to be added.
    dims = ( "c", "y", "x" ), # A tuple specifying the dimensions of the image data
    output_layer = "raw_image", # The name of the output layer where the image data will be stored.
    overwrite = True,
)

In [None]:
# We can access the DAPI image like this:
sdata["raw_image"] # Or, alternatively: sdata.images["raw_image"]

In [None]:
# Plot a crop of the DAPI image
hp.pl.plot_image(
    sdata, 
    img_layer = "raw_image" , 
    crd = [0, 6432, 0, 6432], # The coordinates for the region of interest in the format (xmin, xmax, ymin, ymax). If None, the entire image is plotted.
    figsize = (5,5),
)

In [None]:
# Or, alternatively, via spatialdata-plot:
import spatialdata_plot
sdata.pl.render_images("raw_image").pl.show()

<b>Excercise</b>:

- Use the `Harpy` function `hp.pl.plot_image` to visualize another crop (e.g.: `x_min=2000`, `x_max=4000`, `y_min=1000`, `y_max=4000`). 

- Find the documentation for `hp.pl.plot_image` in the Harpy [readthedocs](https://harpy.readthedocs.io/en/latest/api.html). How would you save the plot to disk?

- Bonus: `hp.pl.plot_image` is a wrapper function around `hp.pl.plot_shapes`. Read the documentation for `hp.pl.plot_shapes`. What does the `fig_kwargs` parameter do? Can you set the `dpi` to 300 for plot that will be saved?

<details> 
<summary>Click to reveal the solution</summary>

```python
hp.pl.plot_image( sdata, img_layer="raw_image", crd = [ 2000, 4000, 1000, 4000 ], output = f'{OUTPUT_DIR}/plot.png', fig_kwargs={ "dpi":300 } )


<b>Excercise</b>:

- Uncomment the following cell and explore the DAPI image in Napari. Try changing the contrast of the image.

In [None]:
from napari_spatialdata import Interactive

# Interactive(sdata)

<b>Excercise</b>:
- Bonus: Add DAPI as a multiscale image to the SpatialData object (tip: read the documentation).

<details>
<summary>Click to reveal the solution</summary>

```python
# Add as multiscale image
sdata=hp.im.add_image_layer(
    sdata,
    arr = array,
    dims = ( "c", "y", "x" ),
    output_layer = "raw_image",
    scale_factors = [2, 2, 2, 2],
    overwrite = True,
)

# Now it is a DataTree
type(sdata["raw_image"])  

# Let's have a look at the dask array
from harpy.image._image import _get_spatial_element
se = _get_spatial_element(sdata, layer="raw_image")
se.data

## 2. Image preprocessing

### 2.1 tiling correction and inpainting

When working with Molecular Cartography data, the data is acquired in tiles that have uneven illumination and this can influence the downstream analysis greatly. Resolve Biosciences assured us this shouldn't impact the transcript counts, but we can check later on whether this is the case. This step is not necessary for most other imaging-based spatial transcriptomics technologies (Xenium, Merscope, ...), but you should plot the entire image to check whether you need this preprocessing step for your data. 

Harpy's tiling_correction() function can be used to correct for uneven illumination (using BaSiC on the back-end). The size of the imaging tiles needs to be known in order to run the function. The tile_size parameter is set to the tile size of Molecular Cartography (2144) by default.

The tiling_correction() function also corrects for the black lines in between the tiles by using OpenCV's inpainting. 

In [None]:
# Performing tiling correction
sdata, flatfields = hp.im.tiling_correction(
    sdata = sdata,
    img_layer = "raw_image",
    tile_size = 2144, # This is set to 2144 by default
    output_layer = "tiling_correction",
    crd = None,
    overwrite=True,
)

# FIXME: It might be better to split up tiling_correction into two separate functions (illumination correction and inpainting) or it should be an argument in the function.

In [None]:
# Plot the raw and corrected image side-by-side
hp.pl.plot_image(sdata, img_layer=[ "raw_image", "tiling_correction" ], crd = [2000, 6000, 2000, 6000], figsize=(10,10))

### 2.2 min-max filtering and contrast enhancing
The next preprocessing steps include:

- A min max filter can be added. The goal of this function is to substract background noise and make the borders of the nuclei/cells cleaner. It will also remove some debris. Note that if you set the size of the filter too small (smaller then the size of your nuclei), the function will create "donuts" (black spots in the center of your cells). If the size of the min max filter is chosen too big, not enough background will be subtracted. Generally, you want to aim for the average nucleus size and some fine-tuning may be necessary. For nuclei in Molecular Cartography data, 45-55 should be a great starting point.

- We also recommend to perform contrast enhancement on your image. Harpy does this by using histogram equalization (CLAHE function). The amount of correction needed can be decided by adapting the contrast_clip value. If the image is already quite bright, 3.5 might be a good starting point. For dark images, you can go up to 10 or even more. Make sure at the end the whole image is evenly illuminated and no cells are dark in the background.
 
If you think your data needs further image processing steps, you can perform these using the map_image function (see further).

In [None]:
# Perform min max filtering
sdata = hp.im.min_max_filtering(
    sdata,
    img_layer = "tiling_correction",
    output_layer = "min_max_filtered",
    size_min_max_filter = 45,
    overwrite = True,
)

# Plot the min max filtered image
hp.pl.plot_image(
    sdata,
    img_layer = "min_max_filtered",
    crd = [2000,6000,2000,6000],
    figsize = (5, 5),
)

# Perform contrast enhancement using CLAHE
sdata = hp.im.enhance_contrast(
    sdata,
    img_layer = "min_max_filtered",
    output_layer = "clahe",
    contrast_clip = 3.5,
    chunks = 20000,
    overwrite = True
)

# Plot the contrast enhanced image
hp.pl.plot_image(
    sdata,
    img_layer = "clahe",
    crd = [2000,6000,2000,6000],
    figsize = (5, 5),
)

<b>Excercise</b>:

- Change the `size_min_max_filter` parameter in `hp.im.min_max_filtering`. What do you see? Try some extreme values.
- Change the `enhance_contrast` parameter in `hp.im.enhance_contrast`. What do you see? Try some extreme values.
- Try image preprocessing on a different crop.

<b>Excercise</b>:

- Uncomment the following cell and explore the preprocessed images in Napari.

In [None]:
#Interactive(sdata)

### 2.3 Custom distributed preprocessing of images using `hp.im.map_image` and `Dask`

See https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html and https://docs.dask.org/en/latest/generated/dask.array.map_overlap.html

Set `blockwise==True` if you want to do distributed processing using `dask.array.map_blocks` or `dask.array.map_overlap`, set `blockwise==False` if your function is already distributed (e.g. when using `dask_image` filters https://image.dask.org/en/latest/dask_image.ndfilters.html.)

In [None]:
import numpy as np
from numpy.typing import NDArray

# Define your custom function
def _my_dummy_function(image: NDArray, parameter: int | float )->NDArray:
    # input (1,1,y,x)
    # output (1,1,y,x)
    print(f"Type of the image is: {type(image)}")
    print(image.shape)
    return image*parameter

fn_kwargs = {"parameter": 2}

# Apply custom function
sdata = hp.im.map_image(
    sdata,
    func = _my_dummy_function,
    fn_kwargs = fn_kwargs,
    img_layer = "raw_image",
    output_layer="dummy_image",
    chunks = 5000,
    blockwise = False, # if blockwise == True --> input to _my_dummy_function is a numpy array of size chunks, else it is a Dask array (with chunksize chunks)
    depth = 1000, # if blockwise == True, and depth specified, will use map_overlap instead of map_blocks for distributed processing
    overwrite = True,
    dtype = np.uint16,
    meta = np.array((), dtype=np.uint16),
)

In [None]:
from harpy.image._image import _get_spatial_element

_get_spatial_element(sdata, layer="raw_image").data.compute()[ :, :10, :10 ]

In [None]:
_get_spatial_element(sdata, layer="dummy_image").data.compute()[ :, :10,:10 ]

<b>Excercise</b>:

- Adapt `my_dummy_function` so it accepts a new parameter, `parameter_2`. Now adapt `my_dummy_function` so the image is multiplied with (`parameter` + `parameter_2`)

<details>
<summary>Click to reveal the solution</summary>

```python
def _my_dummy_function(image: NDArray, parameter: int | float, parameter_2: int | float )->NDArray:
    # input (1,1,y,x)
    # output (1,1,y,x)
    print(f"Type of the image is: {type(image)}" )
    print(image.shape)
    return image*(parameter + parameter_2)

fn_kwargs = {"parameter": 2 , "parameter_2": 2}

sdata = hp.im.map_image(
    sdata,
    func = _my_dummy_function,
    fn_kwargs = fn_kwargs,
    img_layer = "raw_image",
    output_layer="dummy_image",
    chunks = 5000,
    blockwise = True, # if blockwise == True --> input to _my_dummy_function is a numpy array of size chunks, else it is a Dask array (with chunksize chunks)
    depth = 1000, # if blockwise == True, and depth specified, will use map_overlap instead of map_blocks for distributed processing
    overwrite = True,
    dtype = np.uint16,
    meta = np.array((), dtype=np.uint16),
)

<b>Excercise</b>:

- Bonus: Run the cell where `hp.im.map_image` is called in debug mode. Set a breakpoint in `my_dummy_function`. Inspect the shape and type of `image` when you set `blockwise=True` or `blockwise=False`. Set the `depth` parameter to `100`. What do you observe?

## 3. Segmentation

### 3.1 Nucleus segmentation

To segment the nuclei, we here show an example using cellpose, a deep learning network based on a UNET architecture.

Multiple parameters need to be given as an input to the cellpose algorithm. We recommend tuning these to achieve optimal segmentation quality (see https://cellpose.readthedocs.io/en/latest/settings.html). It is often a good idea to fine-tune the parameters on a crop of the image (especially when you only have CPU to work with).
 
- diameter: Includes an estimate of the average nucleus diameter and needs to be given in pixels. If set to None, cellpose will try to estimate the diameter, but this might take a long time and is usually far off. As a guideline, you can use approx. 7 micrometer (in this case 50 pixels at 0.138 micrometer per pixel) for a standard nucleus, but this may vary depending on your specific tissue, sample...
- device: Defines the device you want to work on. If you only have CPU, you can skip this input parameter.
- flow_threshold: Indicates something about the shape of the masks. If you increase it, more masks with less round shapes will be accepted. Usually set between 0.6 and 0.95 (max. is 1). Lower this parameter if you start segmenting artefacts. Increase it if the segmentation misses some non-round cells.
- mask_threshold: Indicates how many of the possible masks are kept. Decreasing the parameter will output more masks. Larger values will output less masks. Usually set between 0 and -6.
- min_size: Indicates the minimum size of a nucleus.
- model_type: If segmenting whole cells instead of nuclei, set this to 'cyto'. You can do this with and without a nucleus channel. When you want to include a nucleus channel for the segmentation, make sure your image is 3D and that the first channel contains the complete cell staining and the second one the nucleus channel (put the channel parameter to np.array([1,0])).

In [None]:
# first we rechunk on disk
from spatialdata.transformations import get_transformation

sdata=hp.im.add_image_layer(
    sdata,
    arr=sdata[ "clahe" ].data.rechunk( 2048 ),
    transformations=get_transformation( sdata[ "clahe" ], get_all=True ),
    output_layer = "clahe",
    overwrite=True,
)

In [None]:
"""
ADVANCED: You can set up a local Dask distributed cluster for parallel computing. Once the cluster is created, a Dask Client is used to connect to it. 
The Dask dashboard link allows you to monitor cluster performance and task progress.
"""

from dask.distributed import Client, LocalCluster

# # Create a local Dask cluster
cluster = LocalCluster(
     n_workers=8,              # Number of worker processes
     threads_per_worker=1,    # Number of threads per worker
     memory_limit="32GB",      # Memory limit per worker
 )

# # Connect a Client to the cluster
client = Client(cluster)

# # Print the Dask dashboard link
print(client.dashboard_link)

In [None]:
import torch
from cellpose import models
from harpy.image import cellpose_callable

gpu = False
device = "cpu"  # mps broken in cellpose (macOS), see https://github.com/MouseLand/cellpose/issues/1063

# Perform nucleus segmentation
sdata = hp.im.segment(
    sdata,
    img_layer="clahe", # The image layer in sdata to be segmented.
    chunks=2048, #settings chunks=None would be equivalent to settings chunks=2048, as chunks on disk are 2048
    depth=200,
    model=cellpose_callable,
    # parameters that will be passed to the callable _cellpose:
    pretrained_model="nuclei", # can also be "cyto", "cyto3", or a path to a fine-tuned cellpose model.
    device=device,
    diameter=50,
    flow_threshold=0.9,
    cellprob_threshold=-4,
    output_labels_layer="segmentation_mask",
    output_shapes_layer="segmentation_mask_boundaries",
    crd=[ 2000,4000,2000,4000 ] if unit_testing else None,  # region to segment [x_min, xmax, y_min, y_max],
    overwrite=True,
)

# client.close() # ADVANCED: Uncomment this when using the Dask Client.
# FIXME: Can we add support for both Cellpose 3 and 4?

In [None]:
# Plot segmentation results
hp.pl.plot_shapes(sdata, img_layer="clahe", shapes_layer="segmentation_mask_boundaries", figsize=(5,5), crd = [2000, 4000, 2000, 4000])

In [None]:
# or via spatialdata-plot
sdata.pl.render_images("clahe").pl.render_labels("segmentation_mask").pl.show()

In [None]:
# To only visualize a crop using spatialdata-plot, we can't pass any coordinates, so but we can perform a bounding box query, and then plot the resulting `SpatialData` object.
sdata_small = sdata.query.bounding_box(
    min_coordinate=[2000, 2000], max_coordinate=[4000, 4000], axes=("x", "y"), target_coordinate_system="global"
)

sdata_small.pl.render_images("clahe").pl.render_labels("segmentation_mask", fill_alpha=0.5).pl.show()

<b>Excercise</b>:

- Try changing segmentation parameters to see how they affect the results. Work on a crop of the image.
- Go to the [documentation](https://spatialdata.scverse.org/projects/plot/en/latest/) of `spatialdata-plot`, and try to visualize the cell boundaries (i.e. the segmentation shapes layer)

<details>
<summary>Click to reveal the solution</summary>

```python
sdata_small.pl.render_images("clahe").pl.render_shapes("segmentation_mask_boundaries", fill_alpha=1.0).pl.show()

### 3.2 Nucleus expansion
In some cases, it may be useful to expand de nuclei segmentations to approximate the cell bodies. Note that this is not very precise and, while it increases the number of transcripts assigned to a cell, it also introduces more wrongly assigned transcripts (i.e. that actually belong to other cells).

In [None]:
# Expand labels layer masks
sdata = hp.im.expand_labels_layer(
    sdata,
    labels_layer="segmentation_mask",
    distance=10, # Number of pixels to expand
    output_labels_layer="segmentation_mask_expanded", # Creates a new labels layer
    output_shapes_layer="segmentation_mask_expanded_boundaries", # Creates a new shapes layer
    overwrite=True,
)

In [None]:
# Plot nuclei masks vs expanded nuclei masks
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    shapes_layer=["segmentation_mask_boundaries", "segmentation_mask_expanded_boundaries"],
    figsize=(10,10),
    crd=[2000, 4000, 2000, 4000],
)

In [None]:
# To only visualize a crop using spatialdata-plot, we can't pass any coordinates, so but we can perform a bounding box query, and then plot the resulting `SpatialData` object.
sdata_small = sdata.query.bounding_box(
    min_coordinate=[2000, 2000], max_coordinate=[4000, 4000], axes=("x", "y"), target_coordinate_system="global"
)

sdata_small.pl.render_images(
    "clahe",
    cmap="gray",
).pl.render_shapes(
    "segmentation_mask_boundaries", 
    fill_alpha=0, 
    outline_width=0.3,
    outline_color='cyan',
    outline_alpha=1,
).pl.render_shapes(
    "segmentation_mask_expanded_boundaries",
    fill_alpha=0, 
    outline_width=0.3,
    outline_color='orange', 
    outline_alpha=1,
).pl.show(
    title="segmentation masks (cyan) vs. expanded segmentation masks (orange)",
    figsize=(10, 10),
    colorbar=False,
)

## 4. Allocating  the transcripts

###  4.1 Creating the count matrix
In this step we
- load in the transcipts: in the case of Molecular Cartography, we can use `hp.io.read_resolve_transcripts`. If no specific loader exist for your data type, you can use the general `hp.io.read_transcripts` function.
- allocate the transcripts to the correct cell. This allocation step creates the count matrix saved in an [anndata](https://anndata.readthedocs.io/en/stable/) object.

In [None]:
# Read in Molecular Cartography transcript data as a points layer
sdata = hp.io.read_resolve_transcripts(
    sdata, 
    output_layer="transcripts", # Name of the points layer of the SpatialData object to which the transcripts will be added.
    path_count_matrix=path_coordinates, # Path to the file containing the transcripts information specific to Molecular Cartography.
    overwrite=True
)

# Allocate transcripts to cells based on the segmentation masks
sdata = hp.tb.allocate(
    sdata=sdata,
    labels_layer="segmentation_mask", # The labels layer (i.e. segmentation mask) in `sdata` to be used to allocate the transcripts to cells.
    points_layer="transcripts", # The points layer in `sdata` that contains the transcripts.
    output_layer="table_transcriptomics", # The table layer in `sdata` in which to save the AnnData object with the transcripts counts per cell.
    update_shapes_layers=False,
    overwrite=True,
)

In [None]:
# Inspect the new points layer
print(type(sdata.points["transcripts"]))
sdata.points["transcripts"].head()

In [None]:
# Inspect the new table layer
display(sdata.tables["table_transcriptomics"])

print('Number of cells: ', len(sdata.tables["table_transcriptomics"].obs.index))
print('Number of genes: ', len(sdata.tables["table_transcriptomics"].var.index))

In [None]:
# Inspect the count matrix in the new table layer
sdata.tables["table_transcriptomics"].to_df().head() # On large count matrices, calls to .to_df() should be avoided

In [None]:
# Inspect the var of the new table layer
sdata.tables["table_transcriptomics"].var.head()

In [None]:
# Inspect the obs of the new table layer
sdata.tables["table_transcriptomics"].obs.head()

In [None]:
# Inspect the spatial coordinates stored in obsm
sdata.tables["table_transcriptomics"].obsm['spatial'][:5] # x,y,(z) coordinates of cell centre (calculated based on mean transcripts location)

In [None]:
# Inspect the spatialdata_attrs in .uns to check the instance_key and region_key
sdata.tables["table_transcriptomics"].uns['spatialdata_attrs']

# NOTE: The AnnData object that is added as a table layer is annotated by the labels layer "segmentation_mask". The instance_key ('cell_ID') matches the labels in "segmentation_mask".
# NOTE: Tables of a SpatialData object can be theoretically be annotated by a labels layer, a shapes layer or a points layer, but tables generated by the Harpy pipeline will always use a labels layer.

In [None]:
import dask.array as da

print('Number of cells in table: ', len(sdata.tables["table_transcriptomics"].obs))
print('Number of segmentation masks in labels layer: ', len(da.unique(sdata.labels["segmentation_mask"].data).compute()) - 1) # We subtract 1 because 0 is also a value, but this corresponds to the background.
print('Number of segmentation boundaries in shapes layer: ', len(sdata.shapes["segmentation_mask_boundaries"]))

# NOTE: Not all segmentation masks are included in the table layer "table_transcriptomics". This is because not all cells could be assigned transcripts.

<b>Excercise</b>:

- Run .compute() on the sdata.points['transcripts'] layer. What is the data type of the resulting object? What is the data type of the original points layer?
- Have a look at https://docs.dask.org/en/stable/dataframe.html to understand the difference between both data types.

<details>
<summary>Click to reveal the solution</summary>

```python
from IPython.display import display

display(sdata["transcripts"].compute().head())
display(type(sdata["transcripts"].compute())) 
display(type(sdata["transcripts"])) 

<b>Excercise</b>:

- Bonus: Extract transformation from the points layer "transcripts" using `spatialdata.transformations.get_transformation`. See https://spatialdata.scverse.org/en/stable/generated/spatialdata.transformations.get_transformation.html
- Bonus: Now extract the transformation from the labels layer "segmentation_mask" and for the image layer "clahe".

<details>
<summary>Click to reveal the solution</summary>

```python
from IPython.display import display
from spatialdata.transformations import get_transformation

display(get_transformation(sdata["transcripts"]))
display(get_transformation(sdata["segmentation_mask"]))
display(get_transformation(sdata["clahe"]))

<b>Excercise</b>:

- Visualize the points layer and the labels layer using napari-spatialdata. Convince yourself they are registered.

In [None]:
# Interactive(sdata)

### 4.2 Visualizing gene expression

In [None]:
# Plot the expression of the Axl gene using hp.pl.polt_shapes()
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    shapes_layer="segmentation_mask_boundaries",
    figsize=(5,5),
    crd=[2000, 4000, 2000, 4000],
    table_layer="table_transcriptomics",
    column="Axl",
)

# NOTE: In Harpy/SpatialData there is a connection between tables, shapes and labels via the region_key and the cell id, which allows us to plot a certain column of a table spatially.

In [None]:
# Plot the expression of the Axl gene using spatialdata-plot
import matplotlib.pyplot as plt

plt.figure(figsize=(5, 5))
ax = plt.gca()

gene_name = "Axl"

sdata_small = sdata.query.bounding_box(
    min_coordinate=[2000, 2000], max_coordinate=[4000, 4000], axes=("x", "y"), target_coordinate_system="global"
)

sdata_small.pl.render_labels("segmentation_mask", color=gene_name, method="datashader", fill_alpha=0.5, table_name="table_transcriptomics").pl.show(
    coordinate_systems="global", ax=ax
)

In [None]:
# Explore gene expression interactively using napari-spatialdata

# Interactive(sdata)

<b>Excercise</b>:

- Use `hp.pl.plot_shapes` to plot the expression of some other genes that are in the dataset.

<details>
<summary>Click to reveal the solution</summary>

```python
display(sdata["table_transcriptomics"].var)

hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    shapes_layer="segmentation_mask_boundaries",
    figsize=(5,5),
    crd=[2000, 4000, 2000, 4000],
    table_layer="table_transcriptomics",
    column="Vwf",
)

<b>Excercise</b>:

Use `napari-spatialdata` to visualize the gene expression of the gene `Axl`.

In [None]:
# Interactive(sdata)

###  4.3 Transcript quality
After we have created the anndata object, we want to check the transcript quality. First we create a plot to check if the transcript density is similar across the whole tissue. If this isn't the case, there can be multiple biological or technical reasons. Note that gene panel choices can also have an influence.

In [None]:
# Create transcript density image
sdata = hp.im.transcript_density(
    sdata,
    img_layer="clahe", # The layer of the SpatialData object used for determining image boundary.
    points_layer="transcripts", # The layer name that contains the transcript data points, by default "transcripts".
    output_layer="transcript_density", # The name of the output image layer
    overwrite=True,
)

In [None]:
# Plot transcript density
hp.pl.plot_image(sdata, img_layer = ["clahe", "transcript_density"], figsize=(10,10))

In [None]:
# Plot transcript and cell density using matplotlib
import matplotlib.pyplot as plt
import pandas as pd

# Get cell coordinates
df_cells = pd.DataFrame(sdata.tables["table_transcriptomics"].obsm['spatial'], columns=['x', 'y'])

# Get transcript coordinates
df_transcripts = sdata.points["transcripts"].compute()

# Create a side-by-side plot
fig, axs = plt.subplots(1, 2, figsize=(30, 15))

# Plot cell density (left)
h1 = axs[0].hexbin(
    df_cells["x"], df_cells["y"],
    gridsize=100,
    cmap="viridis",
    linewidths=0.2,
    edgecolors='face',
)
fig.colorbar(h1, ax=axs[0], label='Cell Count')
axs[0].set_title("Cell Density (Hexbin)")
axs[0].set_xlabel("x")
axs[0].set_ylabel("y")
axs[0].axis("equal")
axs[0].invert_yaxis()

# Plot transcript density (right)
h2 = axs[1].hexbin(
    df_transcripts["x"], df_transcripts["y"],
    gridsize=500,
    cmap="viridis",
    linewidths=0.2,
    edgecolors='face',
)
fig.colorbar(h2, ax=axs[1], label='Transcript Count')
axs[1].set_title(f"Transcript Density (Hexbin)")
axs[1].set_xlabel("x")
axs[1].set_ylabel("y")
axs[1].axis("equal")
axs[1].invert_yaxis()

plt.tight_layout()
plt.show()

# FIXME: We could (should) include these plotting functionality in harpy? Maybe also an option to plot hexbin for column variables in table layer.


In [None]:
# Check number of transcripts
print('Number of transcripts in points layer: ', len(sdata.points["transcripts"]))
print('Number of transcripts assigned to cells: ', sdata.tables["table_transcriptomics"].X.sum())
print('Percentage of transcripts kept: ', ((sdata.tables["table_transcriptomics"].X.sum())/len(sdata.points["transcripts"]))*100)

# NOTE: Only a fraction of transcripts are assigned to cells.

In [None]:
# Check number of genes
print('Number of genes in points layer: ', sdata.points['transcripts'].compute()['gene'].nunique())
print('Number of genes found in cells: ', len(sdata.tables["table_transcriptomics"].var.index))

# NOTE: In general, we don't want to lose any genes, but this may happen if they have a low abundance.

In [None]:
# Check which genes are not found in cells
genes_not_found_in_cells = set(sdata.points['transcripts'].compute()['gene'].unique()) - set(sdata.tables["table_transcriptomics"].var.index)

print("Number of genes not found in cells: ", len(genes_not_found_in_cells))
print("Genes not found in cells:", genes_not_found_in_cells)


In [None]:
# Convert the transcript points table to a DataFrame
df_transcripts = sdata.points["transcripts"].compute()

# Filter and count transcripts for Ms4a7
gene_name = "Ms4a7"
count = (df_transcripts["gene"] == gene_name).sum()

print(f"Number of transcripts for {gene_name}: {count}")


In [None]:
# Analyse and visualize the proportion of transcripts that could not be assigned to a cell during allocation step.

df_analyse_genes_left_out = hp.pl.analyse_genes_left_out(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics",
    points_layer="transcripts",
)

# NOTE: In general we see a downward trend. The more a gene is measured, the less it is located in cells (in ratio). 
# NOTE: The function also prints the ten genes with the highest proportion of transcripts filtered out. If a lot of these genes are markers for the same cell type, you will want to find out why this is happening (bad staining, large cell body compared to nucleus, etc.)

In [None]:
# Inspect analyse_genes_left_out() output table
df_analyse_genes_left_out.sort_values(by="proportion_kept", ascending=True)

## 5. Processing the AnnData table

### 5.1 Filtering and Normalization

The next steps are performed to further process the AnnData object:

- QC metrics are calculated.
- Filtering: cells with fewer than a certain amount of counts (e.g. 10) and genes occuring in fewer than a certain amount of cells (e.g. 5) are filtered out.
- Normalization: for small gene panels (<500), we recommend to normalize the data based on the size of the segmented object (`size_norm=True`). For transcriptome-wide methods, we recommend library size normalization based on the total expression (`size_norm=False`). 
- log1p-transformation of the expression data (y=ln(1+x)).
- Scale data to unit variance and zero mean. The scaling is capped at `max_value_scale`.
- PCA calculation


The last plot shows the size of the nucleus related to the counts. When working with whole cells, if there are some really big cells with really low counts, they are probably not real cells and you should filter based on max size. 

In [None]:
# Perform preprocessing.
sdata = hp.tb.preprocess_transcriptomics(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics",
    output_layer="table_transcriptomics_preprocessed", # write results to a new slot, we could also write to the same slot (when passing overwrite==True).
    min_counts=10,
    min_cells=5,
    size_norm=True,
    highly_variable_genes=False,  # If True, will only retain highly variable genes. This can be used for transcriptome-wide methods.
    max_value_scale=10, # The maximum value to which data will be scaled
    n_comps=50, # Number of principal components to calculate.
    overwrite=True,
    update_shapes_layers=False,
)

In [None]:
# Inspect preprocessed table
sdata.tables[ "table_transcriptomics_preprocessed" ]

In [None]:
# Inspect expression values
sdata.tables["table_transcriptomics_preprocessed"].to_df().head()

In [None]:
# Check mean expression values per gene
sdata.tables["table_transcriptomics_preprocessed"].to_df().mean(axis=0).head() # mean ~ 0

In [None]:
# Check standard deviation of expression values per gene
sdata.tables["table_transcriptomics_preprocessed"].to_df().std(axis=0).head() # std ~ 1

In [None]:
# Check max expression value per gene
sdata.tables["table_transcriptomics_preprocessed"].to_df().max(axis=0).head() # max ~ 10

In [None]:
# Inspect obs of preprocessed table
sdata.tables["table_transcriptomics_preprocessed"].obs.head()

# n_genes_by_counts: The number of genes with at least 1 count in a cell
# log1p_n_genes_by_counts: log1p-transformed n_genes_by_counts
# total_counts: Total number of counts for a cell
# log1p_total_counts: log1p-transformed total_counts
# pct_counts_in_top_2_genes: The percentage of the total gene expression in each cell that comes from the top 2 most highly expressed genes in that cell
# pct_counts_in_top_5_genes: The percentage of the total gene expression in each cell that comes from the top 5 most highly expressed genes in that cell 
# n_counts: Number of counts in a cell
# shapeSize: Area of cell (in pixels)

In [None]:
# Check sum of transcript counts
(sdata.tables["table_transcriptomics"].to_df()).sum(axis=1).head()

In [None]:
# Check number of genes
(sdata.tables["table_transcriptomics"].to_df()>0).sum(axis = 1).head()

In [None]:
# Inspect var of preprocessed table
sdata.tables["table_transcriptomics_preprocessed"].var.head()

# n_cells_by_counts: Number of cells this gene is found in
# mean_counts: Mean counts over all cells
# log1p_mean_counts: log1p of mean_counts
# pct_drop_by_counts: Percentage of cells this gene does not appear in
# total_counts: Total number of counts for a gene
# logp_total_counts: log1p of total_counts
# n_cells: Number of cells this gene is found in
# mean:
# std:

In [None]:
# Plot preprocessing QC plots
hp.pl.preprocess_transcriptomics(
    sdata,
    table_layer="table_transcriptomics_preprocessed",
)

In [None]:
# Additionally, plot a histogram of segmentation mask areas
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(sdata.tables["table_transcriptomics_preprocessed"].obs["shapeSize (pixels)"], kde=False)
plt.title("Area of Segmentation Masks")
plt.xlabel("shapeSize")
plt.ylabel("Count")
plt.tight_layout()
plt.show()


In [None]:
# Plot total counts
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_preprocessed",
    column="total_counts",
    shapes_layer="segmentation_mask_boundaries",
    crd=[2000, 4000, 2000, 4000],
    figsize=(8,8)
)

In [None]:
# Filter cells on size
sdata = hp.tb.filter_on_size(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_preprocessed",
    output_layer="table_transcriptomics_filter",
    min_size=500, # Minimum cell size
    max_size=100000, # Maximum cell size
    update_shapes_layers=True,
    overwrite=True,
)

In [None]:
hp.pl.plot_shapes(
    sdata, 
    img_layer="clahe", 
    shapes_layer="segmentation_mask_boundaries", 
    shapes_layer_filtered="filtered_size_segmentation_mask_boundaries", # Filtered cells will be plotted in red.
    figsize=(5,5), 
    crd = [2000, 4000, 2000, 4000]
)

### 5.2 Clustering

This function performs the neighborhood analysis and the leiden clustering and the UMAP calculations using standard scanpy functions.

You need to define the following parameters:
- The amount of PC's used: Between 15-20 is a good starting point (based on the plot of PCs).
- The amount of neighbors used: 35 is generally a good value. In general, less neighbors means more spread, more means everything is tighter.
- Cluster resolution.

It returns the UMAP and marker gene list per cluster, that can be looked at for finding celltypes. 

In [None]:
import scanpy as sc

# Leiden clustering
sdata = hp.tb.leiden(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_filter",
    output_layer="table_transcriptomics_clustered",
    calculate_umap=True,
    calculate_neighbors=True,
    n_pcs=17, # The number of principal components to use when calculating neighbors.
    n_neighbors=35, # The number of neighbors to consider when calculating neighbors.
    resolution=0.8,
    rank_genes=True,
    key_added="leiden",
    overwrite=True,
)

# Plot UMAP
sc.pl.umap(sdata.tables["table_transcriptomics_clustered"], color=["leiden"], show=True)

In [None]:
# Plot clusters spatially
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_clustered",
    column="leiden",
    shapes_layer="segmentation_mask_boundaries",
    alpha=1.0,
    linewidth=0,
    # crd=[2000, 4000, 2000, 4000]
)

In [None]:
# We can plot other variables on the UMAP as well
from matplotlib.pyplot import rc_context
color_vars = [
    "n_counts",
    "n_genes_by_counts",
    "shapeSize",
    "Glul",
    "leiden",
]
with rc_context({"figure.figsize": (3, 3)}):
    sc.pl.umap(sdata.tables["table_transcriptomics_clustered"], color=color_vars, ncols=3)

In [None]:
sc.pl.rank_genes_groups(sdata.tables["table_transcriptomics_clustered"], n_genes=8, sharey=False, show=True)

<b>Excercise</b>:

Change the parameters of `hp.tb.leiden`. What do you observe?

In [None]:
#from napari_spatialdata import Interactive

#del sdata.tables["table_transcriptomics_clustered"].uns["leiden_colors"]
#Interactive(sdata)

In [None]:
import matplotlib.pyplot as plt

# for fun, also plot via spatialdataplot
plt.figure(figsize=(5, 5))
ax = plt.gca()

column = "leiden"

adata = sdata.tables[ "table_transcriptomics_clustered" ]

#cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
#                    "new_map",
#                    adata.uns[column + "_colors"],
#                    N=len(adata.uns[column + "_colors"]),
#                )

sdata_small = sdata.query.bounding_box(
    min_coordinate=[2000, 2000], max_coordinate=[4000, 4000], axes=("x", "y"), target_coordinate_system="global"
)

sdata_small.pl.render_labels("segmentation_mask", color=column, cmap=None, method="datashader", fill_alpha=1, table_name= "table_transcriptomics_clustered").pl.show(
    coordinate_systems="global", ax=ax
)

### 5.3 Cell type annotation

#### 5.3.1 Annotating clusters

In [None]:
# First, we specify a dictionary with marker genes for some interesting cell types.
marker_genes_dict = {
    'LSEC': ['Stab2', 'Pecam1'],
    'HepatocytesPortal': ['Pck1', 'Hal', 'Sds'],
    'HepatocytesCentral': ['Cyp2e1', 'Glul', 'Lgr5'],
    'Cholangiocytes': ['Spp1', 'Sox9','Epcam'],
    'B_cells': ['Ccr7', 'Cd19', 'Cd79a'],
    'Kuppfer_cells': ['Axl', 'Cd5l', 'Clec4f'],
}

In [None]:
# We can visualize the expression of these marker genes in the Leiden clusters using a scanpy's matrix plot.
sc.pl.matrixplot(
    sdata.tables["table_transcriptomics_clustered"], 
    var_names=marker_genes_dict, 
    groupby="leiden", 
    cmap="Blues",
    standard_scale="var",
    colorbar_title="column scaled\nexpression",
)

In [None]:
# We can also use a dot plot
sc.pl.dotplot(
    sdata.tables["table_transcriptomics_clustered"], 
    var_names=marker_genes_dict, 
    groupby="leiden", 
    cmap="Blues",
)

In [None]:
# The heatmap plot does not collapse the cells into a single average value per cluster
sc.pl.heatmap(
    sdata.tables["table_transcriptomics_clustered"], 
    var_names=marker_genes_dict, 
    groupby="leiden", 
    cmap="viridis", 
)

# For more visualization options, see the scanpy documentation: https://scanpy.readthedocs.io/en/stable/tutorials/plotting/core.html

#### 5.3.2 Annotating cells using sc.tl.score_genes

We can also use a marker gene list and score cells for each cell type using those markers via scanpy's `sc.tl.score_genes` function.

In [None]:
import pandas as pd

# Download annotation file from registry
path_mg = registry.fetch("transcriptomics/resolve/mouse/markerGeneListMartinNoLow.csv")

df = pd.read_csv(path_mg, index_col=0, delimiter=",")
df.columns = df.columns.str.replace(' ', '_', regex=False) # whitespaces no longer allowed since spatialdata>=0.3.0

# Inspect annotation file containing markers
display(df.head()) # This is one-hot encoded matrix with cell types listed in the first row, and marker genes in the first column.

In [None]:
# Annotate cells
sdata, celltypes_scored, celltypes_all = hp.tb.score_genes(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_clustered",
    output_layer="table_transcriptomics_score_genes",
    path_marker_genes=df, # path_marker_genes can also be a dataframe
    overwrite=True,
)

In [None]:
# Inspect new table layer
sdata["table_transcriptomics_score_genes"]

In [None]:
# Inspect new table layer obs
sdata.tables["table_transcriptomics_score_genes"].obs.head()

In [None]:
# Plot cell type annotations on UMAP
sc.pl.umap(sdata.tables["table_transcriptomics_score_genes"], color="annotation")

In [None]:
# Plot cell type annotations spatially
hp.pl.plot_shapes(
    sdata,
    column="annotation",
    img_layer="clahe",
    table_layer= "table_transcriptomics_score_genes",
    shapes_layer="segmentation_mask_boundaries",
    linewidth=0,
    alpha=0.7,
    crd=None,
)

In [None]:
# Let's inspect the cell type counts and percentages
counts = sdata.tables["table_transcriptomics_score_genes"].obs['annotation'].value_counts()
percentages = sdata.tables["table_transcriptomics_score_genes"].obs['annotation'].value_counts(normalize=True) * 100

cluster_summary = pd.DataFrame({
    'count': counts,
    'percentage': percentages.round(2)
})

print(cluster_summary)

### 5.4 Squidpy

#### 5.4.1 Constructing spatial graphs
The spatial graph is a network where each node represents an observation (spot/cell) and edges signify neighborhood relationships (calculated based on the spatial coordiantes of the observations). This graph is useful for various analyses, such as neighborhood enrichment and calcualting spatial statistics such as spatial autocorrelation.

We use squidpy.gr.spatial_neighbors to compute the spatial neighbors graph in this non-grid dataset, setting coord_type="generic", and n_neighs=6 to specify that each observation should have 6 neighbors (this is called K-nearest neighbors(KNN)). We could also calculate neighbor relationships based on a radius around each cell or we could set delaunay=True to apply Delaunay triangulation.

In [None]:
# Try calculating spatial neighbors using Squidpy
import squidpy as sq

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic", # Set to 'generic' for targeted spatial transcriptomics
    n_neighs=6, # Only used when delaunay = False
    radius=None, # To compute the neighbors based on the radius
    delaunay=False, # Whether to compute the graph from Delaunay triangulation
    set_diag=False, # Whether to set the diagonal of the connectivity matrix to 1 (i.e. whether cells should be considered neighbors of themselves).
    key_added="KNN"
)

sdata.tables["table_transcriptomics_score_genes"]

In [None]:
# BUT, this is not yet backed to the zarr store!
from spatialdata import read_zarr

sdata = read_zarr(sdata.path)

sdata.tables["table_transcriptomics_score_genes"]

# NOTE: .uns["spatial_neighbors"], .obsp["spatial_connectivities"] and .obsp["spatial_distances"] are no longer in table!

In [None]:
# Let's try calculating the spatial neighbors again, but we'll make sure the new table is backed to the zarr store by using hp.tb.add_table_layer().
from harpy.utils._keys import _REGION_KEY

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic", # Set to 'generic' for targeted spatial transcriptomics
    n_neighs=6, # Only used when delaunay = False
    radius=None, # To compute the neighbors based on the radius
    delaunay=False, # Whether to compute the graph from Delaunay triangulation
    set_diag=False, # Whether to set the diagonal of the connectivity matrix to 1 (i.e. whether cells should be considered neighbors of themselves).
    key_added="KNN"
)

region = sdata["table_transcriptomics_score_genes"].obs[_REGION_KEY].cat.categories.to_list()

sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_score_genes"],
    output_layer="table_transcriptomics_squidpy",
    region=region, # A list of regions to associate with the table data. Typically this is all unique elements in adata.obs[_REGION_KEY].
    overwrite=True,
)

In [None]:
# Inspect spatial connectivities of first 10 rows and colomns
sdata.tables['table_transcriptomics_squidpy'].obsp['KNN_connectivities'].toarray()[6:10,6:10]

In [None]:
# Inspect spatial distances
sdata.tables['table_transcriptomics_squidpy'].obsp['KNN_distances'].toarray()[6:10,6:10]

In [None]:
# Inspect number of neighbors (for first 10 cells)
sdata.tables['table_transcriptomics_squidpy'].obsp['KNN_connectivities'].toarray().sum(axis=1)[0:10] # sums across the columns for each row

# NOTE: Every cell has exactly 6 neighbors when using n_neigh=6

In [None]:
# Inspect for every cell how many cells have it as a neighbor (for first 10 cells)
sdata.tables['table_transcriptomics_squidpy'].obsp['KNN_connectivities'].toarray().sum(axis=0)[0:10] # sums across the rows for each column

# NOTE: Not every cell is a neighbor of exactly 6 cells when using n_neigh=6

In [None]:
sq.pl.spatial_scatter(
    sdata.tables["table_transcriptomics_squidpy"],
    shape=None,
    color="annotation",
    connectivity_key="KNN_connectivities",
    size=30,
    figsize=(15,15),
    legend_loc='best',
    legend_fontsize=7,
    dpi=300
)

<b>Excercise</b>:

- Build a graph of spatial neighbors using a radius (e.g. 150 pixels) and plot the results using 'sq.pl.spatial_scatter'. Inspect the neighbor relationships in both directions for the first 10 cells. How does this compare to the results for 6-nearest neighbors spatial graph? Do the same for Delaunay triangulation.

<details>
<summary>Click to reveal the solution</summary>

```python
# radius-based spatial graphs
from harpy.utils._keys import _REGION_KEY

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic", # Set to 'generic' for targeted spatial transcriptomics
    n_neighs=6, # Only used when delaunay = False
    radius=150, # To compute the neighbors based on the radius
    delaunay=False, # Whether to compute the graph from Delaunay triangulation
    set_diag=False, # Whether to set the diagonal of the connectivity matrix to 1 (i.e. whether cells should be considered neighbors of themselves).
    key_added="radius"
)

region = sdata["table_transcriptomics_score_genes"].obs[_REGION_KEY].cat.categories.to_list()

sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_score_genes"],
    output_layer="table_transcriptomics_squidpy",
    region=region, # A list of regions to associate with the table data. Typically this is all unique elements in adata.obs[_REGION_KEY].
    overwrite=True,
)

print('Inspect number of neighbors (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['radius_connectivities'].toarray().sum(axis=1)[0:10])

print('Inspect for every cell how many cells have it as a neighbor (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['radius_connectivities'].toarray().sum(axis=0)[0:10])

sq.pl.spatial_scatter(
    sdata.tables["table_transcriptomics_squidpy"],
    shape=None,
    color="annotation",
    connectivity_key="radius_connectivities",
    size=30,
    figsize=(15,15),
    legend_loc='best',
    legend_fontsize=7,
    dpi=300
)

# Delaunay triangulation
from harpy.utils._keys import _REGION_KEY

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_score_genes"], 
    coord_type="generic", # Set to 'generic' for targeted spatial transcriptomics
    n_neighs=6, # Only used when delaunay = False
    radius=None, # To compute the neighbors based on the radius
    delaunay=True, # Whether to compute the graph from Delaunay triangulation
    set_diag=False, # Whether to set the diagonal of the connectivity matrix to 1 (i.e. whether cells should be considered neighbors of themselves).
    key_added="delaunay"
)

region = sdata["table_transcriptomics_score_genes"].obs[_REGION_KEY].cat.categories.to_list()

sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_score_genes"],
    output_layer="table_transcriptomics_squidpy",
    region=region, # A list of regions to associate with the table data. Typically this is all unique elements in adata.obs[_REGION_KEY].
    overwrite=True,
)

print('Inspect number of neighbors (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['delaunay_connectivities'].toarray().sum(axis=1)[0:10])

print('Inspect for every cell how many cells have it as a neighbor (for first 10 cells):')
display(sdata.tables['table_transcriptomics_squidpy'].obsp['delaunay_connectivities'].toarray().sum(axis=0)[0:10])

sq.pl.spatial_scatter(
    sdata.tables["table_transcriptomics_squidpy"],
    shape=None,
    color="annotation",
    connectivity_key="delaunay_connectivities",
    size=30,
    figsize=(15,15),
    legend_loc='best',
    legend_fontsize=7,
    dpi=300
)

#### 5.4.2 Neighborhood enrichment analysis
Then we can calculate the neighborhood enrichment score with squidpy.gr.nhood_enrichment. This function will generate a dictionary stored in adata.uns['annotated_nhood_enrichment'] that will contain a z scores matrix and and a count matrix.

The count matrix represents how often each pair of cell types are neighbors in the dataset. Each row in this count matrix represents a cell type, and each column shows how many times it’s connected to other cell types.

For each pair of cell types, the observed counts from the original data are compared to, for example, 1000 permutations (depending on the n_perms argument) and a z-score is calculated. The z-score is a measure of how many standard deviations the observed count deviates from the distribution generated by random permutations.

In [None]:
# Calculate neighborhood enrichment
sq.gr.nhood_enrichment(
    sdata.tables["table_transcriptomics_squidpy"], 
    cluster_key='annotation', 
    connectivity_key='KNN'
)

# Add table layer to back to zarr
sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_squidpy"],
    output_layer="table_transcriptomics_squidpy",
    region=region,
    overwrite=True,
)

# NOTE: Also see harpy.tb.nhood_enrichment and harpy.pl.nhood_enrichment

In [None]:
# Plot neighborhood enrichment
sq.pl.nhood_enrichment(
    sdata.tables["table_transcriptomics_squidpy"],
    cluster_key='annotation', 
    mode='zscore',
)

#### 5.4.3 Spatial autocorrelation
Moran’s I can be understood as the Pearson correlation between the value at each location and the average value at its neighbors. Just like Pearson correlation, Moran’s I is bound between -1 and 1, where positive value indicates positive spatial autocorrelation and negative value indicates negative spatial autocorrelation.

In [None]:
# Calculate Moran’s I global spatial auto-correlation statistics
sq.gr.spatial_autocorr(
    adata=sdata.tables["table_transcriptomics_squidpy"],
    mode="moran",
    n_perms=100,
    n_jobs=1,
    connectivity_key='KNN_connectivities'
)

# Add table layer to back to zarr
sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_squidpy"],
    output_layer="table_transcriptomics_squidpy",
    region=region,
    overwrite=True,
)

In [None]:
# Inspect highest Moran's I scores
sdata.tables["table_transcriptomics_squidpy"].uns["moranI"].head(10)

In [None]:
# Inspect lowest Moran's I scores
sdata.tables["table_transcriptomics_squidpy"].uns["moranI"].tail(10)

In [None]:
# Let's inspect the spatial expression pattersn for the 9 highest Moran's I scores.
color_vars = sdata.tables["table_transcriptomics_squidpy"].uns["moranI"].index[0:9]

with rc_context({"figure.figsize": (4, 5)}):
    sq.pl.spatial_scatter(
        sdata.tables["table_transcriptomics_squidpy"],
        shape=None,
        color=color_vars,
        size=5,
        ncols=3,
        dpi=300
    )


In [None]:
# UMAPs of 9 highest Moran's I scores
with rc_context({"figure.figsize": (4, 4)}):
    sc.pl.umap(
        sdata.tables["table_transcriptomics_clustered"], 
        color=color_vars, 
        ncols=3,
    )

### 5.5 TissUUmaps

TissUUmaps is a handy visualization software that allows easy interactive exploration of your spatial data. It can be used to visualize data from an AnnData .h5ad file or from a csv-file. You can also simultaneously visualize images (multiple file types, including tiff) and regions (GeoJSON). 

It can be installed using this link: https://tissuumaps.github.io/installation/ \
Documentation can be found here: https://tissuumaps.github.io/TissUUmaps-docs/ 

In [None]:
# Export image as tiff
from skimage.io import imsave

if not unit_testing:

    # Save AnnData as h5ad
    sdata.tables["table_transcriptomics_squidpy"].write(os.path.join(OUTPUT_DIR, 'adata.h5ad'))

    img = sdata.images['clahe'].data.compute()
    imsave(os.path.join(OUTPUT_DIR, "clahe.tiff"), img)

    # Export shapes layer as GeoJSON
    sdata.shapes['segmentation_mask_boundaries'].to_file(os.path.join(OUTPUT_DIR, "segmantation_mask_boundaries.geojson"), driver="GeoJSON")


## 6. Segmentation-free analysis

In [None]:
# First, we create new labels and shapes layers for a hexagonal grid 
shape = (12864, 10720)

size = 50 # radius of the hexagon, or size length of the square.

sdata = hp.im.add_grid_labels_layer(
    sdata, 
    shape=shape, 
    size=size, 
    output_shapes_layer=f"shapes_spots_{size}um", 
    output_labels_layer=f"labels_spots_{size}um", 
    grid_type='hexagon', # Set to 'square' for square grid
    offset=(0, 0), 
    chunks=1024, 
    client=None, 
    transformations=None, 
    scale_factors=(2, 2, 2, 2),
    overwrite=True
)

In [None]:
# Allocate transcripts
sdata = hp.tb.allocate(
    sdata=sdata,
    labels_layer=f"labels_spots_{size}um",
    points_layer="transcripts", # The points layer in `sdata` that contains the transcripts.
    output_layer="table_transcriptomics_hex", # The table layer in `sdata` in which to save the AnnData object with the transcripts counts per cell.
    update_shapes_layers=False,
    overwrite=True,
)

# Perform preprocessing.
sdata = hp.tb.preprocess_transcriptomics(
    sdata,
    labels_layer=f"labels_spots_{size}um",
    table_layer="table_transcriptomics_hex",
    output_layer="table_transcriptomics_hex_preprocessed", # write results to a new slot, we could also write to the same slot (when passing overwrite==True).
    min_counts=10,
    min_cells=5,
    size_norm=True,
    highly_variable_genes=False,  # If True, will only retain highly variable genes. This can be used for transcriptome-wide methods.
    max_value_scale=10, # The maximum value to which data will be scaled
    n_comps=50, # Number of principal components to calculate.
    overwrite=True,
    update_shapes_layers=False,
)

In [None]:
import scanpy as sc

# Leiden clustering
sdata = hp.tb.leiden(
    sdata,
    labels_layer=f"labels_spots_{size}um",
    table_layer="table_transcriptomics_hex_preprocessed",
    output_layer="table_transcriptomics_hex_preprocessed",
    calculate_umap=True,
    calculate_neighbors=True,
    n_pcs=17, # The number of principal components to use when calculating neighbors.
    n_neighbors=35, # The number of neighbors to consider when calculating neighbors.
    resolution=0.8,
    rank_genes=True,
    key_added="leiden",
    overwrite=True,
)

# Plot UMAP
sc.pl.umap(sdata.tables["table_transcriptomics_hex_preprocessed"], color=["leiden"], show=True)

In [None]:
# Plot clusters spatially
hp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_hex_preprocessed",
    column="leiden",
    shapes_layer=f"shapes_spots_{size}um",
    alpha=1.0,
    linewidth=0,
    # crd=[2000, 4000, 2000, 4000]
)

In [None]:
from harpy.utils._keys import _REGION_KEY
import squidpy as sq

sq.gr.spatial_neighbors(
    adata=sdata["table_transcriptomics_hex_preprocessed"], 
    coord_type="grid", # Set to 'generic' for targeted spatial transcriptomics
    n_neighs=6, # Only used when delaunay = False
    radius=None, # To compute the neighbors based on the radius
    delaunay=False, # Whether to compute the graph from Delaunay triangulation
    set_diag=False, # Whether to set the diagonal of the connectivity matrix to 1 (i.e. whether cells should be considered neighbors of themselves).
    key_added=None
)

region = sdata["table_transcriptomics_hex_preprocessed"].obs[_REGION_KEY].cat.categories.to_list()

sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_hex_preprocessed"],
    output_layer="table_transcriptomics_hex_preprocessed",
    region=region, # A list of regions to associate with the table data. Typically this is all unique elements in adata.obs[_REGION_KEY].
    overwrite=True,
)

In [None]:
# Calculate Moran’s I global spatial auto-correlation statistics
sq.gr.spatial_autocorr(
    adata=sdata.tables["table_transcriptomics_hex_preprocessed"],
    mode="moran",
    n_perms=100,
    n_jobs=1,
)

# Add table layer to back to zarr
sdata = hp.tb.add_table_layer(
    sdata,
    adata=sdata.tables["table_transcriptomics_hex_preprocessed"],
    output_layer="table_transcriptomics_hex_preprocessed",
    region=region,
    overwrite=True,
)

In [None]:
# Let's inspect the spatial expression pattersn for the 9 highest Moran's I scores.
color_vars = sdata.tables["table_transcriptomics_hex_preprocessed"].uns["moranI"].index[0:9]
from matplotlib.pyplot import rc_context
with rc_context({"figure.figsize": (4, 5)}):
    sq.pl.spatial_scatter(
        sdata.tables["table_transcriptomics_hex_preprocessed"],
        shape=None,
        color=color_vars,
        size=5,
        ncols=3,
        dpi=300
    )


In [None]:
sq.pl.spatial_scatter(
    sdata.tables["table_transcriptomics_hex_preprocessed"],
    shape=None,
    color="leiden",
    connectivity_key="spatial_connectivities",
    size=30,
    figsize=(15,15),
    legend_loc='best',
    legend_fontsize=7,
    dpi=300
)