# Quick Start: Generate Expression Data for a Customised Tissue  


- **Creator**: Amir Akbarnejad (aa36@sanger.ac.uk)
- **Affiliation**: Wellcome Sanger Institute and University of Cambridge
- **Date of Creation**: 01.07.2025
- **Date of Last Modificaion**: 01.07.2025

This tutorial demonstrates how to generate in silico spatial expression data using MintFlow.
**To be able to run the notebook, the parts that you need to modify are specified by `TODO:MODIFY:`. The rest can be left untouched, as far as the goal is to run the notebook.**  

This notebook is only for demonstration, and to get biologically meaningful results you may need different data and/or settings.

In [None]:
import os, sys
import yaml
import mintflow
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
from tqdm.autonotebook import tqdm
import numpy as np
from pprint import pprint

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import torch

import mintflow
import mintflow.interface.perturbation.module_gen_micsizefactor
import mintflow.interface.perturbation.module_gen_stdata

## 1 Overview

To generate in silico spatial expression data, you need to tell MintFlow:
1. A cell-cell neighbourhood graph (usually computed from cells' spatial locations)
2. Cells' cell type labels
3. A batch index integer, i.e., the index of the training batch that the tissue hypothetically belongs to. It conditions the generation on the batch token of one of biological/technological batches seen during training.  

Afterwards, given the specified tissue (i.e. given 1, 2, and 3) the generative model is able to generate spatial expression data, and, if asked, multiple samples or realisations from it.

Note:
- It is not allowed to have novel cell type labels, i.e. cell type labels that the model has never seen during training.
- If the biological batch index is set to, e.g., 0, it doesn't mean that the specified tissue (i.e. cell type labels and neighbourhood graph) has to be a crop or a subset of the 0-th biological batch in the training set. Instead, you can freely create even a de novo tissue by
  - arbitrarily specifying cells' 2D locations,
  - computing the neighbourhood graph base on cells' locations,
  - and arbitrary assigning cell type labels of your choice to cells.


At the following we demonstrate the steps of doing this.

## 2. Download a sample anndata object and a sample MintFlow checkpoint
 
- Download this sample .h5ad file from google drive: [(link to the file on google drive)](https://drive.google.com/file/d/187Y44hpY5OuwMu0_PA9r9WvycMOx-uz5/view?usp=sharing)
and place it in a directory of your choice. Thereafter, set the variable `path_anndata` below to the path where you placed the `.h5ad` file.
- In the first tutorial notebook we demonstrated how to save a checkpoint on disk by calling `mintflow.dump_checkpoint`. Download this sample checkpoint file from google drive [(link to the file on google drive)](https://drive.google.com/file/d/1KS40-BCE4Zapq0osNjRkMEXs8IGRQj3g/view?usp=sharing)
and place it in a directory of your choice. Thereafter, set the variable `path_checkpoint` below to the path where you placed the `.pt` file.



In [None]:
path_anndata = './NonGit/data_train_single_section.h5ad'  
# TODO:MODIFY: set to the path where you've put the `.h5ad` file that you downloaded above.

path_checkpoint = './NonGit/sample_checkpoint.pt'  
# TODO:MODIFY: set to the path where you've put the `.pt` file that you downloaded above.

## 3. Load the MintFlow checkpoint

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

In [None]:
checkpoint_mintflow = torch.load(
    path_checkpoint,
    map_location='cpu',
    weights_only=False
)
checkpoint_mintflow['model'].to(device)
print("Loaded the checkpoint.")

## 4. Make a tissue with customised cell type labels and cell spatial locations  
As explained above, one can arbitrarily specify cells' spatial locations and cell type labels. But here we simply use a crop of the tissue used for training.  

### 4.1. load the anndata object

In [None]:
adata = sc.read_h5ad(
    path_anndata
)

### 4.2. Select a crop from it

In [None]:
adata = adata[
    (adata.obs['x_centroid'] > 5000.0) &  (adata.obs['x_centroid'] < 6000.0) &\
    (adata.obs['y_centroid'] > 2100.0) &  (adata.obs['y_centroid'] < 2500.0)
]  # select a crop from it

### 4.3. Create the neighbourhood graph

In [None]:
# create the neighbourhood graph
kwargs_neighbourhood_graph = {
    'spatial_key': 'spatial',
    'library_key': None,
    'set_diag': False,
    'delaunay': False,
    'n_neighs': 5
}
adata.uns = {}
sq.gr.spatial_neighbors(
    adata=adata,
    **kwargs_neighbourhood_graph
)

### 4.4. Visualise the selected tissue crop

In [None]:
sc.pl.spatial(
    adata,
    spot_size=5,
    color='broad_celltypes'
)

## 5. Generate expression data for the specified tissue
Having creatted the customised tissue section (i.e. given items 1, 2, and 3 explained at the beginning of the notebook) we now proceed to generate expression data for it. It's done by calling `mintflow.generate_insilico_ST_data`.

Some important arguments to pass to the function `mintflow.generate_insilico_ST_data`:
- `obskey_celltype`: the column name of the `.obs` field that contain cell type labels. Cell type labels have to be among the ones seen during training.
- `batch_index_trainingdata`: generation is conditioned on batch index as well. For example if `batch_index_trainingdata` is set to 1, generation is conditioned on batch with index 1 seen during training. Note that this index is zero-based. To check the batch index assigned to each tissue section, you can run the below cell. In this tutorial we have a single tissue section, and therefore a single batch and `batch_index_trainingdata` is set to 0.
- `estimate_spatial_sizefactors_on_sections`: To generate `Xint` and `Xmic` two size factors are needed. To generate these size factors, MintFlow filters out cells with similar cell type labels and MCC vectors in some tissue sections. This argument specifies the tissue section(s) used for this purpose. In this tutorial we have a single tissue section, therefore `estimate_spatial_sizefactors_on_sections` is set to [0].

In [None]:
# prints the batch index assigned to each tissue section in the training set
pprint(checkpoint_mintflow['data_mintflow']['train_list_tissue_section'].map_Batchname_to_inflowBatchID)

In [None]:
result_generation = mintflow.generate_insilico_ST_data(
    adata=adata,
    obskey_celltype='broad_celltypes',
    obspkey_neighbourhood_graph='spatial_connectivities',
    device=device,
    batch_index_trainingdata=0,
    num_generated_realisations=3,
    model=checkpoint_mintflow['model'],
    data_mintflow=checkpoint_mintflow['data_mintflow'],
    dict_all4_configs=checkpoint_mintflow['dict_all4_configs'],
    estimate_spatial_sizefactors_on_sections=[0]
)

The above cell generates `num_generated_realisations=3` expression data or "realisations" for the tissue, and the variation in the expression of each gene among the generated samples or "realisations" can be informative.

We can obtain, e.g., the average microenvironment component of the generated expression as follows:

In [None]:
Xmic_average = np.stack(
    [realisation['MintFLow_Generated_Xmic'] for realisation in result_generation['list_generated_realisations_ie_expressions']]
).mean(0)

Intuitively, `Xmic_average` means how "on average" the generative model thinks the microenvironment-induced part of expression is, given the provided cells' locations and cell type labels. 