Skip to content

broadinstitute/TissueMosaic

Repository files navigation

TissueMosaic: A python library for the analysis of biological tissues

tissuemosaic_logo

documentation_status

Description

TissueMosaic is Python library for the analysis of biological tissue and cellular micro-environments based on self supervised learning. It is built on PyTorch, PytorchLightning and, Anndata.

Spatially resolved transcriptomic technologies (such as SlideSeq, MerFish, SmFish, BaristaSeq, ExSeq, STARMap and others) allow measuring gene expression with spatial resolution. Deconvolution methods and/or analysis of marker-genes, can be used to assign a discrete cell-type (such as Macrophage, B-Cells, ...) or cell-type proportions to each spot.

This type of data can be nicely organized into anndata objects, which are data-structure specifically designed for transcriptomic data. Each anndata object contains a list of all the cells in a tissue together with (at the minimum):

  1. the gene expression profile
  2. the cell-type label
  3. the spatial coordinates (either in 2D or 3D)

This rich data can unlock interesting scientific discoveries, but it is difficult to analyze. Here is where Tissue Mosaic comes in.

In short, tissues are converted into images and cropped into overlapping patches. Semantic features are associated to each patch via self supervised learning (ssl). The learned features are then used in downstream tasks (such as differential gene expression analysis).

What's appealing about this approach is that it is unbiased, meaning that the researcher does not need to know a priori which features are important. Given enough data and a sufficiently large neural network this approach should be able to extract biological relevant features useful in solving downstream tasks.

Negative results are also interesting because they suggest that the task at hand can not be solved based on cellular co-arrangement alone (i.e. cell-type labels and spatial coordinates). In the latter case, more information (for example histopathology imaging) might be necessary to define the tissue micro-environments.

Typical workflow

A typical workflow consists of 3 steps:

  1. Multiple anndata objects (corresponding to multiple tissues in possibly a diverse set of conditions) are converted to (sparse) images. These images are cropped into overlapping patches of a characteristic length and are fed into a ssl framework. Importantly, in this step the model has no access to the gene expression profile. It only uses the cell-type labels together with their spatial coordinates to create a multi-channel image (in which each channel encodes the density of a specific cell-type). Therefore, the model can only leverage the cellular co-arrangement as a learning signal. See notebook1.
  2. Once a model is trained, any (new or old) anndata object can be processed. As described above, the anndata object is transformed into a sparse image and cropped into overlapping patches. Semantic features are associated to each patch and then transferred to the cells belonging to the patch. Ultimately each cell acquire a new set of annotations describing the local micro-environment of that cell. This steps can be repeated multiple times (once for each trained model) to compare the quality of the features generated by using different ssl model and/or differen patch sizes. See notebook2.
  3. Finally, we evaluate the quality of the features. To this end we use the ssl annotations to predict the gene expression profile conditioned on the cell-type. We compare multiple baselines to show that the ssl features are biological informative. See notebook3.

Why image-based self supervised learning?

Spatial transcriptomic data is a type of tabular data and could be analyzed without converting it to images. However, image-based approaches offer three remarkable advantages:

  1. We can leverage state-of-the-art approaches which are continuously developed by the larger ML community.
  2. By changing the patch size, we can easily obtain information about the cellular environment at different spatial resolution from local (few cells) and global (thousand of cells).
  3. In this approach it is trivial to combine cell-typing information with other imaging modalities such as histopathology. The images corresponding to cell-typing and histopathology can be simply concatenated before feeding them to the algorithm.

Installation

First, you need Python >= 3.11.0 and Pytorch (with CUDA support). If you run the following command from your terminal it should report True:

python -c 'import torch; print(torch.cuda.is_available())'

If not, install Pytorch: https://pytorch.org/get-started/locally/

Finally install Tissue Mosaic and its dependencies:

git clone https://github.com/broadinstitute/TissueMosaic.git
cd TissueMosaic
pip install -r requirements.txt
pip install .

Installation should complete in <10 minutes.

Versions the software has been tested on

Environment 1:

  • System: Linux Ubuntu 22.04.4 lTS
  • Python = 3.11.0, CUDA = 12.1
  • Dependencies: anndata=0.10.6, leidenalg=0.9.1, lightly-1.5.1, lightning_bolts=0.7.0, matplotlib=3.8.3, neptune=1.9.1, numpy=1.26.4, pandas=1.5.3, python_igraph=0.10.4, pytorch-lightning=1.7.7, scanpy=1.9.8, scikit_learn=1.4.1, scipy=1.12.0, seaborn=0.13.2, torch=2.2.1, torchvision=0.17.1, umap_learn=0.5.5

Expected Input

TissueMosaic expects the input to be a folder of anndata objects along with a config file.

Each anndata object should contain the following fields:

  • in obs:
    • 'x': an array of shape (n_cells,) containing the x-coordinate of each spot.
    • 'y': an array of shape (n_cells,) containing the y-coordinate of each spot.
  • in obsm:
    • 'cell_type_proportions': an array of shape (n_cells, n_cell_types) containing the cell-type proportions for each spot.
  • in uns:
    • 'status': a string indicating the external condition label of the sample.

This information can be included under different keys, as long as you update the config file accordingly (example config files are provided in the run/ directory).

If you are running TissueMosaic on your own data, take care to update the 'categories_to_channels' key in the config file. This key is used to map the cell-type labels to the channels in the image and must match the column labels of the 'cell_type_proportions' array.

The main hyperparameter to adjust is the patch size. The 'global_size' and 'local_size' keys are used to specify the patch-size of the global and local crops (for Dino), where local size should be 0.5-0.75 of global size. We find empirically that global size=96 and local size=64 works well for most tissues, but this may vary depending on your particular use case. See the manuscript for additional details.

You can typically leave all other hyperparameters to their default values.

How to run

Please refer to the documentation (https://tissuemosaic.readthedocs.io/) for a quick start tutorial. Running this tutorial with a trained model takes approximately 1.5 hours with a 4 core system.

There are 3 ways to run the code:

You can run the notebooks sequentially. Each notebook demonstrate one step on the typical workflow described in Typical workflow:

Or you can run the code locally from the command line. First download the example data (first published in Dissecting Mammalian Spermatogenesis Using Spatial Transcriptomics by Chen et al.) and untar it in the "testis_anndata" directory.

gsutil -m cp gs://ld-data-bucket/tissue-mosaic/slideseq_testis_anndata_h5ad.tar.gz ./
mkdir -p ./testis_anndata
tar -xzf slideseq_testis_anndata_h5ad.tar.gz -C /testis_anndata.

Next, navigate to the "TissueMosaic/run" directory and train the model (this will take about 3 hours for 500 epochs with a single Nvidia RTX 4090 GPU):

cd TissueMosaic/run
python main_1_train_ssl.py --config config_dino_ssl.yaml --data_folder testis_anndata --ckpt_out dino_testis.pt

# or alternatively
# python main_1_train_ssl.py --config config_barlow_ssl.yaml --data_folder testis_anndata --ckpt_out barlow_testis.pt
# python main_1_train_ssl.py --config config_simclr_ssl.yaml --data_folder testis_anndata --ckpt_out simclr_testis.pt
# python main_1_train_ssl.py --config config_vae_ssl.yaml --data_folder testis_anndata --ckpt_out vae_testis.pt

Next extract the features (this will take 5-10 minutes to run):

mkdir testis_anndata_featurized
python main_2_featurize.py\
    --anndata_in testis_anndata\
    --anndata_out testis_anndata_featurized\
    --ckpt_in dino_testis.pt\
    --feature_key dino\
    --n_patches 500\
    --ncv_k 10 25 100\
    --suffix featurized

Finally, evaluate the features based on their ability to predict the gene expression profile (this will take ~45 minutes to run for 1500 genes).

#set environment threads
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

mkdir gr_results
python main_3_gene_regression.py\
    --anndata_in testis_anndata_featurized\
    --out_dir gr_results\
    --out_prefix dino_ctype\
    --feature_key dino_spot_features\
    --alpha_regularization_strength 0.01\
    --filter_feature 2.0\
    --fc_bc_min_umi=500\
    --fg_bc_min_pct_cells_by_counts 10\
    --cell_types ES

This will write the gene regression evaluation metrics to the specified out directory.

It might make sense to train your model remotely on google cloud (or another cloud provider) using Terra or cromwell. and cromshell. After installing cromshell and connecting to a cromwell server, you can submit a run as follow:

cd TissueMosaic/run
./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_dino_ssl.yaml

# or alternatively
# ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_barlow_ssl.yaml
# ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_simclr_ssl.yaml
# ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_vae_ssl.yaml

Step 2 and 3 can be run locally since they are much shorter (see above).

Features and Limitations

Features:

  1. We have implemented multiple ssl strategies (such as convolutional Vae, Dino, BarlowTwin, SimClr) based on recent advances in image-based Machine Learning.
  2. TissueMosaic can be used to analyze any type of localized quantitative measurement for example spatial proteomics (not only mRNA count data).

Current limitations:

  1. TissueMosaic works only with 2D tissue slices. No 3D support at the moment.

Future Improvements

We hope to soon support:

  1. pairing with histopathology (i.e. dense-image)
  2. Extension to handle 3D images

Contributing

We aspire to make TissueMosaic an easy-to-use and useful software package for the bioinformatics community. While we test and improve TissueMosaic together with our research collaborators, your feedback is invaluable to us and allow us to steer TissueMosaic in the direction that you find most useful in your research. If you have an interesting idea or suggestion, please do not hesitate to reach out to us.

If you encounter a bug, please file a detailed github issue and we will get back to you as soon as possible.

Citation

This software package was developed by Sandeep Kambhampati, Luca D'Alessio, and Fedor Grab.

About

TBW

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •