GitHub - scDiffEq/LARRY-dataset: Documentation associated with preparing and formatting datasets LARRY datasets for ML applications with pytorch / pytorch lightning

This is a Python package that downloads and preprocesses the LARRY dataset from the AllonKleinLab GitHub repository. The LARRY dataset is a group of single-cell lineage-traced datasets that have been used to study transcriptional landscapes and cell fate during differentiation [1]. There are three datasets within this group of datasets, all using hematopoietic progenitor cells from mouse bone marrow and the LARRY lentiviral barcoding strategy.

in vitro
in vivo - transplanted into mice
Cytokine-perturbed (in vitro) - split between various cytokine culture conditions.

The package includes functions to download the LARRY dataset, format the files into AnnData and perform preprocessing / gene-filtering steps. The preprocessed data can then be split into test and training sets for use in different machine learning tasks. Dimension reduction can be performed after data splitting to respect information leakage or on all of the data.

Installation

`pip` distribution

pip install larry-dataset

Development version

git clone https://github.com/mvinyard/LARRY-dataset.git; cd LARRY-dataset

pip install -e .

Quickstart

Downloads pre-processed data from AllonKleinLab/paper-data to ./KleinLabData (by default). The data is formatted into AnnData and returned to the user. A .h5ad file is also saved, locally. The data downloading and conversion step take several minutes due to the large expression normed_counts matrices though this only happens once.

import larry
    
dataset = "in_vitro" # can also choose from: "in_vivo" or "cytokine_perturbation"
adata = larry.fetch(dataset)

AnnData object with n_obs × n_vars = 130887 × 25289
    obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y'
    var: 'gene_name'
    obsm: 'X_clone'

import larry

LARRY_LightningData = larry.LARRY_LightningDataModule()
LARRY_LightningData.prepare_data()

 AnnData object with n_obs × n_vars = 130887 × 25289
    obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y'
    var: 'gene_name'
    uns: 'dataset', 'h5ad_path'
    obsm: 'X_clone'
Preprocessing performed previously. Loading...done.

Under the hood, the LARRY_LightningData calls larry.fetch() and larry.pp.Yeo2021_recipe(), and if task == "fate_prediction", larry.pp.annotate_fate_test_train()

LARRY_LightningData.adata

Print the updated adata:

AnnData object with n_obs × n_vars = 130887 × 25289
    obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y', 'cell_idx', 'clone_idx'
    var: 'gene_name', 'highly_variable', 'corr_cell_cycle', 'pass_filter'
    uns: 'dataset', 'h5ad_path', 'highly_variable_genes_idx', 'n_corr_cell_cycle', 'n_hv', 'n_mito', 'n_pass', 'n_total', 'pp_h5ad_path'
    obsm: 'X_clone', 'X_pca', 'X_scaled', 'X_umap'

Sources

Repositories

References

Weinreb, C., Rodriguez-Fraticelli, A., Camargo, F.D., Klein, A.M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science 80 (2020). https://doi.org/10.1126/science.aaw3381

Please email Michael E. Vinyard (mvinyard@broadinstitute.org) with any questions or interests.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
fate_bias_files		fate_bias_files
larry		larry
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

assets

assets

docs

docs

fate_bias_files

fate_bias_files

larry

larry

notebooks

notebooks

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Installation

`pip` distribution

Development version

Quickstart

Sources

Repositories

References

About

Releases 3

Languages

License

scDiffEq/LARRY-dataset

Folders and files

Latest commit

History

Repository files navigation

Installation

pip distribution

Development version

Quickstart

Sources

Repositories

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`pip` distribution