Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quickstart tutorial #209

Merged
merged 2 commits into from Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 7 additions & 4 deletions docs/index.rst
Expand Up @@ -9,15 +9,18 @@
.. toctree::
:hidden:
:maxdepth: 1
:caption: Singularity Tutorial:
:caption: Quickstart Tutorial:

notebooks/singularity_preprocessing.ipynb
notebooks/gamma_delta.ipynb
notebooks/Q1-singularity-preprocessing.ipynb
notebooks/Q2-analysis.ipynb
notebooks/Q3-singularity-changeo.ipynb
notebooks/Q4-pseudobulk.ipynb
notebooks/Q5-object-prep.ipynb

.. toctree::
:hidden:
:maxdepth: 1
:caption: Tutorial:
:caption: Extended Tutorial:

notebooks/1_dandelion_preprocessing-10x_data.ipynb
notebooks/2_dandelion_filtering-10x_data.ipynb
Expand Down
Expand Up @@ -4,15 +4,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dandelion preprocessing with Singularity\n",
"# Singularity preprocessing\n",
"\n",
"![dandelion_logo](img/dandelion_logo_illustration.png)\n",
"\n",
"Arguably the greatest strength of the dandelion package is a streamlined preprocessing setup making use of a variety of specialised single cell VDJ algorithms:\n",
"\n",
"- V(D)J gene reannotation with `igblastn` and parsed to AIRR format with [changeo's](https://changeo.readthedocs.io/en/stable/examples/10x.html) `MakeDB.py` [[Gupta2015]](https://academic.oup.com/bioinformatics/article/31/20/3356/195677), with the pipeline strengthened by running `blastn` in parallel and using the best alignments\n",
"- Reassigning heavy chain IG V gene alleles with [TIgGER](https://tigger.readthedocs.io/en/stable/) [[Gadala-Maria15]](https://www.pnas.org/content/112/8/E862)\n",
"- Reassigning IG constant region calls by blasting against a curated set of highly specific C gene sequences\n",
"1. V(D)J gene reannotation with `igblastn` and parsed to AIRR format with [changeo's](https://changeo.readthedocs.io/en/stable/examples/10x.html) `MakeDB.py` [[Gupta2015]](https://academic.oup.com/bioinformatics/article/31/20/3356/195677), with the pipeline strengthened by running `blastn` in parallel and using the best alignments\n",
"2. Reassigning heavy chain IG V gene alleles with [TIgGER](https://tigger.readthedocs.io/en/stable/) [[Gadala-Maria15]](https://www.pnas.org/content/112/8/E862)\n",
"3. Reassigning IG constant region calls by blasting against a curated set of highly specific C gene sequences\n",
"4. Quantifying mutations via SHazaM's [observedMutations](https://shazam.readthedocs.io/en/stable/topics/observedMutations/)\n",
"\n",
"However, running this workflow requires a high number of dependencies and databases, which can be troublesome to set up. As such, we've put together a Singularity container that comes pre-configured with all of the required software and resources, allowing you to run the pre-processing pipeline with a single call and easy installation.\n",
"\n",
Expand All @@ -22,29 +23,32 @@
"\n",
" singularity pull library://kt16/default/sc-dandelion:latest\n",
"\n",
"In order to prepare your BCR data for ingestion, create a folder for each sample you'd like to analyse, name it with your sample ID, and store the Cell Ranger `filtered_contig_annotations.csv` and `filtered_contig.fasta` output files inside.\n",
"In order to prepare your BCR data for ingestion, create a folder for each sample you'd like to analyse, name it with your sample ID, and store the Cell Ranger `all_contig_annotations.csv` and `all_contig.fasta` output files inside.\n",
"\n",
" 5841STDY7998693\n",
" ├── filtered_contig_annotations.csv\n",
" └── filtered_contig.fasta\n",
" ├── all_contig_annotations.csv\n",
" └── all_contig.fasta\n",
"\n",
"Please ensure that the only subfolders present in your folder are such per-sample subfolders with the `.csv` and `.fasta` files.\n",
"\n",
"You can then navigate to the directory holding all your sample folders and run Dandelion pre-processing like so:\n",
"\n",
"```bash\n",
"singularity run -B $PWD /path/to/sc-dandelion_latest.sif dandelion-preprocess [optional arguments follow here]\n",
"singularity run -B $PWD /path/to/sc-dandelion_latest.sif dandelion-preprocess\n",
"```\n",
"\n",
"If you're running TR data rather than IG data, specify `--chain TR`.\n",
"Any optional arguments get added at the end of this line.\n",
"\n",
"If you're running TR data rather than IG data, specify `--chain TR` to skip steps 2-4 in the preprocessing. This notably works with TRGD data, which most versions of Cell Ranger struggle to annotate correctly. However, the contigs are still reconstructed, so Dandelion's preprocessing can annotate them for you.\n",
"\n",
"If you wish to process files that have a different prefix than `filtered`, e.g. `all_contig_annotations.csv` and `all_contig.fasta`, provide the desired file prefix with `--file_prefix`. In that case, be sure that your input folder contains those files rather than the filtered ones. You can also provide the `--filter_to_high_confidence` flag to only keep the contigs that Cell Ranger has called as high confidence.\n",
"You can provide the `--filter_to_high_confidence` flag to only keep the contigs that Cell Ranger has called as high confidence. If you wish to process files that have a different prefix than `all`, e.g. `filtered_contig_annotations.csv` and `filtered_contig.fasta`, provide the desired file prefix with `--file_prefix`. In that case, be sure that your input folder contains those files rather than the `all` ones. We use `all` as default as it's possible to subset contigs to relevant ones later.\n",
"\n",
"## Recommended parameterisation\n",
"\n",
"If in possession of gene expression data that the BCR data will be integrated with, the following parameterisation is likely to yield the best results:\n",
"\n",
"```bash\n",
"singularity run -B $PWD /path/to/sc-dandelion_latest.sif dandelion-preprocess \\\n",
" --file_prefix all \\\n",
" --filter_to_high_confidence\n",
"```\n",
"\n",
Expand All @@ -54,9 +58,9 @@
"\n",
"By default, this workflow will analyse all provided IG samples jointly with TIgGER to maximise inference power, and in the event of multiple input folders will prepend the sample IDs to the cell barcodes to avoid erroneously merging barcodes overlapping between samples at this stage. TIgGER should be ran on a per-individual level. If running the workflow on multiple individuals' worth of data at once, or wanting to flag the cell barcodes in a non-default manner, information can be provided to the script in the form of a CSV file passed through the `--meta` argument:\n",
"\n",
"- The first row of the CSV needs to be a header identifying the information in the columns, and the first column needs to contain sample IDs.\n",
"- Barcode flagging can be controlled by an optional `prefix`/`suffix` column. The pipeline will then add the specified prefixes/suffixes to the barcodes of the samples. This may be desirable, as corresponding gene expression samples are likely to have different IDs, and providing the matched ID will pre-format the BCR output to match the GEX nomenclature.\n",
"- Individual information for TIgGER can be specified in an optional `individual` column. If specified, TIgGER will be ran for each unique value present in the column, pooling the corresponding samples.\n",
"1. The first row of the CSV needs to be a header identifying the information in the columns, and the first column needs to contain sample IDs.\n",
"2. Barcode flagging can be controlled by an optional `prefix`/`suffix` column. The pipeline will then add the specified prefixes/suffixes to the barcodes of the samples. This may be desirable, as corresponding gene expression samples are likely to have different IDs, and providing the matched ID will pre-format the BCR output to match the GEX nomenclature.\n",
"3. Individual information for TIgGER can be specified in an optional `individual` column. If specified, TIgGER will be ran for each unique value present in the column, pooling the corresponding samples.\n",
"\n",
"It's possible to just pass a prefix/suffix or individual information. An excerpt of a sample CSV file that could be used on input:\n",
"\n",
Expand All @@ -69,13 +73,15 @@
" WSSS8090102,WSSS8015043,A40\n",
" [...]\n",
"\n",
"If specifying a metadata file, only subfolders with names provided in the sample column will be processed.\n",
"\n",
"The delimiter between the barcode and the prefix/suffix can be controlled with the `--sep` argument. By default, the workflow will strip out the trailing `\"-1\"` from the Cellranger ouput barcode names; pass `--keep_trailing_hyphen_number` if you don't want to do that. Pass `--clean_output` if you want to remove intermediate files and just keep the primary output. The intermediate files may be useful for more detailed inspection.\n",
"\n",
"## Output\n",
"\n",
"The main file of interest will be `dandelion/filtered_contig_dandelion.tsv`, stored in a new subfolder each sample folder. This is an AIRR formatted export of the corrected contigs, which can be used for downstream analysis by both dandelion itself, and other packages like [scirpy](https://icbi-lab.github.io/scirpy/generated/scirpy.io.read_airr.html) [[Sturm2020]](https://academic.oup.com/bioinformatics/article/36/18/4817/5866543) and changeo [[Gupta2015]](https://academic.oup.com/bioinformatics/article/31/20/3356/195677).\n",
"The main file of interest will be `dandelion/all_contig_dandelion.tsv`, stored in a new subfolder each sample folder. This is an AIRR formatted export of the corrected contigs, which can be used for downstream analysis by both dandelion itself, and other packages like [scirpy](https://icbi-lab.github.io/scirpy/generated/scirpy.io.read_airr.html) [[Sturm2020]](https://academic.oup.com/bioinformatics/article/36/18/4817/5866543) and changeo [[Gupta2015]](https://academic.oup.com/bioinformatics/article/31/20/3356/195677).\n",
"\n",
"The file above features a contig space filtered with immcantation. If this is not of interest to you and you wish to see the full contig space as provided on input, refer to `dandelion/tmp/filtered_contig_iblastn_db-all.tsv`.\n",
"The file above features a contig space filtered with immcantation. If this is not of interest to you and you wish to see the full contig space as provided on input, refer to `dandelion/tmp/all_contig_iblast_db-all.tsv`.\n",
"\n",
"The plots showing the impact of TIgGER are in `<tigger>/<tigger>_reassign_alleles.pdf`, for each TIgGER folder (one per unique individual if using `--meta`, `tigger` otherwise). The impact of C gene reannotation is shown in `dandelion/data/assign_isotype.pdf` for each sample.\n",
"\n",
Expand Down Expand Up @@ -106,7 +112,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
"version": "3.8.8"
}
},
"nbformat": 4,
Expand Down
757 changes: 757 additions & 0 deletions docs/notebooks/Q2-analysis.ipynb

Large diffs are not rendered by default.

48 changes: 48 additions & 0 deletions docs/notebooks/Q3-singularity-changeo.ipynb
@@ -0,0 +1,48 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "artistic-pearl",
"metadata": {},
"source": [
"# Singularity changeo clonotype calling\n",
"\n",
"Dandelion comes with a pair of functions, `ddl.pp.calculate_threshold()` and `ddl.tl.define_clones()`, that come together to run the [changeo clonotype calling pipeline](https://changeo.readthedocs.io/en/stable/examples/cloning.html). However, this can be quite fiddly to properly set up - one part of the process is in R, so it requires an operational rpy2 and appropriate dependencies. Meanwhile the other is a command line script which can be difficult to recognise properly within some virtual environments. In summary, creating a simple singularity pipeline to circumvent the need for annoying setup creates a user-friendly way to access this functionality.\n",
"\n",
"Once in possession of a `.h5ddl` file, a saved form of the Dandelion object, the changeo clonotype calling can be ran like so:\n",
"\n",
"```\n",
"singularity run -B $PWD /path/to/sc-dandelion_latest.sif changeo-clonotypes \\\n",
" --h5ddl vdj.h5ddl\n",
"```\n",
"\n",
"By default, this will save the changeo clones into a new column called `changeo_clone_id`. All Dandelion functions that operate on clone calls can be directed to this column by providing `clone_id=\"changeo_clone_id\"` as an argument. If wishing to save the clones to a different column, provide its desired name via `--key_added`.\n",
"\n",
"If wishing to override the threshold determined by SHazaM, provide it via `--manual_threshold`.\n",
"\n",
"The SHazaM plot and Dandelion object with changeo clones identified are saved to files named after the original input file (with `_shazam.pdf` and `_changeo.h5ddl` appended respectively). These destinations can be controlled via `--plot_file` and `--h5ddl_out`."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dandelion-tutorial",
"language": "python",
"name": "dandelion-tutorial"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}