- About Xenomnake
1.1. PDX Model
1.2. Spatial Integration
1.3. Pipeline Overview
1.4. Software Dependencies
1.5. Publication - Installation
2.1. Create conda environment and build dependencies - Running Xenomake
3.1. QuickStart
3.1.2 Memory Requirements and Runtimes
3.2. Step1: Xenomake Init
3.2.1. Description
3.2.2. Flags
3.2.3. Command Line Implementation
3.2.4. Output
3.3. Step2: Xenomake Species
3.3.1 Description
3.3.2 Flags
3.3.3. Command Line Implementation
3.3.4. Output
3.4 Step3: Xenomake Config
3.4.1 Decsription
3.4.2 Flags
3.4.3 Command Line Implementation
3.5. Step3: Run Snakemake Pipeline
3.5.1. Snakemake Description
3.5.2. Dry Run
3.5.3. Execute Snakemake Workflow
3.5.4. Snakemake Standard Outs
3.5.5. Xenomake Output Structure - Downstream Cell-Cell Interactions
- QC
5.1. Scanpy QC Plots
5.2. Xengsort Metrics
5.3. STAR Metrics
5.4. UMI Metrics - Test Dataset
6.1. Download Data
6.2. Run Xenomake Pipeline - Optional Downloads
7.1. Dropseq
7.2. Picard
Processing pipeline for Patient Derived Xenograft (PDX) Spatial Transcriptomics Datasets
Patient-derived xenografts are a widely used component of cancer research
These models consist of grafting a human tumor sample into mice to study interactions between a patient's tumor and stroma, screening compounds that target a patient's tumor, and simulating human tumor biology
With the rising popularity of spatially resolved transcriptomic technologies, there is a growing need for processing pipelines that can handle reads from PDX samples. Spatial processing has the potential to analyze tumor/stroma interactions and tumor microenvironments.
To facilitate the adoption of spatial transcriptomics for PDX studies, we thus have developed a pipeline "Xenomake", which is an end-to-end pipeline that includes read alignment and gene quantification steps for reads generated by spatial transcriptomic platforms and uses a xenograft sorting tool (Xengsort) to apportion xenograft reads to the host and graft genomes.
Xenomake's structure is based on Snakemake and includes scripts to handle multi-mapping, downstream analysis, and handling "ambiguous" reads
Xenomake utilizes a number of pre-existing tools for our pipeline and would like to thank works from these groups. We encourage users to read through these tools to understand how our pipeline is built.
- Snakemake: Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:33 (https://doi.org/10.12688/f1000research.29032.1)
- Drop-seq: Broad Institute. Drop-seq Tools. https://github.com/broadinstitute/Drop-seq. 2015
- Xengsort: Zentgraf, J., Rahmann, S. Fast lightweight accurate xenograft sorting. Algorithms Mol Biol 16, 2 (2021). https://doi.org/10.1186/s13015-021-00181-w
- Picard: Broad Institute. Picard Toolkit. https://github.com/broadinstitute/picard. 2019.
Xenomake: a pipeline for processing and sorting xenograft reads from spatial transcriptomic experiments
Benjamin S Strope, Katherine E Pendleton, William Z Bowie, Gloria V Echeverria, Qian Zhu
bioRxiv 2023.09.04.556109; doi: https://doi.org/10.1101/2023.09.04.556109
# Create conda environment with the provided environment file
conda env create -n xenomake -f environment.yaml
# Activate conda environment
conda activate xenomake
#install xenomake
pip install xenomake
#Initialize Xenomake
xenomake init \
-r1 <read1> \
-r2 <read2> \
-s <sample name> \
-o <output directory> \
--spatial_mode <preset spatial mode [visium,dbit-seq,seq-scope] or custom> \
--run_mode <preset run mode [lenient,stringent] or custom> \
# Print Help Function Call
xenomake species --help
# Create species folder
xenomake species \
-mr <mouse_reference_assembly.fa> \
-hr <human_reference_assemble.fa> \
-ma <mouse_annotation.gtf> \
-ha <human_annotation.gtf> \
If default run mode and spatial mode, this step can be skipped
If 'custom' run mode or spatial mode, you must input the following parameters
# Print Help Function Call
xenomake spatial --help
# Create configuration file
xenomake spatial \
--barcode_file <barcode whitelist> \
--umi <umi structure> \
--cell_barcode <cell barcode structure> \
--beads <number of spots/beads> \
--spot_diameter_um <spot diameter> \
--slide_size_um <slide size> \
--ambiguous <re-partition ambiguous reads from xenograft sorting> \
--downstream <perform downstream data processing> \
--genic_only <only count genic reads (i.e., no UTR, intergenic,intronic> \
--mm_reads <if read is mult-imapped, choose most likely position>
# Dry Run
xenomake run -n
# Initialize Xenomake Pipeline (species dir and config.yaml need to be created)
xenomake run \
--cores <n_threads>
This is a REQUIRED step to initialize your implementation of the xenomake pipeline
xenomake init takes information about your reads, run modes, spatial chemistry, and sample/project names to create a configuration file for your project
Run Modes
===========
Run modes refer to how reads are handled during mapping and xenograft sorting portions of the pipeline. Options are the following:
1. lenient: xenograft ambiguous re-mapped (True), multi-mapped re-aligned (True), only genic reads (True) <br>
2. stringent: xenograft ambiguous re-mapped (False), multi-mapped re-aligned (False), only genic reads (True) <br>
3. custom: User Defined
Spatial Modes
===============
Spatial modes refer to the spatial technology used to generate reads
1. visium <br>
2. slide-seq <br>
6. dbit-seq <br>
7. custom: User Defined barcode structure, umi structure, spot size, slide size, barcode file, number of beads/spots
- --read1: paired-end read in fastq format
- --read2: second paired-end read
- --outdir: name of project directory where the output file will be written
- --sample: name of sample to prepend filenames and create sample directory within the project directory
- --spatial_mode: preset run modes based upon spatial method used [custom,visium,slide_seq,hdst_seq,stereo_seq,pixel_seq,dbit_seq]
- --run_mode: preset run modes based upon read processing. This controls multi-mapped, ambiguous, and non-genic read handling [custom, prude, lenient]
- --root_dir: directory of project execution (default: ./)
- --temp_dir: temporary directory (default: /tmp)
- --picard: path to user preferred picard.jar tool (default: tools/picard.jar)
- --dropseq_tools: path to user preferred download of dropseq tools directory (default: tools/Dropseq_2.5.1)
- --download_species: download mm10 and hg38 genomes and annotations to the current directory (default: False)
- --download_index: download star indices and xengosort index to the current directory (default: False)
# Print Help Function Call
xenomake init --help
#Initialize Project
xenomake init \
-r1 <sample_R1.fastq.gz> \
-r2 <sample_R2.fastq.gz> \
-o <project_name> \
-s <sample_name> \
--spatial_mode <spatial_name> \
--run_mode <custom, lenient, spatial>
sample name: A1
spatial chemistry: visium
run mode: lenient
project "downsampled" initialized, proceed to species setup and project execution
This is a REQUIRED step to initialize your implementation of the xenomake pipeline
xenomake species links all inputs in a new directory structure that is then easily referenced in project run
- --mouse_ref: Mouse Reference Assembly in fasta format
- --human_ref: Human Reference Assembly in fasta format
- --mouse_annotation: Mouse Genome Annotation File in .gtf format
- --human_annotation: Human Genome Annotation File in .gtf format
Including optional flags for indices will skip parts in the pipeline such as STAR --runMode CreateIndex and xengsort index
These two rules can be time-consuming and if these files are already generated, it doesn't need to be done again.
- --human_index: STAR index directory for human genome reference
- --mouse_index: STAR index directory for mouse genome reference
- --xengsort_hash: Xengsort hash table filepath to be used in xengsort classify step
- --xengsort_info: Xengosrt info table filepath to be used in xengsort classify step
# Print Help Function Call
xenomake species --help
# Create species folder
xenomake species \
--mouse_reference <mouse_reference_assembly.fa> \
--human_reference <human_reference_assemble.fa> \
--mouse_annotation <mouse_annotation.gtf> \
--human_annotaion <human_annotation.gtf> \
species directory successfully completed, keep project execution in the current directory
├─ species
│ ├── idx.hash
│ ├── idx.info
│ ├── human
│ │ ├── genome.fa
│ │ ├── annotation.gtf
│ │ ├── star_index/
│ ├── mouse
│ │ ├── genome.fa
│ │ ├── annotation.gtf
│ │ ├── star_index/
Only necessary if you selected custom for either run_mode or spatial_mode as a part of xenomake init
- --barcode_file:
- --umi:
- --cell_bc:
- --beads:
- spot_diameter_um:
- slide_size_um:
- downstream:
- mm_reads:
- genic_only:
- --ambiguous:
if spatial_mode: custom
xenomake config \
--barcode_file <path to spot/cell barcode file> \
--umi <umi structure in units of bases> \
--cell_bc <barcode structure in unit of bases> \
--beads <number of spots/beads in spatial array> \
--spot_diameter_um <diameter of spot in um> \
--slide_size_um <diameter of slide in um>
if run_mode : custom
xenomake config \
--downstream <downstream processing (True/False)> \
--mm_reads <count multi-mapped reads (True/False)> \
--genic_only <only count genic reads in final count matrix (True/False)> \
--ambiguous <partition xenograft 'ambiguous' reads back into alignments (True/False)>
Xenomake uses the Snakemake workflow management system to create reproducible and scalable data analyses
To get an understanding of snakemake, please visit: https://snakemake.github.io/
For additional information, read Snakemake's paper https://doi.org/10.12688/f1000research.29032.1
It is recommended to first execute a dry-run with flag --dry-run, Xenomake will only show the execution plan instead of performing steps.
-cores: number of cores to use for project run
--dry-run: Dry-Run showing the workflow
--rerun-incomplete: Force to rerun incompletely generated files if project halted
--keep-going: If a job fails, keep executing independent jobs that don't rely on its output
#Dry Run
xenomake run --cores <n_cores> --dry-run
#Run Project
xenomake run --cores <n_cores> --keep-going
Below shows an example of the messages printed by Snakemake. Briefly, they give an overview of the jobs in line for execution and print messages when a rule is being performed. This includes a few important features:
- Start time
- Finish time
- Rule name
- Input files
- Output file destinations
- Wildcard values
- Additional params and log destinations
Snakemake will also print any standard-out messages from tools performed. We have saved some of these as log files for subsequent review
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 DGE_Human
1 Expression_QC
...
1 Subset_Overlapped
1 Xengsort_Clasify
1 all
28
Select jobs to execute...
[Fri Sep 8 10:36:04 2023]
rule Preprocess:
input: fastq/sub1.fq, fastq/sub2.fq
output: out/treated/preprocess/unaligned_bc_umi_tagged.bam, out/treated/logs/Preprocess.summary
log: out/treated/logs/Preprocess.log
jobid: 6
wildcards: OUTDIR=out, sample=treated
[Fri Sep 8 10:36:06 2023]
Finished job 6.
1 of 28 steps (4%) done
Select jobs to execute...
...
...
...
[Fri Sep 8 10:47:41 2023]
localrule all:
input: out/treated/final/treated_human_counts.tsv.gz, out/treated/final/treated_mouse_counts.tsv.gz, out/treated/final/treated_human.hdf5, out/treated/final/treated_mouse.hdf5, out/treated/final/treated_human.h5ad, out/treated/final/treated_mouse.hdf5, out/treated/final/treated_mouse.h5ad, out/treated/final/treated_human_processed.h5ad, out/treated/final/treated_mouse_processed.h5ad, out/treated/qc/human_qc.png
jobid: 0
threads: 16
[Fri Sep 8 14:37:01 2023]
Finished job 0.
29 of 29 steps (100%) done
├─ OUTDIR
│ ├── sample
│ │ ├── final.bam: Mapped, Sorted, Xenograft Processed alignments in bam format
│ │ ├── final
│ │ │ ├── tissue_positions_list.csv
│ │ │ ├── {sample}_counts.tsv.gz: Gene by Barcode umi count matrix generated by dropseq DigitalGeneExpression
│ │ │ ├── {sample}_human.h5ad:Spatial Anndata object
│ │ │ ├── {sample}_human.hdf5: 10X genomics formatted output structure (can be used for downstream spatial processing)
│ ├── xengsort: Xenograft sorted fastq files generated using xengsort toolkit
│ │ │ ├── ambiguous.fq.gz
│ │ │ ├── both.fq.gz
│ │ │ ├── graft.fq.gz
│ │ │ ├── host.fq.gz
│ │ │ ├── neither.fq.gz
│ ├── mapping
│ │ │ ├── STAR aligned genomes: Aligned files without Xenograft Processing
│ ├── logs
│ │ │ ├── Tagging and trimming logs
│ │ │ ├── STAR log files
│ │ │ ├── BAM subsetting logs
│ │ │ ├── Xengsort log
│ │ │ ├── Multi-map filtering logs
│ │ │ ├── Digital Gene Expression logs
│ ├── qc
│ │ │ ├── Xengsort summary files
│ │ │ ├── STAR summary files
│ │ │ ├── Dataset qc plots
│ │ │ ├── Digital Gene Expression summary
│ ├── preprocessing
│ │ │ ├── unaligned_bc_umi_tagged.bam: Unaligned bam file. Cell-barcode/UMI tagged and Adapter/PolyA trimmed using dropseq workflows
Using the outputs from Xenomake, we were able to identify group specific expression and differential signalling genes in our model. We provide scripts in the repository CellInteract
to perform identical analyses that are present in the Xenomake Publication
Cell Signalling molecules such as chemokines/cytokines are simultaneously secreted by both stromal cell types and epithelial tumor cells in the tumor microenvironment. It is often difficult to attribute signalling expression to stromal or tumor cells in standard models. However in PDX models processed with means similar to Xenomake, we are able to study the differential production of cellular signals and identify biomarkers in stromal and epithelial compartments
Access further description as well as scripts to perform these downstream analysis:
Link to CelllInteract Repository
: https://github.com/bernard2012/CellInteract
Counts number of reads partitioned to each file
prefix | host | graft | ambiguous | both | neither |
---|---|---|---|---|---|
test | 6652723 | 13654867 | 2603727 | 1297366 | 154218 |
## Running time statistics
- Running time in seconds: 225.0 sec
- Running time in minutes: 3.75 min
- Done. [2023-08-17 16:19:49]
Started job on | Aug 16 18:57:26
Started mapping on | Aug 16 18:58:54
Finished on | Aug 16 19:04:22
Mapping speed, Million of reads per hour | 968.48
Number of input reads | 88239353
Average input read length | 108
UNIQUE READS:
Uniquely mapped reads number | 36755354
Uniquely mapped reads % | 41.65%
Average mapped length | 109.05
Number of splices: Total | 7555478
Number of splices: Annotated (sjdb) | 7421066
Number of splices: GT/AG | 7487922
Number of splices: GC/AG | 23361
Number of splices: AT/AC | 1030
Number of splices: Non-canonical | 43165
Mismatch rate per base, % | 1.75%
Deletion rate per base | 0.04%
Deletion average length | 1.90
Insertion rate per base | 0.04%
Insertion average length | 1.89
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 10085928
% of reads mapped to multiple loci | 11.43%
Number of reads mapped to too many loci | 768576
% of reads mapped to too many loci | 0.87%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 38624549
% of reads unmapped: too short | 43.77%
Number of reads unmapped: other | 2004946
% of reads unmapped: other | 2.27%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Example of top 4 spots with highest expression
CELL_BARCODE | NUM_GENIC_READS | NUM_TRANSCRIPTS | NUM_GENES |
---|---|---|---|
CAGTTCCGCGGGTCGA | 429279 | 67931 | 9287 |
GAAACTCGTGCGATGC | 398339 | 59154 | 8748 |
AAGAGATGAATCGGTA | 362017 | 57447 | 8723 |
CCAAGAAAGTGGGCGA | 367515 | 54742 | 8306 |
Follow Installation Instructions for Xenomake if not done already
Paired End Fastq Files Downsamples to 8.4 million reads
Taken from: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-023-01185-4
wget https://zenodo.org/records/10655403/files/sub1.fq.gz
wget https://zenodo.org/records/10655403/files/sub2.fq.gz
xenomake init \
-r1 sub1.fq.gz \
-r2 sub2.fq.gz \
-s downsampled \
-o test_run \
--run_mode lenient \
--spatial_mode visium \
--download_genome True \
--download_index True \
--download_xengsort True
xenomake species \
-hr GRCh38.p14.fna.gz \
-ha GRCh38.p14.gtf \
-mr GRCm39.fna.gz \
-ma GRCm39.gtf \
--human_index human_index/ \
--mouse_index mouse_index/ \
--xengsort_hash idx.hash \
--xengsort_info idx.info
xenomake run --cores 4
Current version of Dropseq-tools is 2.5.3 and is present at <repo_dir>/tools/Dropseq-2.5.3/
For updated versions check:
https://github.com/broadinstitute/Drop-seq
The provided picard.jar file is currently compatible with Java installed in the Xenomake conda environment.
Java: v11.0.15
Picard: v2.27.5
If you choose to use your own Picard executable, ensure compatibility with Java
https://broadinstitute.github.io/picard/