saturation.R

Table of contents:

Introduction
Getting started
Usage examples
- GEX
- VDJ
- ADT
Running parallel jobs with rush

Introduction

Here is an R script saturation.R for estimating sequencing saturation from a GEX, VDJ, or ADT dataset from the 10x Genomics platform.

The script uses the binomial distribution to downsample the reads and estimate a saturation curve. This can be helpful to determine if a sequencing experiment has enough reads.

10xgenomics.com gives us this formula for sequencing saturation:

Sequencing Saturation = 1 - (n_deduped_reads / n_reads)

Here is my illustration of the relationship between sequencing saturation and reads per unique molecular identifier (UMI):

We can compute reads per UMI from the saturation, and vice versa:

d <- data.frame(sat = seq(0, 1, length.out = 1001))
d$rpu <- 1 / (1 - d$sat)

Learn more about sequencing saturation from the 10xgenomics.com documentation:

File formats:

What is the molecule_info.h5 file? (GEX)
What is the all_contig_annotations.csv file? (VDJ)
What is the sample.stat.csv.gz file? (ADT)

Getting started

Install the dependencies:

install.packages(
  c("data.table", "ggplot2", "ggtext", "glue", "optparse", "pbapply", "scales", "stringr", "BiocManager")
)
BiocManager::install("rhdf5")

See output for an example of the output files.

Usage examples

GEX

Rscript saturation.R --out output --file molecule_info.h5

Reading molecule_info.h5
Estimating GEX saturation for 519882 barcodes
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/saturation-gex.tsv
Writing output/total_reads-vs-saturation-gex.pdf

VDJ

Rscript saturation.R --out output/tcr --file all_contig_annotations.csv

Reading all_contig_annotations.csv
INFO: Removing '-1' from the end of each barcode
Estimating VDJ saturation for 20233 barcodes
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/tcr/saturation-vdj.tsv
Writing output/tcr/total_reads-vs-saturation-vdj.pdf

ADT

Rscript saturation.R --out output --file Batch_1A_ADT.stat.csv.gz

Reading Batch_1A_ADT.stat.csv.gz
Estimating ADT saturation curve for 729372 barcodes
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Writing output/saturation-adt.tsv
Writing output/total_reads-vs-saturation-adt.pdf
Writing output/saturation-adt-feature.tsv
Writing output/histogram-saturation-adt-feature.pdf

Running parallel jobs with rush

Install rush by Wei Shen:

go install github.com/shenwei356/rush

Then make a list of input files and pass it to rush:

ls /project/cellranger_output/*/{molecule_info.h5,all_contig_annotations.csv,*.stat.csv.gz} > files.txt

# Run 16 jobs in parallel, capture outputs from each job in one file
rush -i files.txt -o rush-saturation.txt -j16 'Rscript saturation.R --out out/{/%} --file {}'

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
output		output
LICENSE		LICENSE
README.md		README.md
saturation.R		saturation.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output

output

LICENSE

LICENSE

README.md

README.md

saturation.R

saturation.R

Repository files navigation

saturation.R

Introduction

Getting started

Usage examples

GEX

VDJ

ADT

Running parallel jobs with rush

About

Releases 2

Languages

License

slowkow/saturation

Folders and files

Latest commit

History

Repository files navigation

saturation.R

Introduction

Getting started

Usage examples

GEX

VDJ

ADT

Running parallel jobs with rush

About

Topics

Resources

License

Stars

Watchers

Forks

Languages