Below, we will demonstrate how we can use sincei to explore the scRNA-seq and scATAC-seq data as part of the 10x multiome protocol. The 10x multiome kit allows joint profiling of single-cell ATAC-seq and RNA-seq from single-cells. Here, we will analyse these two data sets separately. We will will use the dataset published with Persad et. al. (2023), which profiles CD34+ cells from human bone marrow.
The raw fastq files were downloaded from GEO and processed using the standard 10x genomics cellranger-arc workflow. Below is the structure of the output directory from the workflow:
<output_di>/outs:
├── analysis
├── atac_cut_sites.bigwig
├── atac_fragments.tsv.gz
├── atac_fragments.tsv.gz.tbi
├── atac_peak_annotation.tsv
├── atac_peaks.bed
├── atac_possorted_bam.bam
├── atac_possorted_bam.bam.bai
├── cloupe.cloupe
├── filtered_feature_bc_matrix
├── filtered_feature_bc_matrix.h5
├── gex_molecule_info.h5
├── gex_possorted_bam.bam
├── gex_possorted_bam.bam.bai
├── per_barcode_metrics.csv
├── raw_feature_bc_matrix
├── raw_feature_bc_matrix.h5
├── summary.csv
└── web_summary.html
We will use the gex_possorted_bam.bam
for gene-expression data and
atac_possorted_bam.bam
for chromatin accessibility analysis using
sincei. These files can also be produced as part of the
cellranger count
workflow for scRNA-seq or scATAC-seq data alone.
For convenience, we provide a subset of this data (only chromosome 2)
here
mkdir 10x_multiome && wget -O 10x_multiome/10x_multiome_testdata.tar.gz https://figshare.com/ndownloader/files/41303289
tar -xvzf 10x_multiome/10x_multiome_testdata.tar.gz ## releases 7 files
Most of the cell barcodes from the droplet-based protocols (like 10x
genomics) do not contain cells. Therefore they have very low counts.
These must be filtered away at the beginning of the analysis. Although
the cellranger pipeline already provides a list of filtered barcodes,
sincei also allows you to extract per barcode count distributions,
indicating which barcodes should be removed. This can be done using the
scFilerBarcodes
tool.
barcodes=737K-arc-v1.txt # cellranger-arc barcodes in this case
for r in 1 2
do
bamfile=cellranger_output_rep${r}/outs/atac_possorted_bam.bam
scFilterBarcodes -p 20 -b ${bamfile} -w ${barcodes} \
-o sincei_output/atac_barcodes_rep${r}.tsv \
--minCount 100 --minMappingQuality 10 --cellTag CB \
--rankPlot sincei_output/barcode_rankplot_rep${r}.png
done
The above example uses a whitelist of possible ATAC barcodes from
cellrange-arc
workflow. See
here
for more details. Providing a whitelist is optional in general, but
recommended for 10x genomics data.
The output file contains a list of filtered barcodes that contain counts
in atleast -mc
regions of the genome. Unlike other tools with
similar options, sincei splits the data in 100kb bins and reports
whether or not a barcode has signal in those bins. This way, barcodes
with high counts, but present in only one genomic bin can also be
filtered out. In most cases, the output is same as the usual approach of
filtering by total counts. -rp
produces the familiar knee-plot
of the barcode counts.
Please follow :doc:`this tutorial <sincei_tutorial_10xATAC>` for further analysis of scATAC-seq samples from the above data.
Please follow :doc: this tutorial <sincei_tutorial_10xRNA> for further analysis of scRNA-seq samples from the above data.
Currently, sincei doesn't provide a method for doublet estimation and removal, which is an important step in the analysis of droplet-based data. Instead, we use simpler filters of min and max number of detected features per cell, which, to some extent mitigates this issue. However, this could lead to some differences in results compared to the published data in used here. Despite this difference, the major published cell types can be separated with sincei for both ATAC and RNA fraction of the data, as shown in the 2 tutorials above.