### Overview

This is a demo to introduce the step-by-step analysis of LongcellPre, which aims for For users who want to use intermediate results from LongcellPre or understand the details about this tool.

For users who want to directly apply this tool to the data to do preprocessing, you can simply run:

In [None]:
library(future)
library(Longcellsrc)
library(LongcellPre)

# update your input path here
fastq = "path of your input fastq or fq.gz"
barcodes = "path of your input cell barcode whitelist"
gtf_path = "path of your gtf annotation"
genome_path = "path of your genome annotation"
minimap_bed_path = "path of your bed annotation for minimap2, can be generated from gtf" //unnecessary
genome_name = "the genome name used for mapping, ex. hg38"
toolkit = your 10X sequencing toolkit
work_dir = "The output directory"

# specify the path for those tools
samtools = "samtools"
minimap2 = "minimap2"
bedtools = "bedtools"

RunLongcellPre(fastq = fastq,barcode_path = barcodes,toolkit = toolkit,
               genome_path = genome_path,genome_name = genome_name,
               gtf_path = gtf_path,minimap_bed_path = minimap_bed_path,work_dir = work_dir,
               samtools = samtools, minimap2 = minimap2,bedtools = bedtools,cores = 4, strategy="multicore")

### Step-by-step demo for LongcellPre

#### load necessary libraries and initiation

In [2]:
library(future)
library(Longcellsrc)
library(LongcellPre)

In [None]:
work_dir = "your output directory"
init(work_dir)

#### build annotation

The annotation used by LongcellPre can be generated from the isoform annotation in gtf format, for common samples like human and mouse, you can download the gtf reference via gencode: https://www.gencodegenes.org/

In [13]:
gtf_path = "your gtf path"

In [14]:
refer = annotation(gtf_path = gtf_path, work_dir = work_dir ,overwrite = FALSE)

“The annotation already exists. If you want to overwrite it please set overwrite as TRUE!”


The `annotation()` function would generate two tables. The first one record the location for the split non-overlapping exons, and the second one records the location for each exon.

In [15]:
gene_bed = refer[[1]]
exon_gtf = refer[[2]]

The first table would be used to guide searching the reads for each gene in the bam file, which is necessary. The second one is used as the canonical isoform annotation and is only necessary when you want to map your reads to the canonical isoforms to get the cell by isoform matrix. If this table is not provided, isoform can also be extracted but stored as a string of splicing sites.

When there is no gtf file as the isoform anotation for the sample,the `annotation()` function can accept a gene bed file to indicate the location of targeted genes. LongcellPre will search reads for targeted genes based on the bed.

#### cell barcode match and reads extraction

This part can be achieve by the function `reads_extract_bc()`. This function contains three main steps:

(1) Trim the adapter sequence and identify the cell barcode and UMI for each read.

(2) Map the trimmed fastq file to the reference genome to get the bam file. Common reference genome can be downloaded from https://www.10xgenomics.com/cn/support/software/cell-ranger/downloads

(3) Extract the isoform information for each read. The read extraction would be guided from the `gene_bed` annotation output from `annotation`. The necessary information in the `gene_bed` here is the location of the gene and also its strand information. If you are only interested in serveral target genes, you could filter out other genes in the `gene_bed`, then reads aligned to other genes won't be searched.

In [None]:
fastq = "path of your input fastq or fq.gz"
barcodes = "path of your input cell barcode whitelist"
genome_path = "path of your genome annotation"
minimap_bed_path = "path of your bed annotation for minimap2, can be generated from gtf"
genome_name = "the genome name used for mapping, ex. hg38"
toolkit = your 10X sequencing toolkit
work_dir = "The output directory"

# specify the path for those tools
samtools = "samtools"
minimap2 = "minimap2"
bedtools = "bedtools"

In [None]:
plan(strategy = "multicore",workers = 4)
reads_bc = reads_extract_bc(fastq_path = fastq,barcode_path = barcodes,gene_bed= gene_bed,
                           adapter = NULL,genome_path = genome_path,genome_name = genome_name,
                           toolkit = toolkit, minimap_bed_path  = minimap_bed_path, work_dir = work_dir,
                           minimap2 = minimap2, samtools = samtools, bedtools = bedtools)
bc = reads_bc[[1]]
qual = reads_bc[[2]]

This function will return two dataframes, the first one record the cell barcode, UMI, isoform and polyA existence information for each read, while the second one records the distribution of Needleman score between the adapter aside the confidently identified cell barcode (edit distance = 0) and its original sequence. The second table is used to evaluate the data quality and be the guidance to help filter scattered UMI clusters in the UMI deduplication step.

Those returns would also be saved into files along with the bam file in the output directory.

#### UMI deduplication and isoform correction

This part can be achieve by the function `umi_count_corres()`. This function contains three main steps:

(1) Cluster reads with the same or similar UMI for each cell into group.

(2) Correct the wrong mapping and trunction for each UMI cluster to get polished isoform representation

(3) If given the canonical isoform annotation (the `exon_gtf` from the annotation step),this function would try to map each read to a canonical isoform and build a cell-by-isoform matrix.

In [None]:
plan(strategy = "multicore",workers = 4)
umi_count_corres(data = bc,qual = qual,gene_bed = gene_bed,gtf = exon_gtf)

This function has no returns but it would output the single cell isoform quantification into the output folder.