Skip to content
Shijia Zhu edited this page May 15, 2022 · 49 revisions

STAR2bSMRT: STARlong and STARshort based Single Molecule Real Time Iso-Seq analysis

Contents

  1. STAR2bSMRT dependent packages
  2. STAR2bSMRT input data
  3. STAR2bSMRT parameter setting
  4. STAR2bSMRT running example

Description

STAR2bSMRT is a novel hybrid sequencing-based alternative splicing identification method, which is specially tailored for long read RNAseq correction.

The long read RNAseq differs from the long read DNAseq from three folds:

  1. as opposed to the primary application of DNAseq, such as the do novo genome assebmly, the RNAseq-based splicing isoform identification often has the pre-built genome reference, which can serve as the great prior knowledge for both read alignment and correction;
  2. for the genome assembly, every base in the DNAseq is indispensable, while only the very small profortion of splicing junction sites in the RNAseq reads act the essential role in the splicing isoform identification;
  3. different from the DNAseq, the RNAseq reads demonstrate variational coverages across the genome in response to different gene expression and alternative splicing.

These difference motivates the proposal of the novel long read RNAseq-specific correction method, STAR2bSMRT. Given the matched hybrid-sequencing, STAR2bSMRT first aligns both long and short reads to the genome, obtaining the approximate splicing junctions; next, differing from corretion of the whole long read suquence, STAR2bSMRT only corrects those long read junction sites, via maximizing the correlation between the long and short read junctions.

STAR2bSMRT has the following contributions:

  1. in addition to short reads, it also incorporates prior knowledge to perform long read correction: genome reference, and annotated junction sites (gtf file);
  2. it only corrects junction sites, largely saving efforts from correcting the whole sequence;
  3. it converts the question of sequence correction into that of statistical optimization, also enabling us to automatically select the best parameters. Two parameters are crucial during the optimization: thresDis (the tolerance distance between short read and long read-derived junction sites), and thresSR (the number of short reads which support the splicing junction sites).

1. Load STAR2bSMRT dependent tools and R packages

STAR2bSMRT depends on STARlong, STARshort, samtools, bedtools, and kallisto (optional).

STARlong_PATH="/STARlong/STAR/source/"
STARshort_PATH="/STARshort/STAR/source/"
samtools_PATH="/samtools/bin/"
bedtools_PATH="/BEDTools/bin/"
kallisto_PATH="/kallisto/bin/"
export PATH=$PATH:$STARlong_PATH:$STARshort_PATH:$samtools_PATH:$bedtools_PATH:$kallisto_PATH

Build genome indexes for both STARlong and STARshort, details see STAR :

STAR \
--runThreadN NumberOfThreads \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome/fasta \
--sjdbGTFfile /path/to/annotations.gtf

Open R console, and import the STAR2bSMRT dependent R packages.

library( Biostrings )
library( foreach )
library( doMC )
library( STAR2bSMRT )

Back to above

2. STAR2bSMRT input data

Three types of output by SMRTportal (one or combinations) could be used for STAR2bSMRT in either fasta or fastq format:

  • phqv the polished high-quality isoforms (phqv.fasta) which have >= 99% accuracy
  • flnc the full-length non-chimeric reads which generally have 90-100% accuracy and high-quality isoforms
  • nfl the non-full-length reads, which can increase the power of correlation between long-read and short-read junction sites

Back to above

3. STAR2bSMRT parameter setting

STAR2bSMRT aims to correct the long-read junction sites, via maximizing the correlation between the short-read junction site and long-read junction site. Two parameters are crucial during the optimization: thresSR (the number of short reads which support the splicing junction sites) and thresDis (the tolerance distance between short read and long read-derived junction sites). The long read junction would be corrected to the short read junction, if the distance between the long and short read junctions is shorter than the defined thresDis, and that short read junction is supported by more short reads than the defined thresSR. STAR2bSMRT does grid searching for the best settings for those two parameters by maximizing the correlation between short-read and long-read junctions. More parameters are listed as followed:

  1. genomeDir character value indicating the directory of STAR genome index for both STARlong and STARshort read mapping
  2. genomeFasta character value indicating the fasta file of genome reference
  3. phqv character value indicating the Isoseq polished high QV trascripts in fasta/fastq, where read counts for each transcript consensus should be saved in transcript names
  4. flnc character value indicating the Isoseq full-length non-chimeric reads in fasta/fastq format
  5. nfl character value indicating the Isoseq non-full-length reads in fasta/fastq format
  6. SR1 character value indicating the short read file in fastq format: single-end or paired-end R1
  7. SR2 character value indicating the short read file in fastq format: paired-end R2
  8. useSJout boolean value indicating whether to use the STARshort generated SJ.out.tab for splicing junction. If FALSE, STAR2bSMRT infer the splicing junction from bam files. By default, FALSE.
  9. adjustNCjunc boolean value indicating whether to minimize the non-canonical junction sites.
  10. thresSR a vector of integers indicating the searching range for the number of short reads which support the splicing junction sites.
  11. thresDis a vector of integers indicating the searching range for the tolerance distance between short read-derived splicing junction and long read-derived junction.
  12. outputDir character value indicating the direcotry where results are saved.
  13. fixedMatchedLS boolean value indicating how often the distance is calculate betwen long read and short read-derived junction sites. If TRUE, only calculated once at the very beginning, which may save running time; otherwise, calculate repeatly after every long read correction. By default, FALSE.
  14. fuzzyMatch integer value indicating the distance for fuzzyMatch
  15. chrom character value indicating the chromosome of interest. By default, STAR2bSMRT works on the whole genome.
  16. s integeter value indicating the start position of the transcript of interest. This is useful for target Isoseq sequencing.
  17. e integeter value indicating the end position of the transcript of interest. This is useful for target Isoseq sequencing.
  18. cores integer value indicating the number of cores for parallel computing

Back to above

4. Running example

The testing datasets could be downloaded from the GEO GSE137127 (NRXN1-alpha target short read Illumina RNAseq), and SRA PRJNA563972 (NRXN1-alpha targeted long read Isoseq).

Followed is one running example:

########################################################################
######## set required information for STAR2bSMRT
########################################################################
genomeDir="/hg19/genomeDir"             # STAR mapper genome index
genomeFasta = "/hg19/hg19.fa"           # genome reference fasta file  
chrom = "chr2"                          # NRXN1-alpha located chromosome
s = 50147488                            # 
e = 51259537                            # NRXN1-alpha rough genomic ranges (s, e), to cover reads falling in
thresSR=c(1:100)                        # searching range of short read support 
thresDis=c(0:30)                        # searching range of distance of short-read and long-read junctions
adjustNCjunc=TRUE                       # to minimize the number of non-canonical junction sites
fixedMatchedLS=FALSE                    # repeatedly update 
useSJout=FALSE                          # 
fuzzyMatch=100                          # allow fuzzy matching
outputDir="/NRXN1_alpha/Cont1_2607"     # the output saving directory 
system( paste("mkdir -p", outputDir) )

########################################################################
# The long read and short read input
########################################################################
# the polished high-quality isoforms used as input
LRphqv="/Isoseq/2607/polished_high_qv_consensus_isoforms.fasta"

# the paired-end short reads as input 
SR1="/Illumina/2607/2607.R1.fastq.gz"
SR2="/Illumina/2607/2607.R2.fastq.gz"

########################################################################
# run STAR2bSMRT_NRXN, specially designed for 
# NRXN1-alpha splicing isoform detection
########################################################################
STAR2bSMRT_NRXN( genomeDir=genomeDir, genomeFasta=genomeFasta, LRphqv=LRphqv, LRflnc=NULL, LRnfl=NULL,
                 SR1=SR1, SR2=SR2, useSJout=useSJout,  adjustNCjunc=adjustNCjunc, 
                 thresSR=thresSR, thresDis=thresDis, outputDir=outputDir, 
                 fixedMatchedLS=fixedMatchedLS, fuzzyMatch=fuzzyMatch, 
                 chrom=chrom , s=s , e=e , cores=10 )
    

Back to above