-
Notifications
You must be signed in to change notification settings - Fork 1
Home
- STAR2bSMRT dependent packages
- STAR2bSMRT input data
- STAR2bSMRT parameter setting
- STAR2bSMRT running example
STAR2bSMRT is a novel hybrid sequencing-based alternative splicing identification method, which is specially tailored for long read RNAseq correction.
The long read RNAseq differs from the long read DNAseq from three folds:
- as opposed to the primary application of DNAseq, such as the do novo genome assebmly, the RNAseq-based splicing isoform identification often has the pre-built genome reference, which can serve as the great prior knowledge for both read alignment and correction;
- for the genome assembly, every base in the DNAseq is indispensable, while only the very small profortion of splicing junction sites in the RNAseq reads act the essential role in the splicing isoform identification;
- different from the DNAseq, the RNAseq reads demonstrate variational coverages across the genome in response to different gene expression and alternative splicing.
These difference motivates the proposal of the novel long read RNAseq-specific correction method, STAR2bSMRT. Given the matched hybrid-sequencing, STAR2bSMRT first aligns both long and short reads to the genome, obtaining the approximate splicing junctions; next, differing from corretion of the whole long read suquence, STAR2bSMRT only corrects those long read junction sites, via maximizing the correlation between the long and short read junctions.
STAR2bSMRT has the following contributions:
- in addition to short reads, it also incorporates prior knowledge to perform long read correction: genome reference, and annotated junction sites (gtf file);
- it only corrects junction sites, largely saving efforts from correcting the whole sequence;
- it converts the question of sequence correction into that of statistical optimization, also enabling us to automatically select the best parameters. Two parameters are crucial during the optimization: thresDis (the tolerance distance between short read and long read-derived junction sites), and thresSR (the number of short reads which support the splicing junction sites).
STAR2bSMRT depends on STARlong, STARshort, samtools, bedtools, and kallisto (optional).
STARlong_PATH="/STARlong/STAR/source/"
STARshort_PATH="/STARshort/STAR/source/"
samtools_PATH="/samtools/bin/"
bedtools_PATH="/BEDTools/bin/"
kallisto_PATH="/kallisto/bin/"
export PATH=$PATH:$STARlong_PATH:$STARshort_PATH:$samtools_PATH:$bedtools_PATH:$kallisto_PATH
Build genome indexes for both STARlong and STARshort, details see STAR :
STAR \
--runThreadN NumberOfThreads \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome/fasta \
--sjdbGTFfile /path/to/annotations.gtf
Open R console, and import the STAR2bSMRT dependent R packages.
library( Biostrings )
library( foreach )
library( doMC )
library( STAR2bSMRT )
Three types of output by SMRTportal (one or combinations) could be used for STAR2bSMRT in either fasta or fastq format:
-
phqv
the polished high-quality isoforms (phqv.fasta) which have >= 99% accuracy -
flnc
the full-length non-chimeric reads which generally have 90-100% accuracy and high-quality isoforms -
nfl
the non-full-length reads, which can increase the power of correlation between long-read and short-read junction sites
STAR2bSMRT aims to correct the long-read junction sites, via maximizing the correlation between the short-read junction site and long-read junction site. Two parameters are crucial during the optimization: thresSR
(the number of short reads which support the splicing junction sites) and thresDis
(the tolerance distance between short read and long read-derived junction sites). The long read junction would be corrected to the short read junction, if the distance between the long and short read junctions is shorter than the defined thresDis
, and that short read junction is supported by more short reads than the defined thresSR
. STAR2bSMRT does grid searching for the best settings for those two parameters by maximizing the correlation between short-read and long-read junctions. More parameters are listed as followed:
-
genomeDir
character value indicating the directory of STAR genome index for both STARlong and STARshort read mapping -
genomeFasta
character value indicating the fasta file of genome reference -
phqv
character value indicating the Isoseq polished high QV trascripts in fasta/fastq, where read counts for each transcript consensus should be saved in transcript names -
flnc
character value indicating the Isoseq full-length non-chimeric reads in fasta/fastq format -
nfl
character value indicating the Isoseq non-full-length reads in fasta/fastq format -
SR1
character value indicating the short read file in fastq format: single-end or paired-end R1 -
SR2
character value indicating the short read file in fastq format: paired-end R2 -
useSJout
boolean value indicating whether to use the STARshort generated SJ.out.tab for splicing junction. If FALSE, STAR2bSMRT infer the splicing junction from bam files. By default, FALSE. -
adjustNCjunc
boolean value indicating whether to minimize the non-canonical junction sites. -
thresSR
a vector of integers indicating the searching range for the number of short reads which support the splicing junction sites. -
thresDis
a vector of integers indicating the searching range for the tolerance distance between short read-derived splicing junction and long read-derived junction. -
outputDir
character value indicating the direcotry where results are saved. -
fixedMatchedLS
boolean value indicating how often the distance is calculate betwen long read and short read-derived junction sites. If TRUE, only calculated once at the very beginning, which may save running time; otherwise, calculate repeatly after every long read correction. By default, FALSE. -
fuzzyMatch
integer value indicating the distance for fuzzyMatch -
chrom
character value indicating the chromosome of interest. By default, STAR2bSMRT works on the whole genome. -
s
integeter value indicating the start position of the transcript of interest. This is useful for target Isoseq sequencing. -
e
integeter value indicating the end position of the transcript of interest. This is useful for target Isoseq sequencing. -
cores
integer value indicating the number of cores for parallel computing
The testing datasets could be downloaded from the GEO GSE137127 (NRXN1-alpha target short read Illumina RNAseq), and SRA PRJNA563972 (NRXN1-alpha targeted long read Isoseq).
Followed is one running example:
########################################################################
######## set required information for STAR2bSMRT
########################################################################
genomeDir="/hg19/genomeDir" # STAR mapper genome index
genomeFasta = "/hg19/hg19.fa" # genome reference fasta file
chrom = "chr2" # NRXN1-alpha located chromosome
s = 50147488 #
e = 51259537 # NRXN1-alpha rough genomic ranges (s, e), to cover reads falling in
thresSR=c(1:100) # searching range of short read support
thresDis=c(0:30) # searching range of distance of short-read and long-read junctions
adjustNCjunc=TRUE # to minimize the number of non-canonical junction sites
fixedMatchedLS=FALSE # repeatedly update
useSJout=FALSE #
fuzzyMatch=100 # allow fuzzy matching
outputDir="/NRXN1_alpha/Cont1_2607" # the output saving directory
system( paste("mkdir -p", outputDir) )
########################################################################
# The long read and short read input
########################################################################
# the polished high-quality isoforms used as input
LRphqv="/Isoseq/2607/polished_high_qv_consensus_isoforms.fasta"
# the paired-end short reads as input
SR1="/Illumina/2607/2607.R1.fastq.gz"
SR2="/Illumina/2607/2607.R2.fastq.gz"
########################################################################
# run STAR2bSMRT_NRXN, specially designed for
# NRXN1-alpha splicing isoform detection
########################################################################
STAR2bSMRT_NRXN( genomeDir=genomeDir, genomeFasta=genomeFasta, LRphqv=LRphqv, LRflnc=NULL, LRnfl=NULL,
SR1=SR1, SR2=SR2, useSJout=useSJout, adjustNCjunc=adjustNCjunc,
thresSR=thresSR, thresDis=thresDis, outputDir=outputDir,
fixedMatchedLS=fixedMatchedLS, fuzzyMatch=fuzzyMatch,
chrom=chrom , s=s , e=e , cores=10 )