Tool Name	Version	Description
SMRT Analysis	v2.3.0	The analysis toolkit for RSII Iso-seq data.
SMRT Link	v5.1.0	The analysis toolkit for Sequel Iso-seq data.
R	3.3.2	The R language program.
Perl	5.24	The Perl language program.
Python	2.7	The Python language program.
cDNA-primer	1.1.6	The wrapper of HMMer for get full-length reads.
HMMer	3.1b2	The tool to identify primer and polyA sequences.
seqtk	1.2-r94	The toolkit to manipulate fastq/fasta files.
GMAP	2017-11-15	The mapper for aligning CCS reads onto reference genome.
SAMTools	1.5	The toolkit to manipulate BAM files.
BEDTools	2-2.26.0	The utilities to manipulate BED files.

The given version is just to suggest you to use this version, but not to prohibit you from using newer version, although we haven’t tested the newer ones and some unknown error may occur.

Needed R packages: ggplot2, EMT, etc.

The needed in-house scripts (fqNameMapping.pl, fqSeFilter.pl, etc.) are packaged in the source release.

III. Tutorial

3.1 Setup Environment

Assume that the pipeline is put at ~/bin/isoSeq.

This pipeline is composed of perl, python, bash and R scripts as well as pre-compiled C binaries, so no need to build/compile them. The only thing needed to do is making them executable:

cd ~/bin/isoSeq
chmod u+x *.{pl,py,sh,R} skyjoin faFilter faCount

Add the path of the pipeline as the prefix of the environment variable $PATH, so that all scripts can be runed as executable command directly.

PATH=~/bin/isoSeq:$PATH

After doing this, the correct $PATH may look like:

echo $PATH
~/bin/isoSeq:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

Test whether the pipeline is accessible:

isoSeq.sh

If pipeline hele message is shown, the environment is ready, otherwise if the below error is shown:

bash: isoSeq.sh: command not found...

check whether the pipeline scripts are executable and their path is in the $PATH variable.

3.2 Prepare Configuration File

The format and description of the configuration file is explained in the help message of isoSeq.sh. It (e.g. sample.conf) is a text file with many lines, each of which is TSV (tab-separated value) or space-separated values. One line is the complete information for one sample. Columns in one line is:

Column	Description
SampleName	The sample name. Results of the sample will be put under a directory named as the given SampleName (under --outDir)
InputDirectory	The input directory of Iso-seq raw data. The directory is composed of cell directories. For Sequel, each cell directory is composed of files like .subreads.bam, .subreadset.xml, etc. For RSII, each cell directory is composed of .h5 files.
Primer	The primer-pair sequences file in FASTA format. The forward primer must be named as F0, F1, F2, …, etc. Likely the reverse primer must be R0, R1, R2, …, etc. F0 and R0 is a pair, F1 and R1 is a pair, and so on. A example file can be: >F0 AAGCAGTGGTATCAACGCAGAGTACATGGGG >R0 GTACTCTGCGTTGATACCACTGCTT >F1 TCGTCGGGGACAACTTTGTACAAAAAAGTTGGATGGGG >R1 CCCAACTTTCTTGTACAAAGTTGTCCCC
Junction (Optional)	If the corresponding RNA-seq junction information is available for the sample, this information in .bed format (output of RNA-seq mapping, like TopHat) can be provided. If so, the following extra analysis will be done: 1. Remove mis-aligned exon guiding by RNA-seq junction 2. Fill missing exon guiding by RNA-seq junction 3. Make consensus splicing site guiding by RNA-seq junction 4. Statistics of PacBio junction supporting by RNA-seq junction 5. Identify SE, A5SS and A3SS with the help of RNA-seq junction 6. Comparison of SE, A5SS and A3SS identified with PacBio and RNA-seq 7. Comparison of PSI of SE, A5SS and A3SS identified with PacBio and RNA-seq 8. Independ combination between alternative splicing events and APA

An example of configuration file:

#SampleName	InputDirectory	Primer	Junction(Optional)
sample1	sample1_data	primer1.fa	junctions.bed
sample2	sample2_data	primer2.fa

The line prefixed with a ‘#’ isn’t necessary and will be skipped automatically by the pipeline.

3.3 Prepare Resource Data

Some resource (data) can be specified in env.conf file in the source. Needed resource files include:

Variable	Default Value
resource directory	The resource directory according to where you put your resource files. It will be used in below variable.
gmapDir	$resource/gmapdb/20180109
genomeHg19	$resource/fna/hg19/all.fa
chrSizeHg19	$resource/chrSize/hg19.tsv
twoBitHg19	$resource/blat/hg19/softMasked.2bit
hg19RefGeneGpe	$resource/geneModel/hg19/refGene.gpe
gapHg19	$resource/gap/hg19.bed4+

<gmapDir> can be downloaded from GMAP website or generated by gmap_build command in GMAP.

genomeHg19 is a fasta file with sequences of all chromosomes. It can be downloaded from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz).

chrSizeHg19 is a TSV file with chr name and its length. It can be downloaded from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes).

twoBitHg19 a compressed binary file storing genome sequence. It can be downloaded from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit).

hg19RefGeneGpe is GPE (Gene Prediction Extension) file describing gene model. It can be downloaded from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz).

gapHg19 is a BED file describing genome gap. It can be downloaded from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gap.txt.gz).

These variables can be overwrote by command option of isoSeq.sh, for example:

isoSeq.sh --genomeSeq /path/to/your/genome/sequence

will overwrite the “$resource/fna/hg19/all.fa”.

3.4 Start to Run the Pipeline

A complete command to run the pipeline may be:

isoSeq.sh --gmapDir gmapDir/hg19 \
    --gmapDB hg19 \
    --genomeSeq hg19.fasta \
    --gpe hg19.refGene.gpe \
    --gap hg19.gap.bed \
    --twoBit hg19.2bit \
    --chrSize hg19.sizes \
    --outDir . \
    --thread 10 \
    sample.conf \
    >isoSeq.log 2>isoSeq.err

GMAP DB can be prepared by:

gmap_build --db=hg19 --circular=chrM hg19.fasta

If the variables in env.conf are specified correctly, you can simply run the analysis as:

isoSeq.sh --thread 10 sample.conf >isoSeq.log 2>isoSeq.err

IV. Get help

You can send the author zhangsjsky@foxmail.com any information about this analysis pipeline, like bug reporting, performance improvement suggestion.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

I. Introduction

II. Prerequisites

III. Tutorial

3.1 Setup Environment

3.2 Prepare Configuration File

3.3 Prepare Resource Data

3.4 Start to Run the Pipeline

IV. Get help

About

Releases

Packages

License

zhangsjsky/IsoSeq

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

I. Introduction

II. Prerequisites

III. Tutorial

3.1 Setup Environment

3.2 Prepare Configuration File

3.3 Prepare Resource Data

3.4 Start to Run the Pipeline

IV. Get help

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages