Sailfish-cir is a computational tool to estimate the relative abundance of circular RNA transcripts from high-throughput RNA-seq data.
It accepts output of circRNA identification tools (CIRI, KNIFE, circRNA_finder) or a BED-format file to specify the reference set of circular RNA transcripts in RNA-seq data. Then, it transforms all circular transcripts to pseudo-linear transcripts. Finally, it estimates the expression of both linear and circular transcripts using Sailfish framework.
The following three tools should be installed before running Sailfish-cir.
1. Sailfish Sailfish (http://www.cs.cmu.edu/~ckingsf/software/sailfish/) is used to quantify RNA expression. Sailfish 0.7.0 or above is recommended. 2. gffread gffread is part of Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/). This GFF/GTF parsing utility is used to extract sequence. 3. gffutils gffutils (https://pypi.python.org/pypi/gffutils) is used to build a database for reference set of all RNA transcripts.
We assume these pre-installed tools are running under a Unix-like environment, and paths of gffread and sailfish binary executive should be added to your
Other than above third party tools, several data files are required for Sailfish-cir as well.
1. Raw sequencing reads file in fasta/fastq format; 2. A CIRI output file or a BED file; 3. Gene annotation file in .gtf format, and a single fasta file or multi-fasta files containing RNA sequences.
python path/to/your/sailfish_cir.py -g path/to/your/genomic/sequence/foo.fa -a path/to/your/annotation/bar.gtf -1 path/to/your/reads/mate1.fastq -2 /path/to/your/reads/mate2.fastq -o /path/to/where/you/want/your/result -c /path/to/your/CIRI/output/file
-g path to genomic sequence fasta file -a path to gene annotation file, ie, .gtf or .gff files -r path to single-end raw sequencing reads file. -1 path to raw pair-end reads, mate 1 -2 path to raw pair-end reads, mate 2 -c path to CIRI output file to specify circular RNA --bed path to bed file which contains circular RNA. --circRNA_finder path to circRNA_finder output file. --KNIFE_report_folder path to KNIFE output folder, make sure it has the subdirectory "circReads" -o output folder that contains the index built by sailfish and quantification results -k minimum match size used during sailfish quantification, default is 21 --libtype format string describing the library type of your reads. default is "IU", [read more on libtype of Sailfish](http://sailfish.readthedocs.org/en/master/library_type.html) --mll mean library length, this option is to fix up the effective length. -h/--help print this help message
Sailfish-cir expression estimats can be found at a subdirectory named
quant_circular under the path set by
In order to generate reference sequences of circular RNA transcripts, a database of your genomic annotation is needed. It is recommended to create the database file manually using the same filename with extension ".db" in the same directory. Since gffutils don't guess GTF file format, and GTF format was changed after GRCH37.75. It will be time-consuming when building the database using the same option for earlier gtf version.
output file from circRNA identification tools
This script accepts BED-format file or output files of circRNA identification tools (CIRI, circRNA_finder, KNIFE are supported) to specify the circular RNA transcripts.
We also provide a small utility to convert those outputs to a BED-format file. usage as follows:
python path/to/your/sailfish_cir.py convert_CIRI your/CIRI/output/file python path/to/your/sailfish_cir.py convert_KNIFE your/KNIFE/output/folder
For CIRI, this will create a .bed file in the same folder. the 'name' (4th) column follows this "circular_transcript@host_gene" pattern. For KNIFE, this will create a "summarized_knife_junction.bed" under your KNIFE output folder. circRNA_finder output is actually BED-format.
If you want to quantify circRNA from multiple circRNA detection results, convert those output files into BED format and merge them into a single .bed file, then use "--bed" option.
- BED-file format support: use '--bed' option to input a bed file as the reference set of circular RNA transcripts
- Convert utility: convert CIRI outputs into a BED-file format
- Issue of effective length correction: fix the effective length correction when estimating circular RNA expression values in real rRNA-depleted RNA-seq data. Please use '--mll' to specify the estimated effective fragment length of RNA-seq data.
- Add some filters to unclutter the GTF entries.
- the first version. Snapshot with documents is here.
Example files are available in MEGA
Genome sequence file: hg19 fasta file
Sailfish (ver 0.9.0): Sailfish Linux binary
RNA-seq simulation scripts simulation
Simulated RNA-seq datasets 11 simulated datasets (uniform expresion distribution)
Musheng Li, Xueying Xie, Jing Zhou, Mengying Sheng, Xiaofeng Yin, Eun-A Ko, Tong Zhou* and Wanjun Gu*. Quantifying circular RNA expression from RNA-seq data using model-based framework. Bioinformatics, 2017, in press.
Musheng Li (email@example.com)