Clone this wiki locally
Trinity: Frequently Asked Questions
- There are too many transcripts! What do I do?
- What computing resources are required?
- How long should this take?
- How do I use reads I downloaded from SRA?
- How can I run this in parallel on a computing grid?
- How do I identify the specific reads that were incorporated into the transcript assemblies?
- How do I combine multiple libraries in a single Trinity run? Or how do I combine paired and single reads?
- Trinity process died due to 'std::bad_alloc'
- Butterfly fails with java Error: Cannot create GC thread. Out of system resources.
- Can I combine strand-specific and non-strand-specific reads?
There are too many transcripts! What do I do?
Welcome to RNA-Seq de novo assembly! :) There are several aspects of de novo assembled transcripts to be aware of:
Lots of transcripts is the rule rather than the exception.
Most of the transcripts are very lowly expressed, and the deeper you sequence and the more complex your genome, the larger the number of lowly expressed transcripts you will be able to assemble. Biological relevance of the lowly expressed transcripts could be questionable - some are bound to be very relevant.
There's really no good reason to immediately filter them out. They can be 'passengers' throughout all of your data analyses, and if any of them are important, they'll ideally surface in the relevant study. You can put them all through Trinotate.github.io for annotation/analysis, and you can put them through DE studies just fine (those with insufficent reads will get directly filtered out during the DE analysis protocols to avoid problems associated with multiple hypothesis testing); If the read counts are few or lacking, they simply won't surface as significant DE entries, but if there's protein homology or other interesting features, you'll want to continue to capture this info - hence don't feel the need to immediately filter!
If you want to guestimate 'how many expressed genes/transcripts do I really have?', then examine counts of transcripts or genes vs. min expression thresholds. This will enable you to count the number that reflect the majority of the reads, excluding counting those that have little read support. The entries with >= ~1 fpkm or tpm tend to be heavily enriched for transcripts that correspond to what we typically think of as 'genes' in the pre-RNA-Seq era, and what typically are awarded annotation status in genome annotation.
Related to the above point, consider computing the contig Ex90N50 value (contig N50 computed based on transcript contigs representing the top 90% of expressed transcripts), in addition to the Ex90 count of transcripts (the number of transcripts that represent this top 90% of expression data). You'll find these numbers to be more palatable as compared to the hundreds of thousands (or millions) of transcript contigs being reconstructed - which we well know are biased towards any lowly expressed transcripts that can be reconstructed.
If for some reason you do want to filter out those transcripts that have little read support, you can filter them based on the RSEM abundance estimates.
If you're concerned about having too many isoforms for a given gene, you can use the filtering method above to remove isoforms that are weakly expressed compared to more dominant isoforms. Also, you can simply perform DE analysis at the 'gene' level in addition to the 'isoform' level, so as to increase your power for DE analysis. Other methods such as Corset can also be used to regroup relevant transcripts into 'gene' clusters. Since the 'gene' aggregates the expression data from all of the isoforms, problems with having too many isoforms quickly dissipate.
If you have compelling reasons for filtering, please share them with us on our Trinity google forum, and we'll update our documentation accordingly!
What computing resources are required?
Ideally, you will have access to a large-memory server, roughly having ~1G of RAM per 1M reads to be assembled. The memory usage mostly depends on the complexity of the RNA-Seq data set, specifically on the number of unique k-mers. If you do not have access to a high-memory server, other freely available options are available.
How long should this take?
Trinity run-time depends on a number of factors including the number of reads to be assembled and the complexity of the transcript graphs. The assembly from start to finish can take anywhere from ~1/2 hour to 1 hour per million reads (your mileage may vary). If you have a large data set, be sure to use the --normalize_reads parameter to greatly improve run times.
How do I use reads I downloaded from SRA?
RNA-Seq data downloaded from SRA tends to exist in a .sra file that needs to be converted to fastq file format. This can be done using the SRA Toolkit like so:
SRA_TOOLKIT/fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra
How can I run this in parallel on a computing grid?
The initial Inchworm and Chrysalis steps currently need to be run on a single server, however parts of Chrysalis and all of Butterfly can be run as naive parallel processes on a computing grid. Both SGE and LSF are currently supported. Configuring Trinity for leveraging these grid-computing platforms is described here.
How do I identify the specific reads that were incorporated into the transcript assemblies?
Currently, the mappings of reads to transcripts are not reported. To obtain this information, we recommend realigning the reads to the assembled transcripts Bowtie. Capturing both properly paired read alignments and single unpaired alignments can be captured and quantified as described here.
How do I combine multiple libraries in a single Trinity run? Or, how do I combine paired and single reads?
If you have RNA-Seq data from multiple libraries and you want to run them all through Trinity in a single pass, simply combine all your left.fq files into one left.fq file, and combine all right.fq files into one right.fq file. Then run Trinity using these separately concatenated left and right input files.
If you have additional singletons, add them to the .fq file that they correspond to based on the sequencing method used (if they're equivalent to the left.fq entries, add them there, etc).
There is no good way to combine strand-specific data with non-strand-specific data, unless you decide to treat the entire data set as non-strand-specific.
Trinity process died due to 'std::bad_alloc'
This is an indicator that the process ran out of available RAM. If you have more RAM resources to make available to Trinity, then simply rerun your original Trinity command with the altered resources allocated and it should resume approximately where it left off.
If you are resource limited, please consider running Trinity here. If you want to continue to try to run Trinity given your available resources, you can reduce the total RAM requirements by running Trinity with parameter '--min_kmer_cov 2'. Although the assembly should still be of high quality and require less RAM, lowly expressed transcripts may be more highly fragmented in the assembly.
Butterfly fails with java Error: Cannot create GC thread. Out of system resources.
There are a couple reasons why this error message might creep up.
all memory has been consumed on the machine. Each butterfly process wants to reserve 10G of maximum heap space. If there's less than 10G of free memory on the machine per butterfly (--CPU setting), then java may not be able to initialize (depends on your OS configuration). Try reducing the --CPU setting and rerunning your Trinity command. It should resume where it left off.
NUMA architecture: one of our users found that the java invocation required: -XX:ParallelGCThreads=, otherwise it would try to use too many threads.
Can I combine strand-specific and non-strand-specific reads?
You can do so, but you wouldn't be able to benefit as from the strand-specificity, since you'll need to run Trinity in non-strand-specific mode.