Skip to content

trinity_assembly

Brian Haas edited this page Jun 7, 2017 · 4 revisions

Running Trinity

When running Trinity on a set of RNA-Seq data involving multiple samples and replicates per sample, we aim to build a single assembly containing all reads. Because Trinity requires a lot of RAM and can take hours to days to run on large data sets, we're only going to run Trinity here on a small set of data as a way of getting the 'Trinity experience'. For downstream analysis (further below), you'll use a Trinity assembly provided to you that is based on having assembled all the reads and took 2 hours to complete. Your assembly will take ~10 minutes or so.

Try assembling a small subset of 100k reads from the in silico normalized and quality-trimmed reads like so.

First, take the top 100k reads from each of the paired-ends of one of the samples. Note, since in fastq file format there are 4 lines / sequence record, we'll want to take the top 400k lines of each file to obtain the top 100k reads.

% head -n400000 insil_norm_ex/wt37_1.P.qtrim.normalized_K25_C50_pctSD200.fq > left.100k.fastq

% head -n400000 insil_norm_ex/wt37_2.P.qtrim.normalized_K25_C50_pctSD200.fq > right.100k.fastq

Now run Trinity to assemble those reads:

% $TRINITY_HOME/Trinity --left left.100k.fastq --right right.100k.fastq \
   --CPU 2 --max_memory 1G --seqType fq --output test_trinity_outdir

Running Trinity on this data set may take ~10 minutes. You'll see it progress through the various stages, starting with Jellyfish to generate the k-mer catalog, then followed by Inchworm to assemble 'draft' contigs, Chrysalis to cluster the contigs and build de Bruijn graphs, and finally Butterfly for tracing paths through the graphs and reconstructing the final isoform sequences.

Running a typical Trinity job requires ~1 hour and ~1G RAM per ~1 million PE reads. You'd normally run it on a high-memory machine and let it churn for hours or days.

The assembled transcripts will be found at 'trinity_out_dir/Trinity.fasta'.

Just to look at the top few lines of the assembled transcript fasta file, you can run:

% head test_trinity_outdir/Trinity.fasta 

and you can see the Fasta-formatted Trinity output:

>TRINITY_DN889_c0_g1_i1 len=259 path=[473:0-258] [-1, 473, -2]
GAACAATGTCTACACTGTCTTCAACTTGGATGACAAGGAACTTTCATTGGCTCAAGCTAA
CTACAATTCATCTCTGAAACCAGATATTGAAGAAATCAAGGATACTGTCCCTAGCGCTGT
GCTGGCTCCACAATACTACAACACATTCTCAGCTGACCCAACTGCCACTGCAGTCACTGG
TAACATCTTTGCACCAGAGGCCACTATGTCCATGGCTGCTCCAGCTAATGCTTCTAGAAA
CTCTTCATTAAACTCTCCT
>TRINITY_DN810_c0_g1_i1 len=226 path=[407:0-225] [-1, 407, -2]
GATGATATCAACAATGAGACTTGTGAACCAGGTGAAGAAAACTCTTTCTTTGTATGCGAC
CTAGGTGAAATTGAAAGATTGTACGCTAACTGGTGGAAAGAACTACCAAGAGTTCAGCCA
TTTTACGCTGTCAAGTGTAACCCAGATTTGAAGATAATAAGAAAATTGGCTGACCTCGGA

Note, the sequences you see will likely be different, as the order of sequences in the output is not deterministic.

The FASTA sequence header for each of the transcripts contains the identifier for the transcript (eg. 'TRINITY_DN889_c0_g1_i1'), the length of the transcript, and then some information about how the path was reconstructed by the software by traversing nodes within the graph.

It is often the case that multiple isoforms will be reconstructed for the same 'gene'. Here, the 'gene' identifier corresponds to the prefix of the transcript identifier, such as 'TRINITY_DN506_c0_g1', and the different isoforms for that 'gene' will contain different isoform numbers in the suffix of the identifier (eg. TRINITY_DN506_c0_g1_i1 and TRINITY_DN506_c0_g1_i2 would be two different isoform sequences reconstructed for the single gene TRINITY_DN506_c0_g1). It is useful to perform certain downstream analyses, such as differential expression, at both the 'gene' and at the 'isoform' level.

Switch to using a more complete assembly (provided)

Now that you've gotten the 'Trinity assembly experience', let's switch to using a more complete assembly that was generated using 18M total PE reads (2M per replicate * 3 samples * 3 replicates per sample).

First, to avoid confusion, delete your 'test_trinity_outdir/' directory, recursively.

% rm -rf test_trinity_outdir/

Then, copy over the following Trinity assembly to your workspace:

% cp ~/shared_ro/Trinity.fasta .

We'll use this Trinity.fasta file for all further studies below.

Using Bandage

The Bandage software provides a graphical interface for navigating the structures of assemblies and can be useful for exploring complex transcript isoform structures.

Try downloading Bandage from https://rrwick.github.io/Bandage/, and loading and viewing your Trinity.fasta file.