Functional Annotation of Transcripts

Brian Haas edited this page Oct 30, 2015 · 2 revisions
Clone this wiki locally

Functional Annotation of Trinity Transcriptome Assemblies

We developed a functional annotation protocol and supporting software for functionally annotating Trinity de novo transcriptome assemblies called Trinotate, available on GitHub at Visit the Trinotate website for documentation and software.

Add Annotations to Expression Matrices

It's often useful to include a brief annotation string along with the transcript (or gene) identifier so that it's carried through all downstream expression analyses, such as included in heatmaps, etc. One way to do this is to take your count matrix and update the feature ID (left-most column of the matrix) to include annotation identifiers. Using the Trinotate (see above) report file, you can do the following to integrate functional annotations:

Scripts for doing this are in either the Trinotate or Trinity package, so be sure to take notice. Script names are prefixed with either ${TRINOTATE_HOME} or ${TRINITY_HOME} accordingly.

First, using the Trinotate report file, generate a map of feature identifier to an annotated feature identifier like so:

                  Trinotate_report.xls  > annot_feature_map.txt

Looking at the first few lines of the output file, you'll see a formatting that resembles the following:

comp0_c0    comp0_c0^YFY8_SCHPO^Steroid_dh^Tm3
comp0_c0_seq1   comp0_c0_seq1^YFY8_SCHPO^Steroid_dh^Tm3
comp1000_c0 comp1000_c0^ARD1_SCHPO^Acetyltransf_1
comp1000_c0_seq1    comp1000_c0_seq1^ARD1_SCHPO^Acetyltransf_1
comp10023_c0    comp10023_c0^TF29_SCHPO^Peptidase_A2E
comp10023_c0_seq1   comp10023_c0_seq1^TF29_SCHPO^Peptidase_A2E
comp1002_c0 comp1002_c0^CCS1_SCHPO
comp1002_c0_seq1    comp1002_c0_seq1^CCS1_SCHPO
comp1002_c0_seq2    comp1002_c0_seq2^CCS1_SCHPO
comp10033_c0    comp10033_c0^RSM1_SCHPO^zf-C3HC

The above maps the original feature identifier (Trinity transcript or gene identifier) to a version that includes the accessions of top blast hits and Pfam identifiers, transmembrane domains, and and signal peptides.

Given this annotation mapping file, you can then update your expression matrix. For example, given a counts matrix 'Trinity_trans_counts.matrix', which looks like so:

          ds_rep1 hs_rep1 log_rep1        plat_rep1
comp4806_c1_seq1        0.00    1.00    0.00    2.00
comp3938_c1_seq1        6.00    4.00    9.00    4.00
comp31964_c0_seq1       0.00    0.00    0.00    0.00
comp6295_c0_seq1        95.00   77.00   95.00   86.00
comp3109_c1_seq1        22.00   8.00    28.00   47.00
comp3439_c1_seq1        4.00    1.00    3.00    6.00

we can integrate functional annotations like so:

${TRINITY_HOME}/Analysis/DifferentialExpression/ \
    Trinity_trans.counts.matrix annot_feature_map.txt > Trinity_trans.counts.wAnnot.matrix

and looking at the top entries of our updated matrix:

            ds_rep1 hs_rep1 log_rep1    plat_rep1
comp4806_c1_seq1    0.00    1.00    0.00    2.00
comp3938_c1_seq1^POPI_SCHPO^POP1    6.00    4.00    9.00    4.00
comp31964_c0_seq1^NAA25_SCHPO   0.00    0.00    0.00    0.00
comp6295_c0_seq1^NAGS_SCHPO^DUF619  95.00   77.00   95.00   86.00
comp3109_c1_seq1^DEP1_SCHPO^Sds3    22.00   8.00    28.00   47.00
comp3439_c1_seq1^CUT12_SCHPO    4.00    1.00    3.00    6.00
comp4603_c2_seq1^MAP4_SCHPO 1.00    0.00    0.00    0.00
comp596_c0_seq1^ELL1_SCHPO^ELL  174.00  81.00   58.00   203.01
comp4091_c0_seq1^BYR1_SCHPO^Pkinase_Tyr 139.00  51.00   66.00   203.01

This updated matrix can be used for downstream analysis steps, such as differential expression analysis.