Skip to content
Evan Staton edited this page Jan 21, 2022 · 16 revisions

This brief guide will show you how to get GO term associations for your genes using HMMER2GO, and also analyze the GO associations for a study set of genes using Ontologizer.

HMMER2GO

The script below will fetch the latest Arabidopsis thaliana coding sequences and run the full analysis. In order of processing, the following script will:

  1. find the longest ORF for each sequence and translate them,
  2. search the translated ORFs for domain matches,
  3. download the protein domain to GO term mappings, and finally,
  4. tabulate the type and number of GO terms for each sequence.

Please note the comment below about the run time. You may want to try a smaller subset to test the commands, or simply read through the steps to understand what each command is doing. Once this process completes you should be able to explore the results with Ontologizer, as explained in the next section.

If we are using Docker, let's consider that our input data is under the current working directory in a directory called work and the Pfam DB is in a directory called /db. The following Docker command would mount both of these directories:

docker run -it --name hmmer2go-con -v $(pwd)/work:/work:Z -v /db:/db:Z sestaton/hmmer2go

Ideally the following script would be in the work directory. Inside the Docker container, you can change to the /work directory and run the script.

#!/bin/bash

wd=$(pwd)

Athal_cdscmp=Arabidopsis_thaliana.TAIR10.cds.all.fa.gz
Athal_base=$(echo ${Athal_cdscmp} | sed 's/\.fa.*//')
Athal_orfs=${Athal_base}_orfs.faa
pfam_out=${Athal_base}_orfs_Pfam-A.tblout
pfam_GO=${Athal_base}_orfs_Pfam-A_GO.tsv
pfam_GAF=${Athal_base}_orfs_Pfam-A_GO_mapping.gaf
pfam_db=`pwd`/db/pfam/latest/Pfam-A.hmm # Edit to the location of the models you want to use

## Set up Pfam-A DB
pfam_dir=$(dirname $pfam_db)
pfam_file=$(basename $pfam_db)
pfam_dl=${pfam_file}.gz

mkdir -p $pfam_dir && cd $pfam_dir
wget -q "http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz"
gunzip $pfam_dl
hmmpress $pfam_file

cd $wd

## Fetch latest A. thaliana CDS models
wget -q -O $Athal_cdscmp \
     "http://ftp.gramene.org/CURRENT_RELEASE/fasta/arabidopsis_thaliana/cds/Arabidopsis_thaliana.TAIR10.cds.all.fa.gz"

## Get longest ORF for each cDNA
time hmmer2go getorf --choose -i $Athal_cdscmp -o $Athal_orfs

## Find domain matches
time hmmer2go run -i $Athal_orfs -d $pfam_db -o $pfam_out -t 4

## Find Pfam -> GO term mapping
time hmmer2go mapterms -i $pfam_out -o $pfam_GO --map

## Create GAF file
time hmmer2go map2gaf -i $pfam_GO -o $pfam_GAF -s 'Arabidopsis thaliana'

This script will take about 9 hours to run (on a small cloud instance, though you can speed this up using multiple processors), so use nohup or your queuing system (e.g., nohup bash hmmer2go.sh 2>&1 > hmmer2go.out &). Note that this script fetches the latest Pfam models and creates the required database to run HMMER2GO. If you have a set of existing models you want to use, perhaps created with the hmmer2go pfamsearch command, then you will have to edit line 12 above and add the location of the Pfam HMM models you want to use (these need to be formatted with hmmpress, which will be handled by hmmer2go pfamsearch --createdb). The expected output should be:

32579 query sequences with 32579 GO terms mapped in file Arabidopsis_thaliana_GOterm_mapping.tsv.

Ontologizer

First, you will have to download Ontologizer and have Java installed. The following is a very simple example. The study set in a real use case would be derived from some experimental result.

Generate the population set:

egrep -v "^\!" Arabidopsis_thaliana_GOterm_mapping.tsv | cut -f2 | sort > athal_popset.txt

For the sake of example, we will create a study set of IDs based on the description (replace "term description" with any terms of interest, for example, "ATP"):

egrep -v "^\!" Arabidopsis_thaliana_GOterm_mapping.tsv | grep <term description> | cut -f2 | sort > athal_studyset.txt

With the population and study sets prepared, we can now run Ontologizer:

#!/bin/bash

# Download the same GO ontology used by HMMER2GO
wget http://purl.obolibrary.org/obo/go.obo

gaf=Arabidopsis_thaliana_GOterm_mapping.gaf

java -jar Ontologizer.jar \
     -s athal_studyset.txt \
     -a $gaf \
     -g go.obo \
     -p athal_popset.txt \
     -o athal_ontologizer
Clone this wiki locally