# Reading Annotations from a _GO Association File_ (GAF)

1. Download a GAF file
2. Load the GAF file into the GafReader
3. Get Annotations

**Bonus: Each line in the GAF file is stored in a namedtuple**:
  * Namedtuple fields
  * Print a subset of the namedtuple fields

## 1) Download a GAF file

In [7]:
!wget http://current.geneontology.org/annotations/goa_human.gaf.gz

--2019-04-08 09:43:16--  http://current.geneontology.org/annotations/goa_human.gaf.gz
Resolving current.geneontology.org (current.geneontology.org)... 143.204.145.94, 143.204.145.35, 143.204.145.220, ...
Connecting to current.geneontology.org (current.geneontology.org)|143.204.145.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8524279 (8.1M) [application/gzip]
Saving to: ‘goa_human.gaf.gz’


utime(goa_human.gaf.gz): Operation not permitted
2019-04-08 09:43:20 (2.07 MB/s) - ‘goa_human.gaf.gz’ saved [8524279/8524279]



In [14]:
!gunzip goa_human.gaf.gz

## 2) Load the GAF file into the GafReader

In [1]:
from goatools.anno.gaf_reader import GafReader

ogaf = GafReader("goa_human.gaf")

  READ      476,348 associations: goa_human.gaf


## 3) Get Annotations
The annotations will be stored in a dict where:
  * the key is the protein ID and 
  * the value is a list of GO IDs associated with the protein.

In [2]:
id2gos = ogaf.read_gaf()

In [3]:
for protein_id, go_ids in sorted(id2gos.items())[:3]:
    print("{PROT:7} : {GOs}\n".format(
        PROT=protein_id,
        GOs=' '.join(sorted(go_ids))))

A0A024R161 : GO:0003924 GO:0005515 GO:0005834 GO:0007186

A0A024RBG1 : GO:0003723 GO:0005829 GO:0008486 GO:0046872 GO:0052840 GO:0052842

A0A075B6H7 : GO:0002377 GO:0005615 GO:0006955



# Bonus: The GAF is stored as a list of named tuples
The list of namedtuples is stored in **GafReader** data member named **_associations_**.

Each namedtuple stores data for one line in the GAF file.

In [9]:
# Sort the list of GAF namedtuples by ID
nts = sorted(ogaf.associations, key=lambda nt:nt.DB_ID)

# Print one namedtuple
print(nts[0])

ntgafobj(DB='UniProtKB', DB_ID='A0A024R161', DB_Symbol='DNAJC25-GNG10', Qualifier=[], GO_ID='GO:0005834', DB_Reference={'PMID:21873635'}, Evidence_Code='IBA', With_From={'UniProtKB:P63212', 'FB:FBgn0004921', 'UniProtKB:O14610', 'RGD:621514', 'PANTHER:PTN001418483', 'RGD:69268', 'RGD:1595475'}, Aspect='C', DB_Name={'Guanine nucleotide-binding protein subunit gamma'}, DB_Synonym={'hCG_1994888', 'DNAJC25-GNG10'}, DB_Type='protein', Taxon=[9606], Date=datetime.date(2018, 4, 25), Assigned_By='GO_Central', Annotation_Extension=set(), Gene_Product_Form_ID=set())


## Namedtuple fields
```
DB             #  0 required 1              UniProtKB
DB_ID          #  1 required 1              P12345
DB_Symbol      #  2 required 1              PHO3
Qualifier      #  3 optional 0 or greater   NOT
GO_ID          #  4 required 1              GO:0003993
DB_Reference   #  5 required 1 or greater   PMID:2676709
Evidence_Code  #  6 required 1              IMP
With_From      #  7 optional 0 or greater   GO:0000346
Aspect         #  8 required 1              F
DB_Name        #  9 optional 0 or 1         Toll-like receptor 4
DB_Synonym     # 10 optional 0 or greater   hToll|Tollbooth
DB_Type        # 11 required 1              protein
Taxon          # 12 required 1 or 2         taxon:9606
Date           # 13 required 1              20090118
Assigned_By    # 14 required 1              SGD
Annotation_Extension # 15 optional 0 or greater part_of(CL:0000576)
Gene_Product_Form_ID # 16 optional 0 or 1       UniProtKB:P12345-2
```

## Print a subset of the namedtuple fields

In [7]:
fmtpat = '{DB_ID} {DB_Symbol:13} {GO_ID} {Evidence_Code} {Date} {Assigned_By}'
for nt_line in nts[:10]:
    print(fmtpat.format(**nt_line._asdict()))

A0A024R161 DNAJC25-GNG10 GO:0005834 IBA 2018-04-25 GO_Central
A0A024R161 DNAJC25-GNG10 GO:0005515 IBA 2018-04-25 GO_Central
A0A024R161 DNAJC25-GNG10 GO:0003924 IEA 2019-02-11 InterPro
A0A024R161 DNAJC25-GNG10 GO:0007186 IEA 2019-02-11 InterPro
A0A024RBG1 NUDT4B        GO:0005829 IDA 2016-12-04 HPA
A0A024RBG1 NUDT4B        GO:0003723 IEA 2019-02-12 UniProt
A0A024RBG1 NUDT4B        GO:0008486 IEA 2019-02-11 UniProt
A0A024RBG1 NUDT4B        GO:0046872 IEA 2019-02-12 UniProt
A0A024RBG1 NUDT4B        GO:0052840 IEA 2019-02-11 UniProt
A0A024RBG1 NUDT4B        GO:0052842 IEA 2019-02-11 UniProt


Copyright (C) 2010-2019, DV Klopfenstein, Haibao Tang. All rights reserved.