### Week 5 - Biological Databases - Gene Ontology
- October 2023
- [https://https://github.com/tisimpson/bioinformatics1](https://github.com/tisimpson/bioinformatics1)
- [ian.simpson@ed.ac.uk](mailto:ian.simpson@ed.ac.uk)

In [2]:
import pandas as pd
import urllib as ul
import numpy as np

In [6]:
#retrieve the gene_ids from the previous section (dop_geneids.txt)
dop_gene_ids = pd.read_csv('../data/pathways/cams_geneids.txt',header=None)
dop_gene_ids.columns=['GeneID']
dop_gene_ids.head()

Unnamed: 0,GeneID
0,965
1,914
2,64115
3,152404
4,941


In [7]:
# We are going to retrieve the mapping file produced by the GeneOntology consortium that maps genes to GO terms this is stored in the gene2go file at the NCBI
# ul.request.urlretrieve('https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz','../data/ontology/gene2go.gz')

# We can read this file into a pandas dataframe using the read_csv function
gene2go = pd.read_csv('../data/ontology/gene2go.gz', compression='gzip', header=0, sep='\t')

# We can look at the first few rows of the dataframe using the head function
gene2go.head()

Unnamed: 0,#tax_id,GeneID,GO_ID,Evidence,Qualifier,GO_term,PubMed,Category
0,3702,814629,GO:0003674,ND,enables,molecular_function,-,Function
1,3702,814629,GO:0005634,ISM,located_in,nucleus,-,Component
2,3702,814629,GO:0008150,ND,involved_in,biological_process,-,Process
3,3702,814630,GO:0003700,ISS,enables,DNA-binding transcription factor activity,11118137,Function
4,3702,814630,GO:0005634,ISM,located_in,nucleus,-,Component


In [8]:
#now explicitly restrict to human (tax_id - 9606)
human_gene2go = gene2go[gene2go['#tax_id']==9606]

In [9]:
# We can look at the first few rows of the dataframe using the head function
human_gene2go.head()

Unnamed: 0,#tax_id,GeneID,GO_ID,Evidence,Qualifier,GO_term,PubMed,Category
962693,9606,1,GO:0003674,ND,enables,molecular_function,-,Function
962694,9606,1,GO:0005576,HDA,located_in,extracellular region,27068509,Component
962695,9606,1,GO:0005576,IDA,located_in,extracellular region,3458201,Component
962696,9606,1,GO:0005576,TAS,located_in,extracellular region,-,Component
962697,9606,1,GO:0005615,HDA,located_in,extracellular space,16502470,Component


In [10]:
# We can merge data frames on matching keys using the merge function in Pandas
# Here we only look for GO annotations that are associated with dopaminergic genes
dop_gos = pd.merge(dop_gene_ids,human_gene2go,right_on='GeneID',left_on='GeneID')

# what is the most frequent GO term annotated to the dopaminergic genes?
# we can use the pandas groupby function to group the data by GO_ID and then count the number of rows in each group
dop_go_counts = dop_gos.groupby('GO_ID').size().sort_values(ascending=False)

# show the top10 terms in a prettytable
from prettytable import PrettyTable
top10 = dop_go_counts.head(10)
t = PrettyTable(['GO_ID','Count'])
for i in top10.index:
    t.add_row([i,top10[i]])
print(t)

+------------+-------+
|   GO_ID    | Count |
+------------+-------+
| GO:0005886 |  276  |
| GO:0005515 |  125  |
| GO:0009986 |   81  |
| GO:0007155 |   62  |
| GO:0016020 |   61  |
| GO:0005923 |   57  |
| GO:0009897 |   55  |
| GO:0042802 |   45  |
| GO:0070062 |   41  |
| GO:0098609 |   39  |
+------------+-------+


In [11]:
# ideally we would like a table that also includes the GO term description

#create a unique lookup dataframe for GO_ID term descriptions from our dopaminergic GO gene dataframe
unique_dop_gos = dop_gos[['GO_ID','GO_term']].drop_duplicates()

# now print out our top 10 GO terms with their descriptions using prettytable
t = PrettyTable(['GO_ID','Count','GO_term'])
for i in top10.index:
    t.add_row([i,top10[i],unique_dop_gos[unique_dop_gos['GO_ID'] == i]['GO_term'].values[0]])
print(t)

+------------+-------+----------------------------------+
|   GO_ID    | Count |             GO_term              |
+------------+-------+----------------------------------+
| GO:0005886 |  276  |         plasma membrane          |
| GO:0005515 |  125  |         protein binding          |
| GO:0009986 |   81  |           cell surface           |
| GO:0007155 |   62  |          cell adhesion           |
| GO:0016020 |   61  |             membrane             |
| GO:0005923 |   57  |    bicellular tight junction     |
| GO:0009897 |   55  | external side of plasma membrane |
| GO:0042802 |   45  |    identical protein binding     |
| GO:0070062 |   41  |      extracellular exosome       |
| GO:0098609 |   39  |        cell-cell adhesion        |
+------------+-------+----------------------------------+


In [12]:
# we can use this dataframe to ask lots of interesting questions about the data

# how many human genes are there in our human gene2GO set?
num_human_genes_ingo = len(human_gene2go['GeneID'].drop_duplicates())
print('There are '+str(num_human_genes_ingo)+' human genes in our human gene2GO set')

# how many genes are annotated with GO:0005515 in our human gene2GO set?
top_goid = dop_go_counts.index[0]
num_human_genes_withtop = len(human_gene2go[human_gene2go['GO_ID'] == top_goid]['GeneID'].drop_duplicates())
print('There are '+str(num_human_genes_withtop)+' human genes annotated with '+top_goid+' in our human gene2GO set')

# what is the size of our gene list?
num_human_genes_inlist = len(dop_gene_ids['GeneID'].drop_duplicates())
print('There are '+str(num_human_genes_inlist)+' genes in our gene list')

# how many genes would we expect to be annoated with the top GO_ID?
expectation = num_human_genes_withtop/num_human_genes_ingo * num_human_genes_inlist
print('We would expect to see this '+str(round(expectation,2))+' times')

# how many genes in our list are annotated with the top GO_ID?
observation = dop_go_counts[top_goid]
print('We actually see this '+str(round(observation,2))+' times')

# what's the enrichment?
print('So, the top GO term is found '+str(round(observation/expectation,2))+' times more frequently than we would expect by chance')

# why might we want to know this?

There are 20761 human genes in our human gene2GO set
There are 5178 human genes annotated with GO:0005886 in our human gene2GO set
There are 158 genes in our gene list
We would expect to see this 39.41 times
We actually see this 276 times
So, the top GO term is found 7.0 times more frequently than we would expect by chance
