# GO Term Processing

It's so frustrating that I even need this notebook. We use `goatools` and the PANTHER API to load info about the gene term and which gene sets which should be analyzing.

We can't use PANTHER because it doesn't account for duplicate genes.

# What GO terms should we be analyzing?

We'll follow along with PANTHER to define what gene sets we should analyze, since that seems like a standard in the field (GO, after all, directly links to it.)

In [294]:
# Use panther API to get an example run
data = requests.get('https://pantherdb.org/services/oai/pantherdb/enrich/overrep?geneInputList=TP53,BRCA1,BRCA2,RAD51,MSH2,MSH6&organism=9606&annotDataSet=GO%3A0008150&enrichmentTestType=FISHER&correction=FDR')

# convert response object to a dataframe
go_terms = []
for val in data_2.json()['results']['result']:
    go_id = val['term'].get('id', 'unlabelled')
    go_label = val['term'].get('label', 'unlabelled')
    num_genes = val['number_in_reference']
    go_terms.append([go_id, go_label, num_genes])

go_terms = pd.DataFrame(go_terms, columns = ['go', 'label', 'num_genes_panther']).sort_values(by = 'go')
go_terms = go_terms.query('go != "unlabelled" & label != "unlabelled"')

# an example of how to run the panther analysis, in case that's helpful 

# formatted_data = []
# for val in data_2.json()['results']['result']:
#     number_in_list = val['number_in_list']
#     fold_enrichment = val['fold_enrichment']
#     fdr = val['fdr']
#     expected = val['expected']
#     number_in_ref = val['number_in_reference']
#     p = val['pValue']
#     term = val['term'].get('id', 'unlabelled') + " (" + val['term']['label'] + ")"
#     plus_minus = val['plus_minus']
    
#     row = [number_in_list, fold_enrichment, fdr, expected, number_in_ref, p, term, plus_minus]
#     formatted_data.append(row)

# formatted_data = pd.DataFrame(formatted_data, columns = ['number_in_list', 'fold_enrichment', 'fdr', 'expected', 'number_in_ref', 'p', 'term', 'plus_minus'])

In [295]:
go_terms.head(2)

Unnamed: 0,go,label,num_genes_panther
408,GO:0000002,mitochondrial genome maintenance,26
142,GO:0000003,reproduction,1457


# Look up some info about the GO terms using goatools

This is a pretty frustrating package to use at baseline, but it's quite comprehensive. This is just following along their tutorial: https://github.com/tanghaibao/goatools/blob/main/notebooks/report_depth_level.ipynb. We also look up some genes here.

In [271]:
import goatools

In [274]:
# load the GO data base
from goatools.base import download_go_basic_obo
obo_fname = download_go_basic_obo()

# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
from goatools.base import download_ncbi_associations
fin_gene2go = download_ncbi_associations()

from goatools.obo_parser import GODag
obodag = GODag("go-basic.obo")

from goatools.anno.genetogo_reader import Gene2GoReader

# Read NCBI's gene2go. Store annotations in a list of namedtuples
objanno = Gene2GoReader(fin_gene2go, taxids=[9606])
ns2assoc = objanno.get_ns2assc()

for nspc, id2gos in ns2assoc.items():
    print("{NS} {N:,} annotated human genes".format(NS=nspc, N=len(id2gos)))

  EXISTS: go-basic.obo
  EXISTS: gene2go
go-basic.obo: fmt(1.2) rel(2023-11-15) 46,228 Terms
HMS:0:00:47.747039 345,232 annotations, 20,762 genes, 18,774 GOs, 1 taxids READ: gene2go 
MF 18,348 annotated human genes
CC 19,783 annotated human genes
BP 18,683 annotated human genes


In [275]:
symbol_to_id = {nt.Symbol: val for val, nt in GeneID2nt_hs.items()}
id_to_symbol = {val: nt.Symbol for val, nt in GeneID2nt_hs.items()}

## Get the depth and level

Just a reminder - level is the shortest path to the top (Biological_Process), and depth is the longest

In [296]:
from goatools.rpt.rpt_lev_depth import RptLevDepth

rptobj = RptLevDepth(obodag)

all_terms = obodag.values()
all_terms_unique = set(all_terms)
print(f"All terms: {len(all_terms)}, Unique terms: {len(all_terms_unique)}")

go_level_depth = [(val.id, val.level, val.depth) for val in all_terms]
go_level_depth = pd.DataFrame(go_level_depth, columns = ['go', 'level', 'depth'])

All terms: 46228, Unique terms: 42769


In [297]:
go_terms = go_terms.merge(go_level_depth, on = 'go', how = 'left').drop_duplicates(subset = 'go')

In [280]:
go_terms.head(2)

Unnamed: 0,go,label,num_genes_panther,level,depth
0,GO:0000002,mitochondrial genome maintenance,26,6.0,6.0
1,GO:0000003,reproduction,1457,1.0,1.0


## Look up what GOAT thinks is the gene list

Most of the time this isn't even close to GO, but we'll include it for completeness.

In [281]:
from collections import defaultdict

In [306]:
go_to_gene = defaultdict(list)

for gene, go in ns2assoc['BP'].items():
    if gene in id_to_symbol:
        gene_symbol = id_to_symbol[gene]
    else:
        continue
    
    for g in go:
        go_to_gene[g].append(gene_symbol)
        
# write this to a file
with open('data/cwas-results/go-gene-sets/goatools-gene-sets.txt', 'w') as out:
    for key in sorted(go_to_gene):
        val = go_to_gene[key]
        out.write(key + '\t' + '\t'.join(val) + '\n')

In [307]:
go_terms['num_genes_goat'] = [len(go_to_gene[g]) for g in go_terms['go'].tolist()]

In [309]:
go_terms.head(2)

Unnamed: 0,go,label,num_genes_panther,level,depth,num_genes_goat
0,GO:0000002,mitochondrial genome maintenance,26,6.0,6.0,11
1,GO:0000003,reproduction,1457,1.0,1.0,4


## An example of goatools enrichment

In [310]:
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS

goeaobj = GOEnrichmentStudyNS(
        GeneID2nt_hs.keys(), # List of mouse protein-coding genes
        ns2assoc, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # defult multipletest correction method


Load BP Ontology Enrichment Analysis ...
 81% 16,927 of 20,913 population items found in association

Load CC Ontology Enrichment Analysis ...
 86% 18,011 of 20,913 population items found in association

Load MF Ontology Enrichment Analysis ...
 84% 17,511 of 20,913 population items found in association


In [319]:
geneids_study = ['TP53', 'BRCA2', 'BRCA2', 'BRCA2', 'BRCA2', 'BRCA2', 'BRCA2', 'BRCA2', 'BRCA2', 'BRCA1', 'MSH6', 'MSH2', 'RAD51']
geneids_study = [symbol_to_id[g] for g in geneids_study]

In [320]:
# 'p_' means "pvalue". 'fdr_bh' is the multipletest method we are currently using.
goea_results_all = goeaobj.run_study(geneids_study, prm)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]


Runing BP Ontology Analysis: current study set of 13 IDs.
100%      6 of      6 study items found in association
 46%      6 of     13 study items found in population(20913)
Calculating 12,220 uncorrected p-values using fisher_scipy_stats
  12,220 terms are associated with 16,927 of 20,913 population items
     212 terms are associated with      6 of      6 study items
  METHOD fdr_bh:
      27 GO terms found significant (< 0.05=alpha) ( 27 enriched +   0 purified): statsmodels fdr_bh
       6 study items associated with significant GO IDs (enriched)
       0 study items associated with significant GO IDs (purified)

Runing CC Ontology Analysis: current study set of 13 IDs.
100%      6 of      6 study items found in association
 46%      6 of     13 study items found in population(20913)
Calculating 1,799 uncorrected p-values using fisher_scipy_stats
   1,799 terms are associated with 18,011 of 20,913 population items
      47 terms are associated with      6 of      6 study items
  M

In [321]:
results = []
fields = goea_results_sig[0].get_prtflds_default()

for res in goea_results_all:
    values = res.get_field_values(fields)
    results.append(values)
    
results = pd.DataFrame(results, columns = fields)

In [322]:
results.query('p_fdr_bh < 0.05').head(2)

Unnamed: 0,GO,NS,enrichment,name,ratio_in_study,ratio_in_pop,p_uncorrected,depth,study_count,p_fdr_bh,study_items
0,GO:0071479,BP,e,cellular response to ionizing radiation,4/6,37/20913,1.240236e-10,6,4,2e-06,"672, 675, 5888, 7157"
1,GO:0006302,BP,e,double-strand break repair,4/6,74/20913,2.1546e-09,8,4,1.3e-05,"672, 675, 4436, 7157"


# Directly interrogate GO

In [323]:
from io import StringIO

In [385]:
def format_go_url(go_term, label=None):
    
    if label:
        class_label = quote(label)
        url = f"""https://golr-aux.geneontology.io/solr/select?defType=edismax&qt=standard&indent=on&wt=csv&rows=100000&start=0&fl=bioentity_label&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&hl.snippets=1000&csv.encapsulator=&csv.separator=%09&csv.header=false&csv.mv.separator=%7C&fq=document_category:%22annotation%22&fq=isa_partof_closure:%22{go_term}%22&fq=taxon_subset_closure_label:%22Homo%20sapiens%22&fq=annotation_class_label:%22{class_label}%22&facet.field=aspect&facet.field=taxon_subset_closure_label&facet.field=type&facet.field=evidence_subset_closure_label&facet.field=regulates_closure_label&facet.field=isa_partof_closure_label&facet.field=annotation_class_label&facet.field=qualifier&facet.field=annotation_extension_class_closure_label&facet.field=assigned_by&facet.field=panther_family_label&q=*:*"""
        
    else:
        url = f"""https://golr-aux.geneontology.io/solr/select?defType=edismax&qt=standard&indent=on&wt=csv&rows=100000&start=0&fl=bioentity_label&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&hl.snippets=1000&csv.encapsulator=&csv.separator=%09&csv.header=false&csv.mv.separator=%7C&fq=document_category:%22annotation%22&fq=isa_partof_closure:%22{go_term}%22&fq=taxon_subset_closure_label:%22Homo%20sapiens%22&facet.field=aspect&facet.field=taxon_subset_closure_label&facet.field=type&facet.field=evidence_subset_closure_label&facet.field=regulates_closure_label&facet.field=isa_partof_closure_label&facet.field=annotation_class_label&facet.field=qualifier&facet.field=annotation_extension_class_closure_label&facet.field=assigned_by&facet.field=panther_family_label&q=*:*"""
        
    return url

We can see that providing a class label substantially decreases the number of genes provided, since it does not include child nodes.

In [387]:
# an example
url = format_go_url("GO:0050767", 'regulation of neurogenesis')
response = requests.get(url)

genes = response.text.split('\n')[:-1]
print(genes)
print(len(genes))

['LEF1', 'ANXA2', 'HEY1', 'CX3CR1', 'HES6', 'BMAL1', 'ASCL1', 'DOCK7', 'ASCL2', 'YAP1', 'IL1B', 'CTNNB1', 'CTNNB1', 'HLTF', 'HES1', 'YTHDF2', 'YTHDF2', 'HES2', 'HEY2', 'HELT', 'BHLHE40', 'CHD7', 'FEZF2', 'S100A10', 'DLL1', 'CX3CL1', 'PRUNE1', 'HES3', 'HES5', 'HES5', 'HMGB2', 'HMGB2', 'FERD3L', 'HOXB3', 'HES7', 'WNT3', 'BHLHE41', 'FEZF1', 'DLL4', 'HEYL', 'PER2', 'FXR1', 'POU4F1']
43


In [388]:
# an example
url = format_go_url("GO:0050767")
response = requests.get(url)

genes = response.text.split('\n')[:-1]
print(len(genes))

500


In [390]:
# actually run the damn thing
for index, row in go_terms.reset_index(drop = True).iterrows():

    if index % 100 == 0:
        print(index, end = ', ')

    go = row['go']
    go_id = go.split(':')[1]
    label = row['label']
    
    url_class = format_go_url(go, label)
    url_full = format_go_url(go)
    
    for url, data in zip([url_class, url_full], ['class', 'full']):
        if data == 'class':
            continue
        try:
            response = requests.get(url)
            genes = response.text.split('\n')[:-1]

            out = '\t'.join([go, label] + genes)

        except:
            print(f'failed {data} for {go}')
            out = 'error'
    
        with open(f'data/cwas-results/go-gene-sets/go-api/{go_id}-{data}.txt', 'w') as outfile:
            outfile.write(out)

0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400, 4500, 4600, 4700, 4800, 4900, 5000, 5100, 5200, 5300, 5400, 5500, 5600, 5700, 5800, 5900, 6000, 6100, 6200, 6300, 6400, 6500, 6600, 6700, 6800, 6900, 7000, 7100, 7200, 7300, 7400, 7500, 7600, 7700, 7800, 7900, 8000, 8100, 8200, 8300, 8400, 8500, 8600, 8700, 8800, 8900, 9000, 9100, 9200, 9300, 9400, 9500, 9600, 9700, 9800, 9900, 10000, 10100, 10200, 10300, 10400, 10500, 10600, 10700, 10800, 10900, 11000, 11100, 11200, 11300, 11400, 11500, 11600, 11700, 11800, 11900, 12000, 12100, 12200, 12300, 12400, 12500, 12600, 12700, 12800, 12900, 13000, 13100, 13200, 13300, 13400, 13500, 13600, 13700, 13800, 13900, 14000, 14100, 14200, 14300, 14400, 14500, 14600, 14700, 14800, 14900, 15000, 15100, 15200, 15300, 15400, 15500, 

In [391]:
import glob

In [392]:
lines = []
for file in glob.glob('data/cwas-results/go-gene-sets/go-api/*-full.txt'):
    gs = [line.split('\t') for line in open(file).readlines()]
    lines += gs

with open('data/cwas-results/go-gene-sets/go-api-gene-sets-full.txt', 'w') as out:
    for l in lines:
        out.write('\t'.join(l) + '\n')

In [410]:
short = pd.DataFrame([(l.split('\t')[0], l.split('\t')[1], l.strip().split('\t')[2:]) for l in open('data/cwas-results/go-gene-sets/go-api-gene-sets.txt').readlines()],
              columns = ['go', 'label', 'genes'])

full = pd.DataFrame([(l.split('\t')[0], l.split('\t')[1], l.strip().split('\t')[2:]) for l in open('data/cwas-results/go-gene-sets/go-api-gene-sets-full.txt').readlines()],
              columns = ['go', 'label', 'genes'])

In [411]:
go_output = short.merge(full, on = ['go', 'label'], suffixes = ['_specific', '_full'])
go_output['genes_specific'] = go_output['genes_specific'].apply(lambda s: ','.join(s))
go_output['genes_full'] = go_output['genes_full'].apply(lambda s: ','.join(s))

go_output = go_output.merge(go_terms, on = ['go', 'label'], how = 'left')

In [412]:
go_output.head(2)

Unnamed: 0,go,label,genes_specific,genes_full,num_genes_panther,level,depth,num_genes_goat
0,GO:0043012,regulation of fusion of sperm to egg plasma me...,NOX5,NOX5,1.0,4.0,6.0,1.0
1,GO:0002728,negative regulation of natural killer cell cyt...,"HLA-F,CD96","HLA-F,CD96",2.0,7.0,11.0,2.0


In [413]:
go_output.to_csv('data/cwas-results/go-gene-sets/full-go-gene-set.txt', sep ='\t', index = False)

In [407]:
go_output[go_output['label'].str.contains('neurogenesis')]

Unnamed: 0,go,label,genes_specific,genes_full
4004,GO:0050768,negative regulation of neurogenesis,"CCL11,TNF,PCM1,IL1B,WNT7A,IL6,TRIM11,DNAJB11,A...","URS0000424278_9606,CDKN2B,SEMA6D,SEMA4F,SPP1,N..."
4746,GO:0050769,positive regulation of neurogenesis,"XRCC2,CX3CR1,RGS14,KHDC3L,NUMBL,NUMBL,NUMB,NUM...","ANAPC2,ANAPC2,XRCC2,MYRF,METRN,TRPV2,MTOR,IL33..."
8676,GO:0050767,regulation of neurogenesis,"LEF1,ANXA2,HEY1,CX3CR1,HES6,BMAL1,ASCL1,DOCK7,...","URS0000424278_9606,ANAPC2,ANAPC2,LEF1,ANXA2,XR..."
10735,GO:0022008,neurogenesis,"XRCC2,WDR62,PCSK1,RARB,EGFR,CLN5,NDUFS2,NUP133...","PTK2,SSNA1,SSNA1,SSNA1,SSNA1,CDH23,P2RY12,DAB2..."
