### Programming for Biomedical Informatics
#### Week 9 - Functional Analysis Using Ontologies

Ontologies are commonly used to aid interpretation of molecular data, most commonly through use of functional annotations to genes and proteins using the Gene Onotlogy combined with downstream likelihood/enrichment analysis using tools such as GSEA as we have discussed. Ontologies are also used in strategies to align unstructured data with domains, for example looking for words and/or phrases that can be mapped to classes in ontologies. Examples here would include things like looking for terms associated with clinical terminolgies in patient discharge summaries.

In this notebook we will perform some basic phenotype extraction from publication abstracts, attempting to find examples of HPO terms that are associated with mentions of particular diseases.

To do this we will randomly select 1000 papers from PubMed that are tagged with the MeSH Major Topic "Autism Spectrum Disorder" retreive their titles and abstracts and then search for phenotypes and genes.

In [None]:
'''
we're going to use a biomedical named entity recognition model called en_core_sci_sm which is a model developed by the Allen Institute for biomedical text processing
https://allenai.github.io/scispacy/
'''

# %pip install scispacy
# %pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz

In [None]:
# Setup for NLP

import pandas as pd
import spacy
from scispacy.linking import EntityLinker
from scispacy.abbreviation import AbbreviationDetector

# supress warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
'''when this is first run it will download some large data files to perform the UMLS linking
approx 2GB in total. On subsequent rune the model will take about 1 minute to load'''

# load the model
nlp = spacy.load("en_core_sci_sm");

# add abbreviations detector
nlp.add_pipe("abbreviation_detector");

# add UMLS entity-linker
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True,"linker_name": "umls","filter_for_definitions": False});

In [None]:
# let's look at a simple example
queryText = 'PAX6 and GLI3 is a highly conserved transcription factor that plays a critical role in eye development in all animals. Mutations in the PAX6 gene are associated with aniridia, a congenital eye malformation characterized by the absence of the iris and other eye abnormalities.'

concepts = dict()

try:
    #perform nlp
    doc = nlp(queryText)
    for entity in doc.ents:
        link = concept,score = entity._.kb_ents[0]
        concepts[entity.text] = concept
        print(entity.text, concept, score)
except:
    #case of no text
    pass

In [None]:
# the human phenotype ontology contains database_cross_reference entries that include the UMLS concept id
# we can use this to link the HPO terms to the UMLS concepts

# load the HPO data using pronto
import pronto

# load the HPO ontology
# fetch the Human Phenotype Onology OBO file and parse it with pronto

# download the HPO ontology OBO file
import urllib.request

current_hpo_url = 'http://purl.obolibrary.org/obo/hp.obo'

# download the file
urllib.request.urlretrieve(current_hpo_url,'hpo.obo');

# parse the file
hpo = pronto.Ontology('hpo.obo')


In [None]:
# we can look in the xrefs (sic. cross-references) of a term to find the UMLS concept id
def hpo2concept(hpo_id):
    term = hpo[hpo_id]
    xrefs = [xref.id for xref in term.xrefs]
    try:
        umls_id = [xref for xref in xrefs if xref.startswith('UMLS')][0].split(':')[1]
        return umls_id
    except:
        return None

# let's test this function
hpo2concept('HP:0001695')

In [None]:
# we're now going to brute force the conversion of all HPO terms to UMLS concepts
hpo2umls = {term.id:hpo2concept(term.id) for term in hpo.terms()}

In [None]:
# look at the first 10 entries
list(hpo2umls.items())[:10]

In [None]:
#lets see if any of the CUIs from our test sentence have been mapped to HPO terms
for entity in concepts.keys():
    concept = concepts[entity]
    if concept in hpo2umls.values():
        print(entity, concept, [hpo[k] for k,v in hpo2umls.items() if v == concept])

In [None]:
'''
As a niche example (not a mainstream ontology) we will look at the ASDPTO ontology which
has been developed as a custom ontology for autism spectrum disorder.

For every term in the ASDPTO ontology, we will look for UMLS concepts.

NB we are limited to the work done by ASDPTO curators in adding annotations to the terms
'''

current_asdpto_url = 'https://data.bioontology.org/ontologies/ASDPTO/submissions/1/download?apikey=4a2fbff0-ef88-432e-b1a1-dffc07e71146'

# download the file
urllib.request.urlretrieve(current_asdpto_url,'autism.obo');

# parse the file
autism  = pronto.Ontology('autism.obo')


In [None]:
# function to find the UMLS concept for a term in ASDPTO
def find_concept(term):
    for annotation in term.annotations:
        try:
            # if the string contains a cui= then it is a UMLS concept
            # extract the CUI
            if 'cui=' in annotation.resource:
                #split the string on 'cui=' and take the remainder
                concept = annotation.resource.split('cui=')[1]
                return(concept)
        except:
            pass

# we can now use this function to find the UMLS concept for each term in the ASDPTO ontology
asdpto2umls = {term.name:find_concept(term) for term in autism.terms()}

# remove any None entries
asdpto2umls = {k:v for k,v in asdpto2umls.items() if v is not None}

# print how many terms have been mapped to UMLS concepts
print(f'There are ',len(asdpto2umls),' terms in the ASDPTO ontology that have been mapped to UMLS concepts')

# look at the first 10 entries
list(asdpto2umls.items())[:10]

In [None]:
'''Now that we have all the NLP components in place to identify and map HPO terms let's now fetch the data and perform the analysis'''

In [None]:
# let's use our knowledge of eUtils to fetch the raw material for our analysis
# we will use the requests library to fetch the data using the eUtils API
# we will use the xml library to parse the data

import urllib.request
import xml.etree.ElementTree as ET

# load my API key from the file
with open('../api_keys/ncbi.txt', 'r') as file:
    api_key = file.read().strip()

with open('../api_keys/ncbi_email.txt', 'r') as file:
    email = file.read().strip()

pubmed_query = '"Autism Spectrum Disorder[Majr]"'

# Define the parameters for the eSearch request
esearch_params = {
    'db': 'pubmed',
    'term': pubmed_query,
    'api_key': api_key,
    'email': email,
    'usehistory': 'y'
}

# encode the parameters so they can be passed to the API
encoded_data = urllib.parse.urlencode(esearch_params).encode('utf-8')

# the base request url for eSearch
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"

# make the request
request = urllib.request.Request(url, data=encoded_data)
response = urllib.request.urlopen(request)

# read into an XML object
esaerch_data_XML = ET.fromstring(response.read())

# print the number of results
count = esaerch_data_XML.find('Count').text
print(f'Total number of results: {count}')

# Extract WebEnv and QueryKey
webenv = esaerch_data_XML.find('WebEnv').text
query_key = esaerch_data_XML.find('QueryKey').text

efetch_params = {
'db': 'pubmed',
'query_key': query_key,
'WebEnv': webenv,
'retmax': '1000',
'api_key': api_key,
'email': email
}

# encode the parameters so they can be passed to the API
encoded_data = urllib.parse.urlencode(efetch_params).encode('utf-8')

# the base request url for eSearch
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"

# make the request
request = urllib.request.Request(url, data=encoded_data)
response = urllib.request.urlopen(request)

# read into an XML object
efetch_data_XML = ET.fromstring(response.read())

# let's look at the first 10 articles
for article in efetch_data_XML.findall('.//PubmedArticle')[:10]:
    pmid = article.find('.//PMID').text
    title = article.find('.//ArticleTitle').text
    abstract = article.find('.//AbstractText').text
    print(f'{pmid}: {title}')
    print(f'Abstract: {abstract}')

In [None]:
# for each article check whether it has an abstract and a title
# if it does combine the title and abstract into a single string
# if it doesn't remove it from the list
articles = dict()

for article in efetch_data_XML.findall('.//PubmedArticle'):
    try:
        pmid = article.find('.//PMID')
        title = article.find('.//ArticleTitle')
        abstract = article.find('.//AbstractText')
        tiab = title.text + ' ' + abstract.text
        articles[pmid.text] = tiab
    except:
        pass

print(f'Number of articles with abstracts: {len(articles)}')

In [None]:
# lets write a function based on code above to perform ner on articles and return
def nlp_article(article):

    current_article_concepts = dict()

    try:
        #perform nlp
        doc = nlp(article)
        for entity in doc.ents:
            link = concept,score = entity._.kb_ents[0]
            current_article_concepts[entity.text] = concept
    except:
        #case of no text
        pass

    hpo_terms = []

    for entity in current_article_concepts.keys():
        concept = current_article_concepts[entity]
        if concept in hpo2umls.values():
            print(entity, concept, [hpo[k] for k,v in hpo2umls.items() if v == concept])
            current_hpo = [hpo[k] for k,v in hpo2umls.items() if v == concept]
            hpo_terms.append(current_hpo[0].id)
    return list(set(hpo_terms))

In [None]:
# now we can apply this function to all the articles
articles_hpo = {pmid: nlp_article(article) for pmid, article in articles.items()}

In [None]:
# let's look at the results from the first 10 articles
list(articles_hpo.items())[:10]

# what percentage of articles have HPO terms
articles_with_hpo = [k for k,v in articles_hpo.items() if v]
print(f'Percentage of articles with HPO terms: {len(articles_with_hpo)/len(articles)*100:.2f}%')

In [None]:
# find the unique HPO terms found in the articles
unique_hpo_terms = list(set([term for terms in articles_hpo.values() for term in terms]))

# create a dataframe to store the data
df = pd.DataFrame(index=articles.keys(), columns=unique_hpo_terms)

# fill the dataframe
for pmid, terms in articles_hpo.items():
    df.loc[pmid, terms] = 1

# fill the NaN values with 0
df.fillna(0, inplace=True)

# print the first 5 rows
df.head()

In [None]:
# count the number of times each term appears and store this in a dataframe with columns
# 'HPO Term Name', 'HPO Term', 'Count' and sort by count
hpo_counts = df.sum().sort_values(ascending=False).reset_index()
hpo_counts.columns = ['HPO Term', 'Count']
hpo_counts['HPO Term Name'] = [hpo[term].name for term in hpo_counts['HPO Term']]

# use PrettyTable to display the data
from prettytable import PrettyTable

table = PrettyTable()
table.field_names = hpo_counts.columns
for row in hpo_counts.itertuples(index=False):
    table.add_row(row)
print(table)

In [None]:
# optional additional

import pandas as pd

# lets speciifically look at all the DDG2P papers fom PubMed
# read in the DDG2P data file
ddg2p = pd.read_csv('./data/DDG2P.csv')
ddg2p.head()

# get the list of unique PMIDs from the 'pmids' column
pmids = ddg2p['pmids'].str.split(';').explode().unique()

# how many unique PMIDs are there
print(f'Number of unique PMIDs: {len(pmids)}')

# randomly select 1000 of these
import random
random.seed(42)
pmids_sample = random.sample(list(pmids), 1000)

import urllib.request
import xml.etree.ElementTree as ET

# load my API key from the file
with open('../api_keys/ncbi.txt', 'r') as file:
    api_key = file.read().strip()

with open('../api_keys/ncbi_email.txt', 'r') as file:
    email = file.read().strip()

# fetch the data for these PMIDs
# Define the parameters for the eSearch request
esearch_params = {
    'db': 'pubmed',
    'id': ','.join(pmids_sample),
    'api_key': api_key,
    'email': email,
    'usehistory': 'y'
}

# encode the parameters so they can be passed to the API
encoded_data = urllib.parse.urlencode(esearch_params).encode('utf-8')

# the base request url for eSearch
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"

# make the request
request = urllib.request.Request(url, data=encoded_data)
response = urllib.request.urlopen(request)

# read into an XML object
efetch_data_XML = ET.fromstring(response.read())

# for each article check whether it has an abstract and a title
# if it does combine the title and abstract into a single string
# if it doesn't remove it from the list
articles = dict()

for article in efetch_data_XML.findall('.//PubmedArticle'):
    try:
        pmid = article.find('.//PMID')
        title = article.find('.//ArticleTitle')
        abstract = article.find('.//AbstractText')
        tiab = title.text + ' ' + abstract.text
        articles[pmid.text] = tiab
    except:
        pass

print(f'Number of articles with abstracts: {len(articles)}')

# print the first 10 articles
for pmid, tiab in list(articles.items())[:10]:
    print(f'{pmid}: {tiab}')



In [None]:
# now we can apply this function to all the articles
articles_hpo = {pmid: nlp_article(article) for pmid, article in articles.items()}

In [None]:
# let's look at the results from the first 10 articles
list(articles_hpo.items())[:10]

# what percentage of articles have HPO terms
articles_with_hpo = [k for k,v in articles_hpo.items() if v]
print(f'Percentage of articles with HPO terms: {len(articles_with_hpo)/len(articles)*100:.2f}%')

In [None]:
# find the unique HPO terms found in the articles
unique_hpo_terms = list(set([term for terms in articles_hpo.values() for term in terms]))

# create a dataframe to store the data
df = pd.DataFrame(index=articles.keys(), columns=unique_hpo_terms)

# fill the dataframe
for pmid, terms in articles_hpo.items():
    df.loc[pmid, terms] = 1

# fill the NaN values with 0
df.fillna(0, inplace=True)

# print the first 5 rows
df.head()

In [None]:
# count the number of times each term appears and store this in a dataframe with columns
# 'HPO Term Name', 'HPO Term', 'Count' and sort by count
hpo_counts = df.sum().sort_values(ascending=False).reset_index()
hpo_counts.columns = ['HPO Term', 'Count']
hpo_counts['HPO Term Name'] = [hpo[term].name for term in hpo_counts['HPO Term']]

# use PrettyTable to display the data
from prettytable import PrettyTable

table = PrettyTable()
table.field_names = hpo_counts.columns
for row in hpo_counts.itertuples(index=False):
    table.add_row(row)
print(table)