# <span style="color:#4682B4">NCBI Taxonomy Data Profile</span>
---

The NCBI Taxonomy is a reference database that contains the names and hierarchically-arranged phylogenetic classifications of organisms created in 1991. The NCBI Taxonomy database is organized in a tree for a hierarchical data structure, where each <span style="color:#4682B4">**_node_**</span> of the tree represents a <span style="color:#4682B4">**_taxon_**</span> and each entry has a primary name, secondary names, and a unique taxonomic identifier. The NCBI Taxonomy database is critical to link nucleotides and protein sequences from the International Nucleotide Sequence Database Collaboration (INSDC) and other biological databases which rely on data from INSDC. These linkages can be made using either the organism name or taxonomic ID.

The database is provided remotely by the National Center for Biotechnology Information (NCBI). INSDC partners send any requests for new names to NCBI Taxonomy curators before data is released. The database incorporates phylogenetic and taxonomic knowledge from published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts.

### <span style="color:#4682B4">Why NCBI Taxonomy</span>
We chose to use the NCBI Taxonomy database because it is the <span style="color:#4682B4">**_sole source for taxonomic classification_**</span> for the INSDC and forms the backbone for many other resources at the NCBI. The NCBI Taxonomy database contains formal and informal organism names and classifications for every sequence in INSDC's datasets (more than 160,000 organisms); these associations between pathogen name and genetic and genomic data are foundational for public health intelligence efforts. The inclusion of informal names also allows us to link pathogens (and their corresponding genetic information) to case reports and other non-traditional data sources which may use names which exist outside of the codes of nomenclature (i.e. "COVID-19" instead of SARS-CoV-2). Additionally, more than 150 external partners maintain links to the NCBI Taxonomy database, with specialty datasets of their own.

### <span style="color:#4682B4">Accessing the NCBI Taxonomy database</span>

There are three methods for accessing the NCBI Taxonomy. First is the NCBI Taxonomy Browser, which is a web-page that allows users to search for organisms, visualize the hierarchy at custom levels of classification, and summarize organism information, such as lineage, in a taxon-specific page. 

In contrast to the Taxonomy Browser, Entrez supports Boolean queries and common search fields across all NCBI databases. There are also several public APIs that allow programmatic access to the Entrez databases; <span style="color:#4682B4">**_we used E-utilities,_**</span> a suite of server-side programs that accept a fixed URL syntax for search, link, and retrieval. 

A third option is to download the complete database as a full text taxdump file (in .dmp format), which is updated every hour on the site. 

This notebook examines the information stored within the NCBI Taxonomy and the quality of metadata that is retrievable via publicly available APIs.

### <span style="color:gray">_Extracting data from NCBI Taxonomy database_</span>

There are 39 Entrez databases. To return a list of all Entrez database names and identify the one we want to query, we use the following:


In [57]:
entrez_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi"
entrez_response = requests.get(entrez_url)
all_db = BeautifulSoup(entrez_response.content, features="xml").getText()
print(all_db)



pubmed
protein
nuccore
ipg
nucleotide
structure
genome
annotinfo
assembly
bioproject
biosample
blastdbinfo
books
cdd
clinvar
gap
gapplus
grasp
dbvar
gene
gds
geoprofiles
homologene
medgen
mesh
ncbisearch
nlmcatalog
omim
orgtrack
pmc
popset
proteinclusters
pcassay
protfam
pccompound
pcsubstance
seqannot
snp
sra
taxonomy
biocollections
gtr




To gather statistics about the <span style="color:#4682B4">**_taxonomy_**</span> database

we created a Python module <span style="color:#4682B4">**_ncbi._**</span> This module is composed of three main functions demonstrated below, <span style="color:#4682B4">**_api_soup_**</span>, <span style="color:#4682B4">**_id_search_**</span>, and <span style="color:#4682B4">**_get_metadata._**</span>

In [4]:
import sys

print("Checking python executable path (make sure it's the right virtualenv)")
print(sys.executable)
# append the path of the parent directory
sys.path.append("..")

Checking python executable path (make sure it's the right virtualenv)
/Users/haileyrobertson/Documents/GitHub/kr2-graph/build_graph_2/notebooks/env/bin/python


In [24]:
# import ncbi
from loguru import logger
import requests
from bs4 import BeautifulSoup
from bs4 import Tag

In [25]:
# Use E-Utils API to access data in XML format
def api_soup(eutil, params):
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/{eutil}.fcgi"
    response = requests.get(url, params)
    soup = BeautifulSoup(response.content, features="xml")

    return soup

In [26]:
# Get ID from text name
def id_search(name):
    logger.info(f"Searching ncbi for term {name}")

    params = {"db": "Taxonomy", "term": name}

    soup = api_soup("esearch", params)

    try:
        ncbi_id = soup.find("Id").getText()

    except AttributeError:
        errors = soup.find("ErrorList")
        warnings = soup.find("WarningList")

        for error in errors.children:
            logger.error(f"{error.name}: {error.getText()}")

        for warning in warnings.children:
            logger.warning(f"{warning.name}: {warning.getText()}")

        return None

    return ncbi_id

In [27]:

# Get NCBI metadata from ID
def get_metadata(ncbi_id):
    params = {"db": "Taxonomy", "id": ncbi_id}
    soup = api_soup("efetch", params)

    taxon = soup.TaxaSet.Taxon

    taxon_metadata = {
        "ScientificName": taxon.ScientificName.getText(),
        "ParentTaxId": taxon.ParentTaxId.getText(),
        "Rank": taxon.Rank.getText(),
        "Division": taxon.Division.getText(),
        "GeneticCode": {"GCId": taxon.GCId.getText(), "GCName": taxon.GCName.getText()},
        "MitoGeneticCode": {
            "MGCId": taxon.MGCId.getText(),
            "MGCName": taxon.MGCName.getText(),
        },
        "Lineage": taxon.Lineage.getText(),
        "CreateDate": taxon.CreateDate.getText(),
        "UpdateDate": taxon.UpdateDate.getText(),
        "PubDate": taxon.PubDate.getText(),
        # "LineageEx":taxon.LineageEx.getText(),
    }

    if taxon.otherNames:
        taxon["OtherNames"] = (taxon.OtherNames.getText(),)

    lineage_ex = []
    for taxon in taxon.LineageEx.children:
        if isinstance(taxon, Tag):
            lineage_ex.append(
                {
                    "TaxId": taxon.TaxId.getText(),
                    "ScientificName": taxon.ScientificName.getText(),
                    "Rank": taxon.Rank.getText(),
                }
            )

    taxon_metadata["LineageEx"] = lineage_ex

    return taxon_metadata


### <span style="color:gray">_Accessing what is in the NCBI Taxonomy database_</span>

The NCBI Taxonomy documentation indicates that the data model is built around a central framework called NameBank, and each entry includes a <span style="color:#4682B4">**_primary name_**</span>, <span style="color:#4682B4">**_secondary names_**</span>, a <span style="color:#4682B4">**_taxonomy identifier_**</span>, <span style="color:#4682B4">**_name entity identifiers_**</span> along with other various metadata about the lineage, genetic code, and linked Entrez records.

We first tested our database access using a sample query with the 'esearch' utility for the Entrez NCBI Taxonomy database. 

<span style="color:#4682B4">**_Input: Entrez text query (&term); Entrez database (&db)_**</span>

<span style="color:#4682B4">**_Expected Output: List of UIDs matching the Entrez query_**</span>


Example: Get the Taxonomic IDs (TaxID) for 'influenza A subtype h1n1' 'alphainfluenzavirus', and 'orthomyxoviridae:

In [45]:
term_list = ['influenza A subtype h1n1','alphainfluenzavirus','orthomyxoviridae']
id_list = []

for each in term_list:
    id_list.append(id_search(each))

print(id_list)

2022-05-04 13:41:57.875 | INFO     | __main__:id_search:3 - Searching ncbi for term influenza A subtype h1n1
2022-05-04 13:41:58.159 | INFO     | __main__:id_search:3 - Searching ncbi for term alphainfluenzavirus
2022-05-04 13:41:58.308 | INFO     | __main__:id_search:3 - Searching ncbi for term orthomyxoviridae


['114727', '197911', '11308']


Using this ID, we can then pass it through the 'efetch' utility to get the full records from the NCBI Taxonomy database about alphainfluenzavirus in XML format.

<span style="color:#4682B4">**_Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode)_**</span>

<span style="color:#4682B4">**_Expected Output: Formmated data records as specified_**</span>


In [50]:
for each in id_list:
    print('\n' + str(get_metadata(each)))


{'ScientificName': 'H1N1 subtype', 'ParentTaxId': '11320', 'Rank': 'serotype', 'Division': 'Viruses', 'GeneticCode': {'GCId': '1', 'GCName': 'Standard'}, 'MitoGeneticCode': {'MGCId': '0', 'MGCName': 'Unspecified'}, 'Lineage': 'Viruses; Riboviria; Orthornavirae; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus', 'CreateDate': '2000/02/07 12:42:00', 'UpdateDate': '2020/04/07 15:24:16', 'PubDate': '2000/02/07 12:42:00', 'LineageEx': [{'TaxId': '10239', 'ScientificName': 'Viruses', 'Rank': 'superkingdom'}, {'TaxId': '2559587', 'ScientificName': 'Riboviria', 'Rank': 'clade'}, {'TaxId': '2732396', 'ScientificName': 'Orthornavirae', 'Rank': 'kingdom'}, {'TaxId': '2497569', 'ScientificName': 'Negarnaviricota', 'Rank': 'phylum'}, {'TaxId': '2497571', 'ScientificName': 'Polyploviricotina', 'Rank': 'subphylum'}, {'TaxId': '2497577', 'ScientificName': 'Insthoviricetes', 'Rank': 'class'}, {'TaxId': '2499411', 'Scientific

The returned metadata is defined below:

| Column name | Content description |
| :-- | :-- |
| `ScientificName` | scientific name of the taxon, validly published with respect to the relevant code of nomenclature |
| `ParentTaxId` | name of the node (category or folder) used within the EIOS User Portal for category navigation |
| `Rank` | Linnaean rank assigned to the taxon, where possible |
| `Division` | leaf node flag (leaf nodes are categories, non-leaf are folders) |
| `GeneticCode` | unique identifier of the EIOS user community that specified the folder nodes (foreign key) |
| `MitoGeneticCode` | inactive category flag |
| `Lineage` | inactive category flag |
| `LineageEx` | inactive category flag |


In [None]:
    taxon_metadata = {
        "ScientificName": taxon.ScientificName.getText(),
        "ParentTaxId": taxon.ParentTaxId.getText(),
        "Rank": taxon.Rank.getText(),
        "Division": taxon.Division.getText(),
        "GeneticCode": {"GCId": taxon.GCId.getText(), "GCName": taxon.GCName.getText()},
        "MitoGeneticCode": {
            "MGCId": taxon.MGCId.getText(),
            "MGCName": taxon.MGCName.getText(),
        },
        "Lineage": taxon.Lineage.getText(),
        "CreateDate": taxon.CreateDate.getText(),
        "UpdateDate": taxon.UpdateDate.getText(),
        "PubDate": taxon.PubDate.getText(),
        # "LineageEx":taxon.LineageEx.getText(),
    }