# <span style="color:#4682B4">NCBI Taxonomy Data Profile</span>
---

The NCBI Taxonomy is a reference database that contains the names and hierarchically-arranged phylogenetic classifications of organisms created in 1991. The NCBI Taxonomy database is organized in a tree for a hierarchical data structure, where each <span style="color:#4682B4">**_node_**</span> of the tree represents a <span style="color:#4682B4">**_taxon_**</span> and each entry has a primary name, secondary names, and a unique taxonomic identifier. The NCBI Taxonomy database is critical to link nucleotides and protein sequences from the International Nucleotide Sequence Database Collaboration (INSDC) and other biological databases which rely on data from INSDC. These linkages can be made using either the organism name or taxonomic ID.

The database is provided remotely by the National Center for Biotechnology Information (NCBI). INSDC partners send any requests for new names to NCBI Taxonomy curators before data is released. The database incorporates phylogenetic and taxonomic knowledge from published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts.

### <span style="color:#4682B4">Why NCBI Taxonomy</span>
We chose to use the NCBI Taxonomy database because it is the <span style="color:#4682B4">**_sole source for taxonomic classification_**</span> for the INSDC and forms the backbone for many other resources at the NCBI. The NCBI Taxonomy database contains formal and informal organism names and classifications for every sequence in INSDC's datasets (more than 160,000 organisms); these associations between pathogen name and genetic and genomic data are foundational for public health intelligence efforts. The inclusion of informal names also allows us to link pathogens (and their corresponding genetic information) to case reports and other non-traditional data sources which may use names which exist outside of the codes of nomenclature (i.e. "COVID-19" instead of SARS-CoV-2). Additionally, more than 150 external partners maintain links to the NCBI Taxonomy database, with specialty datasets of their own.

### <span style="color:#4682B4">Accessing the NCBI Taxonomy database</span>

There are three methods for accessing the NCBI Taxonomy. First is the NCBI Taxonomy Browser, which is a web-page that allows users to search for organisms, visualize the hierarchy at custom levels of classification, and summarize organism information, such as lineage, in a taxon-specific page. 

In contrast to the Taxonomy Browser, Entrez supports Boolean queries and common search fields across all NCBI databases. There are also several public APIs that allow programmatic access to the Entrez databases; <span style="color:#4682B4">**_we used E-utilities,_**</span> a suite of server-side programs that accept a fixed URL syntax for search, link, and retrieval. 

A third option is to download the complete database as a full text taxdump file (in .dmp format), which is updated every hour on the site. 

This notebook examines the information stored within the NCBI Taxonomy and the quality of metadata that is retrievable via publicly available APIs.

### <span style="color:gray">_Extracting data from NCBI Taxonomy database_</span>

There are 39 Entrez databases. To return a list of all Entrez database names and identify the one we want to query, we use the following:


In [1]:
import sys

print("Checking python executable path")
print(sys.executable)
# append the path of the parent directory to access project packages
sys.path.append("..")

import requests
from bs4 import BeautifulSoup

# import project packages
import ncbi

# Set up notebook code formatting:
import inspect
from IPython.display import display, Code
from pygments.formatters import HtmlFormatter
from IPython.core.display import HTML
formatter = HtmlFormatter()
display(HTML(f'<style>{ formatter.get_style_defs(".highlight") }</style>'))

Checking python executable path
/home/ubuntu/code/kr2-graph/build_graph_2/notebooks/env/bin/python


In [2]:
entrez_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi"
entrez_response = requests.get(entrez_url)
all_db = BeautifulSoup(entrez_response.content, features="xml").getText()
print(all_db)



pubmed
protein
nuccore
ipg
nucleotide
structure
genome
annotinfo
assembly
bioproject
biosample
blastdbinfo
books
cdd
clinvar
gap
gapplus
grasp
dbvar
gene
gds
geoprofiles
homologene
medgen
mesh
ncbisearch
nlmcatalog
omim
orgtrack
pmc
popset
proteinclusters
pcassay
protfam
pccompound
pcsubstance
seqannot
snp
sra
taxonomy
biocollections
gtr




To gather statistics about the <span style="color:#4682B4">**_taxonomy_**</span> database and look at the available fields, we specify the taxonomy URL using the EInfo utility:

In [3]:
einfo_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=taxonomy"
einfo_response = requests.get(einfo_url)
taxon_info = BeautifulSoup(einfo_response.content, features='xml')

import pandas as pd

# Extracting the field data
field_tag = taxon_info.FieldList

field = field_tag.find_all('Name')  
full_name = field_tag.find_all('FullName')  
desc = field_tag.find_all('Description')  
term_count = field_tag.find_all('TermCount')
is_date = field_tag.find_all('IsDate')
is_numerical = field_tag.find_all('IsNumerical')
single_token = field_tag.find_all('SingleToken')
hierarchy = field_tag.find_all('Hierarchy')
is_hidden = field_tag.find_all('IsHidden')

field_data = []
  
# Loop to store the data in a list named 'field_data'
for i in range(0, len(field)):
    rows = [field[i].get_text(),full_name[i].get_text(),desc[i].get_text(),
    term_count[i].get_text(), is_date[i].get_text(), is_numerical[i].get_text(), 
    single_token[i].get_text(),hierarchy[i].get_text(),is_hidden[i].get_text()]
    field_data.append(rows)
  
# Converting the list into dataframe
field_df = pd.DataFrame(field_data, columns=['Field Abbreviation','Full Field Name', 'Field Description',
'Term Count', 'Is Date','Is Numerical','Single Token',
'Hierarchy','Is Hidden'], dtype = str)

display(field_df)

Unnamed: 0,Field Abbreviation,Full Field Name,Field Description,Term Count,Is Date,Is Numerical,Single Token,Hierarchy,Is Hidden
0,ALL,All Fields,All terms from all searchable fields,36710851,N,N,N,N,N
1,UID,Taxonomy ID,Unique number assigned to publication,0,N,Y,Y,N,Y
2,FILT,Filter,Limits the records,303,N,N,Y,N,N
3,SCIN,Scientific Name,Scientific name of organism,2420442,N,N,Y,N,N
4,COMN,Common Name,Common name of organism,47323,N,N,Y,N,N
5,TXSY,Synonym,Synonym of organism name,221971,N,N,Y,N,N
6,ALLN,All Names,All aliases for organism,3686044,N,N,Y,N,N
7,NXLV,Next Level,Immediate parent in taxonomic hierarchy,373118,N,N,Y,N,N
8,SBTR,Subtree,Any parent node in taxonomic hierarchy,6103915,N,N,Y,N,N
9,LNGE,Lineage,Lineage in taxonomic hierarchy,3686044,N,N,Y,N,N


To learn about other records associated with the <span style="color:#4682B4">**_taxonomy_**</span> database entries, we call the EInfo utility again, this time specifying the LinkList:

In [4]:
# Extracting the associated links
link_tag = taxon_info.LinkList

link_name = link_tag.find_all('Name')  
link_desc = link_tag.find_all('Description')  

link_data = []
  
# Loop to store the data in a list named 'link_data'
for i in range(0, len(link_name)):
    link_rows = [link_name[i].get_text(), link_desc[i].get_text()]
    link_data.append(link_rows)
  
# Converting the list into dataframe
link_df = pd.DataFrame(link_data, columns=['Link Name', 'Description'], dtype = str)

display(link_df)

Unnamed: 0,Link Name,Description
0,taxonomy_assembly_exp,Assembly records associated with taxonomy reco...
1,taxonomy_bioproject_exp,BioProject records associated with taxonomy re...
2,taxonomy_biosample_exp,BioSample records associated with taxonomy rec...
3,taxonomy_biosystems_exp,BioSystems records associated with taxonomy re...
4,taxonomy_books,Books
5,taxonomy_cdd_exp,CDD records associated with taxonomy record (e...
6,taxonomy_clone_exp,Links to Clone DB (exploded for higher taxa)
7,taxonomy_dbvar_exp,dbVar records associated with taxonomy record ...
8,taxonomy_gds_exp,GEO DataSet records associated with taxonomy r...
9,taxonomy_gene_exp,Gene records associated with taxonomy record (...


To extract data about organisms from these fields, we created a Python package <span style="color:#4682B4">**_ncbi._**</span> This module is composed of three main functions demonstrated below, <span style="color:#4682B4">**_api_soup_**</span>, <span style="color:#4682B4">**_id_search_**</span>, and <span style="color:#4682B4">**_get_metadata._**</span>

In [5]:
Code(inspect.getsource(ncbi.api_soup), language='python')

In [6]:
Code(inspect.getsource(ncbi.id_search), language='python')

In [7]:
Code(inspect.getsource(ncbi.get_metadata), language='python')

### <span style="color:gray">_Accessing what is in the NCBI Taxonomy database_</span>

The NCBI Taxonomy documentation indicates that the data model is built around a central framework called NameBank, and each entry includes a <span style="color:#4682B4">**_primary name_**</span>, <span style="color:#4682B4">**_secondary names_**</span>, a <span style="color:#4682B4">**_taxonomy identifier_**</span>, <span style="color:#4682B4">**_name entity identifiers_**</span> along with other various metadata about the lineage, genetic code, and linked Entrez records.

We first tested our database access using a sample query with the 'esearch' utility for the Entrez NCBI Taxonomy database using formal organism names. 

<span style="color:#4682B4">**_Input: Entrez text query (&term); Entrez database (&db)_**</span>

<span style="color:#4682B4">**_Expected Output: List of UIDs matching the Entrez query_**</span>


Sample data: Get the Taxonomic IDs (TaxID) for 'influenza A subtype h1n1' 'alphainfluenzavirus', and 'orthomyxoviridae:

In [8]:
term_list = ['influenza A subtype h1n1','alphainfluenzavirus','orthomyxoviridae']
id_list = []

for each in term_list:
    id_list.append(ncbi.id_search(each))

print(id_list)

2022-05-04 22:55:35.676 | INFO     | ncbi.id_search:id_search:8 - Searching ncbi for term influenza A subtype h1n1
2022-05-04 22:55:36.186 | INFO     | ncbi.id_search:id_search:8 - Searching ncbi for term alphainfluenzavirus
2022-05-04 22:55:36.704 | INFO     | ncbi.id_search:id_search:8 - Searching ncbi for term orthomyxoviridae


['114727', '197911', '11308']


Using this ID, we can then pass it through the 'efetch' utility to get the full records from the NCBI Taxonomy database about organisms in XML format.

<span style="color:#4682B4">**_Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode)_**</span>

<span style="color:#4682B4">**_Expected Output: Formatted data records as specified_**</span>


In [9]:
for each in id_list: 
    print('\n' + str(ncbi.get_metadata(each)))


{'ScientificName': 'H1N1 subtype', 'ParentTaxId': '11320', 'Rank': 'serotype', 'Division': 'Viruses', 'GeneticCode': {'GCId': '1', 'GCName': 'Standard'}, 'MitoGeneticCode': {'MGCId': '0', 'MGCName': 'Unspecified'}, 'Lineage': 'Viruses; Riboviria; Orthornavirae; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus', 'CreateDate': '2000/02/07 12:42:00', 'UpdateDate': '2020/04/07 15:24:16', 'PubDate': '2000/02/07 12:42:00', 'LineageEx': [{'TaxId': '10239', 'ScientificName': 'Viruses', 'Rank': 'superkingdom'}, {'TaxId': '2559587', 'ScientificName': 'Riboviria', 'Rank': 'clade'}, {'TaxId': '2732396', 'ScientificName': 'Orthornavirae', 'Rank': 'kingdom'}, {'TaxId': '2497569', 'ScientificName': 'Negarnaviricota', 'Rank': 'phylum'}, {'TaxId': '2497571', 'ScientificName': 'Polyploviricotina', 'Rank': 'subphylum'}, {'TaxId': '2497577', 'ScientificName': 'Insthoviricetes', 'Rank': 'class'}, {'TaxId': '2499411', 'Scientific

The query can handle multiple terms at once and return lists of IDs and full records for each ID.

This same query can also be performed using informal names and synonyms for the organisms (excluding orthomyxoviridae, as there is not another common family name):

In [10]:
term_list = ['Influenzavirus A','h1n1']
id_list = []

for each in term_list:
    id_list.append(ncbi.id_search(each))


for each in id_list:
    print('\n' + str(ncbi.get_metadata(each)))

2022-05-04 22:55:39.483 | INFO     | ncbi.id_search:id_search:8 - Searching ncbi for term Influenzavirus A
2022-05-04 22:55:39.998 | INFO     | ncbi.id_search:id_search:8 - Searching ncbi for term h1n1



{'ScientificName': 'Alphainfluenzavirus', 'ParentTaxId': '11308', 'Rank': 'genus', 'Division': 'Viruses', 'GeneticCode': {'GCId': '1', 'GCName': 'Standard'}, 'MitoGeneticCode': {'MGCId': '0', 'MGCName': 'Unspecified'}, 'Lineage': 'Viruses; Riboviria; Orthornavirae; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae', 'CreateDate': '2002/05/08 12:00:00', 'UpdateDate': '2020/04/07 15:24:16', 'PubDate': '2002/06/12 19:01:00', 'LineageEx': [{'TaxId': '10239', 'ScientificName': 'Viruses', 'Rank': 'superkingdom'}, {'TaxId': '2559587', 'ScientificName': 'Riboviria', 'Rank': 'clade'}, {'TaxId': '2732396', 'ScientificName': 'Orthornavirae', 'Rank': 'kingdom'}, {'TaxId': '2497569', 'ScientificName': 'Negarnaviricota', 'Rank': 'phylum'}, {'TaxId': '2497571', 'ScientificName': 'Polyploviricotina', 'Rank': 'subphylum'}, {'TaxId': '2497577', 'ScientificName': 'Insthoviricetes', 'Rank': 'class'}, {'TaxId': '2499411', 'ScientificName': 'Articulavirales', 'Rank': 'o

Diving deeper into Entrez with the species 'Influenzavirus A subtype H1N1' (H1N1):

In [11]:
h1n1_id = ncbi.id_search('Influenzavirus A subtype H1N1')
print(f'ID: {h1n1_id}\n')

h1n1_data = ncbi.get_metadata(h1n1_id)
print(h1n1_data)

2022-05-04 22:55:42.470 | INFO     | ncbi.id_search:id_search:8 - Searching ncbi for term Influenzavirus A subtype H1N1


ID: 114727

{'ScientificName': 'H1N1 subtype', 'ParentTaxId': '11320', 'Rank': 'serotype', 'Division': 'Viruses', 'GeneticCode': {'GCId': '1', 'GCName': 'Standard'}, 'MitoGeneticCode': {'MGCId': '0', 'MGCName': 'Unspecified'}, 'Lineage': 'Viruses; Riboviria; Orthornavirae; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus', 'CreateDate': '2000/02/07 12:42:00', 'UpdateDate': '2020/04/07 15:24:16', 'PubDate': '2000/02/07 12:42:00', 'LineageEx': [{'TaxId': '10239', 'ScientificName': 'Viruses', 'Rank': 'superkingdom'}, {'TaxId': '2559587', 'ScientificName': 'Riboviria', 'Rank': 'clade'}, {'TaxId': '2732396', 'ScientificName': 'Orthornavirae', 'Rank': 'kingdom'}, {'TaxId': '2497569', 'ScientificName': 'Negarnaviricota', 'Rank': 'phylum'}, {'TaxId': '2497571', 'ScientificName': 'Polyploviricotina', 'Rank': 'subphylum'}, {'TaxId': '2497577', 'ScientificName': 'Insthoviricetes', 'Rank': 'class'}, {'TaxId': '2499411', 

In [12]:
# Extracting the lineage
lineage = h1n1_data["LineageEx"]

# lineage_id = lineage_tag.find_all('TaxId')  
# lineage_name = lineage_tag.find_all('ScientificName')  
# lineage_rank = lineage_rank.find_all('Rank')

# lineage_data = []
  
# # Loop to store the data in a list named 'lineage_data'
# for i in range(0, len(lineage_id)):
#     lineage_rows = [lineage_id[i].get_text(), lineage_name[i].get_text(), lineage_rank[i].get_text()]
#     lineage_data.append(lineage_rows)
  
# # Converting the list into dataframe
# lineage_df = pd.DataFrame(lineage_data, columns=['Taxonomic ID', 'Scientific Name', 'Taxonomic Rank'], dtype = str)

lineage_df = pd.DataFrame(lineage)
lineage_df

Unnamed: 0,TaxId,ScientificName,Rank
0,10239,Viruses,superkingdom
1,2559587,Riboviria,clade
2,2732396,Orthornavirae,kingdom
3,2497569,Negarnaviricota,phylum
4,2497571,Polyploviricotina,subphylum
5,2497577,Insthoviricetes,class
6,2499411,Articulavirales,order
7,11308,Orthomyxoviridae,family
8,197911,Alphainfluenzavirus,genus
9,11320,Influenza A virus,species


In [13]:
correction = ncbi.api_soup('espell', {"term": 'Influenza Virus B', "db":"taxonomy"})
print(correction.CorrectedQuery.prettify())

<CorrectedQuery>
 influenza virus a
</CorrectedQuery>



In [14]:
correction = ncbi.api_soup('esearch', {"term": 'Influenza Virus B', "db":"taxonomy"})
print(correction.ErrorList.prettify())

<ErrorList>
 <PhraseNotFound>
  Influenza
 </PhraseNotFound>
 <PhraseNotFound>
  Virus
 </PhraseNotFound>
 <PhraseNotFound>
  B
 </PhraseNotFound>
</ErrorList>



In [15]:
correction = ncbi.api_soup('espell', {"term": 'Influenzavirus B', "db":"taxonomy"})
print(correction.CorrectedQuery.prettify())

<CorrectedQuery>
 influenzavirus b
</CorrectedQuery>



In [16]:
id_response = ncbi.api_soup('esearch', {"term": 'Influenzavirus B', "db":"taxonomy"})
print(id_response.IdList.prettify())

<IdList>
 <Id>
  197912
 </Id>
</IdList>



In [17]:
id_response = ncbi.api_soup('esearch', {"term": 'Influenza B virus', "db":"taxonomy"})
print(id_response.IdList.prettify())

<IdList>
 <Id>
  11520
 </Id>
</IdList>

