# Create Node and Relationship files for NCBI Taxonomy
This notebook creates Node and Relationship files that represent the NCBI Taxonomy ontology tree. The Ontology is retrieved from the [BioPortal](https://bioportal.bioontology.org/ontologies/NCBITAXON).

The Node and Relationship files can be uploaded into a Neo4j Graph Database using the [kg-import](https://github.com/sbl-sdsc/kg-import).

In [1]:
import os
from pathlib import Path
import pandas as pd
from utils import parse_bioportal_csv

In [2]:
# reload modules before executing user code
%load_ext autoreload
%autoreload 2

In [3]:
# configure pandas dataframe
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [4]:
NODE_DIR = Path(os.getenv('NODE_DIR', default='../data'))
RELATIONSHIP_DIR = Path(os.getenv('RELATIONSHIP_DIR', default='../data'))                   

## NCBI Taxonomy

Specify the CSV Download URL from [BioPortal](https://bioportal.bioontology.org/).

In [5]:
ontology_url = 'https://data.bioontology.org/ontologies/NCBITAXON/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv'

Specify extra columns to be imported from the CSV file. The map specifies the mapping of the original column names to new column names to be used as node properties. Use the following the Neo4j convention for property names: lower-case, using underscore to separate words.

In [6]:
extra_properties = {'DIV': 'division', 'RANK': 'rank'}

Browse the Identifiers.org [registry](https://registry.identifiers.org/registry) to find curie (compact uri) for a data resource.

In [7]:
curie = 'taxonomy'

In [8]:
node_file_name = 'Organism.csv'
relationship_file_name = 'Organisms-IS_A-Organism.csv'

## Parse ontology file and create node and relationship dataframes

In [9]:
nodes, relationships = parse_bioportal_csv(ontology_url, extra_properties, curie)

In [10]:
print('Number of nodes:', nodes.shape[0])

Number of nodes: 1983907


In [11]:
nodes.head()

Unnamed: 0,id,name,synonyms,definition,url,division,rank
0,taxonomy:2491102,Schrebera swietenioides,,,http://purl.bioontology.org/ontology/NCBITAXON...,Plants and Fungi,species
1,taxonomy:670317,Sinularia sublimis,,,http://purl.bioontology.org/ontology/NCBITAXON...,Invertebrates,species
2,taxonomy:2522987,Platygastridae sp. BIOUG20087-C03,,,http://purl.bioontology.org/ontology/NCBITAXON...,Invertebrates,species
3,taxonomy:2248825,Haemadipsa sp. THRF6,,,http://purl.bioontology.org/ontology/NCBITAXON...,Invertebrates,species
4,taxonomy:2009595,Apoidea sp. 0520C209,,,http://purl.bioontology.org/ontology/NCBITAXON...,Invertebrates,species


In [12]:
print('Number of relationships:', relationships.shape[0])

Number of relationships: 1983861


In [13]:
relationships.head()

Unnamed: 0,from,to
0,taxonomy:2491102,taxonomy:126565
1,taxonomy:670317,taxonomy:51814
2,taxonomy:2522987,taxonomy:1252372
3,taxonomy:2248825,taxonomy:2647130
4,taxonomy:2009595,taxonomy:889157


## Save files

In [14]:
nodes.to_csv(NODE_DIR / node_file_name, index=False)

In [15]:
relationships.to_csv(RELATIONSHIP_DIR / relationship_file_name, index=False)