## Extraction Pipeline

Run the cells in this notebook to, given an input JSON file, output 3 files: an RDF knowledge graph, a visualization of which tokens were identified, and a .CSV file used for results analysis.

input_file should be a string of the filepath to the .JSON extraction.

In [1]:
input_file = "./data/input/3235full.json"

The extraction is loaded with tree_table_extraction.load_extraction(input_file), and then passed to tree_table_extraction.make_tree_tables to set generate the tree table dictionary. This dictionary is stored in data.

Then, the KG Builder is created. We call build_kg(data) to build a preliminary KG structure inside of the KG_Builder object.

In [2]:
from extraction.kg_builder import KG_Builder
from extraction.tree_table_extraction import load_extraction, make_tree_tables
from extraction.classifiers import *

# Add more description
# Auto-generation tools for creating package dependencies
# Add high-level description to readme
#  Add workflow image to readme

# first, we create the tree tables from the .JSONs using the make_tree_tables function (Step 2)
data = make_tree_tables(load_extraction(input_file))

# now we feed that data into the KG builder
# Step 3 and 4 (includes NCBO annotation)
kg = KG_Builder()

# Document priority of classifiers
# Document how to configure list of ontologies
# Document additional configurable properties 
#   Include different configurable options in a script -- dont overly complicate
# kg.TokenClassifiers = [Free_Value_Token_Classifier(), Concept_Token_Classifier()]

kg.build_KG(data)

# Move helper code to serialize to data


Loaded extraction from ./data/input/3235full.json

REQ: GLA 300 N 404 
REQ: GLA 100 N 407 
REQ:  AGE YEARS 
REQ: historic_period eld long_time old_age senesce 
REQ:  AGE YEARS 
REQ: year old_age eld geezerhood long_time days class 
REQ:  SEX MALE N 
REQ: sexual_activity sexual_practice sex_activity arouse 
REQ:  SEX MALE N 
REQ: male_person 
REQ:  ETHNIC GROUP N 
REQ: cultural ethnical heathen heathenish 
REQ:  ETHNIC GROUP N 
REQ: grouping radical chemical_group 
REQ: ETHNIC GROUP N CAUCASIAN ETHNIC GROUP N
REQ: White White_person Caucasian_language 
REQ: ETHNIC GROUP N BLACK ETHNIC GROUP N
REQ: blackness inkiness total_darkness lightlessness blacken bootleg 
REQ: ETHNIC GROUP N ASIAN ORIENTAL ETHNIC GROUP N
REQ: Asiatic 
REQ: ETHNIC GROUP N ASIAN ORIENTAL ETHNIC GROUP N
REQ: oriental_person 
REQ: ETHNIC GROUP N OTHER ETHNIC GROUP N
REQ: early former 
REQ:  BODY WEIGHT KG 
REQ: organic_structure physical_structure dead_body torso consistency soundbox 
REQ:  BODY WEIGHT KG 
REQ: free_

KeyboardInterrupt: 

The following contains helper code for generating the KG, visualization, and CSV file. By using set_text, we can set the text of the table object to show what the KG Builder has annotated, as well as setting the csv_rows array used for generating the results CSV.

In [12]:
from extraction import serialize
from extraction import visualize

output_dir = "./data/output"

serialize.print_graph(data, input_file, output_dir)
visualize.print_visualization(data, input_file, output_dir)

Saved KG serialization to  ./data/output/gra.3235full.json.ttl 

Age:
  (SCO) http://semanticscience.org/resource/Age : age [AGE, NCBO-PREF]
  (SCO) http://semanticscience.org/resource/SIO_001013 : age [AGE, NCBO-PREF]
  (HHEAR) http://semanticscience.org/resource/SIO_001013 : age [AGE, NCBO-PREF]
  (LOINC) http://purl.bioontology.org/ontology/LNC/LP28815-6 : Age [AGE, NCBO-PREF]
  (LOINC) http://purl.bioontology.org/ontology/LNC/MTHU010047 : Age [AGE, NCBO-PREF]
  (CHEBI) http://purl.obolibrary.org/obo/CHEBI_84123 : advanced glycation end-product [AGE, NCBO-SYN]
  (NCIT) http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25150 : Age [AGE, NCBO-PREF]
  (IOBC) http://purl.jp/bio/4/id/200906091247251300 : instar [AGE, NCBO-SYN]
  (IOBC) http://purl.jp/bio/4/id/200906047565540621 : aging (physiology) [AGE, NCBO-SYN]
  (IOBC) http://purl.jp/bio/4/id/200906029214622968 : age [AGE, NCBO-PREF]
  (IOBC) http://purl.jp/bio/4/id/200906094641329674 : tree age [AGE, NCBO-SYN]
  (NCIT) http://ncic

[<extraction.classifiers.Free_Value_Token_Classifier at 0x240f582bdd8>,
 <extraction.classifiers.Concept_Token_Classifier at 0x240f73eb208>,
 <extraction.classifiers.NCBO_Token_Classifier at 0x240f73eb240>]