## Semantization of simulated data step by step

This notebook will go through the different stages of building a Turtle file that follows the SPHN framework using simulated patient data in `JSON` format.

### Import libraries and start a graph file

In [102]:
from rdflib import URIRef, Graph, Namespace, Literal
from rdflib.namespace import RDF, OWL, XSD

Note that we imported `namespaces` which will be used as prefixes for the generation of our semantized data. Since we need to follow the SPHN framework, we will also create create SPHN prefixes using `Namespace`. To do that, we first need to import the SPHN ontology:

In [103]:
g = Graph()

# Importing ontology
g.parse("sphn_ontology.ttl")
print(len(g))

# Adding allergies and patients as prefixes
allergies = Namespace("http://sib.swiss/allergies/")
g.bind("allergies", allergies)
patients = Namespace("http://sib.swiss/fictivePatients/")
g.bind("patients", patients)

# Adding also SPHN as a variable since this is not automatic
sphn = Namespace("https://biomedit.ch/rdf/sphn-ontology/sphn#")

# Note that adding namespaces doesn't change the length of the file
# because we still didn't use those namespaces
print(len(g))

# We can test that by adding an ID
ID_test = "6824c567-a5b2-8741-2dc4-b13ec092dd27"

g.add((URIRef(patients + ID_test), RDF.type, sphn.SubjectPseudoIdentifier))
print(len(g))


8693
8693
8694


### Import JSON file and find the variables of interest

Now that we're ready to add elements to the graph, we can import some test patient data in `JSON` format and look for their patient ID and allergies.

In [21]:
import json

# Load data
data = json.load(open('test_patient.json', 'r'))

# Navigating through the JSON is not easy but looking at the file structure helps

# Patient ID location
print(data['entry'][0]['resource']['id'])

# for the allergy location, it becomes more convenient to convert 
# data into a dataframe

import pandas as pd

df = pd.DataFrame.from_dict(pd.json_normalize(data['entry']), orient='columns')

# Check column where allergies could be

pd.unique(df['resource.resourceType'])

# We isolate the rows with allergies and get the allergy name(s)

allergies_rows = df[df['resource.resourceType'] == 'AllergyIntolerance']
allergen = allergies_rows['resource.code.text']
allergen

# We isolate the allergy type as well

allergies_types = allergies_rows['resource.category']
allergies_types

6824c567-a5b2-8741-2dc4-b13ec092dd27


24    [environment]
Name: resource.category, dtype: object

### Create loops to add elements of interest to graph

Now we will create a for loop that will take the allergies, link them to their type and link them to the initial patient ID.

In [106]:
# We repeat the code from above to recontruct our ontology from scratch

# Importing ontology
g = Graph()
g.parse("sphn_ontology.ttl")

# Adding allergies and patients as prefixes
allergies = Namespace("http://sib.swiss/allergies/")
g.bind("allergies", allergies)
patients = Namespace("http://sib.swiss/fictivePatients/")
g.bind("patients", patients)
substances = Namespace("http://sib.swiss/substances/")
g.bind("substances", substances)

# Adding also SPHN as a variable since this is not automatic
sphn = Namespace("https://biomedit.ch/rdf/sphn-ontology/sphn#")

# Adding patient ID to the ontology
ID_test = Literal(data['entry'][0]['resource']['id'])
g.add((patients.ID_test, RDF.type, sphn.SubjectPseudoIdentifier))

# We create a list of allergy types to not create redundant nodes
allergy_types_all = []

# And we do the same for allergy substances
allergy_substances = []

# We create a splitting function to extract the terms we need
# from the allergies found above

def split_allergen(allergen):
    first_split = Literal(allergen).split(' (')[0]
    return(first_split)

def split_allergies_types(allergies):
    first_split = Literal(allergies).split('[')[1]
    second_split = first_split.split(']')[0]
    return(second_split)

# Both the functions above return a clean literal

# Now we create a loop to go through the terms
# and add them to our ontology

for i,j in zip(allergen,allergies_types):
    # Convert allergy type and substance to literal
    allergy_type = Literal(split_allergies_types(j))
    allergy_substance = Literal(split_allergen(i))
    
    # Check if any is part of a global list
    # and if not, we can add them to the ontology
    
    if allergy_type not in allergy_types_all:
        allergy_types_all.append(allergy_type)
        g.add((URIRef(allergies + allergy_type), RDF.type, sphn.Allergy))
        
    if allergy_substance not in allergy_substances:
        allergy_substances.append(allergy_substance)
        g.add((URIRef(allergies + allergy_type), sphn.hasSubstance, URIRef(substances + allergy_substance)))
    
    # Add to ontology by associating to the patient ID
    g.add((URIRef(allergies + allergy_type), sphn.SubjectPseudoIdentifier, URIRef(patients + ID_test)))
                                     

In [107]:
# let's check if our global lists have been filled and if our ontology has additional elements

print(allergy_types_all)
print(allergy_substances)

# ontology size
len(g)

[rdflib.term.Literal("'food'")]
[rdflib.term.Literal('Peanut')]


8697

With all of this information, we will now create a script that will:

 - Loop through each `JSON` file
 - Get patient ID, allergies substances and types
 - Add all of these iteratively to our ontology linking them to the patient
 - If an allergy type and/or substance is not in a global list, add their relationship
 - As a last step write a new Turtle file with the new ontology

### Final loop skeleton

First we want to loop through all the patient files we generated:

In [127]:
import glob
import re

json_files = glob.glob('../01_simulation/*json')

# We define global empty lists of allergies types and susbtances

# We create a list of allergy types to not create redundant nodes
allergy_types_all = []

# And we do the same for allergy substances
allergy_substances = []

# We repeat the code from above to recontruct our ontology from scratch

# Importing ontology
g = Graph()
g.parse("sphn_ontology.ttl")

# Adding allergies and patients as prefixes
allergies = Namespace("http://sib.swiss/allergies/")
g.bind("allergies", allergies)
patients = Namespace("http://sib.swiss/fictivePatients/")
g.bind("patients", patients)
substances = Namespace("http://sib.swiss/substances/")
g.bind("substances", substances)

# Adding also SPHN as a variable since this is not automatic
sphn = Namespace("https://biomedit.ch/rdf/sphn-ontology/sphn#")

# We create a splitting function to extract the terms we need
# from the allergy types we find

def split_allergen(allergen):
    first_split = Literal(allergen).split(' (')[0]
    no_spaces = first_split.replace(' ', '')
    # remove special characters
    clean = re.sub(r"[^a-zA-Z0-9 ]", "", no_spaces)
    return(clean)

def split_allergies_types(allergies):
    first_split = Literal(allergies).split('[')[1]
    second_split = first_split.split(']')[0]
    # remove special characters
    clean = re.sub(r"[^a-zA-Z0-9 ]", "", second_split)
    return(clean)

# looping through JSON files
for json_file in json_files:
    
    print(json_file)
    
    # Load data
    data = json.load(open(json_file, 'r'))
    
    # Adding patient ID to the ontology
    ID_patient = Literal(data['entry'][0]['resource']['id'])
    g.add((URIRef(patients + ID_patient), RDF.type, sphn.SubjectPseudoIdentifier))

    # we convert JSON import into dataframe
    df = pd.DataFrame.from_dict(pd.json_normalize(data['entry']), orient='columns')

    # We isolate the rows with allergies and get the allergy name(s)

    allergen_rows = df[df['resource.resourceType'] == 'AllergyIntolerance']
    
    if len(allergen_rows) == 0:
        print("No allergies found!")
        continue
    else:
        print("Allergies found!")
        allergen = allergen_rows['resource.code.text']

        # We isolate the allergy type as well

        allergies_types = allergen_rows['resource.category']
    
        # Now we create a loop to go through the terms
        # and add them to our ontology

        for i,j in zip(allergen,allergies_types):
        
            # Convert allergy type and substance to literal
            allergy_type = Literal(split_allergies_types(j))
            allergy_substance = Literal(split_allergen(i))
    
            # Check if any is part of a global list
            # and if not, we can add them to the ontology
    
            if allergy_type not in allergy_types_all:
                allergy_types_all.append(allergy_type)
                g.add((URIRef(allergies + allergy_type), RDF.type, sphn.Allergy))
        
            if allergy_substance not in allergy_substances:
                allergy_substances.append(allergy_substance)
                g.add((URIRef(allergies + allergy_type), sphn.hasSubstance, URIRef(substances + allergy_substance)))
    
            # Add to ontology by associating to the patient ID
            g.add((URIRef(allergies + allergy_type), sphn.hasSubjectPseudoIdentifier, URIRef(patients + ID_patient)))


../01_simulation/Jerilyn993_Ngoc221_Wyman904_9070913e-8d33-3568-750f-e330b1fbddb1.json
No allergies found!
../01_simulation/Cordell41_Balistreri607_4cd812cc-e2ec-f5d7-585f-a2b64e9caad8.json
No allergies found!
../01_simulation/Katrina8_McGlynn426_9bb595e8-e91c-8a40-1919-a3a01800d780.json
No allergies found!
../01_simulation/Irene779_Berneice173_Mann644_d1f1d721-e00c-a3f7-1d2b-a06a951f6bad.json
No allergies found!
../01_simulation/Ella812_Lai148_Brakus656_2757674c-248e-5ed9-c44e-f8f805b12598.json
No allergies found!
../01_simulation/Arlen68_Barrows492_88f98dbb-0906-40b0-36cc-870eeaf455d6.json
No allergies found!
../01_simulation/Justin359_Dickinson688_c3fa6f65-9f26-df9c-5908-85fbfd2c3d21.json
No allergies found!
../01_simulation/Leonard963_Moen819_82e4ffca-e402-5773-ea88-d52a2f1a970e.json
No allergies found!
../01_simulation/Warren653_Maggio310_cb927656-00cc-e6b6-8caf-fd38b1959e65.json
No allergies found!
../01_simulation/Benjamin360_Rice937_5c60f4ee-dc7f-522a-e9a6-968b6af1853e.json
No 

We can check if the ontology is longer and if we filled all of our allergies substances and types:

In [128]:
len(g)
#print(g.serialize())

8877

In [129]:
print(allergy_substances)
print(allergy_types_all)

[rdflib.term.Literal('Lisinopril'), rdflib.term.Literal('Ibuprofen'), rdflib.term.Literal('Latex'), rdflib.term.Literal('Mold'), rdflib.term.Literal('Housedustmite'), rdflib.term.Literal('Animaldander'), rdflib.term.Literal('Grasspollen'), rdflib.term.Literal('Treepollen'), rdflib.term.Literal('Cowsmilk'), rdflib.term.Literal('Eggs'), rdflib.term.Literal('Shellfish'), rdflib.term.Literal('Fish'), rdflib.term.Literal('Beevenom'), rdflib.term.Literal('PenicillinV'), rdflib.term.Literal('Treenut'), rdflib.term.Literal('Aspirin'), rdflib.term.Literal('Peanut'), rdflib.term.Literal('SulfamethoxazoleTrimethoprim')]
[rdflib.term.Literal('medication'), rdflib.term.Literal('environment'), rdflib.term.Literal('food')]


As a final step we save our extended ontology to a Turtle file:

In [130]:
g.serialize(destination="sphn_ontology_extended.ttl")

<Graph identifier=N3fc40be2504f4a0eae9020764eac5774 (<class 'rdflib.graph.Graph'>)>