# Data Field Punning Demonstration
This notebook explores the utility of using OWL punning to query data using an abstract/ontological representation of a database schema. 

The ontology for the schema very minimal. 

<div>
<img src='images/ontology-schema-element-branch.png' style="height: 125px" />
</div>

The ontology, purposefully, does not use complex OWL axioms or high-level upper ontology classes (e.g., BFO) to represent and relate schema elements. Rather, the classes are intended to go along with the well-understood concepts employed by data professionals. This does mean that the ontology can be enriched with axioms and an upper-level ontology.  
For convenience, I have included as part of the ontoogy some simple representations of "real-world" things under the entity branch.  

<img src='images/ontology-entity-branch.png' style="height: 400px" />

In actual use, these enities would be imported from ontologies.

The ontology is used to represent a simple database consisting of providers, patients, and procedures.  

<img src='images/simple-tables.drawio.png' />  

The fields in the tables are punned by reprenting them as:
1. Object Properties. E.g.: 
```
   _:row :patient_id _:field_value .  
```
2. Classes. E.g.:  
```
  :patient_id rdfs:subClassOf :field .
```
3. Individuals. E.g.:
```
  :tooth_num :represnets :tooth .
  :patient_id :represents :patient .
  :tooth_num :part_of :patient .
```

Data/field values are represented as instances, and a generic `has_value` data property is used to connect the literal value to the instance. E.g.:
```
  _:row :patient_id _:field_value . 
  _:field_value :has_value ?literal_value .
```
This permits other annotations to further describe instance of data/field values, if needed.  

Enumerated values are used to represent the meanings of literal values defined in fields. E.g.:
```
  :enumerated_value#M :represents :male_person; :has_value "M" .
  :enumerated_value#F :represents :female_person; :has_value "F" .
  
  :enumerated_value#M :defines_values_in :patient_id .
  :enumerated_value#F :defines_values_in :patient_id .
```


In [1]:
# use autoreload for debugging lib modules
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pds
from lib.helper_functions import init_graph, df_to_sql
from rdflib import Namespace, URIRef
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

## load demo data into dataframes

In [3]:
patients = pds.read_csv('../data/patients.csv')
providers = pds.read_csv('../data/providers.csv')
procedures = pds.read_csv('../data/procedures.csv')

<img src='images/simple-tables.drawio.png' />

## load ontology into rdflib graph

In [4]:
g = init_graph('../ontology/data-field-punning.ttl')
g.bind(":", Namespace("https://data-field-punning.owl/")) # you can also use g.namespace_manager.bind(...)
g.bind("field:", Namespace("https://data-field-punning.owl/field/"))
g.bind("table:", Namespace("https://data-field-punning.owl/table/"))
g.bind("enum:", Namespace("https://data-field-punning.owl/enumerated_value/"))

add some namespaces to use as shortcuts

In [5]:
ns = Namespace("https://data-field-punning.owl/")
field_ns = Namespace(ns.field)
table_ns = Namespace(ns.table)
enum_ns = Namespace(ns.enumerated_value)

test simple sparql query

In [6]:
q = """
select ?cls ?cls_label where {
  ?cls a owl:Class
  optional {?cls rdfs:label ?cls_label}
}
"""
g.sparql_query_to_df(q).head() # note: I only display the first 5 results

Unnamed: 0,cls,cls_label
0,:canine,canine
1,:crown_restoration,crown restoration
2,:data_value,data value
3,:dentist,dentist
4,:entity,


## add table and field instances to graph

In [7]:
g.add_table_metadata(patients, 'patients', table_ns, field_ns, ns)
g.add_table_metadata(providers, 'providers', table_ns, field_ns, ns)
g.add_table_metadata(procedures, 'procedures', table_ns, field_ns, ns)

<Graph identifier=Nabe79ecb72d44897846d37b7c780ed18 (<class 'rdflib.graph.Graph'>)>

query to check that instaces where added

In [8]:
q = """
prefix : <https://data-field-punning.owl/>
select ?field ?type ?field_name ?table ?table_name where {
  ?field a :field;
         rdfs:label ?field_name;
         rdf:type ?type;
         :member_of ?table .
  ?table rdfs:label ?table_name .
}
"""
g.sparql_query_to_df(q).head()

Unnamed: 0,field,type,field_name,table,table_name
0,:field/patients.patient_id,:field,patients.patient_id,:table/patients,patients
1,:field/patients.patient_id,owlClass,patients.patient_id,:table/patients,patients
2,:field/patients.name,:field,patients.name,:table/patients,patients
3,:field/patients.name,owlClass,patients.name,:table/patients,patients
4,:field/patients.gender,:field,patients.gender,:table/patients,patients


## add enumerated values
The values in `patients.gender` and `procedures.proc_code` are enums. Let's add them to ontology shema.  
Note: For demonstration purposes, I've made enums url safe. In a real-world scenario, the enums would need to be url encoded.

In [9]:
genders = list(pysqldf("select distinct gender from patients")['gender'])
proc_codes = list(pysqldf("select distinct proc_code from procedures")['proc_code'])

In [10]:
g.add_enums(genders, 'patients', 'gender', enum_ns, ns)
g.add_enums(proc_codes, 'procedures', 'proc_code', enum_ns, ns)

<Graph identifier=Nabe79ecb72d44897846d37b7c780ed18 (<class 'rdflib.graph.Graph'>)>

query the enums added to graph

In [11]:
q = """
prefix : <https://data-field-punning.owl/>
select ?enum ?label ?value ?defines where {
  ?enum a :enumerated_value;
    rdfs:label ?label;
    :has_value ?value;
    :defines_values_in ?defines .
}
"""
g.sparql_query_to_df(q).head()

Unnamed: 0,enum,label,value,defines
0,:enumerated_value/patients.gender#M,patients.gender M,M,:field/patients.gender
1,:enumerated_value/patients.gender#F,patients.gender F,F,:field/patients.gender
2,:enumerated_value/procedures.proc_code#d2300,procedures.proc_code d2300,d2300,:field/procedures.proc_code
3,:enumerated_value/procedures.proc_code#d2400,procedures.proc_code d2400,d2400,:field/procedures.proc_code
4,:enumerated_value/procedures.proc_code#d2500,procedures.proc_code d2500,d2500,:field/procedures.proc_code


## add what the data represents
The data in the tables represent things in the world. We need to connect the data to their representations.  
I created a simple mapping between IRIs and the classes represented by the them. This could also be done using a `robot` template or `SSSOM` mapping file.  
Some of mappings are at the field level. For example, the patient_id field represents a patient in general. Other mappings are a the level of enumaterated values. For example, the value "F" in the patient.gender field represents a female.  
**Note**: This mapping involves punning the classes as inviduals b/c the represents object property holds between individuals.

In [12]:
# this can be made into a function
for idx, iri, entity in pds.read_csv('../data/data_representations.csv').itertuples():
    g.add_spo(iri, ns.represents, entity)

In [13]:
q = """
prefix : <https://data-field-punning.owl/>
select ?uri ?label ?represents where {
    ?uri :represents ?represents .
    optional {
      ?uri rdfs:label ?label
    }
}
"""
g.sparql_query_to_df(q).head()

Unnamed: 0,uri,label,represents
0,:field/patients.patient_id,patients.patient_id,:patient
1,:field/procedures.patient_id,procedures.patient_id,:patient
2,:field/patients.primary_provider_id,patients.primary_provider_id,:dentist
3,:field/providers.provider_id,providers.provider_id,:dentist
4,:field/procedures.provider_id,procedures.provider_id,:dentist


## use representations to form sql queries

Conceptually, the schema data represents entities in the manner illustrated below. Use information about what the data *represent*, we can query multiple fields in multiple tables based on the hierarchy in the ontology instead of having to rely solely on knowing the field names and values to retrieve.

<img src='images/simple-table-field-entity-graph.drawio.png' />

Find every field name that represents a `person`.  
**Note**: This finds all fields that represent a subclass of `person`.

In [14]:
q = """
prefix : <https://data-field-punning.owl/>
select distinct ?table_name ?field_name (group_concat(?cls_name) as ?cls_names) where {
    ?cls rdfs:subClassOf :person;
         rdfs:label ?cls_label .
         
    ?field a :field;
        :represents ?cls;
        rdfs:label ?field_name;
        :member_of [a :table; rdfs:label ?table_name] .
        
    bind(replace(?cls_label, " ", "_") as ?cls_name)
}
group by ?table_name ?field_name
order by ?field_name
"""
field_df = g.sparql_query_to_df(q)
field_df

Unnamed: 0,table_name,field_name,cls_names
0,patients,patients.patient_id,patient
1,patients,patients.primary_provider_id,dentist
2,procedures,procedures.patient_id,patient
3,procedures,procedures.provider_id,dentist
4,providers,providers.provider_id,dentist


### build SQL query

In [15]:
q = df_to_sql(field_df)
print(q)

select 
  patients.patient_id as [patients.patient_id (patient)] 
  ,  patients.primary_provider_id as [patients.primary_provider_id (dentist)] 
  ,  procedures.patient_id as [procedures.patient_id (patient)] 
  ,  procedures.provider_id as [procedures.provider_id (dentist)] 
  ,  providers.provider_id as [providers.provider_id (dentist)] 
 
from providers 
inner join patients on 
  patients.primary_provider_id = procedures.provider_id
  and patients.primary_provider_id = providers.provider_id
  and patients.patient_id = procedures.patient_id
inner join procedures on 
  procedures.provider_id = providers.provider_id



In [16]:
sqldf(q)

Unnamed: 0,patients.patient_id (patient),patients.primary_provider_id (dentist),procedures.patient_id (patient),procedures.provider_id (dentist),providers.provider_id (dentist)
0,1001,1,1001,1,1
1,1004,1,1004,1,1
2,1002,2,1002,2,2
3,1005,2,1005,2,2
4,1003,3,1003,3,3
5,1006,3,1006,3,3


## use enumerated values to filter data

Find procedures that were rooth canals.  
The enumerated value `enum:procedures.proc_code#d2800` represents a `root canal`. (i.e., `enum:procedures.proc_code#d2800 :represent entity:root_canal`).  
This permits us to filter for procedure code value `d2800` using what this enum represents, and not the literal. This is useful in situations where different values represent the same kind of thing.

In [17]:
q = """
prefix : <https://data-field-punning.owl/> 
select distinct ?table_name ?field_name ?cls_names ?enum_value
where {

    ?field a :field;
        :represents ?cls;
        rdfs:label ?field_name;
        :member_of [a :table; rdfs:label ?table_name] . 
    
    optional {
    ?enum a :enumerated_value;
        :has_value ?enum_value;
        :defines_values_in ?field;
        :represents :root_canal .
    }
    
    ?cls rdfs:label ?cls_names .
    filter(?cls = :procedure || ?cls = :patient)
}
"""
field_df = g.sparql_query_to_df(q)
field_df

Unnamed: 0,table_name,field_name,cls_names,enum_value
0,patients,patients.patient_id,patient,
1,procedures,procedures.proc_id,procedure,
2,procedures,procedures.patient_id,patient,
3,procedures,procedures.proc_code,procedure,d2600


## build sql query

In [18]:
q = df_to_sql(field_df)
print(q)

select 
  patients.patient_id as [patients.patient_id (patient)] 
  ,  procedures.proc_id as [procedures.proc_id (procedure)] 
  ,  procedures.patient_id as [procedures.patient_id (patient)] 
  ,  procedures.proc_code as [procedures.proc_code (procedure)] 
 
from patients 
inner join procedures on 
  patients.patient_id = procedures.patient_id
  and procedures.proc_id = procedures.proc_code
where procedures.proc_code = 'd2600' 



In [19]:
sqldf(q)

Unnamed: 0,patients.patient_id (patient),procedures.proc_id (procedure),procedures.patient_id (patient),procedures.proc_code (procedure)


## add relations between fields
By adding relations between the fields, we can query for how entities represented by the data in the fields are related.  
For demonstration purposes, the relations are added to the graph directly. However, this information can also be in an external table.  
**To Do**: Write code to turn results into a a sql query.

In [20]:
g.add_spo(field_ns['/procedures.tooth_num'], ns['part_of'], field_ns['/procedures.patient_id'])
g.add_spo(field_ns['/procedures.tooth_num'], ns['participates_in'], field_ns['/procedures.proc_code'])

<Graph identifier=Nabe79ecb72d44897846d37b7c780ed18 (<class 'rdflib.graph.Graph'>)>

Find the fields whose data reprents the entities that a `tooth` is `part of` or `participates in`.  
**note**: The query searches for the field that *represents* a `tooth`, not the field itself.

In [21]:
q = """
prefix : <https://data-field-punning.owl/>
select ?subj_field ?predicate ?obj_field where {
  # find fields that represent a tooth
  ?subj 
      rdfs:label ?subj_field;
      :represents :tooth .
      
  # demonstrate that variables can be used as predicates
  {
    bind(:part_of as ?pred)
    ?pred rdfs:label ?predicate .
    
    ?subj ?pred ?obj .
    ?obj rdfs:label ?obj_field .
  } union {
    bind(:participates_in as ?pred)
    ?pred rdfs:label ?predicate .
    
    ?subj ?pred ?obj .
    ?obj rdfs:label ?obj_field .
  }
    
}
"""
g.sparql_query_to_df(q)

Unnamed: 0,subj_field,predicate,obj_field
0,procedures.tooth_num,part of,procedures.patient_id
1,procedures.tooth_num,participates in,procedures.proc_code


## create a simple translation of the data into RDF

In [22]:
g.add_df(patients, 'patients', field_ns, ns)
g.add_df(procedures, 'procedures', field_ns, ns)
g.add_df(providers, 'providers', field_ns, ns)

<Graph identifier=Nabe79ecb72d44897846d37b7c780ed18 (<class 'rdflib.graph.Graph'>)>

Query to see if insantace data was added using both:
* punned field names
* instances of field values that members of field instances

query using fields as object properties (punned)

In [23]:
q = """
prefix : <https://data-field-punning.owl/>
select ?row ?field_name ?value where {
  ?field a :field;
      rdfs:label ?field_name .
      
  ?row a :row; 
      ?field [:has_value ?value] .
} 
order by ?row
limit 5
"""
g.sparql_query_to_df(q)

Unnamed: 0,row,field_name,value
0,N124df8790a1343cbb0f8deb37a3b22ed,patients.patient_id,1006
1,N124df8790a1343cbb0f8deb37a3b22ed,patients.name,Barney
2,N124df8790a1343cbb0f8deb37a3b22ed,patients.gender,M
3,N124df8790a1343cbb0f8deb37a3b22ed,patients.dob,2006-06-06
4,N124df8790a1343cbb0f8deb37a3b22ed,patients.primary_provider_id,3


similar query to above using instnaces of field values

In [57]:
q = """
prefix : <https://data-field-punning.owl/>
select ?row ?field_class ?field ?value where {
  ?row a :row .
  
  ?field_class_uri rdfs:subClassOf :field;
      rdfs:label ?field_class .
      
  ?field a ?field_class_uri;
      :member_of ?row.
      
  ?field_value_uri a :field_value;
      :member_of ?field;
      :has_value ?value .
}
order by ?row
limit 5
"""
g.sparql_query_to_df(q)

Unnamed: 0,row,field_class,field,value
0,N124df8790a1343cbb0f8deb37a3b22ed,patients.patient_id,N3b914df894d84b938d76d7f51c7e19cd,1006
1,N124df8790a1343cbb0f8deb37a3b22ed,patients.name,N85f371e874f5460bb5dfde3c2d0e74a2,Barney
2,N124df8790a1343cbb0f8deb37a3b22ed,patients.gender,N22c4df00c6564f44ac40ead2671ab976,M
3,N124df8790a1343cbb0f8deb37a3b22ed,patients.dob,Nbd9f92a0369447a18ddbb92906149641,2006-06-06
4,N124df8790a1343cbb0f8deb37a3b22ed,patients.primary_provider_id,N2740e78b2b8f48ccb42636a24ac7d141,3


## query for teeth that are part of a patient and participated in a procedure
Above we related the `tooth_num`, `patient_id`, and `proc_code` fields in the `procedures` table like so:
* `procedures.tooth_num` `part of` `procedures.patient_id`  
* `procedures.tooth_num` `participates_in` `procedures.proc_code`

Using these relations between the fields, we can query for the data values that are related in this manner.  
Due to performance issues with rdflib, I divide this into two parts. The first parts retrieves the uris of the field values.  
The second part prints the literal values in a dataframe.

In [25]:
# part one: fetch field value uris
q = """
prefix : <https://data-field-punning.owl/>
select ?patient_value_uri ?tooth_value_uri ?procedure_value_uri where {

  # specify what the fields represent
  ?tooth_field_uri :represents :tooth .
  ?patient_field_uri :represents :patient .
  ?procedure_field_uri :represents :procedure .
  
  # specify how fields are related
  ?tooth_field_uri :part_of ?patient_field_uri .
  ?tooth_field_uri :participates_in ?procedure_field_uri .


  # find rows where the field contains the data/field value
  ?row a :row .
  ?row ?tooth_field_uri ?tooth_value_uri .
  ?row ?patient_field_uri ?patient_value_uri .
  ?row ?procedure_field_uri ?procedure_value_uri .
} 
"""
results = g.query(q)

In [26]:
# part two: display values in dataframe
data = [[str(g.value(patient_value_uri, ns.has_value)), 
         str(g.value(tooth_value_uri, ns.has_value)), 
         str(g.value(procedure_value_uri, ns.has_value))]
        for patient_value_uri, tooth_value_uri, procedure_value_uri in results]
pds.DataFrame(data, columns=["patient", "tooth", "procedure"])

Unnamed: 0,patient,tooth,procedure
0,1001,1,d2300
1,1002,2,d2400
2,1003,3,d2500
3,1004,4,d2600
4,1005,5,d2700
5,1006,6,d2800


This query returns the same results as above, but uses instances of fields, instead of using fields as object properties.  
**Note**: Punning is still used to relate data in the tooth field that is `part of` the data in the patient field.  
The fields that are represented as classes are punned to be represented as individuals: `?tooth_field_class_uri :part_of ?patient_field_class_uri`.

For performance reasons, displaying the results in a dataframe is more complicated.

In [27]:
# part one: fetch field uris
q = """
prefix : <https://data-field-punning.owl/>
select ?patient_field_uri ?tooth_field_uri ?procedure_field_uri where {
  # specify what the fields represent
  ?tooth_field_class_uri a owl:Class; :represents :tooth .
  ?patient_field_class_uri a owl:Class; :represents :patient .
  ?procedure_field_class_uri a owl:Class; :represents :procedure .
  
  # specify how fields are related (field classes are punned as individuals)
  ?tooth_field_class_uri :part_of ?patient_field_class_uri .
  ?tooth_field_class_uri :participates_in ?procedure_field_class_uri .

  # find instances of fields that are members of same row
  ?row a :row .
  ?tooth_field_uri 
      a ?tooth_field_class_uri;
      :member_of ?row .
  ?patient_field_uri 
      a ?patient_field_class_uri; 
      :member_of ?row .
  ?procedure_field_uri
      a ?procedure_field_class_uri;
      :member_of ?row .
} 
"""
results = g.query(q)

In [28]:
# part two: display values in dataframe
data = []
for patient_field_uri, tooth_field_uri, procedure_field_uri in results:
    # fetch field value uri that is a member of the field (uri)
    #   triples returns a list with a tuple (e.g., [(s, p, o)]
    #   so, the field uri is the first element of the tuple that is the first list element
    patient_value_uri = list(g.triples((None, ns.member_of, patient_field_uri)))[0][0]
    tooth_value_uri = list(g.triples((None, ns.member_of, tooth_field_uri)))[0][0]
    procedure_value_uri = list(g.triples((None, ns.member_of, procedure_field_uri)))[0][0]
    
    # put literal data values in data list
    data.append([str(g.value(patient_value_uri, ns.has_value)), 
                 str(g.value(tooth_value_uri, ns.has_value)), 
                 str(g.value(procedure_value_uri, ns.has_value))])

pds.DataFrame(data, columns=["patient", "tooth", "procedure"])

Unnamed: 0,patient,tooth,procedure
0,1001,1,d2300
1,1002,2,d2400
2,1003,3,d2500
3,1004,4,d2600
4,1005,5,d2700
5,1006,6,d2800


## use enum to filter RDF
Similar to above, we can filter for root canal procedures using what the enumerated value `enum:procedures.proc_code#d2800` represents.

In [75]:
# part one: fetch field value uris
q = """
base <https://data-field-punning.owl/>
prefix : <https://data-field-punning.owl/>
select ?patient ?tooth ?procedure where {
  # specify what enum represents the field whose values it defines
  ?enum_uri 
      a :enumerated_value;
      :represents :root_canal;
      :has_value ?enum_value;
      :defines_values_in ?field_uri .
      
  # fetch data/values from field defined by enum
  ?row 
      a :row;
      <field/procedures.patient_id> [:has_value ?patient];
      <field/procedures.tooth_num> [:has_value ?tooth];
      ?field_uri [:has_value ?procedure] .
  filter(?procedure = ?enum_value)  
} 
"""
g.sparql_query_to_df(q)

Unnamed: 0,patient,tooth,procedure
0,1004,4,d2600
