# Enriching Wikidata with the Getty KG
This notebook shows how graphs like Getty Vocabulary can be used to enrich Wikidata by using `kgtk` operations. We will show this enrichment on the records of people in the `Arnold Schwarzenegger` graph that exist both in Wikidata (with Qnode) and Getty Vocabulary (with ULAN ID). We will enrich their `date of birth` information. 

Specifically, we will investigate: *Does Getty contain complementary information to Wikidata about people's date of birth?*

We will use KGTK to import Getty data, align Getty to Wikidata, query dates of birth in both graphs separately, compare the results, and enrich the Wikidata graph with the missing information.

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
from IPython.display import display, HTML

# import papermill as pm
# sys.path.insert(0,'..')
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

## Set up environment path
Here we set up environment variables that will be used in the following sections, including folders, files like basic databases, query output and so on.

In [2]:
# Parameters

kgtk_path = "/Users/filipilievski/mcs/kgtk"

tutorial_deployment_path = "/Users/filipilievski/mcs/kgtk-tutorial-files/datasets"
project_deployment_path = tutorial_deployment_path + "/arnold-network-analysis"

# Folder on local machine where to create the output and temporary folders
input_path = "/Users/filipilievski/mcs/kgtk-tutorial-files/datasets/getty"
output_path = "/Users/filipilievski/mcs/kgtk-projects"
project_name = "getty-enrichment"

In [3]:
files = [
    "all",
    "label",
    'ULAN_term',
    'ULAN_subject',
    'ULAN_agentmap',
    'ULAN_biography',
    "ULAN_namespace"
]
ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                  json_config_file="files_config.json")

User home: /Users/filipilievski
Current dir: /Users/filipilievski/mcs/kgtk/tutorial
KGTK dir: /Users/filipilievski/mcs/kgtk
Use-cases dir: /Users/filipilievski/mcs/kgtk/use-cases


In [4]:
os.environ['kgtk_path'] = kgtk_path
os.environ['KGTK_GRAPH_CACHE'] = os.environ['STORE']
os.environ['KGTK_OPTION_DEBUG'] = "false"

In [5]:
ck.print_env_variables()


STORE: /Users/filipilievski/mcs/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db
OUT: /Users/filipilievski/mcs/kgtk-projects/getty-enrichment
EXAMPLES_DIR: /Users/filipilievski/mcs/kgtk/examples
USE_CASES_DIR: /Users/filipilievski/mcs/kgtk/use-cases
GRAPH: /Users/filipilievski/mcs/kgtk-tutorial-files/datasets/getty
kypher: kgtk query --graph-cache /Users/filipilievski/mcs/kgtk-projects/getty-enrichment/temp.getty-enrichment/wikidata.sqlite3.db
TEMP: /Users/filipilievski/mcs/kgtk-projects/getty-enrichment/temp.getty-enrichment
kgtk: kgtk
all: /Users/filipilievski/mcs/kgtk-tutorial-files/datasets/getty/all.tsv.gz
label: /Users/filipilievski/mcs/kgtk-tutorial-files/datasets/getty/labels.en.tsv.gz
ULAN_term: /Users/filipilievski/mcs/kgtk-tutorial-files/datasets/getty/ULANOut_2Terms.nt
ULAN_subject: /Users/filipilievski/mcs/kgtk-tutorial-files/datasets/getty/ULANOut_1Subjects.nt
ULAN_agentmap: /Users/filipilievski/mcs/kgtk-tutorial-files/datasets/getty/ULANOut_AgentM

## Approach overview

The Getty knowledge graph consists of [multiple vocabulary files](https://www.getty.edu/research/tools/vocabularies/), including ULAN (Union List of Artist Names), TGN (Thesaurus of Geographic Names), and AAT (Art & Architecture Thesaurus).
In this tutorial, we will focus on the ULAN vocabulary, which "includes names, rich relationships, notes, sources, and biographical information for artists, architects, firms, studios, repositories, and patrons, both individuals and corporate bodies, named and anonymous". The procedures for the other vocabularies should be analogous as they are also in `.nt` format.

The method that we will use consists of the following 5 steps:
1. Import Getty's ULAN file into KGTK
2. Align Getty to Wikidata
3. Query Wikidata, record known & unknown values
4. Query Getty to see if we can find these unknown values
5. Append the newly found values to Wikidata

## 1. Import TGN & ULAN into `kgtk`

As both ULAN and TGN are stored in n-triples (`.nt`) format, we can simply use the `import-ntriples` command. 

**Understanding prefixes** Getty conveniently provides an ontology file in an [RDF format](http://vocab.getty.edu/ontology.rdf), which defines the prefixes in the file header. We have transformed this file in KGTK format (`namespaces.tsv`) and we will use it to help KGTK understand prefixes in the data.

**Getty files** We will use four files from Getty's ULAN vocabulary:
1. `Biography` - which links agents to biographies, using the `gvp:biographyPrefered` property.
2. `Agent Map` which links people to their roles ("agents"), through the `foaf:focus` property.
3. `Subjects` use `dc:identifier` to link ULAN nodes to their ULAN ID strings.
4. `Terms` which links agents to their year of birth and death, using the `gvp:estStart` and `gvp:estEnd` properties 

We first import each of these four files into KGTK:

In [6]:
%%time
kgtk("""
    import-ntriples 
        -i $ULAN_term 
        -o $TEMP/ULAN_term_KGTK.tsv 
        --namespace-file $ULAN_namespace 
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 43.9 ms, sys: 41 ms, total: 84.8 ms
Wall time: 2min 19s


In [7]:
%%time
kgtk("""
    import-ntriples 
        -i $ULAN_subject 
        -o $TEMP/ULAN_subject_KGTK.tsv 
        --namespace-file $ULAN_namespace 
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 15.8 ms, sys: 17.3 ms, total: 33.1 ms
Wall time: 51.7 s


In [8]:
%%time
kgtk("""
    import-ntriples 
        -i $ULAN_agentmap 
        -o $TEMP/ULAN_agentmap_KGTK.tsv 
        --namespace-file $ULAN_namespace 
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 10.8 ms, sys: 14.7 ms, total: 25.5 ms
Wall time: 23.5 s


In [9]:
%%time
kgtk("""
    import-ntriples 
        -i $ULAN_biography 
        -o $TEMP/ULAN_biography_KGTK.tsv 
        --namespace-file $ULAN_namespace 
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

CPU times: user 32 ms, sys: 30.3 ms, total: 62.2 ms
Wall time: 1min 33s


After importing each of the files, we can now use KGTK operations on them. We start by `kgtk cat` to concatenate them into a single file for more convenient work with it.

In [10]:
%%time
kgtk("""
    cat -i $TEMP/ULAN_term_KGTK.tsv $TEMP/ULAN_subject_KGTK.tsv $TEMP/ULAN_agentmap_KGTK.tsv $TEMP/ULAN_biography_KGTK.tsv 
        -o $TEMP/ULAN_all.tsv
    """)

CPU times: user 6.22 ms, sys: 9.73 ms, total: 15.9 ms
Wall time: 11 s


## 2. Build Getty-Wikidata Alignment
Getty provides a `WikidataAlignment` file but our analysis showed that this alignment file is incomplete or out-of-date. Thus, we build our own alignment file, which links ULAN IDs to Wikidata Qnodes.

We perform a join between the Wikidata and the ULAN graph, through the ULAN identifiers available in both graphs.
Wikidata uses the property `P245` to map Qnode ids to ULAN identifiers, whereas Getty combines ULAN nodes to IDs with the `dc:identifier` property.

We will use the `skos:exactMatch` property to indicate alignment between ULAN nodes and Wikidata nodes.

Let's first see what results we get with this join operation:

In [11]:
%%time
kgtk("""
    query -i $all $TEMP/ULAN_all.tsv 
        --match 'all: (qnode)-[:P245]->(identifier), ULAN: (ulanid)-[p]->(identifier)' 
        --where 'p.label = "dc:identifier"' 
        --return 'distinct ulanid as node1, "skos:exactMatch" as label, qnode as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 51.3 ms, sys: 39.8 ms, total: 91.1 ms
Wall time: 2min 7s


Unnamed: 0,node1,label,node2,node2;label
0,ulan:500224955,skos:exactMatch,Q100948,'Rachel Carson'@en
1,ulan:500281177,skos:exactMatch,Q101771,'Gottfried Gruben'@en
2,ulan:500001235,skos:exactMatch,Q101791,'Sep Ruf'@en
3,ulan:500256782,skos:exactMatch,Q102139,'Margrethe II of Denmark'@en
4,ulan:500302331,skos:exactMatch,Q1024362,'Spanish National Research Council'@en
5,ulan:500286871,skos:exactMatch,Q1024426,'University of South Carolina'@en
6,ulan:500114625,skos:exactMatch,Q102711,'Dennis Hopper'@en
7,ulan:500304375,skos:exactMatch,Q10288082,'Wildenstein & Company'@en
8,ulan:500355461,skos:exactMatch,Q103876,'Peter O\'Toole'@en
9,ulan:500221924,skos:exactMatch,Q1049334,'United States Army Corps of Engineers'@en


The results look reasonable, so let's go ahead and store the alignment into a KGTK file:

In [12]:
%%time
kgtk("""
    query -i $all $TEMP/ULAN_KGTK.tsv 
        --match 'all: (qnode)-[:P245]->(identifier), ULAN: (ulanid)-[p]->(identifier)' 
        --where 'p.label = "dc:identifier"' 
        --return 'distinct ulanid as node1, "skos:exactMatch" as label, qnode as node2' 
        -o $TEMP/ULAN_ALIGN.tsv
    """)

CPU times: user 2.79 ms, sys: 8.3 ms, total: 11.1 ms
Wall time: 1.02 s


We will now run a simple Kypher query to count the Qnodes for which we have ULAN mapping:

In [13]:
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv 
        --match '(ulanid)-[]->(qnode)' 
        --return 'count(distinct qnode) as QNODE'
    """)

Unnamed: 0,QNODE
0,535


**Finding:** we obtain ULAN mappings for 535 Wikidata nodes.

## 3. Query Wikidata
We query Wikidata for these 535 people to see if it has recorded date of birth, using the `P569` property.

We provide first a glimpse of the query results (about 10 results):

In [14]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $all 
        --match 'ALIGN: (ulanid)-[]->(qnode), 
                 all: (qnode)-[p:P569]->(birthdate)' 
        --return 'qnode as node1, p.label as label, birthdate as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 4.78 ms, sys: 7.48 ms, total: 12.3 ms
Wall time: 2 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P569,^1907-05-27T00:00:00Z/11,'Rachel Carson'@en,'date of birth'@en
1,Q101771,P569,^1929-06-21T00:00:00Z/11,'Gottfried Gruben'@en,'date of birth'@en
2,Q101791,P569,^1908-03-09T00:00:00Z/11,'Sep Ruf'@en,'date of birth'@en
3,Q102139,P569,^1940-04-16T00:00:00Z/11,'Margrethe II of Denmark'@en,'date of birth'@en
4,Q102711,P569,^1936-05-17T00:00:00Z/11,'Dennis Hopper'@en,'date of birth'@en
5,Q103876,P569,^1932-08-02T00:00:00Z/11,'Peter O\'Toole'@en,'date of birth'@en
6,Q1066442,P569,^1925-10-31T00:00:00Z/11,'Charles Moore'@en,'date of birth'@en
7,Q106775,P569,^1930-10-02T00:00:00Z/11,'Richard Harris'@en,'date of birth'@en
8,Q106775,P569,^1930-10-01T00:00:00Z/11,'Richard Harris'@en,'date of birth'@en
9,Q1080053,P569,^1864-07-20T00:00:00Z/11,'Louis Finot'@en,'date of birth'@en


Now that we understand the results, we perform the query for all 535 people:

In [15]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $all 
        --match 'ALIGN: (ulanid)-[]->(qnode), 
                 all: (qnode)-[p:P569]->(birthdate)' 
        --return 'qnode as node1, p.label as label, birthdate as node2' 
        -o $TEMP/WD_BD.tsv
    """)

CPU times: user 2.62 ms, sys: 6.74 ms, total: 9.36 ms
Wall time: 947 ms


And we count the date of birth rows that we find:

In [16]:
kgtk("""
    query -i $TEMP/WD_BD.tsv
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,266


**Finding:** Out of the 535 people, 266 have date of birth in Wikidata. 

## 4. Query Getty

*Let's see whether Getty can fill the knowledge gaps for the remaining people in Wikidata...*

We now query Getty for the same set of 535 people. In this query, we take the ulan IDs that correspond to the Qnodes of interest, we link these ULAN nodes to ULAN agents or "roles" (using `foaf:focus`), we find their biography (using `gvp:biographyPreferred`), and we get the birth year based on the `gvp:estStart` property. As the birth year is a structured literal, we consider its value (`gvp:structuredValue`).

Getty provides date of birth on a year granularity level. For this purpose, we query Getty for years of birth, and we format them as dates, using the appropriate year precision marker (`/9` in Wikidata and KGTK).

In [17]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $TEMP/ULAN_all.tsv
        --match 'ALIGN: (ulanid)-[]->(qnode), 
                 all: (ulanid)-[p0]->(ulanagent), 
                 all: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" AND p2.label = "gvp:estStart" AND p3.label = "gvp:structured_value"' 
        --return 'distinct qnode as node1, "P569" as label, printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 10.7 ms, sys: 11.1 ms, total: 21.7 ms
Wall time: 18.1 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P569,^1907-01-01T00:00:00Z/9,'Rachel Carson'@en,'date of birth'@en
1,Q101771,P569,^1929-01-01T00:00:00Z/9,'Gottfried Gruben'@en,'date of birth'@en
2,Q101791,P569,^1908-01-01T00:00:00Z/9,'Sep Ruf'@en,'date of birth'@en
3,Q102139,P569,^1940-01-01T00:00:00Z/9,'Margrethe II of Denmark'@en,'date of birth'@en
4,Q1024362,P569,^1800-01-01T00:00:00Z/9,'Spanish National Research Council'@en,'date of birth'@en
5,Q1024426,P569,^1850-01-01T00:00:00Z/9,'University of South Carolina'@en,'date of birth'@en
6,Q102711,P569,^1936-01-01T00:00:00Z/9,'Dennis Hopper'@en,'date of birth'@en
7,Q10288082,P569,^1875-01-01T00:00:00Z/9,'Wildenstein & Company'@en,'date of birth'@en
8,Q103876,P569,^1932-01-01T00:00:00Z/9,'Peter O\'Toole'@en,'date of birth'@en
9,Q1049334,P569,^1850-01-01T00:00:00Z/9,'United States Army Corps of Engineers'@en,'date of birth'@en


As expected, we obtain dates of birth with a year precision (`/9`). We can thus go ahead and query for the dates of birth for all 535 entities:

In [18]:
%%time
kgtk("""
    query -i $TEMP/ULAN_ALIGN.tsv $TEMP/ULAN_all.tsv
        --match 'ALIGN: (ulanid)-[]->(qnode), 
                 all: (ulanid)-[p0]->(ulanagent), 
                 all: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" AND p2.label = "gvp:estStart" AND p3.label = "gvp:structured_value"' 
        --return 'distinct qnode as node1, "P569" as label, printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
        -o $TEMP/Getty_BD.tsv
    """)

CPU times: user 2.69 ms, sys: 7.51 ms, total: 10.2 ms
Wall time: 1 s


Perform validation for results in Getty:

In [19]:
!kgtk validate -i $TEMP/Getty_BD.tsv --ignore-minimum-year


Data lines read: 0
Data lines passed: 0


Let's see how many results we found in Getty:

In [20]:
kgtk("""
    query -i $TEMP/Getty_BD.tsv 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,535


**Finding:** We find date of birth for all 535 people in our Getty knowledge graph!

### 4a. How many values are novel?
Here we count for how many new date of birth we found in Getty:

In [21]:
%%time
kgtk("""
    ifnotexists -i $TEMP/Getty_BD.tsv 
        --filter-on $TEMP/WD_BD.tsv 
        --input-keys node1 
        --filter-keys node1 
        -o $TEMP/New_BD.tsv
    """)

CPU times: user 2.4 ms, sys: 7.04 ms, total: 9.44 ms
Wall time: 963 ms


In [22]:
kgtk("""
    query -i $TEMP/New_BD.tsv
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,269


**Finding:** There are newly found dates of birth in Getty for 269 entities!** -- this is expected, given that Getty has 535 values, and Wikidata had 266 values.

Let's see how many values we get in total:

In [23]:
kgtk("""
    query -i $TEMP/New_BD.tsv
        --match '(qnode)-[]->()' 
        --return 'count(qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,273


**Finding:** We see that in four cases, Getty has two birth dates for a node.

### 4b. Do the known values in Getty and Wikidata match?
Let's check if the found results in Getty match with those in Wikidata. We first obtain the list of matching birth dates, using the `ifexists` command:

In [24]:
%%time
kgtk("""
    ifexists -i $TEMP/Getty_BD.tsv 
        --filter-on $TEMP/WD_BD.tsv 
        --input-keys node1 
        --filter-keys node1 
        -o $TEMP/matching_bd.tsv
    """)

CPU times: user 3.11 ms, sys: 8.2 ms, total: 11.3 ms
Wall time: 1.04 s


We expect to get birth date values by both sources for 266 nodes:

In [25]:
kgtk("""
    query -i $TEMP/matching_bd.tsv 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode) as Qnode'
    """)

Unnamed: 0,Qnode
0,266


Ok, our expectation is correct. Let's now see for how many of those nodes do Wikidata and Getty agree:

In [26]:
kgtk("""
    query -i $TEMP/Getty_BD.tsv $TEMP/WD_BD.tsv
        --match 'Getty: (qnode)-[p]->(v1), WD: (qnode)-[]->(v2)' 
        --where 'kgtk_date_year(v1) = kgtk_date_year(v2)' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_18_c1.""node1"")"
0,250


**Finding:** 250 of the 266 ULAN ids have identical years of birth in Wikidata and Getty.

# 5. Append the newly found years to the Wikidata graph

We are now ready to insert that 269 new values from Getty into Wikidata. We first complete each edge with an id, using the `add-id` command:

In [27]:
%%time
kgtk("""add-id --debug -i $TEMP/New_BD.tsv --id-style wikidata -o $TEMP/New_BD_with_ID.tsv""")

CPU times: user 2.5 ms, sys: 6.95 ms, total: 9.45 ms
Wall time: 1.02 s


Finally, we concatenate the original Wikidata graph with the new edges from Getty:

In [28]:
%%time
kgtk("""cat -i $all $TEMP/New_BD_with_ID.tsv -o $OUT/all_plus_getty.tsv""")

CPU times: user 6.42 ms, sys: 9.5 ms, total: 15.9 ms
Wall time: 12.5 s


Let's count the number of edges in Wikidata before and after enrichment.

Before:

In [29]:
%%time
kgtk("""
    query -i $all 
    --match '(q)-[]->()'
    --return 'count(q)'
    """)

CPU times: user 5.13 ms, sys: 9.06 ms, total: 14.2 ms
Wall time: 1.12 s


Unnamed: 0,"count(graph_1_c1.""node1"")"
0,3577653


After:

In [31]:
%%time
kgtk("""
    query -i $OUT/all_plus_getty.tsv 
    --match '(q)-[]->()'
    --return 'count(q)'
    """)

CPU times: user 11.9 ms, sys: 12.5 ms, total: 24.4 ms
Wall time: 22.5 s


Unnamed: 0,"count(graph_21_c1.""node1"")"
0,3577926


**Finding:** As expected, the difference is 273 (3,577,926 - 3,577,653) edges.