# Wikidata Enrichment Tutorial
This notebook aims to prove that external graphs like Getty Vocabulary can be used to enrich Wikidata by using `kgtk` methods. We will perform our query on a sample with 432 records which are people that both in Wikidata (with Qnode) and Getty Vocabulary (with ULAN ID). We will mainly focus on three facts of them: date of birth, date of death and place of birth, and see whether Getty Vocabulary could make up for Wikidata in these three fields.

In [1]:
import os
import re
import time
import json
import subprocess

from kgtk.functions import kgtk, kypher

## Set up environment path
Here we set up environment variables that will be used in the following sections, including folders, files like basic databases, query output and so on.

In [2]:
# Parameters

# We will define environment variables to hold the full paths to the files as we will use them in the shell commands
kgtk_environment_variables = []

# Folder where database files store
data_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp"
os.environ['DATABASE'] = data_path
kgtk_environment_variables.append('DATABASE')

# Folder of ULAN
ulan_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN"
os.environ['ULAN'] = ulan_path
kgtk_environment_variables.append('ULAN')

ulan_full_nt_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/ULANOut_Full.nt"
os.environ['ULAN_FULL_NT'] = ulan_full_nt_path
kgtk_environment_variables.append('ULAN_FULL_NT')

ulan_full_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/full.tsv"
os.environ['ULAN_FULL'] = ulan_full_path
kgtk_environment_variables.append('ULAN_FULL')

ulan_wikialign_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/wiki.align.tsv"
os.environ['ULAN_ALIGN'] = ulan_wikialign_path
kgtk_environment_variables.append('ULAN_ALIGN')

# Folder of TGN
tgn_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN"
os.environ['TGN'] = tgn_path
kgtk_environment_variables.append('TGN')

tgn_full_nt_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN/TGNOut_Full.nt"
os.environ['TGN_FULL_NT'] = tgn_full_nt_path
kgtk_environment_variables.append('TGN_FULL_NT')

tgn_full_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN/full.tsv"
os.environ['TGN_FULL'] = tgn_full_path
kgtk_environment_variables.append('TGN_FULL')

tgn_wikialign_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN/wiki.align.tsv"
os.environ['TGN_ALIGN'] = tgn_wikialign_path
kgtk_environment_variables.append('TGN_ALIGN')

# namespaces
namespaces_path = '/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/namespaces.tsv'
os.environ['NAMESPACES'] = namespaces_path
kgtk_environment_variables.append('NAMESPACES')

# Wikidata
wikidata_path = '/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/all.tsv'
os.environ['WIKIDATA'] = wikidata_path
kgtk_environment_variables.append('WIKIDATA')

new_wikidata_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/new.all.tsv"
os.environ['NEW_WIKIDATA'] = new_wikidata_path
kgtk_environment_variables.append('NEW_WIKIDATA')

label_path = '/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/labels.en.tsv'
os.environ['KGTK_LABEL_FILE'] = label_path
kgtk_environment_variables.append('KGTK_LABEL_FILE')

# sample
ulan_qnodes_path = '/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ulan-qnodes.tsv'
os.environ['ULAN_QNODES'] = ulan_qnodes_path
kgtk_environment_variables.append('ULAN_QNODES')

samples_path = '/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/samples.tsv'
os.environ['SAMPLES'] = samples_path
kgtk_environment_variables.append('SAMPLES')

# Output
output_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/output"
os.environ['OUTPUT'] = output_path
kgtk_environment_variables.append('OUTPUT')

output_names = {
    "wiki_birthdate": "wiki.birthdate.tsv",
    "wiki_birthplace": "wiki.birthplace.tsv",
    "wiki_deathdate": "wiki.deathdate.tsv",
    "wiki_results": "wiki.results.tsv",
    "unknown": "unknown.tsv",
    "birthyear": "birthyear.tsv",
    "birthplace": "birthplace.tsv",
    "deathyear": "deathyear.tsv",
    "getty_results": "getty.results.tsv",
    "match_birthyear": "match.birthyear.tsv",
    "match_deathyear": "match.deathyear.tsv",
    "match_birthplace": "match.birthplace.tsv",
    "new_birthyear": "new.birthyear.tsv",
    "new_deathyear": "new.deathyear.tsv",
    "new_birthplace": "new.birthplace.tsv",
    "new_results": "new.results.tsv",
    "birthyear_withid": "withid.birthyear.tsv",
    "deathyear_withid": "withid.deathyear.tsv",
    "birthplace_withid": "withid.birthplace.tsv"
}

for key, value in output_names.items():
    variable = key.upper()
    os.environ[variable] = os.path.join(output_path, value)
    kgtk_environment_variables.append(variable)

for variable in kgtk_environment_variables:
    print("{}: \"{}\"".format(variable, os.environ[variable]))

DATABASE: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp"
ULAN: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN"
ULAN_FULL_NT: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/ULANOut_Full.nt"
ULAN_FULL: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/full.tsv"
ULAN_ALIGN: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/wiki.align.tsv"
TGN: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN"
TGN_FULL_NT: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN/TGNOut_Full.nt"
TGN_FULL: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN/full.tsv"
TGN_ALIGN: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/TGN/wiki.align.tsv"
NAMESPACES: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/namespaces.tsv"
WIKIDATA: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/all.tsv"
NEW_WIKIDATA: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/new.all.tsv"
KGTK_LABEL_FILE: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/labels.e

## Import TGN & ULAN into `kgtk` graphs

Import ULAN

In [8]:
%%time
kgtk("""
    import-ntriples 
        -i $ULAN_FULL_NT 
        -o $ULAN_FULL 
        --namespace-file $NAMESPACES 
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

kgtk import-ntriples     -i $ULAN_FULL_NT     -o $ULAN_FULL     --namespace-file $NAMESPACES     --namespace-id-use-uuid True     --build-new-namespaces False     --output-only-used-namespaces True     --structured-value-label gvp:structured_value     --structured-uri-label gvp:structured_uri     --newnode-prefix node     --newnode-use-uuid True
Code: 0, Runtime: 1251.43


Import TGN

In [9]:
%%time
kgtk("""
    import-ntriples 
        -i $TGN_FULL_NT 
        -o $TGN_FULL 
        --namespace-file $NAMESPACES 
        --namespace-id-use-uuid True 
        --build-new-namespaces False 
        --output-only-used-namespaces True 
        --structured-value-label gvp:structured_value 
        --structured-uri-label gvp:structured_uri 
        --newnode-prefix node 
        --newnode-use-uuid True
    """)

kgtk import-ntriples     -i $TGN_FULL_NT     -o $TGN_FULL     --namespace-file $NAMESPACES     --namespace-id-use-uuid True     --build-new-namespaces False     --output-only-used-namespaces True     --structured-value-label gvp:structured_value     --structured-uri-label gvp:structured_uri     --newnode-prefix node     --newnode-use-uuid True
Code: 0, Runtime: 5784.43


## Build Getty-Wikidata Alignment
Here we built our own alignment file instead of using WikidataAlignment provided by Getty. According to our testing, the WikidataAlignment contains much less useful pairs than our own alignment file. The alignment file links ULAN IDs / TGN IDs with Wikidata Qnodes.

### 1. ULAN

In [3]:
%%time
kgtk("""
    query -i $WIKIDATA $ULAN_FULL 
        --match 'a: (qnode)-[:P245]->(identifier), f: (ulanid)-[p]->(identifier)' 
        --where 'p.label = "dc:identifier"' 
        --return 'distinct ulanid as node1, "skos:exactMatch" as label, qnode as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 14.5 ms, sys: 27.3 ms, total: 41.8 ms
Wall time: 6.2 s


Unnamed: 0,node1,label,node2,node2;label
0,ulan:500224955,skos:exactMatch,Q100948,'Rachel Carson'@en
1,ulan:500281177,skos:exactMatch,Q101771,'Gottfried Gruben'@en
2,ulan:500001235,skos:exactMatch,Q101791,'Sep Ruf'@en
3,ulan:500256782,skos:exactMatch,Q102139,'Margrethe II of Denmark'@en
4,ulan:500302331,skos:exactMatch,Q1024362,'Spanish National Research Council'@en
5,ulan:500286871,skos:exactMatch,Q1024426,'University of South Carolina'@en
6,ulan:500114625,skos:exactMatch,Q102711,'Dennis Hopper'@en
7,ulan:500304375,skos:exactMatch,Q10288082,'Wildenstein & Company'@en
8,ulan:500355461,skos:exactMatch,Q103876,'Peter O\'Toole'@en
9,ulan:500221924,skos:exactMatch,Q1049334,'United States Army Corps of Engineers'@en


In [4]:
%%time
kgtk("""
    query -i $WIKIDATA $ULAN_FULL 
        --match 'a: (qnode)-[:P245]->(identifier), f: (ulanid)-[p]->(identifier)' 
        --where 'p.label = "dc:identifier"' 
        --return 'distinct ulanid as node1, "skos:exactMatch" as label, qnode as node2' 
        -o $ULAN_ALIGN
    """)

CPU times: user 3.46 ms, sys: 11.2 ms, total: 14.6 ms
Wall time: 2.32 s


Similarily, here we can convert `ulan-qnodes.tsv` in the form of `(qnode)-[P245]->(identifier)` into `sample.tsv` in the form of `(ulanid)-[skos:exactMatch]->(qnode)` for better using it in the following sections.

In [5]:
%%time
kgtk("""
    query -i $ULAN_QNODES $ULAN_FULL 
        --match 'u: (qnode)-[]->(identifier), 
                 f: (ulanid)-[p]->(identifier)' 
        --where 'p.label = "dc:identifier"' 
        --return 'distinct ulanid as node1, "skos:exactMatch" as label, qnode as node2' 
        -o $SAMPLES
    """)

CPU times: user 4.83 ms, sys: 9.56 ms, total: 14.4 ms
Wall time: 1.85 s


In [6]:
kgtk("""
    query -i $SAMPLES 
        --match '(qnode)-[]->(ulanid)' 
        --return 'count(distinct qnode), count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_588_c1.""node1"")","count(DISTINCT graph_588_c1.""node2"")"
0,432,430


After manually checking we found that ULAN ID "500316131" from `ulan-qnodes.tsv` is not in Getty.

### 2. TGN
Since places in Getty are all TGN place, we need to map the Getty TGN place nodes to Qnodes by using TGN identifiers. Here is the mapping relationship:

- Wikidata: `(Q1234567) - [P1667] -> ("1234567")`
- TGN full: `(tgn:1234567) - [dc:identifier] -> ("1234567")`
- TGN full: `(tgn:1234567) - [foaf:focus] -> (tgn:1234567-place)`

In [7]:
%%time
kgtk("""
    query -i $WIKIDATA $TGN_FULL
        --match 'a: (qnode)-[:P1667]->(identifier), 
                 f: (tgnid)-[p1]->(identifier), 
                 f: (tgnid)-[p2]->(tgnplace)' 
        --where 'p1.label = "dc:identifier" AND p2.label = "foaf:focus"' 
        --return 'tgnplace as node1, "skos:exactMatch" as label, qnode as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 10.3 ms, sys: 11.8 ms, total: 22.2 ms
Wall time: 4.2 s


Unnamed: 0,node1,label,node2,node2;label
0,tgn:7013445-place,skos:exactMatch,Q100,'Boston'@en
1,tgn:1000164-place,skos:exactMatch,Q1000,'Gabon'@en
2,tgn:2116540-place,skos:exactMatch,Q1001828,'Port Townsend'@en
3,tgn:1000165-place,skos:exactMatch,Q1005,'The Gambia'@en
4,tgn:1000167-place,skos:exactMatch,Q1006,'Guinea'@en
5,tgn:1000183-place,skos:exactMatch,Q1007,'Guinea-Bissau'@en
6,tgn:1000168-place,skos:exactMatch,Q1008,'Ivory Coast'@en
7,tgn:1000153-place,skos:exactMatch,Q1009,'Cameroon'@en
8,tgn:7001632-place,skos:exactMatch,Q1011,'Cape Verde'@en
9,tgn:7001660-place,skos:exactMatch,Q1013,'Lesotho'@en


In [8]:
%%time
kgtk("""
    query -i $WIKIDATA $TGN_FULL
        --match 'a: (qnode)-[:P1667]->(identifier), 
                 f: (tgnid)-[p1]->(identifier), 
                 f: (tgnid)-[p2]->(tgnplace)' 
        --where 'p1.label = "dc:identifier" AND p2.label = "foaf:focus"' 
        --return 'tgnplace as node1, "skos:exactMatch" as label, qnode as node2' 
        -o $TGN_ALIGN
    """)

CPU times: user 4.37 ms, sys: 10.2 ms, total: 14.6 ms
Wall time: 2.46 s


## Query for Wikidata
We query Wikidata for those sample records about their dates of birth, dates of death and places of birth. For each of the field, we provide first, a glimpse of the query results (about 10 results), then we perform the whole query and last we count the results.

### 1. Date of birth:

In [9]:
%%time
kgtk("""
    query -i $SAMPLES $WIKIDATA 
        --match 's: (ulanid)-[]->(qnode), 
                 a: (qnode)-[p:P569]->(birthdate)' 
        --return 'qnode as node1, p.label as label, birthdate as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 12.5 ms, sys: 13 ms, total: 25.5 ms
Wall time: 4.87 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P569,^1907-05-27T00:00:00Z/11,'Rachel Carson'@en,'date of birth'@en
1,Q101771,P569,^1929-06-21T00:00:00Z/11,'Gottfried Gruben'@en,'date of birth'@en
2,Q101791,P569,^1908-03-09T00:00:00Z/11,'Sep Ruf'@en,'date of birth'@en
3,Q102139,P569,^1940-04-16T00:00:00Z/11,'Margrethe II of Denmark'@en,'date of birth'@en
4,Q102711,P569,^1936-05-17T00:00:00Z/11,'Dennis Hopper'@en,'date of birth'@en
5,Q103876,P569,^1932-08-02T00:00:00Z/11,'Peter O\'Toole'@en,'date of birth'@en
6,Q1066442,P569,^1925-10-31T00:00:00Z/11,'Charles Moore'@en,'date of birth'@en
7,Q106775,P569,^1930-10-02T00:00:00Z/11,'Richard Harris'@en,'date of birth'@en
8,Q106775,P569,^1930-10-01T00:00:00Z/11,'Richard Harris'@en,'date of birth'@en
9,Q1124,P569,^1946-08-19T00:00:00Z/11,'Bill Clinton'@en,'date of birth'@en


In [10]:
%%time
kgtk("""
    query -i $SAMPLES $WIKIDATA 
        --match 's: (ulanid)-[]->(qnode), 
                 a: (qnode)-[p:P569]->(birthdate)' 
        --return 'qnode as node1, p.label as label, birthdate as node2' 
        -o $WIKI_BIRTHDATE
    """)

CPU times: user 6.05 ms, sys: 11.5 ms, total: 17.6 ms
Wall time: 2.18 s


In [11]:
kgtk("""
    query -i $WIKI_BIRTHDATE
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_617_c1.""node1"")"
0,239


### 2. Date of death:

In [12]:
%%time
kgtk("""
    query -i $SAMPLES $WIKIDATA 
        --match 's: (ulanid)-[]->(qnode), 
                 a: (qnode)-[p:P570]->(deathdate)' 
        --return 'qnode as node1, p.label as label, deathdate as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 9.01 ms, sys: 14.3 ms, total: 23.4 ms
Wall time: 4.67 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P570,^1964-04-14T00:00:00Z/11,'Rachel Carson'@en,'date of death'@en
1,Q101771,P570,^2003-11-24T00:00:00Z/11,'Gottfried Gruben'@en,'date of death'@en
2,Q101771,P570,^2003-01-01T00:00:00Z/9,'Gottfried Gruben'@en,'date of death'@en
3,Q101791,P570,^1982-07-29T00:00:00Z/11,'Sep Ruf'@en,'date of death'@en
4,Q102711,P570,^2010-05-29T00:00:00Z/11,'Dennis Hopper'@en,'date of death'@en
5,Q103876,P570,^2013-12-14T00:00:00Z/11,'Peter O\'Toole'@en,'date of death'@en
6,Q1066442,P570,^1993-12-16T00:00:00Z/11,'Charles Moore'@en,'date of death'@en
7,Q106775,P570,^2002-10-25T00:00:00Z/11,'Richard Harris'@en,'date of death'@en
8,Q1132047,P570,^1854-04-07T00:00:00Z/11,'William Strickland'@en,'date of death'@en
9,Q1132047,P570,^1854-01-01T00:00:00Z/9,'William Strickland'@en,'date of death'@en


In [13]:
%%time
kgtk("""
    query -i $SAMPLES $WIKIDATA 
        --match 's: (ulanid)-[]->(qnode), 
                 a: (qnode)-[p:P570]->(deathdate)' 
        --return 'qnode as node1, p.label as label, deathdate as node2' 
        -o $WIKI_DEATHDATE
    """)

CPU times: user 3.77 ms, sys: 12.8 ms, total: 16.6 ms
Wall time: 2.08 s


In [14]:
kgtk("""
    query -i $WIKI_DEATHDATE
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_620_c1.""node1"")"
0,198


### 3. Place of birth:

In [16]:
%%time
kgtk("""
    query -i $SAMPLES $WIKIDATA 
        --match 's: (ulanid)-[]->(qnode), 
                 a: (qnode)-[p:P19]->(birthplace)' 
        --return 'qnode as node1, p.label as label, birthplace as node2' 
        --limit 10 
    / add-labels
    """)

CPU times: user 11.5 ms, sys: 13.4 ms, total: 24.9 ms
Wall time: 4.35 s


Unnamed: 0,node1,label,node2,node1;label,label;label,node2;label
0,Q101771,P19,Q1449,'Gottfried Gruben'@en,'place of birth'@en,'Genoa'@en
1,Q101791,P19,Q1726,'Sep Ruf'@en,'place of birth'@en,'Munich'@en
2,Q103876,P19,Q39121,'Peter O\'Toole'@en,'place of birth'@en,'Leeds'@en
3,Q106775,P19,Q133315,'Richard Harris'@en,'place of birth'@en,'Limerick'@en
4,Q1124,P19,Q80008,'Bill Clinton'@en,'place of birth'@en,'Hope'@en
5,Q11613,P19,Q572172,'Harry S. Truman'@en,'place of birth'@en,'Lamar'@en
6,Q11673,P19,Q18424,'Andrew Cuomo'@en,'place of birth'@en,'Queens'@en
7,Q11806,P19,Q16101,'John Adams'@en,'place of birth'@en,'Braintree'@en
8,Q11812,P19,Q4179352,'Thomas Jefferson'@en,'place of birth'@en,'Shadwell'@en
9,Q11816,P19,Q16101,'John Quincy Adams'@en,'place of birth'@en,'Braintree'@en


In [17]:
%%time
kgtk("""
    query -i $SAMPLES $WIKIDATA 
        --match 's: (ulanid)-[]->(qnode), 
                 a: (qnode)-[p:P19]->(birthplace)' 
        --return 'qnode as node1, p.label as label, birthplace as node2' 
        -o $WIKI_BIRTHPLACE
    """)

CPU times: user 3.98 ms, sys: 10.2 ms, total: 14.2 ms
Wall time: 2.14 s


In [18]:
kgtk("""
    query -i $WIKI_BIRTHPLACE
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_622_c1.""node1"")"
0,142


## Query for Getty

### 1: Date of birth
Since dates of birth in Getty are all just years, so we are actually querying years of birth then convert them into date format. This is the same for dates of death.

In [19]:
%%time
kgtk("""
    query -i $SAMPLES $ULAN_FULL
        --match 's: (ulanid)-[]->(qnode), 
                 f: (ulanid)-[p0]->(ulanagent), 
                 f: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" AND p2.label = "gvp:estStart" AND p3.label = "gvp:structured_value"' 
        --return 'distinct qnode as node1, "P569" as label, printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 11 ms, sys: 12.1 ms, total: 23 ms
Wall time: 4.15 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P569,^1907-01-01T00:00:00Z/9,'Rachel Carson'@en,'date of birth'@en
1,Q101771,P569,^1929-01-01T00:00:00Z/9,'Gottfried Gruben'@en,'date of birth'@en
2,Q101791,P569,^1908-01-01T00:00:00Z/9,'Sep Ruf'@en,'date of birth'@en
3,Q102139,P569,^1940-01-01T00:00:00Z/9,'Margrethe II of Denmark'@en,'date of birth'@en
4,Q1024362,P569,^1800-01-01T00:00:00Z/9,'Spanish National Research Council'@en,'date of birth'@en
5,Q1024426,P569,^1850-01-01T00:00:00Z/9,'University of South Carolina'@en,'date of birth'@en
6,Q102711,P569,^1936-01-01T00:00:00Z/9,'Dennis Hopper'@en,'date of birth'@en
7,Q10288082,P569,^1875-01-01T00:00:00Z/9,'Wildenstein & Company'@en,'date of birth'@en
8,Q103876,P569,^1932-01-01T00:00:00Z/9,'Peter O\'Toole'@en,'date of birth'@en
9,Q1065,P569,^1945-01-01T00:00:00Z/9,'United Nations'@en,'date of birth'@en


In [20]:
%%time
kgtk("""
    query -i $SAMPLES $ULAN_FULL
        --match 's: (ulanid)-[]->(qnode), 
                 f: (ulanid)-[p0]->(ulanagent), 
                 f: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" AND p2.label = "gvp:estStart" AND p3.label = "gvp:structured_value"' 
        --return 'distinct qnode as node1, "P569" as label, printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
        -o $BIRTHYEAR
    """)

CPU times: user 4.43 ms, sys: 12 ms, total: 16.4 ms
Wall time: 2 s


Perform validation for results in Getty:

In [21]:
!kgtk validate -i $BIRTHYEAR --ignore-minimum-year


Data lines read: 432
Data lines passed: 432


Count the results:

In [22]:
kgtk("""
    query -i $BIRTHYEAR 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_596_c1.""node1"")"
0,430


Here we count for how many new date of birth we found in Getty:

In [23]:
%%time
kgtk("""
    ifnotexists -i $BIRTHYEAR 
        --filter-on $WIKI_BIRTHDATE 
        --input-keys node1 
        --filter-keys node1 
        -o $NEW_BIRTHYEAR
    """)

CPU times: user 5.21 ms, sys: 11.5 ms, total: 16.7 ms
Wall time: 2.16 s


In [24]:
kgtk("""
    query -i $NEW_BIRTHYEAR 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_610_c1.""node1"")"
0,191


**There are 191 newly found dates of birth in Getty!**

Check if the founded results are matched with those in Wikidata:

In [25]:
%%time
kgtk("""
    ifexists -i $BIRTHYEAR 
        --filter-on $WIKI_BIRTHDATE 
        --input-keys node1 
        --filter-keys node1 
        -o $MATCH_BIRTHYEAR
    """)

CPU times: user 3.95 ms, sys: 13.1 ms, total: 17 ms
Wall time: 2.11 s


In [26]:
kgtk("""
    query -i $MATCH_BIRTHYEAR 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_618_c1.""node1"")"
0,239


In [27]:
kgtk("""
    query -i $BIRTHYEAR $WIKI_BIRTHDATE 
        --match 'b: (qnode)-[p]->(v1), w: (qnode)-[]->(v2)' 
        --where 'kgtk_date_year(v1) = kgtk_date_year(v2)' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_596_c1.""node1"")"
0,224


224 of the 239 ULAN ids have the matched dates of birth in Wikidata and Getty.

### 2: Date of death
There are some value of date of death later than current time, we need to filter those items out.

In [28]:
%%time
kgtk("""
    query -i $SAMPLES $ULAN_FULL 
        --match 's: (ulanid)-[]->(qnode), 
                 f: (ulanid)-[p0]->(ulanagent), 
                 f: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" AND p2.label = "gvp:estEnd" AND p3.label = "gvp:structured_value" AND cast(kgtk_unstringify(datevalue), int) <= 2021' 
        --return 'distinct qnode as node1, "P570" as label, printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 12.2 ms, sys: 12.4 ms, total: 24.6 ms
Wall time: 4.63 s


Unnamed: 0,node1,label,node2,node1;label,label;label
0,Q100948,P570,^1964-01-01T00:00:00Z/9,'Rachel Carson'@en,'date of death'@en
1,Q101771,P570,^2003-01-01T00:00:00Z/9,'Gottfried Gruben'@en,'date of death'@en
2,Q101791,P570,^1982-01-01T00:00:00Z/9,'Sep Ruf'@en,'date of death'@en
3,Q1066442,P570,^1993-01-01T00:00:00Z/9,'Charles Moore'@en,'date of death'@en
4,Q1132047,P570,^1854-01-01T00:00:00Z/9,'William Strickland'@en,'date of death'@en
5,Q11613,P570,^1972-01-01T00:00:00Z/9,'Harry S. Truman'@en,'date of death'@en
6,Q11806,P570,^1826-01-01T00:00:00Z/9,'John Adams'@en,'date of death'@en
7,Q11812,P570,^1826-01-01T00:00:00Z/9,'Thomas Jefferson'@en,'date of death'@en
8,Q11816,P570,^1848-01-01T00:00:00Z/9,'John Quincy Adams'@en,'date of death'@en
9,Q11817,P570,^1845-01-01T00:00:00Z/9,'Andrew Jackson'@en,'date of death'@en


In [29]:
%%time
kgtk("""
    query -i $SAMPLES $ULAN_FULL 
        --match 's: (ulanid)-[]->(qnode), 
                 f: (ulanid)-[p0]->(ulanagent), 
                 f: (ulanagent)-[p1]->()-[p2]->()-[p3]->(datevalue)' 
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" AND p2.label = "gvp:estEnd" AND p3.label = "gvp:structured_value" AND cast(kgtk_unstringify(datevalue), int) <= 2021' 
        --return 'distinct qnode as node1, "P570" as label, printf("^%s-01-01T00:00:00Z/9", kgtk_unstringify(datevalue)) as node2' 
        -o $DEATHYEAR
    """)

CPU times: user 5.59 ms, sys: 9.21 ms, total: 14.8 ms
Wall time: 2.37 s


Perform validation:

In [30]:
!kgtk validate -i $DEATHYEAR --ignore-minimum-year


Data lines read: 193
Data lines passed: 193


Count:

In [31]:
kgtk("""
    query -i $DEATHYEAR 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_597_c1.""node1"")"
0,193


Here we count for how many new date of death we found in Getty:

In [32]:
%%time
kgtk(""" 
    ifnotexists -i $DEATHYEAR 
        --filter-on $WIKI_DEATHDATE 
        --input-keys node1 
        --filter-keys node1 
        -o $NEW_DEATHYEAR
    """)

CPU times: user 7.61 ms, sys: 10.2 ms, total: 17.8 ms
Wall time: 2.45 s


In [33]:
kgtk("""
    query -i $NEW_DEATHYEAR 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_611_c1.""node1"")"
0,10


**There are 10 newly found dates of death in Getty!**

Check if the founded results are matched:

In [34]:
kgtk(""" 
    ifexists -i $DEATHYEAR 
        --filter-on $WIKI_DEATHDATE 
        --input-keys node1 
        --filter-keys node1 
        -o $MATCH_DEATHYEAR
    """)

In [35]:
kgtk("""
    query -i $MATCH_DEATHYEAR 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_619_c1.""node1"")"
0,183


In [36]:
kgtk("""
    query -i $DEATHYEAR $WIKI_DEATHDATE 
        --match 'd: (qnode)-[p]->(v1), w: (qnode)-[]->(v2)' 
        --where 'kgtk_date_year(v1) = kgtk_date_year(v2)' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_597_c1.""node1"")"
0,176


176 of the 183 ULAN ids have the matched dates of death in Wikidata and Getty.

### 3: Place of birth

In [37]:
%%time
kgtk("""
    query -i $SAMPLES $ULAN_FULL $TGN_ALIGN 
        --match 's: (ulanid)-[]->(qnode), 
                 f: (ulanid)-[p0]->(ulanagent), 
                 f: (ulanagent)-[p1]->(ulanbio)-[p2]->(tgnplace), 
                 w: (tgnplace)-[]->(tgnqnode)'
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" 
                 AND p2.label = "schema:birthPlace"' 
        --return 'distinct qnode as node1, "P19" as label, tgnqnode as node2' 
        --limit 10
    / add-labels
    """)

CPU times: user 14.4 ms, sys: 11.5 ms, total: 25.8 ms
Wall time: 4.77 s


Unnamed: 0,node1,label,node2,node1;label,label;label,node2;label
0,Q101791,P19,Q1726,'Sep Ruf'@en,'place of birth'@en,'Munich'@en
1,Q102139,P19,Q1748,'Margrethe II of Denmark'@en,'place of birth'@en,'Copenhagen'@en
2,Q122553,P19,Q84,'Charles II of England'@en,'place of birth'@en,'London'@en
3,Q1273122,P19,Q84,'Sam Taylor-Johnson'@en,'place of birth'@en,'London'@en
4,Q12976,P19,Q239,'Baudouin I of Belgium'@en,'place of birth'@en,'Brussels'@en
5,Q1351247,P19,Q60,'Karl Struss'@en,'place of birth'@en,'New York City'@en
6,Q1370307,P19,Q1297,'Sam Wanamaker'@en,'place of birth'@en,'Chicago'@en
7,Q1405,P19,Q220,'Augustus'@en,'place of birth'@en,'Rome'@en
8,Q1430,P19,Q220,'Marcus Aurelius'@en,'place of birth'@en,'Rome'@en
9,Q150966,P19,Q1735,"'Frederick III, Holy Roman Emperor'@en",'place of birth'@en,'Innsbruck'@en


In [38]:
%%time
kgtk("""
    query -i $SAMPLES $ULAN_FULL $TGN_ALIGN 
        --match 's: (ulanid)-[]->(qnode), 
                 f: (ulanid)-[p0]->(ulanagent), 
                 f: (ulanagent)-[p1]->(ulanbio)-[p2]->(tgnplace), 
                 w: (tgnplace)-[]->(tgnqnode)'
        --where 'p0.label = "foaf:focus" AND p1.label = "gvp:biographyPreferred" 
                 AND p2.label = "schema:birthPlace"' 
        --return 'distinct qnode as node1, "P19" as label, tgnqnode as node2' 
        -o $BIRTHPLACE
    """)

CPU times: user 5.92 ms, sys: 8.81 ms, total: 14.7 ms
Wall time: 1.9 s


Count:

In [39]:
kgtk("""
    query -i $BIRTHPLACE 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_595_c1.""node1"")"
0,63


Here we count for how many new birthplace we found in Getty:

In [40]:
%%time
kgtk("""
    ifnotexists -i $BIRTHPLACE 
        --filter-on $WIKI_BIRTHPLACE 
        --input-keys node1 
        --filter-keys node1 
        -o $NEW_BIRTHPLACE
    """)

CPU times: user 4.52 ms, sys: 11.1 ms, total: 15.6 ms
Wall time: 2.32 s


In [41]:
kgtk("""
    query -i $NEW_BIRTHPLACE 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_612_c1.""node1"")"
0,14


**There are 14 newly found places of birth in Getty!**

Check if the founded results are matched:

In [45]:
%%time
kgtk("""
    ifexists -i $BIRTHPLACE 
        --filter-on $WIKI_BIRTHPLACE 
        --input-keys node1 
        --filter-keys node1 
        -o $MATCH_BIRTHPLACE
    """)

CPU times: user 9.35 ms, sys: 7.63 ms, total: 17 ms
Wall time: 2.31 s


In [46]:
kgtk("""
    query -i $MATCH_BIRTHPLACE 
        --match '(qnode)-[]->()' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_621_c1.""node1"")"
0,49


In [48]:
kgtk("""
    query -i $BIRTHPLACE $WIKI_BIRTHPLACE 
        --match 'b: (qnode)-[p]->(v), w: (qnode)-[]->(v)' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_595_c1.""node1"")"
0,47


47 of the 49 ULAN ids have the matched places of birth in Wikidata and Getty.

# Append the newly found years and places to the Arnold graph

Add id:

In [49]:
%%time
kgtk("""add-id -i $NEW_BIRTHYEAR --id-style wikidata -o $BIRTHYEAR_WITHID""")

In [55]:
%%time
kgtk("""add-id -i $NEW_DEATHYEAR --id-style wikidata -o $DEATHYEAR_WITHID""")

In [56]:
%%time
kgtk("""add-id -i $NEW_BIRTHPLACE --id-style wikidata -o $BIRTHPLACE_WITHID""")

Concatenate to Arnold graph:

In [57]:
%%time
kgtk("""cat -i $ARNOLD $BIRTHYEAR_WITHID $DEATHYEAR_WITHID $BIRTHPLACE_WITHID -o $NEW_ARNOLD""")