# Getty Query Tutorial (Nationality)
In this notebook, we perform the similar hunger-for-knowledge procedure for nationality in Wikidata and Getty Vocabulary. The main difference is that we generalize the query for single property to the property chain, this is, the path in kgtk.

In [1]:
import os
import json
import pandas as pd

from kgtk.functions import kgtk, kypher

## Step 0: Set up environment paths
Here we set needed Graph files and desired output files to environment paths.

In [3]:
# We will define environment variables to hold the full paths to the files as we will use them in the shell commands
kgtk_environment_variables = []

# Folder where database files store
data_path = "/nas/home/bohuizha/KG/hunger-for-knowledge/data/"
os.environ['DATABASE'] = data_path
kgtk_environment_variables.append('DATABASE')

# Wikidata (all is much less than claims)
os.environ['WIKIDATA'] = data_path + "claims.tsv"
kgtk_environment_variables.append('WIKIDATA')

# Label file of Wikidata
os.environ['KGTK_LABEL_FILE'] = data_path + "labels.en.tsv"
kgtk_environment_variables.append('KGTK_LABEL_FILE')

# P31
os.environ['P31'] = data_path + "P31.tsv"
kgtk_environment_variables.append('P31')

# P279star
os.environ['P279STAR'] = data_path + "P279star.tsv"
kgtk_environment_variables.append('P279STAR')

# Folder of ULAN
ulan_path = data_path + "gvp/ULAN/"
os.environ['ULAN'] = ulan_path
kgtk_environment_variables.append('ULAN')

# File concatenated by used explicit files: ULAN Subjects, AgentMap, Nationality, AAT-Wikidata Alignment
# The suffix ID means we performed add-id operation to the graph to meet the requirement of paths command.
ulan_concat_path = ulan_path + "ulan.nation.concat.id.tsv"
os.environ['ULAN_NATION_CONCAT_ID'] = ulan_concat_path
kgtk_environment_variables.append('ULAN_NATION_CONCAT_ID')

# ULAN-Wikidata alignment file, maps ULAN ID to Wikidata Qnode
ulan_wikialign_path = ulan_path + "wiki.align.tsv"
os.environ['ULAN_ALIGN'] = ulan_wikialign_path
kgtk_environment_variables.append('ULAN_ALIGN')

# Output
output_path = data_path + "gvp/nation_output/"
if not os.path.exists(output_path):
    os.mkdir(output_path)
os.environ['OUTPUT'] = output_path
kgtk_environment_variables.append('OUTPUT')

# Each file will be explained later in the procedure
output_names = {
    "wiki_nation": "wiki.nation.tsv",
    "wiki_unknown": "wiki.unknown.tsv",
    "pairs": "pairs.tsv",
    "paths": "paths.tsv",
    "paths_label": "paths.label.tsv",
    "getty_nation": "getty.nation.tsv",
    "getty_mapped": "getty.mapped.tsv",
    "getty_agree": "getty.agree.tsv",
    "correct_temp_1": "correct.temp.1.tsv",
    "correct_temp_2": "correct.temp.2.tsv",
    "incorrect_temp": "incorrect.temp.tsv",
    "correct": "correct.tsv",
    "incorrect": "incorrect.tsv",
    "unknown": "unknown.tsv"
}

for key, value in output_names.items():
    variable = key.upper()
    os.environ[variable] = os.path.join(output_path, value)
    kgtk_environment_variables.append(variable)

for variable in kgtk_environment_variables:
    print("{}: \"{}\"".format(variable, os.environ[variable]))

DATABASE: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/"
WIKIDATA: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/claims.tsv"
KGTK_LABEL_FILE: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/labels.en.tsv"
P31: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/P31.tsv"
P279STAR: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/P279star.tsv"
ULAN: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/"
ULAN_NATION_CONCAT_ID: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/ulan.nation.concat.id.tsv"
ULAN_ALIGN: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/ULAN/wiki.align.tsv"
OUTPUT: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/nation_output/"
WIKI_NATION: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/nation_output/wiki.nation.tsv"
WIKI_UNKNOWN: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/nation_output/wiki.unknown.tsv"
PAIRS: "/nas/home/bohuizha/KG/hunger-for-knowledge/data/gvp/nation_output/pairs.tsv"
PATHS: "/nas/home/bohuizha/KG/hun

## Step 1: Query Wikidata
Query Wikidata for nationality for people that have an ULAN ID (92k people)

In [6]:
# count total poeple
# ULAN ID
kgtk("""
    query -i $ULAN_ALIGN 
        --match '(ulanid)-[]->(qnode)'
        --return 'count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_627_c1.""node1"")"
0,92353


In [7]:
# Qnode
kgtk("""
    query -i $ULAN_ALIGN 
        --match '(ulanid)-[]->(qnode)'
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_627_c1.""node2"")"
0,92210


In [6]:
%%time
kgtk("""
    query -i $ULAN_ALIGN $WIKIDATA
        --match 'w: (ulanid)-[]->(qnode), 
                 a: (qnode)-[:P27]->(nation)' 
        --return 'distinct ulanid as node1, "P27" as label, nation as node2'
        --limit 10
    / add-labels
    """)

CPU times: user 37.8 ms, sys: 31.3 ms, total: 69.1 ms
Wall time: 1min 26s


Unnamed: 0,node1,label,node2,label;label,node2;label
0,ulan:500023949,P27,Q142,'country of citizenship'@en,'France'@en
1,ulan:500123827,P27,Q30,'country of citizenship'@en,'United States of America'@en
2,ulan:500223082,P27,Q28,'country of citizenship'@en,'Hungary'@en
3,ulan:500108173,P27,Q28,'country of citizenship'@en,'Hungary'@en
4,ulan:500072302,P27,Q28513,'country of citizenship'@en,'Austria-Hungary'@en
5,ulan:500091345,P27,Q28,'country of citizenship'@en,'Hungary'@en
6,ulan:500059872,P27,Q28,'country of citizenship'@en,'Hungary'@en
7,ulan:500020713,P27,Q183,'country of citizenship'@en,'Germany'@en
8,ulan:500160869,P27,Q183,'country of citizenship'@en,'Germany'@en
9,ulan:500113074,P27,Q28,'country of citizenship'@en,'Hungary'@en


In [7]:
%%time
kgtk("""
    query -i $ULAN_ALIGN $WIKIDATA
        --match 'w: (ulanid)-[]->(qnode), 
                 a: (qnode)-[:P27]->(nation)' 
        --return 'distinct ulanid as node1, "P27" as label, nation as node2'
        -o $WIKI_NATION
    """)

CPU times: user 55.2 ms, sys: 34.7 ms, total: 89.9 ms
Wall time: 1min 44s


## Step 2: Record Known and Unknown

In [8]:
# count known ULAN IDs
kgtk("""
    query -i $WIKI_NATION
        --match '(ulanid)-[]->(nation)' 
        --return 'count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_643_c1.""node1"")"
0,69387


In [8]:
# count known Qnodes
kgtk("""
    query -i $WIKI_NATION $ULAN_ALIGN
        --match 'n: (ulanid)-[]->(nation), a: (ulanid)-[]->(qnode)' 
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_627_c2.""node2"")"
0,69291


Filter unknow in ULAN-Wikidata alignment

In [9]:
%%time
kgtk("""
    ifnotexists -i $ULAN_ALIGN 
        --filter-on $WIKI_NATION 
        --input-keys node1 
        --filter-keys node1 
        -o $WIKI_UNKNOWN
    """)

CPU times: user 4.89 ms, sys: 9.79 ms, total: 14.7 ms
Wall time: 2.35 s


In [10]:
# count unknown ULAN IDs
kgtk("""
    query -i $WIKI_UNKNOWN
        --match '(ulanid)-[]->(qnode)'
        --return 'count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_644_c1.""node1"")"
0,22966


In [9]:
# count unknown Qnodes
kgtk("""
    query -i $WIKI_UNKNOWN
        --match '(ulanid)-[]->(qnode)'
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_644_c1.""node2"")"
0,22919


## Step 3: Query Getty
Based on the known nationality for people, query Getty. Before we perform single property query (we can consider it as 1 hop paths query), this time, we generalize to paths between known pairs (Subjects, Objects). Similarily we will compute the distributions of found paths and pick the top 1 path (property chain). 

### 3-1: Build Pairs from Wikidata Results 
Pairs file is needed for performing kgtk paths operation as the query objects.

In [64]:
%%time
kgtk("""
    query -i $WIKI_NATION
        --match '(ulanid)-[]->(nation)'
        --return 'distinct ulanid as source, nation as target'
        --skip 52000
        --limit 10000
        -o $PAIRS
    """)

CPU times: user 8.16 ms, sys: 10.2 ms, total: 18.3 ms
Wall time: 6.03 s


In [65]:
!head $PAIRS | column -ts $'\t'

source          target
ulan:500016167  Q174193
ulan:500043595  Q145
ulan:500063257  Q145
ulan:500028477  Q15180
ulan:500028477  Q212
ulan:500176055  Q183
ulan:500012823  Q40
ulan:500200032  Q15180
ulan:500200032  Q159


### 3-2: Perform Paths Mapping
This step is the core step of this query and it is also the bottleneck of running this procedure.

In [66]:
%%time
kgtk("""
    paths --max_hops 4
        --path-file $PAIRS
        --path-mode NONE 
        --path-source source
        --path-target target
        -i $ULAN_NATION_CONCAT_ID
        --statistics-only
        -o $PATHS
    """)

CPU times: user 548 ms, sys: 304 ms, total: 852 ms
Wall time: 15min 33s


In [67]:
!head $PATHS

node1	label	node2	id
p0	0	ulan:500086261:foaf:focus:ulan:500086261-agent	p0-0-0
p0	1	ulan:500086261-agent:gvp:nationalityPreferred:aat:300021959	p0-1-1
p0	2	aat:300021959:skos:exactMatch:Q664	p0-2-2
p1	0	ulan:500292325:foaf:focus:ulan:500292325-agent	p1-0-3
p1	1	ulan:500292325-agent:gvp:nationalityPreferred:aat:300020669	p1-1-4
p1	2	aat:300020669:skos:exactMatch:Q12544	p1-2-5
p2	0	ulan:500123916:foaf:focus:ulan:500123916-agent	p2-0-6
p2	1	ulan:500123916-agent:gvp:nationalityPreferred:aat:300021959	p2-1-7
p2	2	aat:300021959:skos:exactMatch:Q664	p2-2-8


### 3-3: Select the Top 1 Property Chain
Here we use Pandas to sort the top 1 property chain. 

In [68]:
# load paths.tsv file
paths = pd.read_csv(os.environ['PATHS'], sep='\t')
# process property: since paths command generate id as node2, we need to further extract property from it
paths.node2 = paths.node2.apply(lambda x: ':'.join(x.split(':')[2:4]))
# save processed property
paths.to_csv(os.environ['PATHS_LABEL'], sep='\t', index=False)
# join property chain for each found path
paths_concat = paths.groupby(paths['node1']).agg({'node2': lambda x: ' '.join(list(x))})
# sort the top 
paths_concat.value_counts()#.head(1)

node2                                                 
foaf:focus gvp:nationalityPreferred skos:exactMatch       67
foaf:focus gvp:nationalityNonPreferred skos:exactMatch    10
dtype: int64

### 3-4: Query Getty Using the Property Chain
Query Getty for the values of the unknown people, with the found property chain: `(ulanid)-[foaf:focus]->()-[gvp:nationalityPreferred]->()-[skos:exactMatch]->(Qnode of Nationality)`

In [69]:
%%time
kgtk("""
    query -i $WIKI_UNKNOWN $ULAN_NATION_CONCAT_ID
        --match 'unknown: (ulanid)-[]->(), 
                 ulan: (ulanid)-[p0]->()-[p1]->()-[p2]->(gender)'
        --where 'p0.label = "foaf:focus" AND
                 p1.label = "gvp:nationalityPreferred" AND
                 p2.label = "skos:exactMatch"'
        --return 'distinct ulanid as node1, "P27" as label, gender as node2'
        --limit 10
    / add-labels
    """)

CPU times: user 37.1 ms, sys: 37.2 ms, total: 74.4 ms
Wall time: 1min 46s


Unnamed: 0,node1,label,node2,label;label,node2;label
0,ulan:500000007,P27,Q1979615,'country of citizenship'@en,'culture of Germany'@en
1,ulan:500000016,P27,Q1985804,'country of citizenship'@en,'culture of France'@en
2,ulan:500000020,P27,Q242485,'country of citizenship'@en,'Flemings'@en
3,ulan:500000031,P27,Q1979615,'country of citizenship'@en,'culture of Germany'@en
4,ulan:500000034,P27,Q1979615,'country of citizenship'@en,'culture of Germany'@en
5,ulan:500000091,P27,Q1985804,'country of citizenship'@en,'culture of France'@en
6,ulan:500000123,P27,Q1979615,'country of citizenship'@en,'culture of Germany'@en
7,ulan:500000178,P27,Q1985804,'country of citizenship'@en,'culture of France'@en
8,ulan:500000205,P27,Q1979615,'country of citizenship'@en,'culture of Germany'@en
9,ulan:500000215,P27,Q1985804,'country of citizenship'@en,'culture of France'@en


In [72]:
%%time
kgtk("""
    query -i $WIKI_UNKNOWN $ULAN_NATION_CONCAT_ID
        --match 'unknown: (ulanid)-[]->(), 
                 ulan: (ulanid)-[p0]->()-[p1]->()-[p2]->(nation)'
        --where 'p0.label = "foaf:focus" AND
                 p1.label = "gvp:nationalityPreferred" AND
                 p2.label = "skos:exactMatch"'
        --return 'distinct ulanid as node1, "P27" as label, nation as node2'
        -o $GETTY_NATION
    """)

CPU times: user 4.97 ms, sys: 13.3 ms, total: 18.2 ms
Wall time: 5.17 s


Count how many new results (in ULAN ID) we found

In [73]:
kgtk("""
    query -i $GETTY_NATION
        --match '(ulanid)-[]->(gender)'
        --return 'count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_648_c1.""node1"")"
0,5022


## Step 4: Record New Results

Map back to Wikidata

In [74]:
%%time
kgtk("""
    query -i $GETTY_NATION $ULAN_ALIGN
        --match 'g: (ulanid)-[p]->(gender), 
                 w: (ulanid)-[]->(qnode)'
        --return 'distinct qnode as node1, p.label as label, gender as node2'
        -o $GETTY_MAPPED
    """)

CPU times: user 9.41 ms, sys: 8.4 ms, total: 17.8 ms
Wall time: 5.56 s


Count how many new results (in Qnode) we found

In [75]:
kgtk("""
    query -i $GETTY_MAPPED
        --match '(qnode)-[]->(gender)'
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_649_c1.""node1"")"
0,5019


## Step 5: Validate with Wikidata Constraints

Count how many new results we found (Qnodes)

In [81]:
%%time
kgtk("""
    query -i $GETTY_MAPPED $P31 $P279STAR \
        --match 'g: (node1)-[nodeProp]->(node2), P31: (node2)-[]->(nodex), P279star: (nodex)-[]->(par)' \
        --where 'par in ["Q7275", "Q57662985", "Q6266", "Q161243", "Q148837", "Q2775969", "Q231002", "Q170156", "Q6256", "Q3024240", "Q15239622", "Q1145276", "Q133442", "Q1048835"]' \
        --return 'distinct node1 as `node1`, nodeProp.label as `label`, node2 as `node2`' \
        -o $CORRECT_TEMP_1
    """)

CPU times: user 7.48 ms, sys: 11.6 ms, total: 19.1 ms
Wall time: 6.37 s


In [82]:
%%time
kgtk("""
    ifnotexists -i $GETTY_MAPPED \
        --filter-on $CORRECT_TEMP_1 \
        --input-keys node1 node2 \
        --filter-keys node1 node2 \
        -o $INCORRECT_TEMP
    """)

CPU times: user 7.15 ms, sys: 9.23 ms, total: 16.4 ms
Wall time: 4.53 s


In [83]:
%%time
kgtk("""
    query -i $INCORRECT_TEMP $P279STAR \
        --match 'i: (node1)-[nodeProp]->(node2), P279star: (node2)-[]->(par)' \
        --where 'par in ["Q7275", "Q57662985", "Q6266", "Q161243", "Q148837", "Q2775969", "Q231002", "Q170156", "Q6256", "Q3024240", "Q15239622", "Q1145276", "Q133442", "Q1048835"]' \
        --return 'distinct node1 as `node1`, nodeProp.label as `label`, node2 as `node2`' \
        -o $CORRECT_TEMP_2
    """)

CPU times: user 6.42 ms, sys: 11.6 ms, total: 18 ms
Wall time: 5.77 s


In [84]:
%%time
kgtk("""
    ifnotexists -i $INCORRECT_TEMP \
        --filter-on $CORRECT_TEMP_2 \
        --input-keys node1 node2 \
        --filter-keys node1 node2 \
        -o $INCORRECT
    """)

CPU times: user 8.6 ms, sys: 8.6 ms, total: 17.2 ms
Wall time: 4.51 s


In [85]:
%%time
kgtk("""
    cat -i $CORRECT_TEMP_1 $CORRECT_TEMP_2 -o $CORRECT
    """)

CPU times: user 7.55 ms, sys: 9.62 ms, total: 17.2 ms
Wall time: 5.93 s


Count (in ULAN ID) how many correct new results:

In [100]:
kgtk("""
    query -i $CORRECT $ULAN_ALIGN
        --match 'c: (qnode)-[]->(), w: (ulanid)-[]->(qnode)'
        --return 'count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_627_c2.""node1"")"
0,269


In [3]:
kgtk("""
    query -i $INCORRECT $ULAN_ALIGN
        --match 'c: (qnode)-[]->(), w: (ulanid)-[]->(qnode)'
        --return 'count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_627_c2.""node1"")"
0,4765


Count (in Qnode) how many correct new results:

In [10]:
kgtk("""
    query -i $CORRECT $ULAN_ALIGN
        --match 'c: (qnode)-[]->(), w: (ulanid)-[]->(qnode)'
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_651_c1.""node1"")"
0,271


Count (in Qnode) how many incorrect new results:

In [12]:
kgtk("""
    query -i $INCORRECT $ULAN_ALIGN
        --match 'c: (qnode)-[]->(), w: (ulanid)-[]->(qnode)'
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_652_c1.""node1"")"
0,4752


## Step 6: Record Unknown after Query
We record unknowns after query Getty, including those uncorrect results.

In [96]:
kgtk("""
    ifnotexists -i $WIKI_UNKNOWN 
        --filter-on $CORRECT 
        --input-keys node2 
        --filter-keys node1
        -o $UNKNOWN
    """)

In [97]:
# count in ULAN IDs
kgtk("""
    query -i $UNKNOWN
        --match '(ulanid)-[]->(qnode)'
        --return 'count(distinct ulanid)'
    """)

Unnamed: 0,"count(DISTINCT graph_653_c1.""node1"")"
0,22697


In [13]:
# count in Qnodes
kgtk("""
    query -i $UNKNOWN
        --match '(ulanid)-[]->(qnode)'
        --return 'count(distinct qnode)'
    """)

Unnamed: 0,"count(DISTINCT graph_653_c1.""node2"")"
0,22648


# Step 7: Comparison of the Known Values¶
How often do the results from two knowledge graphs agree

In [4]:
%%time
kgtk("""
    query -i $ULAN_ALIGN $ULAN_NATION_CONCAT_ID
        --match 'w: (ulanid)-[]->(), 
                 u: (ulanid)-[p0]->()-[p1]->()-[p2]->(nation)'
        --where 'p0.label = "foaf:focus" AND
                 p1.label = "gvp:nationalityPreferred" AND
                 p2.label = "skos:exactMatch"'
        --return 'distinct ulanid as node1, "P27" as label, nation as node2'
        -o $GETTY_AGREE
    """)

CPU times: user 11.4 ms, sys: 17 ms, total: 28.4 ms
Wall time: 6.46 s


In [21]:
kgtk("""
    query -i $GETTY_AGREE $P31
        --match 'g: ()-[]->(nation), P: (nation)-[]->(class)'
        --return 'count(nation)'
    """)

Unnamed: 0,"count(graph_894_c1.""node2"")"
0,27788


In [22]:
kgtk("""
    query -i $GETTY_AGREE $P31
        --match 'g: ()-[]->(nation), P: (nation)-[]->(class)'
        --return 'distinct class, count(nation) as N'
        --order-by 'N desc'
    """)

Unnamed: 0,node2,N
0,Q19958368,22582
1,Q41710,1647
2,Q3624078,400
3,Q6256,306
4,Q223832,305
5,Q202686,305
6,Q1351282,305
7,Q112099,305
8,Q28171280,186
9,Q11042,164


Count agree in Qnode:

In [6]:
%%time
kgtk("""
    query -i $WIKI_NATION $GETTY_AGREE $ULAN_ALIGN
        --match 'n: (ulanid)-[]->(nation), g: (ulanid)-[]->(nation), a: (ulanid)-[]->(qnode)'
        --return 'count(distinct qnode)'
    """)

CPU times: user 11.5 ms, sys: 12.1 ms, total: 23.7 ms
Wall time: 4.32 s


Unnamed: 0,"count(DISTINCT graph_627_c3.""node2"")"
0,325


Count value per node:

In [9]:
nodes = kgtk("""
    query -i $GETTY_AGREE $ULAN_ALIGN 
        --match 'g: (ulanid)-[]->(), w: (ulanid)-[]->(qnode)'
        --return 'count(distinct qnode) as node'
    """)

In [10]:
import subprocess

values = subprocess.check_output("wc -l < $GETTY_AGREE", shell=True)
values = values.decode("utf-8").strip()
values = int(values) - 1

In [11]:
values / nodes.iloc[0]['node']

1.0027797081306462