# Hunger for Knowledge Procedure

## Step 0 Set up `kgtk`

## Step 1 Send a SPARQL query

Find spouse of politicians:

In [None]:
%%bash
kgtk query -i data/claims.tsv \
           --match 'c: (:Q82955)<-[:P106]-(politician)-[:P26]->(s)' \
           --return 'politician, "P26" as label, s' \
           -o data/spouse_of_politician_kgtk.tsv

Count politician-spouse pairs (-1):

In [6]:
%%bash
wc -l data/spouse_of_politician_kgtk.tsv

36975 data/spouse_of_politician_kgtk.tsv


Count how many politicians have spouse in Wikidata:

In [7]:
%%bash
kgtk query -i data/spouse_of_politician_kgtk.tsv \
           --match 'c: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
30400


Find politicians don't have spouse listed in Wikidata:
- Find all politicians

In [None]:
%%bash
kgtk query -i data/claims.tsv \
           --match 'c: (:Q82955)<-[:P106]-(politician)-[p]->(s)' \
           --return 'politician, p.label, s' \
           -o data/politician.tsv

- Eliminate politicians who have spouse(s)

In [None]:
%%bash
kgtk ifnotexists -i data/politician.tsv \
                 --filter-on data/spouse_of_politician_kgtk.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o data/politician_wo_spouse.tsv

Count politicians in Wikidata who don't have spouse listed:

In [9]:
%%bash
kgtk query -i data/politician_wo_spouse.tsv \
           --match 'c: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
619168


## Step 2 Infer properties

In [10]:
%%bash
WIKI_INFO="data/wikidata_infobox.tsv"
RESULTS="data/spouse_of_politician_kgtk.tsv"

kgtk query -i $RESULTS -i $WIKI_INFO \
           --match 's: (n)-[]->(v), w: (n)-[p]->(v)' \
           --return 'p.label, count(v) as N'

label	N
property:1namedata	10
property:2namedata	2
property:after	68
property:alongside	3
property:appointer	2
property:associatedActs	1
property:before	67
property:caption	2
property:children	3
property:coach	1
property:consort	2
property:coronation	1
property:deputy	1
property:deputyPresident	1
property:dictator	1
property:father	5
property:formerpartner	1
property:governor	64
property:issue	3
property:leader	2
property:lieutenantGovernor	1
property:liveInPartner	1
property:minister	1
property:monarch	4
property:mother	1
property:partner	17
property:pharaoh	1
property:preceded	13
property:predecessor	127
property:premier	1
property:president	294
property:primeminister	26
property:queen	5
property:regent	61
property:relations	3
property:relatives	1
property:royalHouse	1
property:spouse	5697
property:spouse(s)_	1
property:spouses	22
property:succeeded	17
property:successor	100
property:title	1
property:vicePresident	1
property:vicepresident	28
property:with	11


## Step 3 Run query in Wikidata infobox 

For those don't have spouse, query in Wikidata infobox:

In [None]:
%%bash
WIKI_INFO="data/wikidata_infobox.tsv"
QUERY_FILE="data/politician_wo_spouse.tsv"

kgtk query -i $QUERY_FILE -i $WIKI_INFO \
           --match 'p: (politician)-[]->(), w: (politician)-[property]->(spouse)' \
           --where 'property.label = "property:spouse"' \
           --return 'politician, property.label, spouse' \
           -o data/new_spouse_of_politician.tsv

It is weird that the returned file contains many duplicate lines, need further process.

Count politician with spouse that we newly found in Wikidata infobox:

In [11]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
39443


Count empty:

In [12]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring_text(s) != ""' \
           --return 'count(distinct p) as N'

N
38021


Since some of the empty items not excatly mean only return empty for that politician, for example,
```
Q1133864    property:spouse    nodemxZbyK2VRrGoaxfdLmyLxw-7343552
Q1133864    property:spouse    'Ethel Arnold'@en
Q1133864    property:spouse    ''@en
```
so it need further check.<div>
First filter out all numbers and empty strings:

In [13]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'NOT kgtk_number(s) AND kgtk_lqstring_text(s) != ""' \
           -o data/cleaned_new_spouse_of_politician.tsv

Then find all empty:

In [14]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring_text(s) != ""' \
           -o data/empty_new_spouse_of_politician.tsv

Filter those in empty but also have spouse in cleaned ones, then we can count how many "pure" empty:

In [15]:
%%bash
kgtk ifnotexists -i data/empty_new_spouse_of_politician.tsv \
                 --filter-on data/cleaned_new_spouse_of_politician.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o data/pure_empty.tsv 

In [16]:
%%bash
kgtk query -i data/pure_empty.tsv \
           --match 'p: (politician)-[]->()' \
           --return 'count(distinct politician) as N'

N
0
