# Hunger for Knowledge Procedure

## Step 0 Set up `kgtk`
Use GitHub branch `dev`:

In [None]:
%%bash
git clone -b dev https://github.com/usc-isi-i2/kgtk.git
cd kgtk
python setup.py install

## Step 1 Send a SPARQL query use `kgtk`

Example 1: Find spouse(s) of politicians: <div>
SPARQL query:
```
SELECT ?politicianLabel ?spouseLabel
WHERE 
{
  ?politician wdt:P31  wd:Q5 ;
              wdt:P106 wd:Q82955 ;
              wdt:P26  ?spouse .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
```

### Main `kgtk` query:

In [None]:
%%bash
kgtk query -i data/claims.tsv \
           --match 'c: (:Q82955)<-[:P106]-(politician)-[:P26]->(s)' \
           --return 'politician, "P26" as label, s' \
           -o data/spouse_of_politician_kgtk.tsv

### Count known results in Wikidata database:

Count politician-spouse pairs / **rows** (result should -1 which is the header):

In [6]:
%%bash
wc -l data/spouse_of_politician_kgtk.tsv

36975 data/spouse_of_politician_kgtk.tsv


Count how many **unique politicians** have spouse in Wikidata:

In [7]:
%%bash
kgtk query -i data/spouse_of_politician_kgtk.tsv \
           --match 'c: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
30400


### Find unknow results in Wikidata database:
- Find all politicians

In [23]:
%%bash
kgtk query -i data/claims.tsv \
           --match 'c: (:Q82955)<-[:P106]-(politician)' \
           --return 'politician, "P106" as label, "Q82955" as node2' \
           -o data/politician.tsv

- Eliminate politicians who have spouse(s)

In [25]:
%%bash
kgtk ifnotexists -i data/politician.tsv \
                 --filter-on data/spouse_of_politician_kgtk.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o data/politician_wo_spouse.tsv

### Count unknown results in Wikidata database:

In [26]:
%%bash
kgtk query -i data/politician_wo_spouse.tsv \
           --match 'c: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
619168


## Step 2 Infer properties

Use query results from Wikidata database to infer properties in Wikidata infobox and return the most frequent property.

In [41]:
%%bash
WIKI_INFO="data/wikidata_infobox.tsv"
RESULTS="data/spouse_of_politician_kgtk.tsv"

kgtk query -i $RESULTS -i $WIKI_INFO \
           --match 's: (n)-[]->(v), w: (n)-[p]->(v)' \
           --return 'p.label, count(v) as N' \
           --order-by 'N desc' \
           --limit 1

label	N
property:spouse	5697


## Step 3 Run query in Wikidata infobox 

For those don't have spouse, query in Wikidata infobox:

In [28]:
%%bash
WIKI_INFO="data/wikidata_infobox.tsv"
QUERY_FILE="data/politician_wo_spouse.tsv"

kgtk query -i $QUERY_FILE -i $WIKI_INFO \
           --match 'p: (politician)-[]->(), w: (politician)-[property]->(spouse)' \
           --where 'property.label = "property:spouse"' \
           --return 'politician, property.label, spouse' \
           -o data/new_spouse_of_politician.tsv

- Count rows of new findings:

In [57]:
%%bash
wc -l data/new_spouse_of_politician.tsv

58739 data/new_spouse_of_politician.tsv


- Count unique politicians of new findings:

In [30]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
39443


## Step 4 Datatype distribution of new findings

### 1. Numbers:

- Rows:

In [42]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_number(s)' \
           -o data/numbers.tsv

wc -l data/numbers.tsv

7336 data/numbers.tsv


- Unique politicians:

In [43]:
%%bash
kgtk query -i data/numbers.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
4587


### 2. Structured literals:

- Rows:

In [44]:
%%bash
RESULTS=data/new_spouse_of_politician.tsv
WIKI_INFO=data/wikidata_infobox.tsv

kgtk query -i $RESULTS -i $WIKI_INFO \
           --match 'n: (p)-[q]->(s), w: (s)-[sv]->(v)' \
           --where 'NOT kgtk_lqstring(s) AND NOT kgtk_number(s) AND sv.label = "dbpedia:structured_value"' \
           --return 's as node1, sv.label, v' \
           -o data/structured_literals.tsv

wc -l data/structured_literals.tsv

1943 data/structured_literals.tsv


- Unique politicians:

In [46]:
%%bash
kgtk query -i data/structured_literals.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
1942


### 3. Strings:

#### All:

- Rows:

In [47]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring(s)' \
           -o data/strings.tsv

wc -l data/strings.tsv

48171 data/strings.tsv


- Unique politicians:

In [48]:
%%bash
kgtk query -i data/strings.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
38317


#### Empty:

- Rows:

In [49]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring_text(s) = ""' \
           -o data/empty_strings.tsv

wc -l data/empty_strings.tsv

5868 data/empty_strings.tsv


- Unique politicians:

In [50]:
%%bash
kgtk query -i data/empty_strings.tsv \
           --match 'n: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
5867


#### Further check empty strings:

Since some of the empty items not excatly mean only return empty for that politician, for example,
```
Q1133864    property:spouse    nodemxZbyK2VRrGoaxfdLmyLxw-7343552
Q1133864    property:spouse    'Ethel Arnold'@en
Q1133864    property:spouse    ''@en
```
so it need further check.<div>
First we filter out all non-empty strings:

In [None]:
%%bash
kgtk query -i data/new_spouse_of_politician.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring_text(s) != ""' \
           -o data/non_empty_strings.tsv

Then we filter those in empty but also have spouse in non-empty ones, then we can count how many "pure" empty:

In [53]:
%%bash
kgtk ifnotexists -i data/empty_strings.tsv \
                 --filter-on data/non_empty_strings.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o data/pure_empty.tsv 

wc -l data/pure_empty.tsv

297 data/pure_empty.tsv


In [54]:
%%bash
kgtk query -i data/pure_empty.tsv \
           --match 'p: (politician)-[]->()' \
           --return 'count(distinct politician) as N'

N
296


### 4. Qnodes

- Rows:

In [55]:
%%bash
kgtk query -i data/claims.tsv -i data/new_spouse_of_politician.tsv \
           --match 'c: (q)-[]->(), n:()-[]->(q)' \
           -o data/qnodes.tsv

wc -l data/qnodes.tsv

23946 data/qnodes.tsv


- Unique politicians:

In [56]:
%%bash
kgtk query -i data/qnodes.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
1275
