# Step 0 Set up `kgtk`
Check in `spouse_of_politician.ipynb`

# Step 1 Send a SPARQL query use `kgtk`

Example 1: Find founding years of universities: 

### Define alias and variables

In [None]:
%%bash
# Database
CLAIMS=../data/claims.tsv.gz
WIKI_INFO=../data/wikidata_infobox.tsv
# Results
RESULTS=../data/founding_year_of_university.tsv
NEW_RESULTS=../data/new_founding_year_of_university.tsv
# Datatypes
NUMBERS=../data/numbers.tsv
STRINGS=../data/strings.tsv
STRUCTURED_LITERALS=../data/structured_literals.tsv
QNODES=../data/qnodes.tsv

### Main `kgtk` query:

In [1]:
# SPARQL query: 
# SELECT DISTINCT ?universityLabel (YEAR(?inception) AS ?foundingYear) 
# WHERE 
# { 
#   ?university wdt:P31/wdt:P279* wd:Q3918 ; 
#               wdt:P571 ?inception . 
#   SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
# } 
# where `P31` means "instance of" and `P279` means "subclass of";

In [3]:
%%bash
kgtk query -i ../data/claims.tsv.gz \
           --match 'c: (q)-[p:P279]->(v:Q3918)' \
           --return 'q, p.label, v' \
           -o ../data/subclass_of_university.tsv

In [None]:
%%bash
kgtk query -i ../data/claims.tsv.gz \
           --match 'c: (q)-[p:P279]->()-[:P279]->(v:Q3918)' \
           --return 'q, p.label, v'

**Type 1:** return full date information:

In [29]:
%%bash
kgtk query -i ../data/claims.tsv.gz -i ../data/subclass_of_university.tsv \
           --match 'c: (u)<-[:P31]-(q)-[p:P571]->(d), s: (u)-[]->()' \
           --return 'distinct q, p.label, d' \
           -o ../data/founding_year_of_university.tsv

**Type 2:** return only `kgtk_date_date`:

In [None]:
%%bash
# head ../data/founding_year_of_university.tsv
kgtk query -i ../data/founding_year_of_university.tsv \
           --match '(q)-[p]->(v)' \
           --return 'q, p.label, kgtk_date_date(v) as node2' \
           -o ../data/founding_year_of_university.tsv

**Type 3:** return only `kgtk_date_year`:

In [6]:
%%bash
kgtk query -i ../data/claims.tsv.gz -i ../data/subclass_of_university.tsv \
           --match 'c: (u)<-[:P31]-(q)-[p:P571]->(d), s: (u)-[]->()' \
           --return 'distinct q, p.label, kgtk_date_year(d) as node2' \
           -o ../data/founding_year_of_university.tsv

For the convinient of property inference, we use **type 2**.

### Count known results in Wikidata database:

Count university-founding_year pairs / **rows** (result should -1 which is the header):

In [7]:
%%bash
wc -l ../data/founding_year_of_university.tsv

2340 ../data/founding_year_of_university.tsv


Count how many **unique politicians** have spouse in Wikidata:

In [8]:
%%bash
kgtk query -i ../data/founding_year_of_university.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
2292


Duplicates:

In [None]:
%%bash
RESULTS=../data/founding_year_of_university.tsv

kgtk query -i $RESULTS \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

### Find unknow results in Wikidata database:
- Find all universities (already completed)

In [10]:
%%bash
kgtk query -i ../data/claims.tsv.gz -i ../data/subclass_of_university.tsv \
           --match 'c: (u)<-[p:P31]-(q), s: (u)-[]->()' \
           --return 'q, p.label, u' \
           -o ../data/university.tsv

- Eliminate universities which have the founding year

In [11]:
%%bash
kgtk ifnotexists -i ../data/university.tsv \
                 --filter-on ../data/founding_year_of_university.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o ../data/university_wo_founding_year.tsv

In [12]:
%%bash
wc -l ../data/university_wo_founding_year.tsv

397 ../data/university_wo_founding_year.tsv


### Count unknown results in Wikidata database:

In [13]:
%%bash
kgtk query -i ../data/university_wo_founding_year.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
395


# Step 2 Infer properties

Use query results from Wikidata database to infer properties in Wikidata infobox and return the most frequent property.

In [38]:
%%bash
RESULTS=../data/founding_year_of_university.tsv
WIKI_INFO=../data/wikidata_infobox.tsv

kgtk query -i $RESULTS -i $WIKI_INFO \
           --match 'f: (q)-[]->(y), w: (q)-[p]->(v)' \
           --where 'kgtk_date_year(y) = "^" + kgtk_unstringify(v)' \
           --return 'p.label, count(v) as N' \
           --order-by 'N desc' \
           --limit 1

label	N
property:established	686


# Step 3 Run query in Wikidata infobox 

For those don't have founding year, query in Wikidata infobox:

In [40]:
%%bash
WIKI_INFO=../data/wikidata_infobox.tsv
QUERY_FILE=../data/university_wo_founding_year.tsv
NEW_RESULTS=../data/new_founding_year_of_university.tsv

kgtk query -i $QUERY_FILE -i $WIKI_INFO \
           --match 'u: (q)-[]->(), w: (q)-[p]->(v)' \
           --where 'p.label = "property:established"' \
           --return 'distinct q, p.label, v' \
           -o $NEW_RESULTS

One can directly output results:

In [50]:
%%bash
WIKI_INFO=../data/wikidata_infobox.tsv
# QUERY_FILE=../data/university.tsv
QUERY_FILE=../data/university_wo_founding_year.tsv
NEW_RESULTS=../data/new_founding_year_of_university.tsv

kgtk query -i $QUERY_FILE -i $WIKI_INFO \
           --match 'u: (q)-[]->(), w: (q)-[p]->(v)' \
           --where 'p.label = "property:established"' \
           --return 'distinct q, p.label, v'

node1	label	node2
Q26910836	property:established	nodemxZbyK2VRrGoaxfdLmyLxw-7539514
Q3550203	property:established	''@en
Q3550203	property:established	nodemxZbyK2VRrGoaxfdLmyLxw-7539726
Q3550203	property:established	nodemxZbyK2VRrGoaxfdLmyLxw-7539727
Q5443327	property:established	nodemxZbyK2VRrGoaxfdLmyLxw-3383155
Q55393546	property:established	nodemxZbyK2VRrGoaxfdLmyLxw-2045292


- Count rows of new findings:

In [41]:
%%bash
wc -l ../data/new_founding_year_of_university.tsv

7 ../data/new_founding_year_of_university.tsv


- Count unique politicians of new findings:

In [42]:
%%bash
kgtk query -i ../data/new_founding_year_of_university.tsv \
           --match 'n: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
4


# Step 4 Datatype distribution of new findings

### 1. Numbers:

- Rows:

In [43]:
%%bash
kgtk query -i ../data/new_founding_year_of_university.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_number(s)' \
           -o ../data/numbers.tsv

wc -l ../data/numbers.tsv

1 ../data/numbers.tsv


- Unique universities:

In [44]:
%%bash
kgtk query -i ../data/numbers.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
0


- Duplicates (movies that have multiple numbers):

In [None]:
%%bash
kgtk query -i ../data/numbers.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

### 2. Structured literals:

- Rows:

In [45]:
%%bash
NEW_RESULTS=../data/new_founding_year_of_university.tsv
WIKI_INFO=../data/wikidata_infobox.tsv

kgtk query -i $NEW_RESULTS -i $WIKI_INFO \
           --match 'n: (q)-[p]->(s), w: (s)-[sv]->(v)' \
           --where 'NOT kgtk_lqstring(s) AND NOT kgtk_number(s) AND sv.label = "dbpedia:structured_value"' \
           --return 'q, p.label, v' \
           -o ../data/structured_literals.tsv

wc -l ../data/structured_literals.tsv

6 ../data/structured_literals.tsv


In [46]:
%%bash
head ../data/structured_literals.tsv

node1	label	node2
Q26910836	property:established	"2009-06-07"
Q3550203	property:established	"--02-20"
Q3550203	property:established	"--03-22"
Q5443327	property:established	"2016-03-17"
Q55393546	property:established	"2017-08-18"


- Unique universities:

In [47]:
%%bash
kgtk query -i ../data/structured_literals.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
4


- Duplicates:

In [52]:
%%bash
kgtk query -i ../data/structured_literals.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

node1	N
Q3550203	2
Q55393546	1
Q5443327	1
Q26910836	1


### 3. Strings:

#### All:

- Rows:

In [53]:
%%bash
NEW_RESULTS=../data/new_founding_year_of_university.tsv

kgtk query -i $NEW_RESULTS \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring(s)' \
           -o ../data/strings.tsv

wc -l ../data/strings.tsv

2 ../data/strings.tsv


- Unique universities:

In [54]:
%%bash
kgtk query -i ../data/strings.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
1


- Duplicates:

In [55]:
%%bash
kgtk query -i ../data/strings.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

node1	N
Q3550203	1


#### Empty:

- Rows:

In [56]:
%%bash
NEW_RESULTS=../data/new_founding_year_of_university.tsv

kgtk query -i $NEW_RESULTS \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring_text(s) = ""' \
           -o ../data/empty_strings.tsv

wc -l ../data/empty_strings.tsv

2 ../data/empty_strings.tsv


- Unique universities:

In [57]:
%%bash
kgtk query -i ../data/empty_strings.tsv \
           --match 'n: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
1


#### Further check empty strings:

First we filter out all non-empty strings:

In [58]:
%%bash
kgtk query -i ../data/new_founding_year_of_university.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'NOT kgtk_lqstring(s) OR kgtk_lqstring_text(s) != ""' \
           -o ../data/non_empty_strings.tsv

Then we filter those in empty but also have founding year in non-empty ones, then we can count how many "pure" empty:

In [59]:
%%bash
kgtk ifnotexists -i ../data/empty_strings.tsv \
                 --filter-on ../data/non_empty_strings.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o ../data/pure_empty.tsv 

wc -l ../data/pure_empty.tsv

1 ../data/pure_empty.tsv


In [60]:
%%bash
kgtk query -i ../data/pure_empty.tsv \
           --match 'p: (q)-[]->()' \
           --return 'count(distinct q) as N'

N
0


### 4. Qnodes

- Rows:

In [61]:
%%bash
NEW_RESULTS=../data/new_founding_year_of_university.tsv

kgtk query -i $NEW_RESULTS \
           --match 'n:()-[]->(q)' \
           --where 'NOT kgtk_lqstring(q) AND NOT kgtk_number(q)' \
           -o ../data/nodes.tsv

kgtk ifnotexists -i ../data/nodes.tsv \
                 --filter-on ../data/structured_literals.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o ../data/qnodes.tsv 

wc -l ../data/qnodes.tsv

1 ../data/qnodes.tsv


- Unique movies:

In [62]:
%%bash
kgtk query -i ../data/qnodes.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
0


- Duplicates:

In [None]:
%%bash
kgtk query -i ../data/qnodes.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N'