# Step 0 Set up `kgtk`
Check in `spouse_of_politician.ipynb`

# Step 1 Send a SPARQL query use `kgtk`

Example 1: Find cost of movies: <div>
SPARQL query:
```
SELECT DISTINCT ?movieLabel ?cost
WHERE
{
  ?movie wdt:P31 wd:Q11424 ;
         wdt:P577 ?publicationDate ;
         wdt:P2130 ?cost .
  FILTER(YEAR(?publicationDate) = 2020) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
```
where `P2130` is "cost" and `P2142` is "box office";

### Define alias and variables

In [None]:
%%bash
# Database
CLAIMS=../data/claims.tsv.gz
WIKI_INFO=../data/wikidata_infobox.tsv
# Results
RESULTS=../data/movie_with_revenue.tsv
NEW_RESULTS=../data/new_movie_with_revenue.tsv
# Datatypes
NUMBERS=../data/numbers.tsv
STRINGS=../data/strings.tsv
STRUCTURED_LITERALS=../data/structured_literals.tsv
QNODES=../data/qnodes.tsv

### Main `kgtk` query:

In [10]:
%%bash
kgtk query -i ../data/claims.tsv.gz \
           --match 'c: (:Q11424)<-[:P31]-(q)-[p:P577]->(d)' \
           --where 'kgtk_date_year(d) = 2020' \
           --return 'q, p.label, d' \
           -o ../data/movie_in_2020.tsv

In [14]:
%%bash
kgtk query -i ../data/claims.tsv.gz \
           --match 'c: (:Q11424)<-[:P31]-(q)-[p:P577]->(d)' \
           --return 'distinct q, p.label, d' \
           -o ../data/movie.tsv

In [44]:
%%bash
wc -l ../data/movie.tsv

263671 ../data/movie.tsv


In [49]:
%%bash
kgtk query -i ../data/claims.tsv.gz -i ../data/movie.tsv \
           --match 'c: (q)-[p:P2130]->(cost), m: (q)-[]->()' \
           --return 'distinct q, p.label, kgtk_quantity_number_int(cost) as node2' \
           -o ../data/movie_with_revenue.tsv

### Count known results in Wikidata database:

Count movie-revenue pairs / **rows** (result should -1 which is the header):

In [47]:
%%bash
wc -l ../data/movie_with_revenue.tsv

2741 ../data/movie_with_revenue.tsv


Count how many **unique movies** have spouse in Wikidata:

In [81]:
%%bash
RESULTS=../data/movie_with_revenue.tsv

kgtk query -i $RESULTS \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
2737


Duplicates:

In [None]:
%%bash
RESULTS=../data/movie_with_revenue.tsv

kgtk query -i $RESULTS \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

### Find unknow results in Wikidata database:
- Find all movies (already completed)
- Eliminate movies which have revenue

In [50]:
%%bash
kgtk ifnotexists -i ../data/movie.tsv \
                 --filter-on ../data/movie_with_revenue.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o ../data/movie_wo_revenue.tsv

In [51]:
%%bash
wc -l ../data/movie_wo_revenue.tsv

258293 ../data/movie_wo_revenue.tsv


### Count unknown results in Wikidata database:

In [84]:
%%bash
kgtk query -i ../data/movie_wo_revenue.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
236590


# Step 2 Infer properties

Use query results from Wikidata database to infer properties in Wikidata infobox and return the most frequent property.

In [23]:
%%bash
RESULTS=../data/movie_with_revenue.tsv
WIKI_INFO=../data/wikidata_infobox.tsv

kgtk query -i $RESULTS -i $WIKI_INFO \
           --match 'm: (q)-[]->(cost), w: (q)-[p]->(v)' \
           --where 'kgtk_quantity_number_int(cost) = kgtk_quantity_number_int(v)' \
           --return 'p.label, count(v) as N' \
           --order-by 'N desc' \
           --limit 1

label	N
property:budget	15


In [52]:
%%bash
RESULTS=../data/movie_with_revenue.tsv
WIKI_INFO=../data/wikidata_infobox.tsv

kgtk query -i $RESULTS -i $WIKI_INFO \
           --match 'm: (q)-[]->(cost), w: (q)-[p]->(s)-[sv]->(v)' \
           --where 'sv.label = "dbpedia:structured_value" AND kgtk_quantity_number_int(cost) = kgtk_quantity_number_int(kgtk_unstringify(v))' \
           --return 'p.label, count(v) as N' \
           --order-by 'N desc' \
           --limit 1

label	N
property:budget	928


# Step 3 Run query in Wikidata infobox 

For those don't have revenue, query in Wikidata infobox:

In [55]:
%%bash
WIKI_INFO=../data/wikidata_infobox.tsv
QUERY_FILE=../data/movie_wo_revenue.tsv
NEW_RESULTS=../data/new_movie_with_revenue.tsv

kgtk query -i $QUERY_FILE -i $WIKI_INFO \
           --match 'm: (movie)-[]->(), w: (movie)-[p]->(revenue)' \
           --where 'p.label = "property:budget"' \
           --return 'distinct movie, p.label, revenue' \
           -o $NEW_RESULTS

- Count rows of new findings:

In [58]:
%%bash
wc -l ../data/new_movie_with_revenue.tsv

18512 ../data/new_movie_with_revenue.tsv


- Count unique movies of new findings:

In [59]:
%%bash
kgtk query -i ../data/new_movie_with_revenue.tsv \
           --match 'n: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
17181


# Step 4 Datatype distribution of new findings

### 1. Numbers:

- Rows:

In [60]:
%%bash
kgtk query -i ../data/new_movie_with_revenue.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_number(s)' \
           -o ../data/numbers.tsv

wc -l ../data/numbers.tsv

1493 ../data/numbers.tsv


- Unique movies:

In [62]:
%%bash
kgtk query -i ../data/numbers.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
1492


- Duplicates (movies that have multiple numbers):

In [None]:
%%bash
kgtk query -i ../data/numbers.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

### 2. Structured literals:

- Rows:

In [68]:
%%bash
NEW_RESULTS=../data/new_movie_with_revenue.tsv
WIKI_INFO=../data/wikidata_infobox.tsv

kgtk query -i $NEW_RESULTS -i $WIKI_INFO \
           --match 'n: (q)-[p]->(s), w: (s)-[sv]->(v)' \
           --where 'NOT kgtk_lqstring(s) AND NOT kgtk_number(s) AND sv.label = "dbpedia:structured_value"' \
           --return 'q, p.label, v' \
           -o ../data/structured_literals.tsv

wc -l ../data/structured_literals.tsv

13142 ../data/structured_literals.tsv


In [69]:
%%bash
head ../data/structured_literals.tsv

node1	label	node2
Q1000394	property:budget	"361000.0"
Q1001943	property:budget	"7100000.0"
Q1002100	property:budget	"4.5E7"
Q1002480	property:budget	"4500000.0"
Q1004392	property:budget	"1.5E7"
Q1004440	property:budget	"1.2E7"
Q1004531	property:budget	"2.0E7"
Q1004567	property:budget	"1.5E7"
Q1004657	property:budget	"2100000.0"


- Unique movies:

In [71]:
%%bash
kgtk query -i ../data/structured_literals.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
13120


- Duplicates:

In [None]:
%%bash
kgtk query -i ../data/structured_literals.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

### 3. Strings:

#### All:

- Rows:

In [72]:
%%bash
NEW_RESULTS=../data/new_movie_with_revenue.tsv

kgtk query -i $NEW_RESULTS \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring(s)' \
           -o ../data/strings.tsv

wc -l ../data/strings.tsv

3869 ../data/strings.tsv


- Unique movies:

In [73]:
%%bash
kgtk query -i ../data/strings.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
3657


- Duplicates:

In [None]:
%%bash
kgtk query -i ../data/strings.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N' \
           --order-by 'N desc'

#### Empty:

- Rows:

In [76]:
%%bash
NEW_RESULTS=../data/new_movie_with_revenue.tsv

kgtk query -i $NEW_RESULTS \
           --match 'n: (p)-[]->(s)' \
           --where 'kgtk_lqstring_text(s) = ""' \
           -o ../data/empty_strings.tsv

wc -l ../data/empty_strings.tsv

407 ../data/empty_strings.tsv


- Unique movies:

In [77]:
%%bash
kgtk query -i ../data/empty_strings.tsv \
           --match 'n: (p)-[]->()' \
           --return 'count(distinct p) as N'

N
406


#### Further check empty strings:

First we filter out all non-empty strings:

In [92]:
%%bash
kgtk query -i ../data/new_movie_with_revenue.tsv \
           --match 'n: (p)-[]->(s)' \
           --where 'NOT kgtk_lqstring(s) OR kgtk_lqstring_text(s) != ""' \
           -o ../data/non_empty_strings.tsv

Then we filter those in empty but also have spouse in non-empty ones, then we can count how many "pure" empty:

In [93]:
%%bash
kgtk ifnotexists -i ../data/empty_strings.tsv \
                 --filter-on ../data/non_empty_strings.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o ../data/pure_empty.tsv 

wc -l ../data/pure_empty.tsv

10 ../data/pure_empty.tsv


In [94]:
%%bash
kgtk query -i ../data/pure_empty.tsv \
           --match 'p: (q)-[]->()' \
           --return 'count(distinct q) as N'

N
9


Check one of them by hand:

In [95]:
%%bash
head -n 2 ../data/pure_empty.tsv

node1	label	node2
Q16242907	property:budget	''@en


In [96]:
%%bash
kgtk query -i ../data/new_movie_with_revenue.tsv \
           --match '(q:Q16242907)-[p]->(v)' \
           --return 'q, p.label, v'

node1	label	node2
Q16242907	property:budget	''@en


### 4. Qnodes

- Rows:

In [78]:
%%bash
NEW_RESULTS=../data/new_movie_with_revenue.tsv

kgtk query -i $NEW_RESULTS \
           --match 'n:()-[]->(q)' \
           --where 'NOT kgtk_lqstring(q) AND NOT kgtk_number(q)' \
           -o ../data/nodes.tsv

kgtk ifnotexists -i ../data/nodes.tsv \
                 --filter-on ../data/structured_literals.tsv \
                 --input-keys node1 \
                 --filter-keys node1 \
                 -o ../data/qnodes.tsv 

wc -l ../data/qnodes.tsv

11 ../data/qnodes.tsv


- Unique movies:

In [79]:
%%bash
kgtk query -i ../data/qnodes.tsv \
           --match '(p)-[]->()' \
           --return 'count(distinct p) as N'

N
10


- Duplicates:

In [80]:
%%bash
kgtk query -i ../data/qnodes.tsv \
           --match '(p)-[]->(s)' \
           --return 'p, count(s) as N'

node1	N
Q10851237	1
Q21065428	1
Q2706071	1
Q4155584	1
Q4796595	1
Q48671706	1
Q4927579	1
Q5284662	1
Q7278388	1
Q7288987	1
