# Generating Useful Wikidata Files

This notebook generates files that contain derived data that is useful in many applications. The input to the notebook is the full Wikidata or a subset of Wikidata. It also works for arbitrary KGs as long as they follow the representation requirements of Wikidata:

- the *instance of* relation is represented using the `P31` property
- the *subclass of* relation is represented using the `P279` property
- all properties declare a datatype, and the data types must be one of the datatypes in Wikidata.

Inputs:

- `claims_file`: contains all statements, which consist of edges `node1/label/node2` where `label` is a property in Wikidata (e.g., sitelinks, labels, aliases and description are not in the claims file.
- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.

Outputs:

- **Instance of (P31):** `derived.P31.tsv.gz` contains all the `instance of (P31)` edges present in the claims file.
- **Subclass of (P279):** `derived.P279.tsv.gz` contains all the `subclass of (P279)` edges present in the claims file.
- **Is A (isa):** `derived.isa.tsv.gz` contains edges `node`isa/node2` where either `node1/P31/node2` or `node1/P279/node2`
- **Closure of subclass of (P279star):** `derived.P279star.tsv.gz` contains edges `node1/P279star/node2` where `node2` is reachable from `node1` via zero or more hops using the `P279` property. Note that for example, `Q44/P279star/Q44`. An example when this file is useful is when you want to find all the instance of a class, including instances of subclasses of the given class.
- **In/out degrees:** `metadata.out_degree.tsv.gz` contains the out degree of every node, and `metadata.in_degree.tsv.gz` contains the in degree of every node.
- **Pagerank:** outputs page rank on the directed graph in `metadata.pagerank.directed.tsv.gz` and page rank of the directed graph in `metadata.pagerank.undirected.tsv.gz`.

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p claims_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.wikibase-item.tsv.gz \
-p property_item_file = /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.property.wikibase-item.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
-p languages es,ru,zh-cn
```

In [2]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt 
from kgtk.configure_kgtk_notebooks import ConfigureKGTK

In [3]:
# Parameters
# input_path = "/Volumes/saggu-ssd/kgtk-tutorial-files/datasets/arnold"
input_path = None
output_path = "/Volumes/saggu-ssd"
kgtk_path = "/Users/amandeep/Github/kgtk"

graph_cache_path = None

project_name = "arnold-github-test"

languages = 'en'

files = 'claims,label,label_all,alias,alias_all,description,description_all,item'

compute_pagerank = True
compute_degrees = True
compute_hits = False
compute_table_linker_files = True

debug = "false"

In [4]:
files = files.split(',')

In [5]:

ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                 graph_cache_path=graph_cache_path)

User home: /Users/amandeep
Current dir: /Users/amandeep/Github/kgtk/use-cases
KGTK dir: /Users/amandeep/Github/kgtk
Use-cases dir: /Users/amandeep/Github/kgtk/use-cases
--2021-10-13 16:53:03--  https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold/claims.tsv.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-10-13 16:53:04 ERROR 404: Not Found.

--2021-10-13 16:53:04--  https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold/labels.en.tsv.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/usc-isi-i2/kgtk-tutorial-files/main/datasets/arnold/labels.en.tsv.gz [following]
--2021-10-13 16:53:05--  https://raw.githubusercontent.com/usc-isi-i2/kgtk-tutorial-file

In [7]:
ck.print_env_variables()

OUT: /Volumes/saggu-ssd/arnold-github-test
kypher: kgtk query --graph-cache /Volumes/saggu-ssd/arnold-github-test/temp.arnold-github-test/wikidata.sqlite3.db
kgtk: kgtk
USE_CASES_DIR: /Users/amandeep/Github/kgtk/use-cases
GRAPH: /Users/amandeep/isi-kgtk-tutorial/input
EXAMPLES_DIR: /Users/amandeep/Github/kgtk/examples
STORE: /Volumes/saggu-ssd/arnold-github-test/temp.arnold-github-test/wikidata.sqlite3.db
TEMP: /Volumes/saggu-ssd/arnold-github-test/temp.arnold-github-test
claims: None/claims.tsv.gz
label: None/labels.en.tsv.gz
label_all: None/labels.tsv.gz
alias: None/aliases.en.tsv.gz
alias_all: None/aliases.tsv.gz
description: None/descriptions.en.tsv.gz
description_all: None/descriptions.tsv.gz
item: None/claims.wikibase-item.tsv.gz


In [8]:
ck.load_files_into_cache()

kgtk query --graph-cache /Volumes/saggu-ssd/arnold-github-test/temp.arnold-github-test/wikidata.sqlite3.db -i "None/claims.tsv.gz" --as claims  -i "None/labels.en.tsv.gz" --as label  -i "None/labels.tsv.gz" --as label_all  -i "None/aliases.en.tsv.gz" --as alias  -i "None/aliases.tsv.gz" --as alias_all  -i "None/descriptions.en.tsv.gz" --as description  -i "None/descriptions.tsv.gz" --as description_all  -i "None/claims.wikibase-item.tsv.gz" --as item  --limit 3
[Errno 2] No such file or directory: '/Users/amandeep/Github/kgtk/use-cases/None/claims.tsv.gz'



In [5]:
debug = debug.lower() == "true"
languages = languages.split(',')

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [8]:
!$kypher -i item --limit 10 | col 

node1	label	node2	id	node2;wikidatatype
P10	P31	Q18610173	P10-P31-Q18610173-85ef4d24-0	wikibase-item
P1000	P31	Q18608871	P1000-P31-Q18608871-093affb5-0	wikibase-item
P1001	P1855	Q11696	P1001-P1855-Q11696-cdbf391b-0	wikibase-item
P1001	P1855	Q12371988	P1001-P1855-Q12371988-12c10bc0-0	wikibase-item
P1001	P1855	Q181574 P1001-P1855-Q181574-7f428c9b-0	wikibase-item
P1001	P1855	Q29868931	P1001-P1855-Q29868931-76b67d84-0	wikibase-item
P1001	P1855	Q8901	P1001-P1855-Q8901-15be5b36-0	wikibase-item
P1001	P31	Q15720608	P1001-P31-Q15720608-deeedec9-0	wikibase-item
P1001	P31	Q22984026	P1001-P31-Q22984026-8beb0cfe-0	wikibase-item
P1001	P31	Q22997934	P1001-P31-Q22997934-1e5b1a96-0	wikibase-item


Force creation of the index on the label column

In [9]:
!$kypher -i item -o - \
--match '(i)-[:P31]->(c)' \
--limit 5 \
| column -t -s $'\t' 

node1  label  node2      id                              node2;wikidatatype
P10    P31    Q18610173  P10-P31-Q18610173-85ef4d24-0    wikibase-item
P1000  P31    Q18608871  P1000-P31-Q18608871-093affb5-0  wikibase-item
P1001  P31    Q15720608  P1001-P31-Q15720608-deeedec9-0  wikibase-item
P1001  P31    Q22984026  P1001-P31-Q22984026-8beb0cfe-0  wikibase-item
P1001  P31    Q22997934  P1001-P31-Q22997934-1e5b1a96-0  wikibase-item


Force creation of the index on the node2 column

In [10]:
!$kypher -i item -o - \
--match '(i)-[r]->(:Q5)' \
--limit 5 \
| column -t -s $'\t' 

node1     label  node2  id                          node2;wikidatatype
P1424     P1855  Q5     P1424-P1855-Q5-47bdcd17-0   wikibase-item
P1552     P1855  Q5     P1552-P1855-Q5-53b667e4-0   wikibase-item
P5869     P1855  Q5     P5869-P1855-Q5-3a19317f-0   wikibase-item
Q1000048  P31    Q5     Q1000048-P31-Q5-f02d7495-0  wikibase-item
Q1000061  P31    Q5     Q1000061-P31-Q5-6d7f3e39-0  wikibase-item


### Count the number of edges

Counting takes a long time

In [11]:
if debug:
    !$kypher -i item \
    --match '()-[r]->()' \
    --return 'count(r) as count' \
    --limit 10

count
393716


### Get labels, aliases and descriptions for other languages

In [12]:
for lang in languages:
    cmd = f"$kypher -i label_all -o $OUT/labels.{lang}.tsv.gz --match '(n1)-[l:label]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

In [13]:
for lang in languages:
    cmd = f"$kypher -i alias_all -o $OUT/aliases.{lang}.tsv.gz --match '(n1)-[l:alias]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

In [14]:
for lang in languages:
    cmd = f"$kypher -i description_all -o $OUT/descriptions.{lang}.tsv.gz --match '(n1)-[l:description]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

### Create the P31 and P279 files

Create the `P31` file

In [15]:
!$kypher -i item -o $OUT/derived.P31.tsv.gz \
--match '(n1)-[l:P31]->(n2)' \
--return 'l, n1, l.label, n2' 

In [16]:
!zcat < $OUT/derived.P31.tsv.gz | head | col

id	node1	label	node2
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q18610173
P1000-P31-Q18608871-093affb5-0	P1000	P31	Q18608871
P1001-P31-Q15720608-deeedec9-0	P1001	P31	Q15720608
P1001-P31-Q22984026-8beb0cfe-0	P1001	P31	Q22984026
P1001-P31-Q22997934-1e5b1a96-0	P1001	P31	Q22997934
P1001-P31-Q61719275-0ccc11a5-0	P1001	P31	Q61719275
P1001-P31-Q70564278-b92b04ba-0	P1001	P31	Q70564278
P1004-P31-Q19829908-6077b37d-0	P1004	P31	Q19829908
P1004-P31-Q24075706-ef209004-0	P1004	P31	Q24075706
zcat: error writing to output: Broken pipe


Create the P279 file

In [17]:
!$kypher -i item -o $OUT/derived.P279.tsv.gz \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' 

In [18]:
!zcat < $OUT/derived.P279.tsv.gz | head | col

zcat: id	node1	label	node2
Q100039327-P279-Q327333-539148f1-0	Q100039327	P279	Q327333
error writing to outputQ100052008-P279-Q100116222-d1597eca-0	Q100052008	P279	Q100116222
: Broken pipe
Q100052008-P279-Q27304565-f02464c8-0	Q100052008	P279	Q27304565
Q1000660-P279-Q125977-c41a8764-0	Q1000660	P279	Q125977
Q1000660-P279-Q2030545-c687deec-0	Q1000660	P279	Q2030545
Q1000976-P279-Q34187-7e54a7c0-0 Q1000976	P279	Q34187
Q1001059-P279-Q216200-82eda284-0	Q1001059	P279	Q216200
Q1001059-P279-Q234460-f0e0aefd-0	Q1001059	P279	Q234460
Q100116222-P279-Q15831457-afcc497b-0	Q100116222	P279	Q15831457


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [19]:
!$kypher -i $OUT/derived.P279.tsv.gz -o $TEMP/P279.n1.tsv.gz \
--match '(n1)-[l]->()' \
--return 'n1 as id' 

In [20]:
!$kypher -i $OUT/derived.P31.tsv.gz -o $TEMP/P31.n2.tsv.gz \
--match '()-[l]->(n2)' \
--return 'n2 as id' 

In [21]:
!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \
| gzip > $TEMP/P279.roots.1.tsv.gz

In [22]:
!$kgtk sort2 --mode NONE --column id -i $TEMP/P279.roots.1.tsv.gz \
| gzip > $TEMP/P279.roots.2.tsv.gz

We have lots of duplicates

In [23]:
!zcat < $TEMP/P279.roots.2.tsv.gz | head

id
Q100039327
Q100039327
Q100052008
Q100052008
Q1000660
Q1000660
Q1000976
Q1001059
Q1001059
zcat: error writing to output: Broken pipe


In [24]:
!$kgtk compact -i $TEMP/P279.roots.2.tsv.gz --mode NONE \
    --presorted \
    --columns id \
    -o $TEMP/P279.roots.tsv

Now we can invoke the reachable-nodes command

In [25]:
!$kgtk reachable-nodes \
    --rootfile $TEMP/P279.roots.tsv \
    --selflink \
    -i $OUT/derived.P279.tsv.gz \
| gzip > $TEMP/P279.reachable.tsv.gz

In [26]:
!zcat < $TEMP/P279.reachable.tsv.gz | head | col

zcat: error writing to output: Broken pipe
node1	label	node2
Q100039327	reachable	Q100039327
Q100039327	reachable	Q327333
Q100039327	reachable	Q43229
Q100039327	reachable	Q16334295
Q100039327	reachable	Q16334298
Q100039327	reachable	Q61961344
Q100039327	reachable	Q16887380
Q100039327	reachable	Q28813620
Q100039327	reachable	Q99527517


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [27]:
!$kypher -i $TEMP/P279.reachable.tsv.gz -o $TEMP/P279star.1.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "P279star" as label, n2 as node2' 

Now we can concatenate these files to produce the final output

In [28]:
!$kgtk sort2 -i $TEMP/P279star.1.tsv.gz -o $TEMP/P279star.2.tsv.gz

Make sure there are no duplicates

In [29]:
!$kgtk compact --presorted -i $TEMP/P279star.2.tsv.gz -o $TEMP/P279star.3.tsv.gz

Add ids

In [30]:
!$kgtk add-id --id-style node1-label-node2-num -i $TEMP/P279star.3.tsv.gz -o $OUT/derived.P279star.tsv.gz

In [31]:
!zcat < $OUT/derived.P279star.tsv.gz | head | col

zcat: error writing to output: Broken pipe
node1	label	node2	id
Q100039327	P279star	Q100039327	Q100039327-P279star-Q100039327-0000
Q100039327	P279star	Q16334295	Q100039327-P279star-Q16334295-0000
Q100039327	P279star	Q16334298	Q100039327-P279star-Q16334298-0000
Q100039327	P279star	Q16887380	Q100039327-P279star-Q16887380-0000
Q100039327	P279star	Q16889133	Q100039327-P279star-Q16889133-0000
Q100039327	P279star	Q23958946	Q100039327-P279star-Q23958946-0000
Q100039327	P279star	Q24229398	Q100039327-P279star-Q24229398-0000
Q100039327	P279star	Q26720107	Q100039327-P279star-Q26720107-0000
Q100039327	P279star	Q28813620	Q100039327-P279star-Q28813620-0000


This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the counts of instances of subclasses of city (Q515).

In [32]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q515), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, count(c) as count, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'count(c) desc, c, n1' \
    --limit 10 \
    | col

class	count	class name	instance	label
Q1549591	627	'big city'@en	Q100	'Boston'@en
Q1093829	470	'city of the United States'@en	Q100	'Boston'@en
Q515	438	'city'@en	Q1001887	'Ifrane'@en
Q1637706	224	'city with millions of inhabitants'@en	Q10127	'Tangerang'@en
Q42744322	77	'urban municipality of Germany'@en	Q1022	'Stuttgart'@en
Q21518270	55	'state or insular area capital in the United States'@en Q100	'Boston'@en
Q2264924	54	'port settlement'@en	Q10400	'Almería'@en
Q1266818	41	'independent city'@en	Q123766 'Charlottesville'@en
Q13218391	36	'charter city'@en	Q159260 'Santa Clara'@en
Q13539802	33	'place with town rights and privileges'@en	Q131128 'Braunau am Inn'@en


Illustrate that it is indeed `P279*`

In [33]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q63440326), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'c, n1' \
    --limit 10 \
    | col 

class	class name	instance	label
Q63440326	'city of Oregon'@en	Q1065556	'Gold Beach'@en
Q63440326	'city of Oregon'@en	Q171224 'Eugene'@en
Q63440326	'city of Oregon'@en	Q43919	'Salem'@en
Q63440326	'city of Oregon'@en	Q6106	'Portland'@en
Q63440326	'city of Oregon'@en	Q846170 'Roseburg'@en
Q63440326	'city of Oregon'@en	Q849596 'Oregon City'@en


### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [34]:
!$kgtk cat -i $OUT/derived.P31.tsv.gz $OUT/derived.P279.tsv.gz \
-o $TEMP/isa.1.tsv.gz

In [35]:
!$kypher -i $TEMP/isa.1.tsv.gz -o $OUT/derived.isa.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "isa" as label, n2' 

Example of how to use the `isa` relation

In [36]:
if debug:
    !$kypher -i $OUT/derived.isa.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label -o - \
    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \
    --return 'distinct n1, l.label, "Q44" as node2, label as n1_label' \
    --limit 10 \
    | col

node1	label	node2	n1_label


### Create files with `isa/P279* and P31/P279*` 
This file is useful to find all nodes that are below a q-node via P279 or isa.

> These files are very large and take many hours to compute

In [37]:
os.environ['P279STAR'] = f"{os.environ['OUT']}/derived.P279star.tsv.gz"
os.environ['ISA'] = f"{os.environ['OUT']}/derived.isa.tsv.gz"

In [38]:
!$kypher -i "$P279STAR" --as P279star -i "$ISA" --as isa  \
--match '\
  isa: (n1)-[]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "isa_star" as label, n3 as node2' \
-o "$TEMP"/derived.isastar_1.tsv.gz

Now add ids and sort it

In [39]:
!$kgtk add-id --id-style wikidata -i "$TEMP"/derived.isastar_1.tsv.gz \
/ sort2 -o "$OUT"/derived.isastar.tsv.gz

Also calculate the same file by for P31/P279*

In [40]:
!$kypher -i item -i P279star \
--match '\
  item: (n1)-[:P31]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "P31P279star" as label, n3 as node2' \
-o "$TEMP"/derived.P31P279star.gz

Add ids and sort it

In [41]:
!$kgtk add-id --id-style wikidata -i "$TEMP"/derived.P31P279star.gz \
/ sort2 -o "$OUT"/derived.P31P279star.tsv.gz

It is also very big

In [42]:
if debug:
    !zcat < "$OUT"/derived.P31P279star.tsv.gz | wc

 1704159 6816636 100144221


## Compute pagerank

Now compute pagerank. These commands will exceed 16GB memory for graphs containing over 25 million nodes.

In [43]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$item" -o $OUT/metadata.pagerank.directed.tsv.gz \
    --compute-pagerank True \
    --compute-hits False \
    --page-rank-property Pdirected_pagerank \
    --output-degrees True \
    --output-pagerank True \
    --output-hits False \
    --output-statistics-only \
    --undirected False \
    --log-file $TEMP/metadata.pagerank.directed.summary.txt 


	Using the fallback 'C' locale.


In [44]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.directed.summary.txt

graph loaded! It has 66014 nodes and 393716 edges

*** Top relations:
P31	76899
P17	30213
P47	29116
P279	21917
P131	13895
P1889	13443
P106	11174
P1411	10914
P166	10452
P21	10171

*** Degrees:
in degree stats: mean=5.964129, std=0.363567, max=1
out degree stats: mean=5.964129, std=0.043517, max=1
total degree stats: mean=11.928258, std=0.380512, max=1

*** PageRank
Max pageranks
26562	Q23958852	0.071410
42551	Q23960977	0.032866
14856	Q35120	0.028596
11192	Q151885	0.026957
439	Q5	0.012807


In [45]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$item" -o $OUT/metadata.pagerank.undirected.tsv.gz \
    --compute-pagerank True \
    --compute-hits False \
    --page-rank-property Pundirected_pagerank \
    --output-degrees True \
    --output-pagerank True \
    --output-hits False \
    --output-statistics-only \
    --undirected True \
    --log-file $TEMP/metadata.pagerank.undirected.summary.txt 


	Using the fallback 'C' locale.


In [46]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.undirected.summary.txt 

graph loaded! It has 66014 nodes and 393716 edges

*** Top relations:
P31	76899
P17	30213
P47	29116
P279	21917
P131	13895
P1889	13443
P106	11174
P1411	10914
P166	10452
P21	10171

*** Degrees:
in degree stats: mean=0.000000, std=0.000000, max=1
out degree stats: mean=11.928258, std=0.380512, max=1
total degree stats: mean=11.928258, std=0.380512, max=1

*** PageRank
Max pageranks
439	Q5	0.022010
173	Q30	0.012919
4782	Q6581097	0.008353
7097	Q15221623	0.004738
1391	Q1860	0.004441


## Compute Degrees

Kypher can compute the out degree by counting the node2s for each node1

In [47]:
!$kypher -i claims -o $TEMP/metadata.out_degree.tsv.gz \
--match '(n1)-[l]->()' \
--return 'distinct n1 as node1, count(distinct l) as node2, "Pout_degree" as label' 

In [48]:
!$kgtk add-id --id-style wikidata -i $TEMP/metadata.out_degree.tsv.gz \
/ sort2 -o $OUT/metadata.out_degree.tsv.gz

In [52]:
!zcat < $OUT/metadata.out_degree.tsv.gz | head | col

zcat: error writing to output: Broken pipe
node1	node2	label	id
P10	1	Pout_degree	P10-Pout_degree-6b86b2
P1000	1	Pout_degree	P1000-Pout_degree-6b86b2
P1001	13	Pout_degree	P1001-Pout_degree-3fdba3
P1004	5	Pout_degree	P1004-Pout_degree-ef2d12
P1005	3	Pout_degree	P1005-Pout_degree-4e0740
P1006	2	Pout_degree	P1006-Pout_degree-d4735e
P1007	2	Pout_degree	P1007-Pout_degree-d4735e
P101	13	Pout_degree	P101-Pout_degree-3fdba3
P1012	3	Pout_degree	P1012-Pout_degree-4e0740


To count the in-degree we only care when the node2 is a wikibase-item

In [49]:
!$kypher -i claims -o $TEMP/metadata.in_degree.tsv.gz \
--match '()-[l]->(n2 {`wikidatatype`:"wikibase-item"})' \
--return 'distinct n2 as node1, count(distinct l) as node2, "Pin_degree" as label' \
--order-by 'n2'

In [50]:
!$kgtk add-id --id-style wikidata -i $TEMP/metadata.in_degree.tsv.gz \
/ sort2 -o $OUT/metadata.in_degree.tsv.gz

In [51]:
!zcat < $OUT/metadata.in_degree.tsv.gz | head | col

zcat: error writing to output: Broken pipe
node1	node2	label	id
Q100	168	Pin_degree	Q100-Pin_degree-80c3cd
Q1000	76	Pin_degree	Q1000-Pin_degree-f74efa
Q1000048	1	Pin_degree	Q1000048-Pin_degree-6b86b2
Q1000148	3	Pin_degree	Q1000148-Pin_degree-4e0740
Q100039327	1	Pin_degree	Q100039327-Pin_degree-6b86b2
Q100046246	1	Pin_degree	Q100046246-Pin_degree-6b86b2
Q100052008	1	Pin_degree	Q100052008-Pin_degree-6b86b2
Q100055982	1	Pin_degree	Q100055982-Pin_degree-6b86b2
Q100063122	1	Pin_degree	Q100063122-Pin_degree-6b86b2


Calculate the distribution so we can make a nice chart

In [53]:
!$kypher -i $OUT/metadata.in_degree.tsv.gz -o $OUT/statistics.in_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as Pin_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

In [54]:
!head $OUT/statistics.in_degree.distribution.tsv | col

Pin_degree	count	label
1	16523	count
2	5740	count
3	3206	count
4	2123	count
5	1736	count
6	1414	count
7	1315	count
8	1120	count
9	851	count


In [55]:
!$kypher -i $OUT/metadata.out_degree.tsv.gz -o $OUT/statistics.out_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as Pout_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

Draw some charts

In [56]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.in_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("in_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["in_degree", "count"],
    ).interactive().properties(title="Distribution of In Degree")

In [57]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.out_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("out_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["out_degree", "count"],
    ).interactive().properties(title="Distribution of Out Degree")

## Summary of results

In [58]:
!ls -lh $OUT/*

-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.de.tsv.gz
-rw-r--r--  1 amandeep  staff   1.3M Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.en.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.es.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.fr.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.it.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.nl.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.pl.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.pt.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.ru.tsv.gz
-rw-r--r--  1 amandeep  staff    61B 

Highest page rank

In [61]:
if debug:
    if compute_pagerank:
        !$kypher -i $OUT/metadata.pagerank.undirected.tsv.gz -i label \
        --match 'pagerank: (n1)-[:Pundirected_pagerank]->(page_rank), label: (n1)-[:label]->(label)' \
        --return 'distinct n1, label as label, page_rank as `undirected page rank`' \
        --order-by 'page_rank desc' \
        --limit 10 

node1	label	undirected page rank
Q5062876	'Centro Superior de Información de la Defensa'@en	9.999978201327167e-06
Q835831	'Mount Vernon'@en	9.999710615250981e-06
Q62302889	'art practice'@en	9.999306691384986e-06
Q12562330	'asymmetry property'@en	9.998677779801328e-06
Q23	'George Washington'@en	9.998210185434762e-05
Q42293667	'honorary doctor of Ben-Gurion University'@en	9.99804745885785e-06
Q608723	'Bristol Old Vic Theatre School'@en	9.997870819410728e-06
Q392316	'First Nations'@en	9.997432367947012e-06
Q55955335	'Mike Lowrey'@en	9.996814408119406e-06
Q23968798	'Eduard Sanjuán'@en	9.996765931174855e-06
