# Generating Useful Wikidata Files

This notebook generates files that contain derived data that is useful in many applications. The input to the notebook is the full Wikidata or a subset of Wikidata. It also works for arbitrary KGs as long as they follow the representation requirements of Wikidata:

- the *instance of* relation is represented using the `P31` property
- the *subclass of* relation is represented using the `P279` property
- all properties declare a datatype, and the data types must be one of the datatypes in Wikidata.

Inputs:

- `claims_file`: contains all statements, which consist of edges `node1/label/node2` where `label` is a property in Wikidata (e.g., sitelinks, labels, aliases and description are not in the claims file.
- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.

Outputs:

- **Instance of (P31):** `derived.P31.tsv.gz` contains all the `instance of (P31)` edges present in the claims file.
- **Subclass of (P279):** `derived.P279.tsv.gz` contains all the `subclass of (P279)` edges present in the claims file.
- **Is A (isa):** `derived.isa.tsv.gz` contains edges `node`isa/node2` where either `node1/P31/node2` or `node1/P279/node2`
- **Closure of subclass of (P279star):** `derived.P279star.tsv.gz` contains edges `node1/P279star/node2` where `node2` is reachable from `node1` via zero or more hops using the `P279` property. Note that for example, `Q44/P279star/Q44`. An example when this file is useful is when you want to find all the instance of a class, including instances of subclasses of the given class.
- **In/out degrees:** `metadata.out_degree.tsv.gz` contains the out degree of every node, and `metadata.in_degree.tsv.gz` contains the in degree of every node.
- **Pagerank:** outputs page rank on the directed graph in `metadata.pagerank.directed.tsv.gz` and page rank of the directed graph in `metadata.pagerank.undirected.tsv.gz`.

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p claims_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.wikibase-item.tsv.gz \
-p property_item_file = /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.property.wikibase-item.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
-p languages es,ru,zh-cn
```

In [24]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
 
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [38]:
# Parameters

input_path = "/Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data"
output_path = "/Volumes/saggu-ssd/wikidata_import/wikidata-20211021"
kgtk_path = "/Users/amandeep/Github/kgtk"

graph_cache_path = None

project_name = "useful-files"

languages = 'en,ru,es,zh-cn,de,it,nl,pl,fr,pt,sv'

files = 'claims,label_all,alias_all,description_all'

compute_pagerank = True
compute_degrees = True
compute_hits = False
debug = False

In [8]:
files = files.split(',')
languages = languages.split(',')

In [9]:
ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                 graph_cache_path=graph_cache_path)

User home: /Users/amandeep
Current dir: /Users/amandeep/Github/kgtk/use-cases
KGTK dir: /Users/amandeep/Github/kgtk
Use-cases dir: /Users/amandeep/Github/kgtk/use-cases


In [10]:
ck.print_env_variables()

STORE: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/useful-files/temp.useful-files/wikidata.sqlite3.db
KGTK_OPTION_DEBUG: false
USE_CASES_DIR: /Users/amandeep/Github/kgtk/use-cases
GRAPH: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data
OUT: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/useful-files
kgtk: kgtk
TEMP: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/useful-files/temp.useful-files
EXAMPLES_DIR: /Users/amandeep/Github/kgtk/examples
KGTK_LABEL_FILE: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data/labels.en.tsv.gz
KGTK_GRAPH_CACHE: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/useful-files/temp.useful-files/wikidata.sqlite3.db
kypher: kgtk query --graph-cache /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/useful-files/temp.useful-files/wikidata.sqlite3.db
claims: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data/claims.tsv.gz
label_all: /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data/labels.tsv.gz
alias_all

In [11]:
if graph_cache_path is None:
    ck.load_files_into_cache()

kgtk query --graph-cache /Volumes/saggu-ssd/wikidata_import/wikidata-20211021/useful-files/temp.useful-files/wikidata.sqlite3.db -i "/Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data/claims.tsv.gz" --as claims  -i "/Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data/labels.tsv.gz" --as label_all  -i "/Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data/aliases.tsv.gz" --as alias_all  -i "/Volumes/saggu-ssd/wikidata_import/wikidata-20211021/data/descriptions.tsv.gz" --as description_all  --limit 3
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item


### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [12]:
!$kypher -i claims --limit 10 | col 

id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1630-53947a-fbe9093e-0	P10	P1630	"https://commons.wikimedia.org/wiki/File:$1"	normal	string
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	wikibase-item


Force creation of the index on the label column

In [13]:
!$kypher -i claims -o - \
--match '(i)-[:P31]->(c)' \
--limit 5 \
| column -t -s $'\t' 

id                                node1   label  node2       rank    node2;wikidatatype
P10-P31-Q18610173-85ef4d24-0      P10     P31    Q18610173   normal  wikibase-item
P1000-P31-Q18608871-093affb5-0    P1000   P31    Q18608871   normal  wikibase-item
P10000-P31-Q19833377-f87f0d4c-0   P10000  P31    Q19833377   normal  wikibase-item
P10000-P31-Q89560413-f555a944-0   P10000  P31    Q89560413   normal  wikibase-item
P10001-P31-Q107738007-c7725ce7-0  P10001  P31    Q107738007  normal  wikibase-item


Force creation of the index on the node2 column

In [14]:
!$kypher -i claims -o - \
--match '(i)-[r]->(:Q5)' \
--limit 5 \
| column -t -s $'\t' 

id                         node1  label  node2  rank    node2;wikidatatype
P1424-P1855-Q5-47bdcd17-0  P1424  P1855  Q5     normal  wikibase-item
P1552-P1855-Q5-53b667e4-0  P1552  P1855  Q5     normal  wikibase-item
P1963-P1855-Q5-1ba43aca-0  P1963  P1855  Q5     normal  wikibase-item
P3055-P1629-Q5-fb63cfeb-0  P3055  P1629  Q5     normal  wikibase-item
P5869-P1855-Q5-3a19317f-0  P5869  P1855  Q5     normal  wikibase-item


### Get labels, aliases and descriptions for other languages

In [12]:
for lang in languages:
    cmd = f"$kypher -i label_all -o $OUT/labels.{lang}.tsv.gz --match '(n1)-[l:label]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

In [13]:
for lang in languages:
    cmd = f"$kypher -i alias_all -o $OUT/aliases.{lang}.tsv.gz --match '(n1)-[l:alias]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

In [14]:
for lang in languages:
    cmd = f"$kypher -i description_all -o $OUT/descriptions.{lang}.tsv.gz --match '(n1)-[l:description]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

### Create the P31 and P279 files

Create the `P31` file

In [15]:
!$kypher -i claims -o $OUT/derived.P31.tsv.gz \
--match '(n1)-[l:P31]->(n2)' \
--return 'l, n1, l.label, n2' 

In [16]:
!zcat < $OUT/derived.P31.tsv.gz | head | col

id	node1	label	node2
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q18610173
P1000-P31-Q18608871-093affb5-0	P1000	P31	Q18608871
P10000-P31-Q19833377-f87f0d4c-0 P10000	P31	Q19833377
P10000-P31-Q89560413-f555a944-0 P10000	P31	Q89560413
zcat: P10001-P31-Q107738007-c7725ce7-0	P10001	P31	Q107738007
P10001-P31-Q64221137-d154ffd9-0 P10001	P31	Q64221137
P10002-P31-Q93433126-dbd52b84-0 P10002	P31	Q93433126
P10003-P31-Q108914651-f3644858-0	P10003	P31	Q108914651
error writing to outputP10003-P31-Q42396390-7f1b5502-0 P10003	P31	Q42396390
: Broken pipe


Create the P279 file

In [17]:
!$kypher -i claims -o $OUT/derived.P279.tsv.gz \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' 

In [18]:
!zcat < $OUT/derived.P279.tsv.gz | head | col

id	node1	label	node2
Q100000030-P279-Q14748-30394205-0	Q100000030	P279	Q14748
Q100000058-P279-Q1622444-bd182663-0	Q100000058	P279	Q1622444
Q1000032-P279-Q1813494-0aa0f1dc-0	Q1000032	P279	Q1813494
Q1000032-P279-Q83602-482a1943-0 Q1000032	P279	Q83602
Q1000039-P279-Q11555767-2dddfd86-0	Q1000039	P279	Q11555767
Q100004761-P279-Q100095237-3971e1cd-0	Q100004761	P279	Q100095237
Q100004761-P279-Q126793-77b1fce8-0	Q100004761	P279	Q126793
Q100004761-P279-Q4544523-639fbe16-0	Q100004761	P279	Q4544523
Q1000064-P279-Q11016-0ab23344-0 Q1000064	P279	Q11016
zcat: error writing to output: Broken pipe


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [19]:
!$kypher -i $OUT/derived.P279.tsv.gz -o $TEMP/P279.n1.tsv.gz \
--match '(n1)-[l]->()' \
--return 'n1 as id' 

In [20]:
!$kypher -i $OUT/derived.P31.tsv.gz -o $TEMP/P31.n2.tsv.gz \
--match '()-[l]->(n2)' \
--return 'n2 as id' 

In [25]:
kgtk("""cat --mode NONE 
       -i $TEMP/P31.n2.tsv.gz
       -i $TEMP/P279.n1.tsv.gz
       -o $TEMP/P279.roots.1.tsv.gz""")

In [26]:
kgtk("""sort --mode NONE 
        --column id 
        -i $TEMP/P279.roots.1.tsv.gz 
        -o $TEMP/P279.roots.2.tsv.gz""")

We have lots of duplicates

In [27]:
!zcat < $TEMP/P279.roots.2.tsv.gz | head

id
Q100000030
Q100000058
Q1000017
Q1000032
Q1000032
Q1000039
Q100004761
Q100004761
Q100004761
zcat: error writing to output: Broken pipe


In [28]:
kgtk("""compact 
        -i $TEMP/P279.roots.2.tsv.gz 
        --mode NONE
        --presorted 
        --columns id
        -o $TEMP/P279.roots.tsv""")

Now we can invoke the reachable-nodes command

In [29]:
kgtk("""reachable-nodes
        --rootfile $TEMP/P279.roots.tsv
        --selflink 
        -i $OUT/derived.P279.tsv.gz
        -o $TEMP/P279.reachable.tsv.gz""")

In [30]:
!zcat < $TEMP/P279.reachable.tsv.gz | head | col

node1	label	node2
zcat: Q100000030	reachable	Q100000030
Q100000030	reachable	Q14748
error writing to outputQ100000030	reachable	Q14745
: Broken pipe
Q100000030	reachable	Q1357761
Q100000030	reachable	Q223557
Q100000030	reachable	Q35459920
Q100000030	reachable	Q488383
Q100000030	reachable	Q35120
Q100000030	reachable	Q4406616


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [33]:
!$kypher -i $TEMP/P279.reachable.tsv.gz -o $TEMP/P279star.1.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'distinct n1, "P279star" as label, n2 as node2' \
--order-by 'n1'

Add ids

In [34]:
!$kgtk add-id --id-style wikidata -i $TEMP/P279star.1.tsv.gz -o $OUT/derived.P279star.tsv.gz

In [35]:
!zcat < $OUT/derived.P279star.tsv.gz | head | col

zcat: node1	label	node2	id
Q100000030	P279star	Q100000030	Q100000030-P279star-Q100000030
error writing to outputQ100000030	P279star	Q14748	Q100000030-P279star-Q14748
: Broken pipe
Q100000030	P279star	Q14745	Q100000030-P279star-Q14745
Q100000030	P279star	Q1357761	Q100000030-P279star-Q1357761
Q100000030	P279star	Q223557 Q100000030-P279star-Q223557
Q100000030	P279star	Q35459920	Q100000030-P279star-Q35459920
Q100000030	P279star	Q488383 Q100000030-P279star-Q488383
Q100000030	P279star	Q35120	Q100000030-P279star-Q35120
Q100000030	P279star	Q4406616	Q100000030-P279star-Q4406616


This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the counts of instances of subclasses of city (Q515).

In [39]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q515), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, count(c) as count, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'count(c) desc, c, n1' \
    --limit 10 \
    | col

Illustrate that it is indeed `P279*`

In [40]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q63440326), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'c, n1' \
    --limit 10 \
    | col 

### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [41]:
kgtk("""cat 
        -i $OUT/derived.P31.tsv.gz 
        -i $OUT/derived.P279.tsv.gz
        -o $TEMP/isa.1.tsv.gz""")

In [42]:
!$kypher -i $TEMP/isa.1.tsv.gz -o $OUT/derived.isa.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "isa" as label, n2' \
--order-by 'n1'

Example of how to use the `isa` relation

In [43]:
if debug:
    !$kypher -i $OUT/derived.isa.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label -o - \
    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \
    --return 'distinct n1, l.label, "Q44" as node2, label as n1_label' \
    --limit 10 \
    | col

### Create files with `isa/P279* and P31/P279*` 
This file is useful to find all nodes that are below a q-node via P279 or isa.

> These files are very large and take many hours to compute

In [44]:
os.environ['P279STAR'] = f"{os.environ['OUT']}/derived.P279star.tsv.gz"
os.environ['ISA'] = f"{os.environ['OUT']}/derived.isa.tsv.gz"

In [45]:
!$kypher -i "$P279STAR" --as P279star -i "$ISA" --as isa  \
--match '\
  isa: (n1)-[]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "isa_star" as label, n3 as node2' \
--order-by 'n1' \
-o "$TEMP"/derived.isastar_1.tsv.gz

database or disk is full



Now add ids

In [39]:
kgtk("""add-id 
        --id-style wikidata 
        -i "$TEMP"/derived.isastar_1.tsv.gz 
        -o "$OUT"/derived.isastar.tsv.gz""")

Also calculate the same file by for P31/P279*

In [40]:
!$kypher -i item -i P279star \
--match '\
  item: (n1)-[:P31]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "P31P279star" as label, n3 as node2' \
--order-by 'n1' \
-o "$TEMP"/derived.P31P279star.gz

Add ids

In [41]:
kgtk("""add-id 
        --id-style wikidata 
        -i "$TEMP"/derived.P31P279star.gz 
        -o "$OUT"/derived.P31P279star.tsv.gz""")

It is also very big

In [42]:
if debug:
    !zcat < "$OUT"/derived.P31P279star.tsv.gz | wc

 1704159 6816636 100144221


## Compute pagerank

Now compute pagerank. These commands will exceed 16GB memory for graphs containing over 25 million nodes.

In [43]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$item" -o $OUT/metadata.pagerank.directed.tsv.gz \
    --compute-pagerank True \
    --compute-hits False \
    --page-rank-property Pdirected_pagerank \
    --output-degrees True \
    --output-pagerank True \
    --output-hits False \
    --output-statistics-only \
    --undirected False \
    --log-file $TEMP/metadata.pagerank.directed.summary.txt 


	Using the fallback 'C' locale.


In [44]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.directed.summary.txt

graph loaded! It has 66014 nodes and 393716 edges

*** Top relations:
P31	76899
P17	30213
P47	29116
P279	21917
P131	13895
P1889	13443
P106	11174
P1411	10914
P166	10452
P21	10171

*** Degrees:
in degree stats: mean=5.964129, std=0.363567, max=1
out degree stats: mean=5.964129, std=0.043517, max=1
total degree stats: mean=11.928258, std=0.380512, max=1

*** PageRank
Max pageranks
26562	Q23958852	0.071410
42551	Q23960977	0.032866
14856	Q35120	0.028596
11192	Q151885	0.026957
439	Q5	0.012807


In [45]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$item" -o $OUT/metadata.pagerank.undirected.tsv.gz \
    --compute-pagerank True \
    --compute-hits False \
    --page-rank-property Pundirected_pagerank \
    --output-degrees True \
    --output-pagerank True \
    --output-hits False \
    --output-statistics-only \
    --undirected True \
    --log-file $TEMP/metadata.pagerank.undirected.summary.txt 


	Using the fallback 'C' locale.


In [46]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.undirected.summary.txt 

graph loaded! It has 66014 nodes and 393716 edges

*** Top relations:
P31	76899
P17	30213
P47	29116
P279	21917
P131	13895
P1889	13443
P106	11174
P1411	10914
P166	10452
P21	10171

*** Degrees:
in degree stats: mean=0.000000, std=0.000000, max=1
out degree stats: mean=11.928258, std=0.380512, max=1
total degree stats: mean=11.928258, std=0.380512, max=1

*** PageRank
Max pageranks
439	Q5	0.022010
173	Q30	0.012919
4782	Q6581097	0.008353
7097	Q15221623	0.004738
1391	Q1860	0.004441


## Compute Degrees

Kypher can compute the out degree by counting the node2s for each node1

In [47]:
!$kypher -i claims -o $TEMP/metadata.out_degree.tsv.gz \
--match '(n1)-[l]->()' \
--return 'distinct n1 as node1, count(distinct l) as node2, "Pout_degree" as label' 

In [48]:
!$kgtk add-id --id-style wikidata -i $TEMP/metadata.out_degree.tsv.gz \
/ sort2 -o $OUT/metadata.out_degree.tsv.gz

In [52]:
!zcat < $OUT/metadata.out_degree.tsv.gz | head | col

zcat: error writing to output: Broken pipe
node1	node2	label	id
P10	1	Pout_degree	P10-Pout_degree-6b86b2
P1000	1	Pout_degree	P1000-Pout_degree-6b86b2
P1001	13	Pout_degree	P1001-Pout_degree-3fdba3
P1004	5	Pout_degree	P1004-Pout_degree-ef2d12
P1005	3	Pout_degree	P1005-Pout_degree-4e0740
P1006	2	Pout_degree	P1006-Pout_degree-d4735e
P1007	2	Pout_degree	P1007-Pout_degree-d4735e
P101	13	Pout_degree	P101-Pout_degree-3fdba3
P1012	3	Pout_degree	P1012-Pout_degree-4e0740


To count the in-degree we only care when the node2 is a wikibase-item

In [49]:
!$kypher -i claims -o $TEMP/metadata.in_degree.tsv.gz \
--match '()-[l]->(n2 {`wikidatatype`:"wikibase-item"})' \
--return 'distinct n2 as node1, count(distinct l) as node2, "Pin_degree" as label' \
--order-by 'n2'

In [50]:
!$kgtk add-id --id-style wikidata -i $TEMP/metadata.in_degree.tsv.gz \
/ sort2 -o $OUT/metadata.in_degree.tsv.gz

In [51]:
!zcat < $OUT/metadata.in_degree.tsv.gz | head | col

zcat: error writing to output: Broken pipe
node1	node2	label	id
Q100	168	Pin_degree	Q100-Pin_degree-80c3cd
Q1000	76	Pin_degree	Q1000-Pin_degree-f74efa
Q1000048	1	Pin_degree	Q1000048-Pin_degree-6b86b2
Q1000148	3	Pin_degree	Q1000148-Pin_degree-4e0740
Q100039327	1	Pin_degree	Q100039327-Pin_degree-6b86b2
Q100046246	1	Pin_degree	Q100046246-Pin_degree-6b86b2
Q100052008	1	Pin_degree	Q100052008-Pin_degree-6b86b2
Q100055982	1	Pin_degree	Q100055982-Pin_degree-6b86b2
Q100063122	1	Pin_degree	Q100063122-Pin_degree-6b86b2


Calculate the distribution so we can make a nice chart

In [53]:
!$kypher -i $OUT/metadata.in_degree.tsv.gz -o $OUT/statistics.in_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as Pin_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

In [54]:
!head $OUT/statistics.in_degree.distribution.tsv | col

Pin_degree	count	label
1	16523	count
2	5740	count
3	3206	count
4	2123	count
5	1736	count
6	1414	count
7	1315	count
8	1120	count
9	851	count


In [55]:
!$kypher -i $OUT/metadata.out_degree.tsv.gz -o $OUT/statistics.out_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as Pout_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

Draw some charts

In [56]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.in_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("in_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["in_degree", "count"],
    ).interactive().properties(title="Distribution of In Degree")

In [57]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.out_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("out_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["out_degree", "count"],
    ).interactive().properties(title="Distribution of Out Degree")

## Summary of results

In [58]:
!ls -lh $OUT/*

-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.de.tsv.gz
-rw-r--r--  1 amandeep  staff   1.3M Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.en.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.es.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.fr.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.it.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.nl.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.pl.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.pt.tsv.gz
-rw-r--r--  1 amandeep  staff    61B Oct 11 15:47 /Volumes/saggu-ssd/arnold-useful-files/aliases.ru.tsv.gz
-rw-r--r--  1 amandeep  staff    61B 

Highest page rank

In [61]:
if debug:
    if compute_pagerank:
        !$kypher -i $OUT/metadata.pagerank.undirected.tsv.gz -i label \
        --match 'pagerank: (n1)-[:Pundirected_pagerank]->(page_rank), label: (n1)-[:label]->(label)' \
        --return 'distinct n1, label as label, page_rank as `undirected page rank`' \
        --order-by 'page_rank desc' \
        --limit 10 

node1	label	undirected page rank
Q5062876	'Centro Superior de Información de la Defensa'@en	9.999978201327167e-06
Q835831	'Mount Vernon'@en	9.999710615250981e-06
Q62302889	'art practice'@en	9.999306691384986e-06
Q12562330	'asymmetry property'@en	9.998677779801328e-06
Q23	'George Washington'@en	9.998210185434762e-05
Q42293667	'honorary doctor of Ben-Gurion University'@en	9.99804745885785e-06
Q608723	'Bristol Old Vic Theatre School'@en	9.997870819410728e-06
Q392316	'First Nations'@en	9.997432367947012e-06
Q55955335	'Mike Lowrey'@en	9.996814408119406e-06
Q23968798	'Eduard Sanjuán'@en	9.996765931174855e-06


#### Move all the files to input folder
Should not do this, files should be left in output