# Generating Useful Wikidata Files

This notebook generates files that contain derived data that is useful in many applications. The input to the notebook is the full Wikidata or a subset of Wikidata. It also works for arbitrary KGs as long as they follow the representation requirements of Wikidata:

- the *instance of* relation is represented using the `P31` property
- the *subclass of* relation is represented using the `P279` property
- all properties declare a datatype, and the data types must be one of the datatypes in Wikidata.

Inputs:

- `claims_file`: contains all statements, which consist of edges `node1/label/node2` where `label` is a property in Wikidata (e.g., sitelinks, labels, aliases and description are not in the claims file.
- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.

Outputs:

- **Instance of (P31):** `derived.P31.tsv.gz` contains all the `instance of (P31)` edges present in the claims file.
- **Subclass of (P279):** `derived.P279.tsv.gz` contains all the `subclass of (P279)` edges present in the claims file.
- **Is A (isa):** `derived.isa.tsv.gz` contains edges `node`isa/node2` where either `node1/P31/node2` or `node1/P279/node2`
- **Closure of subclass of (P279star):** `derived.P279star.tsv.gz` contains edges `node1/P279star/node2` where `node2` is reachable from `node1` via zero or more hops using the `P279` property. Note that for example, `Q44/P279star/Q44`. An example when this file is useful is when you want to find all the instance of a class, including instances of subclasses of the given class.
- **In/out degrees:** `metadata.out_degree.tsv.gz` contains the out degree of every node, and `metadata.in_degree.tsv.gz` contains the in degree of every node.
- **Pagerank:** outputs page rank on the directed graph in `metadata.pagerank.directed.tsv.gz` and page rank of the directed graph in `metadata.pagerank.undirected.tsv.gz`.

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p input_path /data/amandeep/wikidata-20211027-dwd-v3 \
-p output_path /data/amandeep/wikidata-20211027-dwd-v3 \
-p kgtk_path  /Users/amandeep/github/kgtk \
-p project_name useful-files \
-p languages en,es \
-p files claims,label_all,alias_all,description_all \
-p compute_pagerank True \
-p compute_degrees True \
```

In [48]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
 
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [49]:
# Parameters

input_path = "/data/amandeep/wikidata-20211027-dwd-v3"
output_path = "/data/amandeep/wikidata-20211027-dwd-v3"
kgtk_path = "/Users/amandeep/github/kgtk"

graph_cache_path = None

project_name = "useful-files"

languages = 'en,ru,es,zh-cn,de,it,nl,pl,fr,pt,sv'

files = 'claims,label_all,alias_all,description_all'

compute_pagerank = True
compute_degrees = True
debug = False

In [50]:
files = files.split(',')
languages = languages.split(',')

In [51]:
ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                 graph_cache_path=graph_cache_path)

User home: /nas/home/amandeep
Current dir: /data/amandeep/github/kgtk/use-cases
KGTK dir: /Users/amandeep/github/kgtk
Use-cases dir: /Users/amandeep/github/kgtk/use-cases


In [52]:
ck.print_env_variables()

KGTK_OPTION_DEBUG: false
TEMP: /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files
EXAMPLES_DIR: /Users/amandeep/github/kgtk/examples
KGTK_LABEL_FILE: /data/amandeep/wikidata-20211027-dwd-v3/labels.en.tsv.gz
STORE: /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db
kgtk: kgtk
KGTK_GRAPH_CACHE: /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db
GRAPH: /data/amandeep/wikidata-20211027-dwd-v3
USE_CASES_DIR: /Users/amandeep/github/kgtk/use-cases
OUT: /data/amandeep/wikidata-20211027-dwd-v3/useful-files
kypher: kgtk query --graph-cache /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db
claims: /data/amandeep/wikidata-20211027-dwd-v3/claims.tsv.gz
label_all: /data/amandeep/wikidata-20211027-dwd-v3/labels.tsv.gz
alias_all: /data/amandeep/wikidata-20211027-dwd-v3/aliases.tsv.gz
description_all: /data/amandeep/wikidata-20211027-dwd-v3/descriptions.tsv.gz


In [53]:
if graph_cache_path is None:
    ck.load_files_into_cache()

kgtk query --graph-cache /data/amandeep/wikidata-20211027-dwd-v3/useful-files/temp.useful-files/wikidata.sqlite3.db -i "/data/amandeep/wikidata-20211027-dwd-v3/claims.tsv.gz" --as claims  -i "/data/amandeep/wikidata-20211027-dwd-v3/labels.tsv.gz" --as label_all  -i "/data/amandeep/wikidata-20211027-dwd-v3/aliases.tsv.gz" --as alias_all  -i "/data/amandeep/wikidata-20211027-dwd-v3/descriptions.tsv.gz" --as description_all  --limit 3
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item


### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [54]:
!$kypher -i claims --limit 10 | col 

/bin/bash: function: No such file or directory



Force creation of the index on the label column

In [55]:
!$kypher -i claims -o - \
--match '(i)-[:P31]->(c)' \
--limit 5 \
| column -t -s $'\t' 

/bin/bash: function: No such file or directory


Force creation of the index on the node2 column

In [56]:
!$kypher -i claims -o - \
--match '(i)-[r]->(:Q5)' \
--limit 5 \
| column -t -s $'\t' 

/bin/bash: function: No such file or directory


### Create the P31 and P279 files

Create the `P31` file

In [57]:
!$kypher -i claims -o $OUT/derived.P31.tsv.gz \
--match '(n1)-[l:P31]->(n2)' \
--return 'l, n1, l.label, n2' 

In [58]:
!zcat < $OUT/derived.P31.tsv.gz | head | col

id	node1	label	node2
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q18610173
P1000-P31-Q18608871-093affb5-0	P1000	P31	Q18608871
P10000-P31-Q19833377-f87f0d4c-0 P10000	P31	Q19833377

gzip: P10000-P31-Q89560413-f555a944-0 P10000	P31	Q89560413
P10001-P31-Q107738007-c7725ce7-0	P10001	P31	Q107738007
stdout: Broken pipe
P10001-P31-Q64221137-d154ffd9-0 P10001	P31	Q64221137
P10002-P31-Q93433126-dbd52b84-0 P10002	P31	Q93433126
P10003-P31-Q108914651-f3644858-0	P10003	P31	Q108914651
P10003-P31-Q42396390-7f1b5502-0 P10003	P31	Q42396390


Create the P279 file

In [59]:
!$kypher -i claims -o $OUT/derived.P279.tsv.gz \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' 

In [60]:
!zcat < $OUT/derived.P279.tsv.gz | head | col

id	node1	label	node2
Q100000030-P279-Q14748-30394205-0	Q100000030	P279	Q14748
Q100000058-P279-Q1622444-bd182663-0	Q100000058	P279	Q1622444
Q1000032-P279-Q1813494-0aa0f1dc-0	Q1000032	P279	Q1813494
Q1000032-P279-Q83602-482a1943-0 Q1000032	P279	Q83602
Q1000039-P279-Q11555767-2dddfd86-0	Q1000039	P279	Q11555767
Q100004761-P279-Q100095237-3971e1cd-0	Q100004761	P279	Q100095237

gzip: Q100004761-P279-Q126793-77b1fce8-0	Q100004761	P279	Q126793
stdout: Broken pipe
Q100004761-P279-Q4544523-639fbe16-0	Q100004761	P279	Q4544523
Q1000064-P279-Q11016-0ab23344-0 Q1000064	P279	Q11016


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [61]:
!$kypher -i $OUT/derived.P279.tsv.gz -o $TEMP/P279.n1.tsv.gz \
--match '(n1)-[l]->()' \
--return 'n1 as id' 

In [62]:
!$kypher -i $OUT/derived.P31.tsv.gz -o $TEMP/P31.n2.tsv.gz \
--match '()-[l]->(n2)' \
--return 'n2 as id' 

In [63]:
kgtk("""cat --mode NONE 
       -i $TEMP/P31.n2.tsv.gz
       -i $TEMP/P279.n1.tsv.gz
       -o $TEMP/P279.roots.1.tsv.gz""")

In [64]:
kgtk("""sort --mode NONE 
        --column id 
        -i $TEMP/P279.roots.1.tsv.gz 
        -o $TEMP/P279.roots.2.tsv.gz""")

We have lots of duplicates

In [65]:
!zcat < $TEMP/P279.roots.2.tsv.gz | head

id
Q100000030
Q100000058
Q1000017
Q1000032
Q1000032
Q1000039
Q100004761
Q100004761
Q100004761

gzip: stdout: Broken pipe


In [66]:
kgtk("""compact 
        -i $TEMP/P279.roots.2.tsv.gz 
        --mode NONE
        --presorted 
        --columns id
        -o $TEMP/P279.roots.tsv""")

Now we can invoke the reachable-nodes command

In [67]:
kgtk("""reachable-nodes
        --rootfile $TEMP/P279.roots.tsv
        --selflink 
        -i $OUT/derived.P279.tsv.gz
        -o $TEMP/P279.reachable.tsv.gz""")

In [68]:
!zcat < $TEMP/P279.reachable.tsv.gz | head | col

node1	label	node2
Q100000030	reachable	Q100000030
Q100000030	reachable	Q14748
Q100000030	reachable	Q14745
Q100000030	reachable	Q1357761
Q100000030	reachable	Q223557
Q100000030	reachable	Q35459920

gzip: Q100000030	reachable	Q488383
stdout: Broken pipe
Q100000030	reachable	Q35120
Q100000030	reachable	Q4406616


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [69]:
!$kypher -i $TEMP/P279.reachable.tsv.gz -o $TEMP/P279star.1.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'distinct n1, "P279star" as label, n2 as node2' \
--order-by 'n1'

Add ids

In [70]:
!$kgtk add-id --id-style wikidata -i $TEMP/P279star.1.tsv.gz -o $OUT/derived.P279star.tsv.gz

In [71]:
!zcat < $OUT/derived.P279star.tsv.gz | head | col


gzip: stdout: Broken pipe
node1	label	node2	id
Q100000030	P279star	Q100000030	Q100000030-P279star-Q100000030
Q100000030	P279star	Q14748	Q100000030-P279star-Q14748
Q100000030	P279star	Q14745	Q100000030-P279star-Q14745
Q100000030	P279star	Q1357761	Q100000030-P279star-Q1357761
Q100000030	P279star	Q223557 Q100000030-P279star-Q223557
Q100000030	P279star	Q35459920	Q100000030-P279star-Q35459920
Q100000030	P279star	Q488383 Q100000030-P279star-Q488383
Q100000030	P279star	Q35120	Q100000030-P279star-Q35120
Q100000030	P279star	Q4406616	Q100000030-P279star-Q4406616


This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the counts of instances of subclasses of city (Q515).

In [72]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q515), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, count(c) as count, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'count(c) desc, c, n1' \
    --limit 10 \
    | col

Illustrate that it is indeed `P279*`

In [73]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q63440326), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'c, n1' \
    --limit 10 \
    | col 

### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [74]:
kgtk("""cat 
        -i $OUT/derived.P31.tsv.gz 
        -i $OUT/derived.P279.tsv.gz
        -o $TEMP/isa.1.tsv.gz""")

In [75]:
!$kypher -i $TEMP/isa.1.tsv.gz -o $OUT/derived.isa.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "isa" as label, n2' \
--order-by 'n1'

Example of how to use the `isa` relation

In [76]:
if debug:
    !$kypher -i $OUT/derived.isa.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label -o - \
    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \
    --return 'distinct n1, l.label, "Q44" as node2, label as n1_label' \
    --limit 10 \
    | col

### Create files with `isa/P279* and P31/P279*` 
This file is useful to find all nodes that are below a q-node via P279 or isa.

> These files are very large and take many hours to compute

In [77]:
os.environ['P279STAR'] = f"{os.environ['OUT']}/derived.P279star.tsv.gz"
os.environ['ISA'] = f"{os.environ['OUT']}/derived.isa.tsv.gz"

In [78]:
!$kypher -i "$P279STAR" --as P279star -i "$ISA" --as isa  \
--match '\
  isa: (n1)-[]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "isa_star" as label, n3 as node2' \
--order-by 'n1' \
-o "$TEMP"/derived.isastar_1.tsv.gz

Now add ids

In [None]:
kgtk("""add-id 
        --id-style wikidata 
        -i "$TEMP"/derived.isastar_1.tsv.gz 
        -o "$OUT"/derived.isastar.tsv.gz""")

Also calculate the same file by for P31/P279*

In [80]:
!$kypher -i claims -i "$P279STAR" --as P279star \
--match '\
  claims: (n1)-[:P31]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "P31P279star" as label, n3 as node2' \
--order-by 'n1' \
-o "$TEMP"/derived.P31P279star.tsv.gz

Add ids

In [81]:
kgtk("""add-id 
        --id-style wikidata 
        -i "$TEMP"/derived.P31P279star.tsv.gz
        -o "$OUT"/derived.P31P279star.tsv.gz""")

It is also very big

In [42]:
if debug:
    !zcat < "$OUT"/derived.P31P279star.tsv.gz | wc

 1704159 6816636 100144221


## Compute pagerank

Now compute pagerank. These commands will exceed 16GB memory for graphs containing over 25 million nodes.

In [30]:
if compute_pagerank:
    kgtk("""graph-statistics 
            -i "$GRAPH/claims.wikibase-item.tsv.gz" 
            -o $OUT/metadata.pagerank.directed.tsv.gz 
            --compute-pagerank True 
            --compute-hits False 
            --page-rank-property Pdirected_pagerank 
            --output-degrees False 
            --use-mgzip True 
            --mgzip-threads 12 
            --output-pagerank True 
            --output-hits False 
            --output-statistics-only 
            --undirected False 
            --log-file $TEMP/metadata.pagerank.directed.summary.txt""")

In [31]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.directed.summary.txt

graph loaded! It has 92915895 nodes and 654318387 edges

*** Top relations:
P2860	284754818
P31	97274091
P1433	37574876
P50	21623753
P921	17012294
P17	14282490
P407	14064733
P131	10893839
P106	8872011
P6259	8076591

*** Degrees:
in degree stats: mean=7.042050, std=0.460818, max=1
out degree stats: mean=7.042050, std=0.001477, max=1
total degree stats: mean=14.084100, std=0.460825, max=1

*** PageRank
Max pageranks
3927	Q4167836	0.023031
42785	Q13442814	0.021860
1926	Q1860	0.007249
2472	Q5	0.006367
1263646	Q35252665	0.005490


In [32]:
if compute_pagerank:
    kgtk("""graph-statistics 
            -i "$GRAPH/claims.wikibase-item.tsv.gz" 
            -o $OUT/metadata.pagerank.undirected.tsv.gz 
            --compute-pagerank True 
            --compute-hits False 
            --page-rank-property Pundirected_pagerank
            --use-mgzip True 
            --mgzip-threads 12
            --output-degrees False 
            --output-pagerank True 
            --output-hits False 
            --output-statistics-only 
            --undirected True 
            --log-file $TEMP/metadata.pagerank.undirected.summary.txt""")

In [33]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.undirected.summary.txt 

graph loaded! It has 92915895 nodes and 654318387 edges

*** Top relations:
P2860	284754818
P31	97274091
P1433	37574876
P50	21623753
P921	17012294
P17	14282490
P407	14064733
P131	10893839
P106	8872011
P6259	8076591

*** Degrees:
in degree stats: mean=0.000000, std=0.000000, max=1
out degree stats: mean=14.084100, std=0.460825, max=1
total degree stats: mean=14.084100, std=0.460825, max=1

*** PageRank
Max pageranks
42785	Q13442814	0.030297
126633	Q1264450	0.013443
3927	Q4167836	0.012639
2472	Q5	0.008565
1926	Q1860	0.006830


## Compute Degrees

Kypher can compute the out degree by counting the node2s for each node1

In [82]:
!$kypher -i claims -o $TEMP/metadata.out_degree.tsv.gz \
--match '(n1)-[l]->()' \
--order-by 'n1' \
--return 'distinct n1 as node1, count(distinct l) as node2, "Pout_degree" as label' 

In [83]:
kgtk("""add-id --id-style wikidata 
        -i $TEMP/metadata.out_degree.tsv.gz 
        -o $OUT/metadata.out_degree.tsv.gz""")

In [84]:
!zcat < $OUT/metadata.out_degree.tsv.gz | head | col

node1	node2	label	id
P10	19	Pout_degree	P10-Pout_degree-9400f1
P1000	10	Pout_degree	P1000-Pout_degree-4a44dc

gzip: P10000	23	Pout_degree	P10000-Pout_degree-535fa3
stdout: Broken pipe
P10001	26	Pout_degree	P10001-Pout_degree-5f9c4a
P10002	20	Pout_degree	P10002-Pout_degree-f5ca38
P10003	20	Pout_degree	P10003-Pout_degree-f5ca38
P10004	21	Pout_degree	P10004-Pout_degree-6f4b66
P10005	19	Pout_degree	P10005-Pout_degree-9400f1
P10006	20	Pout_degree	P10006-Pout_degree-f5ca38


To count the in-degree we only care when the node2 is a wikibase-item

In [85]:
!$kypher -i claims -o $TEMP/metadata.in_degree.tsv.gz \
--match '()-[l]->(n2 {`wikidatatype`:"wikibase-item"})' \
--return 'distinct n2 as node1, count(distinct l) as node2, "Pin_degree" as label' \
--order-by 'n2'

In [86]:
kgtk("""add-id --id-style wikidata 
        -i $TEMP/metadata.in_degree.tsv.gz
        -o $OUT/metadata.in_degree.tsv.gz""")

In [87]:
!zcat < $OUT/metadata.in_degree.tsv.gz | head | col

node1	node2	label	id

gzip: Q1	91	Pin_degree	Q1-Pin_degree-1da51b
stdout: Broken pipe
Q100	13492	Pin_degree	Q100-Pin_degree-9ba93e
Q1000	5423	Pin_degree	Q1000-Pin_degree-f2069c
Q100000 125	Pin_degree	Q100000-Pin_degree-0f8ef3
Q10000000	1	Pin_degree	Q10000000-Pin_degree-6b86b2
Q100000001	3	Pin_degree	Q100000001-Pin_degree-4e0740
Q10000002	1	Pin_degree	Q10000002-Pin_degree-6b86b2
Q100000040	4	Pin_degree	Q100000040-Pin_degree-4b2277
Q10000005	1	Pin_degree	Q10000005-Pin_degree-6b86b2


Calculate the distribution so we can make a nice chart

In [88]:
!$kypher -i $OUT/metadata.in_degree.tsv.gz -o $OUT/statistics.in_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as Pin_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

In [89]:
!head $OUT/statistics.in_degree.distribution.tsv | col

Pin_degree	count	label
1	7182787 count
2	2146229 count
3	903632	count
4	454520	count
5	317411	count
6	214644	count
7	169879	count
8	115011	count
9	91834	count


In [90]:
!$kypher -i $OUT/metadata.out_degree.tsv.gz -o $OUT/statistics.out_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as Pout_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

In [91]:
!head $OUT/statistics.out_degree.distribution.tsv | col

Pout_degree	count	label
1	6167233 count
2	2647513 count
3	2853195 count
4	3070920 count
5	4297447 count
6	5898946 count
7	4787449 count
8	3723237 count
9	3343266 count


Draw some charts

In [43]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.in_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("in_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["in_degree", "count"],
    ).interactive().properties(title="Distribution of In Degree")

In [44]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.out_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("out_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["out_degree", "count"],
    ).interactive().properties(title="Distribution of Out Degree")

## Summary of results

In [45]:
!ls -lh $OUT/*

-rw-r--r-- 1 amandeep isdstaff  19G Nov 16 02:53 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/derived.isastar.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 295M Nov 11 18:13 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/derived.isa.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 723M Nov 11 18:04 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/derived.P279star.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  40M Nov 11 17:59 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/derived.P279.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  18G Nov 17 11:52 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/derived.P31P279star.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 1.2G Nov 11 18:12 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/derived.P31.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 231M Nov 18 01:18 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/metadata.in_degree.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 598M Nov 17 19:49 /data/amandeep/wikidata-20210215-dwd-v3/useful-files/metadata.out_degree.tsv.gz

Highest page rank

In [46]:
if debug:
    if compute_pagerank:
        !$kypher -i $OUT/metadata.pagerank.undirected.tsv.gz -i label \
        --match 'pagerank: (n1)-[:Pundirected_pagerank]->(page_rank), label: (n1)-[:label]->(label)' \
        --return 'distinct n1, label as label, page_rank as `undirected page rank`' \
        --order-by 'page_rank desc' \
        --limit 10 