# Generating Useful Wikidata Files

This notebook generates files that contain derived data that is useful in many applications. The input to the notebook is the full Wikidata or a subset of Wikidata. It also works for arbitrary KGs as long as they follow the representation requirements of Wikidata:

- the *instance of* relation is represented using the `P31` property
- the *subclass of* relation is represented using the `P279` property
- all properties declare a datatype, and the data types must be one of the datatypes in Wikidata.

Inputs:

- `claims_file`: contains all statements, which consist of edges `node1/label/node2` where `label` is a property in Wikidata (e.g., sitelinks, labels, aliases and description are not in the claims file.
- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.

Outputs:

- **Instance of (P31):** `derived.P31.tsv.gz` contains all the `instance of (P31)` edges present in the claims file.
- **Subclass of (P279):** `derived.P279.tsv.gz` contains all the `subclass of (P279)` edges present in the claims file.
- **Is A (isa):** `derived.isa.tsv.gz` contains edges `node`isa/node2` where either `node1/P31/node2` or `node1/P279/node2`
- **Closure of subclass of (P279star):** `derived.P279star.tsv.gz` contains edges `node1/P279star/node2` where `node2` is reachable from `node1` via zero or more hops using the `P279` property. Note that for example, `Q44/P279star/Q44`. An example when this file is useful is when you want to find all the instance of a class, including instances of subclasses of the given class.
- **In/out degrees:** `metadata.out_degree.tsv.gz` contains the out degree of every node, and `metadata.in_degree.tsv.gz` contains the in degree of every node.
- **Pagerank:** outputs page rank on the directed graph in `metadata.pagerank.directed.tsv.gz` and page rank of the directed graph in `metadata.pagerank.undirected.tsv.gz`.

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p input_path /data/amandeep/wikidata-20211027-dwd-v3 \
-p output_path /data/amandeep/wikidata-20211027-dwd-v3 \
-p kgtk_path  /Users/amandeep/github/kgtk \
-p project_name useful-files \
-p languages en,es \
-p files claims,label_all,alias_all,description_all \
-p compute_pagerank True \
-p compute_degrees True \
```

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
 
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [3]:
# Parameters

input_path = "/data/amandeep/wikidata-20220505/import-wikidata/data"
output_path = "/data/amandeep/wikidata-20220505/import-wikidata/data"
kgtk_path = "/Users/amandeep/github/kgtk"

graph_cache_path = None

project_name = "useful-files"

languages = 'en,ru,es,zh-cn,de,it,nl,pl,fr,pt,sv'

files = 'claims,label_all,alias_all,description_all'

compute_pagerank = False
compute_degrees = False
debug = False
compute_isa_star = False
compute_p31p279_star = False
files_for_cache = None

In [4]:
files = files.split(',')
languages = languages.split(',')
if files_for_cache is None:
    files_for_cache =  files
else:
    files_for_cache = files_for_cache.split(",")

In [5]:
ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                 graph_cache_path=graph_cache_path)

User home: /nas/home/amandeep
Current dir: /data/amandeep/Github/kgtk/use-cases
KGTK dir: /Users/amandeep/github/kgtk
Use-cases dir: /Users/amandeep/github/kgtk/use-cases


In [6]:
ck.print_env_variables()

KGTK_GRAPH_CACHE: /data/amandeep/wikidata-20220505/import-wikidata/data/useful-files/temp.useful-files/wikidata.sqlite3.db
kypher: kgtk query --graph-cache /data/amandeep/wikidata-20220505/import-wikidata/data/useful-files/temp.useful-files/wikidata.sqlite3.db
EXAMPLES_DIR: /Users/amandeep/github/kgtk/examples
USE_CASES_DIR: /Users/amandeep/github/kgtk/use-cases
KGTK_LABEL_FILE: /data/amandeep/wikidata-20220505/import-wikidata/data/labels.en.tsv.gz
kgtk: kgtk
STORE: /data/amandeep/wikidata-20220505/import-wikidata/data/useful-files/temp.useful-files/wikidata.sqlite3.db
GRAPH: /data/amandeep/wikidata-20220505/import-wikidata/data
OUT: /data/amandeep/wikidata-20220505/import-wikidata/data/useful-files
TEMP: /data/amandeep/wikidata-20220505/import-wikidata/data/useful-files/temp.useful-files
KGTK_OPTION_DEBUG: false
claims: /data/amandeep/wikidata-20220505/import-wikidata/data/claims.tsv.gz
label_all: /data/amandeep/wikidata-20220505/import-wikidata/data/labels.tsv.gz
alias_all: /data/ama

In [7]:
if graph_cache_path is None:
    ck.load_files_into_cache(files=files_for_cache)

kgtk query --graph-cache /data/amandeep/wikidata-20220505/import-wikidata/data/useful-files/temp.useful-files/wikidata.sqlite3.db -i "/data/amandeep/wikidata-20220505/import-wikidata/data/claims.tsv.gz" --as claims  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/labels.tsv.gz" --as label_all  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/aliases.tsv.gz" --as alias_all  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/descriptions.tsv.gz" --as description_all  --limit 3
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item


### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [9]:
!$kypher -i claims --limit 10 | col 

/bin/bash: function: No such file or directory



Force creation of the index on the label column

In [10]:
!$kypher -i claims -o - \
--match '(i)-[:P31]->(c)' \
--limit 5 \
| column -t -s $'\t' 

/bin/bash: function: No such file or directory


Force creation of the index on the node2 column

In [11]:
!$kypher -i claims -o - \
--match '(i)-[r]->(:Q5)' \
--limit 5 \
| column -t -s $'\t' 

/bin/bash: function: No such file or directory


### Create the P31 and P279 files

Create the `P31` file

In [12]:
!$kypher -i claims -o $OUT/derived.P31.tsv.gz \
--match '(n1)-[l:P31]->(n2)' \
--return 'l, n1, l.label, n2' 

In [13]:
!zcat < $OUT/derived.P31.tsv.gz | head | col

id	node1	label	node2
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q18610173
P10-P31-Q19847637-e81ded71-0	P10	P31	Q19847637
P1000-P31-Q18608871-093affb5-0	P1000	P31	Q18608871
P10000-P31-Q19833377-f87f0d4c-0 P10000	P31	Q19833377
P10000-P31-Q89560413-f555a944-0 P10000	P31	Q89560413
P10001-P31-Q107738007-c7725ce7-0	P10001	P31	Q107738007
P10001-P31-Q64221137-d154ffd9-0 P10001	P31	Q64221137
P10002-P31-Q93433126-dbd52b84-0 P10002	P31	Q93433126
P10003-P31-Q108914651-f3644858-0	P10003	P31	Q108914651

gzip: stdout: Broken pipe


Create the P279 file

In [14]:
!$kypher -i claims -o $OUT/derived.P279.tsv.gz \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' 

In [15]:
!zcat < $OUT/derived.P279.tsv.gz | head | col

id	node1	label	node2
P2217-P279-Q986260-6ee7fda9-0	P2217	P279	Q986260
Q100000030-P279-Q14748-30394205-0	Q100000030	P279	Q14748
Q100000058-P279-Q1622444-bd182663-0	Q100000058	P279	Q1622444
Q1000032-P279-Q1813494-0aa0f1dc-0	Q1000032	P279	Q1813494
Q1000032-P279-Q83602-482a1943-0 Q1000032	P279	Q83602
Q1000039-P279-Q11555767-2dddfd86-0	Q1000039	P279	Q11555767
Q100004761-P279-Q100095237-3971e1cd-0	Q100004761	P279	Q100095237
Q100004761-P279-Q126793-77b1fce8-0	Q100004761	P279	Q126793
Q100004761-P279-Q4544523-639fbe16-0	Q100004761	P279	Q4544523

gzip: stdout: Broken pipe


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [16]:
!$kypher -i $OUT/derived.P279.tsv.gz -o $TEMP/P279.n1.tsv.gz \
--match '(n1)-[l]->()' \
--return 'n1 as id' 

In [17]:
!$kypher -i $OUT/derived.P31.tsv.gz -o $TEMP/P31.n2.tsv.gz \
--match '()-[l]->(n2)' \
--return 'n2 as id' 

In [18]:
kgtk("""cat --mode NONE 
       -i $TEMP/P31.n2.tsv.gz
       -i $TEMP/P279.n1.tsv.gz
       -o $TEMP/P279.roots.1.tsv.gz""")

In [19]:
kgtk("""sort --mode NONE 
        --column id 
        -i $TEMP/P279.roots.1.tsv.gz 
        -o $TEMP/P279.roots.2.tsv.gz""")

We have lots of duplicates

In [20]:
!zcat < $TEMP/P279.roots.2.tsv.gz | head

id
P2217
Q1
Q1
Q100000030
Q100000058
Q1000017
Q1000032
Q1000032
Q1000039

gzip: stdout: Broken pipe


In [21]:
kgtk("""compact 
        -i $TEMP/P279.roots.2.tsv.gz 
        --mode NONE
        --presorted 
        --columns id
        -o $TEMP/P279.roots.tsv""")

Now we can invoke the reachable-nodes command

In [22]:
kgtk("""reachable-nodes
        --rootfile $TEMP/P279.roots.tsv
        --selflink 
        -i $OUT/derived.P279.tsv.gz
        --label P279star
        -o $TEMP/P279.reachable.tsv.gz""")

In [23]:
!zcat < $TEMP/P279.reachable.tsv.gz | head | col

node1	label	node2
P2217	reachable	P2217
P2217	reachable	Q986260
P2217	reachable	Q3711325
P2217	reachable	Q107715
P2217	reachable	Q309314
P2217	reachable	Q246672

gzip: P2217	reachable	Q7184903
stdout: Broken pipe
P2217	reachable	Q488383
P2217	reachable	Q35120


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

Add ids

In [25]:
!$kgtk add-id --id-style wikidata -i $TEMP/P279.reachable.tsv.gz -o $OUT/derived.P279star.tsv.gz

In [26]:
!zcat < $OUT/derived.P279star.tsv.gz | head | col

node1	label	node2	id
P2217	P279star	P2217	P2217-P279star-P2217
P2217	P279star	Q986260 P2217-P279star-Q986260
P2217	P279star	Q3711325	P2217-P279star-Q3711325
P2217	P279star	Q107715 P2217-P279star-Q107715
P2217	P279star	Q309314 P2217-P279star-Q309314
P2217	P279star	Q246672 P2217-P279star-Q246672
P2217	P279star	Q7184903	P2217-P279star-Q7184903

gzip: P2217	P279star	Q488383 P2217-P279star-Q488383
stdout: Broken pipe
P2217	P279star	Q35120	P2217-P279star-Q35120


This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the counts of instances of subclasses of city (Q515).

In [27]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q515), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, count(c) as count, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'count(c) desc, c, n1' \
    --limit 10 \
    | col

Illustrate that it is indeed `P279*`

In [28]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q63440326), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'c, n1' \
    --limit 10 \
    | col 

### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [29]:
kgtk("""cat 
        -i $OUT/derived.P31.tsv.gz 
        -i $OUT/derived.P279.tsv.gz
        -o $TEMP/isa.1.tsv.gz""")

In [30]:
!$kypher -i $TEMP/isa.1.tsv.gz -o $OUT/derived.isa.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "isa" as label, n2' \
--order-by 'n1'

Example of how to use the `isa` relation

In [31]:
if debug:
    !$kypher -i $OUT/derived.isa.tsv.gz -i $OUT/derived.P279star.tsv.gz -i label -o - \
    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \
    --return 'distinct n1, l.label, "Q44" as node2, label as n1_label' \
    --limit 10 \
    | col

### Create files with `isa/P279* and P31/P279*` 
This file is useful to find all nodes that are below a q-node via P279 or isa.

> These files are very large and take many hours to compute

In [32]:
os.environ['P279STAR'] = f"{os.environ['OUT']}/derived.P279star.tsv.gz"
os.environ['ISA'] = f"{os.environ['OUT']}/derived.isa.tsv.gz"

In [33]:
if compute_isa_star:
    !$kypher -i "$P279STAR" --as P279star -i "$ISA" --as isa  \
    --match '\
      isa: (n1)-[]->(n2), \
      P279star: (n2)-[]->(n3)' \
    --return 'distinct n1 as node1, "isa_star" as label, n3 as node2' \
    --order-by 'n1' \
    -o "$TEMP"/derived.isastar_1.tsv.gz

Now add ids

In [34]:
if compute_isa_star:
    kgtk("""add-id 
            --id-style wikidata 
            -i "$TEMP"/derived.isastar_1.tsv.gz 
            -o "$OUT"/derived.isastar.tsv.gz""")

Also calculate the same file by for P31/P279*

In [35]:
if compute_p31p279_star:
    !$kypher -i claims -i "$P279STAR" --as P279star \
    --match '\
      claims: (n1)-[:P31]->(n2), \
      P279star: (n2)-[]->(n3)' \
    --return 'distinct n1 as node1, "P31P279star" as label, n3 as node2' \
    --order-by 'n1' \
    -o "$TEMP"/derived.P31P279star.tsv.gz

Add ids

In [36]:
if compute_p31p279_star:
    kgtk("""add-id 
            --id-style wikidata 
            -i "$TEMP"/derived.P31P279star.tsv.gz
            -o "$OUT"/derived.P31P279star.tsv.gz""")

It is also very big

In [36]:
if debug:
    !zcat < "$OUT"/derived.P31P279star.tsv.gz | wc

## Compute pagerank

Now compute pagerank. These commands will exceed 16GB memory for graphs containing over 25 million nodes.

In [37]:
if compute_pagerank:
    kgtk("""graph-statistics 
            -i "$GRAPH/claims.wikibase-item.tsv.gz" 
            -o $TEMP/metadata.pagerank.directed.tsv.gz 
            --compute-pagerank True 
            --compute-hits False 
            --page-rank-property Pdirected_pagerank 
            --output-degrees False 
            --use-mgzip True 
            --mgzip-threads 12 
            --output-pagerank True 
            --output-hits False 
            --output-statistics-only 
            --undirected False 
            --log-file $TEMP/metadata.pagerank.directed.summary.txt""")

In [None]:
if compute_pagerank:
    directed_pagerank = kgtk("""
        query -i $TEMP/metadata.pagerank.directed.tsv.gz  
        --match '(n1)-[l:Pdirected_pagerank]->(pagerank)'
    """)

    directed_pagerank_sorted = directed_pagerank.sort_values("node2", ascending=False)
    directed_pagerank_sorted.insert(0, 'P1545', range(1, 1 + len(directed_pagerank_sorted)))
    directed_pagerank_sorted.to_csv(f"{os.environ['TEMP']}/directed-pagerank.ordinal.tsv", index=False, sep='\t')
    

    kgtk("""
        normalize -i "$TEMP"/directed-pagerank.ordinal.tsv
        / add-id --id-style wikidata 
        -o "$OUT"/metadata.pagerank.directed.tsv.gz
    """)
    kgtk("""
        head -i "$OUT"/metadata.pagerank.directed.tsv.gz
    """)

In [38]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.directed.summary.txt

graph loaded! It has 94903511 nodes and 670635690 edges

*** Top relations:
P2860	285098156
P31	99559383
P1433	37893478
P50	22619544
P921	21565587
P17	14723889
P407	14494498
P131	11189895
P106	9239520
P6259	8076517

*** Degrees:
in degree stats: mean=7.066500, std=0.456495, max=1
out degree stats: mean=7.066500, std=0.001451, max=1
total degree stats: mean=14.133001, std=0.456502, max=1

*** PageRank
Max pageranks
7296	Q4167836	0.024407
30751	Q13442814	0.020599
2476	Q1860	0.007204
5853	Q5	0.006323
5852	Q11266439	0.005784


In [39]:
if compute_pagerank:
    kgtk("""graph-statistics 
            -i "$GRAPH/claims.wikibase-item.tsv.gz" 
            -o $TEMP/metadata.pagerank.undirected.tsv.gz 
            --compute-pagerank True 
            --compute-hits False 
            --page-rank-property Pundirected_pagerank
            --use-mgzip True 
            --mgzip-threads 12
            --output-degrees False 
            --output-pagerank True 
            --output-hits False 
            --output-statistics-only 
            --undirected True 
            --log-file $TEMP/metadata.pagerank.undirected.summary.txt""")

In [None]:
if compute_pagerank:
    undirected_pagerank = kgtk("""
        query -i $TEMP/metadata.pagerank.undirected.tsv.gz 
        --match '(n1)-[l:Pundirected_pagerank]->(pagerank)'
    """)

    undirected_pagerank = undirected_pagerank.sort_values("node2", ascending=False)
    undirected_pagerank.insert(0, 'P1545', range(1, 1 + len(undirected_pagerank)))
    undirected_pagerank.to_csv(f"{os.environ['TEMP']}/undirected-pagerank.ordinal.tsv", index=False, sep='\t')
    kgtk("""
        normalize -i "$TEMP"/undirected-pagerank.ordinal.tsv
        / add-id --id-style wikidata 
        -o "$OUT"/metadata.pagerank.undirected.tsv.gz
    """)
    kgtk("""
        head -i "$OUT"/metadata.pagerank.undirected.tsv.gz
    """)

In [40]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.undirected.summary.txt 

graph loaded! It has 94903511 nodes and 670635690 edges

*** Top relations:
P2860	285098156
P31	99559383
P1433	37893478
P50	22619544
P921	21565587
P17	14723889
P407	14494498
P131	11189895
P106	9239520
P6259	8076517

*** Degrees:
in degree stats: mean=0.000000, std=0.000000, max=1
out degree stats: mean=14.133001, std=0.456502, max=1
total degree stats: mean=14.133001, std=0.456502, max=1

*** PageRank
Max pageranks
30751	Q13442814	0.029250
130053	Q1264450	0.013161
7296	Q4167836	0.012312
5853	Q5	0.008650
2476	Q1860	0.006818


## Compute Degrees

Kypher can compute the out degree by counting the node2s for each node1

In [41]:
if compute_degrees:
    !$kypher -i claims -o $TEMP/metadata.out_degree.tsv.gz \
    --match '(n1)-[l]->()' \
    --order-by 'n1' \
    --return 'distinct n1 as node1, count(distinct l) as node2, "Pout_degree" as label' 

In [42]:
if compute_degrees:
    kgtk("""add-id --id-style wikidata 
            -i $TEMP/metadata.out_degree.tsv.gz 
            -o $OUT/metadata.out_degree.tsv.gz""")

In [43]:
if compute_degrees:
    !zcat < $OUT/metadata.out_degree.tsv.gz | head | col

node1	node2	label	id

gzip: P10	20	Pout_degree	P10-Pout_degree-f5ca38
stdout: Broken pipe
P1000	10	Pout_degree	P1000-Pout_degree-4a44dc
P10000	25	Pout_degree	P10000-Pout_degree-b7a568
P10001	30	Pout_degree	P10001-Pout_degree-624b60
P10002	21	Pout_degree	P10002-Pout_degree-6f4b66
P10003	20	Pout_degree	P10003-Pout_degree-f5ca38
P10004	23	Pout_degree	P10004-Pout_degree-535fa3
P10005	21	Pout_degree	P10005-Pout_degree-6f4b66
P10006	25	Pout_degree	P10006-Pout_degree-b7a568


To count the in-degree we only care when the node2 is a wikibase-item

In [44]:
if compute_degrees:
    !$kypher -i claims -o $TEMP/metadata.in_degree.tsv.gz \
    --match '()-[l]->(n2 {`wikidatatype`:"wikibase-item"})' \
    --return 'distinct n2 as node1, count(distinct l) as node2, "Pin_degree" as label' \
    --order-by 'n2'

In [45]:
if compute_degrees:
    kgtk("""add-id --id-style wikidata 
            -i $TEMP/metadata.in_degree.tsv.gz
            -o $OUT/metadata.in_degree.tsv.gz""")

In [46]:
if compute_degrees:
    !zcat < $OUT/metadata.in_degree.tsv.gz | head | col

node1	node2	label	id
Q1	104	Pin_degree	Q1-Pin_degree-5ef6fd
Q100	14133	Pin_degree	Q100-Pin_degree-ef9f82
Q1000	6812	Pin_degree	Q1000-Pin_degree-7536db

gzip: Q10000	2	Pin_degree	Q10000-Pin_degree-d4735e
stdout: Broken pipe
Q100000 125	Pin_degree	Q100000-Pin_degree-0f8ef3
Q10000000	1	Pin_degree	Q10000000-Pin_degree-6b86b2
Q100000001	5	Pin_degree	Q100000001-Pin_degree-ef2d12
Q10000002	1	Pin_degree	Q10000002-Pin_degree-6b86b2
Q100000040	4	Pin_degree	Q100000040-Pin_degree-4b2277


Calculate the distribution so we can make a nice chart

In [47]:
if compute_degrees:
    !$kypher -i $OUT/metadata.in_degree.tsv.gz -o $OUT/statistics.in_degree.distribution.tsv \
    --match '(n1)-[]->(n2)' \
    --return 'distinct n2 as Pin_degree, count(distinct n1) as count, "count" as label' \
    --order-by 'cast(n2, integer)' 

In [48]:
if compute_degrees:
    !head $OUT/statistics.in_degree.distribution.tsv | col

Pin_degree	count	label
1	12410535	count
2	5079189 count
3	2954842 count
4	1981895 count
5	1530432 count
6	1212475 count
7	1008174 count
8	827467	count
9	706367	count


In [49]:
if compute_degrees:
    !$kypher -i $OUT/metadata.out_degree.tsv.gz -o $OUT/statistics.out_degree.distribution.tsv \
    --match '(n1)-[]->(n2)' \
    --return 'distinct n2 as Pout_degree, count(distinct n1) as count, "count" as label' \
    --order-by 'cast(n2, integer)' 

In [50]:
if compute_degrees:
    !head $OUT/statistics.out_degree.distribution.tsv | col

Pout_degree	count	label
1	6266209 count
2	2622464 count
3	2889122 count
4	3106569 count
5	4518981 count
6	6059016 count
7	5408942 count
8	5105646 count
9	6513341 count


Draw some charts

In [43]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.in_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("in_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["in_degree", "count"],
    ).interactive().properties(title="Distribution of In Degree")

In [44]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.out_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("out_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["out_degree", "count"],
    ).interactive().properties(title="Distribution of Out Degree")

## Summary of results

In [51]:
!ls -lh $OUT/*

-rw-r--r-- 1 amandeep isdstaff  21G May  6 23:11 /data/amandeep/wikidata-20220409/useful-files/useful-files/derived.isastar.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 303M May  5 22:49 /data/amandeep/wikidata-20220409/useful-files/useful-files/derived.isa.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 710M May  5 21:57 /data/amandeep/wikidata-20220409/useful-files/useful-files/derived.P279star.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  42M May  5 20:10 /data/amandeep/wikidata-20220409/useful-files/useful-files/derived.P279.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  22G May  8 00:24 /data/amandeep/wikidata-20220409/useful-files/useful-files/derived.P31P279star.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 1.2G May  5 20:09 /data/amandeep/wikidata-20220409/useful-files/useful-files/derived.P31.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 235M May  8 08:39 /data/amandeep/wikidata-20220409/useful-files/useful-files/metadata.in_degree.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 610M May  8 06:52 /data/amandeep/wikidata-20220409/usef

Highest page rank

In [46]:
if debug:
    if compute_pagerank:
        !$kypher -i $OUT/metadata.pagerank.undirected.tsv.gz -i label \
        --match 'pagerank: (n1)-[:Pundirected_pagerank]->(page_rank), label: (n1)-[:label]->(label)' \
        --return 'distinct n1, label as label, page_rank as `undirected page rank`' \
        --order-by 'page_rank desc' \
        --limit 10 