# Calculating Pagerank on Wikidata

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
%env MY=/Users/pedroszekely/Downloads
%env WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504

env: MY=/Users/pedroszekely/Downloads
env: WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504


We need to filter the wikidata edge file to remove all edges where `node2` is a literal. 
We can do this by running `ifexists` to keep eges wehre `node2` also appears in `node1`.
This takes 2-3 hours on a laptop.

In [None]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
>   | pv -p -s 86582473531 \
>   | kgtk  ifexists --filter-on "$WD/wikidata_edges_20200504.tsv.gz" --input-keys node2 --filter-keys node1 \
>   | gzip > "$MY/wikidata-item-edges.tsv.gz"

We have 460 million edges that connect items to other items, let's make sure this is what we want before spending a lot of time computing pagerank

In [10]:
!gzcat $MY/wikidata-item-edges.tsv.gz | wc

 460763981 3225347876 32869769062


In [13]:
!gzcat $MY/wikidata-item-edges.tsv.gz | head

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q8-P31-1	Q8	P31	Q331769	normal				Q331769							item
Q8-P31-2	Q8	P31	Q60539479	normal				Q60539479							item
Q8-P31-3	Q8	P31	Q9415	normal				Q9415							item
Q8-P1343-1	Q8	P1343	Q20743760	normal				Q20743760							item
Q8-P1343-2	Q8	P1343	Q1970746	normal				Q1970746							item
Q8-P1343-3	Q8	P1343	Q19180675	normal				Q19180675							item
Q8-P461-1	Q8	P461	Q169251	normal				Q169251							item
Q8-P279-1	Q8	P279	Q16748867	normal				Q16748867							item
Q8-P460-1	Q8	P460	Q935526	normal				Q935526							item
gzcat: error writing to output: Broken pipe
gzcat: /Users/pedroszekely/Downloads/wikidata-item-edges.tsv.gz: uncompress failed


Let's do a sanity check to make sure that we have the edges that we want.
We can do this by counting how many edges of each `entity-type`. 
Good news, we only have items and properties.

In [25]:
!time gzcat $MY/wikidata-item-edges.tsv.gz | kgtk unique $MY/wikidata-item-edges.tsv.gz --column 'node2;entity-type'

node1	label	node2
item	count	460737401
property	count	26579
gzcat: error writing to output: Broken pipe
gzcat: /Users/pedroszekely/Downloads/wikidata-item-edges.tsv.gz: uncompress failed

real	20m30.673s
user	20m21.082s
sys	0m5.851s


We only needd `node`, `label` and `node2`, so let's remove the other columns

In [41]:
!time gzcat $MY/wikidata-item-edges.tsv.gz | kgtk remove_columns -c 'id,rank,node2;magnitude,node2;unit,node2;date,node2;item,node2;lower,node2;upper,node2;latitude,node2;longitude,node2;precision,node2;calendar,node2;entity-type' \
  | gzip > $MY/wikidata-item-edges-only.tsv.gz


real	34m58.124s
user	55m51.261s
sys	2m33.186s


In [4]:
!gzcat $MY/wikidata-item-edges-only.tsv.gz | head

gzcat: can't stat: /Users/pedroszekely/Downloads/wikidata-item-edges-only.tsv.gz (/Users/pedroszekely/Downloads/wikidata-item-edges-only.tsv.gz.gz): No such file or directory


In [44]:
!gunzip $MY/wikidata-item-edges-only.tsv.gz

The `kgtk graph_statistics` command will compute pagerank. It will run out of memory on a laptop with 16GB of memory.

In [12]:
!time kgtk graph_statistics --directed --degrees --pagerank --log $MY/log.txt $MY/wikidata-item-edges-only.tsv > $MY/wikidata-pagerank-degrees.tsv

/bin/sh: line 1: 89795 Killed: 9               kgtk graph_statistics --directed --degrees --pagerank --log $MY/log.txt $MY/wikidata-item-edges-only.tsv > $MY/wikidata-pagerank-degrees.tsv

real	32m57.832s
user	19m47.624s
sys	8m58.352s


In [11]:
!tail  $MY/wikidata-pagerank-degrees.tsv

Q1693	vertex_pagerank	0.0012109738177124639	Q1693-vertex_pagerank-2447
Q586	vertex_in_degree	1	Q586-vertex_in_degree-2448
Q586	vertex_out_degree	0	Q586-vertex_out_degree-2449
Q586	vertex_pagerank	0.0012109738177124639	Q586-vertex_pagerank-2450
Q64	vertex_in_degree	1	Q64-vertex_in_degree-2451
Q64	vertex_out_degree	0	Q64-vertex_out_degree-2452
Q64	vertex_pagerank	0.0012109738177124639	Q64-vertex_pagerank-2453
Q569998	vertex_in_degree	1	Q569998-vertex_in_degree-2454
Q569998	vertex_out_degree	0	Q569998-vertex_out_degree-2455
Q569998	vertex_pagerank	0.0012109738177124639	Q569998-vertex_pagerank-2456


In [8]:
!head $MY/wikidata-item-edges-only-1000.tsv

node1	label	node2
Q8	P31	Q331769
Q8	P31	Q60539479
Q8	P31	Q9415
Q8	P1343	Q20743760
Q8	P1343	Q1970746
Q8	P1343	Q19180675
Q8	P461	Q169251
Q8	P279	Q16748867
Q8	P460	Q935526
