## Step 6: partition the files to follow the conventions KGTK uses for Wikidata

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *
from generate_report import run

ALIAS: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/aliases.en.tsv.gz"
ALL: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/all.tsv.gz"
CLAIMS: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.tsv.gz"
DESCRIPTION: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/pedroszekely/Documents/GitHub/kgtk/examples"
GE: "/Users/pedroszekely/Downloads/kgtk-tutorial/temp/graph-embedding"
ISA: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.isa.tsv.gz"
ITEM: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.wikibase-item.tsv.gz"
KGTK_PATH: "/Users/pedroszekely/Documents/GitHub/kgtk"
LABEL: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/labels.en.tsv.gz"
OUT: "/Users/pedroszekely/Downloads/kgtk-tutorial/output"
P279: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.P279.tsv.gz"
P279STAR: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/de

In [2]:
%cd {output_path}

/Users/pedroszekely/Downloads/kgtk-tutorial


We'll use the partition-wikidata notebook to complete this step. This notebook expects an input file that includes all edges and qualifiers together. We also need to specify a directory where partitioned files should be created, and a directory where temporary files can be sent (this should be different from our temp directory as the partition notebook will clear any existing files in this folder).

In [3]:
!mkdir -p $OUT/parts

Combine the main edges with the qualifiers

In [4]:
!$kgtk cat -i $OUT/all.tsv.gz -i $OUT/Q154.qualifiers.tsv.gz -o $TEMP/all_and_qualifiers.tsv.gz

In [5]:
!zcat < $TEMP/all_and_qualifiers.tsv.gz | head

id	node1	label	node2
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508
P10-P1659-P1651-c4068028-0	P10	P1659	P1651
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238
P10-P1659-P51-86aca4c5-0	P10	P1659	P51
P10-P1855-Q7378-555592a4-0	P10	P1855	Q7378
P10-P2302-Q21502404-d012aef4-0	P10	P2302	Q21502404
zcat: error writing to output: Broken pipe


In [6]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["TEMP"] + "/all_and_qualifiers.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)
;

Executing:   0%|          | 0/49 [00:00<?, ?cell/s]

''

The partition-wikidata notebook created the following partitioned kgtk-files:

In [7]:
!ls $OUT/parts

aliases.en.tsv.gz                   metadata.property.datatypes.tsv.gz
aliases.tsv.gz                      metadata.types.tsv.gz
all.tsv.gz                          qualifiers.commonsMedia.tsv.gz
claims.commonsMedia.tsv.gz          qualifiers.external-id.tsv.gz
claims.external-id.tsv.gz           qualifiers.geo-shape.tsv.gz
claims.geo-shape.tsv.gz             qualifiers.globe-coordinate.tsv.gz
claims.globe-coordinate.tsv.gz      qualifiers.math.tsv.gz
claims.math.tsv.gz                  qualifiers.monolingualtext.tsv.gz
claims.monolingualtext.tsv.gz       qualifiers.musical-notation.tsv.gz
claims.musical-notation.tsv.gz      qualifiers.quantity.tsv.gz
claims.other.tsv.gz                 qualifiers.string.tsv.gz
claims.quantity.tsv.gz              qualifiers.tabular-data.tsv.gz
claims.string.tsv.gz                qualifiers.time.tsv.gz
claims.tabular-data.tsv.gz          qualifiers.tsv.gz
claims.time.tsv.gz                  qualifiers.url.tsv.gz
claims.tsv.gz                       quali

In [8]:
!$kypher -i $OUT/parts/claims.tsv.gz \
--match '(n1)-[]->()' \
--return 'count(distinct n1)'

count(DISTINCT graph_20_c1."node1")
15513


## Step 7 Run Useful files Notebook

In [9]:
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        cache_path = os.environ["OUT"] + "/temp.useful_files",
        languages = 'en',
        compute_pagerank = True,
        delete_database = True
    )
)
;

Executing:   0%|          | 0/96 [00:00<?, ?cell/s]

''

The useful files notebook created the following files

In [10]:
!ls -lh $OUT/useful_files

total 15904
-rw-r--r--  1 pedroszekely  staff   628K Jan 24 13:13 aliases.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   148K Jan 24 13:13 derived.P279.tsv.gz
-rw-r--r--  1 pedroszekely  staff   1.7M Jan 24 13:14 derived.P279star.tsv.gz
-rw-r--r--  1 pedroszekely  staff   176K Jan 24 13:13 derived.P31.tsv.gz
-rw-r--r--  1 pedroszekely  staff    99K Jan 24 13:14 derived.isa.tsv.gz
-rw-r--r--  1 pedroszekely  staff   678K Jan 24 13:13 descriptions.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   627K Jan 24 13:13 labels.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   205K Jan 24 13:14 metadata.in_degree.tsv.gz
-rw-r--r--  1 pedroszekely  staff   114K Jan 24 13:14 metadata.out_degree.tsv.gz
-rw-r--r--  1 pedroszekely  staff   1.0M Jan 24 13:14 metadata.pagerank.directed.tsv.gz
-rw-r--r--  1 pedroszekely  staff   1.1M Jan 24 13:14 metadata.pagerank.undirected.tsv.gz
-rw-r--r--  1 pedroszekely  staff   1.2K Jan 24 13:14 statistics.in_degree.distribution.tsv
-rw-r--r--  1 pedroszekely  staff   3.3

Look at the distribution of out degrees

In [11]:
pd.read_table(os.environ['OUT']+'/useful_files/statistics.out_degree.distribution.tsv')

Unnamed: 0,out_degree,count,label
0,1,6206,count
1,2,3820,count
2,3,761,count
3,4,460,count
4,5,424,count
...,...,...,...
276,1140,1,count
277,1246,1,count
278,1256,1,count
279,1356,1,count


## Step 8 Run the Knowledge Graph Profiler

In [12]:
# the ; at the end suppresses the output of this cell which is a very large json object output of executing the profiler notebook
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Knowledge-Graph-Profiler.ipynb",
    "Knowledge-Graph-Profiler.out.ipynb",
    parameters=dict(
        wikidata_parts_folder = os.environ["OUT"] + "/parts",
        cache_folder = os.environ['TEMP'] + "/profiler_temp",
        output_folder = os.environ["OUT"] + "/profiler",
        compute_graph_statistics = "true"
    )
)
;

Executing:   0%|          | 0/76 [00:00<?, ?cell/s]

''

[Knowledge Graph Profiler output](Knowledge-Graph-Profiler.out.ipynb)

### Generate a report on Profiler output

In [13]:
run(f'{os.environ["OUT"]}/profiler')

See the [profiler report](report.html) of the main classes and properties in our KG.