# Profile The Tutorial Graph



In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
from IPython.display import display, HTML

import papermill as pm

sys.path.insert(0,'..')
from configure_kgtk_notebooks import ConfigureKGTK

from kgtk.functions import kgtk, kypher

In [2]:
# Parameters

kgtk_path = "/Users/pedroszekely/Documents/GitHub/kgtk"

# Folder on local machine where to create the output and temporary folders
input_path = "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold"
output_path = "/Users/pedroszekely/Downloads/kypher/projects"
project_name = "tutorial-profiling"

These are all the files that we have, but I am tempted to just use the `all` file as it helps to keep the tutorial simpler

In [11]:
files = [
    "all",
    "label",
    "alias",
    "description",
    "external_id",
    "monolingualtext",
    "quantity",
    "string",
    "time",
    "item",
    "wikibase_property",
    "qualifiers",
    "datatypes",
    "p279",
    "p279star",
    "p31",
    "in_degree",
    "out_degree",
    "pagerank_directed",
    "pagerank_undirected"
]
ck = ConfigureKGTK(kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name)

User home: /Users/pedroszekely
Current dir: /Users/pedroszekely/Documents/GitHub/kgtk/tutorial
KGTK dir: /Users/pedroszekely/Documents/GitHub/kgtk
Use-cases dir: /Users/pedroszekely/Documents/GitHub/kgtk/use-cases


In [12]:
ck.print_env_variables(files)

kypher: kgtk query --graph-cache /Users/pedroszekely/Downloads/kypher/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db
GRAPH: /Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold
EXAMPLES_DIR: /Users/pedroszekely/Documents/GitHub/kgtk/examples
kgtk: kgtk
STORE: /Users/pedroszekely/Downloads/kypher/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db
TEMP: /Users/pedroszekely/Downloads/kypher/projects/tutorial-profiling/temp.tutorial-profiling
USE_CASES_DIR: /Users/pedroszekely/Documents/GitHub/kgtk/use-cases
OUT: /Users/pedroszekely/Downloads/kypher/projects/tutorial-profiling
all: /Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/all.tsv.gz
label: /Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/labels.en.tsv.gz
alias: /Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/aliases.en.tsv.gz
description: /Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/

Set up defaults KGTK

In [25]:
os.environ['kgtk_path'] = kgtk_path
os.environ['KGTK_GRAPH_CACHE'] = os.environ['STORE']
os.environ['KGTK_LABEL_FILE'] = input_path + "/labels.en.tsv.gz"
os.environ['KGTK_OPTION_DEBUG'] = "false"

Load all my files into the kypher cache so that all graph aliases are defined

In [26]:
ck.load_files_into_cache(file_list=files)

kgtk query --graph-cache /Users/pedroszekely/Downloads/kypher/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/all.tsv.gz" --as all  -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/labels.en.tsv.gz" --as label  -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/aliases.en.tsv.gz" --as alias  -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/descriptions.en.tsv.gz" --as description  -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/claims.external-id.tsv.gz" --as external_id  -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/claims.monolingualtext.tsv.gz" --as monolingualtext  -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/datasets/arnold/claims.quantity.tsv.gz" --as quantity  -i "/Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/data

In [18]:
%cd {os.environ['OUT']}

/Users/pedroszekely/Downloads/kypher/projects/tutorial-profiling


## Get instance counts



We can compute the instance counts by retrieving all statements that use `instance of (P31)` and counting the instances for each class

In [27]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as class, count(distinct instance) as count'
        --order-by 'cast(count, int) desc'
        --limit 10 
    / add-labels
""")

CPU times: user 5.15 ms, sys: 10.3 ms, total: 15.5 ms
Wall time: 1.13 s


Unnamed: 0,class,count,class;label
0,Q5,10918,'human'@en
1,Q15221623,3176,'bilateral relation'@en
2,Q11424,2126,'film'@en
3,Q4022,1547,'river'@en
4,Q3918,778,'university'@en
5,Q3917681,613,'embassy'@en
6,Q1549591,590,'big city'@en
7,Q19595382,583,'Wikidata property for authority control for p...
8,Q11862829,530,'academic discipline'@en
9,Q15632617,493,'fictional human'@en


We want to add the profiloing data back into the KG so that we can use it in queries and look at it in the browser.
To do so, we create a KGTK graph by using `node1, label, node2` as column headers:

In [29]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as node1, "P31_count" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc'
    --limit 10 
""")

CPU times: user 4.83 ms, sys: 10.7 ms, total: 15.5 ms
Wall time: 759 ms


Unnamed: 0,node1,label,node2
0,Q5,P31_count,10918
1,Q15221623,P31_count,3176
2,Q11424,P31_count,2126
3,Q4022,P31_count,1547
4,Q3918,P31_count,778
5,Q3917681,P31_count,613
6,Q1549591,P31_count,590
7,Q19595382,P31_count,583
8,Q11862829,P31_count,530
9,Q15632617,P31_count,493


It is good practice to add identifiers to the edges so that we can add qualifiers later if we desire:

In [59]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as node1, "P31count" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc' 
    / add-id --id-style wikidata
""")

CPU times: user 43.6 ms, sys: 19.2 ms, total: 62.8 ms
Wall time: 1.19 s


Unnamed: 0,node1,label,node2,id
0,Q5,P31count,10918,Q5-P31count-2bf374
1,Q15221623,P31count,3176,Q15221623-P31count-73e7f3
2,Q11424,P31count,2126,Q11424-P31count-d8adfb
3,Q4022,P31count,1547,Q4022-P31count-05fb3c
4,Q3918,P31count,778,Q3918-P31count-93411f
...,...,...,...,...
4965,Q996839,P31count,1,Q996839-P31count-6b86b2
4966,Q99934885,P31count,1,Q99934885-P31count-6b86b2
4967,Q99935030,P31count,1,Q99935030-P31count-6b86b2
4968,Q99960791,P31count,1,Q99960791-P31count-6b86b2


Now that we saw the steps to do it, here is the query that you would have written at the start:

In [60]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as node1, "P31count" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc'
    / add-id --id-style wikidata
    -o $OUT/metadata.p31.count.tsv
""")

CPU times: user 3.07 ms, sys: 12.4 ms, total: 15.5 ms
Wall time: 1.01 s


Confirm that the output file went to the right place:

In [61]:
!ls -l $OUT

total 528
-rw-r--r--  1 pedroszekely  staff  224219 Oct  8 18:26 metadata.p31.count.tsv
drwxr-xr-x  3 pedroszekely  staff      96 Oct  8 18:19 [34mtemp.tutorial-profiling[m[m


Load the `P31count` graph in the KGTK cache so that we can use it in queries later

In [62]:
kgtk("""
    query -i $OUT/metadata.p31.count.tsv --as p31count --limit 2
""")

Unnamed: 0,node1,label,node2,id
0,Q5,P31count,10918,Q5-P31count-2bf374
1,Q15221623,P31count,3176,Q15221623-P31count-73e7f3


## Count the instances of a class including the instances of all the subclasses

Approach:
- get the class of each instance
- get all the superclass of the class of each instance
- for every superclass, count all the instances

> This query will run at the scale of all Wikidata, which contains millions of classes

We add the labels to see the results, not surprisingly, `entity` has the most instances, and the top classes are 

In [63]:
%%time
kgtk("""
    query -i all
        --match '
            (instance)-[:P31]->(class),
            (class)-[:P279star]->(superclass)'
        --return 'superclass as class, count(distinct instance) as count'
        --order-by 'cast(count, int) desc'
    / add-labels
""")

CPU times: user 47.4 ms, sys: 20 ms, total: 67.4 ms
Wall time: 7.64 s


Unnamed: 0,class,count,class;label
0,Q35120,49231,'entity'@en
1,Q99527517,30567,'collection entity'@en
2,Q28813620,28116,'set'@en
3,Q16887380,28102,'group'@en
4,Q58415929,27411,'spatio-temporal entity'@en
...,...,...,...
7849,Q100166391,1,'salt production facility'@en
7850,Q1001059,1,'writ'@en
7851,Q1000660,1,'algebra over a field'@en
7852,Q100052008,1,'anthropomorphic Pantherinae'@en


Store the results in a file using a new property `P31count_transitive`

In [67]:
%%time
kgtk("""
    query -i all 
        --match '
            (instance)-[:P31]->(class),
            (class)-[:P279star]->(superclass)'
        --return 'superclass as node1, "P31count_transitive" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc'
    -o $OUT/metadata.p31.count.transitive.tsv
""")

CPU times: user 6.38 ms, sys: 13.8 ms, total: 20.2 ms
Wall time: 8.21 s
