<a href="https://colab.research.google.com/github/versant2612/jnotebooks/blob/main/kgtk/02_kg_profiling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kgtk==1.0.1

Collecting kgtk==1.0.1
  Downloading kgtk-1.0.1-py3-none-any.whl (550 kB)
[K     |████████████████████████████████| 550 kB 5.2 MB/s 
[?25hCollecting SPARQLWrapper
  Downloading SPARQLWrapper-1.8.5-py3-none-any.whl (26 kB)
Collecting torchbiggraph
  Downloading torchbiggraph-1.0.0-py3-none-any.whl (99 kB)
[K     |████████████████████████████████| 99 kB 9.0 MB/s 
[?25hCollecting pyrallel.lib==0.0.9
  Downloading pyrallel.lib-0.0.9-py3-none-any.whl (24 kB)
Collecting iso-639
  Downloading iso-639-0.4.5.tar.gz (167 kB)
[K     |████████████████████████████████| 167 kB 63.8 MB/s 
[?25hCollecting redis
  Downloading redis-4.0.1-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 52.9 MB/s 
[?25hCollecting rdflib>=6.0.2
  Downloading rdflib-6.0.2-py3-none-any.whl (407 kB)
[K     |████████████████████████████████| 407 kB 42.1 MB/s 
[?25hCollecting lz4
  Downloading lz4-3.1.3-cp37-cp37m-manylinux2010_x86_64.whl (1.8 MB)
[K     |████████████████████████████████| 

# Knowledge Graph Profiling

The goal fo profiling is to produce a summary of the classes, properties and instances present in a KG. Profiling is challenging because it is comptationally expensive as the queries touch large parts of the KG. In this part of the tutorial, you will learn how to use KGTK to profile a KG, and how KGTK addresses the computatinal challenges of computing profiles. Along the way, you will learn advanced uses of the KGTK query command.

This part of the tutorial is divided into multiple subsetions:
- Counting the number of instances, classes and properties
- Counting the number of instances of each class, the the most basic form of profiling
- Extending instance counting to include the instance of all subclasses of a class
- Generalizing the Wikidata `instance of (P31)` to include `occupation (P106)` and `position held (P39)` so that our profiles include statistics about classes such as `director (P57)`, which in Wikidata don't have instances
- Counting the number of times each property is used in the instances of each class and all its subclasses; you will learn how to divide a computationally challenging task into simpler queries that you can chain together 
- Customizing the profiles to include items of interest

At the end, you will load the profile data in the browsesr so that you can get more insights into the knowledge present in the tutorial KG.

## Preamble: set up the environment and files used in the tutorial

In [1]:
import os
import numpy as np
import pandas as pd

from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [2]:
# Parameters

# Folder on local machine where to create the output and temporary folders
input_path = None
output_path = "/tmp/projects"
project_name = "tutorial-profiling"

Our Wikidata distribution partitions the knowledge in Wikidata into smaller files that make it possible for you to pick and choose which files you want to use. Our tutorial KG is a subset of Wikidata, and is partitioned in the same way as the full Wikidata. The following is a partial list of all the files:

In [3]:
files = [
    "all",
    "label",
    "alias",
    "description",
    "external_id",
    "monolingualtext",
    "quantity",
    "string",
    "time",
    "item",
    "wikibase_property",
    "qualifiers",
    "datatypes",
    "p279",
    "p279star",
    "p31",
    "in_degree",
    "out_degree",
    "pagerank_directed",
    "pagerank_undirected"
]
ck = ConfigureKGTK(files)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name)

User home: /root
Current dir: /content
KGTK dir: /
Use-cases dir: //use-cases
--2021-11-18 13:54:29--  https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold/all.tsv.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/arnold/all.tsv.gz [following]
--2021-11-18 13:54:29--  https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/arnold/all.tsv.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/datasets/arnold/all.tsv.gz [following]
--2021-11-18 13:54:29--  https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/datasets/arnold/all.tsv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.

The KGTK setup command defines environment variables for all the files so that you can reuse the Jupyter notebook when you install it on your local machine.

In [4]:
ck.print_env_variables()

GRAPH: /root/isi-kgtk-tutorial/input
USE_CASES_DIR: //use-cases
KGTK_OPTION_DEBUG: false
EXAMPLES_DIR: //examples
TEMP: /tmp/projects/tutorial-profiling/temp.tutorial-profiling
kypher: kgtk query --graph-cache /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db
KGTK_LABEL_FILE: /root/isi-kgtk-tutorial/input/labels.en.tsv.gz
OUT: /tmp/projects/tutorial-profiling
KGTK_GRAPH_CACHE: /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db
STORE: /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db
kgtk: kgtk
all: /root/isi-kgtk-tutorial/input/all.tsv.gz
label: /root/isi-kgtk-tutorial/input/labels.en.tsv.gz
alias: /root/isi-kgtk-tutorial/input/aliases.en.tsv.gz
description: /root/isi-kgtk-tutorial/input/descriptions.en.tsv.gz
external_id: /root/isi-kgtk-tutorial/input/claims.external-id.tsv.gz
monolingualtext: /root/isi-kgtk-tutorial/input/claims.monolingualtext.tsv.gz
quantity: /root/isi-kgtk-tutorial/input/claims.quantit

The KGTK query command (https://kgtk.readthedocs.io/en/latest/transform/query/) uses a database to cache the file used in the queries. In this tutorial, we will populate the cache now to include the files we need so that later. KGTK will populate the cache on demand, the first time you use a file. I like to do it at configuration time to keep all the aliases in one place so that I can quickly come here and see the aliases of all the files.

In [5]:
%%time
ck.load_files_into_cache()

kgtk query --graph-cache /tmp/projects/tutorial-profiling/temp.tutorial-profiling/wikidata.sqlite3.db -i "/root/isi-kgtk-tutorial/input/all.tsv.gz" --as all  -i "/root/isi-kgtk-tutorial/input/labels.en.tsv.gz" --as label  -i "/root/isi-kgtk-tutorial/input/aliases.en.tsv.gz" --as alias  -i "/root/isi-kgtk-tutorial/input/descriptions.en.tsv.gz" --as description  -i "/root/isi-kgtk-tutorial/input/claims.external-id.tsv.gz" --as external_id  -i "/root/isi-kgtk-tutorial/input/claims.monolingualtext.tsv.gz" --as monolingualtext  -i "/root/isi-kgtk-tutorial/input/claims.quantity.tsv.gz" --as quantity  -i "/root/isi-kgtk-tutorial/input/claims.string.tsv.gz" --as string  -i "/root/isi-kgtk-tutorial/input/claims.time.tsv.gz" --as time  -i "/root/isi-kgtk-tutorial/input/claims.wikibase-item.tsv.gz" --as item  -i "/root/isi-kgtk-tutorial/input/claims.wikibase-property.tsv.gz" --as wikibase_property  -i "/root/isi-kgtk-tutorial/input/qualifiers.tsv.gz" --as qualifiers  -i "/root/isi-kgtk-tutorial/i

## Compute global KG statistics
In this part of the tutorial we will compute global statistics about the number of instances in the KG, the number of properties used to describe all the instances and classes, and the number of classes.



Total number of edges in our graph:

In [6]:
%%bash
zcat < $all | wc -l

2654671


Counting the total number of nodes is a bit harder as nodes can appear in the `node1` poistion or the `node2` position. 
In the queries below we count literals as nodes, as in KGTK they are nodes:
- list all the nodes that appear in the `node1` position.
- list all the nodes that appear in the `node2` position.
- concatenate and deduplicate the two files

In [7]:
kgtk("""
    query -i all
        --match '(n1)-[id]->(n2)'
        --return 'distinct n1 as id'
        -o $TEMP/node1.tsv
""")

kgtk("""
    query -i all
        --match '(n1)-[id]->(n2)'
        --return 'distinct n2 as id'
        -o $TEMP/node2.tsv
""")

kgtk("""
    cat -i $TEMP/node1.tsv -i $TEMP/node2.tsv
    / compact
""")

Unnamed: 0,id
0,$a United States. $b Department of the Interior
1,((0?[1-9]|[1-2][0-9]|3[0-6])[LRC]?)(/(0?[1-9]|...
2,(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]?|25[0...
3,(+20) 2
4,(+32) 2
...,...
1420765,url
1420766,wikibase-form
1420767,wikibase-item
1420768,wikibase-property


select count(distinct ?node) 
{select ?s as ?node where ?s ?p ?o 
UNION 
 select ?o as ?node where ?s ?p ?o}  

Counting the number of instances is easy as we can use the `instance of (P31)` property to identify the instances:

In [8]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'count(distinct instance) as count_instances'
""")

CPU times: user 29 ms, sys: 15.1 ms, total: 44.2 ms
Wall time: 3.81 s


Unnamed: 0,count_instances
0,58831


select count(?instances) where ?instances wd:p31 ?class

Counting the number of properties used is also easy: you do a query over all statements in the KG, and count the occurrence of each property:

In [9]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[l {label: property}]->(class)'
        --return 'count(distinct property) as count_property'
""")

CPU times: user 18.7 ms, sys: 11 ms, total: 29.7 ms
Wall time: 1.86 s


Unnamed: 0,count_property
0,3874


select count(?p) where ?s ?p ?o

Counting the number of classes is more challenging because the notion of class in Wikidata is implicit. Here, we define **class** to be any item that is involved in a `subclass of (P279)`. Some classes don't have instances, so we cannot use `instance of (P31)` to count classes. The KGTK `p279star` graph is very handy for this task, and for any other task where you want to quickly traverse the `subclass of (P279)`. KGTK defines the `subclass of (transitive) (P279star)` property to record all the superclesses of each class, includng itself.

You can count the number of classes by counting the number of distinct classes that appear as values of `P279star` :

In [10]:
%%bash
zcat < $p279star | wc -l

431383


In [11]:
%%time
kgtk("""
    query -i p279star
        --match '(class)-[:P279star]->(super_class)'
        --return 'count(distinct super_class) as count_classes'
""")

[Errno 2] No such file or directory: '/content/p279star'


CPU times: user 10.7 ms, sys: 9.2 ms, total: 19.9 ms
Wall time: 541 ms


select count(distinct ?class) {select ?o as ?class where ?s owl:subClassOf ?o UNION select ?o as ?class where ?s rdf:type ?o} 

Count the number of qualifiers. All the qualifiers are in a file, so we can count them by getting the number of lines in the file:

In [12]:
%%bash
zcat < $qualifiers | wc -l

322505


You can also count the number of qualifier edges using a query, and it is instructive to do it as tis example shows how to access the qualifiers on an edge:
- The first match clasuse has `[id]`, which binds the variable `id` to the identifier of the edge.
- The second match clause uses `(id)` in the `node1` position, and puts the identifier of the qualifier edge in the `qualifier_id` variable.
- The retrun statement returns the count of `qualifier_id`, which is the number of qualifier edges.

In [13]:
kgtk("""
    query -i all
        --match '
            (n1)-[id]->(n2),
            (id)-[qualifier_id]->(qualifier_value)'
        --return 'count(distinct qualifier_id)'
""")

Unnamed: 0,"count(DISTINCT graph_1_c2.""id"")"
0,455226


We can enhance the query to show us the distribution of properties used as qualifiers by introducing a variable `qualifier_property` to capture the property:

In [14]:
%%time
kgtk("""
    query -i all
        --match '
            (n1)-[id]->(n2),
            (id)-[qualifier_id {label: qualifier_property}]->(qualifier_value)'
        --return 'qualifier_property as node1, "count" as label, count(distinct qualifier_id) as node2'
        --order-by 'cast(node2, int) desc'
    / add-labels
""")

CPU times: user 52.8 ms, sys: 25.2 ms, total: 78 ms
Wall time: 6.84 s


Unnamed: 0,node1,label,node2,node1;label
0,P1545,count,134301,'series ordinal'@en
1,P585,count,96781,'point in time'@en
2,P580,count,33212,'start time'@en
3,P459,count,32944,'determination method'@en
4,P805,count,19969,'statement is subject of'@en
...,...,...,...,...
719,P945,count,1,'allegiance'@en
720,P952,count,1,'ISCO-88 occupation code'@en
721,P97,count,1,'noble title'@en
722,P974,count,1,'tributary'@en


## Get instance counts for each class

In this part you will do the simplest profiling query where you count the number of direct instancess of each class.
We can compute the instance counts by retrieving all statements that use `instance of (P31)` and counting the instances for each class.
We order the result by the number of instances to see the classes that have the most instances.
You can see that our tutorial KG contains a large number of people, and that there is a long tail of classes with very few instances; this is common in Wikidata, which defines over 1 million classes.

In [15]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as class, count(distinct instance) as count'
        --order-by 'cast(count, int) desc' 
    / add-labels
""")

CPU times: user 86.4 ms, sys: 28.1 ms, total: 115 ms
Wall time: 1.94 s


Unnamed: 0,class,count,class;label
0,Q5,13873,'human'@en
1,Q15221623,3177,'bilateral relation'@en
2,Q11424,2136,'film'@en
3,Q4022,1550,'river'@en
4,Q3918,815,'university'@en
...,...,...,...
5779,Q995347,1,'Christian movement'@en
5780,Q99566538,1,'Wikidata property for an identifier that gene...
5781,Q996839,1,'fraternal organization'@en
5782,Q99960791,1,'ministry of Andorra'@en


select ?class count(?instances) where ?instances wd:p31 ?class group by ?class


We want to add the profiling data back into the KG so that we can use it in queries and look at it in the browser.
To do so, we create a KGTK graph by using `node1, label, node2` as column headers:

In [16]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as node1, "P31_count" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc'
    --limit 10 
""")

CPU times: user 10.6 ms, sys: 12.1 ms, total: 22.7 ms
Wall time: 839 ms


Unnamed: 0,node1,label,node2
0,Q5,P31_count,13873
1,Q15221623,P31_count,3177
2,Q11424,P31_count,2136
3,Q4022,P31_count,1550
4,Q3918,P31_count,815
5,Q4164871,P31_count,645
6,Q1549591,P31_count,627
7,Q3917681,P31_count,614
8,Q19595382,P31_count,595
9,Q11862829,P31_count,568


It is good practice to add identifiers to the edges so that we can add qualifiers later if we desire. To add the identifiers, we chain the query output to the `add-id` command:

In [17]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as node1, "P31count" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc' 
    / add-id --id-style wikidata
""")

CPU times: user 102 ms, sys: 34.9 ms, total: 137 ms
Wall time: 1.81 s


Unnamed: 0,node1,label,node2,id
0,Q5,P31count,13873,Q5-P31count-247e30
1,Q15221623,P31count,3177,Q15221623-P31count-61d8c4
2,Q11424,P31count,2136,Q11424-P31count-907bdc
3,Q4022,P31count,1550,Q4022-P31count-c27484
4,Q3918,P31count,815,Q3918-P31count-96da2f
...,...,...,...,...
5779,Q995347,P31count,1,Q995347-P31count-6b86b2
5780,Q99566538,P31count,1,Q99566538-P31count-6b86b2
5781,Q996839,P31count,1,Q996839-P31count-6b86b2
5782,Q99960791,P31count,1,Q99960791-P31count-6b86b2


Now that we saw the steps to create the graph with the counts, we want to output the results to a file using the `-o` option:

In [18]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P31]->(class)'
        --return 'class as node1, "P31count" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc'
    / add-id --id-style wikidata
    -o $OUT/metadata.p31.count.tsv
""")

CPU times: user 13.2 ms, sys: 10.1 ms, total: 23.3 ms
Wall time: 1.78 s


Confirm that the output file went to the right place:

In [19]:
!ls -l $OUT

total 260
-rw-r--r-- 1 root root 261137 Nov 18 14:13 metadata.p31.count.tsv
drwxr-xr-x 2 root root   4096 Nov 18 14:13 temp.tutorial-profiling


Load the `P31count` graph in the KGTK cache so that we can use it in queries later

In [21]:
kgtk("""
    query -i $OUT/metadata.p31.count.tsv --as p31count --limit 20
""")

Unnamed: 0,node1,label,node2,id
0,Q5,P31count,13873,Q5-P31count-247e30
1,Q15221623,P31count,3177,Q15221623-P31count-61d8c4
2,Q11424,P31count,2136,Q11424-P31count-907bdc
3,Q4022,P31count,1550,Q4022-P31count-c27484
4,Q3918,P31count,815,Q3918-P31count-96da2f
5,Q4164871,P31count,645,Q4164871-P31count-3c2308
6,Q1549591,P31count,627,Q1549591-P31count-9a3553
7,Q3917681,P31count,614,Q3917681-P31count-fa7aec
8,Q19595382,P31count,595,Q19595382-P31count-a3aaf5
9,Q11862829,P31count,568,Q11862829-P31count-f8818b


Summary of this section:
- In this section we computed the count of instances for every class in our KG.
- We illustrated the use of `instance of (P31)` to do queries.
- We illustrated common conventions to add identifiers to edges and to save results to files.

## Compute `P31count_transitive`, the count of instances of a class including the instances of all the subclasses

Approach:
- get the class of each instance
- get all the superclass of the class of each instance
- for every superclass, count all the instances

> This query will run at the scale of all Wikidata, which contains millions of classes

We add the labels to see the results, not surprisingly, `entity` has the most instances, and the top classes are those at the top of the Wikidata ontology:

In [22]:
%%time
kgtk("""
    query -i all
        --match '
            (instance)-[:P31]->(class),
            (class)-[:P279star]->(superclass)'
        --return 'superclass as class, count(distinct instance) as count'
        --order-by 'cast(count, int) desc'
    / add-labels
""")

CPU times: user 179 ms, sys: 50.5 ms, total: 229 ms
Wall time: 17.8 s


Unnamed: 0,class,count,class;label
0,Q35120,58496,'entity'@en
1,Q99527517,38373,'collection entity'@en
2,Q28813620,35555,'set'@en
3,Q16887380,35533,'group'@en
4,Q58415929,30837,'spatio-temporal entity'@en
...,...,...,...
8926,Q99772908,1,'anthropomorphic equine'@en
8927,Q99860490,1,'neurological and physiological symptom'@en
8928,Q99960791,1,'ministry of Andorra'@en
8929,Q99969523,1,'anthropomorphic artiodactyla'@en


Store the results in a file using a new property `P31count_transitive`

In [23]:
%%time
kgtk("""
    query -i all 
        --match '
            (instance)-[:P31]->(class),
            (class)-[:P279star]->(superclass)'
        --return 'superclass as node1, "P31count_transitive" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc'
    / add-id --id-style wikidata
    -o $OUT/metadata.p31.count.transitive.tsv
""")

CPU times: user 75.4 ms, sys: 29.1 ms, total: 105 ms
Wall time: 14.1 s


Find the number of instances of `Q5: human`, `artist: Q483501` and `film director: Q2526255`. There are many instances of human, but only one of artist and zero of film director.

In [24]:
kgtk("""
    filter -i $OUT/metadata.p31.count.transitive.tsv -p "Q5, Q483501, Q2526255 ;;" / add-labels
""")

Unnamed: 0,node1,label,node2,id,node1;label
0,Q5,P31count_transitive,13944,Q5-P31count_transitive-c2d55f,'human'@en
1,Q483501,P31count_transitive,1,Q483501-P31count_transitive-6b86b2,'artist'@en


The reason there are no instances of `artist: Q483501` or `film director: Q2526255`  is that Wikidata uses the property `occupation: P106` to relate people to their occupations, so the connection between human and artist of director is not `instance of: P31`. It would be nice if the browser page for `artist: Q483501` or `film director: Q2526255` would show the number of people with this occupation. DBpedia uses a different model where humans are instances of artist or film director.


### Summary of this section
In this section we:
- Computed the count of instaces of every class, including all subclasses.
- Introduced `P279star`, the precomputed transitive closure of the Wikidata `subclass of (P279)` property that allows you to conveniently do queries over all super classes or subclasses of an entity.

## Define `P31x`, a generalization of `instance of: P31`

In our KG we are going to define a new property called `instance of (generalized): P31x` that behaves like DBpedia, so that we can ask for instances of `artist: Q483501`.
We do this by generalizing `occupation: P106` abd `position held: 39` to also behave as `P31` statements.

Approach:
- Combine `x P31 y`, `x P106 y` and `x P39 y` statements using a new `P31x` predicate

Use the `filter` to take a peek at the data and see whether our plan makes sense.

In [25]:
kgtk("""
    filter -i $item -p "; P39, P106 ;"
    / head
    / add-labels
""")

Unnamed: 0,node1,label,node2,id,node2;wikidatatype,node1;label,label;label,node2;label
0,Q1000048,P106,Q1622272,Q1000048-P106-Q1622272-3a1be6b5-0,wikibase-item,'Franz Zimmermann'@en,'occupation'@en,'university teacher'@en
1,Q1000048,P106,Q16267607,Q1000048-P106-Q16267607-e13e45d1-0,wikibase-item,'Franz Zimmermann'@en,'occupation'@en,'classical philologist'@en
2,Q100063874,P39,Q1162163,Q100063874-P39-Q1162163-ae076e77-0,wikibase-item,'Catherine Musson'@en,'position held'@en,'director'@en
3,Q100066085,P39,Q1162163,Q100066085-P39-Q1162163-93ac33fd-0,wikibase-item,'Anne-Laurence Mennessier'@en,'position held'@en,'director'@en
4,Q1001,P106,Q11774202,Q1001-P106-Q11774202-45d8eb34-0,wikibase-item,'Mahatma Gandhi'@en,'occupation'@en,'essayist'@en
5,Q1001,P106,Q17351648,Q1001-P106-Q17351648-e64838e9-0,wikibase-item,'Mahatma Gandhi'@en,'occupation'@en,'newspaper editor'@en
6,Q1001,P106,Q1930187,Q1001-P106-Q1930187-6cf568db-0,wikibase-item,'Mahatma Gandhi'@en,'occupation'@en,'journalist'@en
7,Q1001,P106,Q4964182,Q1001-P106-Q4964182-a0867b04-0,wikibase-item,'Mahatma Gandhi'@en,'occupation'@en,'philosopher'@en
8,Q1001,P106,Q808967,Q1001-P106-Q808967-57fe7a7e-0,wikibase-item,'Mahatma Gandhi'@en,'occupation'@en,'barrister'@en
9,Q100159381,P106,Q37226,Q100159381-P106-Q37226-d95f0b81-0,wikibase-item,'Victor Cherner'@en,'occupation'@en,'teacher'@en


Select all the `P31`, `P39` and `P106` statements and rewrite them as `P31x` statements, and also make sure that we do this only for humans:

In [26]:
kgtk("""
    query -i all
        --match '
            (n1)-[:P31]->(:Q5),
            (n1)-[r {label: property}]->(n2)'
        --where 'property in ["P106", "P39", "P31"]'
        --return 'distinct n1 as node1, "P31x" as label, n2 as node2'
        --limit 10
    / add-labels
""")

Unnamed: 0,node1,label,node2,node1;label,node2;label
0,Q1000048,P31x,Q1622272,'Franz Zimmermann'@en,'university teacher'@en
1,Q1000048,P31x,Q16267607,'Franz Zimmermann'@en,'classical philologist'@en
2,Q1000048,P31x,Q5,'Franz Zimmermann'@en,'human'@en
3,Q1000061,P31x,Q5,'Valentyn Symonenko'@en,'human'@en
4,Q100063874,P31x,Q5,'Catherine Musson'@en,'human'@en
5,Q100063874,P31x,Q1162163,'Catherine Musson'@en,'director'@en
6,Q100066085,P31x,Q5,'Anne-Laurence Mennessier'@en,'human'@en
7,Q100066085,P31x,Q1162163,'Anne-Laurence Mennessier'@en,'director'@en
8,Q1001,P31x,Q11774202,'Mahatma Gandhi'@en,'essayist'@en
9,Q1001,P31x,Q17351648,'Mahatma Gandhi'@en,'newspaper editor'@en


The query needs to be more sophisticated, because the previous query adds the extended `instance of` only to humans. If we don't do this, fictional characters that have occupations end up below `human (Q5)` due to the way the Wikidata ontology is structure. The fix is to concatenate (`cat`)the results of the previuos query with the original `instance of (P31)` graph and to deduplicate (`compact`).
The resulting graph goes in file `derived.P31x.tsv`:

In [27]:
%%time
kgtk("""
    query -i item
        --match '
            (n1)-[:P31]->(:Q5),
            (n1)-[r {label: property}]->(n2)'
        --where 'property in ["P106", "P39", "P31"]'
        --return 'distinct n1 as node1, "P31x" as label, n2 as node2'
    / add-id --id-style wikidata

    / compact
    -o $OUT/derived.P31x.tsv
""")

CPU times: user 25.7 ms, sys: 11.5 ms, total: 37.2 ms
Wall time: 3.87 s


Load the `p31x` graph defining our generalized `instance of` property:

In [28]:
kgtk("""
    query -i $OUT/derived.P31x.tsv --as p31x --limit 20
""")

Unnamed: 0,node1,label,node2,id
0,Q1000048,P31x,Q1622272,Q1000048-P31x-Q1622272
1,Q1000048,P31x,Q16267607,Q1000048-P31x-Q16267607
2,Q1000048,P31x,Q5,Q1000048-P31x-Q5
3,Q1000061,P31x,Q5,Q1000061-P31x-Q5
4,Q100063874,P31x,Q1162163,Q100063874-P31x-Q1162163
5,Q100063874,P31x,Q5,Q100063874-P31x-Q5
6,Q100066085,P31x,Q1162163,Q100066085-P31x-Q1162163
7,Q100066085,P31x,Q5,Q100066085-P31x-Q5
8,Q1001,P31x,Q11774202,Q1001-P31x-Q11774202
9,Q1001,P31x,Q17351648,Q1001-P31x-Q17351648


Now we can fix our `P31count_transitive` property to also include classes such as `film director (Q2526255)`. Use the new `P31x` graph to substitute `P31x` for `P31` in our query that computes the class counts:

In [29]:
%%time
kgtk("""
    query -i all -i p31x
        --match '
            p31x: (instance)-[:P31x]->(class),
            all: (class)-[:P279star]->(superclass)'
        --return 'superclass as node1, "P31xcount_transitive" as label, count(distinct instance) as node2'
        --order-by 'cast(node2, int) desc'
    / add-id --id-style wikidata
    -o $OUT/metadata.p31x.count.transitive.tsv
""")

CPU times: user 21.5 ms, sys: 12 ms, total: 33.5 ms
Wall time: 3.06 s


Redo our query to get the number of instances of `Q5: human`, `artist: Q483501` and `film director: Q2526255`.
Now we get more reasonable counts for artist and film directors:

In [30]:
kgtk("""
    filter -i $OUT/metadata.p31x.count.transitive.tsv -p "Q5, Q483501, Q2526255 ;;" / add-labels
""")

Unnamed: 0,node1,label,node2,id,node1;label
0,Q5,P31xcount_transitive,13873,Q5-P31xcount_transitive-247e30,'human'@en
1,Q483501,P31xcount_transitive,2575,Q483501-P31xcount_transitive-e7303a,'artist'@en
2,Q2526255,P31xcount_transitive,674,Q2526255-P31xcount_transitive-8ef532,'film director'@en


Find out the classes that appear in the new file that didn't appear in the old file. To do this we use the `ifnotexists` command that can be used to subtract the statements of one grpah from the statements from another graph.
> Some classes may appear in both graphs and have their counts updated (e.g., artists appeared with a count of 1 before):

In [31]:
kgtk("""
    ifnotexists -i $OUT/metadata.p31x.count.transitive.tsv
        --filter-on $OUT/metadata.p31.count.transitive.tsv
        --input-keys node1
        --filter-keys node1
    / add-labels
""")

Unnamed: 0,node1,label,node2,id,node1;label
0,Q713200,P31xcount_transitive,1912,Q713200-P31xcount_transitive-a991b8,'performing artist'@en
1,Q33999,P31xcount_transitive,1911,Q33999-P31xcount_transitive-dc4bc8,'actor'@en
2,Q15980804,P31xcount_transitive,1400,Q15980804-P31xcount_transitive-55fdec,'media professional'@en
3,Q2285706,P31xcount_transitive,1222,Q2285706-P31xcount_transitive-16a3e9,'head of government'@en
4,Q3282637,P31xcount_transitive,881,Q3282637-P31xcount_transitive-28096b,'film producer'@en
...,...,...,...,...,...
902,Q957729,P31xcount_transitive,1,Q957729-P31xcount_transitive-6b86b2,'photojournalist'@en
903,Q96172702,P31xcount_transitive,1,Q96172702-P31xcount_transitive-6b86b2,'Minister General of the Order of Franciscans'@en
904,Q978708,P31xcount_transitive,1,Q978708-P31xcount_transitive-6b86b2,'Prime Minister of East Timor'@en
905,Q98084799,P31xcount_transitive,1,Q98084799-P31xcount_transitive-6b86b2,'professional photographer'@en


### Summary of this section
In this section we:
- Computed  `P31x` representing our generalized instance of property. Results in `derived.P31x.tsv`.
- Computed `P31xcount_transitive` as a revision of `P31count_transitive` to also include counts via occupation and position held links. Results in `metadata.p31x.count.transitive.tsv`.
- Illustrated how to work with precomputed transitive closures (`P279star`), which enables KGTK to efficiently execute queries that otherwise would be very expensive

## Compute the number of times each property appears in a class

In this section we will compute the distribution of the use of properties in every class in th KG. 
We want to know the count of the different properties used in all instance of a class.
For example, if we look at `film (Q11424)` we want to see what properties are used to describe films, including all subclasses of film.

Computing this distirbution is challenging because as the query below shows, there are many classes in our KG:

In [32]:
kgtk("""
    query -i all --match '(entity)-[:P279]->(class)' --return 'count(distinct class) as `count of classes`'
""")

Unnamed: 0,count of classes
0,7483


Approach: we divide the task into two steps:
- For every entity, compute the set of properties used to describe it, and store this information in `item_properties.tsv`
- For every class, collect all the instances below it, and count the number of times each property appears in `item_properties.tsv`

The query for the first step is below. 
The first clause of the match clause gets the properties used in every instance of the KG.
I included a second clause to get the data type of the property, and used the `--where` clause to exlude properties with external identifiers, as there are so many of them, and for the tutorial we want the query to run faster.

In [33]:
%%time
kgtk("""
    query -i all
        --match '
            (entity)-[l {label: property}]->(),
            (property)-[:datatype]->(datatype)'
        --where 'datatype != "external-id"' 
        --return 'distinct entity as node1, "Phas_property" as label, property as node2'
    / add-labels
""")

CPU times: user 7.66 s, sys: 2.23 s, total: 9.89 s
Wall time: 23.9 s


Unnamed: 0,node1,label,node2,node1;label,node2;label
0,P8874,Phas_property,P1001,'Hong Kong film rating'@en,'applies to jurisdiction'@en
1,Q1001543,Phas_property,P1001,"'Embassy of Finland, Budapest'@en",'applies to jurisdiction'@en
2,Q100325415,Phas_property,P1001,"'Embassy of Belarus, Budapest'@en",'applies to jurisdiction'@en
3,Q1005422,Phas_property,P1001,"'Federal Office of Bundeswehr Equipment, Infor...",'applies to jurisdiction'@en
4,Q1006360,Phas_property,P1001,'Bundesminister'@en,'applies to jurisdiction'@en
...,...,...,...,...,...
837038,Q7020999,Phas_property,P991,'2017 French presidential election'@en,'successful candidate'@en
837039,Q72251,Phas_property,P991,'1876 United States presidential election'@en,'successful candidate'@en
837040,Q72472,Phas_property,P991,'1892 United States presidential election'@en,'successful candidate'@en
837041,Q72835,Phas_property,P991,'1908 United States presidential election'@en,'successful candidate'@en


The results look good, so we add the identifiers to the edges and store the results in `item_properties.tsv`.

In [34]:
%%time
kgtk("""
    query -i all
        --match '
            (property)-[:datatype]->(datatype), 
            (entity)-[l {label: property}]->()'
        --where 'datatype != "external-id"' 
        --return 'distinct entity as node1, "Phas_property" as label, property as node2'
    / add-id --id-style wikidata
    -o $TEMP/item_properties.tsv
""")

CPU times: user 76.5 ms, sys: 24.1 ms, total: 101 ms
Wall time: 12.8 s


In the second step, we use `P279star` to get all the superclasses of each entity, and then look up the entity in the `item_properties` graph to find the properties it uses.
We invent a new property called `P1963computed` to store the counts. Wikidata has a property `properties for this type (P1963)` where editors can manually specify the properties that should be used to describe the instance of a class. We are computing the properties bottom up from the data, so we call the property `P1963computed`.

In the return clause, we list `superclass`, and the value of the `property` variable ahead of the `count` clause to tell KGTK that we want to aggregate by superclass and property. We reuse the Wikidata `quantity (P1114)` to record the counts:

> This query is very expensive to run on the full Wikidata as it touches every entity in Wikidata, but it will complete after many hours.

In [35]:
%%time
kgtk("""
    query -i all -i p31x -i $TEMP/item_properties.tsv
        --match ' 
            p31x: (entity)-[]->(class), 
            all: (class)-[:P279star]->(superclass),
            item_properties: (entity)-[l]->(property)'
        --return 'distinct superclass as node1, "P1963computed" as label, property as node2, count(distinct l) as P1114' \
        --order-by 'cast(P1114, int) desc'
        --limit 100
    / add-labels
""")

CPU times: user 284 ms, sys: 53.8 ms, total: 338 ms
Wall time: 53.2 s


Unnamed: 0,node1,label,node2,P1114,node1;label,node2;label
0,Q103940464,P1963computed,P31,13873,'continuant'@en,'instance of'@en
1,Q154954,P1963computed,P31,13873,'natural person'@en,'instance of'@en
2,Q159344,P1963computed,P31,13873,'heterotroph'@en,'instance of'@en
3,Q164509,P1963computed,P31,13873,'omnivore'@en,'instance of'@en
4,Q16887380,P1963computed,P31,13873,'group'@en,'instance of'@en
...,...,...,...,...,...,...
95,Q164509,P1963computed,P106,6280,'omnivore'@en,'occupation'@en
96,Q16887380,P1963computed,P106,6280,'group'@en,'occupation'@en
97,Q18336849,P1963computed,P106,6280,'item with given name property'@en,'occupation'@en
98,Q215627,P1963computed,P106,6280,'person'@en,'occupation'@en


The results look good, so we store them in `derived.P1963computed.tsv`

In [36]:
%%time
kgtk("""
    query -i all -i p31x -i $TEMP/item_properties.tsv
        --match ' 
            p31x: (entity)-[]->(class), 
            all: (class)-[:P279star]->(superclass),
            item_properties: (entity)-[l]->(property)'
        --return 'distinct superclass as node1, "P1963computed" as label, property as node2, count(distinct l) as P1114' 
    / add-id --id-style wikidata
    / normalize --add-id True
    -o $OUT/derived.P1963computed.tsv
""")

CPU times: user 245 ms, sys: 59.8 ms, total: 305 ms
Wall time: 48.7 s


Add the new graph to the databse anbd define alias `p1963computed` for it.

In [37]:
kgtk("""
    query -i $OUT/derived.P1963computed.tsv --as p1963computed --limit 10
""")

Unnamed: 0,node1,label,node2,id
0,Q1005815,P1963computed,P103,Q1005815-P1963computed-P103
1,Q1005815-P1963computed-P103,P1114,1,Q1005815-P1963computed-P103-P1114-1-0000
2,Q1005815,P1963computed,P106,Q1005815-P1963computed-P106
3,Q1005815-P1963computed-P106,P1114,1,Q1005815-P1963computed-P106-P1114-1-0000
4,Q1005815,P1963computed,P108,Q1005815-P1963computed-P108
5,Q1005815-P1963computed-P108,P1114,1,Q1005815-P1963computed-P108-P1114-1-0000
6,Q1005815,P1963computed,P1343,Q1005815-P1963computed-P1343
7,Q1005815-P1963computed-P1343,P1114,1,Q1005815-P1963computed-P1343-P1114-1-0000
8,Q1005815,P1963computed,P140,Q1005815-P1963computed-P140
9,Q1005815-P1963computed-P140,P1114,1,Q1005815-P1963computed-P140-P1114-1-0000


Let' see the distribution of properties for `film (Q11424)`: ... esse dá erro "Empty set"
> You can try it for `film director (Q2526255)` or `entity (Q35120)`, which gives you the distribution of all properties in the KG:

In [39]:
%%time
kgtk("""
    query -i p1963computed
        --match '(class:Q2526255)-[l:P1963computed]->(property),
            (l)-[:P1114]->(quantity)'
        --return 'distinct class as class, property as property, quantity as count'
        --order-by 'cast(count, int) desc'
    / add-labels
""")

CPU times: user 17.6 ms, sys: 15.2 ms, total: 32.9 ms
Wall time: 1.63 s


Unnamed: 0,class,property,count,class;label,property;label
0,Q2526255,P106,674,'film director'@en,'occupation'@en
1,Q2526255,P31,674,'film director'@en,'instance of'@en
2,Q2526255,P21,673,'film director'@en,'sex or gender'@en
3,Q2526255,P569,665,'film director'@en,'date of birth'@en
4,Q2526255,P27,656,'film director'@en,'country of citizenship'@en
...,...,...,...,...,...
96,Q2526255,P582,1,'film director'@en,'end time'@en
97,Q2526255,P6758,1,'film director'@en,'supported sports team'@en
98,Q2526255,P740,1,'film director'@en,'location of formation'@en
99,Q2526255,P802,1,'film director'@en,'student'@en


Store the resulting graph in `derived.Pproperty_domain.tsv` and define the alias `property_domain` for it in the database:

In [41]:
%%time
kgtk("""
    query -i p1963computed
        --match '
            (class)-[l:P1963computed]->(property),
            (l)-[:P1114]->(quantity)'
        --return 'distinct property as node1, "Pproperty_domain" as label, class as node2, quantity as P1114'
        --order-by 'property, cast(P1114, int) desc'
    / add-id --id-style wikidata
    / normalize --add-id True
    -o $OUT/derived.Pproperty_domain.tsv
""")



CPU times: user 24.5 ms, sys: 17.4 ms, total: 42 ms
Wall time: 3.39 s


In [42]:
kgtk("query -i $OUT/derived.Pproperty_domain.tsv --as property_domain --limit 10")

Unnamed: 0,node1,label,node2,id
0,P101,Pproperty_domain,Q103940464,P101-Pproperty_domain-Q103940464
1,P101-Pproperty_domain-Q103940464,P1114,353,P101-Pproperty_domain-Q103940464-P1114-353-0000
2,P101,Pproperty_domain,Q154954,P101-Pproperty_domain-Q154954
3,P101-Pproperty_domain-Q154954,P1114,353,P101-Pproperty_domain-Q154954-P1114-353-0000
4,P101,Pproperty_domain,Q159344,P101-Pproperty_domain-Q159344
5,P101-Pproperty_domain-Q159344,P1114,353,P101-Pproperty_domain-Q159344-P1114-353-0000
6,P101,Pproperty_domain,Q164509,P101-Pproperty_domain-Q164509
7,P101-Pproperty_domain-Q164509,P1114,353,P101-Pproperty_domain-Q164509-P1114-353-0000
8,P101,Pproperty_domain,Q16887380,P101-Pproperty_domain-Q16887380
9,P101-Pproperty_domain-Q16887380,P1114,353,P101-Pproperty_domain-Q16887380-P1114-353-0000


Let's see the distribution of classes for `cast member(P161)`. We restrict the results to be subclasses of `visual artwork (Q4502142)` because otherwise the results contain too many of the abstract classes. We see that property `cast member(P161)` is defined for film and subclasses of film:

In [46]:
kgtk("""
    query -i property_domain -i all
        --match '
            all: (class)-[:P279star]->(:Q16887380), 
            property_domain: (property:P101)-[l:Pproperty_domain]->(class),
            property_domain: (l)-[:P1114]->(quantity)'
        --return 'distinct property as node1, "Pproperty_domain" as label, class as node2, quantity as P1114'
        --order-by 'property, cast(P1114, int) desc'
        --limit 10
    / add-labels
""")

Unnamed: 0,node1,label,node2,P1114,node1;label,node2;label
0,P101,Pproperty_domain,Q164509,353,'field of work'@en,'omnivore'@en
1,P101,Pproperty_domain,Q16887380,353,'field of work'@en,'group'@en
2,P101,Pproperty_domain,Q45983014,353,'field of work'@en,'organisms by adaptation'@en
3,P101,Pproperty_domain,Q5,353,'field of work'@en,'human'@en
4,P101,Pproperty_domain,Q702269,331,'field of work'@en,'professional'@en
5,P101,Pproperty_domain,Q2500638,279,'field of work'@en,'creator'@en
6,P101,Pproperty_domain,Q482980,277,'field of work'@en,'author'@en
7,P101,Pproperty_domain,Q36180,274,'field of work'@en,'writer'@en
8,P101,Pproperty_domain,Q15980158,248,'field of work'@en,'non-fiction writer'@en
9,P101,Pproperty_domain,Q1650915,247,'field of work'@en,'researcher'@en


### Summary of this section
In this section we:
- Computed  `P1963computed`, to record the frequence of the use of properties in every class.
- Used `P1963computed` to see the distribution of properties for a few classes.
- Illustrated the ability to break down very expensive queries into simpler steps.
- Illustrated a KGTK feature that allows you to use the results of one query as a new graph (`$TEMP/item_properties.tsv`) that can be integrated into other queries.

## Compute the distribution of units for quantity properties
This part of the tutorial illustrates how to work with KGTK structured literals:
- quantities: composed of a numeric value followed by the identifier of a unit, quantities can also define tolerances
- dates and times: composed of an ISO-formatted date, followed by a numeric precision indicator, and sometimes by a calendar
- monolingual strings: composed of a unicode string followed by a language tag

Additional documentation on the KGTK file format is in https://kgtk.readthedocs.io/en/latest/specification/
and documentation for the functions to operate on structured literals within queries is in https://kgtk.readthedocs.io/en/latest/transform/query/

Below is a specific example of how to query the units in structured literals. THe objective in the example is to compute a distribution of the units used in all properties that store quantities.
The query uses the `quantity` graph, which contains all properties whose values are quantities. 

The results of the query are interesting as we see some inconsistencies in the data present in our small subset of Wikidata. 
For example, most instances of `population (P1082)` have no units `point in time (Q186408)`, one has unit `Habitants (Q15621516)`, neither of which are units of `unit of measurement (Q47574)`

In [47]:
kgtk("""
    query -i quantity
        --match '(n1)-[l {label: property}]->(quantity)'
        --return 'distinct property as node1, "Pproperty_units_used" as label, kgtk_quantity_wd_units(quantity) as node2, count(distinct l) as P1114'
        --order-by 'property, cast(P1114, int) desc'
    / add-labels
""")

Unnamed: 0,node1,label,node2,P1114,node1;label,node2;label
0,P1081,Pproperty_units_used,,6810,'Human Development Index'@en,
1,P1082,Pproperty_units_used,,46643,'population'@en,
2,P1082,Pproperty_units_used,Q186408,2,'population'@en,'point in time'@en
3,P1082,Pproperty_units_used,Q15621516,1,'population'@en,'Habitants'@en
4,P1082,Pproperty_units_used,Q5727902,1,'population'@en,'circa'@en
...,...,...,...,...,...,...
418,P8476,Pproperty_units_used,,992,'BTI Governance Index'@en,
419,P8477,Pproperty_units_used,,970,'BTI Status Index'@en,
420,P8687,Pproperty_units_used,,6469,'social media followers'@en,
421,P8843,Pproperty_units_used,,201,'poverty incidence'@en,


We will store the units graph in `derived.Pproperty_units_used.tsv`. The final query includes a `where` clause to filter out the NULL values.

In [48]:
kgtk("""
    query -i quantity
        --match '(n1)-[l {label: property}]->(quantity)'
        --where 'kgtk_quantity_wd_units(quantity) IS NOT NULL'
        --return 'distinct property as node1, "Pproperty_units_used" as label, kgtk_quantity_wd_units(quantity) as node2, count(distinct l) as P1114'
        --order-by 'property, cast(P1114, int) desc'
    / add-id --id-style wikidata
    / normalize --add-id True
    -o $OUT/derived.Pproperty_units_used.tsv
""")

kgtk("query -i $OUT/derived.Pproperty_units_used.tsv --as property_units_used --limit 10")

Unnamed: 0,node1,label,node2,id
0,P1082,Pproperty_units_used,Q186408,P1082-Pproperty_units_used-Q186408
1,P1082-Pproperty_units_used-Q186408,P1114,2,P1082-Pproperty_units_used-Q186408-P1114-2-0000
2,P1082,Pproperty_units_used,Q15621516,P1082-Pproperty_units_used-Q15621516
3,P1082-Pproperty_units_used-Q15621516,P1114,1,P1082-Pproperty_units_used-Q15621516-P1114-1-0000
4,P1082,Pproperty_units_used,Q5727902,P1082-Pproperty_units_used-Q5727902
5,P1082-Pproperty_units_used-Q5727902,P1114,1,P1082-Pproperty_units_used-Q5727902-P1114-1-0000
6,P1083,Pproperty_units_used,Q44666669,P1083-Pproperty_units_used-Q44666669
7,P1083-Pproperty_units_used-Q44666669,P1114,2,P1083-Pproperty_units_used-Q44666669-P1114-2-0000
8,P1083,Pproperty_units_used,Q42177,P1083-Pproperty_units_used-Q42177
9,P1083-Pproperty_units_used-Q42177,P1114,1,P1083-Pproperty_units_used-Q42177-P1114-1-0000


### Summary of this section
In this section we:
- Computed the distribution of the units used for properties that store quantities
- Found examples of inappropriate use of units of measure in Wikidata
- Illustrated how to use functions in `query` to extract elements from structured literals

## Compute the number of awards by sex or gender of the receiver

First, get a distirbution of the `sex or gender (P21)` of people in our graph.
The distribution is skewed, perhaps because it is skewed in Wikidata or a result of how the tutorial graph was constructed.

In [49]:
kgtk("""
    query -i all
        --match '
            (person)-[:P31]->(:Q5),
            (person)-[:P21]->(sex_or_gender)'
        --return 'distinct sex_or_gender as sex_or_gender, count(distinct person) as count'
    / add-labels
""")

Unnamed: 0,sex_or_gender,count,sex_or_gender;label
0,Q6581072,1783,'female'@en
1,Q6581097,8111,'male'@en


Below, we compute the distirbution of `sex or gender (P21)`  per type of award. We use the property `award received (P166)` to extract the awards that people received.

We create a new property `Paward_count` to record the count, and put the `sex or gender (P21)` as a qualifier.

In [50]:
%%time
kgtk("""
    query -i all
        --match '
            (actor)-[:P31]->(:Q5),
            (actor)-[:P21]->(sex_or_gender),
            (actor)-[:P166]->(award)-[:P31]->(award_type)'
        --return 'distinct award_type as node1, "Paward_count" as label, sex_or_gender as P21, count(distinct actor) as node2'
        --order-by 'award_type'
    / add-labels
""")

CPU times: user 22.8 ms, sys: 17.5 ms, total: 40.3 ms
Wall time: 2.75 s


Unnamed: 0,node1,label,P21,node2,node1;label,P21;label
0,Q101007233,Paward_count,Q6581097,1,'film critics association'@en,'male'@en
1,Q1011547,Paward_count,Q6581072,38,'Golden Globe Award'@en,'female'@en
2,Q1011547,Paward_count,Q6581097,42,'Golden Globe Award'@en,'male'@en
3,Q101251494,Paward_count,Q6581097,24,'star'@en,'male'@en
4,Q1044427,Paward_count,Q6581072,8,'Primetime Emmy Award'@en,'female'@en
...,...,...,...,...,...,...
220,Q96474707,Paward_count,Q6581097,16,'honorary award'@en,'male'@en
221,Q96474709,Paward_count,Q6581072,2,'award for best visual effects'@en,'female'@en
222,Q96474709,Paward_count,Q6581097,121,'award for best visual effects'@en,'male'@en
223,Q973011,Paward_count,Q6581097,18,'campaign medal'@en,'male'@en


Store the new `Paward_count` graph in a file and define the alias `award_count` for it

In [51]:
%%time
kgtk("""
    query -i all
        --match '
            (actor)-[:P31]->(:Q5),
            (actor)-[:P21]->(sex_or_gender),
            (actor)-[:P166]->(award)-[:P31]->(award_type)'
        --return 'distinct award_type as node1, "Paward_count" as label, sex_or_gender as P21, count(distinct actor) as node2'
        --order-by 'award_type'
    / add-id --id-style wikidata
    / normalize --add-id True
    -o $OUT/derived.Paward_count.tsv
""")



CPU times: user 27.7 ms, sys: 33.4 ms, total: 61.1 ms
Wall time: 3.53 s


In [52]:
kgtk("query -i $OUT/derived.Paward_count.tsv --as award_count --limit 10")

Unnamed: 0,node1,label,node2,id
0,Q101007233,Paward_count,1,Q101007233-Paward_count-6b86b2
1,Q101007233-Paward_count-6b86b2,P21,Q6581097,Q101007233-Paward_count-6b86b2-P21-Q6581097-0000
2,Q1011547,Paward_count,38,Q1011547-Paward_count-aea921
3,Q1011547-Paward_count-aea921,P21,Q6581072,Q1011547-Paward_count-aea921-P21-Q6581072-0000
4,Q1011547,Paward_count,42,Q1011547-Paward_count-73475c
5,Q1011547-Paward_count-73475c,P21,Q6581097,Q1011547-Paward_count-73475c-P21-Q6581097-0000
6,Q101251494,Paward_count,24,Q101251494-Paward_count-c23560
7,Q101251494-Paward_count-c23560,P21,Q6581097,Q101251494-Paward_count-c23560-P21-Q6581097-0000
8,Q1044427,Paward_count,8,Q1044427-Paward_count-2c6242
9,Q1044427-Paward_count-2c6242,P21,Q6581072,Q1044427-Paward_count-2c6242-P21-Q6581072-0000


### Summary of this section
In this section we:
- Profiled awards to find the gender or sex of awardees, and found that males appear more frequently. We don't know if it is a skew in Wikidata or the real world.
- Defined a new property to hold the data so that it can be shown in the browser.

In [53]:
kgtk("""
    query -i all
        --match '
            (award)-[P31]->(award_type)-[:P279star]->(:Q4220917)'
        --return 'distinct award_type as award_type'
    / add-labels
""")

Unnamed: 0,award_type,award_type;label
0,Q1011547,'Golden Globe Award'@en
1,Q106301,'Academy Award for Best Supporting Actress'@en
2,Q110145,'MTV Movie Awards'@en
3,Q1111310,'Directors Guild of America Award'@en
4,Q1131772,'Saturn Award for Best Science Fiction Film'@en
...,...,...
90,Q96474700,'award for best screenplay'@en
91,Q96474701,'award for best adapted screenplay'@en
92,Q96474704,'award for best makeup and hairdressing'@en
93,Q96474707,'honorary award'@en
