# Example Scenario 2: Validation and statistics on a subset of Wikidata

*Bob wants to extract a subset of Wikidata with the `employer (P108)` or `position held (P39)` properties, validate it, and compute  its statistics, including centrality metrics.*

### Preparation

To run this notebook, Bob would need the Wikidata edges file. We will work with version `20200405` of Wikidata. Presumably, this file is not present on Bob's laptop, so we need to download and unpack it first (note: mac users might need to install `wget` first: `brew install wget`):

In [None]:
%%bash
wget ...

In [None]:
%%bash
gunzip wikidata_edges_20200504.tsv.gz

Here we assume that the Wikidata file has already been transformed to KGTK format.

Alternatively, you can download the Wikidata json.bz2 file and then use the command 

`kgtk import_wikidata -i wikidata-20200504-all.json.bz2 --node wikidata_nodes_20200504.tsv --edge wikidata_edges_20200504.tsv -qual wikidata_qualifiers_20200504.tsv`

This takes around 11 hours.

## Implementation in KGTK

We filter the data for all `P463` relations.

In [None]:
%%bash
kgtk filter -p ' ; P463 ; ' wikidata_edges_20200504.tsv > p463.tsv

Next, we clean it and remove columns that are not relevant for this use case:

In [None]:
%env ignore_cols=id,rank,node2;magnitude,node2;unit,node2;item,node2;lower,node2;upper,node2;entity-type,node2;longitude,node2;latitude,node2;date,node2;calendar,node2;precision

In [None]:
%%bash
kgtk clean_data --error-limit 0 p463.tsv  / remove_columns -c "$ignore_cols" | grep . > graph.tsv

Finally, we compute graph statistics:

In [6]:
%%bash
kgtk graph_statistics --directed --degrees --pagerank --log p463_summary.txt graph.tsv > p463_stats.tsv

You can now inspect the individual node metrics in `p463_stats` or read the summary in `p463_summary.txt`.

For example, we learn that the mean degree is 2.45 and that the node with a highest PageRank is ORCID, Inc. (`Q19861084`).