# Extracting commonsense knowledge from Wikidata

## Define relevant properties

In [1]:
%env properties="P1889,P461,P527,P186,P463,P276,P170,P366,P279"

env: properties="P1889,P461,P527,P186,P463,P276,P170,P366,P279"


## Filter relevant properties

In [2]:
%%bash
kgtk filter -p " ; $properties ; " input/wikidata_edges_20200504.tsv.gz > tmp/kgtk_wikidata_filter.tsv

## Remove columns

In [3]:
%env ignore_cols=id,rank,node2;magnitude,node2;unit,node2;item,node2;lower,node2;upper,node2;entity-type,node2;longitude,node2;latitude,node2;date,node2;calendar,node2;precision

env: ignore_cols=id,rank,node2;magnitude,node2;unit,node2;item,node2;lower,node2;upper,node2;entity-type,node2;longitude,node2;latitude,node2;date,node2;calendar,node2;precision


In [4]:
%%bash
kgtk remove_columns -c "$ignore_cols" -i tmp/kgtk_wikidata_filter.tsv > tmp/kgtk_wikidata_cols.tsv

## Deduplicate

In [5]:
%%bash
kgtk compact -i tmp/kgtk_wikidata_cols.tsv -o tmp/kgtk_wikidata_compact.tsv

## Add labels

In [4]:
%%bash
kgtk --debug lift --verbose \
     --input-file tmp/kgtk_wikidata_compact.tsv \
     --label-file input/wikidata/wiki_labels.tsv \
     --output-file tmp/kgtk_wikidata.tsv \
     --columns-to-lift node1 node2 label \
     --prefilter-labels

Opening the input file: tmp/kgtk_wikidata_compact.tsv
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file tmp/kgtk_wikidata_compact.tsv
header: node1	label	node2
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=0 label=1 node2=2 id=-1
KgtkReader: Reading an edge file.
Opening the label file: input/wikidata/wiki_labels.tsv
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file input/wikidata/wiki_labels.tsv
header: node1	label	node2
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=0 label=1 node2=2 id=-1
KgtkReader: Reading an edge file.
Lifting with in-memory buffering.
Reading input data to prefilter the labels.
Loading input rows without labels from tmp/kgtk_wikidata_compact.tsv
Labels needed: 3431341
Loading labels from the label file.
Loading labels from input/wikidata/wiki_labels.tsv
Filtering for needed labels
label_match_column_idx=0 (node1).
label_select_column_idx=1 (label).
label_value_column_idx=2 (no

## Add PageRank

In [5]:
%%bash
kgtk --debug lift --verbose \
     --input-file tmp/kgtk_wikidata.tsv \
     --label-file input/wikidata/wikidata-pagerank-only-sorted2.tsv \
     --output-file tmp/kgtk_wikidata_with_pr.tsv \
     --columns-to-lift node1 node2 \
     --property vertex_pagerank \
     --lift-suffix ";pagerank" \
     --prefilter-labels

Opening the input file: tmp/kgtk_wikidata.tsv
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file tmp/kgtk_wikidata.tsv
header: node1	label	node2	node1;label	node2;label	label;label
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=0 label=1 node2=2 id=-1
KgtkReader: Reading an edge file.
Opening the label file: input/wikidata/wikidata-pagerank-only-sorted2.tsv
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file input/wikidata/wikidata-pagerank-only-sorted2.tsv
header: node1	relation	node2	id
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=0 label=1 node2=2 id=3
KgtkReader: Reading an edge file.
Lifting with in-memory buffering.
Reading input data to prefilter the labels.
Loading input rows without labels from tmp/kgtk_wikidata.tsv
Labels needed: 3431334
Loading labels from the label file.
Loading labels from input/wikidata/wikidata-pagerank-only-sorted2.tsv
Filtering for needed labels
label_match_column_idx=

## Compute statistics

In [7]:
%%bash
kgtk graph_statistics --directed --degrees --pagerank --hits --log p463_summary.txt -i tmp/kgtk_wikidata.tsv > tmp/stats/wiki_stats.tsv

## Filtering
* How about we filter on PageRank or other popularity?
* Also filter out nodes with no label?