# Constructing ISI's COVID-19 Knowledge Graph

This notebook shows how we use our KGTK toolking (link) to construct the COVID-19 knowledge graph for the CORD-19 corpus using the text extractions from the BLENDER group at UIUC. The text extractions inlcude a variety of entity extractions with links to bioinformatics databases. Our approach is to:

* extract from Wikidata the subgraph that covers all the publications in the CORD-19 corpus and the entities identified by the BLENDER group.
* define new Wikidata items for publications or entities that are not present in the public Wikidata
* annotate the article items with the relevant entities
* preserve provenance

We implement the approach in the following steps:

* data prepararation: convert the JSON representation of BLENDER output to CSV files that are easy to process
* extract Wikidata subgraph: extract from Wikidata the articles, authors, and entities mentioned in the BLENDER corpus
* create missing items: create nodes for articles and entities that are not present in Wikidata
* create mention edges: create edges to record the entity extractions from BLENDER, including justifications
* incorporate analytic outputs: add edges to record graph metrics such as pagerank
* export knowledge graph: export the graph to KGTL edges, RDF and Neo4J 

## Data Preparation

Set up environment variables with location of the input files

In [2]:
%env COVID=/Users/pedroszekely/data/covid/blender
%env WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504

env: COVID=/Users/pedroszekely/data/covid/blender
env: WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504


### Wikidata Data

The Wikidata files are the large:

* `wikidata_nodes_20200504.tsv.gz` the English labels, alias and descriptions for all items in Wikidata
* `wikidata_edges_20200504.tsv.gz` the edges for all statements in Wikidata
* `wikidata_qualifiers_20200504.tsv.gz` the qualifier edges for all statements in Wikidata

In [18]:
ls -lh "$WD"

total 44833280
-rw-------  1 pedroszekely  staff    21M May 11 14:19 P279.tsv.gz
-rw-------  1 pedroszekely  staff   6.9M May 11 15:59 P279_star.tsv.gz
-rw-------  1 pedroszekely  staff    12M May 11 14:29 P279_truncated.tsv.gz
-rw-------  1 pedroszekely  staff   500M May 11 12:53 P31.tsv.gz
-rw-------  1 pedroszekely  staff    16G May 11 04:09 wikidata_edges_20200504.tsv.gz
-rw-------  1 pedroszekely  staff   2.2G May 11 03:22 wikidata_nodes_20200504.tsv.gz
-rw-------  1 pedroszekely  staff   2.4G May 11 04:19 wikidata_qualifiers_20200504.tsv.gz


Working with the dump files takes time because they are so large (86B chars and 1.1B lines), more than 6 minutes to just unzip and count lines

In [19]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" | wc

 1105944516 7412374276 86582473531

real	6m29.270s
user	8m28.469s
sys	1m17.519s


The columns in the edge files:

* `id` is a unique identifier for an edge, and provides an identifier for each statement in Wikidata
* `node1`, `label` and `node2` are the item/property/value or subject/predicate/object
* `rank` is the Wikidata rank for the statement
* `node2;*` are additional columns that provide detailed information about the `node2` column, making it easy to parse

In [20]:
!gzcat "$WD/wikidata_edges_20200504.tsv.gz" | head

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q8-P1245-1	Q8	P1245	"885155"	normal											
Q8-P373-1	Q8	P373	"Happiness"	normal											
Q8-P31-1	Q8	P31	Q331769	normal				Q331769							item
Q8-P31-2	Q8	P31	Q60539479	normal				Q60539479							item
Q8-P31-3	Q8	P31	Q9415	normal				Q9415							item
Q8-P508-1	Q8	P508	"13163"	normal											
Q8-P18-1	Q8	P18	"Sweet Baby Kisses Family Love.jpg"	normal											
Q8-P910-1	Q8	P910	Q8505256	normal				Q8505256							item
Q8-P349-1	Q8	P349	"00566227"	normal											
gzcat: error writing to output: Broken pipe
gzcat: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/wikidata_edges_20200504.tsv.gz: uncompress failed


### BLENDER Data

We created a file that contains all the identifiers present in the BLENDER dataset with two columns:

* node2: the value of the identifier
* label: the name of the Wikidata property used to store the identifier, eg, P698 is PubMed ID

Later, we are going to look up all these identifiers in Wikidata to find their q-nodes

In [3]:
!head $COVID/corpus-identifiers.tsv

node2	label
PMC3670673	P932
3670673	P932
22621853	P698
9606	P685
851819	P351
851323	P351
7905	P351
856140	P351
4932	P685


## Extract Wikidata Subgraph

First extract all the edges that we may want to use (this takes 107 minutes)

In [24]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" | kgtk filter -p ";P685,P486,P351,P5055,P698,P932;" | gzip > $COVID/corpus-identifier-edges.tsv.gz


real	107m13.861s
user	146m30.484s
sys	9m8.806s


## Create Missing Items

## Create Mention Edges

## Incorporate Analytic Outputs

## Export Knowledge Graph

In [None]:
!time gzcat $COVID/pmcid.tsv.gz | kgtk ifexists --filter-on $COVID/covid-pmcids.tsv --left-keys node2 --right-keys id | gzip > covid-pmcid-edges.tsv.gz

In [3]:
!gzcat covid-pmcid-edges.tsv.gz | wc

   12934   64679  750209


In [4]:
!gzcat covid-pmcid-edges.tsv.gz | head 

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q21093209-P932-1	Q21093209	P932	"2918564"	normal									
Q21142689-P932-1	Q21142689	P932	"2792043"	normal									
Q21144693-P932-1	Q21144693	P932	"1564183"	normal									
Q21146685-P932-1	Q21146685	P932	"4207625"	normal									
Q21256658-P932-1	Q21256658	P932	"1373654"	normal									
Q21283966-P932-1	Q21283966	P932	"1185527"	normal									
Q21328696-P932-1	Q21328696	P932	"2842971"	normal									
Q22680677-P932-1	Q22680677	P932	"4517126"	normal									
Q23912860-P932-1	Q23912860	P932	"4941879"	normal									
gzcat: error writing to output: Broken pipe
gzcat: covid-pmcid-edges.tsv.gz: uncompress failed


In [5]:
!wd u Q21093209

[90mid[39m Q21093209
[42mLabel[49m RETRACTED: Influenza or not influenza: analysis of a case of high fever that happened 2000 years ago in Biblical time
[44mDescription[49m scientific article
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscholarly article [90m(Q13442814)[39m | retracted paper [90m(Q45182324)[39m


In [6]:
!ls $COVID/covid-pmcid-edges.tsv.gz

/Users/pedroszekely/data/covid/covid-pmcid-edges.tsv.gz


In [7]:
!cd $(env COVID)

env: COVID: No such file or directory


In [8]:
pwd

'/Users/pedroszekely/data/covid'

In [9]:
!mlr --help

Usage: mlr [I/O options] {verb} [verb-dependent options ...] {zero or more file names}

Command-line-syntax examples:
  mlr --csv cut -f hostname,uptime mydata.csv
  mlr --tsv --rs lf filter '$status != "down" && $upsec >= 10000' *.tsv
  mlr --nidx put '$sum = $7 < 0.0 ? 3.5 : $7 + 2.1*$8' *.dat
  grep -v '^#' /etc/group | mlr --ifs : --nidx --opprint label group,pass,gid,member then sort -f group
  mlr join -j account_id -f accounts.dat then group-by account_name balances.dat
  mlr --json put '$attr = sub($attr, "([0-9]+)_([0-9]+)_.*", "\1:\2")' data/*.json
  mlr stats1 -a min,mean,max,p10,p50,p90 -f flag,u,v data/*
  mlr stats2 -a linreg-pca -f u,v -g shape data/*
  mlr put -q '@sum[$a][$b] += $x; end {emit @sum, "a", "b"}' data/*
  mlr --from estimates.tbl put '
  for (k,v in $*) {
    if (is_numeric(v) && k =~ "^[t-z].*$") {
      $sum += v; $count += 1
    }
  }
  $mean = $sum / $count # no assignment if count unset'
  mlr --from infile.dat put -f analyze.mlr
  mlr --from infile.d