# Constructing ISI's COVID-19 Knowledge Graph

This notebook shows how we use our KGTK toolking (link) to construct the COVID-19 knowledge graph for the CORD-19 corpus using the text extractions from the BLENDER group at UIUC. The text extractions inlcude a variety of entity extractions with links to bioinformatics databases. Our approach is to:

* extract from Wikidata the subgraph that covers all the publications in the CORD-19 corpus and the entities identified by the BLENDER group.
* define new Wikidata items for publications or entities that are not present in the public Wikidata
* annotate the article items with the relevant entities
* preserve provenance

We implement the approach in the following steps:

* data prepararation: convert the JSON representation of BLENDER output to CSV files that are easy to process
* extract Wikidata subgraph: extract from Wikidata the articles, authors, and entities mentioned in the BLENDER corpus
* create missing items: create nodes for articles and entities that are not present in Wikidata
* create mention edges: create edges to record the entity extractions from BLENDER, including justifications
* incorporate analytic outputs: add edges to record graph metrics such as pagerank
* export knowledge graph: export the graph to KGTL edges, RDF and Neo4J 

## Data Preparation

Set up environment variables with location of the input files

In [37]:
import numpy as np
import pandas as pd
import os

In [1]:
%env COVID=/Users/pedroszekely/data/covid/blender
%env WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504

env: COVID=/Users/pedroszekely/data/covid/blender
env: WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504


### Wikidata Data

The Wikidata files are the large:

* `wikidata_nodes_20200504.tsv.gz` the English labels, alias and descriptions for all items in Wikidata
* `wikidata_edges_20200504.tsv.gz` the edges for all statements in Wikidata
* `wikidata_qualifiers_20200504.tsv.gz` the qualifier edges for all statements in Wikidata

In [2]:
ls -lh "$WD"

total 44833280
-rw-------  1 pedroszekely  staff    21M May 11 14:19 P279.tsv.gz
-rw-------  1 pedroszekely  staff   6.9M May 11 15:59 P279_star.tsv.gz
-rw-------  1 pedroszekely  staff    12M May 11 14:29 P279_truncated.tsv.gz
-rw-------  1 pedroszekely  staff   500M May 11 12:53 P31.tsv.gz
-rw-------  1 pedroszekely  staff    16G May 11 04:09 wikidata_edges_20200504.tsv.gz
-rw-------  1 pedroszekely  staff   2.2G May 11 03:22 wikidata_nodes_20200504.tsv.gz
-rw-------  1 pedroszekely  staff   2.4G May 11 04:19 wikidata_qualifiers_20200504.tsv.gz


Working with the dump files takes time because they are so large (86B chars and 1.1B lines), more than 6 minutes to just unzip and count lines

In [19]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" | wc

 1105944516 7412374276 86582473531

real	6m29.270s
user	8m28.469s
sys	1m17.519s


The columns in the edge files:

* `id` is a unique identifier for an edge, and provides an identifier for each statement in Wikidata
* `node1`, `label` and `node2` are the item/property/value or subject/predicate/object
* `rank` is the Wikidata rank for the statement
* `node2;*` are additional columns that provide detailed information about the `node2` column, making it easy to parse

In [20]:
!gzcat "$WD/wikidata_edges_20200504.tsv.gz" | head

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q8-P1245-1	Q8	P1245	"885155"	normal											
Q8-P373-1	Q8	P373	"Happiness"	normal											
Q8-P31-1	Q8	P31	Q331769	normal				Q331769							item
Q8-P31-2	Q8	P31	Q60539479	normal				Q60539479							item
Q8-P31-3	Q8	P31	Q9415	normal				Q9415							item
Q8-P508-1	Q8	P508	"13163"	normal											
Q8-P18-1	Q8	P18	"Sweet Baby Kisses Family Love.jpg"	normal											
Q8-P910-1	Q8	P910	Q8505256	normal				Q8505256							item
Q8-P349-1	Q8	P349	"00566227"	normal											
gzcat: error writing to output: Broken pipe
gzcat: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/wikidata_edges_20200504.tsv.gz: uncompress failed


### BLENDER Data

The BLENDER data came in a JSON document per article. We used Python scripts to create simple TSV files that are easy to process.

The `corpus-identifiers.tsv` file contains all the identifiers present in the BLENDER dataset with two columns:

* node2: the value of the identifier
* label: the name of the property in Wikidata used to represent the specific identifier. For example `P698` is PubMed ID. See https://www.wikidata.org/wiki/Q93157077

In [56]:
ids = pd.read_csv(os.getenv("COVID")+'/corpus-identifiers.tsv', delimiter='\t')
ids

Unnamed: 0,node2,label
0,PMC3670673,P932
1,3670673,P932
2,22621853,P698
3,9606,P685
4,851819,P351
...,...,...
139032,19423234,P698
139033,22609285,P698
139034,12781505,P698
139035,10028170,P698


## Extract Wikidata Subgraph

### Step 1: find the `node1` for all the rows in `corpus-identifiers.tsv`

We do this with the `kgtk ifexists` command to scan the file of all edges in Wikidata to select the ones where `label` and `node2` match the rows in `corpus-identifiers.tsv`. We store the results in `corpus-identifier-edges.tsv`

In [31]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
  | kgtk ifexists --input-keys label node2 --filter-on $COVID/corpus-identifiers.tsv --filter-keys label node2 --filter-mode NONE \
  > $COVID/corpus-identifier-edges.tsv


real	50m53.543s
user	52m18.007s
sys	2m9.164s


We found 72,649 q-nodes in Wikidata for the identifiers in our corpus. 

In [58]:
idedges = pd.read_csv(os.getenv("COVID")+'/corpus-identifier-edges.tsv', delimiter='\t')
idedges.loc[:, ['id', 'node1', 'label', 'node2']]

Unnamed: 0,id,node1,label,node2
0,Q140-P685-1,Q140,P685,9689
1,Q556-P486-1,Q556,P486,D006859
2,Q688-P486-1,Q688,P486,D002713
3,Q716-P486-1,Q716,P486,D014025
4,Q1832-P486-1,Q1832,P486,D005682
...,...,...,...,...
72644,Q93147847-P932-1,Q93147847,P932,6730851
72645,Q93157077-P698-1,Q93157077,P698,31492122
72646,Q93157077-P932-1,Q93157077,P932,6731609
72647,Q93157501-P698-1,Q93157501,P698,31492169


Here are the counts of the q-nodes we found in Wikidata for each property.

In [44]:
idedges.label.value_counts()

P698     20426
P351     18235
P932     14169
P486     10096
P685      9721
P5055        2
Name: label, dtype: int64

### Step 2: get all the edges from Wikidata for the q-nodes in `corpus-identifier-edges.tsv`

To do this, we again scan all the edges in Wikidata looking for edges whose `node1` matches the `node1` in `corpus-identifier-edges.tsv`

In [32]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
  | kgtk ifexists --input-keys node1 --filter-on $COVID/corpus-identifier-edges.tsv --filter-keys node1 \
  > $COVID/corpus-edges.tsv


real	42m33.188s
user	44m19.812s
sys	1m50.926s


We now have 2.2 million edges for the entities in our corpus.

In [34]:
!wc $COVID/corpus-edges.tsv

 2297783 13662891 176151115 /Users/pedroszekely/data/covid/blender/corpus-edges.tsv


We can use Pandas to explore the data

In [69]:
ce = pd.read_csv(os.getenv("COVID")+'/corpus-edges.tsv', delimiter='\t')
ce.loc[:, ['id', 'node1', 'label', 'node2']]

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,node1,label,node2
0,Q140-P225-1,Q140,P225,Panthera leo
1,Q140-P105-1,Q140,P105,Q7432
2,Q140-P171-1,Q140,P171,Q127960
3,Q140-P31-1,Q140,P31,Q16521
4,Q140-P1403-1,Q140,P1403,Q15294488
...,...,...,...,...
2297777,Q93157501-P478-1,Q93157501,P478,17
2297778,Q93157501-P50-1,Q93157501,P50,Q92401632
2297779,Q93157501-P577-1,Q93157501,P577,^2019-09-06T00:00:00Z/11
2297780,Q93157501-P698-1,Q93157501,P698,31492169


In [70]:
ce.label.value_counts()

P2860                 540720
wikipedia_sitelink    300857
P704                  120845
P2093                 109421
P639                   87783
                       ...  
P4611                      1
P3100                      1
P1268                      1
P3252                      1
P5420                      1
Name: label, Length: 952, dtype: int64

We can use the Wikidata cli to see what the top properties:

In [74]:
!wd u P2860, P704, P2093

[90mid[39m P2860
[42mLabel[49m cites work
[44mDescription[49m citation from one creative work to another
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for items about works [90m(Q18618644)[39m

[90mid[39m P704
[42mLabel[49m Ensembl transcript ID
[44mDescription[49m transcript ID issued by Ensembl database
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for an identifier [90m(Q19847637)[39m | Wikidata property related to medicine [90m(Q19887775)[39m

[90mid[39m P2093
[42mLabel[49m author name string
[44mDescription[49m string to store unspecified author name for publications; use if Wikidata item for author (P50) does not exist or is not known
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for items about works [90m(Q18618644)[39m | Wikidata property with datatype string that is not an external identifier [90m(Q21099935)[39m | Wikidata property to indicate a source [90m

How many authors have items in Wikidata?

In [77]:
!kgtk filter -p ';P50;' $COVID/corpus-edges.tsv | wc -l

   25284


### Step 3:

In [79]:
!kgtk filter -p ';P50, P2860;' $COVID/corpus-edges.tsv > $COVID/citation-and-author-edges.tsv

In [80]:
caae = pd.read_csv(os.getenv("COVID")+'/citation-and-author-edges.tsv', delimiter='\t')
caae.loc[:, ['id', 'node1', 'label', 'node2']]

Unnamed: 0,id,node1,label,node2
0,Q21090495-P2860-1,Q21090495,P2860,Q24611162
1,Q21090495-P2860-2,Q21090495,P2860,Q24655519
2,Q21090495-P2860-3,Q21090495,P2860,Q22065976
3,Q21090495-P2860-4,Q21090495,P2860,Q24650035
4,Q21090495-P2860-5,Q21090495,P2860,Q24684593
...,...,...,...,...
565998,Q93147847-P50-1,Q93147847,P50,Q61104970
565999,Q93147847-P50-2,Q93147847,P50,Q90414144
566000,Q93147847-P50-3,Q93147847,P50,Q87706998
566001,Q93157077-P50-1,Q93157077,P50,Q93157070


Let's look at some of the items we got

In [81]:
!wd u Q24611162, Q61104970

[90mid[39m Q24611162
[42mLabel[49m Viral mutation rates
[44mDescription[49m scientific article
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscholarly article [90m(Q13442814)[39m

[90mid[39m Q61104970
[42mLabel[49m Veerasak Punyapornwithaya
[44mDescription[49m researcher ORCID ID = 0000-0001-9870-7773
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mhuman [90m(Q5)[39m


Fetch all the edges from Wikidata about the authors and citations in our corpus.
We do this by scanning the Wikidata edges file and extraction all edges where node1 matches node2 in `citation-and-author-edges.tsv`

In [83]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
  | kgtk ifexists --input-keys node1 --filter-on $COVID/citation-and-author-edges.tsv --filter-keys node2 \
  > $COVID/corpus-citations-and-authors.tsv


real	43m13.185s
user	44m37.153s
sys	2m6.936s


Keep the edges for the articles and entities that we have in the BLENDER corpus

First extract all the edges that we may want to use (this takes 107 minutes)

In [85]:
!wc $COVID/corpus-citations-and-authors.tsv

 11389393 78495849 854598718 /Users/pedroszekely/data/covid/blender/corpus-citations-and-authors.tsv


In [86]:
!head $COVID/corpus-citations-and-authors.tsv

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q493567-P21-1	Q493567	P21	Q6581097	normal				Q6581097							item
Q493567-P106-1	Q493567	P106	Q3779582	normal				Q3779582							item
Q493567-P106-2	Q493567	P106	Q39631	normal				Q39631							item
Q493567-P106-3	Q493567	P106	Q82955	normal				Q82955							item
Q493567-P106-4	Q493567	P106	Q1622272	normal				Q1622272							item
Q493567-P106-5	Q493567	P106	Q15634281	normal				Q15634281							item
Q493567-P19-1	Q493567	P19	Q984894	normal				Q984894							item
Q493567-P244-1	Q493567	P244	"n88290830"	normal											
Q493567-P214-1	Q493567	P214	"69131111"	normal											


In [7]:
!gzcat $COVID/corpus-identifier-edges.tsv.gz | kgtk filter -p ";P698,P932;" | gzip > $COVID/corpus-article-identifier-edges.tsv.gz

In [9]:
!gzcat $COVID/corpus-identifier-edges.tsv.gz | kgtk filter -p ";P685,P486,P351,P5055;" | gzip > $COVID/corpus-entity-identifier-edges.tsv.gz

## Create Missing Items

## Create Mention Edges

## Incorporate Analytic Outputs

## Export Knowledge Graph

In [None]:
!time gzcat $COVID/pmcid.tsv.gz | kgtk ifexists --filter-on $COVID/covid-pmcids.tsv --left-keys node2 --right-keys id | gzip > covid-pmcid-edges.tsv.gz

In [3]:
!gzcat covid-pmcid-edges.tsv.gz | wc

   12934   64679  750209


In [4]:
!gzcat covid-pmcid-edges.tsv.gz | head 

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q21093209-P932-1	Q21093209	P932	"2918564"	normal									
Q21142689-P932-1	Q21142689	P932	"2792043"	normal									
Q21144693-P932-1	Q21144693	P932	"1564183"	normal									
Q21146685-P932-1	Q21146685	P932	"4207625"	normal									
Q21256658-P932-1	Q21256658	P932	"1373654"	normal									
Q21283966-P932-1	Q21283966	P932	"1185527"	normal									
Q21328696-P932-1	Q21328696	P932	"2842971"	normal									
Q22680677-P932-1	Q22680677	P932	"4517126"	normal									
Q23912860-P932-1	Q23912860	P932	"4941879"	normal									
gzcat: error writing to output: Broken pipe
gzcat: covid-pmcid-edges.tsv.gz: uncompress failed


In [5]:
!wd u Q21093209

[90mid[39m Q21093209
[42mLabel[49m RETRACTED: Influenza or not influenza: analysis of a case of high fever that happened 2000 years ago in Biblical time
[44mDescription[49m scientific article
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscholarly article [90m(Q13442814)[39m | retracted paper [90m(Q45182324)[39m


In [6]:
!ls $COVID/covid-pmcid-edges.tsv.gz

/Users/pedroszekely/data/covid/covid-pmcid-edges.tsv.gz


In [7]:
!cd $(env COVID)

env: COVID: No such file or directory


In [8]:
pwd

'/Users/pedroszekely/data/covid'

In [59]:
man mlr

MILLER(1)							     MILLER(1)



NAME
       miller - like awk, sed, cut, join, and sort for name-indexed data such
       as CSV and tabular JSON.

SYNOPSIS
       Usage: mlr [I/O options] {verb} [verb-dependent options ...] {zero or
       more file names}


DESCRIPTION
       Miller operates on key-value-pair data while the familiar Unix tools
       operate on integer-indexed fields: if the natural data structure for
       the latter is the array, then Miller's natural data structure is the
       insertion-ordered hash map.  This encompasses a variety of data
       formats, including but not limited to the familiar CSV, TSV, and JSON.
       (Miller can handle positionally-indexed data as a special case.) This
       manpage documents Miller v5.7.0.

EXAMPLES
   COMMAND-LINE SYNTAX
       mlr --csv cut -f hostname,uptime mydata.csv
       mlr --tsv --rs lf filter '$status != "down" && $upsec >= 10000' *.tsv
       mlr --nidx put '$sum = $7 < 0.0 ? 3.5 : $7 + 2.1*$8' *.dat
     