# Constructing ISI's COVID-19 Knowledge Graph

This notebook shows how we use our KGTK toolking (link) to construct the COVID-19 knowledge graph for the CORD-19 corpus using the text extractions from the BLENDER group at UIUC. The text extractions inlcude a variety of entity extractions with links to bioinformatics databases. Our approach is to:

* extract from Wikidata the subgraph that covers all the publications in the CORD-19 corpus and the entities identified by the BLENDER group.
* define new Wikidata items for publications or entities that are not present in the public Wikidata
* annotate the article items with the relevant entities
* preserve provenance

We implement the approach in the following steps:

* data prepararation: convert the JSON representation of BLENDER output to CSV files that are easy to process
* extract Wikidata subgraph: extract from Wikidata the articles, authors, and entities mentioned in the BLENDER corpus
* create missing items: create nodes for articles and entities that are not present in Wikidata
* create mention edges: create edges to record the entity extractions from BLENDER, including justifications
* incorporate analytic outputs: add edges to record graph metrics such as pagerank
* export knowledge graph: export the graph to KGTL edges, RDF and Neo4J 

## Data Preparation

Set up environment variables with location of the input files

In [1]:
import numpy as np
import pandas as pd
import os
import json

In [2]:
%env COVID=/Users/amandeep/Github/CKG-Covid/datasets/sandbox
%env WD=/Users/amandeep/Documents/wikidata-20200504

env: COVID=/Users/amandeep/Github/CKG-Covid/datasets/sandbox
env: WD=/Users/amandeep/Documents/wikidata-20200504


### Wikidata Data

The Wikidata files are the large:

* `wikidata_nodes_20200504.tsv.gz` the English labels, alias and descriptions for all items in Wikidata
* `wikidata_edges_20200504.tsv.gz` the edges for all statements in Wikidata
* `wikidata_qualifiers_20200504.tsv.gz` the qualifier edges for all statements in Wikidata

In [None]:
ls -lh "$WD"

Working with the dump files takes time because they are so large (86B chars and 1.1B lines), more than 6 minutes to just unzip and count lines

In [None]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" | wc

The columns in the edge files:

* `id` is a unique identifier for an edge, and provides an identifier for each statement in Wikidata
* `node1`, `label` and `node2` are the item/property/value or subject/predicate/object
* `rank` is the Wikidata rank for the statement
* `node2;*` are additional columns that provide detailed information about the `node2` column, making it easy to parse

In [None]:
!gzcat "$WD/wikidata_edges_20200504.tsv.gz" | head

### BLENDER Data

The BLENDER data came in a JSON document per article. We used Python scripts to create simple TSV files that are easy to process.

The `corpus-identifiers.tsv` file contains all the identifiers present in the BLENDER dataset with two columns:

* node2: the value of the identifier
* label: the name of the property in Wikidata used to represent the specific identifier. For example `P698` is PubMed ID. See https://www.wikidata.org/wiki/Q93157077

In [3]:
ids = pd.read_csv(os.getenv("COVID")+'/corpus-identifiers.tsv', delimiter='\t')
ids

Unnamed: 0,node2,label
0,3670673,P932
1,22621853,P698
2,9606,P685
3,851819,P351
4,851323,P351
...,...,...
99799,19423234,P698
99800,22609285,P698
99801,12781505,P698
99802,10028170,P698


## Extract Wikidata Subgraph

### Step 1: find the `node1` for all the rows in `corpus-identifiers.tsv`

We do this with the `kgtk ifexists` command to scan the file of all edges in Wikidata to select the ones where `label` and `node2` match the rows in `corpus-identifiers.tsv`. We store the results in `corpus-identifier-edges.tsv`

In [4]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
  | kgtk ifexists --input-keys label node2 --filter-on $COVID/corpus-identifiers.tsv --filter-keys label node2 --filter-mode NONE \
  > $COVID/corpus-identifier-edges.tsv


real	43m17.833s
user	46m24.846s
sys	0m53.892s


We found 72,649 q-nodes in Wikidata for the identifiers in our corpus. 

In [6]:
idedges = pd.read_csv(os.getenv("COVID")+'/corpus-identifier-edges.tsv', delimiter='\t')
#idedges.loc[:, [ 'node2']]

In [7]:
idedges

Unnamed: 0,id,node1,label,node2,rank,node2;magnitude,node2;unit,node2;date,node2;item,node2;lower,node2;upper,node2;latitude,node2;longitude,node2;precision,node2;calendar,node2;entity-type
0,Q140-P685-1,Q140,P685,9689,normal,,,,,,,,,,,
1,Q556-P486-1,Q556,P486,D006859,normal,,,,,,,,,,,
2,Q688-P486-1,Q688,P486,D002713,normal,,,,,,,,,,,
3,Q716-P486-1,Q716,P486,D014025,normal,,,,,,,,,,,
4,Q1832-P486-1,Q1832,P486,D005682,normal,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72645,Q93147847-P932-1,Q93147847,P932,6730851,normal,,,,,,,,,,,
72646,Q93157077-P698-1,Q93157077,P698,31492122,normal,,,,,,,,,,,
72647,Q93157077-P932-1,Q93157077,P932,6731609,normal,,,,,,,,,,,
72648,Q93157501-P698-1,Q93157501,P698,31492169,normal,,,,,,,,,,,


Here are the counts of the q-nodes we found in Wikidata for each property.

In [8]:
idedges['label'].value_counts()

P698     20426
P351     18235
P932     14169
P486     10096
P685      9721
P5055        3
Name: label, dtype: int64

### Step 2: get all the edges from Wikidata for the q-nodes in `corpus-identifier-edges.tsv`

To do this, we again scan all the edges in Wikidata looking for edges whose `node1` matches the `node1` in `corpus-identifier-edges.tsv`

In [None]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
  | kgtk ifexists --input-keys node1 --filter-on $COVID/corpus-identifier-edges.tsv --filter-keys node1 \
  > $COVID/corpus-edges.tsv

We now have 2.2 million edges for the entities in our corpus.

In [10]:
!wc $COVID/corpus-edges.tsv

 2297783 13662891 176151115 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/corpus-edges.tsv


We can use Pandas to explore the data

In [11]:
ce = pd.read_csv(os.getenv("COVID")+'/corpus-edges.tsv', delimiter='\t', index_col=['id'], dtype=object)

In [12]:
ce.loc[:, ['node1', 'label', 'node2', 'node2;entity-type']].head(5)

Unnamed: 0_level_0,node1,label,node2,node2;entity-type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q140-P225-1,Q140,P225,Panthera leo,
Q140-P105-1,Q140,P105,Q7432,item
Q140-P171-1,Q140,P171,Q127960,item
Q140-P31-1,Q140,P31,Q16521,item
Q140-P1403-1,Q140,P1403,Q15294488,item


In [13]:
ce[ce['node2;entity-type']=='item'].loc[:, ['node1', 'label', 'node2', 'node2;entity-type']]

Unnamed: 0_level_0,node1,label,node2,node2;entity-type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q140-P105-1,Q140,P105,Q7432,item
Q140-P171-1,Q140,P171,Q127960,item
Q140-P31-1,Q140,P31,Q16521,item
Q140-P1403-1,Q140,P1403,Q15294488,item
Q140-P141-1,Q140,P141,Q278113,item
...,...,...,...,...
Q93157077-P50-1,Q93157077,P50,Q93157070,item
Q93157077-P921-1,Q93157077,P921,Q12184,item
Q93157501-P1433-1,Q93157501,P1433,Q15752156,item
Q93157501-P31-1,Q93157501,P31,Q13442814,item


In [14]:
ce[ce['node2;entity-type']=='item'].label.value_counts()

P2860    540720
P31       62617
P684      52486
P279      28076
P50       25283
          ...  
P4954         1
P790          1
P6104         1
P1425         1
P1918         1
Name: label, Length: 221, dtype: int64

In [16]:
ce.label.value_counts()[0:20]

P2860                 540720
wikipedia_sitelink    300857
P704                  120845
P2093                 109421
P639                   87783
P31                    62617
P684                   52486
P1843                  38031
P279                   28076
P2888                  26496
P645                   25676
P644                   25670
P50                    25283
P577                   20699
P1476                  20479
P698                   20438
P1433                  20413
P356                   20197
P478                   20164
P688                   19353
Name: label, dtype: int64

We can use the Wikidata cli to see what the top properties:

In [17]:
!wd u P2860, P704, P2093

[90mid[39m P2860
[42mLabel[49m cites work
[44mDescription[49m citation from one creative work to another
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for items about works [90m(Q18618644)[39m

[90mid[39m P704
[42mLabel[49m Ensembl transcript ID
[44mDescription[49m transcript ID issued by Ensembl database
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for an identifier [90m(Q19847637)[39m | Wikidata property related to medicine [90m(Q19887775)[39m

[90mid[39m P2093
[42mLabel[49m author name string
[44mDescription[49m string to store unspecified author name for publications; use if Wikidata item for author (P50) does not exist or is not known
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for items about works [90m(Q18618644)[39m | Wikidata property with datatype string that is not an external identifier [90m(Q21099935)[39m | Wikidata property to indicate a source [90m

How many authors have items in Wikidata?

In [18]:
ce.node2.value_counts()[0:20]

Q13442814    20473
Q7187        18221
Q20747295    15858
Q15978631    10696
Q16521        9503
Q7432         9053
Q22809680     8136
Q22809711     8090
Q1860         7786
Q83310        4767
1             3385
Q11173        3128
Q12136        3011
2             2207
3             2058
4             2008
6             1898
Q211005       1883
5             1638
7             1603
Name: node2, dtype: int64

In [19]:
!wd u Q13442814, Q7187, Q20747295, Q15978631

[90mid[39m Q13442814
[42mLabel[49m scholarly article
[44mDescription[49m article in an academic publication, usually peer reviewed
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mscholarly publication [90m(Q591041)[39m | article [90m(Q191067)[39m | scholarly work [90m(Q55915575)[39m

[90mid[39m Q7187
[42mLabel[49m gene
[44mDescription[49m basic physical and functional unit of heredity
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mNucleic acid sequence [90m(Q863908)[39m | biological region [90m(Q50365914)[39m | biological sequence [90m(Q3511065)[39m

[90mid[39m Q20747295
[42mLabel[49m protein-coding gene
[44mDescription[49m Type of a gene
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mgene [90m(Q7187)[39m

[90mid[39m Q15978631
[42mLabel[49m Homo sapiens
[44mDescription[49m species of mammal
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mtaxon [90m(Q16521)[39m


In [20]:
!kgtk filter -p ';P50;' $COVID/corpus-edges.tsv | wc -l

   25284


### Step 3:

In [None]:
!kgtk filter -p ';P50, P2860;' $COVID/corpus-edges.tsv > $COVID/citation-and-author-edges.tsv

In [21]:
caae = pd.read_csv(os.getenv("COVID")+'/citation-and-author-edges.tsv', delimiter='\t')
caae.loc[:, ['id', 'node1', 'label', 'node2']]

Unnamed: 0,id,node1,label,node2
0,Q21090495-P2860-1,Q21090495,P2860,Q24611162
1,Q21090495-P2860-2,Q21090495,P2860,Q24655519
2,Q21090495-P2860-3,Q21090495,P2860,Q22065976
3,Q21090495-P2860-4,Q21090495,P2860,Q24650035
4,Q21090495-P2860-5,Q21090495,P2860,Q24684593
...,...,...,...,...
565998,Q93147847-P50-1,Q93147847,P50,Q61104970
565999,Q93147847-P50-2,Q93147847,P50,Q90414144
566000,Q93147847-P50-3,Q93147847,P50,Q87706998
566001,Q93157077-P50-1,Q93157077,P50,Q93157070


Let's look at some of the items we got

In [22]:
!wd u Q24611162, Q61104970

[90mid[39m Q24611162
[42mLabel[49m Viral mutation rates
[44mDescription[49m scientific article
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscholarly article [90m(Q13442814)[39m

[90mid[39m Q61104970
[42mLabel[49m Veerasak Punyapornwithaya
[44mDescription[49m researcher ORCID ID = 0000-0001-9870-7773
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mhuman [90m(Q5)[39m


Fetch all the edges from Wikidata about the authors and citations in our corpus.
We do this by scanning the Wikidata edges file and extraction all edges where node1 matches node2 in `citation-and-author-edges.tsv`

In [None]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
  | kgtk ifexists --input-keys node1 --filter-on $COVID/citation-and-author-edges.tsv --filter-keys node2 \
  > $COVID/corpus-citations-and-authors.tsv

Keep the edges for the articles and entities that we have in the BLENDER corpus

First extract all the edges that we may want to use (this takes 107 minutes)

In [23]:
!wc $COVID/corpus-citations-and-authors.tsv

 11389393 78495849 854598718 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/corpus-citations-and-authors.tsv


In [24]:
!head $COVID/corpus-citations-and-authors.tsv

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q493567-P21-1	Q493567	P21	Q6581097	normal				Q6581097							item
Q493567-P106-1	Q493567	P106	Q3779582	normal				Q3779582							item
Q493567-P106-2	Q493567	P106	Q39631	normal				Q39631							item
Q493567-P106-3	Q493567	P106	Q82955	normal				Q82955							item
Q493567-P106-4	Q493567	P106	Q1622272	normal				Q1622272							item
Q493567-P106-5	Q493567	P106	Q15634281	normal				Q15634281							item
Q493567-P19-1	Q493567	P19	Q984894	normal				Q984894							item
Q493567-P244-1	Q493567	P244	"n88290830"	normal											
Q493567-P214-1	Q493567	P214	"69131111"	normal											


In [27]:
ccau = pd.read_csv(os.getenv("COVID")+'/corpus-citations-and-authors.tsv', delimiter='\t', dtype=object)
ccau.loc[:, ['id', 'node1', 'label', 'node2']]

Unnamed: 0,id,node1,label,node2
0,Q493567-P21-1,Q493567,P21,Q6581097
1,Q493567-P106-1,Q493567,P106,Q3779582
2,Q493567-P106-2,Q493567,P106,Q39631
3,Q493567-P106-3,Q493567,P106,Q82955
4,Q493567-P106-4,Q493567,P106,Q1622272
...,...,...,...,...
11389387,Q92959340-P31-1,Q92959340,P31,Q5
11389388,Q93068965-P496-1,Q93068965,P496,0000-0002-0515-3933
11389389,Q93068965-P31-1,Q93068965,P31,Q5
11389390,Q93078579-P496-1,Q93078579,P496,0000-0003-4287-7831


In [28]:
ccau.node2.value_counts()[0:20]

Q13442814    288236
Q1860        156362
1             36519
2             33157
3             30061
4             27443
5             23557
6             23406
7             16362
8             16141
9             14921
10            14679
Q1251128      14032
11            13758
Q5            13615
12            13434
Q1650915       8181
Q1146531       7967
Q564954        6669
13             6077
Name: node2, dtype: int64

In [29]:
!wd u Q13442814, Q1860, Q1251128

[90mid[39m Q13442814
[42mLabel[49m scholarly article
[44mDescription[49m article in an academic publication, usually peer reviewed
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mscholarly publication [90m(Q591041)[39m | article [90m(Q191067)[39m | scholarly work [90m(Q55915575)[39m

[90mid[39m Q1860
[42mLabel[49m English
[44mDescription[49m West Germanic language originating in England with linguistic roots in French, German and Vulgar Latin
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m natural language [90m(Q33742)[39m | modern language [90m(Q1288568)[39m | language [90m(Q34770)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mAnglic languages [90m(Q1346342)[39m

[90mid[39m Q1251128
[42mLabel[49m Journal of Virology
[44mDescription[49m scientific journal
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscientific journal [90m(Q5633421)[39m | delayed open access journal [90m(Q5253501)[39m


In [None]:
!cat $COVID/corpus-identifier-edges.tsv | kgtk filter -p ";P698,P932;" | gzip > $COVID/corpus-article-identifier-edges.tsv.gz

In [None]:
!cat $COVID/corpus-identifier-edges.tsv | kgtk filter -p ";P685,P486,P351,P5055;" | gzip > $COVID/corpus-entity-identifier-edges.tsv.gz

## Create Missing Items

### Step 4: find the identifiers in `corpus-identifiers.tsv` for which there is no `node1` in Wikidata

We do this with the `kgtk ifnotexists` command to scan the file corpus-identifiers.tsv to select the ones which do not have a `node1` in the file `corpus-identifier-edges.tsv`. We store the results in `corpus-identifiers-not-in-wikidata.tsv`

In [40]:
!cat $COVID/corpus-identifiers.tsv | \
kgtk ifnotexists --input-keys label node2 --filter-on  $COVID/corpus-identifier-edges.tsv --filter-keys label node2 \
--mode=NONE > $COVID/corpus-identifiers-not-in-wikidata.tsv

Format the missing identifiers file by string qouting the identifiers

In [None]:
df = pd.read_csv('{}/corpus-identifiers-not-in-wikidata.tsv'.format(os.getenv('COVID')), sep='\t', dtype=object)
df['node2'] = df['node2'].map(lambda x: json.dumps(x))
df.to_csv('{}/corpus-identifiers-not-in-wikidata_formatted.tsv'.format(os.getenv('COVID')), sep='\t', index=False)

The number of missing identifiers in Wikidata(minus the header row): 28112

In [24]:
!wc -l $COVID/corpus-identifiers-not-in-wikidata_formatted.tsv

   28113 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/corpus-identifiers-not-in-wikidata_formatted.tsv


### Process and clean CTD datasets for `Chemicals`, `Diseases` and `Genes`

#### Step 1: Process CTD Diseases file

In [100]:
rename_columns = {
    '# DiseaseName': 'label',
    'DiseaseID': 'P486',
    'Definition': 'descriptions',
    'Synonyms': 'aliases'
    
}
drop_columns = ['AltDiseaseIDs', 'ParentIDs', 'TreeNumbers', 'ParentTreeNumbers','SlimMappings']

df_disease = pd.read_csv('{}/CTD_diseases.tsv.gz'.format(os.getenv("COVID")), sep='\t', skiprows=27)
df_disease = df_disease.fillna('').rename(columns=rename_columns).drop(columns=drop_columns)
df_disease['P486'] = df_disease['P486'].map(lambda x: x[5:] if x.startswith('MESH') or x.startswith("OMIM") else x).map(lambda x: json.dumps(x))
df_disease.to_csv('{}/CTD_diseases_clean.tsv'.format(os.getenv("COVID")), sep='\t', index=False)

Find the disease ids in the `CTD_diseases_clean.tsv` for the missing identifiers in Wikidata in the `corpus-identifiers-not-in-wikidata_formatted.tsv` file.

We do this with the `kgtk ifexists` command.

In [101]:
!cat $COVID/CTD_diseases_clean.tsv | \
kgtk ifexists --input-keys P486 --filter-on  $COVID/corpus-identifiers-not-in-wikidata_formatted.tsv --filter-keys node2 \
--mode=NONE > $COVID/CTD-diseases-corpus-identifiers.tsv

Add the column `node1`, creating pseudo Wikidata Qnodes with the formula:
Qnode(Disease_ID) = `Q00005550-disease-<Disease_ID>`.

Also add the column `P31`(instance of) with a constant value of `Q12136` (disease)

In [102]:
df_dc = pd.read_csv('{}/CTD-diseases-corpus-identifiers.tsv'.format(os.getenv("COVID")), sep='\t')
df_dc['node1'] = df_dc['P486'].map(lambda x: 'Q00005550-disease-{}'.format(x.replace('"', '')))
df_dc['P31'] = 'Q12136'
df_dc.to_csv('{}/CTD-diseases-corpus-identifiers-node1.tsv'.format(os.getenv("COVID")), sep='\t', index=False)


Convert the CTD disease file to a `KGTK Edge` file.
We do this with `mlr reshape` command

In [103]:
!cat $COVID/CTD-diseases-corpus-identifiers-node1.tsv | \
mlr --itsv --otsv reshape -i label,P486,descriptions,aliases,P31 -o label,node2 \
> $COVID/CTD-diseases-corpus-identifiers-compact.tsv

`aliases` in the above file contains multiple values separated by `|`
We can expand the values into multiple rows with `kgtk expand` command

In [104]:
!cat $COVID/CTD-diseases-corpus-identifiers-compact.tsv | kgtk expand --columns node1 label --mode NONE \
> $COVID/CTD-diseases-corpus-edges.tsv

In [20]:
!head $COVID/CTD-diseases-corpus-edges.tsv | column -t

node1                      label         node2
Q00005550-disease-C537806  label         18-Hydroxylase  deficiency
Q00005550-disease-C537806  P486          """C537806"""
Q00005550-disease-C537806  descriptions
Q00005550-disease-C537806  aliases       18-alpha        hydroxylase  deficiency
Q00005550-disease-C537806  aliases       18-HYDROXYLASE  DEFICIENCY
Q00005550-disease-C537806  aliases       18-Oxidase      Deficiency
Q00005550-disease-C537806  aliases       Aldosterone     deficiency   1
Q00005550-disease-C537806  aliases       Aldosterone     deficiency   due         to  defect  in  18-hydroxylase
Q00005550-disease-C537806  aliases       ALDOSTERONE     DEFICIENCY   DUE         TO  DEFECT  IN  STEROID         18-HYDROXYLASE


#### Step 2: Process CTD Chemicals file

In [106]:
rename_columns = {
    '# ChemicalName': 'label',
    'ChemicalID': 'P486',
    'Definition': 'descriptions',
    'Synonyms': 'aliases',
    'CasRN': 'P231' # CAS Rn number
    
}
drop_columns = ['ParentIDs', 'TreeNumbers', 'ParentTreeNumbers','DrugBankIDs']

df_chemical = pd.read_csv('{}/CTD_chemicals.tsv.gz'.format(os.getenv("COVID")), sep='\t', skiprows=27)
df_chemical = df_chemical.fillna('').rename(columns=rename_columns).drop(columns=drop_columns)
df_chemical['P486'] = df_chemical['P486'].map(lambda x: x[5:] if x.startswith('MESH') or x.startswith("OMIM") else x).map(lambda x: json.dumps(x))
df_chemical['P231'] = df_chemical['P231'].map(lambda x: x[5:] if x.startswith('MESH') or x.startswith("OMIM") else x).map(lambda x: json.dumps(x) if x else x)
df_chemical.to_csv('{}/CTD_chemicals_clean.tsv'.format(os.getenv("COVID")), sep='\t', index=False)

Find the chemical ids in the `CTD_chemicals_clean.tsv` for the missing identifiers in Wikidata in the `corpus-identifiers-not-in-wikidata_formatted.tsv` file.

We do this with the `kgtk ifexists` command.

In [107]:
!cat $COVID/CTD_chemicals_clean.tsv | \
kgtk ifexists --input-keys P486 --filter-on  $COVID/corpus-identifiers-not-in-wikidata_formatted.tsv --filter-keys node2 \
--mode=NONE > $COVID/CTD-chemicals-corpus-identifiers.tsv

Add the column `node1`, creating pseudo Wikidata Qnodes with the formula:
Qnode(Chemical_ID) = `Q00005550-chemical-<Chemical_ID>`.

Also add the column `P31`(instance of) with a constant value of `Q11344` (chemical element)

In [108]:
df_cc = pd.read_csv('{}/CTD-chemicals-corpus-identifiers.tsv'.format(os.getenv("COVID")), sep='\t')
df_cc['node1'] = df_cc['P486'].map(lambda x: 'Q00005550-chemical-{}'.format(x.replace('"', '')))
df_cc['P31'] = 'Q11344'
df_cc.to_csv('{}/CTD-chemicals-corpus-identifiers-node1.tsv'.format(os.getenv("COVID")), sep='\t', index=False)


Convert the CTD chemical file to a `KGTK Edge` file.
We do this with `mlr reshape` command

In [109]:
!cat $COVID/CTD-chemicals-corpus-identifiers-node1.tsv | \
mlr --itsv --otsv reshape -i label,P486,descriptions,aliases,P31,P231 -o label,node2 \
> $COVID/CTD-chemicals-corpus-identifiers-compact.tsv

`aliases` in the above file contains multiple values separated by `|`
We can expand the values into multiple rows with `kgtk expand` command

In [110]:
!cat $COVID/CTD-chemicals-corpus-identifiers-compact.tsv | kgtk expand --columns node1 label --mode NONE \
> $COVID/CTD-chemicals-corpus-edges.tsv

In [19]:
!head $COVID/CTD-chemicals-corpus-edges.tsv | column -t

node1                       label         node2
Q00005550-chemical-C493119  label         07H239-A
Q00005550-chemical-C493119  P486          """C493119"""
Q00005550-chemical-C493119  descriptions
Q00005550-chemical-C493119  aliases
Q00005550-chemical-C493119  P31           Q11344
Q00005550-chemical-C493119  P231
Q00005550-chemical-C534883  label         10074-G5
Q00005550-chemical-C534883  P486          """C534883"""
Q00005550-chemical-C534883  descriptions


#### Step 3: Process CTD Genes file

In [112]:
rename_columns = {
    'GeneName': 'descriptions',
    'GeneID': 'P351',
    '# GeneSymbol': 'label',
    'Synonyms': 'aliases',
    'PharmGKBIDs': 'P7001',
    'UniProtIDs': 'P352'
    
}
drop_columns = ['AltGeneIDs', 'BioGRIDIDs']

df_gene = pd.read_csv('{}/CTD_genes.tsv.gz'.format(os.getenv("COVID")), sep='\t', skiprows=27, dtype=object)
df_gene = df_gene.fillna('').rename(columns=rename_columns).drop(columns=drop_columns)
df_gene['P351'] = df_gene['P351'].map(lambda x: str(x)).map(lambda x: x[5:] if x.startswith('MESH') or x.startswith("OMIM") else x).map(lambda x: json.dumps(x))
df_gene['P7001'] = df_gene['P7001'].map(lambda x: json.dumps(x) if x else x)
df_gene['P352'] = df_gene['P352'].map(lambda x: json.dumps(x) if x else x)
df_gene.to_csv('{}/CTD_genes_clean.tsv'.format(os.getenv("COVID")), sep='\t', index=False)

Find the gene ids in the `CTD_genes_clean.tsv` for the missing identifiers in Wikidata in the `corpus-identifiers-not-in-wikidata_formatted.tsv` file.

We do this with the `kgtk ifexists` command.

In [113]:
!cat $COVID/CTD_genes_clean.tsv | \
kgtk ifexists --input-keys P351 --filter-on  $COVID/corpus-identifiers-not-in-wikidata_formatted.tsv --filter-keys node2 \
--mode=NONE > $COVID/CTD-genes-corpus-identifiers.tsv

Add the column `node1`, creating pseudo Wikidata Qnodes with the formula:
Qnode(Gene_ID) = `Q00005550-gene-<Gene_ID>`.

Also add the column `P31`(instance of) with a constant value of `Q7187` (gene)

In [114]:
df_gc = pd.read_csv('{}/CTD-genes-corpus-identifiers.tsv'.format(os.getenv("COVID")), sep='\t')
df_gc['node1'] = df_gc['P351'].map(lambda x: 'Q00005550-gene-{}'.format(x.replace('"', '')))
df_gc['P31'] = 'Q7187'
df_gc.to_csv('{}/CTD-genes-corpus-identifiers-node1.tsv'.format(os.getenv("COVID")), sep='\t', index=False)


Convert the CTD gene file to a `KGTK Edge` file.
We do this with `mlr reshape` command

In [115]:
!cat $COVID/CTD-genes-corpus-identifiers-node1.tsv | \
mlr --itsv --otsv reshape -i label,P351,descriptions,aliases,P31,P7001,P352 -o label,node2 \
> $COVID/CTD-genes-corpus-identifiers-compact.tsv

`aliases` in the above file contains multiple values separated by `|`
We can expand the values into multiple rows with `kgtk expand` command

In [116]:
!cat $COVID/CTD-genes-corpus-identifiers-compact.tsv | kgtk expand --columns node1 label --mode NONE \
> $COVID/CTD-genes-corpus-edges.tsv

In [18]:
!head $COVID/CTD-genes-corpus-edges.tsv | column -t

node1                  label         node2
Q00005550-gene-27521   label         272I21T
Q00005550-gene-27521   P351          """27521"""
Q00005550-gene-27521   descriptions  DNA           segment,  272I21T
Q00005550-gene-27521   aliases
Q00005550-gene-27521   P31           Q7187
Q00005550-gene-27521   P7001
Q00005550-gene-27521   P352
Q00005550-gene-414970  label         2959A1
Q00005550-gene-414970  P351          """414970"""


## Create Mention Edges

In [3]:
from scripts.create_mention_edges import CreateMentionEdges


# The class CreateMentionEdges has the following input parameters in order,
#     1. The path where the folders `pmid_abs` and `pmcid` are (from Heng)
#     2. corpus-identifier-edges.tsv - file which has Qnodes for identifiers in Wikidata
#     3. CTD-genes-corpus-edges.tsv - file with Qnodes for genes created by us
#     4. CTD-diseases-corpus-edges.tsv - file with Qnodes for diseases created by us
#     5. CTD-chemicals-corpus-edges.tsv - file with Qnodes for chemicals created by us

cme = CreateMentionEdges(os.getenv("COVID"), 
                        '{}/corpus-identifier-edges.tsv'.format(os.getenv("COVID")),
                        '{}/CTD-genes-corpus-edges.tsv'.format(os.getenv("COVID")),
                        '{}/CTD-diseases-corpus-edges.tsv'.format(os.getenv("COVID")),
                        '{}/CTD-chemicals-corpus-edges.tsv'.format(os.getenv("COVID")))
cme.create_mention_edges(os.getenv("COVID"))


Total number of papers in the corpus:  20446
Done!


The code above will create 3 files at output path $COVID,

* covid_kgtk_blender_mentions.tsv: File with mention edges
* covid_kgtk_blender_mentions_qualifiers.tsv: qualifiers for the mention edges
* scholarly_articles_not_in_wikidata.tsv: papers edges files which were not present in Wikidata
    

In [17]:
!head $COVID/covid_kgtk_blender_mentions.tsv | column -t 

node1             label     node2             id
Q77092138         P2020003  Q166231           Q77092138-P2020003-0
Q77092138         P2020007  Q20747334         Q77092138-P2020007-1
Q77092138         P2020007  Q20747334         Q77092138-P2020007-2
Q77092138         P2020007  Q20747334         Q77092138-P2020007-3
Q77092138         P2020003  Q767485           Q77092138-P2020003-4
Q77092138         P2020001  Q77092138-text-0  Q77092138-P2020001-0
Q77092138-text-0  P2020012  Immunology        and                     prevention  of  infection  in  feedlot  cattle.  Q77092138-text-0-label-0
Q77092138-text-0  P31       Q1385610          Q77092138-text-0-P31-0
Q77092138         P2020001  Q77092138-text-1  Q77092138-P2020001-1


In [16]:
!head $COVID/covid_kgtk_blender_mentions_qualifiers.tsv | column -t 

node1                 label     node2                            id
Q77092138-P2020003-0  P4153     29                               Q77092138-P2020003-0-0
Q77092138-P2020003-0  P2043     9                                Q77092138-P2020003-0-1
Q77092138-P2020003-0  P1932     infection                        Q77092138-P2020003-0-2
Q77092138-P2020003-0  P2020008  http://blender.cs.illinois.edu/  Q77092138-P2020003-0-3
Q77092138-P2020003-0  P2020001  Q77092138-text-0                 Q77092138-P2020003-0-4
Q77092138-P2020007-1  P4153     50                               Q77092138-P2020007-1-0
Q77092138-P2020007-1  P2043     6                                Q77092138-P2020007-1-1
Q77092138-P2020007-1  P1932     cattle                           Q77092138-P2020007-1-2
Q77092138-P2020007-1  P2020008  http://blender.cs.illinois.edu/  Q77092138-P2020007-1-3


In [15]:
!head $COVID/scholarly_articles_not_in_wikidata.tsv | column -t

id                         node1              label  node2
Q000077708998245-P31-0     Q000077708998245   P31    Q13442814
Q000077708998245-label-1   Q000077708998245   label  Immunity   to  infection.
Q000077708998245-P577-2    Q000077708998245   P577   1996
Q000077708998245-P1476-3   Q000077708998245   P1476  Immunity   to  infection.
Q000077708998245-P698-4    Q000077708998245   P698   8998245
Q0000777022951009-P31-0    Q0000777022951009  P31    Q13442814
Q0000777022951009-label-1  Q0000777022951009  label  Stock      or  stroke?     Stock  market  movement  and  stroke  incidence  in  Taiwan.
Q0000777022951009-P577-2   Q0000777022951009  P577   2012
Q0000777022951009-P1476-3  Q0000777022951009  P1476  Stock      or  stroke?     Stock  market  movement  and  stroke  incidence  in  Taiwan.


## Generate RDF Triples

`kgtk` comes packaged with the command `generate_wikidata_triples` which can generate RDF triples from a KGTK edge file.


### Step 1: concatenate the edge files we created in this notebook

we do this with `kgtk cat` command

In [22]:
!kgtk cat $COVID/corpus-edges.tsv \
          $COVID/corpus-citations-and-authors.tsv \
          $COVID/CTD-diseases-corpus-edges.tsv \
          $COVID/CTD-chemicals-corpus-edges.tsv \
          $COVID/CTD-genes-corpus-edges.tsv \
          $COVID/covid_kgtk_blender_mentions.tsv \
          $COVID/covid_kgtk_blender_mentions_qualifiers.tsv \
          $COVID/scholarly_articles_not_in_wikidata.tsv > $COVID/corpus-all.tsv

In [24]:
!wc -l $COVID/corpus-edges.tsv 
!wc -l $COVID/corpus-citations-and-authors.tsv 
!wc -l $COVID/CTD-diseases-corpus-edges.tsv 
!wc -l $COVID/CTD-chemicals-corpus-edges.tsv 
!wc -l $COVID/CTD-genes-corpus-edges.tsv 
!wc -l $COVID/covid_kgtk_blender_mentions.tsv 
!wc -l $COVID/covid_kgtk_blender_mentions_qualifiers.tsv 
!wc -l $COVID/scholarly_articles_not_in_wikidata.tsv
!wc -l $COVID/corpus-all.tsv

 2297783 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/corpus-edges.tsv
 11389393 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/corpus-citations-and-authors.tsv
    5305 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/CTD-diseases-corpus-edges.tsv
  112456 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/CTD-chemicals-corpus-edges.tsv
   41616 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/CTD-genes-corpus-edges.tsv
 8142747 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/covid_kgtk_blender_mentions.tsv
 25868901 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/covid_kgtk_blender_mentions_qualifiers.tsv
     456 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/scholarly_articles_not_in_wikidata.tsv
 47858650 /Users/amandeep/Github/CKG-Covid/datasets/sandbox/corpus-all.tsv


### Step 2: generate rdf triples

we do this with `kgtk generate_wikidata_triples` command

In [None]:
!cat $COVID/corpus-all.tsv | kgtk generate_wikidata_triples \
                                -ap aliases \
                                -lp label \
                                -dp descriptions \
                                -pf properties.tsv \ 
                                -n 1000 \
                                -ig no  \
                                --debug \
                                -gt yes > $COVID/corpus-all.ttl

## Incorporate Analytic Outputs