# Query WikiData to get Biomedical Entities

We will get the nodes (and later some edges) for our biomedical graph from WikiData

## To Do 
1. Is the script reproducible?
2. Automate reproduced script.
3. Does the script need improvements, now that's been reproduced?
4. Automate improved script.
5. Compare reproduced vs improved.
6. Test with other algorithms besides Rephetio, compare again with reproduced + improved.
7. Create web interface.

- Include mechanisms such that libraries are transferrable, no matter who does it
- Fix nodes where there's a time out (limit)
- Check and include any differing nodes (for improved section)

In [3]:
import pandas as pd
from pathlib import Path

# 'ModuleNotFoundError' for both lines below
## Solution: pip install git+https://github.com/mmayers12/data_tools
### https://github.com/mmayers12/data_tools (fyi also a data_tools in pip, different)
#### so far, these work okay (there isn't a conflict)

from data_tools.df_processing import char_combine_iter 
from data_tools.wiki import node_query_pipeline

# New line recommended by notebook
from tqdm.autonotebook import tqdm 

In [4]:
nodes = []

# Diseases

In [5]:
q = """ SELECT DISTINCT ?disease ?diseaseLabel ?umlscui ?snomed_ct ?doid ?mesh ?mondo ?omim ?orpha
        WHERE {

          # Initial typing for Disease 
          # Either instance of Disease of has a Disease Ontology ID
          {?disease wdt:P31 wd:Q12136}UNION{?disease wdt:P699 ?doid}.

          OPTIONAL {?disease wdt:P2892 ?umlscui .}
          OPTIONAL {?disease wdt:P5806 ?snomed_ct. }
          OPTIONAL {?disease wdt:P699 ?doid. }
          OPTIONAL {?disease wdt:P486 ?mesh. }
          OPTIONAL {?disease wdt:P5270 ?mondo. }
          OPTIONAL {?disease wdt:P492 ?omim. }
          OPTIONAL {?disease wdt:P1550 ?orpha. }

          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

In [8]:
dis_curi_map = {'umlscui': 'UMLS', 'snomed_ct': 'SNOMED', 'mesh': 'MESH', 
                'doid': 'DOID', 'mondo': 'MONDO', 'omim': 'OMIM', 'orpha': 'ORPHA'}

res = node_query_pipeline(q, dis_curi_map, 'disease')
# what's happening in the 'node_query_pipeline()' function that's outputting format?
nodes.append(res)
nodes[0].head()

Unnamed: 0,id,name,label,xrefs
0,Q1001150,fibrillation,Disease,UMLS:C0232197
1,Q100165995,acute pulmonary hypertension,Disease,
2,Q1002195,autosomal recessive limb-girdle muscular dystr...,Disease,DOID:DOID:0110297|MONDO:MONDO:0012248|OMIM:609...
3,Q1003534,bulbar syndrome,Disease,
4,Q1004647,bullous pemphigoid,Disease,DOID:DOID:8506|MESH:D010391|MONDO:MONDO:001908...


# Compounds

In [17]:
q = """SELECT DISTINCT ?compound ?compoundLabel ?kegg_drug ?chebi ?drugbank_id ?umlscui ?chembl_id ?unii ?ikey ?pubchem_cid ?rxnorm ?mesh_supplemental_record_ui ?mesh_descriptor_ui
        WHERE {

          # Initial typing for Compound
          ?compound wdt:P31 wd:Q11173 .
          # Give me all Wikidata items where the item is an instance of a chemical compound

        # Whatever item up there may optionally have the following identifier + variable
          OPTIONAL { ?compound wdt:P665 ?kegg_drug .}
          OPTIONAL { ?compound wdt:P683 ?chebi .}
          OPTIONAL { ?compound wdt:P715 ?drugbank_id .}
          OPTIONAL { ?compound wdt:P2892 ?umlscui .}
          OPTIONAL { ?compound wdt:P592 ?chembl_id .}
          OPTIONAL { ?compound wdt:P652 ?unii .}
          OPTIONAL { ?compound wdt:P3350 ?ikey .}
          OPTIONAL { ?compound wdt:P662 ?pubchem_cid .}
          OPTIONAL { ?compound wdt:P3345 ?rxnorm .}
          OPTIONAL { ?compound wdt:P6680 ?mesh_supplemental_record_ui .}
          OPTIONAL { ?compound wdt:P486 ?mesh_descriptor_ui .}

          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }
        
        limit 200""" # limit needed here, fix later

In [18]:
chem_curi_map = {'unii': 'UNII', 
    'rxnorm': 'RxCUI', 
    'drugbank_id': 'DB', 
    'umlscui': 'UMLS', 
    'chebi': 'CHEBI', 
    'chembl_id': 'CHEMBL',
    'kegg_drug': 'KEGG', 
    'ikey': 'IKEY', 
    'pubchem_cid': 'PCID', 
    'mesh_supplemental_record_ui': 'MESH', 
    'mesh_descriptor_ui': 'MESH'}

res = node_query_pipeline(q, chem_curi_map, 'compound')
nodes.append(res)
nodes[1].head()

# JSONDecodeError is due to the time it takes
## Solution: Limit 200 above (temporary fix) 

Unnamed: 0,id,name,label,xrefs
0,Q1001150,fibrillation,Disease,UMLS:C0232197
1,Q100165995,acute pulmonary hypertension,Disease,
2,Q1002195,autosomal recessive limb-girdle muscular dystr...,Disease,DOID:DOID:0110297|MONDO:MONDO:0012248|OMIM:609...
3,Q1003534,bulbar syndrome,Disease,
4,Q1004647,bullous pemphigoid,Disease,DOID:DOID:8506|MESH:D010391|MONDO:MONDO:001908...


# Phenotype

In [11]:
q = """SELECT DISTINCT ?phenotype ?phenotypeLabel ?hpo ?mesh ?omim ?snomed
        WHERE {

          # Initial typing for phenotype
          {?phenotype wdt:P31 wd:Q169872.}UNION{?phenotype wdt:P3841 ?hpo}

          # Xrefs associated with phenotypes
          OPTIONAL {?phenotype wdt:P3841 ?hpo .}
          OPTIONAL {?phenotype wdt:P486 ?mesh . }
          OPTIONAL {?phenotype wdt:P492 ?omim . }
          OPTIONAL {?phenotype wdt:P5806 ?snomed . }

          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

In [12]:
res = node_query_pipeline(q, {'mesh': 'MESH', 'omim': 'OMIM', 'hpo':'HP', 'snomed': 'SNOMED'}, 'phenotype')
nodes.append(res)
nodes[2].head()

Unnamed: 0,id,name,label,xrefs
0,Q150717,decane,Compound,CHEBI:41808|CHEMBL:CHEMBL134537|PCID:15600|UNI...
1,Q150731,undecane,Compound,CHEBI:46342|CHEMBL:CHEMBL132474|PCID:14257|RxC...
2,Q150744,dodecane,Compound,CHEBI:28817|CHEMBL:CHEMBL30959|KEGG:C08374|MES...
3,Q150788,tridecane,Compound,CHEBI:35998|CHEMBL:CHEMBL135694|KEGG:C13834|PC...
4,Q150808,tetradecane,Compound,CHEBI:41253|DB:03563|PCID:12389|UNII:03LY784Y58


# Gene

Genes are too numerous and will require filtering to a single taxon in order for the query to finish successfully.

For now we will only extract human genes, but in the future we will do the same for infectious taxa.

In [13]:
q = """SELECT DISTINCT ?gene ?geneLabel ?entrez ?symbol ?hgnc ?omim ?ensembl
        WHERE {{

          # Initial typing for Gene
          ?gene wdt:P31 wd:Q7187.
          ?gene wdt:P703 wd:{tax}.

          OPTIONAL{{?gene wdt:P351 ?entrez .}}
          OPTIONAL{{?gene wdt:P353 ?symbol .}}
          OPTIONAL{{?gene wdt:P354 ?hgnc .}}
          OPTIONAL{{?gene wdt:P492 ?omim .}}
          OPTIONAL{{?gene wdt:P594 ?ensembl .}}

          SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }}
        }}"""

human_tax_wd_id = 'Q15978631' 
q = q.format(tax=human_tax_wd_id)

In [14]:
gene_curi_map = {'entrez': 'NCBIGene', 'symbol': 'SYM', 'hgnc':'HGNC', 'omim':'OMIM', 'ensembl':'ENSG'}
res = node_query_pipeline(q, gene_curi_map, 'gene')
nodes.append(res)
nodes[3].head()

Unnamed: 0,id,name,label,xrefs
0,Q1016605,Burkitt lymphoma,Phenotype,HP:0030080|MESH:D002051|OMIM:113970
1,Q101971,wart,Phenotype,HP:0200043|MESH:D014860
2,Q102293266,calcium oxalate nephrolithiasis,Phenotype,HP:0008672
3,Q1027995,pyloric stenosis,Phenotype,HP:0002021|MESH:D011707
4,Q10282075,gingival fibromatosis,Phenotype,HP:0000169|MESH:D005351|OMIM:135300|OMIM:60554...


# Protein

In [19]:
q = """SELECT DISTINCT ?protein ?proteinLabel ?uniprot
        WHERE {{

          # Initial typing for Protein
          ?protein wdt:P31 wd:Q8054.
          ?protein wdt:P703 wd:{tax}.

          OPTIONAL{{?protein wdt:P352 ?uniprot .}}
          SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }}
        }}"""
q = q.format(tax=human_tax_wd_id)

In [20]:
res = node_query_pipeline(q, {'uniprot':'UniProt'}, 'protein')
nodes.append(res)
nodes[4].head()

Unnamed: 0,id,name,label,xrefs
0,Q1022703,ZMPSTE24,Gene,ENSG:ENSG00000084073|HGNC:12877|NCBIGene:10269...
1,Q106030625,MAP2K5-DT,Gene,HGNC:55261|NCBIGene:118732298|SYM:MAP2K5-DT
2,Q106030627,CD2AP-DT,Gene,HGNC:55263|NCBIGene:118732299|SYM:CD2AP-DT
3,Q106030628,ERBIN-DT,Gene,HGNC:55270|NCBIGene:118732300|SYM:ERBIN-DT
4,Q106030629,PPP2CA-DT,Gene,HGNC:55266|NCBIGene:118732304|SYM:PPP2CA-DT


# Pathway

In [21]:
q = """SELECT DISTINCT ?pathway ?pathwayLabel ?react ?wpid
        WHERE {

          # Initial typing for Pathway
          ?pathway wdt:P31 wd:Q4915012 .

          OPTIONAL{?pathway wdt:P3937 ?react .}
          OPTIONAL{?pathway wdt:P2410 ?wpid .}

          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

In [22]:
res = node_query_pipeline(q, {'react':'REACT', 'wpid':'WP'}, 'pathway')
nodes.append(res)
nodes[5].head()

Unnamed: 0,id,name,label,xrefs
0,Q102769,sodium hydroxide,Compound,CHEBI:32145|CHEMBL:CHEMBL2105794|DB:11151|KEGG...
1,Q104219,bilirubin,Compound,CHEBI:16990|CHEMBL:CHEMBL501680|KEGG:C00486|ME...
2,Q104334,carbonic acid,Compound,CHEBI:28976|CHEMBL:CHEMBL1161632|KEGG:C01353|M...
3,Q105522,uric acid,Compound,CHEBI:17775|CHEMBL:CHEMBL792|DB:08844|KEGG:C00...
4,Q107184,cupric sulfate,Compound,CHEBI:23414|DB:06778|KEGG:C18713|PCID:24462|UM...


# Molecular Function

In [23]:
q = """SELECT DISTINCT ?molecular_function ?molecular_functionLabel ?goid
        WHERE {

          # Initial typing for molecular Function
          ?molecular_function wdt:P31 wd:Q14860489 .
          ?molecular_function wdt:P686 ?goid

          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

In [24]:
res = node_query_pipeline(q, {'goid':'GO'}, 'molecular_function')
nodes.append(res)
nodes[6].head()

Unnamed: 0,id,name,label,xrefs
0,Q1024612,C-X-C motif chemokine receptor 5,Protein,UniProt:P32302
1,Q1032902,"Mucin 16, cell surface associated",Protein,UniProt:Q8WXI7
2,Q105362742,Neurotensin/neuromedin N,Protein,UniProt:P30990
3,Q105412156,prepro-GRP,Protein,UniProt:P07492
4,Q1056532,CD44 molecule (Indian blood group),Protein,UniProt:P16070


# Biological Process

In [25]:
q = """SELECT DISTINCT ?biological_process ?biological_processLabel ?goid
        WHERE {

          # Initial typing for molecular Function
          ?biological_process wdt:P31 wd:Q2996394 .
          ?biological_process wdt:P686 ?goid

          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

In [26]:
res = node_query_pipeline(q, {'goid':'GO'}, 'biological_process')
nodes.append(res)
nodes[7].head()

Unnamed: 0,id,name,label,xrefs
0,Q100166016,EML4 and NUDC in mitotic spindle formation,Pathway,REACT:R-HSA-9648025
1,Q100166017,Postmitotic nuclear pore complex (NPC) reforma...,Pathway,REACT:R-HSA-9615933
2,Q100166018,Sealing of the nuclear envelope (NE) by ESCRT-III,Pathway,REACT:R-HSA-9668328
3,Q100166022,Inhibition of DNA recombination at telomere,Pathway,REACT:R-HSA-9670095
4,Q100166036,Response of EIF2AK4 (GCN2) to amino acid defic...,Pathway,REACT:R-HSA-9633012


# Cellular Component

In [27]:
q = """SELECT DISTINCT ?cellular_component ?cellular_componentLabel ?goid
    WHERE {

      # Initial typing for Cellular Component
      ?cellular_component wdt:P31 wd:Q5058355 .
      ?cellular_component wdt:P686 ?goid

      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
    }"""

In [28]:
res = node_query_pipeline(q, {'goid':'GO'}, 'cellular_component')
nodes.append(res)
nodes[8].head()

Unnamed: 0,id,name,label,xrefs
0,Q1012651,ribonuclease P activity,Molecular Function,GO:0004526
1,Q13667380,metal ion binding,Molecular Function,GO:0046872
2,Q13667398,lipoprotein particle receptor binding,Molecular Function,GO:0070325
3,Q14326094,protein serine/threonine kinase activity,Molecular Function,GO:0004674
4,Q14326101,serine-type peptidase activity,Molecular Function,GO:0008236


# Anatomy

In [29]:
q = """SELECT DISTINCT ?anatomy ?anatomyLabel ?uberon ?mesh
        WHERE {

          # Anatomical Strucutres
          ?anatomy wdt:P1554 ?uberon
          
          OPTIONAL{?anatomy wdt:P486 ?mesh .}

          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

In [30]:
res = node_query_pipeline(q, {'uberon':'UBERON', 'mesh': 'MESH'}, 'anatomy')
nodes.append(res)
nodes[9].head()

Unnamed: 0,id,name,label,xrefs
0,Q10344426,pachytene,Biological Process,GO:0000239
1,Q1057,metabolism,Biological Process,GO:0008152
2,Q105726,urination,Biological Process,GO:0060073
3,Q1068809,cell cycle checkpoint,Biological Process,GO:0000075
4,Q10746904,intramembranous ossification,Biological Process,GO:0001957


# Put them all together

In [31]:
nodes = pd.concat(nodes, sort=False, ignore_index=True)
len(nodes)

170010

In [32]:
nodes['id'].nunique()

152182

In [33]:
nodes[nodes['id'].duplicated(keep=False)].sort_values('id').head(50)

Unnamed: 0,id,name,label,xrefs
0,Q1001150,fibrillation,Disease,UMLS:C0232197
16682,Q1001150,fibrillation,Disease,UMLS:C0232197
16683,Q100165995,acute pulmonary hypertension,Disease,
1,Q100165995,acute pulmonary hypertension,Disease,
16684,Q1002195,autosomal recessive limb-girdle muscular dystr...,Disease,DOID:DOID:0110297|MONDO:MONDO:0012248|OMIM:609...
2,Q1002195,autosomal recessive limb-girdle muscular dystr...,Disease,DOID:DOID:0110297|MONDO:MONDO:0012248|OMIM:609...
16685,Q1003534,bulbar syndrome,Disease,
3,Q1003534,bulbar syndrome,Disease,
16686,Q1004647,bullous pemphigoid,Disease,DOID:DOID:8506|MESH:D010391|MONDO:MONDO:001908...
4,Q1004647,bullous pemphigoid,Disease,DOID:DOID:8506|MESH:D010391|MONDO:MONDO:001908...


In [34]:
nodes[nodes['id'].duplicated(keep=False)].sort_values('id').tail(50)

Unnamed: 0,id,name,label,xrefs
16658,Q979168,cerebral edema,Disease,DOID:DOID:4724|MESH:D001929|MONDO:MONDO:000668...
33340,Q979168,cerebral edema,Disease,DOID:DOID:4724|MESH:D001929|MONDO:MONDO:000668...
33341,Q980709,status epilepticus,Disease,DOID:DOID:1824|MESH:D013226|UMLS:C0038220
16659,Q980709,status epilepticus,Disease,DOID:DOID:1824|MESH:D013226|UMLS:C0038220
16660,Q98078318,bone sarcoma,Disease,DOID:DOID:0080639
33342,Q98078318,bone sarcoma,Disease,DOID:DOID:0080639
16661,Q980926,polymyositis,Disease,MESH:D017285|MONDO:MONDO:0019127|ORPHA:732|UML...
33343,Q980926,polymyositis,Disease,MESH:D017285|MONDO:MONDO:0019127|ORPHA:732|UML...
16662,Q98266891,Shrunken pore syndrome,Disease,
33344,Q98266891,Shrunken pore syndrome,Disease,


In [35]:
nodes['label'].value_counts()

Gene                  59058
Disease               33364
Biological Process    28674
Protein               25477
Molecular Function    11029
Cellular Component     4155
Pathway                3386
Anatomy                2568
Phenotype              1957
Compound                342
Name: label, dtype: int64

## Save

In [36]:
this_name = '01a_WikiData_Nodes'
out_dir = Path('../2_pipeline/').joinpath(this_name, 'out')

# Make the output directory if doesn't already exist
out_dir.mkdir(parents=True, exist_ok=True)

nodes.to_csv(out_dir.joinpath('nodes.csv'), index=False)

## edit 'pipeline' folder to be results?