* = 'to do' item for later
* Convert to a python script (for automation)

Step 0: Relevant packages installed (ref 00_requirements.txt) <br>
Step 1: Data downloaded <br>
Step 1a: Nodes 

I. [Load Packages](#Load) clicking on phrase will take you directly to section <br>
II. [Query for Biomedical Node Types in Wikidata](#Query)

## Load 
Packages and modules with relevant functions

In [1]:
from pathlib import Path
from tqdm.autonotebook import tqdm 

from data_tools.df_processing import char_combine_iter 
from data_tools.wiki import node_query_pipeline

  from tqdm.autonotebook import tqdm


Make an empty list for nodes (this will become a populated .csv)

In [2]:
nodes = []

## Query
Biomedically relevant node types in Wikidata (ordered alphabetically)
* Node categories to add or adjust? Note that identifiers have been removed (irrelevant?)
** Remove xrefs column for node_query_pipeline function??
* Make into a for loop (DRY)?

In [6]:
# Anatomy (remove ?uberon... how ??)
q = """SELECT DISTINCT ?anatomy ?anatomyLabel ?uberon 
        WHERE {
          ?anatomy wdt:P1554 ?uberon
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }""" 

res = node_query_pipeline(q, {}, 'anatomy')
nodes.append(res)

# Biological Process (wdt:P1554 ?uberon of anatomy vs wdt:P31 wd:Q2996394 .)
q = """SELECT DISTINCT ?biological_process ?biological_processLabel 
        WHERE {
          ?biological_process wdt:P31 wd:Q2996394 .
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

res = node_query_pipeline(q, {}, 'biological_process')
nodes.append(res)

# Cellular Component
q = """SELECT DISTINCT ?cellular_component ?cellular_componentLabel 
    WHERE {
      ?cellular_component wdt:P31 wd:Q5058355 .
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
    }"""

res = node_query_pipeline(q, {}, 'cellular_component')
nodes.append(res)

# Compounds (ideas for how to specify?)
q = """SELECT DISTINCT ?compound ?compoundLabel
        WHERE {
          ?compound wdt:P31 wd:Q11173 .
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }
        limit 150000""" 

res = node_query_pipeline(q, {}, 'compound')
nodes.append(res)

# Disease
q = """SELECT DISTINCT ?disease ?diseaseLabel 
        WHERE {
          ?disease wdt:P31 wd:Q12136 .
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""
 
res = node_query_pipeline(q, {}, 'disease')
nodes.append(res)

# Genes (see below)

# Pathway
q = """SELECT DISTINCT ?pathway ?pathwayLabel
        WHERE {
          ?pathway wdt:P31 wd:Q4915012 .
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

res = node_query_pipeline(q, {}, 'pathway')
nodes.append(res)

# Phenotype (nothing for hpo? apply to Compound?)
q = """SELECT DISTINCT ?phenotype ?phenotypeLabel ?hpo 
        WHERE {
          {?phenotype wdt:P31 wd:Q169872.}UNION{?phenotype wdt:P3841 ?hpo}
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

res = node_query_pipeline(q, {}, 'phenotype')
nodes.append(res)

# Protein (see below)

# Molecular Function
q = """SELECT DISTINCT ?molecular_function ?molecular_functionLabel 
        WHERE {
          ?molecular_function wdt:P31 wd:Q14860489 .
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

res = node_query_pipeline(q, {}, 'molecular_function')
nodes.append(res)

In [8]:
print(nodes) # Anatomy (more efficient way?)

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [10]:
print(nodes) # Biological Process added

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [13]:
print(nodes) # Cellular Component added

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [15]:
print(nodes) # Compounds added

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [17]:
print(nodes) # Disease added

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [16]:
# Genes (issue with . after wd:{tax} -- apply this to Compounds?)

q = """SELECT DISTINCT ?gene ?geneLabel 
        WHERE {{
          ?gene wdt:P31 wd:Q7187.
          ?gene wdt:P703 wd:{tax}. 
          SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }}
        }}"""

human_tax_wd_id = 'Q15978631' 
q = q.format(tax=human_tax_wd_id)

gene_curi_map = {'entrez': 'NCBIGene', 'symbol': 'SYM', 'hgnc':'HGNC', 'omim':'OMIM', 'ensembl':'ENSG'}
res = node_query_pipeline(q, gene_curi_map, 'gene')
nodes.append(res)
nodes[3].head()

In [21]:
print(nodes) # Pathway added

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [23]:
print(nodes) # Phenotype added

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [24]:
# Protein

q = """SELECT DISTINCT ?protein ?proteinLabel 
        WHERE {{
          ?protein wdt:P31 wd:Q8054.
          ?protein wdt:P703 wd:{tax}.
          SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }}
        }}"""
q = q.format(tax=human_tax_wd_id)

res = node_query_pipeline(q, {}, 'protein')
nodes.append(res)

NameError: name 'human_tax_wd_id' is not defined

In [19]:
print(nodes) # Molecular Function added

[              id                                       name    label    xrefs
0       Q1001337  mesencephalic nucleus of trigeminal nerve  Anatomy  0001718
1       Q1002789                posterior ethmoidal foramen  Anatomy  0018654
2       Q1003805                           Nucleus ambiguus  Anatomy  0001719
3        Q101004                                      aorta  Anatomy  0000947
4     Q102277188                      anatomical projection  Anatomy  0004529
...          ...                                        ...      ...      ...
2562     Q988343                               blood vessel  Anatomy  0001981
2563     Q988861                                 Epineurium  Anatomy  0000124
2564     Q992893                           Brunner's glands  Anatomy  0001212
2565     Q994554                            thoracic cavity  Anatomy  0002224
2566     Q999472                            pulmonary trunk  Anatomy  0002333

[2567 rows x 4 columns],               id                     

In [19]:
# Molecular Function
q = """SELECT DISTINCT ?molecular_function ?molecular_functionLabel 
        WHERE {
          ?molecular_function wdt:P31 wd:Q14860489 .
          SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGAGE],en" }
        }"""

update and rename mike's data_tools package? others that use it...?
note to change pd to pandas