This notebook applies SPARQL query to retrieve and download biomedically relevant edge category types from Wikidata to be utilized by the downstream drug repurposing algorithm.

I. [Load Packages](#Load) [clicking on phrase will take you directly to section] <br>
II. [Infectious Taxa as Part of Network](#Taxa) <br>


III. [Query for Biomedical Edge Types in Wikidata](#Query) <br>
IV. [Concatenate Node Types and Save as .csv](#Concatenate) <br>

## Load 
Packages and modules with relevant functions

In [5]:
%matplotlib inline 
# why is above line needed?
import pandas as pd

import functools # what does this do?
from pathlib import Path
from itertools import chain # what does this do?
from tqdm.autonotebook import tqdm 

from data_tools.df_processing import char_combine_iter, add_curi
from data_tools.plotting import count_plot_h
from data_tools.wiki import execute_sparql_query, node_query_pipeline, standardize_nodes, standardize_edges

  from tqdm.autonotebook import tqdm


In [6]:
def process_taxa(edges): # Integrate process_taxa() function into data_tools package ?
    nodes = edges.drop_duplicates(subset=['taxon', 'tax_id'])[['taxon', 'taxonLabel', 'tax_id']]
    nodes = add_curi(nodes, {'tax_id': 'NCBITaxon'})
    return standardize_nodes(nodes, 'taxon')

In [8]:
# What is happening in this code cell?
# Why do we need nodes to get edges? Is it a good idea that we have them?

prev_dir = Path('../results/').resolve()
prev_nodes = pd.read_csv(prev_dir.joinpath('01a_nodes.csv')) 

In [9]:
nodes = []
edges = []

## Taxa
We will account for the various taxa involved in or related to disease. This will include 2 types of syntax, and 2 approaches.

#### Syntax in the WikiData data model

1. Direct statements:  
    Taxon has-effect Disease... or Disease has-cause Taxon 
    

2. Qualifier Statements:  
    Disease has-cause infection (qual: of Taxon) 

#### Approaches  in the WikiData data model
1. Direct links:      
    Taxon has-effect Disease
    

2. Punning down to a specific taxonomic level:  
    Partent_taxon has-effect Disease  
    Taxon has-parent* Parent_taxon  
    Taxon has-rank Species 

In [24]:
# Approach 1
## Syntax 1 -- Direct statement: Disease causes infection
q = """SELECT DISTINCT ?disease ?taxon ?taxonLabel ?tax_id
    WHERE {{?disease wdt:P31 wd:Q12136}UNION{?disease wdt:P699 ?doid}.
      ?disease p:P828 [ps:P828 wd:Q166231;pq:P642 ?taxon;].
      OPTIONAL{?taxon wdt:P685 ?tax_id}.
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}"""

qr = execute_sparql_query(q)
tax_nodes = process_taxa(qr)
edge_res = standardize_edges(qr, 'taxon', 'disease', 'causes')

nodes.append(tax_nodes)
edges.append(edge_res)

## Syntax 2 -- Qualifier statements
### a. disease has-cause TAXON 
q = """SELECT DISTINCT ?disease ?diseaseLabel ?doid ?taxon ?taxonLabel ?tax_id
    WHERE {{?disease wdt:P31 wd:Q12136}UNION{?disease wdt:P699 ?doid}.
        ?taxon wdt:P685 ?tax_id. 
       {?disease wdt:P828 ?taxon}UNION{?taxon wdt:P1542 ?disease}.
        OPTIONAL {?disease wdt:P699 ?doid.}
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}"""

qr = execute_sparql_query(q)
tax_nodes = process_taxa(qr)
edge_res = standardize_edges(qr, 'taxon', 'disease', 'causes')

nodes.append(tax_nodes)
edges.append(edge_res)

### b. TAXON has-effect Disease
q = """SELECT DISTINCT ?disease ?diseaseLabel ?doid ?taxon ?taxonLabel ?tax_id
    WHERE {{?disease wdt:P31 wd:Q12136}UNION{?disease wdt:P699 ?doid}.
        ?taxon wdt:P685 ?tax_id.
           {?disease wdt:P828 ?taxon}UNION{?taxon wdt:P1542 ?disease}.
           OPTIONAL {?disease wdt:P699 ?doid.}
           SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}"""

qr = execute_sparql_query(q)
tax_nodes = process_taxa(qr)
edge_res = standardize_edges(qr, 'taxon', 'disease', 'causes')

nodes.append(tax_nodes)
edges.append(edge_res)

# Approach 2
## Syntax 1
q = """SELECT DISTINCT ?disease ?diseaseLabel ?doid ?parent_tax ?parent_taxLabel ?par_taxid ?taxon ?taxonLabel ?tax_id
    WHERE {{?disease wdt:P31 wd:Q12136}UNION{?disease wdt:P699 ?doid}.
      ?disease p:P828 [ps:P828 wd:Q166231;
                       pq:P642 ?parent_tax;].
      OPTIONAL{?disease wdt:P699 ?doid}.
      OPTIONAL{?parent_tax wdt:P685 ?par_taxid}.
      FILTER NOT EXISTS {?parent_tax wdt:P105 wd:Q36732}.
      FILTER NOT EXISTS {?parent_tax wdt:P105 wd:Q3978005}.
      {?taxon wdt:P171+ ?parent_tax}UNION{?parent_tax wdt:P171+ ?taxon}
      ?taxon wdt:P105 wd:Q7432 .
      ?taxon wdt:P685 ?tax_id    
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}"""

qr = execute_sparql_query(q)
tax_nodes = process_taxa(qr)
edge_res = standardize_edges(qr, 'taxon', 'disease', 'causes', 'computed')
edge_res['comp_type'] = 'punning'

nodes.append(tax_nodes)
edges.append(edge_res)

## Syntax 2
q = """SELECT DISTINCT ?disease ?diseaseLabel ?doid ?parent_tax ?parent_taxLabel ?parent_tax_id ?taxon ?taxonLabel ?tax_id
    WHERE {{?disease wdt:P31 wd:Q12136}UNION{?disease wdt:P699 ?doid}.
        ?parent_tax wdt:P685 ?parent_tax_id. 
      FILTER NOT EXISTS {?parent_tax wdt:P105 wd:Q36732}.
      FILTER NOT EXISTS {?parent_tax wdt:P105 wd:Q3978005}.      
       {?disease wdt:P828 ?parent_tax}UNION{?parent_tax wdt:P1542 ?disease}.
        OPTIONAL {?disease wdt:P699 ?doid.}
      {?taxon wdt:P171+ ?parent_tax}UNION{?parent_tax wdt:P171+ ?taxon}
      ?taxon wdt:P685 ?tax_id .
      ?taxon wdt:P105 wd:Q7432 .
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}"""

qr = execute_sparql_query(q)
tax_nodes = process_taxa(qr)
edge_res = standardize_edges(qr, 'taxon', 'disease', 'causes', 'computed')
edge_res['comp_type'] = 'punning'

nodes.append(tax_nodes)
edges.append(edge_res)

# Remove duplicates
tax_nodes = pd.concat(nodes, sort=False, ignore_index=True).drop_duplicates(subset=['id'])
nodes = [tax_nodes]

# 

Unnamed: 0,start_id,end_id,type,dsrc_type,comp_type
0,Q10419976,Q356713,causes,computed,punning
1,Q10419992,Q356713,causes,computed,punning
2,Q10420018,Q356713,causes,computed,punning
3,Q10419948,Q356713,causes,computed,punning
4,Q10420014,Q356713,causes,computed,punning


In [26]:
# Remove duplicates
tax_nodes = pd.concat(nodes, sort=False, ignore_index=True).drop_duplicates(subset=['id'])
nodes = [tax_nodes]

In [27]:
len(tax_nodes)

43609

In [25]:
tax_nodes.head()

Unnamed: 0,id,name,label,xrefs
0,Q100152243,Fomitopsis mounceae,Taxon,NCBITaxon:2126941
1,Q1002578,Babesia divergens,Taxon,NCBITaxon:32595
2,Q100348353,Ruhugu virus,Taxon,NCBITaxon:2652755
3,Q100378193,Cantharellus lutescens,Taxon,NCBITaxon:104198
4,Q100604983,Stamnaria yugrana,Taxon,NCBITaxon:2059531


## Query
Biomedically relevant edge types in Wikidata (ordered alphabetically) <br>
To affirm a edge type category has been added, move it to its own cell and view separately using the 'print' function.

## Taxon Causes Disease (2 syntax, 2 approches) *why this part?

Theyre are two syntax for taxon causes disease that exist in the WikiData data model:

1. Direct statements:  
    Taxon has-effect Disease... or Disease has-cause Taxon 


2. Qualifier Statements:  
    Disease has-cause infection (qual: of Taxon) 
    
There are also two approaches:

1. Direct links:      
    Taxon has-effect Disease


2. Punning down to a specific taxonomic level:  
    Partent_taxon has-effect Disease  
    Taxon has-parent* Parent_taxon  
    Taxon has-rank Species 

### Syntax 1: disease has-cause infection - qualifier: of TAXON

In [None]:
q = """    SELECT DISTINCT ?disease ?taxon ?taxonLabel ?tax_id
    WHERE 
    {
     {?disease wdt:P31 wd:Q12136}UNION{?disease wdt:P699 ?doid}.

     ?disease p:P828 [ps:P828 wd:Q166231;
                       pq:P642 ?taxon;].
      OPTIONAL{?taxon wdt:P685 ?tax_id}.
          
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    }"""

In [None]:
qr = execute_sparql_query(q)
tax_nodes = process_taxa(qr)
edge_res = standardize_edges(qr, 'taxon', 'disease', 'causes')

nodes.append(tax_nodes)
edges.append(edge_res)

edge_res.head()