<a href="https://colab.research.google.com/github/wikipathways/pathway-figure-ocr/blob/master/notebooks/bte_with_pfocr_cooccurrence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [49]:
!pip install numpy pandas requests requests_cache SetSimilaritySearch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [100]:
from copy import copy, deepcopy
import json

# BTE with PFOCR Coocurrence

I modified a query from the 3/17 Question of the Month to find relationships like this:

`1 of 3 selected genes` -> `any gene` -> `Valproic Acid`

In [51]:
import requests
import requests_cache


requests_cache.install_cache("pfocr_cache")

## Get BTE Results

In [52]:
query = {
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "constraints": [],
                    "object": "n0",
                    "predicates": ["biolink:related_to"],
                    "subject": "n1",
                },
                "e02": {
                    "constraints": [],
                    "object": "n1",
                    "predicates": ["biolink:related_to"],
                    "subject": "n2",
                },
            },
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "constraints": [],
                    "ids": ["NCBIGene:3855", "NCBIGene:211", "NCBIGene:26995"],
                    "is_set": False,
                },
                "n1": {
                    "categories": ["biolink:Gene"],
                    "constraints": [],
                    "is_set": False,
                },
                "n2": {
                    "categories": ["biolink:SmallMolecule"],
                    "constraints": [],
                    "ids": ["PUBCHEM.COMPOUND:3121"],
                    "is_set": False,
                },
            },
        }
    }
}

bte_r = requests.post("https://api.bte.ncats.io/v1/query", json=query)
print(bte_r.status_code)
if bte_r.status_code != 200:
    print(bte_r.text)

200


In [53]:
gene_q_node_ids = []
for q_node_id, v in query["message"]["query_graph"]["nodes"].items():
    if "biolink:Gene" in v["categories"]:
        gene_q_node_ids.append(q_node_id)
print(gene_q_node_ids)

['n0', 'n1']


In [54]:
bte_message = bte_r.json()["message"]
bte_results = bte_message["results"]
genes_to_bte_results = dict()
for bte_result in bte_results:
    bte_result_genes = []
    for gene_q_node_id in gene_q_node_ids:
        for entry in bte_result["node_bindings"][gene_q_node_id]:
            id = entry["id"]
            target_prefix = "NCBIGene:"
            if id.startswith(target_prefix):
                bte_result_genes.append("NCBIGene:" + entry["id"][len(target_prefix):])
    genes_key = tuple(sorted(bte_result_genes))
    if not genes_key in genes_to_bte_results:
        genes_to_bte_results[genes_key] = []
    genes_to_bte_results[genes_key].append(bte_result)

## Get PFOCR Data

Download the entire JSON file we gave to BTE.

In [55]:
pfocr_url = "https://www.dropbox.com/s/1f14t5zaseocyg6/bte_chemicals_diseases_genes.ndjson?dl=1"
pfocr_request = requests.get(pfocr_url)
print(f"status_code: {pfocr_request.status_code}")
if pfocr_request.status_code != 200:
    print(pfocr_request.text)

genes_to_figids = {}
figid_to_genes = {}
figid_to_pfocr_result = {}
for line in pfocr_request.text.splitlines():
    pfocr_result = json.loads(line)
    figid = pfocr_result["_id"]
    genes = set(
            ["NCBIGene:" + g for g in pfocr_result["associatedWith"]["mentions"]["genes"]["ncbigene"]]
            )
    figid_to_pfocr_result[figid] = pfocr_result
    figid_to_genes[figid] = genes

    genes_key = tuple(sorted(genes))
    if not genes_key in genes_to_figids:
        genes_to_figids[genes_key] = []
    genes_to_figids[genes_key].append(figid)

status_code: 200


## Connect BTE Results & PFOCR

### Compare Algo Performance

The following three algorithms all match up the BTE gene sets with the PFOCR sets, but they have different performances:
- Brute Force: 47s
- Check All BTE Results Genes Set: 5s
- SetSimilaritySearch: 2s

#### Brute Force

In [56]:
all_figure_genes_keys = genes_to_figids.keys()
bf_overlaps_2_plus = set()
for bte_result_genes_key in genes_to_bte_results.keys():
    bte_result_genes = set(bte_result_genes_key)
    for figure_genes_keys in all_figure_genes_keys:
        figure_genes = set(figure_genes_keys)
        if len(bte_result_genes.intersection(figure_genes)) >= 2:
            bf_overlaps_2_plus.add((bte_result_genes_key, figure_genes_keys))
print(len(bf_overlaps_2_plus))

164


#### Check All BTE Results Genes Set
Check whether the figure gene set overlaps with `size >=2` the set made up of all BTE results genes. Note this performance may be worse if the BTE result count is larger.

In [57]:
all_bte_result_genes_keys = genes_to_bte_results.keys()
all_bte_result_genes = set([]).union(
    *[set(bte_result_genes_key) for bte_result_genes_key in all_bte_result_genes_keys]
)
bf_overlaps_2_plus = set()

for figure_genes_keys in genes_to_figids.keys():
    figure_genes = set(figure_genes_keys)
    if len(figure_genes.intersection(all_bte_result_genes)) >= 2:
        for bte_result_genes_key in all_bte_result_genes_keys:
            bte_result_genes = set(bte_result_genes_key)
            if len(bte_result_genes.intersection(figure_genes)) >= 2:
                bf_overlaps_2_plus.add((bte_result_genes_key, figure_genes_keys))

print(len(bf_overlaps_2_plus))

164


#### SetSimilaritySearch

This algorithm is sometimes too permissive, making the overlap check needed, but even so, it's faster and gets the same results once we apply the overlap check.

In [58]:
from SetSimilaritySearch import SearchIndex

pfocr_gene_sets = list(genes_to_figids.keys())
index = SearchIndex(pfocr_gene_sets, similarity_func_name="containment", 
    similarity_threshold=0.8)

sss_overlaps_2_plus = set()
for bte_result_genes_key in genes_to_bte_results.keys():
    bte_result_genes = set(bte_result_genes_key)
    results = index.query(bte_result_genes)
    for result in results:
        figure_genes = pfocr_gene_sets[result[0]]
        if len(bte_result_genes.intersection(figure_genes)) >= 2:
            sss_overlaps_2_plus.add((bte_result_genes_key, tuple(sorted(figure_genes))))
print(len(sss_overlaps_2_plus))

164


### Choose and Apply SetSimilaritySearch

SetSimilaritySearch was fastest, so let's choose it and use it to augment the BTE results.

In [59]:
from SetSimilaritySearch import SearchIndex

pfocr_gene_sets = list(genes_to_figids.keys())
index = SearchIndex(pfocr_gene_sets, similarity_func_name="containment", 
    similarity_threshold=0.8)

overlaps_2_plus_count = 0
for bte_result_genes_key, genes_bte_results in genes_to_bte_results.items():
    bte_result_genes = set(bte_result_genes_key)
    results = index.query(bte_result_genes)
    #if results:
    #    print("------")
    #    print(f'bte_result_genes: {bte_result_genes}')
    for result in results:
        figure_genes = pfocr_gene_sets[result[0]]
        score = result[1]
        figids = genes_to_figids[tuple(sorted(figure_genes))]
        #print(f'{figids} ({score}) - {figure_genes}')
        common = bte_result_genes.intersection(figure_genes)
        #print(f'intersection: {common}')
        if len(common) >= 2:
            for figid in figids:
                overlaps_2_plus_count += 1
                pfocr_result = figid_to_pfocr_result[figid]
                #print("--------")
                #print(f'{figid} - {figure_genes}')
                for bte_result in genes_bte_results:
                    #print(f'bte_result: {sorted(bte_result_genes)}')
                    #print(f'common: {bte_result_genes.intersection(figure_genes)}')
                    nodes = set()
                    for q_node_id, values in bte_result["node_bindings"].items():
                        for value in values:
                            id = value["id"]
                            if id in figure_genes:
                                nodes.add(q_node_id)
                    if not "pfocr" in bte_result:
                        bte_result["pfocr"] = []
                    pfocr_entry = copy(pfocr_result)
                    pfocr_entry["nodes"] = sorted(nodes)
                    pfocr_entry["score"] = score
                    bte_result["pfocr"].append(pfocr_entry)
print(len(overlaps_2_plus_count))

172


In [118]:
import pandas as pd


kg_nodes = bte_message["knowledge_graph"]["nodes"]

bte_rows = []
for bte_result in bte_results:
    bte_row_template = {}
    for q_node_id, value in bte_result["node_bindings"].items():
        node_labels = []
        for v in value:
            id = v["id"]
            name = kg_nodes[id]["name"]
            #node_labels.append(f'{name} ({id})')
            node_labels.append(name)
        bte_row_template[q_node_id] = ",".join(node_labels)
        # just taking the first one for now
        bte_row_template[q_node_id + "_identifier"] = value[0]["id"]

    bte_row_template["score"] = bte_result["score"]
    
    if "pfocr" in bte_result:
        for pfocr_result in bte_result["pfocr"]:
            bte_row = deepcopy(bte_row_template)
            bte_row["figure_title"] = pfocr_result["associatedWith"]["title"]
            bte_row["figid"] = pfocr_result["_id"]
            bte_row["figure_url"] = pfocr_result["associatedWith"]["figureUrl"]
            bte_row["pfocr_score"] = pfocr_result["score"]
            bte_rows.append(bte_row)
    else:
        bte_rows.append(bte_row_template)
    
bte_df = pd.DataFrame(bte_rows)
bte_df

Unnamed: 0,n2,n2_identifier,n1,n1_identifier,n0,n0_identifier,score,figure_title,figid,figure_url,pfocr_score
0,Valproic acid,PUBCHEM.COMPOUND:3121,HMOX1,NCBIGene:3162,ALAS1,NCBIGene:211,0,Activated Oncogenic Pathway Modifies Iron Netw...,PMC5293201__pcbi.1005352.g001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,1.0
1,Valproic acid,PUBCHEM.COMPOUND:3121,HMOX1,NCBIGene:3162,ALAS1,NCBIGene:211,0,The heme biosynthetic pathway and aspects of i...,PMC4279155__metabolites-04-00977-g001.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,1.0
2,Valproic acid,PUBCHEM.COMPOUND:3121,HMOX1,NCBIGene:3162,ALAS1,NCBIGene:211,0,Pathogenesis and Clinical Features of the Acut...,PMC6754303__nihms-1526961-f0002.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,1.0
3,Valproic acid,PUBCHEM.COMPOUND:3121,ATP5MG,NCBIGene:10632,TRUB2,NCBIGene:26995,0,,,,
4,Valproic acid,PUBCHEM.COMPOUND:3121,ATF2,NCBIGene:1386,KRT7,NCBIGene:3855,0,Transcriptional landscape of mouse-aged ovarie...,PMC6281605__41420_2018_121_Fig2_HTML.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,1.0
...,...,...,...,...,...,...,...,...,...,...,...
454,Valproic acid,PUBCHEM.COMPOUND:3121,SLC7A11,NCBIGene:23657,ALAS1,NCBIGene:211,0,,,,
455,Valproic acid,PUBCHEM.COMPOUND:3121,HPRT1,NCBIGene:3251,ALAS1,NCBIGene:211,0,Metabolic pathway differences between LLO118 a...,PMC7500861__nihms-1628382-f0005.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,1.0
456,Valproic acid,PUBCHEM.COMPOUND:3121,CFTR,NCBIGene:1080,KRT7,NCBIGene:3855,0,Network of pancreatic cancer-target genes,PMC6826333__or-42-06-2561-g01.jpg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,1.0
457,Valproic acid,PUBCHEM.COMPOUND:3121,LAMP2,NCBIGene:3920,TRUB2,NCBIGene:26995,0,,,,


The following table is also [available as a Google Sheet](https://docs.google.com/spreadsheets/d/1yHBTxs6SghJGk-wQP7JbmpZ5Wbq6OY6Yz7PEtwd4j_w/edit?usp=sharing).

In [120]:
bte_df.rename(columns={"figure_url": "figure_count"}).groupby(["n0", "n1", "n2"])[["figure_count"]].count().sort_values("figure_count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,figure_count
n0,n1,n2,Unnamed: 3_level_1
ALAS1,ALAD,Valproic acid,34
ALAS1,UROS,Valproic acid,28
ALAS1,CPOX,Valproic acid,17
KRT7,KRT19,Valproic acid,3
KRT7,KRT8,Valproic acid,3
...,...,...,...
ALAS1,TFPI2,Valproic acid,0
ALAS1,TBP,Valproic acid,0
ALAS1,TBL1XR1,Valproic acid,0
ALAS1,SUZ12,Valproic acid,0


## Display some figures

In [140]:
from IPython.display import Image
from IPython.core.display import HTML 


for i, df in bte_df[(bte_df["n0_identifier"] == "NCBIGene:211") & (bte_df["n1_identifier"] == "NCBIGene:210")][["figure_url", "figure_title"]].iterrows():
    figure_title = df["figure_title"]
    figure_url = df["figure_url"]
    display(Image(url=figure_url))
    print(figure_title)
    print("")
    print("")
    print("")


Evolution of the Tetrapyrrole Biosynthetic Pathway in Secondary Algae: Conservation, Redundancy and Replacement





Evolution of the Tetrapyrrole Biosynthetic Pathway in Secondary Algae: Conservation, Redundancy and Replacement





Heme biosynthesis





Increase of microRNA-210, Decrease of Raptor Gene Expression and Alteration of Mammalian Target of Rapamycin Regulated Proteins following Mithramycin Treatment of Human Erythroid Cells





Heme biosynthesis pathway (left) and the principle of HAL mediated PpIX production enhancement with dimethylsulphoxide (DMSO) or deferoxamine mesylate salt (DFO) (right)





Schematic of porphyrin-synthetic pathway to illustrate potential control points for increased PpIX accumulation





Heme biosynthesis





Haem synthetic pathway





Heme biosynthesis pathway (in red) connects with glucose (in green) and glutamine (in blue) metabolic pathways





Tetrapyrrole biosynthetic pathways in prokaryotes





Metabolic pathway differences between LLO118 and LLO56 and identification of mGPD2 as a candidate gene





Enzymes and intermediate products of the heme synthesis pathway





Heme Biosynthetic Pathway





Haem synthesis pathway





Heme synthesis pathway





Iron metabolic pathways in the processes of sponge plasticity





A: The heme biosynthetic pathway





Evolution of haem biosynthetic and degradative pathways





Sequence Evidence for the Presence of Two Tetrapyrrole Pathways in Euglena gracilis





Four identified pathways responsible for P





The metabolic pathway of FECH and heme





Heme biosynthetic pathway





Heme biosynthetic pathway and experimental porphyria models





Heme biosynthetic pathway, porphyrias and nutrients





Scheme of the heme biosynthesis pathway and membrane transporters





Leishmania heme auxotrophy





Chemical heme biosynthesis pathway





Heme biosynthesis pathway





The Heme Biosynthetic Pathway of the Obligate Wolbachia Endosymbiont of Brugia malayi as a Potential Anti-filarial Drug Target





Iron Necessity: The Secret of Wolbachia's Success?





Heme-dependent feedback inhibition of ALAS in the heme biosynthetic pathway





The metabolic roles of the endosymbiotic organelles of Toxoplasma and Plasmodium spp.





Metabolic pathway of aerial and submerged mycelia in the liquid surface culture of Cordyceps militaris





Heme metabolism



