<a href="https://colab.research.google.com/github/wikipathways/pathway-figure-ocr/blob/master/notebooks/bte_with_pfocr_cooccurrence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from copy import copy, deepcopy
import json

# BTE with PFOCR Coocurrence

I modified a query from the 3/17 Question of the Month to find relationships like this:

`1 of 3 selected genes` -> `any gene` -> `Valproic Acid`

In [2]:
import requests
import requests_cache


requests_cache.install_cache("pfocr_cache")

## Get BTE Results

In [3]:
query = {
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "object": "n0",
                    "predicates": ["biolink:related_to"],
                    "subject": "n1"
                },
                "e02": {
                    "object": "n1",
                    "predicates": ["biolink:related_to"],
                    "subject": "n2"
                }
            },
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["NCBIGene:3855", "NCBIGene:211", "NCBIGene:26995"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "categories": ["biolink:SmallMolecule"],
                    "ids": ["PUBCHEM.COMPOUND:3121"]
                }
            }
        }
    }
}

# bte_r = requests.post("https://api.bte.ncats.io/v1/query", json=query)
bte_r = requests.post("http://localhost:3000/v1/query", json=query)
# To clear the cache for this request
# requests_cache.get_cache().delete(bte_r.cache_key)

print(bte_r.status_code)
if bte_r.status_code != 200:
    print(bte_r.text)

200


In [4]:
gene_q_node_ids = []
for q_node_id, v in query["message"]["query_graph"]["nodes"].items():
    if "categories" in v and "biolink:Gene" in v["categories"]:
        gene_q_node_ids.append(q_node_id)
print(gene_q_node_ids)

['n0', 'n1']


In [5]:
bte_message = bte_r.json()["message"]
bte_results = bte_message["results"]
genes_to_bte_results = dict()
for bte_result in bte_results:
    bte_result_genes = []
    for gene_q_node_id in gene_q_node_ids:
        for entry in bte_result["node_bindings"][gene_q_node_id]:
            id = entry["id"]
            target_prefix = "NCBIGene:"
            if id.startswith(target_prefix):
                bte_result_genes.append("NCBIGene:" + id[len(target_prefix):])
    genes_key = tuple(sorted(bte_result_genes))
    if not genes_key in genes_to_bte_results:
        genes_to_bte_results[genes_key] = []
    genes_to_bte_results[genes_key].append(bte_result)
print(f'BTE TRAPI result count: {len(bte_results)}')

BTE TRAPI result count: 367
BTE result gene count: 346


## Get PFOCR Data

Download the entire JSON file we gave to BTE.

In [6]:
pfocr_url = "https://www.dropbox.com/s/1f14t5zaseocyg6/bte_chemicals_diseases_genes.ndjson?dl=1"
pfocr_request = requests.get(pfocr_url)
print(f"status_code: {pfocr_request.status_code}")
if pfocr_request.status_code != 200:
    print(pfocr_request.text)

genes_to_figids = {}
figid_to_genes = {}
figid_to_pfocr_result = {}
pfocr_result_count = 0
for line in pfocr_request.text.splitlines():
    pfocr_result_count += 1
    pfocr_result = json.loads(line)
    figid = pfocr_result["_id"]
    genes = set(
            ["NCBIGene:" + g for g in pfocr_result["associatedWith"]["mentions"]["genes"]["ncbigene"]]
            )
    figid_to_pfocr_result[figid] = pfocr_result
    figid_to_genes[figid] = genes

    genes_key = tuple(sorted(genes))
    if not genes_key in genes_to_figids:
        genes_to_figids[genes_key] = []
    genes_to_figids[genes_key].append(figid)
print(f'pfocr_result_count: {pfocr_result_count}')

status_code: 200
pfocr_result_count: 77719


#### How many CURIEs are in both?

In [7]:
all_bte_result_genes_keys = genes_to_bte_results.keys()
all_bte_result_genes = set([]).union(
    *[set(bte_result_genes_key) for bte_result_genes_key in all_bte_result_genes_keys]
)
print(f'BTE TRAPI result unique gene count: {len(all_bte_result_genes)}')

all_common_genes = set()
for figure_genes_keys in genes_to_figids.keys():
    figure_genes = set(figure_genes_keys)
    all_common_genes.update(
        figure_genes.intersection(all_bte_result_genes)
    )
print(len(all_common_genes))
print(f'Genes common to BTE TRAPI results and PFOCR figures count: {len(all_common_genes)}')

346
333


## Connect BTE Results & PFOCR

### Compare Algo Performance

The following three algorithms all match up the BTE gene sets with the PFOCR sets, but they have different performances:
- Brute Force: 47s
- Check All BTE Results Genes Set: 5s
- SetSimilaritySearch: 2s

#### TRAPI Results x Figs (Brute Force)

In [8]:
all_figure_genes_keys = genes_to_figids.keys()
bf_overlaps_2_plus = set()
for bte_result_genes_key in genes_to_bte_results.keys():
    bte_result_genes = set(bte_result_genes_key)
    for figure_genes_keys in all_figure_genes_keys:
        figure_genes = set(figure_genes_keys)
        if len(bte_result_genes.intersection(figure_genes)) >= 2:
            bf_overlaps_2_plus.add((bte_result_genes_key, figure_genes_keys))
print(len(bf_overlaps_2_plus))

196


#### TRAPI Results x Figs with check against set of all genes in any BTE result

Before checking every BTE result against the current figure, check whether the figure matches two genes from any BTE result first. If not, it's pointless to check specific BTE results, so we can skip entire figure.

In [9]:
all_bte_result_genes_keys = genes_to_bte_results.keys()
all_bte_result_genes = set([]).union(
    *[set(bte_result_genes_key) for bte_result_genes_key in all_bte_result_genes_keys]
)
bf_overlaps_2_plus = set()

matched_figures = set()
matched_bte_results = set()

for figure_genes_keys in genes_to_figids.keys():
    figure_genes = set(figure_genes_keys)
    figure_ids = genes_to_figids[figure_genes_keys]
    if len(figure_genes.intersection(all_bte_result_genes)) >= 2:
        for bte_result_genes_key in all_bte_result_genes_keys:
            bte_result_genes = set(bte_result_genes_key)
            if len(bte_result_genes.intersection(figure_genes)) >= 2:
                matched_figures.update(set(figure_ids))
                matched_bte_results.add(bte_result_genes_key)
                bf_overlaps_2_plus.add((bte_result_genes_key, figure_genes_keys))

print(len(bf_overlaps_2_plus))
print(f'matched_figures count: {len(matched_figures)}')
print(f'matched BTE results count: {len(matched_bte_results)}')

196
matched_figures count: 63
matched_bte_results count: 71


#### SetSimilaritySearch

This algorithm is sometimes too permissive, making the overlap check needed, but even so, it's faster and gets the same results once we apply the overlap check.

In [10]:
from SetSimilaritySearch import SearchIndex

pfocr_gene_sets = list(genes_to_figids.keys())
index = SearchIndex(pfocr_gene_sets, similarity_func_name="containment", 
    similarity_threshold=0.8)

sss_overlaps_2_plus = set()
for bte_result_genes_key in genes_to_bte_results.keys():
    bte_result_genes = set(bte_result_genes_key)
    results = index.query(bte_result_genes)
    for result in results:
        figure_genes = pfocr_gene_sets[result[0]]
        if len(bte_result_genes.intersection(figure_genes)) >= 2:
            sss_overlaps_2_plus.add((bte_result_genes_key, tuple(sorted(figure_genes))))
print(len(sss_overlaps_2_plus))

196


### Choose and Apply SetSimilaritySearch

SetSimilaritySearch was fastest, so let's choose it and use it to augment the BTE results.

In [11]:
from SetSimilaritySearch import SearchIndex

pfocr_gene_sets = list(genes_to_figids.keys())
index = SearchIndex(pfocr_gene_sets, similarity_func_name="containment", 
    similarity_threshold=0.8)

overlaps_2_plus_count = 0
for bte_result_genes_key, genes_bte_results in genes_to_bte_results.items():
    bte_result_genes = set(bte_result_genes_key)
    results = index.query(bte_result_genes)
    #if results:
    #    print("------")
    #    print(f'bte_result_genes: {bte_result_genes}')
    for result in results:
        figure_genes = pfocr_gene_sets[result[0]]
        score = result[1]
        figids = genes_to_figids[tuple(sorted(figure_genes))]
        #print(f'{figids} ({score}) - {figure_genes}')
        common = bte_result_genes.intersection(figure_genes)
        #print(f'intersection: {common}')
        if len(common) >= 2:
            for figid in figids:
                overlaps_2_plus_count += 1
                pfocr_result = figid_to_pfocr_result[figid]
                #print("--------")
                #print(f'{figid} - {figure_genes}')
                for bte_result in genes_bte_results:
                    #print(f'bte_result: {sorted(bte_result_genes)}')
                    #print(f'common: {bte_result_genes.intersection(figure_genes)}')
                    nodes = set()
                    for q_node_id, values in bte_result["node_bindings"].items():
                        for value in values:
                            id = value["id"]
                            if id in figure_genes:
                                nodes.add(q_node_id)
                    if not "pfocr_notebook" in bte_result:
                        bte_result["pfocr_notebook"] = []
                    pfocr_entry = copy(pfocr_result)
                    pfocr_entry["nodes"] = sorted(nodes)
                    pfocr_entry["score"] = score
                    bte_result["pfocr_notebook"].append(pfocr_entry)
print(overlaps_2_plus_count)

207


In [12]:
import pandas as pd


kg_nodes = bte_message["knowledge_graph"]["nodes"]

bte_rows = []
for bte_result in bte_results:
    bte_row_template = {}
    for q_node_id, value in bte_result["node_bindings"].items():
        node_labels = []
        for v in value:
            id = v["id"]
            name = kg_nodes[id]["name"]
            #node_labels.append(f'{name} ({id})')
            node_labels.append(name)
        bte_row_template[q_node_id] = ",".join(node_labels)
        # just taking the first one for now
        bte_row_template[q_node_id + "_identifier"] = value[0]["id"]

    bte_row_template["score"] = bte_result["score"]
    
    if "pfocr" in bte_result or "pfocr_notebook" in bte_result:
        if "pfocr" in bte_result:
            for pfocr_result in bte_result["pfocr"]:
                bte_row = deepcopy(bte_row_template)
                bte_row["figure_url"] = pfocr_result["figureUrl"]
                bte_row["pmc"] = pfocr_result["pmc"]
                bte_row["nodes"] = pfocr_result["nodes"]
                bte_rows.append(bte_row)
        if "pfocr_notebook" in bte_result:
            for pfocr_result in bte_result["pfocr_notebook"]:
                bte_row = deepcopy(bte_row_template)
                bte_row["figure_title_notebook"] = pfocr_result["associatedWith"]["title"]
                bte_row["figid_notebook"] = pfocr_result["_id"]
                bte_row["figure_url_notebook"] = pfocr_result["associatedWith"]["figureUrl"]
                bte_row["pfocr_score_notebook"] = pfocr_result["score"]
                bte_rows.append(bte_row)
    else:
        bte_rows.append(bte_row_template)
    
bte_df = pd.DataFrame(bte_rows)
bte_df

Unnamed: 0,n1,n1_identifier,n0,n0_identifier,n2,n2_identifier,score,figure_url,pmc,nodes,figure_title_notebook,figid_notebook,figure_url_notebook,pfocr_score_notebook
0,IVL,NCBIGene:3713,KRT7,NCBIGene:3855,Valproic acid,PUBCHEM.COMPOUND:3121,2.966556,,,,,,,
1,FABP4,NCBIGene:2167,ALAS1,NCBIGene:211,Valproic acid,PUBCHEM.COMPOUND:3121,2.449312,,,,,,,
2,TP63,NCBIGene:8626,KRT7,NCBIGene:3855,Valproic acid,PUBCHEM.COMPOUND:3121,2.224773,,,,,,,
3,MAPK8,NCBIGene:5599,ALAS1,NCBIGene:211,Valproic acid,PUBCHEM.COMPOUND:3121,2.106262,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,PMC5354998,"[n1, n0]",,,,
4,MAPK8,NCBIGene:5599,ALAS1,NCBIGene:211,Valproic acid,PUBCHEM.COMPOUND:3121,2.106262,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,PMC1906540,"[n1, n0]",,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
705,POLD1,NCBIGene:5424,ALAS1,NCBIGene:211,Valproic acid,PUBCHEM.COMPOUND:3121,0.000000,,,,,,,
706,HERC6,NCBIGene:55008,ALAS1,NCBIGene:211,Valproic acid,PUBCHEM.COMPOUND:3121,0.000000,,,,,,,
707,SDK1,NCBIGene:221935,ALAS1,NCBIGene:211,Valproic acid,PUBCHEM.COMPOUND:3121,0.000000,,,,,,,
708,SHANK2,NCBIGene:22941,ALAS1,NCBIGene:211,Valproic acid,PUBCHEM.COMPOUND:3121,0.000000,,,,,,,


In [13]:
print(len(bte_df[bte_df["figure_url"].notnull()]["figure_url"].drop_duplicates()))
print(len(bte_df[bte_df["figure_url_notebook"].notnull()]["figure_url_notebook"].drop_duplicates()))

63
63


In [14]:
print(
    len(bte_df[bte_df["figure_url"].notnull()][
        ["n0_identifier", "n1_identifier", "n2_identifier"]
    ].drop_duplicates())
)
print(
    len(bte_df[bte_df["figure_url_notebook"].notnull()][
        ["n0_identifier", "n1_identifier", "n2_identifier"]
    ].drop_duplicates())
)

70
70


In [15]:
bte_df.rename(
    columns={"figure_url": "figure_count"}
).groupby(
    ["n0", "n1", "n2"]
)[["figure_count"]].count().sort_values("figure_count", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,figure_count
n1,n2,Unnamed: 2_level_1
ALAD,Valproic acid,34
HMBS,Valproic acid,33
UROS,Valproic acid,28
CPOX,Valproic acid,17
KRT19,Valproic acid,3
...,...,...
KAT6B,Valproic acid,0
KDM1A,Valproic acid,0
KDM5A,Valproic acid,0
KIF9,Valproic acid,0


## Display some figures

In [19]:
from IPython.display import Image
from IPython.core.display import HTML 


for i, df in bte_df[
    bte_df["figure_url_notebook"].notnull() & bte_df["figure_title_notebook"].notnull()
][["figure_url_notebook", "figure_title_notebook"]].drop_duplicates().iterrows():
    figure_title = df["figure_title_notebook"]
    figure_url = df["figure_url_notebook"]
    display(Image(url=figure_url))
    print(figure_title)
    print("")
    print("")
    print("")


Integrated genomic and molecular characterization of cervical cancer.





Porphyrin synthesis pathway in relation to COVID-19





Innate immune responses





Activated Oncogenic Pathway Modifies Iron Network in Breast Epithelial Cells: A Dynamic Modeling Perspective





The heme biosynthetic pathway and aspects of its regulation in hepatocytes





Pathogenesis and Clinical Features of the Acute Hepatic Porphyrias (AHPs)





Network of pancreatic cancer-target genes





Neural ectoderm and neural crest specification in deprivation of the external signals.A





Sub-network found among the genes in a differential expression experiment that compares FA patients to controls





DNA methylation and Transcriptome Changes Associated with Cisplatin Resistance in Ovarian Cancer.





Genes represented in the 38-gene DTC panel and their biological pathway associations





Integration of gene expression with the proposed mechanistic pathway initiated by comfrey treatment leading to tumorigenesis





Survey of the Impact of Deyolking on Biological Processes Covered by Shotgun Proteomic Analyses of Zebrafish Embryos





Transcriptional landscape of mouse-aged ovaries reveals a unique set of non-coding RNAs associated with physiological and environmental ovarian dysfunctions





Roundabout signaling pathway involved in the pathogenesis of COPD by integrative bioinformatics analysis.





Regulation network of CYP3A4 phenotype expression adapted from Klein et al





Simple Rx for congenital erythropoietic porphyria





Evolution of the Tetrapyrrole Biosynthetic Pathway in Secondary Algae: Conservation, Redundancy and Replacement





Evolution of the Tetrapyrrole Biosynthetic Pathway in Secondary Algae: Conservation, Redundancy and Replacement





Increase of microRNA-210, Decrease of Raptor Gene Expression and Alteration of Mammalian Target of Rapamycin Regulated Proteins following Mithramycin Treatment of Human Erythroid Cells





Transport of heme synthesis intermediates and heme in metazoans





Heme biosynthesis





Haem synthetic pathway





Heme biosynthesis pathway (in red) connects with glucose (in green) and glutamine (in blue) metabolic pathways





Haem synthesis pathway





Heme synthesis pathway





Evolution of haem biosynthetic and degradative pathways





Sequence Evidence for the Presence of Two Tetrapyrrole Pathways in Euglena gracilis





Four identified pathways responsible for P





Heme biosynthetic pathway and experimental porphyria models





Heme biosynthetic pathway, porphyrias and nutrients





Scheme of the heme biosynthesis pathway and membrane transporters





Leishmania heme auxotrophy





Heme biosynthesis pathway (left) and the principle of HAL mediated PpIX production enhancement with dimethylsulphoxide (DMSO) or deferoxamine mesylate salt (DFO) (right)





Occurrence of plant photodynamic stress





Schematic of porphyrin-synthetic pathway to illustrate potential control points for increased PpIX accumulation





Tetrapyrrole biosynthetic pathways in prokaryotes





Enzymes and intermediate products of the heme synthesis pathway





Heme Biosynthetic Pathway





Iron metabolic pathways in the processes of sponge plasticity





A: The heme biosynthetic pathway





Heme biosynthesis pathway





The metabolic pathway of FECH and heme





Heme biosynthetic pathway





Chemical heme biosynthesis pathway





Heme biosynthesis pathway





The Heme Biosynthetic Pathway of the Obligate Wolbachia Endosymbiont of Brugia malayi as a Potential Anti-filarial Drug Target





Iron Necessity: The Secret of Wolbachia's Success?





Heme-dependent feedback inhibition of ALAS in the heme biosynthetic pathway





The metabolic roles of the endosymbiotic organelles of Toxoplasma and Plasmodium spp.





Heme metabolism





Gene Expression Variability in Human Hepatic Drug Metabolizing Enzymes and Transporters.





Functional enrichment of combined differentially expressed gene list in severe asthma across all tissues





Heme biosynthesis





Metabolic pathway differences between LLO118 and LLO56 and identification of mGPD2 as a candidate gene





Metabolic pathway of aerial and submerged mycelia in the liquid surface culture of Cordyceps militaris





Lost in translocation: The function of the 18 kD translocator protein





Rev-erbA/PGC-1A pathway regulating heme homeostasis





Ingenuity Pathway Analysis of genes that are differentially methylated (P<1.8 × 10−6) between normal mucosa of cancer patients and controls





Network pathway analysis of down-regulated genes in the gene category mitotic cell cycle





Metabolic reconstruction of Marine Group II Euryarchaeota, SAR406, and SAR202, based on the top three or four most complete genomes





Integrated metabolic, proteomic and transcriptomic functional analysis





Rickettsia species synthesize cell envelope glycoconjugates from imported host sugars and fuel the TCA cycle with a range of host-acquired metabolites



