# BEL Graph RAG Example

This notebook demonstrates how to use Graph RAG (Retrieval Augmented Generation) with BEL graphs to review an analysis of a gene set related to SIRT1 and PARP1

In [22]:
import os
import json
from pathlib import Path
import sys
from ndex2 import Ndex2
import ndex2
from ndex2.cx2 import CX2Network

# Add parent directory to path to import textToKnowledgeGraph
sys.path.append('..')

import sys
# print(sys.executable)

# for key, value in os.environ.items():
#      print(f"{key}: {value}")

# Get NDEx account and password from environment variables
ndex_account = os.environ.get('NDEX_ACCOUNT')
ndex_password = os.environ.get('NDEX_PASSWORD')

# Verify credentials are available
if not ndex_account or not ndex_password:
    raise ValueError("NDEx credentials not found in environment variables. "
                    "Please set NDEX_ACCOUNT and NDEX_PASSWORD.")

ndex_client = Ndex2(username=ndex_account, password=ndex_password)


## Papers to build our knowledge graph for Graph RAG

In [4]:
relevant_papers = {
    "paper1": {
        "title": "PBX1-SIRT1 Positive Feedback Loop Attenuates ROS-Mediated HF-MSC Senescence and Apoptosis",
        "citation": "Stem Cell Rev Rep. 2023 Feb;19(2):443-454. doi: 10.1007/s12015-022-10425-w.",
        "pmcid": "PMC9902417"
    },
    "paper2": {
        "title": "A PARP1–BRG1–SIRT1 axis promotes HR repair by reducing nucleosome density at DNA damage sites",
        "citation": "Nucleic Acids Res. 2019 Sep 19;47(16):8563-8580. doi: 10.1093/nar/gkz592.",
        "pmcid": "PMC7145522"
    },
    "paper3": {
        "title": "SIRT1/PARP1 crosstalk: connecting DNA damage and metabolism.",
        "citation": "Genome Integr. 2013 Dec 20;4(1):6. doi: 10.1186/2041-9414-4-6. ",
        "pmcid": "PMC3898398"
    }
}

## Process the relevant papers and upload to NDEx

In [6]:
# Process each paper to generate knowledge graphs, upload, and return the NDEx network ids
from bel_processing import process_paper_to_bel_cx2
paper_ndex_ids = []
for paper_id, paper in relevant_papers.items():
    cx2_network = process_paper_to_bel_cx2(paper["pmcid"])
    network_id = ndex_client.save_new_cx2_network(cx2_network.to_cx2())
    paper_ndex_ids.append(network_id)



INFO: Setting up output directory for PMC9902417
INFO: Successfully downloaded XML for PMCID PMC9902417.
INFO: Processing xml file to get text paragraphs
INFO: Processing paragraphs with LLM-BEL model
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST 

Time taken: 278.21 seconds (4.64 minutes)


INFO: Successfully downloaded XML for PMCID PMC7145522.
INFO: Processing xml file to get text paragraphs
INFO: Processing paragraphs with LLM-BEL model
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/

Time taken: 254.08 seconds (4.23 minutes)


INFO: Successfully downloaded XML for PMCID PMC3898398.
INFO: Processing xml file to get text paragraphs
INFO: Processing paragraphs with LLM-BEL model
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/

Time taken: 238.49 seconds (3.97 minutes)


## Download the networks from NDEx, merge, upload.

In [24]:
from ndex2.cx2 import RawCX2NetworkFactory

# Creating an instance of RawCX2NetworkFactory
cx2_factory = RawCX2NetworkFactory()

def merge_cx2(cx2_graphs):
    merged_graph = CX2Network()
    node_map = {}  # Maps (node_name, node_type) to node ID in merged graph
    
    # First, merge all nodes
    for graph in cx2_graphs:
        for node_id, node in graph.get_nodes().items():  # Changed to .items() based on docs
            # Create a tuple of node attributes that define uniqueness
            node_data = node["v"]
            node_name = node_data.get('name', '')
            
            if node_name not in node_map:
                # Create new node using add_node() as documented
                new_node_id = merged_graph.add_node(attributes=node_data)
                node_map[node_name] = new_node_id
    
    # Then, merge all edges
    for graph in cx2_graphs:
        for edge_id, edge_data in graph.get_edges().items():  
            # Get source and target directly from edge data
            source_id = edge_data.get('s')  
            target_id = edge_data.get('t') 
            
            # Get source and target nodes
            source_node = graph.get_node(source_id)
            source_name = source_node["v"]["name"]
            target_node = graph.get_node(target_id)
            target_name = target_node["v"]["name"]
            
            merged_source = node_map[source_name]
            merged_target = node_map[target_name]
            
            # Create edge using add_edge() as documented
            merged_graph.add_edge(source=merged_source, 
                                target=merged_target, 
                                attributes=edge_data["v"])
    
    return merged_graph

# Get the networks from NDEx as a list of CX2Network objects
paper_cx2_networks = []
for ndex_id in paper_ndex_ids:
    # if it is a url string, get the id at the end
    ndex_id = ndex_id.split("/")[-1]
    # get the NDEx network
    print(f"getting {ndex_id}")
    response = ndex_client.get_network_as_cx2_stream(ndex_id)
    network = cx2_factory.get_cx2network(response.json())
    # Merge all paper knowledge graphs
    paper_cx2_networks.append(network)

print("merging cx2 networks")
merged_knowledge_graph = merge_cx2(paper_cx2_networks)
merged_knowledge_graph.set_name("parp1_sirt1_knowledge_graph")
print("saving cx2 network to NDEx")
kg_network_id =  ndex_client.save_new_cx2_network(merged_knowledge_graph.to_cx2())

getting 9aad3bb0-e275-11ef-8e41-005056ae3c32
getting 32651b82-e276-11ef-8e41-005056ae3c32
getting c0d77a24-e276-11ef-8e41-005056ae3c32
merging cx2 networks
saving cx2 network to NDEx


## Prompt Templates
Ask an LLM to review an analysis of a gene set.

The LLM will be queried with and without the results of a query to the merged knowledge graph.

In [None]:
PROMPT_TEMPLATE = """
You are playing the role of an expert cancer biologist.

TASK:
1. Review the following gene/protein set.
2. Review the following summary of the gene set's function and potential relationship to cancer.
The summary was produced by an LLM.
3. Provide your critique, including your reasoning about the causal relationships between 
4. Provide advice that can be incorporated in the prompt to the LLM to improve its output
5. Provide additional advice for the LLM as a causal knowledge graph of relevant facts that would help it. 

Present one statement per line using BEL format. To refresh your knowledge of BEL, it is a language 
for representing biological knowledge in a computable form. Key aspects:
- Entities are represented with functions like p() for proteins, a() for abundances, bp() for biological processes
- Relationships between entities use operators like increases, decreases, directlyIncreases, association
- Entities must use standard namespaces (HGNC for human genes, CHEBI for chemicals, etc.)
- Statements in BEL are associated with evidence text

Genes: {geneset}

Gene set summary:
{gene_set_summary}


{knowledge_graph}

Output format:
## Genes
<genes>

## Critique:
<critique>

"advice": "<advice_to_llm>",
"causal_relationship_advice_graph": " - <BEL_statement> : <evidence_text><newline>..."
}}
"""

KNOWLEDGE_GRAPH_TEMPLATE = """
Here is information in BEL format that may help you perform your critique and be used in your advice.
Be sure to distinguish when you draw on this information vs when you use your own knowledge.

{statements}

"""

GENE_SET = "FOXO1, FOXO3, HIF1A, NAMPT, NFKB1, PARP1, PPARG, SIRT1"

## This summary was generated by Claude 3.5 Sonnet
GENE_SET_SUMMARY = """
Core protein network in cancer metabolism:

Primary NAD+ regulatory circuit:

SIRT1-NAMPT-PARP1: NAD+-dependent switch coupling energy status to cell fate
NAMPT produces NAD+, SIRT1 and PARP1 compete for it under stress/damage
Competition determines survival vs death pathway activation

Transcriptional integration hub:

FOXO1/FOXO3 + NFKB1: integrate stress/survival signals
HIF1A + PPARG: coordinate metabolic adaptation and hypoxia response
Direct SIRT1-mediated deacetylation of FOXO1/FOXO3 and HIF1A

Critical additional regulators with strong experimental evidence:

PRKAA1/PRKAA2: master energy sensor, directly phosphorylates FOXOs, regulates NAMPT
TP53: couples metabolism to DNA damage via SIRT1/PARP1 interactions
PPARGC1A: controls mitochondrial function through SIRT1/PPARG/FOXO axis

System integration:

Functions as metabolic checkpoint where NAD+ availability determines outcomes
PRKAA1/2 sets threshold based on energy status (AMP:ATP ratio)
TP53 integrates DNA damage signals into the network
PPARGC1A fine-tunes mitochondrial response
All components show direct physical/functional interactions in cancer metabolism
Network particularly active under metabolic stress and DNA damage conditions
"""

PROMPT_TEMPLATE

'\nYou are playing the role of an expert cancer biologist.\n\nTASK:\n1. Review the following gene/protein set.\n2. Review the following summary of the gene set\'s function and potential relationship to cancer.\nThe summary was produced by an LLM.\n3. Provide your critique, including your reasoning about the causal relationships between \n4. Provide advice that can be incorporated in the prompt to the LLM to improve its output\n5. Provide additional advice for the LLM as a causal knowledge graph of relevant facts that would help it. \n\nPresent one statement per line using BEL format. To refresh your knowledge of BEL, it is a language \nfor representing biological knowledge in a computable form. Key aspects:\n- Entities are represented with functions like p() for proteins, a() for abundances, bp() for biological processes\n- Relationships between entities use operators like increases, decreases, directlyIncreases, association\n- Entities must use standard namespaces (HGNC for human gene

In [54]:
import os
from openai import OpenAI
from typing import List

# Get API key from environment
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found in environment variables")

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

# Template for extracting gene entities
entity_prompt_template = """
Interpret this text to extract all genes/proteins mentioned and output them as a whitespace-separated list of human gene symbols.
<example>
TP53 AKT1 MTOR
</example>
Only output that list, nothing else.
<text>
{text}
</text>
"""

def query_llm(prompt: str) -> str:
    """
    Query OpenAI's GPT model.
    """
    try:
        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=1000
        )
        return completion.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error querying OpenAI: {str(e)}")
        return ""

def get_entities(text: str) -> List[str]:
    """
    Extract gene entities from text using LLM.
    """
    prompt = entity_prompt_template.format(text=text)
    response = query_llm(prompt)
    if response:
        return response
    return ""

kg_query_string = get_entities(GENE_SET_SUMMARY)

print(f'Query string to KG in NDEx: {kg_query_string}')

kg_network_id = "7ce89103-a372-11ef-99aa-005056ae3c32"

nice_kg_query_network = ndex2.create_nice_cx_from_raw_cx(ndex_client.get_neighborhood(kg_network_id, kg_query_string, search_depth=1))

# convert the network to a string containing BEL statements and supporting evidence
knowledge_graph = ""

for edge_id, edge_obj in nice_kg_query_network.get_edges():
    knowledge_graph += nice_kg_query_network.get_edge_attribute_value(edge_obj, "bel_expression")
    knowledge_graph += "\n"

print("NDEx query done")

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Query string to KG in NDEx: SIRT1 NAMPT PARP1 FOXO1 FOXO3 NFKB1 HIF1A PPARG PRKAA1 PRKAA2 TP53 PPARGC1A
NDEx query done


## Query without the knowledge graph

In [57]:
prompt = PROMPT_TEMPLATE.format(
    geneset=GENE_SET,
    gene_set_summary = GENE_SET_SUMMARY,
    knowledge_graph=""
)

analysis_no_kg = query_llm(prompt)
print(analysis_no_kg)

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


## Genes
FOXO1, FOXO3, HIF1A, NAMPT, NFKB1, PARP1, PPARG, SIRT1

## Critique:
The summary provides a reasonably accurate description of the roles of the given gene set in cancer metabolism. It appropriately highlights the importance of the SIRT1-NAMPT-PARP1 circuit in regulating NAD+ and its role in determining cell fate under stress conditions. The roles of transcription factors like FOXO1/3, NFKB1, HIF1A, and PPARG as integrators of cellular stress signals and metabolic adaptations are also well summarized. However, the summary doesn't define the directionality of the interactions between these genes/proteins. For instance, it mentions that SIRT1 mediates the deacetylation of FOXO1/3 and HIF1A, but it doesn't specify whether this increases or decreases the activity of these proteins. Similarly, the role of TP53 in coupling metabolism to DNA damage via SIRT1/PARP1 interactions is not clear.

## Advice:
The LLM should focus more on the directionality and consequences of the interaction

## Query with the knowledge graph

In [56]:
knowledge_graph_prompt = KNOWLEDGE_GRAPH_TEMPLATE.format(
    statements = knowledge_graph)

prompt = PROMPT_TEMPLATE.format(
    geneset=GENE_SET,
    gene_set_summary = GENE_SET_SUMMARY,
    knowledge_graph=knowledge_graph_prompt
)

analysis_with_kg = query_llm(prompt)
print(analysis_with_kg)

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


## Genes
FOXO1, FOXO3, HIF1A, NAMPT, NFKB1, PARP1, PPARG, SIRT1

## Critique:
The LLM has correctly identified key components of the NAD+ regulatory circuit and their roles in cancer metabolism. It correctly states that NAMPT, SIRT1, and PARP1 are involved in the NAD+-dependent switch that couples energy status to cell fate. The transcriptional integration hub involving FOXO1/FOXO3 and NFKB1, and HIF1A and PPARG is also accurate. The additional regulators PRKAA1/2, TP53, and PPARGC1A are indeed involved in these processes. 

However, the LLM misses some key points. The BEL statements suggest that SIRT1 can both increase and decrease PARP1, which can be confusing. This could be better clarified by mentioning that SIRT1 usually inhibits PARP1 but can also increase its activity under certain conditions. The statement that "All components show direct physical/functional interactions in cancer metabolism" is too broad, as not all components physically interact with each other.

## Advice:
T