# BEL Graph RAG Example for paper usecases



In [None]:
# %pip install ndex2 langchain

In [36]:
%pip install texttoknowledgegraph==0.3.8 -q

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import os
import json
from typing import List, Dict, Any
from pathlib import Path
import sys
from ndex2 import Ndex2
import ndex2
from ndex2.cx2 import CX2Network
from dotenv import load_dotenv
from ndex2.cx2 import RawCX2NetworkFactory
load_dotenv()



# Get NDEx account and password from environment variables
OPENAI_API_KEY   = os.getenv("OPENAI_API_KEY")
NDEX_ACCOUNT     = os.getenv("NDEX_ACCOUNT")
NDEX_PASSWORD    = os.getenv("NDEX_PASSWORD")
assert all([OPENAI_API_KEY, NDEX_ACCOUNT, NDEX_PASSWORD]), "Missing creds"

# Connect to NDEx using the provided credentials
ndex_client = Ndex2(username=NDEX_ACCOUNT, password=NDEX_PASSWORD)



## Run Graphrag query on first paper

In [28]:
# Download the network of the first paper: pmid10436023

BASE_KG_UUID = "7a3195e9-4c50-11f0-a218-005056ae3c32"   
base_cx2 = ndex_client.get_network_as_cx2_stream(BASE_KG_UUID).json()
with open("base_kg.cx2", "w") as f: json.dump(base_cx2, f)

In [9]:
from openai import OpenAI
from typing import List

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

def query_llm(prompt: str) -> str:
    """
    Query OpenAI's GPT model.
    """
    try:
        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=1000
        )
        return completion.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error querying OpenAI: {str(e)}")
        return ""


Try the option of lettin the LLM use its knowledge and the option where we specifically tell it to use only the knowledge graph.

- IMPORTANT NOTE: ONLY USE THE INFORMATION PROVIDED IN THE KNOWLEDGE GRAPH TO ANSWER THE QUESTION. DO NOT MAKE UP ANY INFORMATION OR USE YOUR OWN KNOWLEDGE.

- IMPORTANT NOTE: YOU CAN MAKE USE OF YOUR KNOWLEDGE OF BIOLOGY AND THE PROVIDED KNOWLEDGE GRAPH TO ANSWER THE QUESTION.


In [34]:
# Base prompt for LLM query
PROMPT_TEMPLATE = """
You are playing the role of an expert cancer biologist.

QUESTION: How does metabolism affect dna damage response?

IMPORTANT NOTE: ONLY USE THE INFORMATION PROVIDED IN THE KNOWLEDGE GRAPH TO ANSWER THE QUESTION. DO NOT MAKE UP ANY INFORMATION OR USE YOUR OWN KNOWLEDGE.

TASK:
1. Review the following gene/protein set and the provided knowledge graph.
2. Answer the question under QUESTION based on the provided knowledge graph.
3. Provide your final answer as a paragraph summary

Genes: {geneset}

{knowledge_graph}
"""


In [25]:
# Function that does graphrag query and returns the llm response

from ndex2 import create_nice_cx_from_raw_cx
from typing import Callable, Union, List

def graph_rag_query(
    geneset: Union[List[str], str],
    ndex_id: str,
    prompt_template: str,
    ndex_client,
    llm_query_fn: Callable[[str], str],
    search_depth: int = 1
) -> str:
    """
    Perform a graph-RAG: pull a 1-hop neighborhood from NDEx, extract BEL context,
    fill the prompt, and query the LLM.

    Inputs:
      geneset         – either a list of HGNC symbols or a whitespace-delimited string
      ndex_id         – NDEx network UUID
      prompt_template – string containing placeholders {geneset} and {knowledge_graph}
      ndex_client     – an instantiated ndex2.client.Ndex2 object
      llm_query_fn    – function that takes a single string prompt and returns the LLM’s response
      search_depth    – how many hops out to pull (default=1)

    Returns:
      The raw response string from the LLM.
    """
    # Normalize gene list to Python list
    if isinstance(geneset, str):
        gene_list = geneset.split()
    else:
        gene_list = geneset

    # Build the semicolon-delimited search string for NDEx
    search_string = ";".join(gene_list)

    # 1) Fetch the neighborhood as raw CX2 JSON
    raw_cx2 = ndex_client.get_neighborhood(
        ndex_id,
        search_string=search_string,
        search_depth=search_depth
    )

    # 2) Wrap in the “nice” CX helper
    nice_net = create_nice_cx_from_raw_cx(raw_cx2)

    # 3) Extract BEL expressions from every edge
    bel_lines = []
    for edge_id, edge_obj in nice_net.get_edges():
        bel_stmt = nice_net.get_edge_attribute_value(edge_obj, "bel_expression")
        bel_lines.append(bel_stmt)
    knowledge_graph = "\n".join(bel_lines)

    # 4) Fill in the prompt template
    formatted_prompt = prompt_template.format(
        geneset=" ".join(gene_list),
        knowledge_graph=knowledge_graph
    )

    # 5) Call the LLM and return its response
    return llm_query_fn(formatted_prompt)


# ─── Example Usage ─────────────────────────────────────────────────────────────

# response = graph_rag_query(
#     geneset=["SIRT1", "PARP1", "TP53"],
#     ndex_id=BASE_KG_UUID,
#     prompt_template=PROMPT_TEMPLATE,
#     ndex_client=ndex,
#     llm_query_fn=query_llm
# )
# print(response)

In [26]:
graphrag_res = graph_rag_query(
    geneset=["SIRT1", "NAMPT", "PARP1", "TP53", "BRCA1", "CDK2"],
    ndex_id=BASE_KG_UUID,
    prompt_template=PROMPT_TEMPLATE,
    ndex_client=ndex_client,
    llm_query_fn=query_llm
)

print(graphrag_res)

The relationship between metabolism and the DNA damage response involves a complex interplay of multiple proteins and pathways. In the provided gene/protein set, several key players are involved in this interaction. SIRT1 and NAMPT are important metabolic regulators, although their specific interactions are not detailed in the knowledge graph. PARP1 is directly decreased by TP53, indicating a suppressive effect on PARP1's role in DNA repair, specifically in poly(ADP-ribose) polymerase activity. TP53, a crucial tumor suppressor, enhances the expression of BRCA1, which is involved in homologous recombination repair, thereby promoting DNA repair pathways. TP53 also increases the expression of CDKN1A (p21), which helps in cell cycle arrest, allowing time for DNA repair. Furthermore, CDK2 activity, crucial for cell cycle progression, is inhibited by CDKN1B and indirectly by TP53, which can also have implications for DNA damage response by preventing the cell from progressing through the cyc

In [22]:
print(graphrag_res)

The knowledge graph provides a complex interaction map highlighting how various proteins and genes are involved in the DNA damage response, particularly focusing on TP53, a key regulator in this process. Metabolism can influence the DNA damage response through the activity and regulation of several proteins such as SIRT1, NAMPT, and PARP1, although these specific metabolic proteins are not directly covered in the graph. However, TP53 is shown to have a pivotal role in regulating the activity of several key players in DNA repair and cell cycle regulation. TP53 directly increases the activity of BRCA1, a crucial protein in the repair of DNA double-strand breaks, and decreases the activity of PARP1, which is involved in base excision repair. Furthermore, TP53 enhances the expression of CDKN1A, which inhibits CDK2, thereby controlling the cell cycle progression in response to DNA damage. It also interacts with MDM2, which in turn influences TP53 activity through a feedback loop. Although t

### LLM response using only the knowledge graph


The knowledge graph provides a complex interaction map highlighting how various proteins and genes are involved in the DNA damage response, particularly focusing on TP53, a key regulator in this process. Metabolism can influence the DNA damage response through the activity and regulation of several proteins such as SIRT1, NAMPT, and PARP1, although these specific metabolic proteins are not directly covered in the graph. However, TP53 is shown to have a pivotal role in regulating the activity of several key players in DNA repair and cell cycle regulation. TP53 directly increases the activity of BRCA1, a crucial protein in the repair of DNA double-strand breaks, and decreases the activity of PARP1, which is involved in base excision repair. Furthermore, TP53 enhances the expression of CDKN1A, which inhibits CDK2, thereby controlling the cell cycle progression in response to DNA damage. It also interacts with MDM2, which in turn influences TP53 activity through a feedback loop. Although the graph does not directly link metabolic processes to the DNA damage response, the interplay between these proteins suggests that metabolic changes affecting any of these interactions could consequently influence the DNA damage response.

### LLM response using knowledge graph and LLM knowledge

The relationship between metabolism and the DNA damage response involves a complex interplay of multiple proteins and pathways. In the provided gene/protein set, several key players are involved in this interaction. SIRT1 and NAMPT are important metabolic regulators, although their specific interactions are not detailed in the knowledge graph. PARP1 is directly decreased by TP53, indicating a suppressive effect on PARP1's role in DNA repair, specifically in poly(ADP-ribose) polymerase activity. TP53, a crucial tumor suppressor, enhances the expression of BRCA1, which is involved in homologous recombination repair, thereby promoting DNA repair pathways. TP53 also increases the expression of CDKN1A (p21), which helps in cell cycle arrest, allowing time for DNA repair. Furthermore, CDK2 activity, crucial for cell cycle progression, is inhibited by CDKN1B and indirectly by TP53, which can also have implications for DNA damage response by preventing the cell from progressing through the cycle with damaged DNA. The regulation of TP53 is tightly controlled by MDM2, which TP53 itself directly decreases, establishing a feedback loop to modulate the DNA damage response. The knowledge graph emphasizes TP53's central role in orchestrating the DNA damage response and its interactions with proteins that impact both metabolism and DNA repair mechanisms. Overall, metabolism influences the DNA damage response by modulating pathways and proteins like TP53, PARP1, and BRCA1, which coordinate the repair of DNA and maintain genomic stability.

## Demonstate GraphRag on Merged Network of two papers

In [27]:
# Download network of the second paper: pmid24360018
NEW_KG_UUID = "1a811f64-4b97-11f0-a218-005056ae3c32"   
new_cx2 = ndex_client.get_network_as_cx2_stream(NEW_KG_UUID).json()
with open("new_kg.cx2", "w") as f: json.dump(new_cx2, f)

In [None]:
# Merge base and new networks
from ndex2.cx2 import RawCX2NetworkFactory
from textToKnowledgeGraph.convert_to_cx2 import add_style_to_network

# Creating an instance of RawCX2NetworkFactory
cx2_factory = RawCX2NetworkFactory()
base_net = cx2_factory.get_cx2network(base_cx2)
new_net  = cx2_factory.get_cx2network(new_cx2)

def merge_cx2(cx2_graphs):
    merged_graph = CX2Network()
    node_map = {}  # Maps (node_name, node_type) to node ID in merged graph
    
    # First, merge all nodes
    for graph in cx2_graphs:
        for node_id, node in graph.get_nodes().items():  # Changed to .items() based on docs
            # Create a tuple of node attributes that define uniqueness
            node_data = node["v"]
            node_name = node_data.get('name', '')
            
            if node_name not in node_map:
                # Create new node using add_node() as documented
                new_node_id = merged_graph.add_node(attributes=node_data)
                node_map[node_name] = new_node_id
    
    # Then, merge all edges
    for graph in cx2_graphs:
        for edge_id, edge_data in graph.get_edges().items():  
            # Get source and target directly from edge data
            source_id = edge_data.get('s')  
            target_id = edge_data.get('t') 
            
            # Get source and target nodes
            source_node = graph.get_node(source_id)
            source_name = source_node["v"]["name"]
            target_node = graph.get_node(target_id)
            target_name = target_node["v"]["name"]
            
            merged_source = node_map[source_name]
            merged_target = node_map[target_name]
            
            # Create edge using add_edge() as documented
            merged_graph.add_edge(source=merged_source, 
                                target=merged_target, 
                                attributes=edge_data["v"])
    
    return merged_graph


cx2_graphs = [base_net, new_net]
merged_network = merge_cx2(cx2_graphs)

# Apply style to the merged network
add_style_to_network(
    cx2_network=merged_network,
    style_path="/Users/favourjames/Downloads/llm-text-to-knowledge-graph/textToKnowledgeGraph/cx_style.json"   
)

merged_network.set_name("Merged Network of Base and New KGs")

# Upload the merged network to NDEx
merged_uuid = ndex_client.save_new_cx2_network(merged_network.to_cx2())

print("Merged network UUID:", merged_uuid)

INFO: [2025-06-23 16:36:25] textToKnowledgeGraph.convert_to_cx2 - Setting visual style properties


Merged network UUID: https://www.ndexbio.org/v3/networks/d3c02378-5047-11f0-a218-005056ae3c32


In [35]:
# Graphrag query with merged graph context
Merged_UUID = "d3c02378-5047-11f0-a218-005056ae3c32"

merged_graphrag_res = graph_rag_query(
    geneset=["SIRT1", "NAMPT", "PARP1", "TP53", "BRCA1", "CDK2"],
    ndex_id=Merged_UUID,
    prompt_template=PROMPT_TEMPLATE,
    ndex_client=ndex_client,
    llm_query_fn=query_llm
)

print(merged_graphrag_res)

INFO: [2025-06-23 16:46:25] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Metabolism significantly influences the DNA damage response through a complex interplay involving the genes and proteins SIRT1, NAMPT, PARP1, and TP53. SIRT1, a metabolic regulator, directly decreases the activity of PARP1, a key player in DNA repair processes, indicating a regulatory control over DNA repair mechanisms. PARP1, in turn, enhances processes like DNA repair and cell death, as evidenced by its role in increasing biological processes linked to cell cycle and DNA damage responses (GO:0006974 and GO:0008219). TP53, a critical tumor suppressor, directly decreases the activity of both SIRT1 and PARP1, and simultaneously increases the activity of BRCA1, another vital DNA repair protein. This suggests that TP53 modulates DNA repair pathways by balancing the activities of these proteins. Moreover, the activity of PARP1 is positively regulated by NAMPT and other metabolic components, linking cellular metabolic status to DNA damage response pathways. Collectively, these interactions 

### LLM response using only the knowledge graph for merged network

Metabolism significantly influences the DNA damage response through a complex interplay involving the genes and proteins SIRT1, NAMPT, PARP1, and TP53. SIRT1, a metabolic regulator, directly decreases the activity of PARP1, a key player in DNA repair processes, indicating a regulatory control over DNA repair mechanisms. PARP1, in turn, enhances processes like DNA repair and cell death, as evidenced by its role in increasing biological processes linked to cell cycle and DNA damage responses (GO:0006974 and GO:0008219). TP53, a critical tumor suppressor, directly decreases the activity of both SIRT1 and PARP1, and simultaneously increases the activity of BRCA1, another vital DNA repair protein. This suggests that TP53 modulates DNA repair pathways by balancing the activities of these proteins. Moreover, the activity of PARP1 is positively regulated by NAMPT and other metabolic components, linking cellular metabolic status to DNA damage response pathways. Collectively, these interactions highlight a sophisticated network where metabolism, through key regulators like SIRT1 and NAMPT, affects the DNA damage response, primarily by modulating the activities of repair proteins such as PARP1 and TP53.

### LLM response using knowledge graph and LLM knowledge for merged network

The metabolism of cells can significantly influence the DNA damage response (DDR), a critical cellular mechanism for maintaining genomic integrity. The genes and proteins involved, such as SIRT1, NAMPT, PARP1, TP53, BRCA1, and CDK2, play distinct roles in this process. SIRT1 and PARP1 are central to the regulation of metabolism and the DDR. SIRT1, a NAD+-dependent deacetylase, is involved in cellular stress responses and regulates PARP1 activity. It acts as a negative regulator of PARP1, as evidenced by its direct decrease in PARP1 activity. Conversely, PARP1, which also relies on NAD+ for its activity, facilitates DNA repair by recognizing DNA strand breaks and recruiting DNA repair proteins, thus playing a direct role in the DDR. TP53, a pivotal tumor suppressor gene, regulates the expression of several genes involved in the DDR, including CDKN1A and BRCA1, and can downregulate PARP1 activity. This regulation is crucial, as overactivation of PARP1 can lead to NAD+ depletion, influencing cellular energy metabolism. Furthermore, TP53 activity is modulated by various factors such as ATM and PRKDC, which enhance its role in the DDR. CDK2, a cyclin-dependent kinase, interacts with the cell cycle and can be negatively regulated by TP53 through indirect pathways. Overall, the interplay between these metabolic regulators and the DDR components highlights the intricate connection between cellular metabolism and the maintenance of genomic stability, where metabolic enzymes and pathways can directly impact the efficacy and regulation of DNA repair mechanisms.