# 📚 Learning Path: Graphs, LLMs, and Knowledge Extraction with Historical Data
Welcome to this hands-on notebook! This guided journey is designed for beginners in both knowledge graphs and large language models (LLMs). By the end, you will:
- Understand what a knowledge graph is and why it matters for reasoning over data.
- Learn how to use LLMs to extract structured knowledge from unstructured text.
- See how to build, query, and visualize a knowledge graph from historical data.
- Practice asking questions and interpreting answers using both graph and LLM reasoning.
- Reflect on each step with learner prompts to deepen your understanding.

**How to use this notebook:**
- Read the explanations and learner comments in each cell.
- Run the code cells in order, and observe the outputs.
- After each section, pause to consider the learner prompts and questions.
- Try modifying the code or data to experiment and learn actively!

Let's begin your journey into the world of graphs and LLMs!

# 🧠 Historical GraphRAG Notebook
Welcome! In this notebook, you'll learn how to combine knowledge graphs and large language models (LLMs) to extract, organize, and reason about information from historical texts.

**What is a Knowledge Graph?**
A knowledge graph is a way to represent information as a network of entities (like people, places, or events) and the relationships between them. This makes it easier to explore connections and answer complex questions.

**Why use LLMs?**
LLMs can help us extract structured knowledge (facts and relationships) from unstructured text, making it possible to build and update knowledge graphs automatically.

As you go through each step, look for the learner prompts and questions to help you reflect and deepen your understanding.

## 🔧 Environment Setup
Before we start, let's make sure we have all the necessary tools. We'll use several Python libraries:
- `langchain` and `langchain-core`: For working with LLMs and building chains of reasoning.
- `pandas`: For handling tabular data.
- `networkx`: For building and visualizing graphs.
- `dotenv`: For securely loading API keys and settings.

**Learner Prompt:**
- Why do you think we need both a graph library and an LLM library? What might each be responsible for in this project?

In [None]:
# Install required libraries. This ensures you have the latest versions for working with LLMs and graphs.
# You only need to run this once per environment.
%pip pip install -r requirements.txt

# Learner Prompt:
# - What do you think would happen if you skipped this step or used older versions of these libraries?

In [None]:
# Import all the libraries we need for this project.
import os  # For interacting with the operating system and environment variables
import pandas as pd  # For loading and manipulating tabular data
import networkx as nx  # For creating and working with graphs
from collections import defaultdict  # For counting and grouping data efficiently
from langchain.chat_models import AzureChatOpenAI  # For using Azure's LLM
from langchain.prompts import PromptTemplate  # For creating prompts for the LLM
from langchain_core.output_parsers import StrOutputParser  # For parsing LLM outputs
from langchain_community.document_loaders import CSVLoader  # For loading CSV files as documents
from langchain.text_splitter import RecursiveCharacterTextSplitter  # For splitting text into manageable chunks
from langchain.schema import Document  # For working with document objects
from dotenv import load_dotenv  # For loading environment variables from a .env file

# Load environment variables (like API keys) from a .env file for security.
load_dotenv()

# Learner Prompt:
# - Can you identify which libraries are for graphs, which are for LLMs, and which are for data handling?

## 🤖 Azure OpenAI Setup
To use a powerful LLM, we need to connect to Azure OpenAI. This requires API keys and endpoint information, which are kept secret for security.

**Learner Prompt:**
- Why do you think it's important to keep API keys and endpoints out of your code? How does using a `.env` file help?

In [None]:
# Load Azure OpenAI credentials from the .env file.
# This keeps sensitive information (like API keys) out of your code.
import os
from pathlib import Path
from dotenv import load_dotenv

# Get the parent directory of the current working directory (where the notebook is running)
parent_dir = Path.cwd().parent
env_path = parent_dir / ".env"

# Load the .env file from the parent directory
load_dotenv(dotenv_path=env_path, override=True)

AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION")
AzureOpenAIEmbeddingsModel = os.getenv("AzureOpenAIEmbeddingsModel", "text-embedding-ada-002")
AzureChatOpenAIModel = os.getenv("AzureChatOpenAIModel")

# Learner Prompt:
# - What could go wrong if you accidentally share your .env file or hard-code your API key in a public notebook?

In [None]:
# Load the Azure OpenAI LLM (Large Language Model) using the credentials above.
# This model will help us extract knowledge from text and answer questions.
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
    model=AzureChatOpenAIModel,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,    
    temperature=1
# Temperature controls randomness: higher = more creative, lower = more focused
)

# Learner Prompt:
# - What do you think happens if you set the temperature to 0? To 2? Try changing it and see what kind of responses you get.

## 📄 Load and Chunk Historical Text
To build a knowledge graph, we first need to load our historical data. We'll use a CSV file with information about historical figures. Because LLMs work best with short pieces of text, we'll split the data into manageable chunks.

**Learner Prompt:**
- Why do you think it's important to split long documents into smaller chunks before processing them with an LLM?

In [None]:
# Load the historical data from a CSV file as a list of documents.
loader = CSVLoader(file_path="historical_figures.csv")
docs = loader.load()

# Split the documents into smaller chunks for easier processing by the LLM.
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(docs)

# Learner Prompt:
# - What might happen if you use a very large chunk size? What about a very small one?

## 🏗️ Entity Extraction Prompt (with Example)
Now we'll use the LLM to extract structured knowledge from our text. We'll ask it to find entities (like people or events) and the relationships between them, and return these as triples (Subject, Predicate, Object).
This is a key step in building a knowledge graph from unstructured data.

**Learner Prompt:**
- Why do you think we use the format (Subject, Predicate, Object) for representing knowledge? Can you think of an example from your own experience?

In [None]:
# Create a prompt template for the LLM to extract triples and new entities from text.
dynamic_entity_prompt = PromptTemplate.from_template("""
You are an expert in information extraction. Analyze the following text and extract factual triples and new entities.

Extract factual triples and new entities from the following text.

- Return the triples as plain lines in the format: Subject, Predicate, Object
- Do not use dashes, bullets, or numbering.
- Avoid extra punctuation like periods at the end.
- Return the new entities as a plain list, one per line.
- Do not include the triples in the entity list.

Text:
{input}

Known Entities:
{known}

Respond in this format exactly:

Triples:
Subject1, Predicate1, Object1
Subject2, Predicate2, Object2

New Entities:
Entity1
Entity2
Entity3
""")
dynamic_entity_chain = dynamic_entity_prompt | llm | StrOutputParser()

# Learner Prompt:
# - Why is it important to be very specific about the output format when working with LLMs?

## 🌐 Build Knowledge Graph from Triples
Once we have extracted triples (facts) from the text, we can build a knowledge graph. Each triple becomes a connection (edge) between two entities (nodes) in the graph.
This graph structure allows us to explore relationships and answer questions that would be hard to solve with plain text.

**Learner Prompt:**
- How might a knowledge graph help you find connections between historical figures that aren't obvious from reading text alone?

In [None]:
# Build the knowledge graph from the extracted triples.
import os
import time
import json
from networkx.readwrite import json_graph

# === CONFIG ===
REINDEX = True  # Set to False to skip LLM re-indexing if data exists
GRAPH_PATH = "historical_graph.json"
ENTITIES_PATH = "known_entities.json"

# === Setup containers ===
G = nx.MultiDiGraph()
all_triples = []
entity_usage = defaultdict(int)
known_entities = set()
MAX_ENTITIES_FOR_PROMPT = 40

if not REINDEX and os.path.exists(GRAPH_PATH) and os.path.exists(ENTITIES_PATH):
    print("🔁 Loading graph and entities from disk...")
    with open(GRAPH_PATH, "r", encoding="utf-8") as f:
        data = json.load(f)
        G = json_graph.node_link_graph(data, directed=True, multigraph=True)
    with open(ENTITIES_PATH, "r", encoding="utf-8") as f:
        known_entities = set(json.load(f))
    print(f"✅ Loaded graph with {len(G.nodes)} nodes and {len(G.edges)} edges.")
else:
    print("⚙️ Rebuilding graph from documents using LLM...")
    for i, doc in enumerate(documents):
        start = time.time()

        # Sort known entities by usage to prioritize important ones for the prompt.
        sorted_entities = sorted(known_entities, key=lambda e: -entity_usage[e])
        limited_entities = sorted_entities[:MAX_ENTITIES_FOR_PROMPT]
        known_str = ", ".join(limited_entities) if limited_entities else "(none)"

        # Use the LLM to extract triples and new entities from each document chunk.
        output = dynamic_entity_chain.invoke({"input": doc.page_content, "known": known_str})
        print(f"\n--- LLM Output for doc {i} ---\n{output.strip()}")

        # Split the output into triples and new entities.
        sections = output.strip().split("New Entities:")
        triples_block = sections[0].replace("Triples:", "").strip()
        new_entities_block = sections[1].strip() if len(sections) > 1 else ""

        triples = []
        for line in triples_block.split("\n"):
            if line.strip():
                parts = [p.strip(" ()\n") for p in line.split(",")]
                if len(parts) == 3:
                    triples.append(tuple(parts))
                else:
                    print(f"⚠️ Skipping malformed triple line: {line}")

        # Normalize new entities by stripping leading/trailing spaces and quotes
        new_entities = [e.lstrip("- ").strip(" \"'\n") for e in new_entities_block.split("\n") if e.strip()]

        all_triples.extend(triples)
        known_entities.update(new_entities)

        for s, r, o in triples:
            # Normalize subject and object by stripping leading/trailing spaces and hyphens
            s = s.lstrip("- ").strip()
            r = r.strip(" .-").lower()
            o = o.lstrip("- ").strip()
            G.add_node(s)
            G.add_node(o)
            G.add_edge(s, o, relation=r)
            entity_usage[s] += 1
            entity_usage[o] += 1

        print(f"⏱️ Processed doc {i} in {round(time.time() - start, 2)}s")

    # === Save graph and entity index ===
    print("💾 Saving graph and entities to disk...")
    with open(GRAPH_PATH, "w", encoding="utf-8") as f:
        json.dump(json_graph.node_link_data(G), f, ensure_ascii=False, indent=2)
    with open(ENTITIES_PATH, "w", encoding="utf-8") as f:
        json.dump(sorted(list(known_entities)), f, ensure_ascii=False, indent=2)

    print(f"✅ Graph built with {len(G.nodes)} nodes, {len(G.edges)} edges, and {len(known_entities)} tracked entities.")

# Learner Prompt:
# - What are some advantages of storing the graph and entities to disk? When might you want to rebuild the graph from scratch?

In [None]:
# Function to search the graph for all nodes and edges within a certain depth from an anchor entity.
def search_graph(anchor_entity: str, depth: int = 2) -> nx.MultiDiGraph:
    if anchor_entity not in G:
        raise ValueError(f"Entity '{anchor_entity}' not found in the graph.")

    # Use BFS to find all nodes within the specified depth
    bfs_edges = nx.bfs_edges(G, source=anchor_entity, depth_limit=depth)
    nodes_in_scope = {anchor_entity}
    edges_in_scope = []

    for u, v in bfs_edges:
        nodes_in_scope.update([u, v])
        edges_in_scope.append((u, v))

    # Create a new subgraph from the collected nodes and edges
    subgraph = nx.MultiDiGraph()
    for u, v in edges_in_scope:
        for key in G[u][v]:
            relation = G[u][v][key].get("relation", "")
            subgraph.add_edge(u, v, relation=relation)
            subgraph.add_node(u)
            subgraph.add_node(v)

    return subgraph

# Learner Prompt:
# - Why might you want to limit the depth of your search in a large graph? What could happen if you set the depth too high?

In [None]:
# Alternative function: search the graph in an undirected way (ignoring edge direction).
def search_graph(anchor: str, depth: int = 2) -> nx.Graph:
    if anchor not in G:
        raise ValueError(f"Entity '{anchor}' not found in the graph.")

    undirected_G = G.to_undirected()
    visited_nodes = set([anchor])
    current_level = set([anchor])

    for _ in range(depth):
        next_level = set()
        for node in current_level:
            neighbors = undirected_G.neighbors(node)
            for neighbor in neighbors:
                if neighbor not in visited_nodes:
                    next_level.add(neighbor)
        visited_nodes.update(next_level)
        current_level = next_level

    subgraph = G.subgraph(visited_nodes).copy()
    return subgraph

# Learner Prompt:
# - What is the difference between a directed and undirected graph? When might you want to ignore edge direction?

## 📌 Anchor Detection from Question
When you ask a question, we need to figure out which entity in the graph is most relevant (the "anchor"). We'll use the LLM to match your question to the best entity.

**Learner Prompt:**
- Why do you think it's helpful to identify an anchor entity before searching the graph?

In [None]:
# Function to detect the main entity (anchor) in a user's question using the LLM.
def detect_anchor_entity(question: str) -> str:
    MAX_ENTITIES_FOR_ANCHOR_PROMPT = 40

    # Use global variables for known_entities and entity_usage
    sorted_entities = sorted(known_entities, key=lambda e: -entity_usage.get(e, 0))
    limited_entities = sorted_entities[:MAX_ENTITIES_FOR_ANCHOR_PROMPT]
    known_entities_list = "\n".join(f"- {e}" for e in limited_entities) if limited_entities else "(none)"

    detect_prompt = PromptTemplate.from_template(f"""
You are a semantic matcher. Your task is to identify the main entity (person, event, or discovery) from the list below that the question is primarily about.

Only return the exact entity name from the list. Do not explain.

Known Entities:
{known_entities_list}

Question: {{question}}
""")

    detect_chain = detect_prompt | llm | StrOutputParser()
    return detect_chain.invoke({"question": question}).strip()

# Learner Prompt:
# - What challenges might an LLM face when trying to match a question to the right entity?

In [None]:
# Visualize the subgraph around the detected anchor entity to see its connections.
def visualize_subgraph(query: str, depth: int = 2):
    anchor_entity = detect_anchor_entity(query)
    print(f"🔍 Detected anchor entity: {anchor_entity}")

    if anchor_entity not in G:
        print(f"⚠️ Entity '{anchor_entity}' not found in the graph.")
        return

    # Extract the subgraph based on depth
    visited = set()
    queue = [(anchor_entity, 0)]
    sub_nodes = set()

    while queue:
        current_node, current_depth = queue.pop(0)
        if current_depth > depth or current_node in visited:
            continue
        visited.add(current_node)
        sub_nodes.add(current_node)
        neighbors = list(G.successors(current_node)) + list(G.predecessors(current_node))
        for neighbor in neighbors:
            queue.append((neighbor, current_depth + 1))
            sub_nodes.add(neighbor)

    SG = G.subgraph(sub_nodes)

    # Visualize
    plt.figure(figsize=(10, 7))
    pos = nx.spring_layout(SG, seed=42)
    nx.draw_networkx_nodes(SG, pos, node_size=600, node_color="#FFEEEE")
    nx.draw_networkx_edges(SG, pos, arrows=True, arrowstyle='-|>', edge_color="#666")
    nx.draw_networkx_labels(SG, pos, font_size=10)
    edge_labels = {(u, v): data["relation"] for u, v, data in SG.edges(data=True)}
    nx.draw_networkx_edge_labels(SG, pos, edge_labels=edge_labels, font_size=9)
    plt.title(f"Subgraph for: '{anchor_entity}'", fontsize=14)
    plt.axis("off")
    plt.show()

# Learner Prompt:
# - How does visualizing a subgraph help you understand the relationships between entities?

## 🤔 Ask a Question Using the Graph
Now you can ask questions about historical figures or events. The system will use the knowledge graph to find relevant facts and help the LLM answer your question.

**Learner Prompt:**
- How might the answers you get from a knowledge graph differ from those you get from a search engine or a plain LLM?

In [None]:
# Use the knowledge graph to answer a question by extracting relevant facts and passing them to the LLM.
def answer_question_with_graph(query: str, depth: int = 2):
    anchor_entity = detect_anchor_entity(query)
    print(f"🔍 Detected anchor entity: {anchor_entity}")

    subgraph = search_graph(anchor_entity, depth=depth)
    context = [f"{u} {data['relation']} {v}." for u, v, data in subgraph.edges(data=True)]
    context_text = "\n".join(context)

    response = llm.invoke(f"Answer the question based on these facts:\n\n{context_text}\n\nQuestion: {query}")

    return context, response

# Learner Prompt:
# - What are the advantages of providing the LLM with a focused set of facts from the graph, instead of the entire dataset?

## 🚀 Try It Out
Now it's your turn! Try asking your own questions about the historical data. Experiment with different questions and see how the graph and LLM work together to provide answers.

**Learner Prompt:**
- What kinds of questions work best with this system? What are its limitations?

In [None]:
# Example: Ask a question about a historical figure's contribution.
question = "How did Albert Einstein contribute to the development of the atomic bomb?"
# Set the depth for the subgraph search
depth= 2
retrieved_chunks, response = answer_question_with_graph(question, depth)

print("\nRetrieved Chunks:")
for line in retrieved_chunks:
    print("-", line)

print("\nAnswer:")
print(response)

visualize_subgraph(question, depth)

# Learner Prompt:
# - Try changing the question or the depth. What do you notice about the retrieved facts and the answer?

In [None]:
# Print all nodes (entities) in the knowledge graph.
print(list(G.nodes))

# Learner Prompt:
# - How many entities are in your graph? Are there any you didn't expect to see?

In [None]:
# Print all edges related to 'relativity' to explore specific relationships in the graph.
for u, v, data in G.edges(data=True):
    if "relativity" in u.lower() or "relativity" in v.lower():
        print(f"{u} --{data['relation']}--> {v}")

# Learner Prompt:
# - Try searching for other keywords or relationships. What patterns or connections do you discover?