# Generating DAG's on Wikipedia articles with Langchain

## Introduction
Yesterday I was bored, and I've already done some work generating protein association networks with Cytoscape and String.
I was wondering if this could be extended to other domains with the use of LLM's to generate JSON graph data.
I tried it with an paragraph in an article about the [Muslim Brotherhood](https://en.wikipedia.org/wiki/Muslim_Brotherhood) since I had to write a final paper on that for my first year seminar at UCLA and it yielded somewhat interesting results.
I was able to import the JSON into Cytoscape and do an analysis, but I want to try investigating the feasibility of doing this on a larger scale.


## Load Wikipedia article
We'll use Langchain's loaders to scrape Wikipedia articles.
This way, we can programmatically get Wikipedia articles and chunk them later.

In [1]:
from langchain_community.document_loaders import WikipediaLoader

In [2]:
docs = WikipediaLoader(query="Muslim Brotherhood", load_max_docs=1).load()
docs[0]

Document(metadata={'title': 'Muslim Brotherhood', 'summary': 'The Society of the Muslim Brothers (Arabic: جماعة الإخوان المسلمين Jamāʿat al-Ikhwān al-Muslimīn), better known as the Muslim Brotherhood (الإخوان المسلمون al-Ikhwān al-Muslimūn), is a transnational Sunni Islamist organization founded in Egypt by Islamic scholar, Imam and schoolteacher Hassan al-Banna in 1928. Al-Banna\'s teachings spread far beyond Egypt, influencing various Islamist movements from charitable organizations to political parties.\nInitially, as a Pan-Islamic, religious, and social movement, it preached Islam in Egypt, taught the illiterate, and set up hospitals and business enterprises. It later advanced into the political arena, aiming to end British colonial control of Egypt. The movement\'s self-stated aim is the establishment of a state ruled by sharia law under a caliphate–its most famous slogan is "Islam is the solution". Charity is a major aspect of its work.\nThe group spread to other Muslim countries

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

article_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1200,
    chunk_overlap = 200
)

split_article = article_splitter.split_documents(docs)

## Defining the schema

For our DAG, we need to define schemas for nodes and edges. Nodes have to be event centric, but minimal, and edges have to be causal with evidence.

In [4]:
# Minimal schemas for a temporal causal graph (Cytoscape-ready)

from enum import Enum
from typing import List, Optional, Dict, Any, Tuple, Literal
from pydantic import BaseModel, Field, constr, confloat
from datetime import date


# --- Enums ---

class NodeType(str, Enum):
    Event = "Event"
    Person = "Person"
    Organization = "Organization"
    Place = "Place"
    Concept = "Concept"


class RelationType(str, Enum):
    causes = "causes"
    leads_to = "leads_to"
    enables = "enables"
    triggers = "triggers"
    prevents = "prevents"
    mitigates = "mitigates"


Polarity = Literal[1, -1]  # +1 promotes/enables; -1 inhibits/prevents


# --- Evidence & Provenance (minimal) ---

class Evidence(BaseModel):
    quote: constr(strip_whitespace=True, min_length=1)
    citation_ids: List[str] = Field(default_factory=list)
    section: Optional[str] = None
    char_spans: Optional[Tuple[int, int]] = None


class Provenance(BaseModel):
    article_id: Optional[str] = None
    revision_id: Optional[str] = None
    run_id: Optional[str] = None


# --- Core graph models ---

class EventNode(BaseModel):
    id: constr(strip_whitespace=True, min_length=1)
    label: constr(strip_whitespace=True, min_length=1)
    type: NodeType = NodeType.Event
    time_start: Optional[date] = None
    time_end: Optional[date] = None
    wikidata_id: Optional[str] = None
    provenance: Optional[Provenance] = None
    meta: Dict[str, Any] = Field(default_factory=dict)


class CausalEdge(BaseModel):
    id: constr(strip_whitespace=True, min_length=1)
    source: constr(strip_whitespace=True, min_length=1)
    target: constr(strip_whitespace=True, min_length=1)
    relation_type: RelationType
    polarity: Polarity = 1
    confidence: confloat(ge=0.0, le=1.0) = 0.5
    lag_days: Optional[int] = None
    evidence: List[Evidence] = Field(default_factory=list)
    temporal_valid: bool = True
    provisional: bool = False
    meta: Dict[str, Any] = Field(default_factory=dict)


# --- Cytoscape export wrappers (minimal) ---

class CyNode(BaseModel):
    data: EventNode


class CyEdge(BaseModel):
    data: CausalEdge


class CyElements(BaseModel):
    nodes: List[CyNode] = Field(default_factory=list)
    edges: List[CyEdge] = Field(default_factory=list)


## Creating extractor
Using the schemas defined above, we're going to build an extraction chain with Gemini. It'll take our document and parse it into CyElements, which can be imported into Cytoscape later.

In [15]:
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import init_chat_model
parser = PydanticOutputParser(pydantic_object=CyElements)
format_instructions = parser.get_format_instructions()

SYSTEM = """You are extracting a causal, temporal graph from the provided document.
Goal: Return a Cytoscape-ready JSON object with nodes and edges.

Rules:
- Only include nodes that are concrete events/entities referenced in the document.
- Only include causal edges (cause → effect). Use relation_type from the allowed set.
- Set polarity: +1 for promotes/enables/causes; -1 for prevents/mitigates.
- Provide at least one evidence.quote per edge. Include citation_ids/section if visible.
- Use concise labels; include a year if present (e.g., "(1914)").
- Generate unique, stable ids. For nodes: 'evt_<slug>' etc. For edges: 'e_<src>_<dst>_<relation>'.
- If dates are known, populate time_start/time_end as strings (YYYY-MM-DDT00:00:00Z). If not possible, leave as null. You're not going to know hour minute second so leave those as zero. If day not known, leave as 01 since 00 not possible."
- temporal_valid should be true only if time_start(source) < time_start(target), or unknown.
- Return ONLY JSON that matches the schema.
"""

USER = """Document title: {title}
Optional context: {context}
Full text:
{doc}

Output format (must follow exactly):
{format_instructions}
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM),
    ("user", USER)
])


In [16]:
# Example: OpenAI-compatible; replace with your provider
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Or your initialized model:
llm = init_chat_model("gemini-2.5-flash", model_provider="google_genai", temperature=0)

messages = prompt.format_messages(
    title="Sample Article",
    context="Historical cause-effect extraction.",
    doc=docs[0].page_content,
    format_instructions=format_instructions
)

raw = llm.invoke(messages)

# Parse with Pydantic
try:
    cy: CyElements = parser.parse(raw.content)
except Exception as e:
    # Optional: retry with a corrective system note or simpler schema
    raise


In [21]:
import json
from datetime import datetime

def save_cyjs(cy_elements, path, app_name="cause_effect_extractor"):
    payload = {
        "format_version": "1.0",
        "generated_by": app_name,
        "target_cytoscape_version": "~3.9",
        "creationTime": datetime.utcnow().isoformat() + "Z",
        "elements": {
            "nodes": [n.model_dump(mode="json") for n in cy_elements.nodes],
            "edges": [e.model_dump(mode="json") for e in cy_elements.edges],
        },
       
    }
    with open(path, "w", encoding="utf-8") as f:
        json.dump(payload, f, ensure_ascii=False, indent=2)



save_cyjs(cy, "brotherhood.cyjs")

  "creationTime": datetime.utcnow().isoformat() + "Z",
