# Week 2 - Part 2: Summary Agent (LangChain 1.0 + OpenAI) - Local Summary Search

This notebook uses **LangChain 1.0's `create_agent` API** and **OpenAI** to implement an agent with three tools:
1. `fetch_web_page(url)` - fetches and cleans page text.
2. `save_summary(url, summary)` - validates (Pydantic) and saves a summary to `summaries/` as JSON.
3. `search_summaries(query, k)` - searches previously saved summaries (no web calls) and returns the most relevant records.

## Question 4 - Framework and LLM Provider

- Framework: **LangChain 1.0** (`create_agent`)
- LLM Provider: **OpenAI** via `langchain-openai`

We define tools, a system prompt with the requested rules, and enforce structured output (Pydantic) for final answers.

In [None]:
import os, re, json, pathlib, datetime
from typing import List, Dict, Optional

import requests
from bs4 import BeautifulSoup
import itertools

from pydantic import BaseModel, HttpUrl, Field
from pydantic import field_validator as pyd_field_validator 

from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
from langchain.tools import tool
from langchain.agents.structured_output import ProviderStrategy

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Create summaries directory
SUM_DIR = pathlib.Path("summaries")
SUM_DIR.mkdir(exist_ok=True)

# Utility: safe filename from URL
def slugify_url(url: str) -> str:
    slug = re.sub(r'[^a-zA-Z0-9\-_.]+', '_', url.strip())
    return slug[:180]

# Log tool calls
TOOL_CALLS: List[str] = []

## Pydantic Models
We validate saved summaries and the agent's final answer format.

In [None]:
class SummaryRecord(BaseModel):
    url: HttpUrl
    title: Optional[str] = None
    summary: str = Field(..., min_length=20, max_length=10000)
    timestamp: str

    @pyd_field_validator("summary")
    def check_word_count(cls, v: str):
        words = v.split()
        if len(words) > 240:
            v = " ".join(words[:240])
            words = v.split()
        if len(words) < 80:
            raise ValueError("Summary must be at least 80 words (target 80-240).")
        return v

class SearchHit(BaseModel):
    url: HttpUrl
    title: Optional[str] = None
    summary: str
    score: float

class QAResponse(BaseModel):
    answer: str = Field(..., min_length=10)
    sources: List[str] = Field(default_factory=list, description="List of URLs used as sources")

## Tools
We define three tools using the `@tool` decorator and log tool usage.

In [None]:
def _log_tool(name: str):
    TOOL_CALLS.append(name)

def _clean_page_text(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "noscript"]):
        tag.decompose()
    text = soup.get_text(separator="\n")
    text = re.sub(r"\n{2,}", "\n\n", text).strip()
    return text[:20000]

@tool
def fetch_web_page(url: str) -> str:
    """Fetch and clean the content of a web page at a given URL. Use when the user provides a URL you should index."""
    _log_tool("fetch_web_page")
    headers = {"User-Agent": "Mozilla/5.0 (HomeworkAgent/1.0)"}
    r = requests.get(url, headers=headers, timeout=20)
    r.raise_for_status()
    return _clean_page_text(r.text)

@tool
def save_summary(url: str, summary: str) -> str:
    """Validate and save a concise summary (80-240 words) for a URL to the local summaries/ folder as JSON."""
    _log_tool("save_summary")
    # Try to capture title
    title = None
    try:
        headers = {"User-Agent": "Mozilla/5.0 (HomeworkAgent/1.0)"}
        r = requests.get(url, headers=headers, timeout=20)
        r.raise_for_status()
        soup = BeautifulSoup(r.text, "lxml")
        t = soup.find("title")
        if t and t.text:
            title = t.text.strip()
    except Exception:
        pass

    payload = {
        "url": url,
        "title": title,
        "summary": summary.strip(),
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
    }
    record = SummaryRecord(**payload)  # Pydantic validation

    out_path = SUM_DIR / f"{slugify_url(url)}.json"
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(record.model_dump(mode="json"), f, ensure_ascii=False, indent=2)
    return f"Saved: {out_path.as_posix()}"

def _load_all_summaries() -> List[SummaryRecord]:
    items: List[SummaryRecord] = []
    for p in SUM_DIR.glob("*.json"):
        try:
            with open(p, "r", encoding="utf-8") as f:
                data = json.load(f)
            items.append(SummaryRecord(**data))
        except Exception:
            continue
    return items

def _score(text: str, query: str) -> float:
    q_tokens = [t for t in re.split(r"\W+", query.lower()) if t]
    t_tokens = [t for t in re.split(r"\W+", text.lower()) if t]
    if not q_tokens or not t_tokens:
        return 0.0
    hits = sum(t_tokens.count(qt) for qt in q_tokens)
    return float(hits) / (len(t_tokens) + 1)

@tool
def search_summaries(query: str, k: int = 5) -> str:
    """Search previously saved summaries for the most relevant items and return top-k as JSON."""
    _log_tool("search_summaries")
    records = _load_all_summaries()
    scored = []
    for rec in records:
        score = _score((rec.title or "") + " " + rec.summary, query)
        if score > 0:
            scored.append(SearchHit(url=rec.url, title=rec.title, summary=rec.summary, score=score))
    scored.sort(key=lambda x: x.score, reverse=True)
    top = scored[: max(1, min(k, 10))]
    return json.dumps([h.model_dump(mode="json") for h in top], ensure_ascii=False, indent=2)

TOOLS = [fetch_web_page, save_summary, search_summaries]

## Agent Instructions
Rules reflect local summary search and ordered behavior.

In [None]:
AGENT_INSTRUCTIONS = """
You are a focused research assistant that uses tools to index web pages and answer questions from saved summaries.

RULES:
1) If the user asks to index/save pages (one or more URLs present), process them SEQUENTIALLY:
   For each URL:
   - Call fetch_web_page(url).
   - Write an 80-240 word summary.
   - Call save_summary(url, summary). If it fails validation, revise the summary and try ONCE more, then continue to the next URL.
2) If the user asks a question (no new URLs to index), first call search_summaries(query, k=5) to retrieve the most relevant saved summaries, then answer ONLY from those summaries. Cite the specific URLs you used.
3) Keep saved summaries 80-240 words. If your draft exceeds 240 words, shorten to ~200–240 words.
4) Be concise, neutral, and avoid speculation.

TOOLS:
- fetch_web_page(url: str) -> str
- save_summary(url: str, summary: str) -> str
- search_summaries(query: str, k: int = 5) -> str

OUTPUT:
- Return a structured response with fields: answer (string) and sources (list of URLs).
"""
print(AGENT_INSTRUCTIONS)


You are a focused research assistant that uses tools to index web pages and answer questions from saved summaries.

RULES:
1) If the user asks to index/save pages (one or more URLs present), process them SEQUENTIALLY:
   For each URL:
   - Call fetch_web_page(url).
   - Write an 80-240 word summary.
   - Call save_summary(url, summary). If it fails validation, revise the summary and try ONCE more, then continue to the next URL.
2) If the user asks a question (no new URLs to index), first call search_summaries(query, k=5) to retrieve the most relevant saved summaries, then answer ONLY from those summaries. Cite the specific URLs you used.
3) Keep saved summaries 80-240 words. If your draft exceeds 240 words, shorten to ~200–240 words.
4) Be concise, neutral, and avoid speculation.

TOOLS:
- fetch_web_page(url: str) -> str
- save_summary(url: str, summary: str) -> str
- search_summaries(query: str, k: int = 5) -> str

OUTPUT:
- Return a structured response with fields: answer (string) a

## Build the Agent

In [None]:
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

agent = create_agent(
    model=model,
    tools=TOOLS,
    system_prompt=AGENT_INSTRUCTIONS,
    response_format=ProviderStrategy(QAResponse)
)

def extract_tool_calls(messages) -> List[str]:
    names = []
    for msg in messages or []:
        tc = getattr(msg, "tool_calls", None)
        if tc:
            for call in tc:
                name = call.get("name") if isinstance(call, dict) else getattr(call, "name", None)
                if name:
                    names.append(name)
    return names

def run_agent_user_prompt(text: str):
    TOOL_CALLS.clear()
    try:
        result = agent.invoke(
            {"messages": [{"role": "user", "content": text}]},
            config={"recursion_limit": 60}  # headroom
        )
    except Exception as e:
        print("Agent error:", e)
        if TOOL_CALLS:
            print("Tools used before error:", TOOL_CALLS)
        raise
    messages = result.get("messages") if isinstance(result, dict) else None
    final_text = ""
    if messages:
        last = messages[-1]
        final_text = getattr(last, "content", str(last))
    structured = result.get("structured_response") if isinstance(result, dict) else None
    tools_from_messages = extract_tool_calls(messages or [])
    tools_used = list(dict.fromkeys(TOOL_CALLS + tools_from_messages))
    return final_text, tools_used, structured


## Question 5 - Capybara Page

In [None]:
q5_user_prompt = "What is this page about? https://en.wikipedia.org/wiki/Capybara"
final_text, tools_used, structured = run_agent_user_prompt(q5_user_prompt)

print("\n=== Q5: Tools Used ===")
for t in tools_used:
    print(" -", t)

print("\n=== Q5: Structured Answer ===")
print(structured if structured else final_text)

print("\nSaved summaries:")
for p in sorted(SUM_DIR.glob("*.json")):
    print(" -", p.as_posix())


=== Q5: Tools Used ===
 - fetch_web_page
 - save_summary

=== Q5: Structured Answer ===
answer='The capybara (Hydrochoerus hydrochaeris) is the largest rodent species, native to South America. It inhabits savannas and dense forests near water bodies, living in social groups of 10-20 individuals, sometimes up to 100. Capybaras are herbivores, primarily grazing on grasses and aquatic plants, and are known for their excellent swimming abilities. They communicate through various vocalizations and have a unique social structure, with dominant males protecting females. While not currently threatened, they face hunting pressures in some regions. Capybaras have adapted well to urban environments and are often found in zoos. Their meat is consumed in some cultures, particularly during Lent in Venezuela.' sources=['https://en.wikipedia.org/wiki/Capybara']

Saved summaries:
 - summaries/https_en.wikipedia.org_wiki_Capybara.json


## Question 6 - Index Related Pages + Search Summaries

In [None]:
pages = [
    "https://en.wikipedia.org/wiki/Lesser_capybara",
    "https://en.wikipedia.org/wiki/Hydrochoerus",
    "https://en.wikipedia.org/wiki/Neochoerus",
    "https://en.wikipedia.org/wiki/Caviodon",
    "https://en.wikipedia.org/wiki/Neochoerus_aesopi",
]

q6_user_prompt = (
    "Index the following pages by fetching each and then saving a concise summary for each:\n"
    + "\n".join(pages)
    + "\n\nThen answer this:\n"
    "What are threats to capybara populations? Provide a short synthesis and cite the specific pages you used."
)

final_text, tools_used, structured = run_agent_user_prompt(q6_user_prompt)

print("\n=== Q6: Tools Used ===")
for t in tools_used:
    print(" -", t)

print("\n=== Q6: Structured Answer ===")
print(structured if structured else final_text)

print("\nSaved summaries now:")
for p in sorted(SUM_DIR.glob("*.json")):
    print(" -", p.as_posix())


=== Q6: Tools Used ===
 - fetch_web_page
 - save_summary
 - search_summaries

=== Q6: Structured Answer ===
answer='Capybara populations face several threats, primarily from hunting and habitat loss. While the capybara (Hydrochoerus hydrochaeris) is not currently classified as threatened, it experiences hunting pressures in certain regions, particularly for its meat, which is consumed in various cultures, especially during Lent in Venezuela. Additionally, habitat destruction due to agricultural expansion and urban development poses significant risks to their natural environments. The Lesser Capybara (Hydrochoerus isthmius) is classified as Data Deficient by the IUCN, indicating insufficient data on its population status, which may also reflect underlying threats to its survival. Overall, while capybaras have adapted well to some urban environments, ongoing threats from hunting and habitat degradation remain critical concerns for their populations.' sources=['https://en.wikipedia.org/w

## Inspect Saved Summaries

In [None]:
def preview(path, n=40):
    with open(path, "r", encoding="utf-8") as f:
        txt = f.read()
    print(path.name, "\n", txt[: min(len(txt), n*80)], "\n")

all_jsons = sorted(SUM_DIR.glob("*.json"))
for p in itertools.islice(all_jsons, 0, 5):
    preview(p)

https_en.wikipedia.org_wiki_Capybara.json 
 {
  "url": "https://en.wikipedia.org/wiki/Capybara",
  "title": "Capybara - Wikipedia",
  "summary": "The capybara (Hydrochoerus hydrochaeris) is the largest rodent species, native to South America. It inhabits savannas and dense forests near water bodies, living in social groups of 10-20 individuals, sometimes up to 100. Capybaras are herbivores, primarily grazing on grasses and aquatic plants, and are known for their excellent swimming abilities. They communicate through various vocalizations and have a unique social structure, with dominant males protecting females. While not currently threatened, they face hunting pressures in some regions. Capybaras have adapted well to urban environments and are often found in zoos. Their meat is consumed in some cultures, particularly during Lent in Venezuela.",
  "timestamp": "2025-10-28T03:29:53.725205Z"
} 

https_en.wikipedia.org_wiki_Caviodon.json 
 {
  "url": "https://en.wikipedia.org/wiki/Caviodo