# pdf_agent examples demo

This notebook combines three small demos used in the blog series:
1. Ingest PDFs into the vector index
2. Run a retrieval query and save the JSON result
3. Run a comprehensive `analyze_all` job and save the long-form output

Notes:
- The notebook uses the repository's `PDFAgent` class (from `agents/pdf_agent.py`).
- Ensure you have the environment configured and `system_config.json` or environment variables set for any cloud providers you plan to use.
- Create an `outputs/` directory before running cells (`outputs/` is used to store results).

In [None]:
# Setup: imports and agent initialization
from pathlib import Path
import json
import os

# Import the agent (uses repo code)
from agents.pdf_agent import PDFAgent

# Ensure outputs directory exists
Path('outputs').mkdir(parents=True, exist_ok=True)

# Instantiate the agent (may print logs)
agent = PDFAgent()
print('PDFAgent initialized. Use `agent` to run ingestion, retrieval and analyze_all.')

## Demo 1 — Ingest a folder of PDFs

This cell will parse and index PDF files found in `downloads/search_results` by default. Update `folder` to point to your PDF directory. For small test sets (2–3 PDFs) this is quick; for hundreds of PDFs it may take much longer.

In [None]:
# Ingest demo (adjust `folder` as needed)
folder = Path('downloads/search_results')

if not folder.exists():
    print(f'Folder not found: {folder.resolve()}')
    print('Update the `folder` variable to the path where your PDFs are stored.')
else:
    print(f'Ingesting PDFs from: {folder.resolve()}')
    result = agent.process_folder(folder)
    print('Ingest result:')
    print(json.dumps(result, indent=2))

## Demo 2 — Retrieval (RAG)

This cell runs a retrieval query using the agent's `search` API (mode: `enhanced` by default). The returned result will be saved to `outputs/retrieve_result.json`. If the index is empty, run the ingest cell first.

In [None]:
# Retrieval demo
query = 'What are agentic workflows?'
mode = 'enhanced'
print(f'Running query (mode={mode}): {query}')
try:
    res = agent.search(query, mode=mode)
except RuntimeError:
    print('Search not ready: index may be empty. Run ingestion first.')
    raise

print('
--- ANSWER ---')
print(res.get('answer', '(no answer)'))

out_path = Path('outputs/retrieve_result.json')
with out_path.open('w') as fh:
    json.dump(res, fh, indent=2)

print(f'Query result saved to: {out_path.resolve()}')

## Demo 3 — Comprehensive analysis (`analyze_all`)

The `analyze_all` mode performs a batched analysis across the entire indexed corpus and can produce long outputs. Use a short query and expect longer runtimes. The notebook saves the final text to `outputs/analyze_all_result.txt`.

In [None]:
# Analyze-all demo (may take significant time on large corpora)
query = 'Summarize recent trends in agentic AI'
print('Starting analyze_all; this may take a while depending on corpus size...')
result = agent.search(query, mode='analyze_all')
answer = result.get('answer', '(no answer)')
out_file = Path('outputs/analyze_all_result.txt')
out_file.write_text(answer)
print(f'Analyze_all output saved to: {out_file.resolve()}')
print(f'Documents analyzed: {len(result.get("sources", []))}')

## Next steps

- If you plan to use cloud providers (Azure/Poe), verify `system_config.json` and your environment variables before ingesting large corpora.
- For privacy-sensitive data, consider running Ollama locally and selecting it as the embedding/LLM provider in settings.
- Use the `examples/` CLI scripts if you prefer non-interactive runs.

If you want, I can split this notebook into smaller tutorial notebooks (one per blog post) or produce an HTML-exported demo for distribution.