## DocumentSynthesizer example (with ContextGenerator)

This notebook demonstrates how to use the refactored `DocumentSynthesizer` that:
- Extracts text from documents
- Generates prompt-ready contexts using `ContextGenerator` (semantic chunking, markdown-aware)
- Delegates test generation to `PromptSynthesizer`, per-context

You can configure:

Initialization Parameters (when creating the synthesizer):
- `prompt`: Generation prompt for test cases (required)
- `batch_size`: Maximum tests per LLM call (optional)
- `system_prompt`: Custom system prompt template (optional)
- `max_context_tokens`: Token limit per context (default: 1000)
- `strategy`: Context selection strategy - "sequential" or "random" (default: "random")

Generation Parameters (when calling .generate()):
- `documents`: List of document dictionaries (required for document-based generation)
  Each document should contain:
  - `name` (str): Document identifier/filename
  - `description` (str): Brief description of document content
  - `path` (str): File path to document OR
  - `content` (str): Raw text content (if provided, overrides path)
- `num_tests`: Total number of tests to generate across all contexts (default: 5)
- `tests_per_context`: Target tests per context - caps total at num_tests (optional)

Each generated test includes metadata mapping it back to its source context and documents.

### Example 1: Using direct content (no file paths needed)

In [None]:
from rhesis.sdk.synthesizers.document_synthesizer import DocumentSynthesizer
from rhesis.sdk.types import Document


prompt = "Generate diverse test cases for insurance claims handling."

doc_synth = DocumentSynthesizer(
    prompt=prompt, 
)

documents = [
    Document(
        name="policy_terms.md",
        description="Insurance policy terms and coverage",
        content="""
# Insurance Policy Terms

## Coverage
- Medical emergencies
- Theft and loss

## Exclusions
- Intentional damage
- Pre-existing conditions

---

## Claims Process
1. Report incident within 48 hours
2. Provide documentation
3. Await assessment
        """,
    ),
    Document(
        name="claims_guidelines.md",
        description="Guidelines for handling claims",
        content="""
# Claims Handling Guidelines

Claims should be processed within 14 days. Fraud indicators include inconsistent dates and unverifiable receipts.
        """,
    ),
]

result = doc_synth.generate(documents=documents, num_tests=10)

len(result.tests), result.metadata


In [None]:
# Inspect first test and its enhanced metadata
first = result.tests[0]
{
  "prompt": first["prompt"]["content"],
  "behavior": first["behavior"],
  "category": first["category"],
  "topic": first["topic"],
  "metadata_keys": list(first["metadata"].keys()),
  "context_index": first["metadata"]["context_index"],
  "context_length": first["metadata"]["context_length"],
  "source_document": first["metadata"]["sources"][0]["source"],
  "source_name": first["metadata"]["sources"][0]["name"],
  "source_description": first["metadata"]["sources"][0]["description"],
  "context_preview": first["metadata"]["sources"][0]["content"][:160] + "...",
  "generated_by": first["metadata"]["generated_by"],
}

### Example 2: Using file paths

In [None]:
doc_path = "/Users/emanuelederossi/Downloads/15227EN_MV_GIC_10.2021 copia 2.pdf"

documents = [
    Document(
    name="Sample Document", description="Example document for testing", path=doc_path)
]

prompt = "Generate test cases about this document to check if the information is correct. Always say: given that the document says: (literal content of the document), why ..."

doc_synth = DocumentSynthesizer(
    prompt=prompt, 
    max_context_tokens=1500,
)

result = doc_synth.generate(documents=documents, num_tests=10)

print(result)

In [None]:
# Inspect first test and its enhanced metadata
first = result.tests[0]
{
  "prompt": first["prompt"]["content"],
  "behavior": first["behavior"],
  "category": first["category"],
  "topic": first["topic"],
  "metadata_keys": list(first["metadata"].keys()),
  "context_index": first["metadata"]["context_index"],
  "context_length": first["metadata"]["context_length"],
  "source_document": first["metadata"]["sources"][0]["source"],
  "source_name": first["metadata"]["sources"][0]["name"],
  "source_description": first["metadata"]["sources"][0]["description"],
  "context_preview": first["metadata"]["sources"][0]["content"][:160] + "...",
  "generated_by": first["metadata"]["generated_by"],
}