# L10: Putting It All Together

## Building and Evaluating a Blog-ification Agent

In this exercise, you will:
1. Build an agent that transforms technical articles into blog posts
2. Design an evaluation framework for subjective outputs
3. Rate outputs (yours and your peers')
4. Discover just how noisy human judgment really is

---
## Setup

In [None]:
import anthropic
import json
from pathlib import Path

# Initialize the client
client = anthropic.Anthropic()

# Load articles and guidelines
article_1 = Path("articles/article_1.md").read_text()
article_2 = Path("articles/article_2.md").read_text()
guidelines_v1 = Path("guidelines_v1.md").read_text()
guidelines_v2 = Path("guidelines_v2.md").read_text()

print("âœ“ Setup complete")
print(f"Article 1: {len(article_1)} characters")
print(f"Article 2: {len(article_2)} characters")

---
## Part 1: Build Your Blog-ification Agent

Design a prompt (or prompt chain) that transforms technical articles into blog posts.

**Tips:**
- Start simple â€” a single well-crafted prompt often beats a complex chain
- Consider: How do you convey the guidelines effectively?
- Consider: Single-shot or multi-step (outline â†’ draft â†’ refine)?

In [None]:
# TODO: Design your blog-ification prompt
# This is the core of your agent â€” take your time!

def blogify(article: str, guidelines: str) -> str:
    """
    Transform a technical article into a blog post following the given guidelines.
    
    Args:
        article: The source article text
        guidelines: The blog writing guidelines
    
    Returns:
        The generated blog post
    """
    
    # YOUR PROMPT DESIGN HERE
    system_prompt = """
    # TODO: Write your system prompt
    """
    
    user_prompt = f"""
    # TODO: Write your user prompt
    # You have access to: {article} and {guidelines}
    """
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}]
    )
    
    return response.content[0].text

In [None]:
# Test your agent on a short excerpt first
test_excerpt = article_1[:1500]  # First ~1500 chars
test_output = blogify(test_excerpt, guidelines_v1)
print(test_output)

---
## Part 2: Generate Blog Posts

Run your agent on both articles using Guidelines V1.

In [None]:
# Generate blog for Article 1
blog_1 = blogify(article_1, guidelines_v1)
print("=" * 50)
print("BLOG 1: World Models")
print("=" * 50)
print(blog_1)
print(f"\nWord count: {len(blog_1.split())}")

In [None]:
# Generate blog for Article 2
blog_2 = blogify(article_2, guidelines_v1)
print("=" * 50)
print("BLOG 2: Claude's Constitution")
print("=" * 50)
print(blog_2)
print(f"\nWord count: {len(blog_2.split())}")

In [None]:
# Save your outputs
my_outputs = {
    "blog_1": blog_1,
    "blog_2": blog_2,
    "prompt_system": "YOUR SYSTEM PROMPT HERE",  # TODO: Copy your actual prompt
    "prompt_user": "YOUR USER PROMPT HERE",      # TODO: Copy your actual prompt
}

# Save locally (you'll also submit via Google Form)
with open("my_outputs.json", "w") as f:
    json.dump(my_outputs, f, indent=2)
    
print("âœ“ Outputs saved to my_outputs.json")

---
## Part 3: Design Your Evaluation Framework

Now the hard part: **How do you know if a blog post is good?**

Design a rubric that you'll use to rate blog posts. Think carefully about:
- What dimensions of quality matter?
- How will you score each dimension?
- What does each score level mean?

**Warning**: There is no "correct" rubric. This is intentional.

In [None]:
# TODO: Design your evaluation rubric
# Be specific! Vague criteria lead to noisy ratings.

my_rubric = """
# My Blog Evaluation Rubric

## Dimension 1: [NAME]
**Definition**: [What are you measuring?]
**Scale**: 1-5
- 1: [What does this score mean?]
- 2: [What does this score mean?]
- 3: [What does this score mean?]
- 4: [What does this score mean?]
- 5: [What does this score mean?]

## Dimension 2: [NAME]
**Definition**: [What are you measuring?]
**Scale**: 1-5
- 1: ...
- 5: ...

## Dimension 3: [NAME]
...

## Overall Score
**Scale**: 1-10
**How to compute**: [Weighted average? Holistic judgment? Explain.]
"""

print(my_rubric)

---
## Part 4: Rate Blog Posts

You will rate:
1. Your own 2 blog posts
2. 4 blog posts from peers (provided via Google Form)

For each, record:
- Scores for each dimension in your rubric
- Overall score (1-10)
- Confidence level (high/medium/low)
- Brief notes (optional, but helpful for later reflection)

In [None]:
# Template for recording your ratings
# You can use this structure or adapt it

my_ratings = {
    "my_blog_1": {
        "dimension_scores": {
            # TODO: Add your dimensions
            "dimension_1": None,
            "dimension_2": None,
            "dimension_3": None,
        },
        "overall_score": None,  # 1-10
        "confidence": None,     # "high", "medium", or "low"
        "notes": ""
    },
    "my_blog_2": {
        "dimension_scores": {},
        "overall_score": None,
        "confidence": None,
        "notes": ""
    },
    "peer_blog_A": {
        "dimension_scores": {},
        "overall_score": None,
        "confidence": None,
        "notes": ""
    },
    "peer_blog_B": {
        "dimension_scores": {},
        "overall_score": None,
        "confidence": None,
        "notes": ""
    },
    "peer_blog_C": {
        "dimension_scores": {},
        "overall_score": None,
        "confidence": None,
        "notes": ""
    },
    "peer_blog_D": {
        "dimension_scores": {},
        "overall_score": None,
        "confidence": None,
        "notes": ""
    }
}

---
## Part 5: Submit Your Work

Submit via the Google Form (link from instructor):
1. Your agent prompts
2. Your 2 generated blog posts
3. Your evaluation rubric
4. Your ratings for all 6 blog posts

In [None]:
# Final submission package
submission = {
    "outputs": my_outputs,
    "rubric": my_rubric,
    "ratings": my_ratings
}

# Pretty print for copying to Google Form
print(json.dumps(submission, indent=2))

---
## ðŸ”® Preview: What Comes Next (Lab 10b)

In the next session, we'll analyze the class data and discover:

1. **The Disagreement Wall**: How much do your ratings differ from your peers'?
2. **The Blind Duplicate**: Did you rate the same blog consistently?
3. **The Phantom Winner**: When we aggregate ratings, which blog "wins"? And how confident can we be?

Come prepared to be surprised.

---

*"Wherever there is judgment, there is noiseâ€”and more of it than you think."*  
â€” Daniel Kahneman