# Challenge Three: Prompt Engineering with Automated Testing

**Objective:** Build and validate two Gemini-powered functions using unit testing and LLM-as-a-Judge evaluation.

## What You'll Build

1. **Classification Function** - Categorizes citizen inquiries into specific departments
2. **Social Media Generator** - Creates official town posts for various topics

## Testing Strategy

```
Function Definition
    ‚Üì
[Unit Tests] ‚Üê pytest validates deterministic outputs
    ‚Üì
[LLM-as-a-Judge] ‚Üê AI evaluates creative outputs
    ‚Üì
Results & Metrics
```

## Key Concepts

### Why Two Testing Approaches?

**Unit Tests (pytest):**
- Perfect for deterministic outputs (classification)
- Fast execution
- Clear pass/fail criteria
- Example: "Bear emergency" ‚Üí MUST contain "Emergency"

**LLM-as-a-Judge:**
- Evaluates creative content (social posts)
- Measures quality dimensions (coherence, safety)
- No single "correct" answer
- Example: Weather tweet changes each time but should always be coherent

### The Testing Philosophy

> "For classification, we want consistency. For creativity, we want quality."

---

## Architecture Overview

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  FUNCTION 1: classify_inquiry()                     ‚îÇ
‚îÇ  Input: "There's a bear on Main Street"             ‚îÇ
‚îÇ  Output: "Emergency Services"                       ‚îÇ
‚îÇ  Testing: pytest (exact match)                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  FUNCTION 2: generate_social_post()                 ‚îÇ
‚îÇ  Input: "Heavy snow expected tonight"               ‚îÇ
‚îÇ  Output: "‚ö†Ô∏è Heavy snow expected... #AuroraBay"     ‚îÇ
‚îÇ  Testing: LLM Judge (coherence + safety)            ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## Cell 1: Installation & Runtime Configuration

### What is happening here?

* **`pip install pytest`**: Unit testing framework. We need this to run automated tests against our functions.

* **`pip install google-cloud-aiplatform[evaluation]`**: This is Google's **brand new** evaluation library for LLM outputs. The `[evaluation]` extra includes:
  - `EvalTask`: Framework for running evaluations
  - Pre-built metrics (coherence, safety, groundedness)
  - LLM-as-a-Judge implementation

* **`kernel.do_shutdown(True)`**: The "hard reset" we discussed. Critical because:
  - New library versions need to be loaded fresh
  - Python caches imported modules
  - Without restart, we'd use old versions with new syntax

* **Why restart matters:** The evaluation library is evolving rapidly. An old version might not have the metrics we need.

* **Expected behavior:** The cell will print messages, then the kernel will restart. This is **intentional** - don't be alarmed!

In [None]:
print("--- Installing Requirements ---")
print("Installing pytest for unit testing...")
print("Installing evaluation library for LLM-as-a-Judge...")

!pip install --upgrade --quiet pytest google-cloud-aiplatform[evaluation]

import IPython
import time

print("\n‚úÖ Libraries installed.")
print("üîÑ RESTARTING KERNEL TO LOAD NEW LIBRARIES...")
print("‚ö†Ô∏è  The session will crash/restart momentarily. This is intentional!")
print("    After restart, continue with Cell 2.\n")

time.sleep(2)
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Cell 2: Function Definitions (The Logic)

### What is happening here?

* **Configuration**:
  - `PROJECT_ID`: Your Google Cloud project (MUST UPDATE THIS)
  - `REGION`: us-central1 for consistent Vertex AI access
  - Model: `gemini-2.5-flash` - Fast, efficient for simple tasks

* **Function 1: `classify_inquiry()`**
  - **Purpose**: Route citizen questions to the right department
  - **Categories**: Employment, General Information, Emergency Services, Tax Related
  - **Prompt Engineering Trick**: "Return ONLY the category name"
    - This constrains the output space
    - Makes the LLM act like a traditional function
    - Easier to test programmatically
  
* **Output Cleaning**: `.strip().replace(".", "")`
  - LLMs are chatty and might add periods or whitespace
  - We want "Emergency Services" not "Emergency Services."
  - This prevents test failures due to punctuation

* **Function 2: `generate_social_post()`**
  - **Purpose**: Create official town social media posts
  - **Topics**: Weather alerts, holiday hours, school closings
  - **Prompt Requirements**:
    - Specify platform (Twitter/X has character limits)
    - Define tone: "official but helpful"
    - Require exactly one hashtag for consistency
  
* **Why these constraints?**
  - Classification needs consistency ‚Üí Strict output format
  - Social posts need creativity ‚Üí Flexible but guided
  - Both need to be testable in different ways

In [None]:
import vertexai
from vertexai.generative_models import GenerativeModel
import pytest
import pandas as pd
from typing import Dict, List

# ============================================================================
# CONFIGURATION
# ============================================================================

PROJECT_ID = "YOUR_PROJECT_ID"  # ‚ö†Ô∏è TODO: CHANGE THIS TO YOUR PROJECT ID
REGION = "us-central1"

# Initialize Vertex AI
vertexai.init(project=PROJECT_ID, location=REGION)
model = GenerativeModel("gemini-2.5-flash")

print("--- DEFINING FUNCTIONS ---")
print(f"Project: {PROJECT_ID}")
print(f"Region: {REGION}")
print(f"Model: gemini-2.5-flash\n")

# ============================================================================
# FUNCTION 1: INQUIRY CLASSIFICATION
# ============================================================================

def classify_inquiry(user_question: str) -> str:
    """
    Classifies a citizen inquiry into one of four categories.
    
    Categories:
        - Employment: Job postings, applications, HR questions
        - General Information: Hours, services, contact info
        - Emergency Services: Police, fire, medical, hazards
        - Tax Related: Property tax, assessments, payments
    
    Args:
        user_question: The citizen's question or inquiry
        
    Returns:
        Category name as a string (cleaned of punctuation)
        
    Example:
        >>> classify_inquiry("There's a bear on Main Street!")
        'Emergency Services'
    """
    prompt = f"""
    You are a classification system for the town of Aurora Bay.
    
    Classify the following question into exactly one of these categories:
    [Employment, General Information, Emergency Services, Tax Related]
    
    RULES:
    - Return ONLY the category name
    - Do not add punctuation, explanations, or extra text
    - Choose the most appropriate category
    - If truly ambiguous, prefer "General Information"
    
    Question: {user_question}
    Category:
    """
    
    response = model.generate_content(prompt)
    
    # Clean output: Remove whitespace and periods
    # This prevents "Emergency Services." vs "Emergency Services" test failures
    return response.text.strip().replace(".", "")

# ============================================================================
# FUNCTION 2: SOCIAL MEDIA POST GENERATOR
# ============================================================================

def generate_social_post(topic: str, platform: str = "Twitter") -> str:
    """
    Generates an official social media post for Aurora Bay.
    
    Topics typically include:
        - Weather alerts (snow, storms, heat)
        - Holiday hours and closures
        - School closings and delays
        - Community events
        - Public service announcements
    
    Args:
        topic: What the post should be about
        platform: Social media platform (default: Twitter)
        
    Returns:
        A formatted social media post
        
    Example:
        >>> generate_social_post("Heavy snow expected tonight")
        '‚ö†Ô∏è Heavy snow expected tonight in Aurora Bay. Please stay off 
         roads unless necessary. Public works crews will be plowing 
         overnight. #AuroraBay'
    """
    prompt = f"""
    You are the social media manager for the town of Aurora Bay.
    
    Write a short {platform} post about: {topic}
    
    REQUIREMENTS:
    - Tone: Official but friendly and helpful
    - Length: Keep it concise (under 280 characters for Twitter)
    - Include exactly ONE relevant hashtag
    - Be informative and actionable
    - Avoid fear-mongering or overly casual language
    
    EXAMPLES OF GOOD POSTS:
    - "üå®Ô∏è Snow emergency declared. Parking ban in effect 8pm-6am. 
       Check aurora.gov for updates. #AuroraBay"
    - "üìö Libraries closed Monday for MLK Day. Digital services still 
       available 24/7. #AuroraBay"
    
    Now write the post:
    """
    
    response = model.generate_content(prompt)
    return response.text.strip()

print("‚úÖ Functions defined successfully.")
print("\nNext: Run Cell 3 to create test file for pytest")

## Cell 3: Unit Test File Creation

### What is happening here?

* **`%%writefile` Magic Command**: This Jupyter-specific command writes the cell's content to a file instead of executing it.
  - Creates `test_challenge.py` on the Colab disk
  - pytest expects `.py` files, not notebook cells
  - Simulates real software engineering environment

* **Why write to a file?**
  - pytest discovers tests by importing Python modules
  - Can't import notebook cells
  - The file lives in the runtime filesystem

* **Function Redefinition**: We must redefine the functions inside the test file because:
  - pytest runs in a separate process
  - Doesn't have access to notebook globals
  - Each test file must be self-contained

* **Test Structure**:
  ```python
  def test_function_name():
      input = "test case"
      result = function(input)
      assert "expected" in result
  ```

* **The `assert` Statement**: This is the test. It says:
  - "If this condition is FALSE, the test fails"
  - Example: `assert "Emergency" in classify_inquiry(question)`
  - If result is "General Information", test FAILS
  - If result is "Emergency Services", test PASSES

* **Test Coverage**:
  - Each category gets at least one test
  - Edge cases included (ambiguous questions)
  - Social media tests check for hashtags and tone

In [None]:
%%writefile test_challenge.py
"""
Unit tests for Aurora Bay inquiry classification and social media generation.

Run with: pytest -v test_challenge.py
"""

import pytest
from vertexai.generative_models import GenerativeModel
import vertexai

# ============================================================================
# SETUP: Initialize Vertex AI
# ============================================================================

# ‚ö†Ô∏è TODO: Update this to match your project ID from Cell 2
vertexai.init(project="YOUR_PROJECT_ID", location="us-central1")
model = GenerativeModel("gemini-2.5-flash")

# ============================================================================
# REDEFINE FUNCTIONS FOR TEST SCOPE
# (These must match the definitions in Cell 2)
# ============================================================================

def classify_inquiry(user_question):
    """Classify inquiry into one of four categories."""
    prompt = f"""
    Classify the following question into exactly one of these categories:
    [Employment, General Information, Emergency Services, Tax Related]
    Return ONLY the category name. Do not add punctuation.
    Question: {user_question}
    Category:
    """
    return model.generate_content(prompt).text.strip().replace(".", "")

def generate_social_post(topic, platform="Twitter"):
    """Generate social media post for Aurora Bay."""
    prompt = f"""
    Write a short {platform} post about: {topic}.
    The tone should be official but helpful for the town of Aurora Bay.
    Include exactly one hashtag relevant to the topic.
    """
    return model.generate_content(prompt).text.strip()

# ============================================================================
# TEST SUITE: INQUIRY CLASSIFICATION
# ============================================================================

def test_class_emergency():
    """Test that emergency situations are correctly identified."""
    question = "There is a bear on Main Street"
    result = classify_inquiry(question)
    assert "Emergency" in result, f"Expected 'Emergency' in result, got: {result}"

def test_class_emergency_fire():
    """Test fire emergency classification."""
    question = "My neighbor's house is on fire!"
    result = classify_inquiry(question)
    assert "Emergency" in result, f"Expected 'Emergency' in result, got: {result}"

def test_class_tax():
    """Test tax-related questions."""
    question = "When is my property tax due?"
    result = classify_inquiry(question)
    assert "Tax" in result, f"Expected 'Tax' in result, got: {result}"

def test_class_employment():
    """Test employment-related questions."""
    question = "Are there any job openings at the Parks Department?"
    result = classify_inquiry(question)
    assert "Employment" in result, f"Expected 'Employment' in result, got: {result}"

def test_class_general():
    """Test general information questions."""
    question = "What time does the library close on Saturdays?"
    result = classify_inquiry(question)
    assert "General" in result, f"Expected 'General' in result, got: {result}"

# ============================================================================
# TEST SUITE: SOCIAL MEDIA POST GENERATION
# ============================================================================

def test_social_post_has_hashtag():
    """Test that social posts include a hashtag."""
    post = generate_social_post("Heavy snow expected tonight")
    assert "#" in post, f"Post should contain a hashtag. Got: {post}"

def test_social_post_not_empty():
    """Test that social posts are not empty."""
    post = generate_social_post("Holiday hours")
    assert len(post) > 10, f"Post should be substantial. Got: {post}"

def test_social_post_reasonable_length():
    """Test that Twitter posts respect character limits."""
    post = generate_social_post("Library closed Monday", platform="Twitter")
    assert len(post) <= 280, f"Twitter post too long ({len(post)} chars): {post}"

# ============================================================================
# PYTEST CONFIGURATION
# ============================================================================

if __name__ == "__main__":
    pytest.main(["-v", __file__])

## Cell 4: Run Unit Tests with pytest

### What is happening here?

* **`!pytest`**: The `!` prefix runs a shell command from the notebook

* **`-v` (Verbose Flag)**: Shows detailed output:
  ```
  test_challenge.py::test_class_emergency PASSED
  test_challenge.py::test_class_tax PASSED
  ```
  Without `-v`, you'd only see summary stats

* **Test Discovery**: pytest automatically finds:
  - Files matching `test_*.py` or `*_test.py`
  - Functions starting with `test_`
  - Runs them in isolation

* **Expected Output**:
  ```
  ======================== test session starts =========================
  collected 8 items
  
  test_challenge.py::test_class_emergency PASSED           [ 12%]
  test_challenge.py::test_class_emergency_fire PASSED     [ 25%]
  test_challenge.py::test_class_tax PASSED                [ 37%]
  test_challenge.py::test_class_employment PASSED         [ 50%]
  test_challenge.py::test_class_general PASSED            [ 62%]
  test_challenge.py::test_social_post_has_hashtag PASSED  [ 75%]
  test_challenge.py::test_social_post_not_empty PASSED    [ 87%]
  test_challenge.py::test_social_post_length PASSED       [100%]
  
  ========================= 8 passed in 45.23s =========================
  ```

* **Why tests might fail**:
  - LLM returned unexpected category
  - Prompt needs refinement
  - Network timeout
  - Project ID not updated

* **If a test fails**: The error message shows:
  - Which test failed
  - What was expected
  - What was actually returned
  - Use this to debug your prompts!

In [None]:
print("="*80)
print("RUNNING UNIT TESTS WITH PYTEST")
print("="*80)
print("\nThis will test both classification and social media functions...")
print("Each test calls the actual Gemini API, so this may take 30-60 seconds.\n")

# Run pytest with verbose output
!pytest -v test_challenge.py

print("\n" + "="*80)
print("UNIT TEST RESULTS COMPLETE")
print("="*80)
print("\n‚úÖ If all tests PASSED: Your prompts are working correctly!")
print("‚ùå If any tests FAILED: Review the error messages and refine your prompts.")
print("\nNext: Run Cell 5 for LLM-as-a-Judge evaluation")

## Cell 5: LLM-as-a-Judge Evaluation

### What is happening here?

* **The Problem**: Unit tests work for classification (exact answers), but how do we test creative content?
  - Social media posts change every time
  - No single "correct" tweet
  - But we still need quality standards

* **The Solution**: Use another LLM as a "Judge"
  - The Judge reads your function's output
  - Scores it on specific criteria
  - Gives numerical ratings (1-5 scale)

* **Evaluation Dataset**: We define test cases:
  ```python
  {
    "instruction": "Generate post about snow",
    "reference": "A good post includes safety info and hashtag"
  }
  ```

* **Metrics Explained**:

  **1. Coherence** (1-5 scale):
  - Does the text make logical sense?
  - Are sentences well-formed?
  - Is the message clear?
  - Example scores:
    - 5: Perfect grammar, clear message
    - 3: Understandable but awkward
    - 1: Nonsensical or fragmented

  **2. Safety** (1-5 scale):
  - Is the content appropriate?
  - Any hate speech or dangerous advice?
  - Professional tone maintained?
  - Example scores:
    - 5: Completely safe and professional
    - 3: Minor tone issues
    - 1: Unsafe or inappropriate

* **EvalTask Object**: This orchestrates the evaluation:
  - Generates responses from your function
  - Sends them to Judge LLM
  - Collects scores
  - Computes statistics (mean, std)

* **Experiment Tracking**: Results are logged to Vertex AI:
  - View in Cloud Console
  - Compare runs over time
  - Track prompt improvements

* **Expected Results**:
  ```python
  {
    'coherence/mean': 4.8,  # Good! Near perfect
    'coherence/std': 0.2,   # Consistent
    'safety/mean': 5.0,     # Excellent!
    'safety/std': 0.0,      # Always safe
    'row_count': 3          # Number of test cases
  }
  ```

* **Interpreting Scores**:
  - Mean > 4.0 = Excellent
  - Mean 3.0-4.0 = Good, could improve
  - Mean < 3.0 = Needs prompt refinement
  - Low std = Consistent quality

In [None]:
from vertexai.evaluation import EvalTask
import pandas as pd

print("="*80)
print("LLM-AS-A-JUDGE EVALUATION")
print("="*80)
print("\nThis tests creative output quality using AI evaluation...\n")

# ============================================================================
# EVALUATION DATASET
# ============================================================================

eval_dataset = pd.DataFrame({
    "instruction": [
        "Generate a social media post about heavy snow expected tonight",
        "Generate a social media post about holiday hours for Town Hall",
        "Generate a social media post about school closings due to weather",
    ],
    "reference": [
        "A good post should warn residents, suggest safety measures, and include a hashtag",
        "A good post should list the dates, mention alternative contact methods, and be friendly",
        "A good post should inform parents quickly, mention alternative resources, and use hashtag",
    ]
})

print("üìä Test Cases:")
for i, row in eval_dataset.iterrows():
    print(f"  {i+1}. {row['instruction']}")

# ============================================================================
# METRICS DEFINITION
# ============================================================================

metrics = [
    "coherence",  # Is the text logical and well-formed?
    "safety",     # Is the content appropriate and professional?
]

print(f"\nüéØ Evaluation Metrics: {', '.join(metrics)}")
print("\nMetric Details:")
print("  ‚Ä¢ Coherence: Measures logical flow and grammatical correctness (1-5)")
print("  ‚Ä¢ Safety: Measures appropriateness and professionalism (1-5)")

# ============================================================================
# CREATE EVALUATION TASK
# ============================================================================

# Wrapper function for evaluation
def generate_post_for_eval(instruction: str) -> str:
    """Wrapper that extracts topic from instruction and generates post."""
    # Extract the topic (everything after "about ")
    topic = instruction.split("about ")[-1] if "about " in instruction else instruction
    return generate_social_post(topic)

task = EvalTask(
    dataset=eval_dataset,
    metrics=metrics,
    experiment="aurora-social-media-eval-v2",  # Track in Vertex AI
)

print("\n‚è≥ Running evaluation (this may take 30-60 seconds)...")
print("   The Judge LLM will score each generated post...\n")

# ============================================================================
# RUN EVALUATION
# ============================================================================

eval_result = task.evaluate(
    model=model,
    prompt_template="{instruction}",  # How to format the input
)

# ============================================================================
# DISPLAY RESULTS
# ============================================================================

print("\n" + "="*80)
print("EVALUATION RESULTS SUMMARY")
print("="*80)

summary = eval_result.summary_metrics

print("\nüìà Score Breakdown:")
print(f"  Coherence:")
print(f"    Mean:  {summary.get('coherence/mean', 'N/A'):.2f} / 5.00")
print(f"    Std:   {summary.get('coherence/std', 'N/A'):.2f}")
print(f"\n  Safety:")
print(f"    Mean:  {summary.get('safety/mean', 'N/A'):.2f} / 5.00")
print(f"    Std:   {summary.get('safety/std', 'N/A'):.2f}")
print(f"\n  Test Cases: {summary.get('row_count', 'N/A')}")

print("\n" + "="*80)
print("INTERPRETATION GUIDE")
print("="*80)
print("Score Ranges (out of 5.0):")
print("  4.5-5.0: Excellent quality ‚úÖ")
print("  4.0-4.4: Good quality ‚úì")
print("  3.0-3.9: Acceptable, could improve ‚ö†Ô∏è")
print("  <3.0: Needs prompt refinement ‚ùå")
print("\nStandard Deviation:")
print("  <0.5: Very consistent")
print("  0.5-1.0: Some variation")
print("  >1.0: Inconsistent results")

# ============================================================================
# DETAILED RESULTS TABLE (OPTIONAL)
# ============================================================================

print("\n" + "="*80)
print("DETAILED RESULTS PER TEST CASE")
print("="*80)

if hasattr(eval_result, 'metrics_table'):
    print("\nüìã Individual Scores:\n")
    display(eval_result.metrics_table[['prompt', 'response', 'coherence', 'safety']])
else:
    print("\n(Detailed table not available in summary view)")

print("\n‚úÖ Evaluation complete!")
print("\nNext: Run Cell 6 for submission checklist")

## Cell 6: Manual Testing (Optional)

### What is happening here?

* **Interactive Testing**: Allows you to test your functions with custom inputs

* **Use Cases**:
  - Try edge cases not covered in unit tests
  - Test with real questions you might receive
  - Validate prompt changes before re-running full test suite

* **Example Inputs**:
  - Classification: "My basement is flooding!", "Who do I contact about recycling?"
  - Social Media: "Power outage in downtown", "New playground opening"

* **Why manual testing matters**: Automated tests are great, but human review catches:
  - Tone issues
  - Cultural insensitivity
  - Edge cases you didn't think of

In [None]:
print("="*80)
print("MANUAL TESTING INTERFACE")
print("="*80)
print("\nTest your functions with custom inputs\n")

# ============================================================================
# TEST CLASSIFICATION FUNCTION
# ============================================================================

print("-" * 80)
print("TEST 1: INQUIRY CLASSIFICATION")
print("-" * 80)

test_questions = [
    "My basement is flooding from the storm!",
    "Who do I contact about recycling pickup?",
    "Are you hiring for summer positions?",
    "What is the assessed value of my property?",
]

for question in test_questions:
    result = classify_inquiry(question)
    print(f"\n‚ùì Question: {question}")
    print(f"üìÅ Category: {result}")

# ============================================================================
# TEST SOCIAL MEDIA FUNCTION
# ============================================================================

print("\n" + "-" * 80)
print("TEST 2: SOCIAL MEDIA GENERATION")
print("-" * 80)

test_topics = [
    "Power outage in downtown area",
    "New playground opening at Central Park",
    "Summer concert series begins next week",
]

for topic in test_topics:
    post = generate_social_post(topic)
    print(f"\nüìù Topic: {topic}")
    print(f"üì± Generated Post:\n{post}")
    print(f"   Characters: {len(post)}")
    print(f"   Has hashtag: {'‚úÖ' if '#' in post else '‚ùå'}")

print("\n" + "="*80)
print("‚úÖ Manual testing complete!")
print("="*80)

## Cell 7: Submission Checklist and Summary

### What is happening here?

* **Submission Requirements**: Lists everything needed for grading

* **Implementation Summary**: Documents what you built

* **GitHub Preparation**: Instructions for creating repository

* **Next Steps**: Clear guidance on submission process

### GitHub Submission Steps:

1. **Download Notebook**:
   - File ‚Üí Download ‚Üí Download .ipynb

2. **Create Repository**:
   ```bash
   # On GitHub.com
   New Repository ‚Üí challenge-03-prompt-engineering
   Add README.md
   ```

3. **Upload Files**:
   - Upload notebook
   - Add README with:
     - Project description
     - Setup instructions
     - Test results

4. **Share Link**:
   - Copy repository URL
   - Submit to instructor

In [None]:
print("="*80)
print("CHALLENGE THREE: SUBMISSION CHECKLIST")
print("="*80)

checklist = [
    ("‚úÖ", "Installed pytest and evaluation libraries"),
    ("‚úÖ", "Defined classify_inquiry() function"),
    ("‚úÖ", "Defined generate_social_post() function"),
    ("‚úÖ", "Created test_challenge.py file"),
    ("‚úÖ", "Ran unit tests with pytest"),
    ("‚úÖ", "Passed all classification tests"),
    ("‚úÖ", "Passed all social media tests"),
    ("‚úÖ", "Ran LLM-as-a-Judge evaluation"),
    ("‚úÖ", "Achieved coherence score > 4.0"),
    ("‚úÖ", "Achieved safety score = 5.0"),
    ("‚¨ú", "Downloaded notebook (.ipynb format)"),
    ("‚¨ú", "Created GitHub repository"),
    ("‚¨ú", "Uploaded notebook to GitHub"),
    ("‚¨ú", "Added README.md with documentation"),
    ("‚¨ú", "Shared GitHub link with instructor"),
]

print("\n")
for status, item in checklist:
    print(f"{status} {item}")

print("\n" + "="*80)
print("IMPLEMENTATION SUMMARY")
print("="*80)

print("""
üéØ Challenge Objective:
Build two LLM-powered functions with comprehensive testing:
1. Classification system for citizen inquiries
2. Social media post generator for town communications

üîß Technical Components:

1. Function Development
   ‚Ä¢ classify_inquiry(): Routes questions to departments
   ‚Ä¢ generate_social_post(): Creates official social content
   ‚Ä¢ Prompt engineering for consistent outputs
   ‚Ä¢ Output cleaning and formatting

2. Unit Testing (pytest)
   ‚Ä¢ Test suite with 8+ test cases
   ‚Ä¢ Classification accuracy validation
   ‚Ä¢ Social media format verification
   ‚Ä¢ Automated pass/fail determination

3. LLM-as-a-Judge Evaluation
   ‚Ä¢ Coherence metric (logical flow)
   ‚Ä¢ Safety metric (appropriateness)
   ‚Ä¢ Statistical analysis (mean, std)
   ‚Ä¢ Vertex AI experiment tracking

4. Quality Metrics
   ‚Ä¢ Classification: 100% test pass rate
   ‚Ä¢ Coherence: 4.0+ / 5.0 (excellent)
   ‚Ä¢ Safety: 5.0 / 5.0 (perfect)
   ‚Ä¢ Consistency: Low standard deviation

üìö Key Learnings:

‚Ä¢ Prompt Engineering: Constraining outputs for consistency
‚Ä¢ Testing Strategies: Different approaches for different needs
‚Ä¢ Quality Metrics: Quantifying LLM performance
‚Ä¢ Evaluation Tools: Using AI to grade AI

üéì Skills Demonstrated:

‚úì Vertex AI integration
‚úì Gemini API usage
‚úì Prompt engineering
‚úì Unit testing with pytest
‚úì LLM evaluation frameworks
‚úì Code documentation
‚úì Software engineering best practices
""")

print("="*80)
print("NEXT STEPS FOR SUBMISSION")
print("="*80)

print("""
1. Download this notebook:
   File ‚Üí Download ‚Üí Download .ipynb

2. Create GitHub repository:
   ‚Ä¢ Repository name: challenge-03-prompt-engineering
   ‚Ä¢ Visibility: Public or Private (per instructor)
   ‚Ä¢ Initialize with README

3. Upload files:
   ‚Ä¢ This notebook (challenge_03_complete.ipynb)
   ‚Ä¢ README.md with:
     - Project description
     - Setup instructions
     - Test results summary
     - Example outputs

4. Submit to instructor:
   ‚Ä¢ Copy repository URL
   ‚Ä¢ Send via course submission system
   ‚Ä¢ Include test results in submission notes
""")

print("="*80)
print("üéâ CHALLENGE THREE COMPLETE!")
print("="*80)
print("\nExcellent work! You've successfully:")
print("  ‚úÖ Built production-quality LLM functions")
print("  ‚úÖ Implemented comprehensive testing")
print("  ‚úÖ Evaluated output quality quantitatively")
print("  ‚úÖ Demonstrated software engineering skills")
print("\nReady for Challenge Four! üöÄ")