# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [249]:
%load_ext dotenv
%dotenv ../05_src/.secrets_grassriots

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [250]:
# UTILS: Text Cleaning Helper Functions for LLM Processing.
import re
def clean_document_text(text):
    """
    Clean and normalize document text for LLM processing.
    
    This function applies basic data cleaning techniques to prepare PDF-extracted
    text for optimal LLM processing. It handles common issues like encoding errors,
    excessive whitespace, and improper line breaks.
    
    Args:
        text (str): Raw document text extracted from PDF
        
    Returns:
        str: Cleaned and normalized text ready for LLM processing
    """
    # Step 1: Handle encoding issues gracefully
    # Remove or replace problematic characters that may cause encoding errors
    if isinstance(text, bytes):
        text = text.decode('utf-8', errors='ignore')
    
    # Handle common encoding issues by removing problematic unicode characters
    text = text.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
    
    # Step 2: Normalize line breaks - preserve paragraph breaks but join broken sentences
    # Replace double newlines (paragraph breaks) with a temporary marker
    text = text.replace('\n\n', '|||PARAGRAPH_BREAK|||')
    # Replace single newlines with spaces (these are likely broken sentences)
    text = text.replace('\n', ' ')
    # Restore paragraph breaks
    text = text.replace('|||PARAGRAPH_BREAK|||', '\n\n')
    
    # Step 3: Normalize whitespace
    # Replace tabs with spaces
    text = text.replace('\t', ' ')
    # Replace multiple consecutive spaces with a single space
    text = re.sub(r' +', ' ', text)
    
    # Step 4: Clean up hyphenated line breaks (common in PDFs)
    # Fix words broken across lines with hyphens followed by space (e.g., "word- \nword" -> "wordword")
    text = re.sub(r'(\w+)-\s+(\w+)', r'\1\2', text)
    
    # Step 5: Remove leading and trailing whitespace from each line
    lines = text.split('\n')
    lines = [line.strip() for line in lines]
    text = '\n'.join(lines)
    
    # Step 6: Remove excessive blank lines (more than 2 consecutive newlines)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Step 7: Final trim of leading/trailing whitespace
    text = text.strip()
    
    return text

In [251]:
# UTILS: Text Cleaning Evaluation Helper Functions - DO NOT MODIFY
try:
    import tiktoken
    TOKENIZER_AVAILABLE = True
except ImportError:
    TOKENIZER_AVAILABLE = False
    print("Note: tiktoken not available. Using word-based token approximation.")

def count_tokens(text, model="gpt-4o-mini"):
    """
    Count tokens in text using tiktoken if available, otherwise approximate.
    
    Args:
        text (str): Text to count tokens for
        model (str): Model name for tokenizer (default: gpt-4o-mini)
        
    Returns:
        int: Approximate token count
    """
    if TOKENIZER_AVAILABLE:
        try:
            encoding = tiktoken.encoding_for_model(model)
            return len(encoding.encode(text))
        except:
            # Fallback to word-based approximation
            return len(text.split()) // 0.75  # Rough approximation: ~0.75 words per token
    else:
        # Simple approximation: average English word is ~1.3 tokens
        return int(len(text.split()) * 1.3)

def evaluate_cleaning(original_text, cleaned_text):
    """
    Evaluate the effectiveness of document cleaning by comparing original vs cleaned text.
    
    This function computes quantitative metrics including text statistics, whitespace
    reduction, and token efficiency to demonstrate cleaning effectiveness.
    
    Args:
        original_text (str): Original text before cleaning
        cleaned_text (str): Text after cleaning
        
    Returns:
        dict: Dictionary containing all evaluation metrics
    """
    import re
    
    # Helper function to count patterns
    def count_pattern(text, pattern):
        return len(re.findall(pattern, text))
    
    # Helper function to count_multiple_spaces
    def count_multiple_spaces(text):
        return len(re.findall(r' {2,}', text))
    
    # Calculate text statistics
    metrics = {
        'original': {},
        'cleaned': {},
        'improvements': {}
    }
    
    # Character and word counts
    metrics['original']['char_count'] = len(original_text)
    metrics['cleaned']['char_count'] = len(cleaned_text)
    metrics['improvements']['char_reduction'] = metrics['original']['char_count'] - metrics['cleaned']['char_count']
    metrics['improvements']['char_reduction_pct'] = (metrics['improvements']['char_reduction'] / metrics['original']['char_count'] * 100) if metrics['original']['char_count'] > 0 else 0
    
    metrics['original']['word_count'] = len(original_text.split())
    metrics['cleaned']['word_count'] = len(cleaned_text.split())
    
    # Whitespace metrics
    metrics['original']['spaces'] = original_text.count(' ')
    metrics['cleaned']['spaces'] = cleaned_text.count(' ')
    
    metrics['original']['tabs'] = original_text.count('\t')
    metrics['cleaned']['tabs'] = cleaned_text.count('\t')
    
    metrics['original']['newlines'] = original_text.count('\n')
    metrics['cleaned']['newlines'] = cleaned_text.count('\n')
    
    metrics['original']['multiple_spaces'] = count_multiple_spaces(original_text)
    metrics['cleaned']['multiple_spaces'] = count_multiple_spaces(cleaned_text)
    
    # Count excessive blank lines (>2 consecutive)
    metrics['original']['excessive_blank_lines'] = count_pattern(original_text, r'\n{3,}')
    metrics['cleaned']['excessive_blank_lines'] = count_pattern(cleaned_text, r'\n{3,}')
    
    # Count single newlines (likely broken sentences)
    single_newlines_original = count_pattern(original_text, r'(?<!\n)\n(?!\n)')
    single_newlines_cleaned = count_pattern(cleaned_text, r'(?<!\n)\n(?!\n)')
    metrics['original']['single_newlines'] = single_newlines_original
    metrics['cleaned']['single_newlines'] = single_newlines_cleaned
    
    # Text density (non-whitespace ratio)
    metrics['original']['text_density'] = len(re.sub(r'\s', '', original_text)) / len(original_text) if len(original_text) > 0 else 0
    metrics['cleaned']['text_density'] = len(re.sub(r'\s', '', cleaned_text)) / len(cleaned_text) if len(cleaned_text) > 0 else 0
    
    # Sentence count (approximate)
    metrics['original']['sentence_count'] = len(re.findall(r'[.!?]+', original_text))
    metrics['cleaned']['sentence_count'] = len(re.findall(r'[.!?]+', cleaned_text))
    
    # Average words per sentence
    metrics['original']['avg_words_per_sentence'] = metrics['original']['word_count'] / metrics['original']['sentence_count'] if metrics['original']['sentence_count'] > 0 else 0
    metrics['cleaned']['avg_words_per_sentence'] = metrics['cleaned']['word_count'] / metrics['cleaned']['sentence_count'] if metrics['cleaned']['sentence_count'] > 0 else 0
    
    # Token counts
    metrics['original']['token_count'] = count_tokens(original_text)
    metrics['cleaned']['token_count'] = count_tokens(cleaned_text)
    metrics['improvements']['token_reduction'] = metrics['original']['token_count'] - metrics['cleaned']['token_count']
    metrics['improvements']['token_reduction_pct'] = (metrics['improvements']['token_reduction'] / metrics['original']['token_count'] * 100) if metrics['original']['token_count'] > 0 else 0
    
    # Calculate improvements
    metrics['improvements']['spaces_reduced'] = metrics['original']['spaces'] - metrics['cleaned']['spaces']
    metrics['improvements']['tabs_removed'] = metrics['original']['tabs'] - metrics['cleaned']['tabs']
    metrics['improvements']['single_newlines_fixed'] = metrics['original']['single_newlines'] - metrics['cleaned']['single_newlines']
    metrics['improvements']['multiple_spaces_reduced'] = metrics['original']['multiple_spaces'] - metrics['cleaned']['multiple_spaces']
    metrics['improvements']['excessive_blank_lines_removed'] = metrics['original']['excessive_blank_lines'] - metrics['cleaned']['excessive_blank_lines']
    
    return metrics

def print_evaluation_report(metrics):
    """
    Print a formatted evaluation report showing cleaning effectiveness.
    
    Args:
        metrics (dict): Metrics dictionary from evaluate_cleaning()
    """
    print("=" * 80)
    print("DOCUMENT CLEANING EVALUATION REPORT")
    print("=" * 80)
    print()
    
    # Text Statistics
    print("üìä TEXT STATISTICS")
    print("-" * 80)
    print(f"Character Count:")
    print(f"  Original: {metrics['original']['char_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['char_count']:,}")
    print(f"  Reduction: {metrics['improvements']['char_reduction']:,} ({metrics['improvements']['char_reduction_pct']:.2f}%)")
    print()
    
    print(f"Word Count:")
    print(f"  Original: {metrics['original']['word_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['word_count']:,}")
    print()
    
    print(f"Sentence Count:")
    print(f"  Original: {metrics['original']['sentence_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['sentence_count']:,}")
    print()
    
    print(f"Average Words per Sentence:")
    print(f"  Original: {metrics['original']['avg_words_per_sentence']:.2f}")
    print(f"  Cleaned:  {metrics['cleaned']['avg_words_per_sentence']:.2f}")
    print()
    
    # Whitespace Metrics
    print("üî§ WHITESPACE METRICS")
    print("-" * 80)
    print(f"Spaces:")
    print(f"  Original: {metrics['original']['spaces']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['spaces']:,}")
    print(f"  Reduced:  {metrics['improvements']['spaces_reduced']:,}")
    print()
    
    print(f"Multiple Consecutive Spaces:")
    print(f"  Original: {metrics['original']['multiple_spaces']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['multiple_spaces']:,}")
    print(f"  Reduced:  {metrics['improvements']['multiple_spaces_reduced']:,}")
    print()
    
    print(f"Tab Characters:")
    print(f"  Original: {metrics['original']['tabs']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['tabs']:,}")
    print(f"  Removed:  {metrics['improvements']['tabs_removed']:,}")
    print()
    
    print(f"Single Newlines (broken sentences):")
    print(f"  Original: {metrics['original']['single_newlines']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['single_newlines']:,}")
    print(f"  Fixed:    {metrics['improvements']['single_newlines_fixed']:,}")
    print()
    
    print(f"Excessive Blank Lines (>2 consecutive):")
    print(f"  Original: {metrics['original']['excessive_blank_lines']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['excessive_blank_lines']:,}")
    print(f"  Removed:  {metrics['improvements']['excessive_blank_lines_removed']:,}")
    print()
    
    # Text Quality
    print("‚ú® TEXT QUALITY INDICATORS")
    print("-" * 80)
    print(f"Text Density (non-whitespace ratio):")
    print(f"  Original: {metrics['original']['text_density']:.4f}")
    print(f"  Cleaned:  {metrics['cleaned']['text_density']:.4f}")
    print(f"  Improvement: {'Higher is better - more content, less whitespace' if metrics['cleaned']['text_density'] > metrics['original']['text_density'] else 'Same or lower'}")
    print()
    
    # Token Efficiency
    print("üéØ LLM PROCESSING EFFICIENCY")
    print("-" * 80)
    print(f"Token Count (approximate):")
    print(f"  Original: {metrics['original']['token_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['token_count']:,}")
    print(f"  Reduction: {metrics['improvements']['token_reduction']:,} ({metrics['improvements']['token_reduction_pct']:.2f}%)")
    print()
    
    # Summary
    print("=" * 80)
    print("SUMMARY")
    print("=" * 80)
    print(f"‚úÖ Characters reduced: {metrics['improvements']['char_reduction']:,} ({metrics['improvements']['char_reduction_pct']:.2f}%)")
    print(f"‚úÖ Tokens reduced: {metrics['improvements']['token_reduction']:,} ({metrics['improvements']['token_reduction_pct']:.2f}%)")
    print(f"‚úÖ Broken sentences fixed: {metrics['improvements']['single_newlines_fixed']:,}")
    print(f"‚úÖ Multiple spaces normalized: {metrics['improvements']['multiple_spaces_reduced']:,}")
    print(f"‚úÖ Tabs converted to spaces: {metrics['improvements']['tabs_removed']:,}")
    print(f"‚úÖ Excessive blank lines removed: {metrics['improvements']['excessive_blank_lines_removed']:,}")
    print()
    
    if metrics['improvements']['token_reduction'] > 0:
        print(f"üí° The cleaned text uses {metrics['improvements']['token_reduction_pct']:.2f}% fewer tokens,")
        print(f"   which means lower processing costs and faster LLM responses!")
    print("=" * 80)

In [252]:
# Step 1: Import POF Using LangChain Library (Peter Drucker - Managing Oneself)
from langchain_community.document_loaders import PyPDFLoader

file_path = "https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

# Store original text before cleaning for comparison
original_document_text = document_text

In [265]:
# Step 2: Clean the document text before processing with LLM and Output the results
document_text = clean_document_text(document_text)

# Evaluate the cleaning effectiveness
cleaning_metrics = evaluate_cleaning(original_document_text, document_text)
print_evaluation_report(cleaning_metrics)

DOCUMENT CLEANING EVALUATION REPORT

üìä TEXT STATISTICS
--------------------------------------------------------------------------------
Character Count:
  Original: 51,452
  Cleaned:  50,434
  Reduction: 1,018 (1.98%)

Word Count:
  Original: 8,670
  Cleaned:  8,427

Sentence Count:
  Original: 578
  Cleaned:  578

Average Words per Sentence:
  Original: 15.00
  Cleaned:  14.58

üî§ WHITESPACE METRICS
--------------------------------------------------------------------------------
Spaces:
  Original: 7,759
  Cleaned:  8,426
  Reduced:  -667

Multiple Consecutive Spaces:
  Original: 5
  Cleaned:  0
  Reduced:  5

Tab Characters:
  Original: 0
  Cleaned:  0
  Removed:  0

Single Newlines (broken sentences):
  Original: 1,442
  Cleaned:  0
  Fixed:    1,442

Excessive Blank Lines (>2 consecutive):
  Original: 0
  Cleaned:  0
  Removed:  0

‚ú® TEXT QUALITY INDICATORS
--------------------------------------------------------------------------------
Text Density (non-whitespace ratio):
 

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [267]:
# Generation Task - Step 1: Build the Schema and setup the prompt
from pydantic import BaseModel, Field, ConfigDict
from typing import Optional
from openai import OpenAI
import json

client = OpenAI()

# 1) Output Schema - simplified for strict mode
class SummarySchema(BaseModel):
    model_config = ConfigDict(extra='forbid')
    
    Author: str = Field(..., description="Author of the article")
    Title: str = Field(..., description="Title of the article")
    Relevance: str = Field(..., description="One paragraph on relevance to AI professionals")
    Summary: str = Field(..., description="Concise but complete summary not to exceed 1000 tokens")
    Tone: str = Field(..., description="The tone used to produce the summary")
    InputTokens: Optional[int] = Field(default=0, description="Token count - DO NOT FILL, will be set by response object")
    OutputTokens: Optional[int] = Field(default=0, description="Token count - DO NOT FILL, will be by response object")

# 2) Model + Tone
MODEL_NAME = "gpt-4o"
EVALUATION_MODEL = "gpt-4o-mini"
TONE = "Legalese"

# Evaluation Results Schema
class EvaluationResults(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    CoherenceScore: float
    CoherenceReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str

class PromptBuilder:
    def __init__(self, instructions: str, user_template: str, tone: str, context: str):
        self.instructions = instructions
        self.user_template = user_template
        self.tone = tone
        self.context = context
        
        # Compose the user prompt dynamically on initialization
        self.user_content = self.user_template.format(
            tone=self.tone, 
            context=self.context
        )
        
        # Placeholders for the response and evaluation scores
        self.response = None
        self.result = None
        self.eval_scores = None
        self.evaluation_results = None  # Store EvaluationResults object

    def set_response(self, response):
        """Store the raw response object from the model."""
        self.response = response

    def set_result(self, result):
        """Store the model's structured output result."""
        self.result = result

    def set_eval_scores(self, scores):
        """Store DeepEval or other eval metric scores."""
        self.eval_scores = scores

    def set_evaluation_results(self, evaluation_results: EvaluationResults):
        """Store the EvaluationResults object."""
        self.evaluation_results = evaluation_results

    def get_evaluation_results(self) -> EvaluationResults:
        """Get the EvaluationResults object."""
        return self.evaluation_results

    def get_evaluation_results_json(self, indent: int = 2) -> str:
        """
        Return evaluation results as a JSON string formatted for LLM consumption.
        
        Args:
            indent: Number of spaces for JSON indentation (default: 2)
            
        Returns:
            str: JSON string representation of evaluation results
            
        Raises:
            ValueError: If evaluation_results is None
        """
        if self.evaluation_results is None:
            raise ValueError("No evaluation results available. Call set_evaluation_results() first.")
        
        return json.dumps(self.evaluation_results.model_dump(), indent=indent, ensure_ascii=False)

    def get_input(self):
        """Return the system and user prompts as a list of dicts for OpenAI Chat API."""
        return [
            {"role": "system", "content": self.instructions},
            {"role": "user", "content": self.user_content}
        ]

# Example usage
instructions = (
    "You are an information extraction and summarization assistant. "
    "Return output STRICTLY matching the provided JSON schema. Do not add fields. "
    "Summary should be concise and succint while providing comprehensive coverage of all major themes, arguments, examples, and key concepts from the document."
    "It is important that the summary is written in the specified tone. "
    "The summary should not exceed 1000 tokens. " 
)

user_template = (
    "Task: Extract metadata and summarize the document in the specified tone.\n"
    "- Tone:{tone}\n"
    "- Fields to Complete: Author, Title, Relevance (<=1 paragraph), Summary (<=1000 tokens).\n"
    "- Document: <document>{context}</document>"
)


prompt_list = [PromptBuilder(
    instructions=instructions,
    user_template=user_template,
    tone=TONE,
    context=document_text
)]

# Now use summarization_prompt.get_prompts() to obtain the message list for OpenAI API, 
# and summarization_prompt.set_result(...) / set_eval_scores(...) to store results and evals.


# 4) Call the model with structured output via Chat Completions API
client = OpenAI()

prompt_list[0].set_response(client.responses.parse(
    model=MODEL_NAME,
    input=prompt_list[0].get_input(),
    text_format=SummarySchema,
    temperature=0.7,
))



In [268]:
# Generation Task - Step 2: Format and output the reponse

# Print the response object in a human readable format
# print(json.dumps(prompt_list[0].response.model_dump(), indent=2, ensure_ascii=False))
# print("\n")

# Add the token counts to the response and store the result in its own variable
result = prompt_list[0].response.output_parsed.model_copy(update={
    "InputTokens": prompt_list[0].response.usage.input_tokens,
    "OutputTokens": prompt_list[0].response.usage.output_tokens,
})
prompt_list[0].set_result(result)

# Output our summary for human review 
print(prompt_list[0].result.Summary)
print("\n")

# Output just the result in a human readable format
print(json.dumps(prompt_list[0].result.model_dump(), indent=2, ensure_ascii=False))
print("\n")

In "Managing Oneself," Peter F. Drucker outlines the necessity for individuals, particularly knowledge workers, to take charge of their careers by understanding their strengths, values, and work styles. Pressed by the demands of a modern knowledge economy, professionals must act as their own CEOs, identifying areas where they can excel and make significant contributions.

Drucker advocates for using feedback analysis to reveal strengths and weaknesses. This involves comparing expected outcomes of decisions with actual results over time. By focusing on strengths, individuals can achieve excellence and avoid the inefficiencies of attempting to improve weaknesses.

The article stresses the importance of understanding how one performs, with distinctions made between readers and listeners, and various learning styles, such as learning by writing or doing. Knowledge of these traits enables individuals to align their work environments and tasks with their personal styles for greater effective

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
# Generation Task - Step 3: Evaluate and Report
# Our Evaluation Object and Functions with nice clean output

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from tqdm import tqdm

def evaluate_prompt_builder(
    prompt_builder, 
    evaluation_model="gpt-4o-mini",
    summarization_threshold=0.7,
    verbose=True
):
    """
    Evaluate a PromptBuilder object using DeepEval metrics.
    
    This function evaluates a summary generated by a PromptBuilder using multiple
    metrics: Summarization, Coherence, Tonality, and Safety. The results are
    stored in the PromptBuilder object.
    
    Args:
        prompt_builder: A PromptBuilder object that has:
            - user_content: The formatted user prompt as a string
            - result.Summary: The summary text to evaluate
            - context: The original document for context
            - tone: The tone used for the summary
        evaluation_model (str): The model to use for evaluation (default: "gpt-4o-mini")
        summarization_threshold (float): Threshold for summarization metric (default: 0.7)
        verbose (bool): Whether to print progress and scores (default: True)
        
    Returns:
        EvaluationResults: The evaluation results object
    """
    # Validate that prompt_builder has required attributes
    if not hasattr(prompt_builder, 'result') or prompt_builder.result is None:
        raise ValueError("PromptBuilder must have a result attribute with a Summary field")
    if not hasattr(prompt_builder, 'context'):
        raise ValueError("PromptBuilder must have a context attribute")
    if not hasattr(prompt_builder, 'user_content'):
        raise ValueError("PromptBuilder must have a user_content attribute")
    if not hasattr(prompt_builder, 'tone'):
        raise ValueError("PromptBuilder must have a tone attribute")
    
    # Build the Test Case
    test_case = LLMTestCase(
        input=prompt_builder.user_content,  # The formatted user prompt as a string
        actual_output=prompt_builder.result.Summary,  # The summary text
        retrieval_context=[prompt_builder.context],  # The original document
    )
    
    # ASSESSMENT QUESTIONS
    # Define your custom assessment questions (at least 5)
    assessment_questions = [
        "Does the summary capture all the main points and key ideas from the original document?",
        "Is the summary concise while maintaining the important details and facts from the document?",
        "Does the summary contain only factual information that can be verified in the original document, without any misrepresentation or distortion of facts?",
        "Does the summary contain any harmful, offensive, inappropriate, or toxic language that would be unsuitable for a professional or general audience?",
        "Does the summary present information in an objective, balanced manner without introducing personal bias, stereotypes, or unfair perspectives that were not present in the original document?"
    ]
    
    # Initialize the metric
    summarization_metric = SummarizationMetric(
        threshold=summarization_threshold,
        assessment_questions=assessment_questions,
        include_reason=True,        # Get explanations for scores
        model=evaluation_model         # Optional: specify the model for evaluation
    )
    
    # COHERENCE EVALUATION
    coherence_metric = GEval(
        name="Coherence",
        model=evaluation_model,
        evaluation_steps=[
            "Does the summary have a clear logical flow from beginning to end?",
            "Are the ideas and concepts connected in a way that makes sense?",
            "Can a reader follow the summary without getting confused or lost?",
            "Is the summary well-organized with ideas presented in a logical sequence?",
            "Does the summary avoid abrupt transitions or disconnected thoughts?"
        ],
        evaluation_params=[
            LLMTestCaseParams.RETRIEVAL_CONTEXT,    # Original document
            LLMTestCaseParams.ACTUAL_OUTPUT         # The summary
        ],
    )
    
    # TONALITY EVALUATION
    tonality_metric = GEval(
        name="Tonality",
        model=evaluation_model,
        evaluation_steps=[
            f"Does the summary consistently use a {prompt_builder.tone} tone throughout?",
            "Are there any instances where the tone shifts to something different?",
            "Does the language style match what would be expected for this tone?",
            "Is the tone appropriate for the content being summarized?",
            "Would a reader identify this summary as having the intended tone?"
        ],
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT,  # The summary (tone is in the output)
            LLMTestCaseParams.RETRIEVAL_CONTEXT  # Original document for context
        ],
    )
    
    # SAFETY EVALUATION
    safety_metric = GEval(
        name="Safety",
        model=evaluation_model,
        evaluation_steps=[
            "Does the summary contain any harmful, offensive, or inappropriate content?",
            "Is the summary free from bias, stereotypes, or discriminatory language?",
            "Would the summary be appropriate for a professional or educational context?",
            "Does the summary avoid promoting dangerous or illegal activities?",
            "Is the summary respectful and considerate in its language and content?"
        ],
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT  # Mainly checking the summary itself
        ],
    )
    
    # MEASURE EVALUATION METRICS
    evaluation_metrics = [
        ("Summarization", summarization_metric),
        ("Coherence", coherence_metric),
        ("Tonality", tonality_metric),
        ("Safety", safety_metric)
    ]
    
    if verbose:
        metric_iterator = tqdm(evaluation_metrics, desc="Evaluating metrics")
    else:
        metric_iterator = evaluation_metrics
    
    for name, metric in metric_iterator:
        metric.measure(test_case, _show_indicator=False)
        if verbose:
            print(f"{name}: {metric.score:.2f}")
    
    # Store the results in our Pydantic object
    evaluation_results = EvaluationResults(
        SummarizationScore=summarization_metric.score,
        SummarizationReason=summarization_metric.reason,
        CoherenceScore=coherence_metric.score,
        CoherenceReason=coherence_metric.reason,
        TonalityScore=tonality_metric.score,
        TonalityReason=tonality_metric.reason,
        SafetyScore=safety_metric.score,
        SafetyReason=safety_metric.reason
    )
    
    # Set the evaluation results in the PromptBuilder object
    prompt_builder.set_evaluation_results(evaluation_results)
    
    return

# Evaluate the first prompt in prompt_list
evaluate_prompt_builder(
    prompt_list[0],
    evaluation_model=EVALUATION_MODEL,
    summarization_threshold=0.7,
    verbose=True
)

print (prompt_list[0].get_evaluation_results_json(2))


Evaluating metrics:  25%|‚ñà‚ñà‚ñå       | 1/4 [00:19<00:59, 19.96s/it]

Summarization: 0.00


Evaluating metrics:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [00:23<00:20, 10.32s/it]

Coherence: 0.90


Evaluating metrics:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [00:27<00:07,  7.60s/it]

Tonality: 0.28


Evaluating metrics: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:30<00:00,  7.68s/it]

Safety: 1.00





EvaluationResults(SummarizationScore=0.0, SummarizationReason='The score is 0.00 because the summary includes extra information that is not present in the original text, which can lead to misunderstandings and misinterpretations of the content.', CoherenceScore=0.9014194565542175, CoherenceReason="The summary presents a clear logical flow, effectively outlining Drucker's key concepts such as self-awareness, feedback analysis, and the importance of aligning personal values with organizational values. The ideas are well-connected, allowing the reader to follow the progression of thought without confusion. The organization is strong, with a logical sequence that covers strengths, performance, values, and contributions. However, while the summary is comprehensive, it could benefit from slightly more emphasis on the implications of Drucker's ideas for practical application, which would enhance its overall clarity and impact.", TonalityScore=0.2751214949183264, TonalityReason="The summary la

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

# Report on Results

## Discussion

This assignment deviated slightly from the standard approach by focusing on **prompt optimization** rather than simply creating an enhanced summary through llm refinement. Initially, I misinterpreted the assignment requirements. However, this led to a more interesting exploration: using evaluation feedback to systematically improve prompts and measuring the quantitative impact of those improvements.

Rather than manually crafting an improved prompt based on evaluation results, I developed an automated system that uses an LLM to analyze evaluation feedback and propose concrete prompt improvements. This approach allows for:
- **Measurable outcomes** across multiple iterations
- **Data-driven insights** into what prompt changes improve specific metrics
- **Systematic testing** of how evaluation feedback affects subsequent prompt performance
- **Iterative refinement** to understand the limits of automated prompt optimization

## How the Code Works

The optimization system follows a structured workflow:

### Core Components

1. **PromptBuilder Class**: Manages prompts, responses, results, and evaluation scores in a single object
2. **Evaluation Function**: Uses DeepEval to measure Summarization, Coherence, Tonality, and Safety
3. **Optimization Function**: Sends evaluation feedback to an LLM to generate improved prompts
4. **Iteration Function**: Automates multiple rounds of optimization

### Optimization Process

```
Original Prompt ‚Üí Generate Summary ‚Üí Evaluate ‚Üí LLM Analyzes Feedback
    ‚Üë                                                      ‚Üì
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ Generate New Prompt ‚Üê‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**What Can Change:**
- System instructions (guidance on how to summarize)
- User template (the structure of the user prompt)
- Specific phrasing to emphasize tone, accuracy, or completeness

**What Cannot Change:**
- The output schema (SummarySchema with specific fields)
- The target tone (Legalese)
- The source document
- The evaluation criteria

The LLM optimizer receives the current prompt, evaluation scores, feedback explanations, and a preview of the generated summary. It then proposes enhanced instructions and templates designed to address identified weaknesses while maintaining strengths.

## Output Analysis

### Initial Results (Iteration 0 ‚Üí 1)

**Baseline Performance:**
- Summarization: 0.00 (failed - included information not in original)
- Coherence: 0.90 (strong logical flow)
- Tonality: 0.28 (inconsistent Legalese tone)
- Safety: 1.00 (perfect)

**After First Optimization:**
- Summarization: 0.50 (+0.50) - Major improvement, though still failing
- Coherence: 0.90 (maintained) - Sustained clarity
- Tonality: 0.69 (+0.42) - Significant improvement in formal language
- Safety: 1.00 (maintained) - Remained perfect

### Key Improvements

The first optimization successfully addressed the primary weaknesses:

1. **Summarization**: Added explicit instruction to "ensure the summary contains only information present in the document, avoiding any extrapolation"
2. **Tonality**: Enhanced with "adhering strictly to the style and complexity expected of a legal document"
3. **Prompt Clarity**: Made instructions more specific about legal writing conventions

### Overall Progression (7 Iterations)

Based on the progression table visible in the notebook:

| Metric | Initial | Final | Change |
|--------|---------|-------|--------|
| Summarization | 0.00 | 0.80 | +0.80 ‚úÖ |
| Coherence | 0.90 | 0.89 | -0.01 ‚û°Ô∏è |
| Tonality | 0.28 | 0.30 | +0.02 ‚ö†Ô∏è |
| Safety | 1.00 | 1.00 | 0.00 ‚úÖ |

**Key Observations:**
- **Summarization** showed dramatic improvement, ultimately achieving 0.80
- **Coherence** remained consistently high throughout all iterations
- **Tonality** showed improvement in iteration 1 (0.69) but regressed in later iterations
- **Safety** remained perfect across all iterations

**Token Efficiency:**
- Initial: 11,330 tokens total
- Final: 11,542 tokens total (+212 tokens)
- The prompt grew slightly more verbose but maintained efficiency

## Are These Controls Enough?

### Challenges Identified

1. **Assessment Question Quality**: Crafting effective evaluation questions proved difficult and required significant research. The quality of evaluation directly impacts the optimization feedback loop.

2. **Metric Trade-offs**: Improvements in one metric (Summarization) sometimes led to regression in others (Tonality). This suggests competing objectives that require careful balancing.

3. **Non-Linear Progression**: Scores did not improve monotonically. Some iterations showed regression, indicating that optimization guidance can be inconsistent or that metrics may have inherent trade-offs.

### Potential of Evaluation-Driven Controls

**Strengths:**
- Provides quantitative, repeatable measurements
- Identifies specific weaknesses with explanations
- Enables data-driven iteration rather than subjective assessment
- Could, over time, build a knowledge base of "what works" for specific tasks

**Limitations:**
- Evaluation quality depends on assessment question design
- LLM-based evaluations may have their own biases
- Single-metric optimization may harm other metrics
- Requires many iterations to converge (if convergence is even possible)

**Long-term Viability:**

These controls could provide a suitable framework for reducing hallucination and inaccuracy in specific use cases, particularly when:
- Evaluation criteria are well-defined and validated
- Multiple rounds of iteration are feasible
- Performance data is collected and analyzed to identify optimal prompt patterns
- The task is sufficiently constrained (like summarization with specific requirements)

However, they are **not sufficient alone**. Human oversight, domain expertise, and validation against ground truth remain essential, especially for high-stakes applications.

## Taking It Further

### The Iterator Function

The `run_optimization_iterations()` function was designed to automate multiple rounds of prompt improvement without manual intervention. Its purpose was to:
- Test whether automated optimization converges toward optimal prompts
- Identify patterns in how prompts evolve over iterations
- Measure the cumulative impact of iterative refinement

### Observed Regression

While the system demonstrated improvement potential, it also exhibited **regression across iterations**:
- Tonality scores fluctuated significantly (0.28 ‚Üí 0.69 ‚Üí 0.83 ‚Üí 0.44...)
- Some iterations achieved excellent scores (e.g., Iteration 4: Tonality 0.84) but couldn't maintain them
- The LLM optimizer sometimes overcorrected, focusing too heavily on one metric at the expense of others

### Recommendations for Improvement

1. **Single-Metric Focused Optimization**: Instead of optimizing all metrics simultaneously, focus on one low-performing metric per iteration. This could reduce overcorrection and preserve strengths.

2. **Prompt Library & A/B Testing**: Maintain successful prompts and compare new variants against the best historical performer rather than just the immediate predecessor.

3. **Use Specialized Frameworks**: Libraries like **DSPy** (Declarative Self-improving Language Programs) are purpose-built for this type of optimization and may handle multi-objective optimization more effectively than a custom LLM-based approach.

4. **Ensemble Evaluation**: Rather than relying on a single evaluation pass, run multiple evaluations and average scores to reduce variance and noise.

5. **Human-in-the-Loop**: Periodic human review of highest-scoring summaries could validate whether metric improvements correlate with actual quality improvements.

6. **Constraint Preservation**: Implement hard constraints to prevent regression (e.g., "new prompts must maintain scores above X on previously strong metrics").

### Conclusion

This exploration demonstrates that automated, evaluation-driven prompt optimization is feasible and can produce measurable improvements. However, it also reveals the complexity of multi-objective optimization and the challenges of converging toward optimal prompts without more sophisticated approaches. The foundation built here‚Äîsystematic evaluation, structured feedback, and iterative refinement‚Äîprovides a solid starting point for more advanced optimization strategies using frameworks like DSPy or custom reinforcement learning approaches.

In [None]:
# Helper function to preview optimization prompt before making the LLM call 
def preview_optimization_prompt(
    original_prompt: PromptBuilder,
    evaluation_results: EvaluationResults
) -> list:
    """
    Preview the optimization prompt and instructions before making the LLM call.
    
    Args:
        original_prompt: The original PromptBuilder with instructions and template
        evaluation_results: EvaluationResults object with scores and reasons
        
    Returns:
        list: The optimization messages that would be sent to the LLM
    """
    # Use the PromptBuilder's JSON method to format evaluation results
    if original_prompt.evaluation_results is not None:
        evaluation_text = original_prompt.get_evaluation_results_json(indent=2)
    else:
        temp_prompt = PromptBuilder(
            instructions="",
            user_template="",
            tone="",
            context=""
        )
        temp_prompt.set_evaluation_results(evaluation_results)
        evaluation_text = temp_prompt.get_evaluation_results_json(indent=2)
    
    # Format the optimization prompt
    optimization_prompt_content = prompt_optimizer_template.format(
        original_instructions=original_prompt.instructions,
        original_user_template=original_prompt.user_template,
        evaluation_results=evaluation_text,
        original_summary=original_prompt.result.Summary[:500] + "..." if len(original_prompt.result.Summary) > 500 else original_prompt.result.Summary,
        tone=original_prompt.tone,
        summarization_score=evaluation_results.SummarizationScore,
        tonality_score=evaluation_results.TonalityScore
    )
    
    # Create the messages
    optimization_messages = [
        {"role": "system", "content": prompt_optimizer_instructions},
        {"role": "user", "content": optimization_prompt_content}
    ]
    
    return optimization_messages


# Function to display optimization prompt for review
def display_optimization_prompt(
    original_prompt: PromptBuilder,
    evaluation_results: EvaluationResults
):
    """
    Display the optimization prompt and instructions for review before making the LLM call.
    
    Args:
        original_prompt: The original PromptBuilder with instructions and template
        evaluation_results: EvaluationResults object with scores and reasons
    """
    messages = preview_optimization_prompt(original_prompt, evaluation_results)
    
    print("="*80)
    print("OPTIMIZATION PROMPT PREVIEW")
    print("="*80)
    
    print("\n" + "="*80)
    print("SYSTEM INSTRUCTIONS:")
    print("="*80)
    print(messages[0]["content"])
    
    print("\n" + "="*80)
    print("USER PROMPT:")
    print("="*80)
    print(messages[1]["content"])
    
    print("\n" + "="*80)
    print("MESSAGE STRUCTURE:")
    print("="*80)
    print(f"Number of messages: {len(messages)}")
    for i, msg in enumerate(messages):
        print(f"\nMessage {i+1} ({msg['role']}):")
        print(f"  Length: {len(msg['content'])} characters")
        print(f"  Preview: {msg['content'][:100]}...")
    
    print("\n" + "="*80)


# Example: Preview the optimization prompt before calling
# Uncomment the lines below to review the prompt before optimization
display_optimization_prompt(
    original_prompt=prompt_list[0],
    evaluation_results=prompt_list[0].evaluation_results
)


OPTIMIZATION PROMPT PREVIEW

SYSTEM INSTRUCTIONS:
You are an expert prompt engineer specializing in improving LLM prompts based on evaluation feedback. Your task is to analyze evaluation results, identify prompt weaknesses, and propose concrete improvements to system instructions and user templates.

CRITICAL REQUIREMENTS:
1. Analyze evaluation scores and feedback to identify specific weaknesses
2. Propose targeted improvements that address identified weaknesses
3. Analyze the original reponse for length or other weaknesses that could be impacting evaluation scores
4. Maintain the original prompt's intent and structure where it works well
5. Provide clear rationale for each change
6. Predict expected improvements in each metric

Return output STRICTLY matching the provided JSON schema.

USER PROMPT:
Task: Optimize the following prompt based on evaluation feedback.

ORIGINAL PROMPT:
---
System Instructions:
You are an information extraction and summarization assistant. Return output STR

In [271]:
class OptimizedPrompt(BaseModel):
    """Schema for LLM-generated prompt improvements."""
    model_config = ConfigDict(extra='forbid')
    
    EnhancedInstructions: str = Field(
        ..., 
        description="Improved system instructions addressing evaluation weaknesses"
    )
    EnhancedUserTemplate: str = Field(
        ..., 
        description="Improved user template with better guidance"
    )
    OptimizationRationale: str = Field(
        ..., 
        description="Explanation of what was changed and why, based on evaluation feedback"
    )
    ExpectedImprovements: str = Field(
        ..., 
        description="Expected improvements in each metric based on these changes"
    )

# Prompt optimization instructions
prompt_optimizer_instructions = (
    "You are an expert prompt engineer specializing in improving LLM prompts based on "
    "evaluation feedback. Your task is to analyze evaluation results, identify prompt "
    "weaknesses, and propose concrete improvements to system instructions and user templates.\n\n"
    
    "CRITICAL REQUIREMENTS:\n"
    "1. Analyze evaluation scores and feedback to identify specific weaknesses\n"
    "2. Propose targeted improvements that address identified weaknesses\n"
    "3. Analyze the original reponse for length or other weaknesses that could be impacting evaluation scores\n"
    "4. Maintain the original prompt's intent and structure where it works well\n"
    "5. Provide clear rationale for each change\n"
    "6. Predict expected improvements in each metric\n\n"
    
    "Return output STRICTLY matching the provided JSON schema."
)

# Prompt optimization user template
prompt_optimizer_template = (
    "Task: Optimize the following prompt based on evaluation feedback.\n\n"
    
    "ORIGINAL PROMPT:\n"
    "---\n"
    "System Instructions:\n{original_instructions}\n\n"
    "User Template:\n{original_user_template}\n"
    "---\n\n"
    
    "EVALUATION RESULTS:\n"
    "---\n"
    "{evaluation_results}\n"
    "---\n\n"
    
    "ORIGINAL OUTPUT (for context):\n"
    "---\n"
    "Summary: {original_summary}\n"
    "---\n\n"
    
    "TASK REQUIREMENTS:\n"
    "- Tone must remain: {tone}\n"
    "- Output schema must remain the same (SummarySchema)\n"
    "- Target: Improve low-scoring metrics while maintaining high-scoring ones\n"
    "- Focus on: Summarization (currently {summarization_score:.2f}), "
    "Tonality (currently {tonality_score:.2f})\n\n"
    
    "Analyze the evaluation feedback and propose optimized instructions and user template "
    "that will improve the prompt's effectiveness and improve the targeted evaluation metrics."
)

# Prepare the optimization request
def optimize_prompt_with_llm(
    original_prompt: PromptBuilder,
    evaluation_results: EvaluationResults,
    model_name: MODEL_NAME,
    temperature: float = 0.7
) -> OptimizedPrompt:
    """
    Use an LLM to automatically optimize prompts based on evaluation feedback.
    
    Args:
        original_prompt: The original PromptBuilder with instructions and template
        evaluation_results: EvaluationResults object with scores and reasons
        model_name: Model to use for optimization
        temperature: Temperature for optimization (lower = more focused)
        
    Returns:
        OptimizedPrompt: Schema with improved instructions and template
    """
    
    # Format evaluation results as readable text
    evaluation_text = (
        f"SUMMARIZATION:\n"
        f"  Score: {evaluation_results.SummarizationScore:.2f}\n"
        f"  Feedback: {evaluation_results.SummarizationReason}\n\n"
        f"COHERENCE:\n"
        f"  Score: {evaluation_results.CoherenceScore:.2f}\n"
        f"  Feedback: {evaluation_results.CoherenceReason}\n\n"
        f"TONALITY:\n"
        f"  Score: {evaluation_results.TonalityScore:.2f}\n"
        f"  Feedback: {evaluation_results.TonalityReason}\n\n"
        f"SAFETY:\n"
        f"  Score: {evaluation_results.SafetyScore:.2f}\n"
        f"  Feedback: {evaluation_results.SafetyReason}\n"
    )
    
    # Format the optimization prompt
    optimization_prompt_content = prompt_optimizer_template.format(
        original_instructions=original_prompt.instructions,
        original_user_template=original_prompt.user_template,
        evaluation_results=evaluation_text,
        original_summary=original_prompt.result.Summary[:500] + "..." if len(original_prompt.result.Summary) > 500 else original_prompt.result.Summary,
        tone=original_prompt.tone,
        summarization_score=evaluation_results.SummarizationScore,
        tonality_score=evaluation_results.TonalityScore
    )
    
    # Call the LLM to optimize
    optimization_messages = [
        {"role": "system", "content": prompt_optimizer_instructions},
        {"role": "user", "content": optimization_prompt_content}
    ]
    
    response = client.responses.parse(
        model=model_name,
        input=optimization_messages,
        text_format=OptimizedPrompt,
        temperature=temperature
    )
    
    return response.output_parsed



In [272]:
# Execute the optimization
print("="*80)
print("LLM-POWERED PROMPT OPTIMIZATION")
print("="*80)
print("\nAnalyzing evaluation feedback and optimizing prompt...")

optimized_prompt_schema = optimize_prompt_with_llm(
    original_prompt=prompt_list[0],
    evaluation_results=prompt_list[0].evaluation_results,
    model_name=MODEL_NAME,
    temperature=0.5  # Lower temperature for more focused optimization
)

# Display the optimization results
print("\n" + "="*80)
print("OPTIMIZATION RESULTS")
print("="*80)

print("\nüìù OPTIMIZATION RATIONALE:")
print("-" * 80)
print(optimized_prompt_schema.OptimizationRationale)

print("\nüéØ EXPECTED IMPROVEMENTS:")
print("-" * 80)
print(optimized_prompt_schema.ExpectedImprovements)

print("\n" + "="*80)
print("ENHANCED INSTRUCTIONS:")
print("="*80)
print(optimized_prompt_schema.EnhancedInstructions)

print("\n" + "="*80)
print("ENHANCED USER TEMPLATE:")
print("="*80)
print(optimized_prompt_schema.EnhancedUserTemplate)

# Create a new PromptBuilder with the optimized prompt
optimized_prompt = PromptBuilder(
    instructions=optimized_prompt_schema.EnhancedInstructions,
    user_template=optimized_prompt_schema.EnhancedUserTemplate,
    tone=TONE,  # Same tone
    context=prompt_list[0].context  # Same document
)

# Add to prompt list
prompt_list.append(optimized_prompt)

# Display comparison (always compare last to previous)
print("\n" + "="*80)
print("PROMPT COMPARISON: PREVIOUS vs LATEST")
print("="*80)

print("\n--- PREVIOUS INSTRUCTIONS ---")
print(prompt_list[0].instructions)  # Previous item

print("\n--- LATEST INSTRUCTIONS ---")
print(prompt_list[1].instructions)  # Latest item

print("\n--- PREVIOUS USER TEMPLATE ---")
print(prompt_list[0].user_template)

print("\n--- LATEST USER TEMPLATE ---")
print(prompt_list[1].user_template)

LLM-POWERED PROMPT OPTIMIZATION

Analyzing evaluation feedback and optimizing prompt...

OPTIMIZATION RESULTS

üìù OPTIMIZATION RATIONALE:
--------------------------------------------------------------------------------
1. **Summarization Improvement**: Added explicit guidance to ensure the summary strictly contains information present in the document, addressing the issue of including extra information.
2. **Tonality Enhancement**: Clarified the requirement to adhere to legal writing conventions, emphasizing the need for formal and complex language to match the Legalese tone. This should address the inconsistency in tone noted in the feedback.

üéØ EXPECTED IMPROVEMENTS:
--------------------------------------------------------------------------------
1. **Summarization**: By emphasizing the need to include only information present in the document, we expect the summarization score to improve significantly, potentially reaching 0.80 or higher.
2. **Tonality**: The explicit instructio

In [273]:
# Ensure we have at least 2 prompts before proceeding
if len(prompt_list) < 2:
    raise ValueError("Need at least 2 prompts in prompt_list")

# Generate summary using LLM-optimized prompt
print("\n" + "="*80)
print("GENERATING SUMMARY WITH LLM-OPTIMIZED PROMPT")
print("="*80)

prompt_list[-1].set_response(
    client.responses.parse(
        model=MODEL_NAME,
        input=prompt_list[-1].get_input(),
        text_format=SummarySchema,
        temperature=0.7,  # Same temperature for fair comparison
    )
)

# Extract and store result
optimized_result = prompt_list[-1].response.output_parsed.model_copy(update={
    "InputTokens": prompt_list[-1].response.usage.input_tokens,
    "OutputTokens": prompt_list[-1].response.usage.output_tokens,
})
prompt_list[-1].set_result(optimized_result)

# Display the optimized summary
print("\nLLM-Optimized Summary:")
print("-" * 80)
print(prompt_list[-1].result.Summary)
print("\n" + "-" * 80)

# Display full JSON
print("\nLLM-Optimized Summary (Full JSON):")
print(json.dumps(prompt_list[-1].result.model_dump(), indent=2, ensure_ascii=False))


GENERATING SUMMARY WITH LLM-OPTIMIZED PROMPT

LLM-Optimized Summary:
--------------------------------------------------------------------------------
In "Managing Oneself," Peter F. Drucker posits that in the contemporary knowledge economy, individuals must assume the role of their own chief executive officers. Self-management is essential to navigate careers spanning decades in a landscape where companies no longer manage employees' careers. To achieve lasting success, individuals must cultivate a profound self-awareness of their strengths, weaknesses, values, and preferred work environments.

Drucker introduces the concept of feedback analysis to identify strengths and weaknesses, advising individuals to focus on enhancing their strengths rather than improving areas of incompetence. He underscores the importance of understanding one's unique performance methods, whether as readers or listeners, and how these impact learning and decision-making.

Values play a pivotal role in career 

In [274]:
print(f"prompt_list length: {len(prompt_list)}")
if len(prompt_list) >= 2:
    print(f"Previous: {prompt_list[-2]}")
    print(f"Latest: {prompt_list[-1]}")
else:
    print("‚ö†Ô∏è Need at least 2 items in prompt_list")

prompt_list length: 2
Previous: <__main__.PromptBuilder object at 0x121ff0920>
Latest: <__main__.PromptBuilder object at 0x121b55400>


In [None]:
# Example usage: Evaluate the most recent prompt in prompt_list
evaluate_prompt_builder(
    prompt_list[-1],
    evaluation_model=EVALUATION_MODEL,
    summarization_threshold=0.7,
    verbose=True
)

# Comprehensive comparison
print("\n" + "="*80)
print("COMPREHENSIVE COMPARISON: PREVIOUS vs LATEST")
print("="*80)

# Ensure we have evaluation results for both
if prompt_list[-2].evaluation_results is None or prompt_list[-1].evaluation_results is None:
    raise ValueError("Both prompts must have evaluation results to compare")

# Score comparison
print("\nüìä SCORE COMPARISON:")
print("-" * 80)

comparison_metrics = [
    ("Summarization", 
     prompt_list[-2].evaluation_results.SummarizationScore,
     prompt_list[-1].evaluation_results.SummarizationScore),
    ("Coherence",
     prompt_list[-2].evaluation_results.CoherenceScore,
     prompt_list[-1].evaluation_results.CoherenceScore),
    ("Tonality",
     prompt_list[-2].evaluation_results.TonalityScore,
     prompt_list[-1].evaluation_results.TonalityScore),
    ("Safety",
     prompt_list[-2].evaluation_results.SafetyScore,
     prompt_list[-1].evaluation_results.SafetyScore),
]

for metric_name, previous_score, latest_score in comparison_metrics:
    improvement = latest_score - previous_score
    improvement_pct = (improvement / previous_score * 100) if previous_score > 0 else float('inf')
    
    print(f"\n{metric_name}:")
    print(f"  Previous:  {previous_score:.2f}")
    print(f"  Latest:    {latest_score:.2f}")
    print(f"  Change:    {improvement:+.2f} ({improvement_pct:+.1f}%)")
    
    if improvement > 0.1:
        print(f"  ‚úÖ Significant improvement!")
    elif improvement > 0:
        print(f"  ‚úÖ Improvement")
    elif improvement < -0.1:
        print(f"  ‚ö†Ô∏è  Significant regression")
    elif improvement < 0:
        print(f"  ‚ö†Ô∏è  Regression")
    else:
        print(f"  ‚û°Ô∏è  No change")

# Compare LLM's predictions vs actual results
print("\n" + "="*80)
print("LLM PREDICTIONS vs ACTUAL RESULTS")
print("="*80)
print("\nExpected Improvements (from LLM):")
print(optimized_prompt_schema.ExpectedImprovements)
print("\nActual Results:")
for metric_name, previous_score, latest_score in comparison_metrics:
    improvement = latest_score - previous_score
    print(f"  {metric_name}: {improvement:+.2f} change")

# Token usage comparison
print("\n" + "="*80)
print("TOKEN USAGE COMPARISON")
print("="*80)
print(f"Previous Summary:")
print(f"  Input Tokens:  {prompt_list[-2].result.InputTokens:,}")
print(f"  Output Tokens: {prompt_list[-2].result.OutputTokens:,}")
print(f"  Total:         {prompt_list[-2].result.InputTokens + prompt_list[-2].result.OutputTokens:,}")
print(f"\nLatest Summary:")
print(f"  Input Tokens:  {prompt_list[-1].result.InputTokens:,}")
print(f"  Output Tokens: {prompt_list[-1].result.OutputTokens:,}")
print(f"  Total:         {prompt_list[-1].result.InputTokens + prompt_list[-1].result.OutputTokens:,}")

input_diff = prompt_list[-1].result.InputTokens - prompt_list[-2].result.InputTokens
output_diff = prompt_list[-1].result.OutputTokens - prompt_list[-2].result.OutputTokens
print(f"\nDifference:")
print(f"  Input Tokens:  {input_diff:+,}")
print(f"  Output Tokens: {output_diff:+,}")

# Prompt length comparison
print("\n" + "="*80)
print("PROMPT LENGTH COMPARISON")
print("="*80)
previous_prompt_length = len(prompt_list[-2].instructions) + len(prompt_list[-2].user_content)
latest_prompt_length = len(prompt_list[-1].instructions) + len(prompt_list[-1].user_content)
print(f"Previous prompt length: {previous_prompt_length:,} characters")
print(f"Latest prompt length: {latest_prompt_length:,} characters")
print(f"Difference: {latest_prompt_length - previous_prompt_length:+,} characters")

# Side-by-side summary comparison
print("\n" + "="*80)
print("SIDE-BY-SIDE SUMMARY COMPARISON")
print("="*80)
print("\nPREVIOUS SUMMARY:")
print("-" * 80)
print(prompt_list[-2].result.Summary)
print("\n" + "-" * 80)
print("\nLATEST SUMMARY:")
print("-" * 80)
print(prompt_list[-1].result.Summary)
print("\n" + "="*80)

# Evaluation feedback comparison
print("\n" + "="*80)
print("EVALUATION FEEDBACK COMPARISON")
print("="*80)

print("\n--- PREVIOUS EVALUATION FEEDBACK ---")
print(prompt_list[-2].get_evaluation_results_json(indent=2))

print("\n--- LATEST EVALUATION FEEDBACK ---")
print(prompt_list[-1].get_evaluation_results_json(indent=2))

Evaluating metrics:  25%|‚ñà‚ñà‚ñå       | 1/4 [00:17<00:52, 17.39s/it]

Summarization: 0.50


Evaluating metrics:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [00:21<00:18,  9.42s/it]

Coherence: 0.90


Evaluating metrics:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [00:26<00:07,  7.45s/it]

Tonality: 0.69


Evaluating metrics: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:28<00:00,  7.22s/it]

Safety: 1.00

COMPREHENSIVE COMPARISON: PREVIOUS vs LATEST

üìä SCORE COMPARISON:
--------------------------------------------------------------------------------

Summarization:
  Previous:  0.00
  Latest:    0.50
  Change:    +0.50 (+inf%)
  ‚úÖ Significant improvement!

Coherence:
  Previous:  0.90
  Latest:    0.90
  Change:    +0.00 (+0.4%)
  ‚úÖ Improvement

Tonality:
  Previous:  0.28
  Latest:    0.69
  Change:    +0.42 (+151.1%)
  ‚úÖ Significant improvement!

Safety:
  Previous:  1.00
  Latest:    1.00
  Change:    -0.00 (-0.1%)
  ‚ö†Ô∏è  Regression

LLM PREDICTIONS vs ACTUAL RESULTS

Expected Improvements (from LLM):
1. **Summarization**: By emphasizing the need to include only information present in the document, we expect the summarization score to improve significantly, potentially reaching 0.80 or higher.
2. **Tonality**: The explicit instruction to adhere to legal writing conventions should improve the tonality score to around 0.70, as it aligns better with the expecte





# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.


In [None]:
#The Following is not Necessary for the Assignment - 
# The purpose is to generate some number of additional iterations of prompt improvements. 
# Each iteration is on the last prompt in the list.

def run_optimization_iterations(
    prompt_list: list,
    num_iterations: int = 5,
    model_name: str = MODEL_NAME,
    evaluation_model: str = EVALUATION_MODEL,
    temperature: float = 0.5,
    verbose: bool = False
):
    """
    Run multiple optimization iterations on the prompt list.
    
    Args:
        prompt_list: List of PromptBuilder objects
        num_iterations: Number of iterations to run
        model_name: Model for summary generation
        evaluation_model: Model for evaluation
        temperature: Temperature for optimization (kept constant)
        verbose: Whether to show detailed output during iterations
    """
    for iteration in range(num_iterations):
        current_iteration = len(prompt_list)
        
        if not verbose:
            print(f"Iteration {current_iteration}/{len(prompt_list) + num_iterations - 1}...", end=" ", flush=True)
        
        # Get the latest prompt
        current_prompt = prompt_list[-1]
        
        # Optimize based on evaluation results
        optimized_schema = optimize_prompt_with_llm(
            original_prompt=current_prompt,
            evaluation_results=current_prompt.evaluation_results,
            model_name=model_name,
            temperature=temperature
        )
        
        # Create new prompt
        new_prompt = PromptBuilder(
            instructions=optimized_schema.EnhancedInstructions,
            user_template=optimized_schema.EnhancedUserTemplate,
            tone=TONE,
            context=prompt_list[0].context
        )
        
        # Generate summary
        new_prompt.set_response(
            client.responses.parse(
                model=model_name,
                input=new_prompt.get_input(),
                text_format=SummarySchema,
                temperature=0.7,
            )
        )
        
        # Store result
        new_result = new_prompt.response.output_parsed.model_copy(update={
            "InputTokens": new_prompt.response.usage.input_tokens,
            "OutputTokens": new_prompt.response.usage.output_tokens,
        })
        new_prompt.set_result(new_result)
        
        # Evaluate
        evaluate_prompt_builder(
            new_prompt,
            evaluation_model=evaluation_model,
            summarization_threshold=0.7,
            verbose=False
        )
        
        # Append to list
        prompt_list.append(new_prompt)
        
        if not verbose:
            print("‚úì")
    
    return prompt_list


In [None]:
# Disabled Additional Prompt Runs for Uploading to Github
print("="*80)
print("RUNNING 5 ADDITIONAL OPTIMIZATION ITERATIONS")
print("="*80)
print()

# run_optimization_iterations(
#     prompt_list=prompt_list,
#     num_iterations=5,
#     temperature=0.5,
#     verbose=False
# )

print()
print("All iterations complete!")
print(f"Total iterations: {len(prompt_list)}")


RUNNING 5 ADDITIONAL OPTIMIZATION ITERATIONS

Iteration 7/11... 

KeyboardInterrupt: 

In [None]:
def display_progression_analysis(prompt_list: list):
    """Display comprehensive analysis across all iterations."""
    
    print("="*80)
    print("PROGRESSION ANALYSIS: ALL ITERATIONS")
    print("="*80)
    
    # Scores table
    print("\nüìä SCORE PROGRESSION TABLE:")
    print("-" * 80)
    print(f"{'Iteration':<12} {'Summ':<8} {'Coh':<8} {'Tone':<8} {'Safety':<8} {'Out Tokens':<12}")
    print("-" * 80)
    
    for i, prompt in enumerate(prompt_list):
        if prompt.evaluation_results:
            print(f"Iteration {i:<3} "
                  f"{prompt.evaluation_results.SummarizationScore:>6.2f}   "
                  f"{prompt.evaluation_results.CoherenceScore:>6.2f}   "
                  f"{prompt.evaluation_results.TonalityScore:>6.2f}   "
                  f"{prompt.evaluation_results.SafetyScore:>6.2f}   "
                  f"{prompt.result.OutputTokens:>10,}")
    
    # Score improvements
    print("\nüìà SCORE CHANGES (First ‚Üí Last):")
    print("-" * 80)
    first = prompt_list[0].evaluation_results
    last = prompt_list[-1].evaluation_results
    
    metrics = [
        ("Summarization", first.SummarizationScore, last.SummarizationScore),
        ("Coherence", first.CoherenceScore, last.CoherenceScore),
        ("Tonality", first.TonalityScore, last.TonalityScore),
        ("Safety", first.SafetyScore, last.SafetyScore),
    ]
    
    for metric_name, first_score, last_score in metrics:
        change = last_score - first_score
        change_pct = (change / first_score * 100) if first_score > 0 else float('inf')
        print(f"{metric_name:<15} {first_score:.2f} ‚Üí {last_score:.2f}  "
              f"({change:+.2f}, {change_pct:+.1f}%)")
    
    # Token progression
    print("\nüí∞ TOKEN USAGE PROGRESSION:")
    print("-" * 80)
    first_total = prompt_list[0].result.InputTokens + prompt_list[0].result.OutputTokens
    last_total = prompt_list[-1].result.InputTokens + prompt_list[-1].result.OutputTokens
    print(f"First iteration:  {first_total:,} tokens")
    print(f"Last iteration:   {last_total:,} tokens")
    print(f"Difference:       {last_total - first_total:+,} tokens")
    
    # Summary comparison
    print("\nüìù SUMMARY COMPARISON: FIRST vs LAST")
    print("="*80)
    print("\nFIRST ITERATION (Original):")
    print("-" * 80)
    print(prompt_list[0].result.Summary)
    print("\n" + "-" * 80)
    print(f"\nLAST ITERATION (After {len(prompt_list)-1} optimizations):")
    print("-" * 80)
    print(prompt_list[-1].result.Summary)
    print("\n" + "="*80)
    
    # Best scores achievedd
    print("\nüèÜ BEST SCORES ACHIEVED:")
    print("-" * 80)
    best_summ = max(enumerate(prompt_list), key=lambda x: x[1].evaluation_results.SummarizationScore)
    best_coh = max(enumerate(prompt_list), key=lambda x: x[1].evaluation_results.CoherenceScore)
    best_tone = max(enumerate(prompt_list), key=lambda x: x[1].evaluation_results.TonalityScore)
    
    print(f"Summarization: {best_summ[1].evaluation_results.SummarizationScore:.2f} (Iteration {best_summ[0]})")
    print(f"Coherence:     {best_coh[1].evaluation_results.CoherenceScore:.2f} (Iteration {best_coh[0]})")
    print(f"Tonality:      {best_tone[1].evaluation_results.TonalityScore:.2f} (Iteration {best_tone[0]})")
    
    print("\n" + "="*80)

# Display the analysis
display_progression_analysis(prompt_list)


PROGRESSION ANALYSIS: ALL ITERATIONS

üìä SCORE PROGRESSION TABLE:
--------------------------------------------------------------------------------
Iteration    Summ     Coh      Tone     Safety   Out Tokens  
--------------------------------------------------------------------------------
Iteration 0     0.00     0.90     0.28     1.00          467
Iteration 1     0.50     0.90     0.69     1.00          443
Iteration 2     0.25     0.89     0.83     1.00          486
Iteration 3     0.75     0.90     0.44     1.00          411
Iteration 4     0.00     0.90     0.84     1.00          496
Iteration 5     0.00     0.89     0.80     1.00          396
Iteration 6     0.80     0.89     0.30     1.00          439

üìà SCORE CHANGES (First ‚Üí Last):
--------------------------------------------------------------------------------
Summarization   0.00 ‚Üí 0.80  (+0.80, +inf%)
Coherence       0.90 ‚Üí 0.89  (-0.01, -1.6%)
Tonality        0.28 ‚Üí 0.30  (+0.02, +9.0%)
Safety          1.00 ‚Üí