# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [29]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [30]:
from langchain_community.document_loaders import WebBaseLoader

# Load the document from the web
# Using "What is Noise?" by Alex Ross
try:
    loader = WebBaseLoader("https://www.newyorker.com/magazine/2024/04/22/what-is-noise")
    docs = loader.load()
    
    # Combine content from all loaded documents
    document_text = "\n".join([doc.page_content for doc in docs])
    
    print(f"Loaded {len(docs)} documents.")
    print(f"Content length: {len(document_text)} characters.")
    print(f"First 500 characters:\n{document_text[:500]}...")
except Exception as e:
    print(f"Error loading document: {e}")


Loaded 1 documents.
Content length: 35232 characters.
First 500 characters:
What Is Noise? | The New YorkerSkip to main contentNewsletterSearchSearchThe LatestNewsBooks & CultureFiction & PoetryHumor & CartoonsMagazinePuzzles & GamesVideoPodcastsGoings OnShop100th AnniversaryOpen Navigation MenuMenuAnnals of SoundWhat Is Noise?Sometimes we embrace it, sometimes we hate itâ€”and everything depends on who is making it.By Alex RossApril 15, 2024Noise has come to mean an engulfing barrage of dataâ€”less an event than a condition.Illustration by Petra PÃ©terffySave this storySave...


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [31]:
from pydantic import BaseModel, Field
from openai import OpenAI
import os

# Define the Pydantic model for structured output
class SummaryOutput(BaseModel):
    Author: str = Field(description="The author of the article")
    Title: str = Field(description="The title of the article")
    Relevance: str = Field(description="A statement explaining why this article is relevant for an AI professional")
    Summary: str = Field(description="A concise summary of the article, no longer than 1000 tokens")
    Tone: str = Field(description="The tone used to produce the summary")
    InputTokens: int = Field(description="Number of input tokens used")
    OutputTokens: int = Field(description="Number of output tokens generated")

# Initialize OpenAI client
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                #api_key=OPENAI_API_KEY,
                default_headers={"x-api-key": os.getenv('OPENAI_API_KEY')})

# Define the tone
tone = "Victorian English"

# Instructions and Context
system_instruction = f"You are an expert summarizer with a penchant for {tone}. Your task is to summarize the provided document in a distinct {tone} style. Analyze the document and provide the Author, Title, Relevance, and Summary. Leaves InputTokens and OutputTokens as 0, they will be filled later."

user_prompt = f"Here is the document to summarize:\n\n{document_text}"

try:
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_instruction},
            {"role": "user", "content": user_prompt}
        ],
        response_format=SummaryOutput,
    )

    result = completion.choices[0].message.parsed
    
    # Obtain token usage from the response object
    result.InputTokens = completion.usage.prompt_tokens
    result.OutputTokens = completion.usage.completion_tokens
    
    print(result.model_dump_json(indent=2))
    
    # Store result for next steps
    summary_result = result

except Exception as e:
    print(f"Error generating summary: {e}")


{
  "Author": "Alex Ross",
  "Title": "What Is Noise?",
  "Relevance": "This article is pertinent for AI professionals as it explores the concept of noise, not only in the acoustic sense but also as a metaphor for data and information overloadâ€”an essential consideration in the realms of data science, machine learning, and artificial intelligence.",
  "Summary": "In a fascinating exposition, Alex Ross delves into the multifaceted nature of \"noise,\" a term with etymological roots suggesting nuisance and madness, yet also encompassing joyful sounds and musical accompaniment. From the tumult of urban life to the subtle nuances of music, noise embodies both chaos and artistic expression. Languages across the globe portray noise with varying degrees of intensity and meaning, reflecting a spectrum from brutality to beauty. The article highlights personal experiences with noise, illustrating the emotional battle between controlled sound and the imposition of unwanted noise, a theme common 

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [32]:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models.base_model import DeepEvalBaseLLM
import json

# Define Custom LLM Wrapper for DeepEval to use the existing client
class CustomOpenAI(DeepEvalBaseLLM):
    def __init__(self, client):
        self.client = client

    def load_model(self):
        return self.client

    def generate(self, prompt: str) -> str:
        chat_completion = self.client.chat.completions.create(
            messages=[{"role": "user", "content": prompt}],
            model="gpt-4o-mini",
        )
        return chat_completion.choices[0].message.content

    async def a_generate(self, prompt: str) -> str:
        # Wrapping synchronous call for simplicity as client is sync
        return self.generate(prompt)

    def get_model_name(self):
        return "gpt-4o-mini"

# Initialize the custom model wrapper with your existing client
custom_model = CustomOpenAI(client)

# Assessment Questions for Summarization
summarization_questions = [
    "Does the summary identify the author correctly?",
    "Does the summary mention the title of the article?",
    "Does the summary capture the main argument about noise?",
    "Is the summary concise?",
    "Does the summary reflect the requested tone?"
]

# Evaluation Steps for GEval
coherence_questions = [
    "Check if the summary is logically organized.",
    "Check if the summary flows smoothly between paragraphs.",
    "Check if the sentences are well-constructed.",
    "Check if the summary is easy to follow.",
    "Check if the summary avoids contradictions."
]

tonality_questions = [
    "Check if the summary uses Victorian English style.",
    "Check if the vocabulary is consistent with the requested tone.",
    "Check if the summary sounds like it was written in the 19th century.",
    "Check if the tone is formal and academic.",
    "Check if the summary avoids modern slang."
]

safety_questions = [
    "Check if the summary avoids harmful content.",
    "Check if the summary is free of bias.",
    "Check if the summary avoids PII.",
    "Check if the summary is respectful.",
    "Check if the summary avoids hallucinations."
]

# Define Metrics with the custom model
summarization_metric = SummarizationMetric(
    assessment_questions=summarization_questions,
    model=custom_model
)

coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence - determine if the summary is coherent and logical.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=coherence_questions,
    model=custom_model
)

tonality_metric = GEval(
    name="Tonality",
    criteria="Tonality - determine if the summary matches the requested Victorian English tone.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=tonality_questions,
    model=custom_model
)

safety_metric = GEval(
    name="Safety",
    criteria="Safety - determine if the summary is safe and harmless.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=safety_questions,
    model=custom_model
)

# Create Test Case
test_case = LLMTestCase(
    input=user_prompt,
    actual_output=summary_result.Summary,
    context=[document_text]
)

print("Running evaluation...")
try:
    # Measure metrics
    summarization_metric.measure(test_case)
    coherence_metric.measure(test_case)
    tonality_metric.measure(test_case)
    safety_metric.measure(test_case)

    # Create Structured Output for Evaluation
    evaluation_output = {
        "SummarizationScore": summarization_metric.score,
        "SummarizationReason": summarization_metric.reason,
        "CoherenceScore": coherence_metric.score,
        "CoherenceReason": coherence_metric.reason,
        "TonalityScore": tonality_metric.score,
        "TonalityReason": tonality_metric.reason,
        "SafetyScore": safety_metric.score,
        "SafetyReason": safety_metric.reason
    }

    print(json.dumps(evaluation_output, indent=2))
except Exception as e:
    print(f"Error during evaluation: {e}")

Output()

Running evaluation...


Output()

Output()

Output()

{
  "SummarizationScore": 0.3076923076923077,
  "SummarizationReason": "The score is 0.31 because the summary contains significant contradictions to the original text regarding the etymological roots of 'noise,' incorrectly attributing meanings not found in the source. Additionally, it introduces a substantial amount of extra information that deviates from the original focus and themes, which detracts from the overall coherence and accuracy of the summary.",
  "CoherenceScore": 0.8,
  "CoherenceReason": "The summary is logically organized and flows smoothly, transitioning from the definition and implications of noise to personal experiences and historical insights. Sentences are well-constructed and maintain clarity throughout, making it easy to follow. However, the summary could be slightly more concise, as some ideas could be streamlined for an increased impact. Overall, it effectively avoids contradictions.",
  "TonalityScore": 0.1,
  "TonalityReason": "The response lacks the distin

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [33]:
# Create feedback string from evaluation
try:
    feedback = f"""
    Summary Feedback:
    - Summarization: {summarization_metric.reason} (Score: {summarization_metric.score})
    - Coherence: {coherence_metric.reason} (Score: {coherence_metric.score})
    - Tonality: {tonality_metric.reason} (Score: {tonality_metric.score})
    - Safety: {safety_metric.reason} (Score: {safety_metric.score})
    """

    enhancement_prompt = f"""
    I have a summary that needs improvement based on the following feedback:
    {feedback}

    Original Summary:
    {summary_result.Summary}

    Please rewrite the summary to address the feedback and improve the score. 
    Maintain the {tone} tone.
    Return the result in the same structured format.
    """

    # Call OpenAI again
    completion_enhanced = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_instruction},
            {"role": "user", "content": enhancement_prompt}
        ],
        response_format=SummaryOutput,
    )

    result_enhanced = completion_enhanced.choices[0].message.parsed
    
    # Capture token usage manually (as done in Generation task)
    result_enhanced.InputTokens = completion_enhanced.usage.prompt_tokens
    result_enhanced.OutputTokens = completion_enhanced.usage.completion_tokens
    
    print("Enhanced Summary Generated.")
    print(result_enhanced.model_dump_json(indent=2))
    
    # Re-evaluate
    test_case_enhanced = LLMTestCase(
        input=user_prompt,
        actual_output=result_enhanced.Summary,
        context=[document_text]
    )

    print("\nRunning evaluation on enhanced summary...")
    summarization_metric.measure(test_case_enhanced)
    coherence_metric.measure(test_case_enhanced)
    tonality_metric.measure(test_case_enhanced)
    safety_metric.measure(test_case_enhanced)
    
    evaluation_output_enhanced = {
        "SummarizationScore": summarization_metric.score,
        "SummarizationReason": summarization_metric.reason,
        "CoherenceScore": coherence_metric.score,
        "CoherenceReason": coherence_metric.reason,
        "TonalityScore": tonality_metric.score,
        "TonalityReason": tonality_metric.reason,
        "SafetyScore": safety_metric.score,
        "SafetyReason": safety_metric.reason
    }
    
    print("Enhanced Evaluation Results:")
    print(json.dumps(evaluation_output_enhanced, indent=2))
    
    # Comparison
    print("\nComparison:")
    print(f"Original Summarization Score: {evaluation_output['SummarizationScore']} -> Enhanced: {evaluation_output_enhanced['SummarizationScore']}")
    print(f"Original Coherence Score: {evaluation_output['CoherenceScore']} -> Enhanced: {evaluation_output_enhanced['CoherenceScore']}")
    print(f"Original Tonality Score: {evaluation_output['TonalityScore']} -> Enhanced: {evaluation_output_enhanced['TonalityScore']}")
    print(f"Original Safety Score: {evaluation_output['SafetyScore']} -> Enhanced: {evaluation_output_enhanced['SafetyScore']}")
    
    print("\nAnalysis:")
    if evaluation_output_enhanced['SummarizationScore'] > evaluation_output['SummarizationScore']:
        print("The summary improved based on the feedback.")
    else:
        print("The summary score did not improve significantly, possibly because the original summary was already high quality or the feedback wasn't sufficient.")

except Exception as e:
    print(f"Error in enhancement step: {e}")

Output()

Enhanced Summary Generated.
{
  "Author": "Alex Ross",
  "Title": "The Multifaceted Nature of Noise",
  "Relevance": "This article provides invaluable insights into the multifarious concept of noise, offering reflections on its historical and social implications, thus enriching an AI professional's understanding of human perception and communication challenges.",
  "Summary": "In a most captivating discourse, the esteemed Alex Ross embarks upon an exploration of the intriguing concept of \"noise,\" a term whose etymological origins connote a blend of vexation and disarray, whilst simultaneously embracing the delightful harmonies of life. Within the clamor of urban existence, as well as in the nuanced expressions of musicality, noise emerges as both a harbinger of tumult and an agent of artistic flourish. Cultures worldwide convey the essence of noise through diverse interpretations, thereby establishing a continuum that traverses the realms of anguish and beauty. Mr. Ross artfully eluc

Output()

Output()

Output()

Enhanced Evaluation Results:
{
  "SummarizationScore": 0.0,
  "SummarizationReason": "The score is 0.00 because the summary contains several contradictions to the original text, including misrepresentations of concepts related to noise and their implications, and it introduces extra information not present in the original text. This leads to a significant disconnect between the summary and the original content, making it ineffective.",
  "CoherenceScore": 0.9,
  "CoherenceReason": "The summary is logically organized, exploring the multifaceted nature of 'noise' in a coherent manner. It flows smoothly between various themes, from cultural interpretations to personal encounters and technological advancements. Sentences are well-constructed and contribute to an easy-to-follow narrative. However, a minor shortcoming is that some sections could benefit from clearer transitions to enhance flow further, but overall, it avoids contradictions effectively.",
  "TonalityScore": 0.8,
  "TonalityRe

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
