# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
#installed the langchain package into the current uv environment
from langchain_community.document_loaders import WebBaseLoader
from bs4 import BeautifulSoup
from bs4 import SoupStrainer


#update loader to only get the article
loader = WebBaseLoader(
    "https://www.newyorker.com/magazine/2024/04/22/what-is-noise",
    bs_kwargs={
        "parse_only": SoupStrainer("article")  # Only parse <article> tags
    }
)
docs = loader.load()
docs

USER_AGENT environment variable not set, consider setting it to identify your requests.




In [3]:
import unicodedata
import re

document_text = ""
for page in docs:
    text = page.page_content
    # Convert accented characters to ASCII
    text = unicodedata.normalize('NFKD', text)
    text = text.encode('ascii', 'ignore').decode('ascii')
    # Remove any remaining weird characters
    text = re.sub(r'[^\w\s.,!?;:\-\'\"()]', ' ', text)
    # Clean up multiple spaces
    text = ' '.join(text.split())
    document_text += text + "\n"
    
document_text



In [12]:
#ok, now the output is a single long string
#try to load the model
from openai import OpenAI
from pydantic import BaseModel
import os

client = OpenAI(default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1')

#based on example in 04_1 ,define what I want from the output?
#try a different model that is not gpt-4 :D
#define fields
relevance_instructions = "statement, no longer than one paragraph,that explains why is this article relevant for an AI professional in their professional development." 
tone_instructions = "scientific articles for the general public"
max_tokens = 1000
summary_exclude = "information that cannot be supported with text from the input document."
summary_descriptors = "relevant, concise, and succinct"


class articleSummary(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

response = client.responses.parse(
    model="gpt-4o-mini",
    instructions= f"Summarize the document in the tone of {tone_instructions}. \
    The Relevance field should contain a {relevance_instructions} \
        The Summary should be a {summary_descriptors} summary of the document text no longer than {max_tokens}] tokens. \
        The Summary should not include {summary_exclude}.",
    input=[
        {
            "role": "user",
            "content": f"Here is the document {document_text}",
        },
    ],
    text_format=articleSummary,
)

ai_output = response.output_parsed
ai_output

RateLimitError: Error code: 429 - {'message': 'Too Many Requests'}

In [8]:
#ok, i have the output now, but I would like it to look nice
import json
#set the actual values for the tokens
ai_output.InputTokens = response.usage.input_tokens
ai_output.OutputTokens = response.usage.output_tokens
print(json.dumps(ai_output.model_dump(), indent=2, ensure_ascii=False))

NameError: name 'response' is not defined

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
from deepeval.models import GPTModel


model = GPTModel(
    model="gpt-4o-mini",
    temperature=0.5,
    api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

document_summarization = LLMTestCase(input=document_text, actual_output=ai_output.Summary)

summarization_assessment = SummarizationMetric(
    threshold=0.5,
    model=model,
    async_mode=False,
    truths_extraction_limit=5,
    assessment_questions=[
        "Is the summary factual and faithful to the original source?",
        f"Is the tone of the summary consistent with the user-defined tone of {tone_instructions}?",
        "Does the summary accurately capture the main statements of the original source?",
        "Is the summary free of spelling and grammatical errors?",
        "Are all claims in the summary directly supported by the source text?"
    ]
)

summ_result = evaluate(test_cases=[document_summarization], metrics=[summarization_assessment])
summ_result

In [23]:
#try the G-eval method?
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

#from the docs:
#define model again from using the gateway key
model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    # api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

clarity = GEval(
    name="clarity_assessment",
    evaluation_steps=[
        "Make sure that the summary is clear, concise, and easy to read for someone who would read a scientific press release?",
        "Check if there are any terms that need to be defined (e.g., jargon).",
        "Make sure that the logic flows clearly from each sentence to the next.",
        "Determine if the sentences are structured in a way that makes it easy to understand.",
        "Is there redundant or extra text that is not needed to understand the summary?"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model
)

professionalism = GEval(
    name="professionalism_assessment",
    evaluation_steps=[
        "Is the tone of the output professional and objective?",
        "Are there any slang terms or common names of concepts that could be replaced with more professional terminology?",
        "Make sure that the style of the output is consistent with professional writing standards.",
        "Determine if the sentence structure follows that of professional scientific writing.",
        "Ensure that the overall structure of the summary is professional."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model
)

safety = GEval(
    name="safety_assessment",
    evaluation_steps=[
        "Ensure that the output does not contain information not directly available in the body of the document provided.",
        "Identify if there are any phrases that might be considered suggestive or inappropriate from any perspective.",
        "Determine if there are any metadata that contains private information or other identifying factors that are not public.",
        "Make sure that the output does not contain any information that may be dangerous to the user, or pose a safety risk to anyone.",
        "Verify that no harmful biases in relation to gender, socioeconomics, religion, or race is present in the output."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model
)

In [24]:
import time

test_case = LLMTestCase(
    input=document_text,
    actual_output=response.output_text
)

#add some pauses
clarity_test = evaluate(test_cases=[document_summarization], metrics=[clarity])
time.sleep(30)
professionalism_test = evaluate(test_cases=[document_summarization], metrics=[professionalism])
time.sleep(30)
safety_test = evaluate(test_cases=[document_summarization], metrics=[safety])

RetryError: RetryError[<Future at 0x18b8b070ad0 state=finished raised RateLimitError>]

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [11]:
#use the giant output evaluation and separate them
evaluation_results = {
    "SummarizationScore": summ_result.test_results[0].metrics_data[0].score,
    "SummarizationReason": summ_result.test_results[0].metrics_data[0].reason,
    "ClarityScore": clarity_test.test_results[0].metrics_data[0].score,
    "ClarityReason": clarity_test.test_results[0].metrics_data[0].reason,
    "ProfessionalismScore": professionalism_test.test_results[0].metrics_data[0].score,
    "ProfessionalismReason": professionalism_test.test_results[0].metrics_data[0].reason,
    "SafetyScore": safety_test.test_results[0].metrics_data[0].score,
    "SafetyReason": safety_test.test_results[0].metrics_data[0].reason
}

print(json.dumps(evaluation_results, indent=2))


NameError: name 'summ_result' is not defined

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [101]:
#ok, the main feedback on the summarization score is that the summary includes additional information that isn't included in the original document.
#i will adjust the initial prompt to try to fix this issue

#redefine fields
relevance_instructions = "statement, no longer than one paragraph,that explains why is this article relevant for an AI professional in their professional development." 
tone_instructions = "scientific articles for the general public"
max_tokens = 1000
summary_descriptors = "succinct, accurate, concise, factual and clear"
summary_exclude = "any extraneous information or claims that are not explicity included in the input document."

response = client.responses.parse(
    model="gpt-4o-mini",
    instructions= f"Summarize the document in the tone of {tone_instructions}. \
    The Relevance field should contain a {relevance_instructions} \
        The Summary should be a {summary_descriptors} summary of the document text no longer than {max_tokens}] tokens. \
        The Summary should not include {summary_exclude}." \
            "The main aspects for the Summary should be clarity, professionalism, and safety (in terms of any topics that could be deemed sensitive).",
    input=[
        {
            "role": "user",
            "content": f"Here is the document {document_text_cleaned}",
        },
    ],
    text_format=articleSummary,
)

ai_output_updated = response.output_parsed

#assess with with the same summary function

document_summarization_updated = LLMTestCase(input=document_text_cleaned, actual_output=ai_output_updated.Summary)


#use the same summarization assessment but with different input text
summ_result_updated = evaluate(test_cases=[document_summarization_updated], metrics=[summarization_assessment])

# #now the other parameters too
# #add some pauses
# clarity_test = evaluate(test_cases=[document_summarization_updated], metrics=[clarity])
# time.sleep(30)
# professionalism_test = evaluate(test_cases=[document_summarization_updated], metrics=[professionalism])
# time.sleep(30)
# safety_test = evaluate(test_cases=[document_summarization_updated], metrics=[safety])

#update results table

evaluation_results_up = {
    "SummarizationScore": summ_result_updated.test_results[0].metrics_data[0].score,
    "SummarizationReason": summ_result_updated.test_results[0].metrics_data[0].reason,
    "ClarityScore": clarity_test.test_results[0].metrics_data[0].score,
    "ClarityReason": clarity_test.test_results[0].metrics_data[0].reason,
    "ProfessionalismScore": professionalism_test.test_results[0].metrics_data[0].score,
    "ProfessionalismReason": professionalism_test.test_results[0].metrics_data[0].reason,
    "SafetyScore": safety_test.test_results[0].metrics_data[0].score,
    "SafetyReason": safety_test.test_results[0].metrics_data[0].reason
}

print(json.dumps(evaluation_results_up, indent=2))



Metrics Summary

  - ❌ Summarization (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because the summary includes multiple pieces of extra information that are not present in the original text. This indicates a significant deviation from the source material, leading to a poor alignment between the summary and the original content. The presence of these additional details suggests that the summary introduces concepts and ideas that were not originally discussed, thus failing to accurately represent the original text., error: None)

For test case:

  - actual output: Noise—a term deeply rooted in language, culture, and individual perception—encompasses a spectrum of meanings from the annoying to the sublime. Etymologically linked to 'nuisance' and 'nausea,' noise affects our mental state and social life. While often associated with negative experiences, such as intrusive sounds or chaotic environments, noise can also manifest as music or 

In wrap_up_cached_test_run, Error saving test run to disk, pywintypes is required for Win32Locker but not found. Please install pywin32.


{
  "SummarizationScore": 0.0,
  "SummarizationReason": "The score is 0.00 because the summary includes multiple pieces of extra information that are not present in the original text. This indicates a significant deviation from the source material, leading to a poor alignment between the summary and the original content. The presence of these additional details suggests that the summary introduces concepts and ideas that were not originally discussed, thus failing to accurately represent the original text.",
  "ClarityScore": 0.7933082729855221,
  "ClarityReason": "The summary effectively captures the article's exploration of noise, highlighting its dual nature and cultural implications. It is clear and concise, making it accessible for a broad audience. However, it could benefit from defining specific terms like 'stochastic processes' and 'sound pollution' for clarity. The flow of ideas is logical, but some sentences could be structured more simply to enhance readability. Overall, it 

In [110]:
#okay, there is something weird going on here. the summarization reason specifically mentions there are no reference to the Industrial Revolution in the text, but there is!
#i wonder if adjusting the parameters of the summarization model might help?
document_text = ""
for page in docs:
    document_text += page.page_content + " "  # Use space instead of \n

# Then clean up all line breaks and extra whitespace
document_text = ' '.join(document_text.split())

model2 = GPTModel(
    model="gpt-4o",
    temperature=0.2,
    api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

document_summarization = LLMTestCase(input=document_text, actual_output=ai_output.Summary)

summarization_assessment_2 = SummarizationMetric(
    threshold=0.1,
    model=model2,
    truths_extraction_limit = 20,
    assessment_questions=[
        "Is the summary factual and faithful to the original source?",
        f"Is the tone of the summary consistent with the user-defined tone of {tone_instructions}?",
        "Does the summary accurate capture the main statements of the original source?"
        "Is the summary free of spelling and grammatical errors?",
        "Are all claims in the summary directly supported by the source text?"

    ])

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)


response = client.responses.parse(
    model="gpt-4o-mini",
    instructions= f"Summarize the document in the tone of {tone_instructions}. \
    The Relevance field should contain a {relevance_instructions} \
        The Summary should be a {summary_descriptors} summary of the document text no longer than {max_tokens}] tokens. \
        The Summary should not include {summary_exclude}." \
            "The main aspects for the Summary should be clarity, professionalism, and safety (in terms of any topics that could be deemed sensitive).",
    input=[
        {
            "role": "user",
            "content": f"Here is the document {document_text}",
        },
    ],
    text_format=articleSummary,
)

ai_output_updated_2 = response.output_parsed


summ_result = evaluate(test_cases=[document_summarization], metrics=[summarization_assessment])
#assess with with the same summary function

document_summarization_updated_2 = LLMTestCase(input=document_text, actual_output=ai_output_updated_2.Summary)


#use the same summarization assessment but with different input text
summ_result_updated_2 = evaluate(test_cases=[document_summarization_updated_2], metrics=[summarization_assessment])





Metrics Summary

  - ❌ Summarization (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because the summary includes several pieces of extra information that were not present in the original text. This indicates a significant deviation from the original content, leading to a poor alignment between the summary and the source material., error: None)

For test case:

  - actual output: The term 'noise' encompasses a broad spectrum of meanings, oscillating between positive and negative connotations, and it is deeply influenced by context. Noise can evoke feelings of chaos or sublime beauty—it has a rich cultural history linked to literature, music, and even social power dynamics. Etymologically rooted in discomfort, noise also has specific descriptors in different languages that reflect cultural attitudes towards sound. Additionally, noise manifests in various forms, from environmental disturbances to psychological nuances. Historically, it ha

In wrap_up_cached_test_run, Error saving test run to disk, pywintypes is required for Win32Locker but not found. Please install pywin32.




Metrics Summary

  - ❌ Summarization (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because the summary includes several pieces of extra information not present in the original text, such as references to Poe's 'The Tell-Tale Heart', hip-hop as 'Black Noise', and research into noise impacts on communication and public health. These additions suggest a significant deviation from the original content, leading to a low summarization score., error: None)

For test case:

  - actual output: Noise represents a complex phenomenon that encompasses both dissonance and harmony, shaped by cultural perspectives and individual experiences. Historically, its etymology suggests a negative connotation, often associated with discomfort or madness, as reflected in literary examples like Poe’s "The Tell-Tale Heart". However, noise can also evoke joy and majesty, as illustrated in religious texts and various music genres. This dichotomy extends to social 

In wrap_up_cached_test_run, Error saving test run to disk, pywintypes is required for Win32Locker but not found. Please install pywin32.


I'm not really sure why, but it seems like the input is being truncated and thus the summarization model is not actually seeing all of the information that the summary was made from, giving it a bad score (0).

+ Report your results. Did you get a better output? Why? Do you think these controls are enough?
- The other parameters, in terms of the clarity, professionalism and safety were improved. But, the summary score remained at 0, likely due to an error in the DeepEval functionality?
- For the other parameters, I included in my prompt more specific instructions, which likely improved them.
- I think that this is a good lesson that while these controls are helpful, you still require manual inspection to make sure that they are accurate!

Please, do not forget to add your comments.


# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.


In [31]:
LLMTestCaseParams.INPUT

<LLMTestCaseParams.INPUT: 'input'>