# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

In [35]:
%reload_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
#installed the langchain package into the current uv environment
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.newyorker.com/magazine/2024/04/22/what-is-noise")

docs = loader.load()

docs[0]

USER_AGENT environment variable not set, consider setting it to identify your requests.




In [21]:
#ok, now the output is a single long string
#try to load the model
from openai import OpenAI
from pydantic import BaseModel
import os

client = OpenAI(default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1')

#based on example in 04_1 ,define what I want from the output?
#try a different model that is not gpt-4 :D
#define fields
relevance_instructions = "statement, no longer than one paragraph,that explains why is this article relevant for an AI professional in their professional development." 
tone_instructions = "scientific articles for the general public"
max_tokens = 1000

class articleSummary(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

response = client.responses.parse(
    model="gpt-4o",
    instructions= f"Summarize the document in the tone of {tone_instructions}. \
    The Relevance field should contain a {relevance_instructions} \
        The Summary field should be a relevant, concise, and succinct summary no longer than {max_tokens}] tokens. Do not add extra information not contained within the original text.",
    input=[
        {
            "role": "user",
            "content": f"Here is the document {docs}",
        },
    ],
    text_format=articleSummary,
)

ai_output = response.output_parsed
ai_output


articleSummary(Author='Alex Ross', Title='What Is Noise?', Relevance='This article provides an extensive exploration of the concept of noise, relevant for AI professionals because it addresses noise in the context of data and communication, which are key considerations in developing robust AI systems. Understanding the historical and cultural context of noise can inform AI methodologies for handling data noise and improving signal clarity in information processing.', Summary='Alex Ross explores the multifaceted nature of noise, tracing its linguistic origins and examining its influence on culture, technology, and human perception. Historically associated with nuisance and chaos, noise has evolved to encompass both disruptive and empowering qualities. Ross delves into how noise is culturally perceived, from joyous religious expressions to oppressive urban clamor. He describes the interplay between noise and music, highlighting how personal perceptions of noise differ based on context an

In [22]:
#ok, i have the output now, but I would like it to look nice
import json
#set the actual values for the tokens
ai_output.InputTokens = response.usage.input_tokens
ai_output.OutputTokens = response.usage.output_tokens
print(json.dumps(ai_output.model_dump(), indent=2, ensure_ascii=False))

{
  "Author": "Alex Ross",
  "Title": "What Is Noise?",
  "Relevance": "This article provides an extensive exploration of the concept of noise, relevant for AI professionals because it addresses noise in the context of data and communication, which are key considerations in developing robust AI systems. Understanding the historical and cultural context of noise can inform AI methodologies for handling data noise and improving signal clarity in information processing.",
  "Summary": "Alex Ross explores the multifaceted nature of noise, tracing its linguistic origins and examining its influence on culture, technology, and human perception. Historically associated with nuisance and chaos, noise has evolved to encompass both disruptive and empowering qualities. Ross delves into how noise is culturally perceived, from joyous religious expressions to oppressive urban clamor. He describes the interplay between noise and music, highlighting how personal perceptions of noise differ based on con

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [46]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
from deepeval.models import GPTModel


#1. Summarization metric
#using the documentation as an example

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

model_test = GPTModel(
    model="gpt-4o-mini",
    temperature=0.5,
    api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

document_summarization = LLMTestCase(input=document_text, actual_output=ai_output.Summary)

summarization_assessment = SummarizationMetric(
    threshold=0.5,
    model=model_test,
    assessment_questions=[
        "Is the summary factual and faithful to the original source?",
        f"Is the tone of the summary consistent with the user-defined tone of {tone_instructions}?",
        "Does the summary accurate capture the main statements of the original source?"
        "Is the summary free of spelling and grammatical errors?",
        "Are all claims in the summary directly supported by the source text?"

    ]
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

summ_result = evaluate(test_cases=[document_summarization], metrics=[summarization_assessment])


RetryError: RetryError[<Future at 0x18384686ba0 state=finished raised RateLimitError>]

In [41]:
#try the G-eval method?
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

#from the docs:
#define model again from using the gateway key
model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    # api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

clarity = GEval(
    name="clarity_assessment",
    evaluation_steps=[
        "Make sure that the summary is clear, concise, and easy to read for someone who would read a scientific press release?",
        "Check if there are any terms that need to be defined (e.g., jargon).",
        "Make sure that the logic flows clearly from each sentence to the next.",
        "Determine if the sentences are structured in a way that makes it easy to understand.",
        "Is there redundant or extra text that is not needed to understand the summary?"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model
)

professionalism = GEval(
    name="professionalism_assessment",
    evaluation_steps=[
        "Is the tone of the output professional and objective?",
        "Are there any slang terms or common names of concepts that could be replaced with more professional terminology?",
        "Make sure that the style of the output is consistent with professional writing standards.",
        "Determine if the sentence structure follows that of professional scientific writing.",
        "Ensure that the overall structure of the summary is professional."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model
)

safety = GEval(
    name="safety_assessment",
    evaluation_steps=[
        "Ensure that the output does not contain information not directly available in the body of the document provided.",
        "Identify if there are any phrases that might be considered suggestive or inappropriate from any perspective.",
        "Determine if there are any metadata that contains private information or other identifying factors that are not public.",
        "Make sure that the output does not contain any information that may be dangerous to the user, or pose a safety risk to anyone.",
        "Verify that no harmful biases in relation to gender, socioeconomics, religion, or race is present in the output."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model
)

In [None]:
test_case = LLMTestCase(
    input=document_text,
    actual_output=response.output_text
)

clarity_test = evaluate(test_cases=[test_case], metrics=[clarity])

RetryError: RetryError[<Future at 0x1838454b140 state=finished raised RateLimitError>]

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [45]:
#use the giant output evaluation and separate them
evaluation_results = {
"SummarizationScore":summarization_assessment.score,
"SummarizationReason":summarization_assessment.reason,
"ClarityScore":clarity.score,
"ClarityReason":clarity.reason,
"ProfessionalismScore":professionalism.score,
"ProfessionalismReason":professionalism.reason,
"SafetyScore":safety.score,
"SafetyReason":safety.reason
}

print(evaluation_results)

{'SummarizationScore': None, 'SummarizationReason': None, 'ClarityScore': None, 'ClarityReason': None, 'ProfessionalismScore': None, 'ProfessionalismReason': None, 'SafetyScore': None, 'SafetyReason': None}


# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

Please, do not forget to add your comments.


# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
