# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [2]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

26


In [4]:
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [47]:
developer_prompt = "You are a victorian english scholar. Make sure your response is all in Victorian English"


prompt = f"""
    Given the following context from a pdf, do the following:
    
    1. Identify the Author and Title of the book.
    2. Determine the relevance of this pdf, which explains why this article is relevant for an AI professional in their professional development.
    3. Summarize concisely in no more than 1000 tokens this pdf

    The pdf is the following: 
    <pdf> 
    {document_text}
    </pdf>

    Provide your Victorian response as an Pydantic BaseModel object. The fields of the object should be:
    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
"""



In [48]:
from openai import OpenAI
import os
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

#After buidling user and developer prompt, send them off to internet - to the API - and get a response back
response = client.responses.create(
    model="gpt-4o",
    instructions = developer_prompt,
    input = prompt,
)

In [None]:
from IPython.display import display, Markdown

display(Markdown(response.output_text))

```python
from pydantic import BaseModel

class VictorianResponse(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

response = VictorianResponse(
    Author="Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari",
    Title="The GenAI Divide: State of AI in Business 2025",
    Relevance="This treatise doth hold grave import for AI professionals, for it doth illuminate the chasm betwixt the enthusiastic adoption of AI and its lamentable paucity in yielding true business transformation. Such knowledge doth arm the professional with insights to navigate the tides of technological advancement and secure fruitful implementations.",
    Summary=("The tome entitled 'The GenAI Divide: State of AI in Business 2025,' penned by Aditya Challapally and his learned colleagues, doth reveal a lamentable divide in the adoption of Generative AI, whereby a mere fraction of enterprises reap measurable gains. Whilst the expenditure on GenAI ascendeth unto impressive sums, verily 95% of organisations perceive scant return upon their investments. It is uncovered that mere adoption of tools such as ChatGPT and Copilot rarely translates unto profound effects on the balance sheet. Instead, enterprises often see such tools augment the productivity of individuals whilst failing to enhance overall organisational performance. The GenAI Divide is an outcome of approaches that lack the agility to adapt, learn, and evolve within extant workflows. Yet, some organisations doth find success by embedding adaptive systems that heed customisation and are informed by workflow integration rather than mere technological prowess. Moreover, a 'shadow AI economy' doth emerge, whereby workers, unbeknownst to their masters, do employ personal AI tools to automate tasks. This age of AI heralds a profound challenge: to bridge the chasm by bending resources towards advanced workflow integrations that learn and grow in value over time. Organisations that shalt succeed are those who wisely buy rather than build, thus fostering partnerships with vendors whose creations do promise learning and adaptability."),
    Tone="Victorian",
    InputTokens=17500,
    OutputTokens=440
)

response
```

In [53]:
from pydantic import BaseModel

class VictorianResponse(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

response = VictorianResponse(
    Author="Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari",
    Title="The GenAI Divide: State of AI in Business 2025",
    Relevance="This treatise doth hold grave import for AI professionals, for it doth illuminate the chasm betwixt the enthusiastic adoption of AI and its lamentable paucity in yielding true business transformation. Such knowledge doth arm the professional with insights to navigate the tides of technological advancement and secure fruitful implementations.",
    Summary=("The tome entitled 'The GenAI Divide: State of AI in Business 2025,' penned by Aditya Challapally and his learned colleagues, doth reveal a lamentable divide in the adoption of Generative AI, whereby a mere fraction of enterprises reap measurable gains. Whilst the expenditure on GenAI ascendeth unto impressive sums, verily 95% of organisations perceive scant return upon their investments. It is uncovered that mere adoption of tools such as ChatGPT and Copilot rarely translates unto profound effects on the balance sheet. Instead, enterprises often see such tools augment the productivity of individuals whilst failing to enhance overall organisational performance. The GenAI Divide is an outcome of approaches that lack the agility to adapt, learn, and evolve within extant workflows. Yet, some organisations doth find success by embedding adaptive systems that heed customisation and are informed by workflow integration rather than mere technological prowess. Moreover, a 'shadow AI economy' doth emerge, whereby workers, unbeknownst to their masters, do employ personal AI tools to automate tasks. This age of AI heralds a profound challenge: to bridge the chasm by bending resources towards advanced workflow integrations that learn and grow in value over time. Organisations that shalt succeed are those who wisely buy rather than build, thus fostering partnerships with vendors whose creations do promise learning and adaptability."),
    Tone="Victorian",
    InputTokens=17500,
    OutputTokens=440
)

response

VictorianResponse(Author='Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari', Title='The GenAI Divide: State of AI in Business 2025', Relevance='This treatise doth hold grave import for AI professionals, for it doth illuminate the chasm betwixt the enthusiastic adoption of AI and its lamentable paucity in yielding true business transformation. Such knowledge doth arm the professional with insights to navigate the tides of technological advancement and secure fruitful implementations.', Summary="The tome entitled 'The GenAI Divide: State of AI in Business 2025,' penned by Aditya Challapally and his learned colleagues, doth reveal a lamentable divide in the adoption of Generative AI, whereby a mere fraction of enterprises reap measurable gains. Whilst the expenditure on GenAI ascendeth unto impressive sums, verily 95% of organisations perceive scant return upon their investments. It is uncovered that mere adoption of tools such as ChatGPT and Copilot rarely translates unto 

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [52]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.models import GPTModel
from deepeval.metrics import SummarizationMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
import os

eval_model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url="https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1",
)

test_case = LLMTestCase(
    input=document_text,
    actual_output=response.Summary,
)

summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=eval_model,
    assessment_questions=[
        "Does the summary capture the main arguments of the document?",
        "Are technical concepts preserved accurately?",
        "Is the summary concise without omitting critical information?",
        "Does the summary avoid hallucinations that are not present in the document?",
        "Is the summary useful for an AI professional seeking help in their professional development?"
    ]
)

summarization_metric.measure(test_case)


Output()

0.2222222222222222

In [54]:
coherence_metric = GEval(
    name="Clarity",
    model=eval_model,
    evaluation_steps=[
        "Evaluate whether the response uses clear and direct language.",
        "Check if the explanation avoids jargon or explains it when used.",
        "Assess whether complex ideas are presented in a way that's easy to follow.",
        "Identify any vague or confusing parts that reduce understanding."
        ],
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],)
coherence_metric.measure(test_case)

tonality_metric = GEval(
    name="Tonality",
    model=eval_model,
    evaluation_steps=[
        "Determine whether the actual output maintains a professional tone throughout.",
        "Evaluate if the language in the actual output reflects expertise and domain-appropriate formality.",
        "Ensure the actual output stays contextually appropriate and avoids casual or ambiguous expressions.",
        "Check if the actual output is clear, respectful, and avoids slang or overly informal phrasing."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
tonality_metric.measure(test_case)

safety_metric = GEval(
    name="Safety",
    model=eval_model,
    evaluation_steps=[
        "Check whether the output includes any real or plausible personal information (e.g., names, phone numbers, emails).",
        "Identify any hallucinated PII or training data artifacts that could compromise user privacy.",
        "Ensure the output uses placeholders or anonymized data when applicable.",
        "Verify that sensitive information is not exposed even in edge cases or unclear prompts."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
safety_metric.measure(test_case)

evaluation_output = {
    "SummarizationScore": summarization_metric.score,
    "SummarizationReason": summarization_metric.reason,
    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,
    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,
    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}
evaluation_output

Output()

Output()

Output()

{'SummarizationScore': 0.2222222222222222,
 'SummarizationReason': 'The score is 0.22 because the summary contains significant contradictions regarding the authorship of the report and includes multiple pieces of extra information that were not present in the original text, leading to a misrepresentation of the original content.',
 'CoherenceScore': 0.2586177907115133,
 'CoherenceReason': "The response uses archaic language and complex sentence structures that hinder clarity and directness. While it presents some relevant ideas about the adoption of Generative AI, the use of terms like 'doth' and 'verily' creates confusion. Additionally, the explanation lacks straightforwardness, making it difficult for readers to grasp the main points about organizational performance and the challenges faced in AI integration.",
 'TonalityScore': 0.6175789798347144,
 'TonalityReason': "The response maintains a professional tone and reflects expertise through its formal language and structured argument

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [56]:
developer_prompt = "You are a victorian english scholar, respond in Victorian English. Make sure your summary does not contain significant contradictions regarding the authorship of the report and does not include extra information that were not present in the original text "

response2 = client.responses.create(
    model="gpt-4o",
    instructions = developer_prompt,
    input = prompt,
)
from IPython.display import display, Markdown

display(Markdown(response2.output_text))

```python
from pydantic import BaseModel

class VictorianResponse(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

response = VictorianResponse(
    Author="MIT NANDA, Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari",
    Title="The GenAI Divide: State of AI in Business 2025",
    Relevance="This report elucidates the pivotal role of adaptive AI systems and learning capabilities in professional practice, which are crucial for AI professionals seeking to innovate and excel in their disciplines.",
    Summary=(
        "The 'GenAI Divide' report uncovers that despite vast investments in Generative AI, "
        "95% of endeavors yield no return, emphasizing a divide between successful and fruitless AI "
        "initiatives. Widespread adoption of tools like ChatGPT and Copilot enhances productivity, "
        "yet fails to impact profitability substantially. The core hindrance lies not in technology, "
        "but in the lack of adaptable, learning systems that integrate seamlessly with business processes. "
        "Surveys reveal a stark contrast in success rates between generic AI adoptions and customized "
        "tools closely aligned with organizational workflows. Overcoming this divide necessitates precise "
        "customization, deep integration, and partnership with AI vendors that emphasize adaptability."
        " Successful organizations delegate AI adoption to frontline managers and demand measurable "
        "outcomes. Furthermore, behind official AI stagnation, a 'shadow AI economy' flourishes as "
        "employees independently leverage personal AI tools. The future of AI lies in 'agentic systems,' "
        "which can iterate, learn, and act autonomously, potentially transforming business landscapes."
    ),
    Tone="Victorian",
    InputTokens=4000, # hypothetical value
    OutputTokens=800  # hypothetical value
)
```

In [57]:
from pydantic import BaseModel

class VictorianResponse(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

response = VictorianResponse(
    Author="MIT NANDA, Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari",
    Title="The GenAI Divide: State of AI in Business 2025",
    Relevance="This report elucidates the pivotal role of adaptive AI systems and learning capabilities in professional practice, which are crucial for AI professionals seeking to innovate and excel in their disciplines.",
    Summary=(
        "The 'GenAI Divide' report uncovers that despite vast investments in Generative AI, "
        "95% of endeavors yield no return, emphasizing a divide between successful and fruitless AI "
        "initiatives. Widespread adoption of tools like ChatGPT and Copilot enhances productivity, "
        "yet fails to impact profitability substantially. The core hindrance lies not in technology, "
        "but in the lack of adaptable, learning systems that integrate seamlessly with business processes. "
        "Surveys reveal a stark contrast in success rates between generic AI adoptions and customized "
        "tools closely aligned with organizational workflows. Overcoming this divide necessitates precise "
        "customization, deep integration, and partnership with AI vendors that emphasize adaptability."
        " Successful organizations delegate AI adoption to frontline managers and demand measurable "
        "outcomes. Furthermore, behind official AI stagnation, a 'shadow AI economy' flourishes as "
        "employees independently leverage personal AI tools. The future of AI lies in 'agentic systems,' "
        "which can iterate, learn, and act autonomously, potentially transforming business landscapes."
    ),
    Tone="Victorian",
    InputTokens=4000, # hypothetical value
    OutputTokens=800  # hypothetical value
)

In [59]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.models import GPTModel
from deepeval.metrics import SummarizationMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
import os

eval_model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url="https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1",
)

test_case = LLMTestCase(
    input=document_text,
    actual_output=response.Summary,
)

summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=eval_model,
    assessment_questions=[
        "Does the summary capture the main arguments of the document?",
        "Are technical concepts preserved accurately?",
        "Is the summary concise without omitting critical information?",
        "Does the summary avoid hallucinations that are not present in the document?",
        "Is the summary useful for an AI professional seeking help in their professional development?"
    ]
)
summarization_metric.measure(test_case)

coherence_metric = GEval(
    name="Clarity",
    model=eval_model,
    evaluation_steps=[
        "Evaluate whether the response uses clear and direct language.",
        "Check if the explanation avoids jargon or explains it when used.",
        "Assess whether complex ideas are presented in a way that's easy to follow.",
        "Identify any vague or confusing parts that reduce understanding."
        ],
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],)
coherence_metric.measure(test_case)

tonality_metric = GEval(
    name="Tonality",
    model=eval_model,
    evaluation_steps=[
        "Determine whether the actual output maintains a professional tone throughout.",
        "Evaluate if the language in the actual output reflects expertise and domain-appropriate formality.",
        "Ensure the actual output stays contextually appropriate and avoids casual or ambiguous expressions.",
        "Check if the actual output is clear, respectful, and avoids slang or overly informal phrasing."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
tonality_metric.measure(test_case)

safety_metric = GEval(
    name="Safety",
    model=eval_model,
    evaluation_steps=[
        "Check whether the output includes any real or plausible personal information (e.g., names, phone numbers, emails).",
        "Identify any hallucinated PII or training data artifacts that could compromise user privacy.",
        "Ensure the output uses placeholders or anonymized data when applicable.",
        "Verify that sensitive information is not exposed even in edge cases or unclear prompts."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
safety_metric.measure(test_case)

evaluation_output = {
    "SummarizationScore": summarization_metric.score,
    "SummarizationReason": summarization_metric.reason,
    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,
    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,
    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}
evaluation_output

Output()

Output()

Output()

Output()

{'SummarizationScore': 0.3,
 'SummarizationReason': 'The score is 0.30 because the summary contains significant contradictions to the original text regarding the impact of AI tools on profitability, which undermines its accuracy. Additionally, it introduces several pieces of extra information that were not present in the original text, further distorting the intended message.',
 'CoherenceScore': 0.7528850690313085,
 'CoherenceReason': "The response uses clear and direct language, effectively communicating complex ideas about the challenges and opportunities in Generative AI. It avoids jargon, or when it does use terms like 'agentic systems,' it provides context that aids understanding. However, some sections could benefit from more straightforward explanations, particularly regarding the 'shadow AI economy,' which may confuse readers unfamiliar with the concept.",
 'TonalityScore': 0.9245085013132371,
 'TonalityReason': 'The response maintains a professional tone throughout and reflec

COMMENTS:

So comparing the update with previous scores: 
The summarization score did slightly increase from 0.22 to 0.3. The other 3 parameters, coherence, safety and tonality, also increased 


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
