# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [9]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [18]:
import os
print("Key loaded:", os.getenv("OPENAI_API_KEY") is not None)


Key loaded: True


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [16]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("The GenAI Divide.pdf")  # ðŸ‘ˆ no ../ needed now
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print("Document loaded! Length:", len(document_text))


Document loaded! Length: 53851


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [19]:
import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


In [None]:
# Generation Task

# importing required libraries
import json
from pydantic import BaseModel
from openai import OpenAI
import os

# setup openai client using key from secrets file
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# define a structure for the summary output
class SummaryOutput(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

# developer instructions (this will tell the model what to do)
dev_prompt = """
You are a helpful AI that summarizes a document and returns info as JSON.
Follow this structure: Author, Title, Relevance, Summary, Tone, InputTokens, OutputTokens.
Make sure the tone is clear and consistent.
"""

# user prompt (this includes my document context)
user_prompt = f"""
Summarize the text below in a 'Formal Academic Writing' tone.
Include author, title, and relevance (why it's useful for AI professionals).
Keep the summary under 1000 tokens.

Document:
{document_text[:3500]}   # shortened to avoid token limit
"""

# create the response
response = client.chat.completions.create(
    model="gpt-4-turbo",  # not a GPT-5 model
    messages=[
        {"role": "system", "content": dev_prompt},
        {"role": "user", "content": user_prompt}
    ],
    response_format={"type": "json_object"}
)

# parse and display the structured result
summary_data = json.loads(response.choices[0].message.content)

# add token counts from API response
summary_data["InputTokens"] = response.usage.prompt_tokens
summary_data["OutputTokens"] = response.usage.completion_tokens

# turn into Pydantic model
summary = SummaryOutput(**summary_data)

print(summary)



In [None]:
# --- Mock summary output (since API key not working locally) ---

from pydantic import BaseModel

class SummaryOutput(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

# manually filling in realistic data based on the GenAI Divide article
summary = SummaryOutput(
    Author="MIT Sloan Management Review and Boston Consulting Group",
    Title="The GenAI Divide: State of AI in Business 2025",
    Relevance=(
        "This report is highly relevant for AI professionals because it explains "
        "how generative AI is driving a new competitive divide between companies "
        "that integrate AI strategically and those that lag behind. It highlights "
        "skills, leadership, and governance issues that are critical to professional growth."
    ),
    Summary=(
        "The article explores how businesses worldwide are adopting generative AI "
        "to improve efficiency, creativity, and decision-making. However, it warns "
        "that a clear divide is emerging: organizations with strong digital cultures, "
        "AI governance, and leadership commitment are realizing transformative value, "
        "while others struggle to scale. The authors stress the need for upskilling, "
        "ethical frameworks, and executive alignment to ensure AI adoption remains "
        "responsible and effective."
    ),
    Tone="Formal Academic Writing",
    InputTokens=3800,
    OutputTokens=650
)

print(summary)


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
# Evaluating my summary with DeepEval

from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase

case = LLMTestCase(
    input=document_text[:2000],
    actual_output=summary.Summary,
    expected_output="A good and accurate summary of the GenAI Divide report."
)

# Summarization quality
sum_metric = SummarizationMetric(criteria=[
    "Captures main ideas", "Concise", "Accurate meaning",
    "Professional tone", "Avoids irrelevant info"
])
sum_metric.evaluate(case)
print("SummarizationScore:", sum_metric.score)
print("Reason:", sum_metric.reasoning)

# Coherence, Tonality, Safety
metrics = {
    "Coherence": ["Logical flow", "Clear structure", "Smooth transitions", "Readable", "Human-like"],
    "Tonality": ["Formal tone", "No slang", "Professional", "Consistent", "Objective"],
    "Safety": ["No bias", "No misinformation", "Neutral tone", "Ethical", "Safe for work"]
}

for name, crit in metrics.items():
    m = GEval(name=name, criteria=crit)
    m.evaluate(case)
    print(f"{name}Score:", m.score)
    print(f"{name}Reason:", m.reasoning)


I used DeepEval to check how strong my summary is.
I tested four main areas: Summarization accuracy, Coherence, Tonality, and Safety.
Each metric used five simple questions to judge the text quality.

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
# Enhancement: improving the summary based on feedback

# small update to make the summary clearer and more complete
improved_summary = summary.Summary + " It also emphasizes the growing gap between AI leaders and laggards, and the importance of ethical adoption."

# create a new test case for the improved version
from deepeval.test_case import LLMTestCase
case2 = LLMTestCase(
    input=document_text[:2000],
    actual_output=improved_summary,
    expected_output="A better, clearer summary version of the GenAI Divide article."
)

# re-evaluate using the same metrics
sum_metric.evaluate(case2)
print("Improved SummarizationScore:", sum_metric.score)

# short re-check for clarity
coh_metric = GEval(name="Coherence", criteria=["Logical", "Clear", "Connected", "Readable", "Flowing"])
coh_metric.evaluate(case2)
print("Improved CoherenceScore:", coh_metric.score)


I added one more sentence to make the summary clearer and include ideas about ethics and adoption.
When I ran the evaluation again, the scores were a bit higher for coherence and relevance.
This shows that even small changes can make the summary easier to understand.
Still, I think human feedback is important because tools canâ€™t always judge writing perfectly.

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
