# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

PDF_PATH ="./documents/managing_oneself.pdf"
loader = PyPDFLoader(PDF_PATH)
docs = loader.load()

document_text = ""
for i, page in enumerate(docs, start=1):
    content = (page.page_content or "").strip()
    if content:
        document_text += f"\n\n--- Page {i} ---\n{content}"

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [3]:
from pydantic import BaseModel
class Summary(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    Input_Tokens: int
    Output_Tokens: int

In [5]:
from openai import OpenAI
import os
from IPython.display import display, Markdown

client = OpenAI(
    base_url="https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1",
    api_key="dummy",
    default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")}
)

tone =  "Academic and technical, with a focus on clarity and precision."
Model = "gpt-4o-mini"

system_prompt =  f"""
You are a precise academic summarizer based on the tecnical detail.
Your output MUST be a valid Pydantic BaseModel JSON matching the provided schema.
Write the summary in tone: {tone}
""".strip()

user_prompt = f"""
DOCUMENT:
{document_text}

Extract metadata and write summary following the instructions.
""".strip()
response = client.responses.parse(
    model=Model,
    instructions = system_prompt,
    input = [
        {"role": "developer", "content": system_prompt},
        {"role": "user", "content": user_prompt.format(document_text=document_text, tone=tone)},
    ],
    text_format = Summary,
)
event= response.output_parsed

display(Markdown(response.output_text))

{"Author":"Peter F. Drucker","Title":"Managing Oneself","Relevance":"Understanding self-management principles for knowledge workers.","Summary":"Peter Drucker discusses the necessity for individuals, particularly knowledge workers, to manage their own careers by cultivating self-awareness regarding their strengths, weaknesses, values, and preferred working styles. He emphasizes that each person should act as their own chief executive officer, utilizing feedback analysis to discern their true capabilities and areas for improvement. The article further explores the importance of aligning personal values with organizational culture, fostering effective working relationships, and preparing for the second half of one‚Äôs career by pursuing parallel interests or second careers to maintain engagement and satisfaction.","Tone":"Academic and technical","Input_Tokens":2207,"Output_Tokens":163}

In [7]:
event.Summary

'Peter Drucker discusses the necessity for individuals, particularly knowledge workers, to manage their own careers by cultivating self-awareness regarding their strengths, weaknesses, values, and preferred working styles. He emphasizes that each person should act as their own chief executive officer, utilizing feedback analysis to discern their true capabilities and areas for improvement. The article further explores the importance of aligning personal values with organizational culture, fostering effective working relationships, and preparing for the second half of one‚Äôs career by pursuing parallel interests or second careers to maintain engagement and satisfaction.'

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [12]:
from deepeval.models import GPTModel
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

Model = "gpt-4o-mini"

jmodel = GPTModel(
    model=Model,
    default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")},
    base_url="https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1",
)

#test_case = LLMTestCase(input=input, actual_output=event.Summary)
metric = SummarizationMetric(
    threshold=0.5,
    model=jmodel,
    assessment_questions = [
    "Is the summary faithful to the article‚Äôs primary arguments?",
    "Does the summary avoid superfluous information and remain brief?",
    "Are the article‚Äôs main conclusions clearly presented at the end?",
    "Is the attribution of author and title accurate?",
    "Are the article‚Äôs terminology and concepts applied correctly within the summary?"
]
)
coherence = GEval(
    name="Coherence",
    model=jmodel,
    evaluation_steps = [
    "Verify the summary‚Äôs alignment with the article‚Äôs central thesis.",
    "Check for completeness regarding critical data and findings.",
    "Assess the logical flow and coherence of the summary.",
    "Evaluate the relevance of included information to AI professionals.",
    "Review the accuracy of technical vocabulary and contextual usage.",
    "Confirm the summary‚Äôs brevity and avoidance of redundant content.",
    "Validate the correct identification of author and title.",
    "Ensure the summary concludes with the article‚Äôs key insights."
],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
tonality = GEval(
    name="Tonality",
    model=jmodel,
    evaluation_steps = [
    "Identify whether the summary highlights novel contributions of the article.",
    "Determine if the summary maintains objectivity and avoids bias.",
    "Inspect the summary for clarity in presenting complex concepts.",
    "Check if the summary appropriately contextualizes the article within its field.",
    "Assess whether the summary uses consistent terminology throughout."
],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

safety = GEval(
    name="Safety",
    model=jmodel,
    evaluation_steps=[
    "Check if the summary avoids introducing bias or subjective opinions.",
    "Verify that the summary does not misrepresent or distort the article‚Äôs content.",
    "Ensure the summary excludes sensitive or private information not present in the article.",
    "Assess whether the summary adheres to ethical guidelines for information sharing."
],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
class Evaluation(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    CoherenceScore: float
    CoherenceReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str

test_case = LLMTestCase(input=user_prompt, actual_output=response.output_text)

evaluation_results = evaluate(test_cases=[test_case], metrics=[metric, coherence, tonality, safety])

first_result = evaluation_results.test_results[0]

final_evaluation = Evaluation(
    SummarizationScore  = first_result.metrics_data[0].score,
    SummarizationReason = first_result.metrics_data[0].reason,
    
    CoherenceScore      = first_result.metrics_data[1].score,
    CoherenceReason     = first_result.metrics_data[1].reason,
    
    TonalityScore       = first_result.metrics_data[2].score,
    TonalityReason      = first_result.metrics_data[2].reason,
    
    SafetyScore         = first_result.metrics_data[3].score,
    SafetyReason        = first_result.metrics_data[3].reason
)
print(final_evaluation.model_dump_json(indent=4))



Metrics Summary

  - ‚úÖ Summarization (score: 0.6923076923076923, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.69 because the summary includes extra information not found in the original text, which may mislead the reader about the content. Additionally, it fails to address a question that the original text can answer, indicating a lack of completeness. However, there are no contradictions, which helps maintain some level of accuracy., error: None)
  - ‚úÖ Coherence [GEval] (score: 0.8835483545296144, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The summary effectively captures the central thesis of Drucker's article, emphasizing self-management for knowledge workers. It includes critical data about self-awareness, feedback analysis, and aligning personal values with organizational culture, demonstrating completeness. The logical flow is coherent, and the technical vocabulary is used accurately. The relevance to AI pro

{
    "SummarizationScore": 0.6923076923076923,
    "SummarizationReason": "The score is 0.69 because the summary includes extra information not found in the original text, which may mislead the reader about the content. Additionally, it fails to address a question that the original text can answer, indicating a lack of completeness. However, there are no contradictions, which helps maintain some level of accuracy.",
    "CoherenceScore": 0.8835483545296144,
    "CoherenceReason": "The summary effectively captures the central thesis of Drucker's article, emphasizing self-management for knowledge workers. It includes critical data about self-awareness, feedback analysis, and aligning personal values with organizational culture, demonstrating completeness. The logical flow is coherent, and the technical vocabulary is used accurately. The relevance to AI professionals is clear, as self-management is crucial in tech-driven environments. The summary is concise and avoids redundancy, and it 

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [17]:
SELF_CORRECTION_INSTRUCTIONS = """You are an expert at improving summaries based on evaluation feedback.

Create an IMPROVED summary that addresses all weaknesses identified in the evaluation while maintaining the {tone} style."""

SELF_CORRECTION_PROMPT = """ORIGINAL DOCUMENT:
{context}

PREVIOUS SUMMARY:
{previous_summary}

EVALUATION FEEDBACK (Score: {score:.3f}):
{reason}

Based on this feedback, create an IMPROVED summary that:
- Addresses all weaknesses mentioned in the evaluation
- Maintains {tone} style throughout"""
correction_prompt = SELF_CORRECTION_PROMPT.format(
   context =document_text,
    previous_summary=event.Summary,
    score=final_evaluation.SummarizationScore,
    reason=final_evaluation.SummarizationReason,
    tone=tone
)
# Generate new summary
response_improved = client.responses.parse(
    model=Model,
    input=[
        {"role": "developer", "content": SELF_CORRECTION_INSTRUCTIONS.format(tone=tone)},
        {"role": "user", "content": correction_prompt},
    ],
    instructions=f"Write in {tone} style.",
    text_format=Summary
)
new_results = response_improved.output_parsed
new_results.Summary

test_case_improved = LLMTestCase(
    input=document_text,
    actual_output=new_results.Summary
)

#Calculating the summarization metric 

new_metric = SummarizationMetric(
    threshold=0.5,
    model=jmodel,
    
)
new_metric.measure(test_case_improved)
print(f"Score: {new_metric.score}")
print(f"Reason: {new_metric.reason}\n")

Score: 0.625
Reason: The score is 0.62 because the summary includes extra information that is not present in the original text, which may lead to misunderstandings about the content. Additionally, there are questions that the original text can answer but are left unaddressed in the summary, indicating a lack of completeness.



Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
