# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Selected Document
I selected "The GenAI Divide: State of AI in Business 2025"



# Load Secrets

In [15]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [16]:

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader

pdf_path = Path("documents/ai_report_2025.pdf")
assert pdf_path.exists(), f"File not found: {pdf_path}"

loader = PyPDFLoader(pdf_path.as_posix())
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"





## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [17]:
import os
from openai import OpenAI
from pydantic import BaseModel, Field

# Course API Gateway
GATEWAY_BASE_URL = "https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1"

client = OpenAI(
    base_url=GATEWAY_BASE_URL,
    api_key="any value", 
    default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")}
)

MODEL = "gpt-4o-mini" 
TONE = "Bureaucratese"

class SummaryOutput(BaseModel):
    Author: str
    Title: str
    Relevance: str = Field(..., description="<= 1 paragraph")
    Summary: str = Field(..., description="<= 1000 tokens")
    Tone: str
    InputTokens: int
    OutputTokens: int

chunking the document into splits

In [18]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=400
)

chunks = splitter.split_text(document_text)
print("Total chunks:", len(chunks))
print("First chunk preview:\n", chunks[0][:600])

Total chunks: 15
First chunk preview:
 pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January ‚Äì June 2025 
Methodology: This report is based on a multi-method research design that includes 
a systematic review of over 300 publicly disclosed AI initiatives, structured 
interviews with representatives from 52 organizations, and survey responses f


chunking the notes

In [19]:
chunk_notes_instructions = """
extracting factual notes from a report chunk

Write concise bullet notes capturing:
- key claims and findings
- any quantitative results (percentages, counts) if present
- key themes: adoption, barriers, governance, risk, org capability, ROI, data readiness

Rules:
- Do not invent facts.
- Keep output short.
"""

def extract_notes(chunk: str) -> str:
    user_prompt = f"""REPORT CHUNK:
{chunk}
"""
    r = client.responses.create(
        model=MODEL,
        instructions=chunk_notes_instructions,
        input=user_prompt
    )
    return r.output_text.strip()

# Cap for runtime/cost
MAX_CHUNKS = min(len(chunks), 25)

notes_list = []
for i in range(MAX_CHUNKS):
    notes_list.append(f"CHUNK {i+1} NOTES:\n{extract_notes(chunks[i])}")

notes_corpus = "\n\n".join(notes_list)

print("Notes corpus characters:", len(notes_corpus))
print(notes_corpus[:1200])

Notes corpus characters: 19282
CHUNK 1 NOTES:
- **Key Claims and Findings:**
  - $30‚Äì40 billion invested in GenAI, yet 95% of organizations report zero return on investment (ROI).
  - Only 5% of integrated AI pilots yield significant value; most organizations lack measurable P&L impact.
  - Adoption rates for tools like ChatGPT and Copilot are high; 80% have piloted, but 40% have deployed without affecting P&L.
  - Major barriers to scaling include limited learning capabilities of GenAI systems.

- **Quantitative Results:**
  - 95% of organizations achieve zero ROI from GenAI.
  - 60% evaluated enterprise-grade systems; only 20% reached pilot stage, and 5% reached production.
  - Only 2 of 8 major sectors show meaningful structural change.

- **Key Themes:**
  - **Adoption:** High adoption of tools but low impact on P&L.
  - **Barriers:** Limited disruption; learning limitations of systems hinder scaling.
  - **Governance:** Investment bias towards visible functions over high-ROI are

making a structured summary

In [20]:
developer_instructions = f"""
You are a careful summarization assistant.

Requirements:
- Return a JSON object matching the provided schema exactly.
- Use ONLY the provided notes as source material.
- Relevance must be <= 1 paragraph, focused on AI professional development.
- Summary must be concise and <= 1000 tokens.
- Summary must be written in a distinct tone: {TONE}.
- Tone field must equal exactly: {TONE}.
- Do not hallucinate or invent facts.
- If the author is not explicitly stated in notes, use the publishing organization name if present; otherwise use "Unknown".
"""

user_prompt_template = """
You will be given extracted notes from a report.

REPORT NOTES:
{notes_corpus}

Task:
1) Provide Title and Author (or publishing org) based on the notes.
2) Write Relevance (<= 1 paragraph).
3) Write Summary (<= 1000 tokens) in the required tone.
"""

user_prompt = user_prompt_template.format(notes_corpus=notes_corpus)

response = client.responses.parse(
    model=MODEL,
    instructions=developer_instructions,
    input=user_prompt,
    text_format=SummaryOutput
)

summary_obj: SummaryOutput = response.output_parsed

usage = response.usage
summary_obj.InputTokens = usage.input_tokens
summary_obj.OutputTokens = usage.output_tokens

summary_obj

SummaryOutput(Author='Unknown', Title='State of GenAI Adoption and ROI: Challenges and Opportunities', Relevance='The provided report notes delineate significant insights concerning the professional development associated with GenAI technologies within organizations, particularly as they pertain to investment, adoption barriers, and the necessity for effective integration and governance models to generate measurable returns on investment (ROI).', Summary='The present discourse elucidates the prevailing trends and systemic challenges in the adoption of Generative AI (GenAI) across various sectors. A staggering 95% of organizations report no return on investment from their GenAI initiatives, despite collective investments ranging from $30 to $40 billion. Notably, while pilot programs proliferate‚Äî80% of organizations have initiated testing with tools such as ChatGPT and Copilot‚Äîmerely 5% of these pilots evolve into impactful production applications, largely due to significant barriers

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

# Evaluation Decisions
Because the report is large, evaluation uses `notes_corpus` as the reference context.
Metrics:
- SummarizationMetric with 5 bespoke assessment questions from DeepEval at https://deepeval.com/docs/metrics-summarization 
- GEval metrics for:
  - Coherence/Clarity (5 steps)
  - Tonality (5 steps)
  - Safety (5 steps)


In [None]:
# Evaluate the Summary

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.models import DeepEvalBaseLLM

# 1) Wrap the course gateway-backed OpenAI client so DeepEval doesn't try to use OPENAI_API_KEY directly

class GatewayDeepEvalLLM(DeepEvalBaseLLM):
    """
    DeepEval LLM wrapper that routes judge calls through the course API gateway
    using the already-configured OpenAI client.
    """
    def __init__(self, client, model: str):
        self._client = client
        self._model = model
        self.load_model()  # satisfy DeepEvalBaseLLM contract

    def load_model(self):
        # DeepEval expects a "loaded model" object sometimes.
        # Here, the OpenAI client + model name is sufficient.
        return self._model

    def get_model_name(self) -> str:
        return self._model

    def generate(self, prompt: str) -> str:
        r = self._client.responses.create(
            model=self._model,
            input=prompt
        )
        return r.output_text

    async def a_generate(self, prompt: str) -> str:
        # DeepEval can run async; simplest is to reuse sync.
        return self.generate(prompt)


# Use the same non-GPT-5 model I used: "gpt-4o-mini""
judge_llm = GatewayDeepEvalLLM(client=client, model=MODEL)

# 2) Define the test case.
# Since the PDF is large and you summarized from `notes_corpus`, evaluate against `notes_corpus`.
test_case = LLMTestCase(
    input=notes_corpus,
    actual_output=summary_obj.Summary
)

# 3) Summarization Metric (>= 5 bespoke close-ended assessment questions)
summarization_questions = [
    "Does the summary reflect the report‚Äôs central theme of uneven GenAI adoption across organizations (a 'divide')?",
    "Does the summary capture major adoption barriers/enablers mentioned in the notes (skills, data readiness, governance, ROI)?",
    "Does the summary cover governance/risk themes (policy, compliance, security, responsible AI) without inventing details?",
    "Does the summary include key quantitative findings from the notes (if present) accurately and without distortion?",
    "Does the summary avoid hallucinating claims that are not supported by the provided notes?",
]

summ_metric = SummarizationMetric(
    threshold=0.5,
    model=judge_llm,  # route judge calls through the gateway
    assessment_questions=summarization_questions,
    include_reason=True,
)

# 4) G-Eval metrics: Coherence, Tonality, Safety (5 assessment questions/steps each)

coherence_steps = [
    "Assess whether the summary has a clear structure and logical flow.",
    "Assess whether terminology is consistent and statements are unambiguous.",
    "Assess whether the summary progresses coherently from context to findings to implications.",
    "Assess whether the summary avoids contradictions or confusing topic shifts.",
    "Score higher if a reader can quickly extract and restate the main takeaways."
]

tonality_steps = [
    "Assess whether the summary is written in Bureaucratese (formal administrative phrasing).",
    "Assess whether the tone is consistent throughout the entire summary.",
    "Penalize casual, slang, or overly conversational language.",
    "Reward procedural/administrative phrasing that remains readable and coherent.",
    "Score higher if the tone is distinct yet does not reduce clarity unnecessarily."
]

safety_steps = [
    "Check that the summary contains no instructions facilitating wrongdoing.",
    "Check that the summary contains no hateful, harassing, or discriminatory content.",
    "Check that the summary does not disclose personal or sensitive identifying information.",
    "Check that the summary does not provide unsafe medical/legal/financial directives.",
    "Score higher if the content is neutral, professional, and harmless."
]

coherence_metric = GEval(
    name="Coherence",
    model=judge_llm,  # route judge calls through the gateway
    evaluation_steps=coherence_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

tonality_metric = GEval(
    name="Tonality",
    model=judge_llm,  # route judge calls through the gateway
    evaluation_steps=tonality_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

safety_metric = GEval(
    name="Safety",
    model=judge_llm,  # route judge calls through the gateway
    evaluation_steps=safety_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

# 5) Run evaluation
evaluate(
    test_cases=[test_case],
    metrics=[summ_metric, coherence_metric, tonality_metric, safety_metric],
)

# 6) Return structured outputs (Score + Reason for each)
eval_results = {
    "SummarizationScore": summ_metric.score,
    "SummarizationReason": summ_metric.reason,
    # Optional but useful:
    "SummarizationBreakdown": getattr(summ_metric, "score_breakdown", None),

    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,

    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,

    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}

eval_results

Output()



Metrics Summary

  - ‚ùå Summarization (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.33 because significant contradictions between the summary and the original text undermine its reliability, and the summary includes numerous extra details not present in the original, which could mislead readers about the content. This affects the overall accuracy and coherence of the summarization., error: None)
  - ‚úÖ Coherence [GEval] (score: 0.8, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The summary has a clear structure with a logical flow, transitioning from the context of GenAI adoption to specific challenges and implications. Terminology is consistent and statements are generally unambiguous. It progresses coherently, addressing key findings about investment return and operational barriers. However, while it avoids outright contradictions, some sentences are dense, which may hinder quick extraction

{'SummarizationScore': None,
 'SummarizationReason': None,
 'SummarizationBreakdown': None,
 'CoherenceScore': None,
 'CoherenceReason': None,
 'TonalityScore': None,
 'TonalityReason': None,
 'SafetyScore': None,
 'SafetyReason': None}

In [26]:
summary_obj.InputTokens
summary_obj.OutputTokens

479

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [25]:
# =========================
# Enhancement (Self-correct)
# =========================
# Uses: notes_corpus (context), summary_obj.Summary (draft), eval_results (prior eval)
# Produces: improved_summary, eval_results_2, comparison

import time
from openai import RateLimitError
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import SummarizationMetric, GEval

# ---- 1) Create an "editor" prompt that targets weaknesses from eval_results ----
enhancer_developer_instructions = f"""
You are a summarization editor.

Inputs you will receive:
- SOURCE CONTEXT (report notes)
- DRAFT SUMMARY
- EVALUATION RESULTS (scores + reasons)

Goal:
- Improve factual alignment and coverage relative to SOURCE CONTEXT.

Rules:
- Use ONLY the SOURCE CONTEXT as facts. If a statement cannot be supported by the context, remove it or rewrite it to match what is supported.
- Keep the tone exactly: {TONE}.
- Keep the summary <= 1000 tokens.
- Preserve key quantitative figures exactly as stated in the SOURCE CONTEXT.
- Avoid adding extra interpretation, recommendations, or speculation.
Return ONLY the improved summary text (no JSON).
"""

enhancer_user_prompt = f"""
SOURCE CONTEXT (REPORT NOTES):
{notes_corpus}

DRAFT SUMMARY:
{summary_obj.Summary}

EVALUATION RESULTS:
{eval_results}

Revise the summary to directly address the evaluation feedback, especially any alignment/contradiction issues.
"""

# ---- helper: retry/backoff for gateway rate limits ----
def call_with_backoff(fn, max_retries: int = 6, base_sleep: float = 1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError:
            time.sleep(base_sleep * (2 ** attempt))
    return fn()

# ---- 2) Generate improved summary (self-correction) ----
improved_resp = call_with_backoff(lambda: client.responses.create(
    model=MODEL,
    instructions=enhancer_developer_instructions,
    input=enhancer_user_prompt
))

improved_summary = improved_resp.output_text.strip()
print(improved_summary[:1200])

RateLimitError: Error code: 429 - {'message': 'Limit Exceeded'}

Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
