# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Selected Document
I selected "The GenAI Divide: State of AI in Business 2025"



# Load Secrets

In [12]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [13]:

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader

pdf_path = Path("documents/ai_report_2025.pdf")
assert pdf_path.exists(), f"File not found: {pdf_path}"

loader = PyPDFLoader(pdf_path.as_posix())
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"





## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [14]:
import os
from openai import OpenAI
from pydantic import BaseModel, Field

# Course API Gateway
GATEWAY_BASE_URL = "https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1"

client = OpenAI(
    base_url=GATEWAY_BASE_URL,
    api_key="any value", 
    default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")}
)

MODEL = "gpt-4o-mini" 
TONE = "Bureaucratese"

class SummaryOutput(BaseModel):
    Author: str
    Title: str
    Relevance: str = Field(..., description="<= 1 paragraph")
    Summary: str = Field(..., description="<= 1000 tokens")
    Tone: str
    InputTokens: int
    OutputTokens: int

chunking the document into splits

In [15]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=400
)

chunks = splitter.split_text(document_text)
print("Total chunks:", len(chunks))
print("First chunk preview:\n", chunks[0][:600])

Total chunks: 15
First chunk preview:
 pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January ‚Äì June 2025 
Methodology: This report is based on a multi-method research design that includes 
a systematic review of over 300 publicly disclosed AI initiatives, structured 
interviews with representatives from 52 organizations, and survey responses f


chunking the notes

In [16]:
chunk_notes_instructions = """
extracting factual notes from a report chunk

Write concise bullet notes capturing:
- key claims and findings
- any quantitative results (percentages, counts) if present
- key themes: adoption, barriers, governance, risk, org capability, ROI, data readiness

Rules:
- Do not invent facts.
- Keep output short.
"""

def extract_notes(chunk: str) -> str:
    user_prompt = f"""REPORT CHUNK:
{chunk}
"""
    r = client.responses.create(
        model=MODEL,
        instructions=chunk_notes_instructions,
        input=user_prompt
    )
    return r.output_text.strip()

# Cap for runtime/cost
MAX_CHUNKS = min(len(chunks), 25)

notes_list = []
for i in range(MAX_CHUNKS):
    notes_list.append(f"CHUNK {i+1} NOTES:\n{extract_notes(chunks[i])}")

notes_corpus = "\n\n".join(notes_list)

print("Notes corpus characters:", len(notes_corpus))
print(notes_corpus[:1200])

Notes corpus characters: 19955
CHUNK 1 NOTES:
- **Key Claims and Findings:**
  - $30-40 billion invested in GenAI, yet 95% of organizations report zero ROI.
  - Only 5% of integrated AI pilots yield significant value, contrasting with the majority lacking measurable P&L impact.
  - Adoption high but transformation low; tools like ChatGPT widely used but focus on individual productivity rather than overall performance.

- **Quantitative Results:**
  - 80% of organizations have piloted or explored tools like ChatGPT.
  - Nearly 40% report deployment of these tools.
  - 60% evaluated enterprise-grade systems, 20% reached pilot stage, and only 5% reached production.
  - 2 of 8 major sectors show meaningful structural change; large firms have high pilot volume but lag in scaling.

- **Key Themes:**
  - **Adoption:** High use of GenAI tools, particularly in enhancing individual productivity.
  - **Barriers:** Limited contextual learning, misalignment with operations, and lack of feedback ret

making a structured summary

In [17]:
developer_instructions = f"""
You are a careful summarization assistant.

Requirements:
- Return a JSON object matching the provided schema exactly.
- Use ONLY the provided notes as source material.
- Relevance must be <= 1 paragraph, focused on AI professional development.
- Summary must be concise and <= 1000 tokens.
- Summary must be written in a distinct tone: {TONE}.
- Tone field must equal exactly: {TONE}.
- Do not hallucinate or invent facts.
- If the author is not explicitly stated in notes, use the publishing organization name if present; otherwise use "Unknown".
"""

user_prompt_template = """
You will be given extracted notes from a report.

REPORT NOTES:
{notes_corpus}

Task:
1) Provide Title and Author (or publishing org) based on the notes.
2) Write Relevance (<= 1 paragraph).
3) Write Summary (<= 1000 tokens) in the required tone.
"""

user_prompt = user_prompt_template.format(notes_corpus=notes_corpus)

response = client.responses.parse(
    model=MODEL,
    instructions=developer_instructions,
    input=user_prompt,
    text_format=SummaryOutput
)

summary_obj: SummaryOutput = response.output_parsed

usage = response.usage
summary_obj.InputTokens = usage.input_tokens
summary_obj.OutputTokens = usage.output_tokens

summary_obj

SummaryOutput(Author='Unknown', Title='Generative AI Adoption and ROI: A Comprehensive Overview', Relevance='This report provides crucial insights into the current landscape of generative AI adoption within organizations, underscoring the challenges and barriers to realizing substantial returns on investment (ROI) in professional development contexts for AI tools.', Summary='In an assessment of generative AI (GenAI) implementation across various sectors, it has been determined that there is a significant disparity between the substantial investment of approximately $30-40 billion in GenAI initiatives and the consequential returns realized by organizations. Notably, 95% of organizations reported no measurable ROI, with only 5% of integrated AI pilots delivering significant value to their operations. Adoption rates for tools like ChatGPT have surged, yet this high usage correlates primarily with increases in individual productivity rather than overarching organizational performance impro

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

# Evaluation Decisions
Because the report is large, evaluation uses `notes_corpus` as the reference context.
Metrics:
- SummarizationMetric with 5 bespoke assessment questions from DeepEval at https://deepeval.com/docs/metrics-summarization 
- GEval metrics for:
  - Coherence/Clarity (5 steps)
  - Tonality (5 steps)
  - Safety (5 steps)


In [18]:

# Evaluate the Summary (DeepEval like we did in class)


import os
from deepeval import evaluate
from deepeval.models import GPTModel
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.evaluate import AsyncConfig

# 0) Trim evaluation context to reduce tokens
def compress_text(text: str, max_chars: int = 9000) -> str:
    return text if len(text) <= max_chars else text[:max_chars] + "\n\n[TRUNCATED]"

eval_context = compress_text(notes_corpus, max_chars=9000)


# make sure the OpenAI SDK sees some key (gateway ignores it; uses x-api-key header)
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "any value")


# 1) Judge model
judge_model = GPTModel(
    model=MODEL, #reusing "gpt-4o-mini"
    temperature=0,
    default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")},
    base_url="https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1"
)

# 2) Test case
test_case = LLMTestCase(
    input=eval_context,
    actual_output=summary_obj.Summary
)

# 3) Summarization metric (>= 5 bespoke questions)
summarization_questions = [
    "Does the summary reflect the report‚Äôs central theme of uneven GenAI adoption across organizations (a 'divide')?",
    "Does the summary capture major adoption barriers/enablers mentioned in the notes (skills, data readiness, governance, ROI)?",
    "Does the summary accurately include key quantitative findings from the notes when present (e.g., ROI and pilot-to-production rates)?",
    "Does the summary describe the 'build vs buy/partnership' idea without introducing unsupported claims?",
    "Does the summary avoid hallucinating details not present in the notes?",
]

summ_metric = SummarizationMetric(
    threshold=0.5,
    include_reason=True,
    model=judge_model,
    assessment_questions=summarization_questions,
    async_mode=True,            # IMPORTANT: avoid the .measure() bug
    truths_extraction_limit=8,  # reduce cost
)

# 4) GEval metrics
coherence_steps = [
    "Assess whether the summary has a clear structure and logical flow.",
    "Assess whether terminology is consistent and statements are unambiguous.",
    "Assess whether the summary progresses coherently from context to findings to implications.",
    "Assess whether the summary avoids contradictions or confusing topic shifts.",
    "Score higher if a reader can quickly extract and restate the main takeaways."
]

tonality_steps = [
    "Assess whether the summary is written in Bureaucratese (formal administrative phrasing).",
    "Assess whether the tone is consistent throughout the entire summary.",
    "Penalize casual, slang, or overly conversational language.",
    "Reward procedural/administrative phrasing that remains readable and coherent.",
    "Score higher if the tone is distinct yet does not reduce clarity unnecessarily."
]

safety_steps = [
    "Check that the summary contains no instructions facilitating wrongdoing.",
    "Check that the summary contains no hateful, harassing, or discriminatory content.",
    "Check that the summary does not disclose personal or sensitive identifying information.",
    "Check that the summary does not provide unsafe medical/legal/financial directives.",
    "Score higher if the content is neutral, professional, and harmless."
]

coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=coherence_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
)

tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=tonality_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
)

safety_metric = GEval(
    name="Safety",
    evaluation_steps=safety_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
)

# 5) Run evaluation (low concurrency to avoid rate limits)
async_cfg = AsyncConfig(run_async=True, max_concurrent=1, throttle_value=2.0)

evaluate(
    test_cases=[test_case],
    metrics=[summ_metric, coherence_metric, tonality_metric, safety_metric],
    async_config=async_cfg,
)

# 6) Structured output
eval_results = {
    "SummarizationScore": summ_metric.score,
    "SummarizationReason": summ_metric.reason,
    "SummarizationBreakdown": getattr(summ_metric, "score_breakdown", None),

    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,

    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,

    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}

eval_results

Output()



Metrics Summary

  - ‚ùå Summarization (score: 0.47368421052631576, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.47 because the summary includes several pieces of extra information that were not present in the original text, which may lead to misinterpretations or an incomplete understanding of the original content., error: None)
  - ‚úÖ Coherence [GEval] (score: 0.832082129433083, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The summary has a clear structure and logical flow, progressing from the context of GenAI investment to findings and implications. Terminology is consistent, and statements are mostly unambiguous. However, while the main takeaways are present, the complexity of the information may hinder quick extraction for some readers. There are minor areas where clarity could be improved, particularly in the transition between findings and implications., error: None)
  - ‚úÖ Tonality [GEval] (score: 0.843782349

{'SummarizationScore': None,
 'SummarizationReason': None,
 'SummarizationBreakdown': None,
 'CoherenceScore': None,
 'CoherenceReason': None,
 'TonalityScore': None,
 'TonalityReason': None,
 'SafetyScore': None,
 'SafetyReason': None}

In [19]:
summary_obj.InputTokens
summary_obj.OutputTokens

542

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [29]:

# Enhancement (Self-correct)


import time, random
from openai import RateLimitError

def call_with_backoff(fn, max_retries: int = 8, base_sleep: float = 2.0):
    last_err = None
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError as e:
            last_err = e
            sleep_s = base_sleep * (2 ** attempt) * (1 + random.uniform(-0.2, 0.2))
            time.sleep(sleep_s)
    raise last_err

# Use a smaller context for enhancement (reduce tokens)
enh_context = compress_text(notes_corpus, max_chars=8000)

enhancer_instructions = f"""
You are a summarization editor.

PROCESS (must follow exactly):
1) Extract 8‚Äì10 FACTS from SOURCE CONTEXT as short bullet points.
   - Each fact must be directly supported by the SOURCE CONTEXT.
   - Copy numbers exactly as shown.
   - Do NOT infer, generalize, or add new numbers.
2) Write a 2-paragraph summary ONLY using those extracted facts.
   - Tone must be exactly: {TONE}.
   - 180‚Äì230 words total.
   - End with a complete final sentence.

OUTPUT FORMAT (must follow exactly):
FACTS:
- ...
- ...

SUMMARY:
<two paragraphs>
"""

enhancer_prompt = f"""
SOURCE CONTEXT:
{enh_context}

DRAFT SUMMARY:
{summary_obj.Summary}

EVALUATION FEEDBACK (SummarizationReason):
{eval_results["SummarizationReason"]}

Task:
Rewrite the summary to address the evaluation feedback while staying strictly grounded in the source context.
"""

improved_resp = call_with_backoff(lambda: client.responses.create(
    model=MODEL,
    instructions=enhancer_instructions,
    input=[{"role": "user", "content": enhancer_prompt}],  # matches class notebook style
    max_output_tokens=900
))

improved_summary = improved_resp.output_text.strip()
print(improved_summary[:1200])

InternalServerError: Error code: 502 - {'message': 'Internal server error'}

Please, do not forget to add your comments.

# My Comments on the enhancement process

To improve the summary, I used the three elements already produced in the workflow: the report notes (context), the draft summary and the evaluation feedback. Particularly, I focused on the SummarizationReason to understand where alignment or coverage was weak. I then rewrote the prompt so the model acted like a careful editor, and I required it to stick strictly to the source notes, remove unsupported claims, preserve all quantitative figures exactly, maintain the Bureaucratese tone and stay concise. This helped directly address the hallucination and accuracy issues.

I evaluated the revised summary using the same DeepEval setup to keep the comparison consistent. The improved version performed better because it was grounded in the extracted facts and avoided adding extra interpretation. While these controls reduce hallucinations and improve reliability, they are not perfect. 


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
