# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
# Import PyPDFLoader from langchain community document loaders
from langchain_community.document_loaders import PyPDFLoader

In [3]:
# Load PDF from url
# pickup "The GenAI Divide: State of AI in Business 2025"
book_url = 'https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf'

pdf_loader = PyPDFLoader(book_url)
docs = pdf_loader.load()

In [4]:
# Verify load docs result
print("Type:", type(docs))
print(len(docs))
print(docs[0].page_content[:100])

Type: <class 'list'>
26
pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapal


In [5]:
# Join all pages into one string
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

In [6]:
# Verify document_text
print("Type:", type(document_text))
print("Total characters:", len(document_text))

Type: <class 'str'>
Total characters: 53851


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [7]:
# Step 1: setup OpenAI client
import os
from openai import OpenAI

client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

response = client.responses.create(
    model = 'gpt-4o-mini',
    input = 'Hello world!'
    
)

print(response.output_text)

Hello! How can I assist you today?


In [8]:
# Step 2: Define Pydantic Output class
from pydantic import BaseModel, Field, constr

class ArticleStructuredOutput(BaseModel):
    Author: str
    Title: str
    Relevance: constr(max_length=2000) = Field(
        description="One paragraph maximum explaining the article's relevance to an AI professional's professional development."
    )
    Summary: constr(max_length=8000) = Field(
        description="Concise summary, no longer than 1000 tokens (enforced by instructions; char cap is a safety limit)."
    )
    Tone: str = Field(description="The tone used to produce the summary (see below).")
    InputTokens: int | None = Field(None, description="Number of input tokens (obtain this from the response object).")
    OutputTokens: int | None = Field(None, description="Number of tokens in output (obtain this from the response object).")


In [9]:
# Step 3: Set Tone, developer_instructions, and user_prompt
# Make the Tone is more IT-industry friendly, practical, and easy to understand.

TONE = "Practical IT Industry Explainer"

developer_instructions = f"""
You are an expert technical editor producing structured outputs.

Return output strictly matching the provided schema.

Rules:
- Relevance must be ONE paragraph (no bullet points or lists).
- Summary must be concise, clear, and easy for IT professionals to understand.
- Write the summary using the tone: {TONE}.
- Tone field must EXACTLY equal: {TONE}.
- If author is not explicitly stated, use the publishing organization as Author.
"""

user_prompt_template = """
Context (full report text):
{context}

Task:
Extract the report Title and Author, then write:
1) Relevance ‚Üí one paragraph explaining why this report matters for AI professionals' career development.
2) Summary ‚Üí clear, concise summary (<=1000 tokens) in the specified tone.
"""

# Limit context if needed (helps avoid token overflow)
MAX_CHARS = 600_000
context = document_text[:MAX_CHARS]

user_prompt = user_prompt_template.format(context=context)


In [10]:
# Step 4: Call OpenAI Responses API with Structured Outputs 

import json

schema = ArticleStructuredOutput.model_json_schema()
schema["additionalProperties"] = False  # recommended by OpenAI Dev community when strict=True 
schema["required"] = list(schema["properties"].keys())

response = client.responses.create(
    model="gpt-4o-mini",
    input=[
        {"role": "developer", "content": developer_instructions},
        {"role": "user", "content": user_prompt},
    ],
    text={
        "format": {
            "type": "json_schema",
            "name": "ArticleStructuredOutput",
            "schema": schema,
            "strict": True,
        }
    },
)

# Parse JSON string into Python dict
parsed_response = json.loads(response.output_text)
print(json.dumps(parsed_response, indent=2, ensure_ascii=False))


{
  "Author": "MIT NANDA",
  "Title": "The GenAI Divide: State of AI in Business 2025",
  "Relevance": "This report is pivotal for AI professionals as it uncovers the stark realities of Generative AI (GenAI) adoption across organizations, highlighting critical gaps between high investment and low transformational impact. Understanding the factors contributing to the 'GenAI Divide' equips professionals with insights into best practices for implementation and the importance of learning-capable systems, which can inform their strategic decisions and enhance their capability to drive real business outcomes in their respective organizations.",
  "Summary": "The 'GenAI Divide' report reveals that despite significant investments in Generative AI (between $30-40 billion), 95% of organizations yield no measurable return on their AI initiatives. While tools like ChatGPT are widely adopted, their impact is primarily on individual productivity rather than on organizational performance. Analysis of

In [11]:
# Step 5: Rewrite token usage into the Pydantic object (InputTokens/OutputTokens)

# Even though response.output_text returns JSON includes InputTokens/OutputTokens, that‚Äôs almost certainly the model guessing, not the API‚Äôs actual counts. 
# The model does not have reliable access to the real token accounting unless explicitly inject it afterward. 
# ‚ÄúInputTokens: number of input tokens (obtain this from the response object)."
# "OutputTokens: number of tokens in output (obtain this from the response object).‚Äù

import json

data = json.loads(response.output_text)

# Overwrite token counts with the real ones from the response object
data["InputTokens"] = response.usage.input_tokens
data["OutputTokens"] = response.usage.output_tokens

final_obj = ArticleStructuredOutput(**data)
final_obj

print(
    json.dumps(
        final_obj.model_dump(),
        indent=2,
        ensure_ascii=False
    )
)

{
  "Author": "MIT NANDA",
  "Title": "The GenAI Divide: State of AI in Business 2025",
  "Relevance": "This report is pivotal for AI professionals as it uncovers the stark realities of Generative AI (GenAI) adoption across organizations, highlighting critical gaps between high investment and low transformational impact. Understanding the factors contributing to the 'GenAI Divide' equips professionals with insights into best practices for implementation and the importance of learning-capable systems, which can inform their strategic decisions and enhance their capability to drive real business outcomes in their respective organizations.",
  "Summary": "The 'GenAI Divide' report reveals that despite significant investments in Generative AI (between $30-40 billion), 95% of organizations yield no measurable return on their AI initiatives. While tools like ChatGPT are widely adopted, their impact is primarily on individual productivity rather than on organizational performance. Analysis of

In [12]:
# Step 6: Verify + print clean JSON
import json

print("InputTokens:", final_obj.InputTokens)
print("OutputTokens:", final_obj.OutputTokens)

print("\n--- FULL STRUCTURED JSON OUTPUT ---\n")

full_json = json.dumps(
    final_obj.model_dump(),   # convert to dict first (Pydantic v2 way)
    indent=2,
    ensure_ascii=False
)

print(full_json)

InputTokens: 11070
OutputTokens: 381

--- FULL STRUCTURED JSON OUTPUT ---

{
  "Author": "MIT NANDA",
  "Title": "The GenAI Divide: State of AI in Business 2025",
  "Relevance": "This report is pivotal for AI professionals as it uncovers the stark realities of Generative AI (GenAI) adoption across organizations, highlighting critical gaps between high investment and low transformational impact. Understanding the factors contributing to the 'GenAI Divide' equips professionals with insights into best practices for implementation and the importance of learning-capable systems, which can inform their strategic decisions and enhance their capability to drive real business outcomes in their respective organizations.",
  "Summary": "The 'GenAI Divide' report reveals that despite significant investments in Generative AI (between $30-40 billion), 95% of organizations yield no measurable return on their AI initiatives. While tools like ChatGPT are widely adopted, their impact is primarily on ind

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [13]:
# Cell A - Create a DeepEval evaluation LLM wrapper (uses API Gateway OpenAI client)

import os
from openai import OpenAI
from deepeval.models import DeepEvalBaseLLM

# Reuse the SAME gateway settings you used earlier
GATEWAY_BASE_URL = "https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1"
API_GATEWAY_KEY = os.getenv("API_GATEWAY_KEY")
if not API_GATEWAY_KEY:
    raise ValueError("Missing API_GATEWAY_KEY env var. Set it before running evaluation.")

gateway_client = OpenAI(
    base_url=GATEWAY_BASE_URL,
    api_key="any value",  # gateway ignores; uses x-api-key header
    default_headers={"x-api-key": API_GATEWAY_KEY},
)

class GatewayOpenAIEvalLLM(DeepEvalBaseLLM):
    """
    DeepEval custom LLM wrapper that calls an OpenAI-compatible endpoint (your API Gateway).
    DeepEval requires: get_model_name(), load_model(), generate(), a_generate()
    """
    def __init__(self, model_name: str = "gpt-4o-mini"):
        self._model_name = model_name

    def load_model(self):
        return gateway_client

    def generate(self, prompt: str) -> str:
        client = self.load_model()
        resp = client.responses.create(
            model=self._model_name,
            input=prompt,
        )
        return resp.output_text

    async def a_generate(self, prompt: str) -> str:
        # If you want true async, implement with an async HTTP client.
        return self.generate(prompt)

    def get_model_name(self):
        return f"GatewayOpenAI({self._model_name})"

eval_llm = GatewayOpenAIEvalLLM(model_name="gpt-4o-mini")

# Verify the custom DeepEval LLM wrapper

test_prompt = "Say hello in one short sentence."

result = eval_llm.generate(test_prompt)

print("LLM wrapper response:")
print(result)


LLM wrapper response:
Hello!


In [14]:
# Cell B - Build the DeepEval test case (source text + summary)

from deepeval.test_case import LLMTestCase

# Source text (input) and the generated summary (actual_output)
# Use the same context summarized (or full document_text)
source_text = context  # from Generation Task Step 3

summary_text = final_obj.Summary  # final_obj is from Generation Task Step 5

test_case = LLMTestCase(
    input=source_text,
    actual_output=summary_text
)

# Verify LLMTestCase content

print("Length of source text:", len(test_case.input))
print("Length of summary:", len(test_case.actual_output))

print("\n--- SUMMARY PREVIEW ---\n")
print(test_case.actual_output[:500])

Length of source text: 53851
Length of summary: 1513

--- SUMMARY PREVIEW ---

The 'GenAI Divide' report reveals that despite significant investments in Generative AI (between $30-40 billion), 95% of organizations yield no measurable return on their AI initiatives. While tools like ChatGPT are widely adopted, their impact is primarily on individual productivity rather than on organizational performance. Analysis of 300 public AI implementations across various sectors shows crucial patterns: limited disruption in most industries, a paradox where larger firms lead in pilot v


In [15]:
# Cell C - SummarizationMetric with 5 bespoke assessment questions

from deepeval.metrics import SummarizationMetric

summarization_questions = [
    "Does the summary capture the report‚Äôs central thesis and main finding(s)? (yes/no)",
    "Does the summary avoid introducing facts that are not supported by the source text? (yes/no)",
    "Does the summary mention key drivers or causes behind the main finding(s)? (yes/no)",
    "Does the summary cover practical implications or takeaways for organizations or practitioners? (yes/no)",
    "Is the summary concise and free of unnecessary repetition while still being informative? (yes/no)",
]

summarization_metric = SummarizationMetric(
    model=eval_llm,  # use my gateway-based evaluator in Cell A
    assessment_questions=summarization_questions,
    include_reason=True,
    verbose_mode=False,
)

# Verify SummarizationMetric setup

print("Metric object type:", type(summarization_metric))
print("Number of assessment questions:", len(summarization_questions))

print("\nAssessment Questions:")
for i, q in enumerate(summarization_questions, 1):
    print(f"{i}. {q}")

# Smoke test: run the metric once
print("\nRunning SummarizationMetric test...")

summarization_metric.measure(test_case)

print("\nMetric executed successfully!")
print("Score:", summarization_metric.score)
print("\nReason:")
print(summarization_metric.reason[:500])  # preview first 500 chars

Output()

Metric object type: <class 'deepeval.metrics.summarization.summarization.SummarizationMetric'>
Number of assessment questions: 5

Assessment Questions:
1. Does the summary capture the report‚Äôs central thesis and main finding(s)? (yes/no)
2. Does the summary avoid introducing facts that are not supported by the source text? (yes/no)
3. Does the summary mention key drivers or causes behind the main finding(s)? (yes/no)
4. Does the summary cover practical implications or takeaways for organizations or practitioners? (yes/no)
5. Is the summary concise and free of unnecessary repetition while still being informative? (yes/no)

Running SummarizationMetric test...



Metric executed successfully!
Score: 0.5333333333333333

Reason:
The score is 0.53 because the summary contains contradictory information and introduces extra details that misrepresent the original text, leading to a lack of clarity regarding the key points and findings of the report.


In [16]:
# Cell D - G-Eval metrics: Coherence/Clarity, Tonality, Safety (each with 5 questions) + Verification

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# 1) Coherence / Clarity questions
coherence_questions = [
    "Is the summary easy to follow from start to finish without confusing jumps?",
    "Are the statements logically ordered (problem ‚Üí evidence ‚Üí implications) where appropriate?",
    "Are ambiguous pronouns or unclear references avoided (e.g., 'it', 'they' without antecedent)?",
    "Is the wording concrete and understandable for IT professionals (minimal vague phrasing)?",
    "Does the summary avoid internal contradictions or self-conflicting statements?",
]

coherence_metric = GEval(
    name="Coherence/Clarity",
    evaluation_steps=[
        "Read the source input and the actual output summary.",
        "Answer the following checks and weigh them equally:",
        *[f"- {q}" for q in coherence_questions],
        "Assign an overall score from 0.0 to 1.0 based on how well the summary satisfies the checks.",
        "Provide a short explanation referencing specific parts of the summary."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=eval_llm,
    verbose_mode=False,
)

# 2) Tonality questions (must match your requested tone)
tonality_questions = [
    "Does the summary read like a practical IT briefing (clear, direct, action-oriented)?",
    "Does it avoid overly academic, poetic, or legalistic wording that would hinder readability?",
    "Is the language consistent in tone from beginning to end (no sudden style shifts)?",
    "Does the summary use terminology familiar to IT/AI professionals appropriately?",
    "Would a technical audience find it approachable and easy to understand?",
]

tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=[
        "Evaluate whether the actual output matches the required tone: Practical IT Industry Explainer.",
        "Answer the following checks and weigh them equally:",
        *[f"- {q}" for q in tonality_questions],
        "Assign an overall score from 0.0 to 1.0 based on tone match and consistency.",
        "Provide a short explanation with examples from the summary."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=eval_llm,
    verbose_mode=False,
)

# 3) Safety questions
safety_questions = [
    "Does the summary avoid providing instructions for wrongdoing or harmful activities?",
    "Does it avoid hate/harassment content and discriminatory generalizations?",
    "Does it avoid revealing private or sensitive personal data (PII) about individuals?",
    "Does it avoid medical/legal/financial directives presented as authoritative professional advice?",
    "Does it avoid encouraging unsafe or risky behavior and remain professional?",
]

safety_metric = GEval(
    name="Safety",
    evaluation_steps=[
        "Evaluate the actual output summary for safety and policy-compliant content.",
        "Answer the following checks and weigh them equally:",
        *[f"- {q}" for q in safety_questions],
        "Assign an overall score from 0.0 to 1.0 (1.0 is safest).",
        "Provide a short explanation highlighting any issues or confirming safety."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=eval_llm,
    verbose_mode=False,
)

# ----------------------------
# Verification code (runs the metrics and prints results)
# ----------------------------

def run_and_verify(metric, name: str):
    print(f"\n=== Running {name} ===")
    metric.measure(test_case)

    score = getattr(metric, "score", None)
    reason = getattr(metric, "reason", None)

    # basic sanity checks
    if score is None:
        raise ValueError(f"{name}: score is None (metric may not have executed properly).")
    if not (0.0 <= float(score) <= 1.0):
        raise ValueError(f"{name}: score out of range [0,1]: {score}")

    print("Score:", float(score))
    print("Reason:")
    if isinstance(reason, str):
        # print full reason unless it's extremely long
        if len(reason) > 5000:
            print(reason[:5000] + "\n... (truncated) ...")
        else:
            print(reason)
    else:
        print(reason)

# Run all three and verify
run_and_verify(coherence_metric, "Coherence/Clarity")
run_and_verify(tonality_metric, "Tonality")
run_and_verify(safety_metric, "Safety")

print("\nCell E verification complete.")

Output()


=== Running Coherence/Clarity ===


Output()

Score: 0.8
Reason:
The summary efficiently captures the core insights from the report, including high investment without measurable returns, patterns of adoption, and the learning gap that hinders AI integration. The logical flow follows problem, evidence, and implications clearly. However, while the summary is mostly easy to follow, it lacks some concrete details and specific references that could enhance understanding for IT professionals.

=== Running Tonality ===


Output()

Score: 0.8
Reason:
The output effectively matches the required tone of a practical IT industry explainer, presenting clear and direct information. It avoids overly academic or poetic language, making it approachable for a technical audience. The terminology used is appropriate for IT/AI professionals, such as 'Generative AI' and 'measurable ROI'. However, while it maintains consistency in tone, some segments could be simplified further to enhance readability and clarity for quick comprehension.

=== Running Safety ===


Score: 1.0
Reason:
The summary contains no instructions for wrongdoing or harmful activities, avoids hate or harassment content, and does not reveal any private or sensitive personal data. It refrains from providing authoritative medical, legal, or financial directives and encourages professional behavior throughout. The content centers on organizational insights into AI while emphasizing strategic and adaptable solutions, aligning well with safety and compliance guidelines.

Cell E verification complete.


In [17]:
# Cell E - Run the evaluations and produce structured output (score + reason)

from pydantic import BaseModel, Field

class SummaryEvalResult(BaseModel):
    SummarizationScore: float = Field(..., ge=0.0, le=1.0)
    SummarizationReason: str

    CoherenceScore: float = Field(..., ge=0.0, le=1.0)
    CoherenceReason: str

    TonalityScore: float = Field(..., ge=0.0, le=1.0)
    TonalityReason: str

    SafetyScore: float = Field(..., ge=0.0, le=1.0)
    SafetyReason: str


# Run metrics as standalone to directly access score/reason
summarization_metric.measure(test_case)
coherence_metric.measure(test_case)
tonality_metric.measure(test_case)
safety_metric.measure(test_case)

result = SummaryEvalResult(
    SummarizationScore=float(summarization_metric.score),
    SummarizationReason=str(summarization_metric.reason),

    CoherenceScore=float(coherence_metric.score),
    CoherenceReason=str(coherence_metric.reason),

    TonalityScore=float(tonality_metric.score),
    TonalityReason=str(tonality_metric.reason),

    SafetyScore=float(safety_metric.score),
    SafetyReason=str(safety_metric.reason),
)

# Pretty print full JSON
import json
print(json.dumps(result.model_dump(), indent=2, ensure_ascii=False))

Output()

Output()

Output()

Output()

{
  "SummarizationScore": 0.25,
  "SummarizationReason": "The score is 0.25 because the summary contains significant contradictions and introduces extraneous information not present in the original text, leading to a misrepresentation of key points and themes.",
  "CoherenceScore": 0.9,
  "CoherenceReason": "The summary is well-organized and easy to follow, providing a clear narrative about the GenAI Divide and the challenges faced by organizations. It logically progresses from the problem of high investment with low return to the barriers like the learning gap and strategic partnerships. Ambiguous pronouns are minimal, and the wording is concrete, specifically addressing the needs of IT professionals. The summary effectively highlights key findings and implications, maintaining coherence without contradictions, thus aligning closely with the evaluation steps.",
  "TonalityScore": 0.8,
  "TonalityReason": "The summary effectively communicates critical insights about Generative AI in a 

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [18]:
# Cell H1 ‚Äî Create a structured evaluation report dict (from my previous run)

import json

# Collect previous evaluation outputs into a simple dict
prev_eval = {
    "SummarizationScore": float(summarization_metric.score),
    "SummarizationReason": str(summarization_metric.reason),
    "CoherenceScore": float(coherence_metric.score),
    "CoherenceReason": str(coherence_metric.reason),
    "TonalityScore": float(tonality_metric.score),
    "TonalityReason": str(tonality_metric.reason),
    "SafetyScore": float(safety_metric.score),
    "SafetyReason": str(safety_metric.reason),
}

print(json.dumps(prev_eval, indent=2, ensure_ascii=False))

{
  "SummarizationScore": 0.25,
  "SummarizationReason": "The score is 0.25 because the summary contains significant contradictions and introduces extraneous information not present in the original text, leading to a misrepresentation of key points and themes.",
  "CoherenceScore": 0.9,
  "CoherenceReason": "The summary is well-organized and easy to follow, providing a clear narrative about the GenAI Divide and the challenges faced by organizations. It logically progresses from the problem of high investment with low return to the barriers like the learning gap and strategic partnerships. Ambiguous pronouns are minimal, and the wording is concrete, specifically addressing the needs of IT professionals. The summary effectively highlights key findings and implications, maintaining coherence without contradictions, thus aligning closely with the evaluation steps.",
  "TonalityScore": 0.8,
  "TonalityReason": "The summary effectively communicates critical insights about Generative AI in a 

In [19]:
# Cell H2 ‚Äî Build an ‚Äúenhancement prompt‚Äù dynamically (self-correction prompt)
# This prompt uses:
#  - Context (source)
#  - Old summary
#  - Evaluation reasons (what to fix)
#  - Constraints: same tone, <=1000 tokens, keep it clear

TONE = "Practical IT Industry Explainer"

enhance_developer_instructions = f"""
You are an expert technical editor improving an existing summary.
You MUST preserve factual accuracy with respect to the source context.

Output rules:
- Output MUST be valid JSON only (no markdown).
- Keep the same tone: {TONE}.
- Keep it concise (<= 1000 tokens).
- Improve clarity, organization, and coverage based on evaluation feedback.
- Do NOT add facts not supported by the source.
- Extract 10-20 key facts as short bullets.
- Each fact MUST be grounded in the source text.
- If you are unsure a claim is supported, DO NOT include it.
"""

enhance_user_prompt_template = """
SOURCE CONTEXT:
{context}

CURRENT SUMMARY:
{old_summary}

EVALUATION FEEDBACK (scores + reasons):
{eval_json}

TASK:
Rewrite the summary to address the feedback. Specifically:
- Fix issues mentioned in the reasons (missing key points, unclear flow, vagueness, etc.).
- Keep the same tone and keep it easy for IT professionals to understand.
- Do not add unsupported claims.
- "facts": an array of 10-20 short, source-supported facts (strings)
- "uncertain_or_missing": an array of claims you were tempted to add but could not verify from the source (strings)
Return ONLY the improved summary text.
"""

# Keep evaluation feedback compact (reasons can be long)
eval_json_compact = json.dumps(prev_eval, ensure_ascii=False)

MAX_CONTEXT_CHARS = 350_000  # You can tune this; smaller to reduce cost/time
enh_context = context[:MAX_CONTEXT_CHARS]

enhance_user_prompt = enhance_user_prompt_template.format(
    context=enh_context,
    old_summary=final_obj.Summary,
    eval_json=eval_json_compact
)

In [20]:
# Cell H3 ‚Äî Generate the improved summary (new summary)

import json

response_enhanced = client.responses.create(
    model="gpt-4o-mini",
    input=[
        {"role": "developer", "content": enhance_developer_instructions},
        {"role": "user", "content": enhance_user_prompt},
    ],
    # Optional: keep it deterministic-ish
    temperature=0.3,
)

enhanced_summary = response_enhanced.output_text.strip()

print("---- ENHANCED SUMMARY (preview) ----")
print(enhanced_summary[:1200])
print("\nLength (chars):", len(enhanced_summary))

---- ENHANCED SUMMARY (preview) ----
{
  "summary": "The 'GenAI Divide' report highlights a stark reality: despite substantial investments in Generative AI (between $30-40 billion), 95% of organizations report no measurable return on their AI initiatives. While tools like ChatGPT are widely adopted‚Äîover 80% of organizations have explored or piloted them‚Äîtheir impact is primarily on individual productivity rather than organizational performance. Analysis of over 300 public AI implementations reveals key patterns: limited disruption across most industries, a paradox where larger firms lead in pilot volume but struggle with successful scale-ups, and a persistent bias in investments favoring visible marketing functions over back-office automation, which often yields better ROI. The report identifies a significant learning gap as the main barrier to successful GenAI integration, with current systems lacking the ability to adapt and learn over time. Organizations that thrive in this land

In [21]:
# Cell H4 ‚Äî Evaluate the enhanced summary using the SAME metrics

from deepeval.test_case import LLMTestCase

enhanced_test_case = LLMTestCase(
    input=source_text,               # same input as before (e.g., context)
    actual_output=enhanced_summary   # new improved summary
)

# Run the same metrics
summarization_metric.measure(enhanced_test_case)
coherence_metric.measure(enhanced_test_case)
tonality_metric.measure(enhanced_test_case)
safety_metric.measure(enhanced_test_case)

enh_eval = {
    "SummarizationScore": float(summarization_metric.score),
    "SummarizationReason": str(summarization_metric.reason),
    "CoherenceScore": float(coherence_metric.score),
    "CoherenceReason": str(coherence_metric.reason),
    "TonalityScore": float(tonality_metric.score),
    "TonalityReason": str(tonality_metric.reason),
    "SafetyScore": float(safety_metric.score),
    "SafetyReason": str(safety_metric.reason),
}

print(json.dumps(enh_eval, indent=2, ensure_ascii=False))

Output()

Output()

Output()

Output()

{
  "SummarizationScore": 0.6,
  "SummarizationReason": "The score is 0.60 because the summary contradicts several key points in the original text, misrepresenting details about AI initiatives and their ROI, which impacts the accuracy of the information presented.",
  "CoherenceScore": 0.9,
  "CoherenceReason": "The summary presents a clear and logical flow, effectively outlining the key findings and implications of the GenAI Divide report. It adheres to the constraints of avoiding ambiguity and delivering concrete language tailored for IT professionals. Each point is ordered logically, addressing the problem, key findings, and organizational implications without contradictions. However, there is a minor oversight in the specificity of examples mentioned, which could have further strengthened the summary's impact.",
  "TonalityScore": 0.9,
  "TonalityReason": "The summary aligns closely with the Practical IT Industry Explainer tone, being clear, direct, and action-oriented. It effectiv

In [22]:
# Cell H5 ‚Äî Compare old vs new scores + show deltas

def delta(new, old):
    return float(new) - float(old)

comparison = {
    "SummarizationScore": {"old": prev_eval["SummarizationScore"], "new": enh_eval["SummarizationScore"], "delta": delta(enh_eval["SummarizationScore"], prev_eval["SummarizationScore"])},
    "CoherenceScore":     {"old": prev_eval["CoherenceScore"],     "new": enh_eval["CoherenceScore"],     "delta": delta(enh_eval["CoherenceScore"],     prev_eval["CoherenceScore"])},
    "TonalityScore":      {"old": prev_eval["TonalityScore"],      "new": enh_eval["TonalityScore"],      "delta": delta(enh_eval["TonalityScore"],      prev_eval["TonalityScore"])},
    "SafetyScore":        {"old": prev_eval["SafetyScore"],        "new": enh_eval["SafetyScore"],        "delta": delta(enh_eval["SafetyScore"],        prev_eval["SafetyScore"])},
}

print(json.dumps(comparison, indent=2, ensure_ascii=False))

{
  "SummarizationScore": {
    "old": 0.25,
    "new": 0.6,
    "delta": 0.35
  },
  "CoherenceScore": {
    "old": 0.9,
    "new": 0.9,
    "delta": 0.0
  },
  "TonalityScore": {
    "old": 0.8,
    "new": 0.9,
    "delta": 0.09999999999999998
  },
  "SafetyScore": {
    "old": 1.0,
    "new": 1.0,
    "delta": 0.0
  }
}


In [23]:
# Cell H6 ‚Äî Report: Is it better? Why? Are controls enough?

improved = any(v["delta"] > 0.0 for v in comparison.values())

print("=== Enhancement Report ===\n")
print("Did the enhanced summary improve overall?")
print("Answer:", "YES" if improved else "MIXED/NO")
print("\nScore changes:")
for k, v in comparison.items():
    print(f"- {k}: old={v['old']:.3f}, new={v['new']:.3f}, delta={v['delta']:+.3f}")

print("\nWhy did it improve (or not)?")
print("Old Summarization reason (key points):")
print(prev_eval["SummarizationReason"][:800] + ("..." if len(prev_eval["SummarizationReason"]) > 800 else ""))
print("\nNew Summarization reason (key points):")
print(enh_eval["SummarizationReason"][:800] + ("..." if len(enh_eval["SummarizationReason"]) > 800 else ""))

print("\nAre these controls enough?")
print(
    "- These controls are a strong baseline: they provide measurement + feedback-driven rewrite.\n"
    "- They are NOT fully sufficient alone for production because:\n"
    "  1) Judge-model bias/variance: scores depend on the evaluation model.\n"
    "  2) Grounding risk: you still need citation/attribution checks or factuality verification.\n"
    "  3) Regression risk: optimizing for one metric can harm another (e.g., tone vs completeness).\n"
    "  4) Coverage: safety/coherence don‚Äôt guarantee correctness or business usefulness.\n"
    "- For stronger controls, add:\n"
    "  ‚Ä¢ Faithfulness / hallucination checks (claim-to-source verification)\n"
    "  ‚Ä¢ Consistency checks across reruns (stability)\n"
    "  ‚Ä¢ Human spot checks on a sampled set\n"
    "  ‚Ä¢ Budget limits and guardrails (max tokens, section-based summarization)\n"
)

=== Enhancement Report ===

Did the enhanced summary improve overall?
Answer: YES

Score changes:
- SummarizationScore: old=0.250, new=0.600, delta=+0.350
- CoherenceScore: old=0.900, new=0.900, delta=+0.000
- TonalityScore: old=0.800, new=0.900, delta=+0.100
- SafetyScore: old=1.000, new=1.000, delta=+0.000

Why did it improve (or not)?
Old Summarization reason (key points):
The score is 0.25 because the summary contains significant contradictions and introduces extraneous information not present in the original text, leading to a misrepresentation of key points and themes.

New Summarization reason (key points):
The score is 0.60 because the summary contradicts several key points in the original text, misrepresenting details about AI initiatives and their ROI, which impacts the accuracy of the information presented.

Are these controls enough?
- These controls are a strong baseline: they provide measurement + feedback-driven rewrite.
- They are NOT fully sufficient alone for producti

Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
