# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [33]:
# Install dependencies (run once)
%pip install -qU langchain-community pypdf

# Load secrets / API key
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

# Load a PDF
from langchain_community.document_loaders import PyPDFLoader

file_path = "ai_report_2025.pdf"  
loader = PyPDFLoader(file_path)
docs = loader.load()

# Inspect results
print(f"Number of pages loaded: {len(docs)}\n")
print(f"First 200 characters of first page:\n{docs[0].page_content[:200]}\n")
print(f"Metadata of first page:\n{docs[0].metadata}")

# Join pages into single document text
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print(f"\nTotal document length: {len(document_text)} characters")



Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: C:\Users\romab\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Number of pages loaded: 26

First 200 characters of first page:
pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025

Metadata of first page:
{'producer': 'MicrosoftÂ® Word for Microsoft 365', 'creator': 'MicrosoftÂ® Word for Microsoft 365', 'creationdate': '2025-07-13T21:18:19-07:00', 'msip_label_87867195-f2b8-4ac2-b0b6-6bb73cb33afc_siteid': '72f988bf-86f1-41af-91ab-2d7cd011db47', 'msip_label_87867195-f2b8-4ac2-b0b6-6bb73cb33afc_method': 'Privileged', 'msip_label_87867195-f2b8-4ac2-b0b6-6bb73cb33afc_enabled': 'True', 'author': 'Aditya Challapally', 'moddate': '2025-07-13T21:18:19-07:00', 'source': 'ai_report_2025.pdf', 'total_pages': 26, 'page': 0, 'page_label': '1'}

Total document length: 53851 characters


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [None]:
# Install the libraries
# Run once 
# %pip install -qU openai pydantic pypdf

# Imports
from openai import OpenAI
from pydantic import BaseModel
from pypdf import PdfReader
import json

# Client + model configuration
client = OpenAI()  # Uses OPENAI_API_KEY from env
GEN_MODEL = "gpt-4o"

# Pydantic schema
class ArticleSummary(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

# Load PDF document
def load_pdf(filepath):
    """Extract text from PDF file"""
    reader = PdfReader(filepath)
    return "\n".join([page.extract_text() for page in reader.pages])

# Load your document
document_text = load_pdf("ai_report_2025.pdf")

# Developer instructions (system prompt)
developer_instructions = (
    "You are a high-precision summarization assistant. "
    "Return only valid JSON containing EXACTLY these keys: "
    '"Author", "Title", "Relevance", "Summary", "Tone", "InputTokens", "OutputTokens". '
    "Follow the user's tone instruction exactly. "
    "'Relevance' should be one paragraph. "
    "'Summary' should be concise and no longer than 1000 tokens."
)

# Configuration (change these as needed)
chosen_tone = "Formal Academic Writing"  # or 'Victorian English', 'Legalese', etc.
article_title = "ai_report_2025.pdf"
article_author = "Aditya Challapally"

# Build user prompt dynamically
user_prompt = f"""
Article author: {article_author}
Article title: {article_title}
Tone for the SUMMARY: {chosen_tone}

Produce a valid JSON object with these keys:
- Author: the article author
- Title: the article title
- Relevance: one paragraph explaining why this article is relevant for an AI professional's professional development
- Summary: concise summary (max 1000 tokens) written in {chosen_tone}
- Tone: the tone you used (string)
- InputTokens: set to 0 (will be updated)
- OutputTokens: set to 0 (will be updated)

ARTICLE START
{document_text}
ARTICLE END

Output only the JSON object, nothing else.
"""

# Call OpenAI Chat Completions API
resp = client.chat.completions.create(
    model=GEN_MODEL,
    messages=[
        {"role": "system", "content": developer_instructions},
        {"role": "user", "content": user_prompt}
    ],
    response_format={"type": "json_object"},  # Enforces JSON output
    max_tokens=1200,
    temperature=0.2,
)

# Extract response
model_text = resp.choices[0].message.content
input_tokens = resp.usage.prompt_tokens
output_tokens = resp.usage.completion_tokens

print("Raw model output preview:\n", model_text[:800])
print(f"\nTokens - Input: {input_tokens}, Output: {output_tokens}")

# Parse JSON
parsed_json = json.loads(model_text)

# Override token counts with actual usage
parsed_json["InputTokens"] = input_tokens
parsed_json["OutputTokens"] = output_tokens

# Validate with Pydantic
article_summary = ArticleSummary(**parsed_json)

print("\nFinal ArticleSummary:")
print(article_summary.model_dump_json(indent=2))

# Save to file
with open("ai_report_2025_initial_summary.json", "w", encoding="utf-8") as f:
    f.write(article_summary.model_dump_json(indent=2))
    
print("\nSaved ai_report_2025_initial_summary.json")









[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: C:\Users\romab\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
Raw model output preview:
 ```json
{
    "Author": "Aditya Challapally",
    "Title": "ai_report_2025.pdf",
    "Relevance": "This article is highly relevant for AI professionals as it provides a comprehensive analysis of the current state and future trajectory of AI implementation in business. It identifies the critical challenges and opportunities associated with the GenAI Divide, offering insights into why many AI initiatives fail to deliver expected returns. By understanding these dynamics, AI professionals can better navigate the complexities of AI adoption, enhance their strategic decision-making, and align their efforts with successful practices that bridge the divide between high adoption and meaningful transformation.",
    "Summary": "The report 'State of AI in Business 2025' by Aditya Challapally and othe

Initial ArticleSummary (pydantic):
{
  "Author": "Aditya Challapally",
  "Title": "ai_report_2025.pdf",
 

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [34]:
# Import required libraries
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models.base_model import DeepEvalBaseLLM
from openai import OpenAI
import json
import os

# Configure DeepEval to use gpt-4o as the judge model
class GPT4OModel(DeepEvalBaseLLM):
    def __init__(self, model="gpt-4o"):
        self.model = model
        self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    
    def load_model(self):
        return self.client
    
    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        )
        return response.choices[0].message.content
    
    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)
    
    def get_model_name(self):
        return self.model

# Initialize the custom model for DeepEval
judge_model = GPT4OModel(model="gpt-4o")

# Load the initial summary created in generation phase
with open("ai_report_2025_initial_summary.json", "r", encoding="utf-8") as f:
    summary_data = json.load(f)

print("Loaded initial summary:")
print(f"Author: {summary_data['Author']}")
print(f"Title: {summary_data['Title']}")
print(f"Tone: {summary_data['Tone']}")
print(f"Input Tokens: {summary_data['InputTokens']}")
print(f"Output Tokens: {summary_data['OutputTokens']}")
print(f"\nSummary preview (first 300 chars):\n{summary_data['Summary'][:300]}...\n")

# Define custom assessment questions for Summarization Metric
# At least 5 questions as required by the assignment
summarization_questions = [
    "Does the summary capture the main themes and key findings of the original document?",
    "Are all critical statistics, data points, and research findings accurately represented?",
    "Does the summary maintain the logical flow and structure of the original article?",
    "Are the core arguments and conclusions of the original document preserved?",
    "Does the summary avoid including irrelevant details or tangential information?"
]


print("Evauation Metric Configured:")
print(f"1. Summarization Metric: {len(summarization_questions)} assessment questions")
print(f"2. Coherence Metric (G-Eval): 5 evaluation steps")
print(f"3. Tonality Metric (G-Eval): 5 evaluation steps")
print(f"4. Safety Metric (G-Eval): 5 evaluation steps")


# Create test case for summarization evaluation
summarization_test_case = LLMTestCase(
    input=document_text,  # The full article text from generation phase
    actual_output=summary_data['Summary']
)

# Initialize and measure Summarization Metric
print("Metric 1: Evaluating Summarization Quality")


summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=judge_model,  # Using our custom gpt-4o model
    assessment_questions=summarization_questions
)

print("Running evaluation... (this may take a minute)")
summarization_metric.measure(summarization_test_case)

summarization_score = summarization_metric.score
summarization_reason = summarization_metric.reason

print(f"\nSummarization Score: {summarization_score:.3f}")
print(f"\nSummarization Reason:")
print(summarization_reason)


# Initialize and measure Coherence metric (G-Eval)

print("Metric 2: Evaluating Coherence/Clarity")


coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence and clarity - determine if the summary is logically structured, clear, and easy to understand",
    evaluation_steps=[
        "Assess the logical flow and organization of ideas",
        "Check for smooth transitions between concepts",
        "Evaluate clarity of language and terminology",
        "Determine if technical concepts are appropriately explained",
        "Verify overall narrative consistency"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
    threshold=0.5
)

coherence_test_case = LLMTestCase(
    input=document_text,
    actual_output=summary_data['Summary']
)

print("Running evaluation... (this may take a minute)")
coherence_metric.measure(coherence_test_case)

coherence_score = coherence_metric.score
coherence_reason = coherence_metric.reason

print(f"\nCoherence Score: {coherence_score:.3f}")
print(f"\nCoherence Reason:")
print(coherence_reason)


# Initialize and measure Tonality metric (G-Eval)
print("Metric 3: Evaluation Tonality")

tonality_metric = GEval(
    name="Tonality",
    criteria=f"Tonality consistency - determine if the summary consistently maintains {summary_data['Tone']} as specified",
    evaluation_steps=[
        f"Verify the summary uses {summary_data['Tone']} consistently",
        "Check vocabulary and word choice alignment with the specified tone",
        "Assess sentence structure and complexity appropriate to the tone",
        "Identify any tone shifts or inconsistencies",
        "Evaluate overall tone authenticity and appropriateness"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
    threshold=0.5
)

tonality_test_case = LLMTestCase(
    input=f"The tone should be: {summary_data['Tone']}",
    actual_output=summary_data['Summary']
)

print("Running evaluation... (this may take a minute)")
tonality_metric.measure(tonality_test_case)

tonality_score = tonality_metric.score
tonality_reason = tonality_metric.reason

print(f"\nTonality Score: {tonality_score:.3f}")
print(f"\nTonality Reason:")
print(tonality_reason)


# Initialize and measure Safety metric (G-Eval)
print("Metric 4: Evaluating Safety")


safety_metric = GEval(
    name="Safety",
    criteria="Safety and accuracy - determine if the summary is factually accurate, avoids unsupported claims, and handles information ethically",
    evaluation_steps=[
        "Verify all claims are supported by the original document",
        "Check for factual accuracy against the source",
        "Identify any biased or potentially harmful language",
        "Assess ethical handling of sensitive information",
        "Confirm the author's intent is not misrepresented"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
    threshold=0.5
)

safety_test_case = LLMTestCase(
    input=document_text,
    actual_output=summary_data['Summary']
)

print("Running evaluation... (this may take a minute)")
safety_metric.measure(safety_test_case)

safety_score = safety_metric.score
safety_reason = safety_metric.reason

print(f"\nSafety Score: {safety_score:.3f}")
print(f"\nSafety Reason:")
print(safety_reason)


# Create structured evaluation results as required by assignment
evaluation_results = {
    "SummarizationScore": summarization_score,
    "SummarizationReason": summarization_reason,
    "CoherenceScore": coherence_score,
    "CoherenceReason": coherence_reason,
    "TonalityScore": tonality_score,
    "TonalityReason": tonality_reason,
    "SafetyScore": safety_score,
    "SafetyReason": safety_reason
}

# Save evaluation results
with open("ai_report_2025_evaluation_results.json", "w", encoding="utf-8") as f:
    json.dump(evaluation_results, f, indent=2)

# Display summary of results
print("Evaluation Summary")

print(f"Summarization Score: {summarization_score:.3f}")
print(f"Coherence Score: {coherence_score:.3f}")
print(f"Tonality Score: {tonality_score:.3f}")
print(f"Safety Score: {safety_score:.3f}")
average_score = (summarization_score + coherence_score + tonality_score + safety_score) / 4
print(f"\nAverage Score: {average_score:.3f}")
print("\nEvaluation results saved to: ai_report_2025_evaluation_results.json")

# Display detailed evaluation reasons

print("Evaluation Reason")


print("\nSummarization:")
print(summarization_reason)

print("\nCoherence:")
print(coherence_reason)

print("\nTonality:")
print(tonality_reason)

print("\nSafety:")
print(safety_reason)




Loaded initial summary:
Author: Aditya Challapally
Title: ai_report_2025.pdf
Tone: Formal Academic Writing
Input Tokens: 10000
Output Tokens: 1000

Summary preview (first 300 chars):
The report, authored by Aditya Challapally and others, explores the 'GenAI Divide' in business by 2025, where despite $30-40 billion in investments, 95% of organizations see no return from AI initiatives. The divide is not due to model quality or regulation but rather the approach to AI adoption. To...

Evauation Metric Configured:
1. Summarization Metric: 5 assessment questions
2. Coherence Metric (G-Eval): 5 evaluation steps
3. Tonality Metric (G-Eval): 5 evaluation steps
4. Safety Metric (G-Eval): 5 evaluation steps
Metric 1: Evaluating Summarization Quality
Running evaluation... (this may take a minute)



Summarization Score: 0.400

Summarization Reason:
The score is 0.40 because the summary contains significant contradictions and extra information not present in the original text, such as mentions of Aditya Challapally, the year 2025, and various factors like model quality and regulation. Additionally, it fails to accurately represent critical statistics, data points, and the logical flow of the original article.
Metric 2: Evaluating Coherence/Clarity
Running evaluation... (this may take a minute)



Coherence Score: 0.900

Coherence Reason:
The response demonstrates a strong logical flow and organization, clearly outlining the key findings of the report. Transitions between concepts are smooth, with each point building on the previous one. Language and terminology are clear and appropriate for the subject matter. Technical concepts, such as the 'GenAI Divide' and 'shadow AI economy,' are well-explained. The narrative is consistent, maintaining focus on the central theme of AI adoption challenges and solutions.
Metric 3: Evaluation Tonality
Running evaluation... (this may take a minute)



Tonality Score: 0.900

Tonality Reason:
The summary consistently uses formal academic writing, with appropriate vocabulary and complex sentence structures suitable for the tone. It maintains a consistent tone throughout, effectively conveying the report's findings and conclusions. The vocabulary aligns well with the formal tone, using terms like 'integration issues,' 'enterprise paradox,' and 'investment patterns.' There are no noticeable tone shifts or inconsistencies, and the overall tone is authentic and appropriate for an academic summary. The only minor shortcoming is the lack of explicit introductory or concluding sentences that could enhance the formal structure.
Metric 4: Evaluating Safety
Running evaluation... (this may take a minute)



Safety Score: 0.900

Safety Reason:
The response aligns well with the evaluation steps. It accurately summarizes the report's findings on the GenAI Divide, including the lack of ROI despite significant investment, the importance of approach over model quality, and the patterns identified in AI adoption. It also correctly highlights the learning gap as a core barrier and the effectiveness of personal AI tools in the shadow economy. The response avoids biased or harmful language and ethically handles sensitive information. However, it could slightly improve by explicitly mentioning the authors' intent as stated in the report's conclusion.
Evaluation Summary
Summarization Score: 0.400
Coherence Score: 0.900
Tonality Score: 0.900
Safety Score: 0.900

Average Score: 0.775

Evaluation results saved to: ai_report_2025_evaluation_results.json
Evaluation Reason

Summarization:
The score is 0.40 because the summary contains significant contradictions and extra information not present in the ori

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

Please, do not forget to add your comments.

In [None]:
# Import required libraries
from openai import OpenAI
from pydantic import BaseModel, ValidationError
import json
import os
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models.base_model import DeepEvalBaseLLM

# Configure DeepEval to use gpt-4o as the judge model
class GPT4OModel(DeepEvalBaseLLM):
    def __init__(self, model="gpt-4o"):
        self.model = model
        self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    
    def load_model(self):
        return self.client
    
    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        )
        return response.choices[0].message.content
    
    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)
    
    def get_model_name(self):
        return self.model

# Initialize the custom model for DeepEval
judge_model = GPT4OModel(model="gpt-4o")

# Load previous results from Evaluation phase
print("Loading Initial Results from Evaluation Phase")

with open("ai_report_2025_initial_summary.json", "r", encoding="utf-8") as f:
    initial_summary = json.load(f)

with open("ai_report_2025_evaluation_results.json", "r", encoding="utf-8") as f:
    evaluation_results = json.load(f)

print("\nInitial Summary Statistics:")
print(f"Author: {initial_summary['Author']}")
print(f"Title: {initial_summary['Title']}")
print(f"Tone: {initial_summary['Tone']}")
print("\nInitial Evaluation Scores:")
print(f"  Summarization: {evaluation_results['SummarizationScore']:.3f}")
print(f"  Coherence:     {evaluation_results['CoherenceScore']:.3f}")
print(f"  Tonality:      {evaluation_results['TonalityScore']:.3f}")
print(f"  Safety:        {evaluation_results['SafetyScore']:.3f}")
initial_avg = (evaluation_results['SummarizationScore'] + 
               evaluation_results['CoherenceScore'] + 
               evaluation_results['TonalityScore'] + 
               evaluation_results['SafetyScore']) / 4
print(f"  Average:       {initial_avg:.3f}")

# Pydantic model for enhanced summary
class ArticleSummary(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

# Create enhancement prompt using evaluation feedback
print("\nStep 1: Creating Enhancement Prompt")

client = OpenAI()
GEN_MODEL = "gpt-4o"

enhancement_developer_instructions = (
    "You are an expert summarization improvement assistant. "
    "You will receive an original document, a previous summary, and detailed evaluation feedback. "
    "Your task is to create an IMPROVED summary that addresses all weaknesses identified in the evaluation. "
    "Return only a single JSON object containing EXACTLY these keys: "
    '"Author", "Title", "Relevance", "Summary", "Tone", "InputTokens", "OutputTokens". '
    "Do not output any commentary, markdown, or extra fields."
)

enhancement_user_prompt = f"""
ORIGINAL DOCUMENT:
{document_text}

PREVIOUS SUMMARY:
{initial_summary['Summary']}

EVALUATION FEEDBACK:

1. SUMMARIZATION (Score: {evaluation_results['SummarizationScore']:.3f}):
{evaluation_results['SummarizationReason']}

2. COHERENCE (Score: {evaluation_results['CoherenceScore']:.3f}):
{evaluation_results['CoherenceReason']}

3. TONALITY (Score: {evaluation_results['TonalityScore']:.3f}):
{evaluation_results['TonalityReason']}

4. SAFETY (Score: {evaluation_results['SafetyScore']:.3f}):
{evaluation_results['SafetyReason']}

TASK:
Based on the evaluation feedback above, create an ENHANCED summary that:
- Addresses all weaknesses identified in the evaluation
- Improves upon low-scoring dimensions
- Maintains or enhances high-scoring dimensions
- Captures all key themes and findings more accurately
- Ensures better coherence and logical flow
- Maintains consistent {initial_summary['Tone']} tone throughout
- Ensures all claims are supported by the original document

Return a single strict JSON object with these keys:
- Author: {initial_summary['Author']}
- Title: {initial_summary['Title']}
- Relevance: Enhanced one-paragraph explanation of relevance for AI professionals
- Summary: Enhanced summary (concise, ~1000 tokens, in {initial_summary['Tone']} tone)
- Tone: {initial_summary['Tone']}
- InputTokens: 0 (will be filled by client)
- OutputTokens: 0 (will be filled by client)

Output must be valid JSON only. No explanation or commentary.
"""

print(f"Enhancement prompt created with:")
print(f"  - Original document ({len(document_text)} chars)")
print(f"  - Previous summary ({len(initial_summary['Summary'])} chars)")
print(f"  - Evaluation feedback (4 metrics)")

# Generate enhanced summary
print("\nStep 2: Generating Enhanced Summary")

resp_enhanced = client.responses.create(
    model=GEN_MODEL,
    instructions=enhancement_developer_instructions,
    input=enhancement_user_prompt,
    max_output_tokens=1500,
    temperature=0.2,
)

# Extract model text
try:
    enhanced_model_text = resp_enhanced.output_text
except Exception:
    enhanced_model_text = ""
    out = getattr(resp_enhanced, "output", []) or []
    for block in out:
        if isinstance(block, dict):
            for c in block.get("content", []):
                if c.get("type") == "output_text":
                    enhanced_model_text += c.get("text", "")
        else:
            enhanced_model_text += str(block)

# Extract token usage
usage = getattr(resp_enhanced, "usage", None)
enhanced_input_tokens = -1
enhanced_output_tokens = -1
if usage:
    u = usage if isinstance(usage, dict) else usage.__dict__
    enhanced_input_tokens = u.get("input_tokens") or u.get("prompt_tokens") or -1
    enhanced_output_tokens = u.get("output_tokens") or u.get("completion_tokens") or -1
    try:
        enhanced_input_tokens = int(enhanced_input_tokens)
        enhanced_output_tokens = int(enhanced_output_tokens)
    except:
        pass

print(f"Enhanced summary generated!")
print(f"   Input tokens: {enhanced_input_tokens}")
print(f"   Output tokens: {enhanced_output_tokens}")

# Parse enhanced JSON using safe extraction function
def safe_extract_json(text, defaults):
    """Extract JSON from model output with fallback parsing"""
    first = text.find("{")
    last = text.rfind("}")
    if first != -1 and last != -1 and last > first:
        try:
            return json.loads(text[first:last+1])
        except Exception:
            pass
    
    # Fallback: regex-based extraction
    parsed = defaults.copy()
    import re
    
    def find_field(k):
        m = re.search(rf'"{re.escape(k)}"\s*:\s*"(?P<v>.*?)"', text, re.S)
        if m:
            return m.group("v").strip()
        m2 = re.search(rf'{re.escape(k)}\s*[:\-]\s*(.+?)(?=\n[A-Z][a-zA-Z ]+\s*[:\-"]|$)', text, re.S)
        if m2:
            return m2.group(1).strip().strip('",')
        return ""
    
    for key in defaults.keys():
        val = find_field(key)
        if val:
            if key in ("InputTokens", "OutputTokens"):
                try:
                    parsed[key] = int(''.join([c for c in val if c.isdigit()])) if any(c.isdigit() for c in val) else parsed[key]
                except:
                    pass
            else:
                parsed[key] = val
    return parsed

enhanced_defaults = {
    "Author": initial_summary['Author'],
    "Title": initial_summary['Title'],
    "Relevance": "",
    "Summary": "",
    "Tone": initial_summary['Tone'],
    "InputTokens": enhanced_input_tokens,
    "OutputTokens": enhanced_output_tokens
}

enhanced_parsed_json = safe_extract_json(enhanced_model_text, enhanced_defaults)

if enhanced_input_tokens >= 0:
    enhanced_parsed_json["InputTokens"] = enhanced_input_tokens
if enhanced_output_tokens >= 0:
    enhanced_parsed_json["OutputTokens"] = enhanced_output_tokens

# Create Pydantic object for enhanced summary
enhanced_summary = ArticleSummary(
    Author=enhanced_parsed_json.get("Author") or initial_summary['Author'],
    Title=enhanced_parsed_json.get("Title") or initial_summary['Title'],
    Relevance=enhanced_parsed_json.get("Relevance") or "",
    Summary=enhanced_parsed_json.get("Summary") or "",
    Tone=enhanced_parsed_json.get("Tone") or initial_summary['Tone'],
    InputTokens=int(enhanced_parsed_json.get("InputTokens", enhanced_input_tokens)),
    OutputTokens=int(enhanced_parsed_json.get("OutputTokens", enhanced_output_tokens))
)

# Save enhanced summary
with open("ai_report_2025_enhanced_summary.json", "w", encoding="utf-8") as f:
    f.write(enhanced_summary.model_dump_json(indent=2))

print(f"\nEnhanced summary saved to: ai_report_2025_enhanced_summary.json")
print(f"   Summary length: {len(enhanced_summary.Summary)} chars")

# Re-evaluate enhanced summary - Summarization
print("\nStep 3: Re-Evaluating Enhanced Summary")

print("\n1. Evaluating Summarization...")
summarization_questions = [
    "Does the summary capture the main themes and key findings of the original document?",
    "Are all critical statistics, data points, and research findings accurately represented?",
    "Does the summary maintain the logical flow and structure of the original article?",
    "Are the core arguments and conclusions of the original document preserved?",
    "Does the summary avoid including irrelevant details or tangential information?"
]

enhanced_summ_metric = SummarizationMetric(
    threshold=0.5,
    model=judge_model,
    assessment_questions=summarization_questions
)

enhanced_summ_test = LLMTestCase(
    input=document_text,
    actual_output=enhanced_summary.Summary
)

enhanced_summ_metric.measure(enhanced_summ_test)
enhanced_summ_score = enhanced_summ_metric.score
enhanced_summ_reason = enhanced_summ_metric.reason

print(f"   Score: {enhanced_summ_score:.3f} (Previous: {evaluation_results['SummarizationScore']:.3f})")

# Re-evaluate - Coherence
print("\n2. Evaluating Coherence...")
enhanced_coh_metric = GEval(
    name="Coherence",
    criteria="Coherence and clarity - determine if the summary is logically structured, clear, and easy to understand",
    evaluation_steps=[
        "Assess the logical flow and organization of ideas",
        "Check for smooth transitions between concepts",
        "Evaluate clarity of language and terminology",
        "Determine if technical concepts are appropriately explained",
        "Verify overall narrative consistency"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
    threshold=0.5
)

enhanced_coh_test = LLMTestCase(
    input=document_text,
    actual_output=enhanced_summary.Summary
)

enhanced_coh_metric.measure(enhanced_coh_test)
enhanced_coh_score = enhanced_coh_metric.score
enhanced_coh_reason = enhanced_coh_metric.reason

print(f"   Score: {enhanced_coh_score:.3f} (Previous: {evaluation_results['CoherenceScore']:.3f})")

# Re-evaluate - Tonality
print("\n3. Evaluating Tonality...")
enhanced_tone_metric = GEval(
    name="Tonality",
    criteria=f"Tonality consistency - determine if the summary consistently maintains {enhanced_summary.Tone} as specified",
    evaluation_steps=[
        f"Verify the summary uses {enhanced_summary.Tone} consistently",
        "Check vocabulary and word choice alignment with the specified tone",
        "Assess sentence structure and complexity appropriate to the tone",
        "Identify any tone shifts or inconsistencies",
        "Evaluate overall tone authenticity and appropriateness"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
    threshold=0.5
)

enhanced_tone_test = LLMTestCase(
    input=f"The tone should be: {enhanced_summary.Tone}",
    actual_output=enhanced_summary.Summary
)

enhanced_tone_metric.measure(enhanced_tone_test)
enhanced_tone_score = enhanced_tone_metric.score
enhanced_tone_reason = enhanced_tone_metric.reason

print(f"   Score: {enhanced_tone_score:.3f} (Previous: {evaluation_results['TonalityScore']:.3f})")

# Re-evaluate - Safety
print("\n4. Evaluating Safety...")
enhanced_safety_metric = GEval(
    name="Safety",
    criteria="Safety and accuracy - determine if the summary is factually accurate, avoids unsupported claims, and handles information ethically",
    evaluation_steps=[
        "Verify all claims are supported by the original document",
        "Check for factual accuracy against the source",
        "Identify any biased or potentially harmful language",
        "Assess ethical handling of sensitive information",
        "Confirm the author's intent is not misrepresented"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=judge_model,
    threshold=0.5
)

enhanced_safety_test = LLMTestCase(
    input=document_text,
    actual_output=enhanced_summary.Summary
)

enhanced_safety_metric.measure(enhanced_safety_test)
enhanced_safety_score = enhanced_safety_metric.score
enhanced_safety_reason = enhanced_safety_metric.reason

print(f"   Score: {enhanced_safety_score:.3f} (Previous: {evaluation_results['SafetyScore']:.3f})")

# Create and save enhanced evaluation results
enhanced_evaluation_results = {
    "SummarizationScore": enhanced_summ_score,
    "SummarizationReason": enhanced_summ_reason,
    "CoherenceScore": enhanced_coh_score,
    "CoherenceReason": enhanced_coh_reason,
    "TonalityScore": enhanced_tone_score,
    "TonalityReason": enhanced_tone_reason,
    "SafetyScore": enhanced_safety_score,
    "SafetyReason": enhanced_safety_reason
}

with open("ai_report_2025_enhanced_evaluation.json", "w", encoding="utf-8") as f:
    json.dump(enhanced_evaluation_results, f, indent=2)

print(f"\nEnhanced evaluation saved to: ai_report_2025_enhanced_evaluation.json")

# Comparison report
print("\nStep 4: Comparison Report")

metrics = ["Summarization", "Coherence", "Tonality", "Safety"]
initial_scores = [
    evaluation_results['SummarizationScore'],
    evaluation_results['CoherenceScore'],
    evaluation_results['TonalityScore'],
    evaluation_results['SafetyScore']
]
enhanced_scores = [
    enhanced_summ_score,
    enhanced_coh_score,
    enhanced_tone_score,
    enhanced_safety_score
]

print(f"\n{'Metric':<20} {'Initial':<12} {'Enhanced':<12} {'Change':<12} {'Status'}")
print("-" * 75)

for i, metric in enumerate(metrics):
    change = enhanced_scores[i] - initial_scores[i]
    change_str = f"+{change:.3f}" if change >= 0 else f"{change:.3f}"
    if change > 0.05:
        status = "IMPROVED"
    elif change > 0:
        status = "Slight up"
    elif change == 0:
        status = "No change"
    elif change > -0.05:
        status = "Slight down"
    else:
        status = "WORSE"
    print(f"{metric:<20} {initial_scores[i]:<12.3f} {enhanced_scores[i]:<12.3f} {change_str:<12} {status}")

enhanced_avg = sum(enhanced_scores) / len(enhanced_scores)
avg_change = enhanced_avg - initial_avg

print("-" * 75)
print(f"{'AVERAGE':<20} {initial_avg:<12.3f} {enhanced_avg:<12.3f} {'+' if avg_change >= 0 else ''}{avg_change:.3f}")

# Analysis and conclusions
print("\nStep 5: Analysis and Reflection")

improvement_count = sum(1 for i in range(len(metrics)) if enhanced_scores[i] > initial_scores[i])
deterioration_count = sum(1 for i in range(len(metrics)) if enhanced_scores[i] < initial_scores[i])
unchanged_count = sum(1 for i in range(len(metrics)) if enhanced_scores[i] == initial_scores[i])

print(f"\nMetrics improved: {improvement_count}/{len(metrics)}")
print(f"Metrics deteriorated: {deterioration_count}/{len(metrics)}")
print(f"Metrics unchanged: {unchanged_count}/{len(metrics)}")
print(f"\nOverall average change: {'+' if avg_change >= 0 else ''}{avg_change:.3f}")

print("\nDid We Get Better Output?")

if avg_change > 0.05:
    print("\nYES - The enhanced summary shows SIGNIFICANT improvement.")
elif avg_change > 0:
    print("\nYES - The enhanced summary shows MODEST improvement.")
elif avg_change == 0:
    print("\nNEUTRAL - The enhanced summary shows NO change.")
else:
    print("\nNO - The enhanced summary shows DETERIORATION.")

print("\nWhy Did This Happen?")

if avg_change > 0:
    print("""
The self-correction approach worked because:

1. EXPLICIT FEEDBACK: The enhancement prompt included specific evaluation 
   scores and detailed reasons for each dimension, giving the model concrete
   targets for improvement.

2. CONTEXT PRESERVATION: The model had access to both the original document
   and the previous summary, allowing it to understand what was missing or
   incorrect.

3. STRUCTURED GUIDANCE: The prompt explicitly instructed the model to address
   weaknesses in low-scoring dimensions while maintaining high-scoring ones.

4. SAME EVALUATION CRITERIA: Using identical metrics for re-evaluation ensures
   fair comparison and validates the improvement approach.
""")
else:
    print("""
The self-correction did not improve results because:

1. The initial summary may have already been near-optimal for these metrics.

2. The evaluation criteria may be subjective and variable between runs.

3. The model may have over-corrected in some areas while under-performing
   in others.

4. The enhancement prompt may need refinement to better guide improvements.
""")

print("\nAre These Controls Enough?")

print("""
These controls are a GOOD START but have limitations:

STRENGTHS:
+ Automated evaluation provides quantitative feedback
+ Self-correction can improve summaries iteratively
+ Multiple metrics capture different quality dimensions
+ Structured outputs ensure consistency

LIMITATIONS:
- Single-iteration enhancement may not address all issues
- LLM-based evaluation can be inconsistent and subjective
- No human verification of factual accuracy
- Tone evaluation depends on judge model's understanding
- May require multiple iterations with diminishing returns
- No external fact-checking against trusted sources

RECOMMENDATIONS FOR PRODUCTION:
1. Implement iterative enhancement (2-3 cycles with stopping criteria)
2. Add human-in-the-loop verification for critical content
3. Use ensemble evaluation (multiple judge models, average scores)
4. Implement external fact-checking against knowledge bases
5. Add domain-specific evaluation criteria
6. Track evaluation consistency across multiple runs
7. Set minimum score thresholds before accepting summaries
8. Include edge case testing (very long/short docs, technical content)
""")

print("\nFiles Generated:")
print("- ai_report_2025_enhanced_summary.json")
print("- ai_report_2025_enhanced_evaluation.json")

Loading Initial Results from Evaluation Phase

Initial Summary Statistics:
Author: Aditya Challapally
Title: ai_report_2025.pdf
Tone: Formal Academic Writing

Initial Evaluation Scores:
  Summarization: 0.000
  Coherence:     0.900
  Tonality:      0.900
  Safety:        0.900
  Average:       0.675

Step 1: Creating Enhancement Prompt
Enhancement prompt created with:
  - Original document (53850 chars)
  - Previous summary (1264 chars)
  - Evaluation feedback (4 metrics)

Step 2: Generating Enhanced Summary
Enhanced summary generated!
   Input tokens: 11650
   Output tokens: 382

Enhanced summary saved to: ai_report_2025_enhanced_summary.json
   Summary length: 1417 chars

Step 3: Re-Evaluating Enhanced Summary

1. Evaluating Summarization...


   Score: 0.000 (Previous: 0.000)

2. Evaluating Coherence...


   Score: 0.900 (Previous: 0.900)

3. Evaluating Tonality...


   Score: 0.900 (Previous: 0.900)

4. Evaluating Safety...


   Score: 0.900 (Previous: 0.900)

Enhanced evaluation saved to: ai_report_2025_enhanced_evaluation.json

Step 4: Comparison Report

Metric               Initial      Enhanced     Change       Status
---------------------------------------------------------------------------
Summarization        0.000        0.000        +0.000       No change
Coherence            0.900        0.900        +0.000       No change
Tonality             0.900        0.900        +0.000       No change
Safety               0.900        0.900        +0.000       No change
---------------------------------------------------------------------------
AVERAGE              0.675        0.675        +0.000

Step 5: Analysis and Reflection

Metrics improved: 0/4
Metrics deteriorated: 0/4
Metrics unchanged: 4/4

Overall average change: +0.000

Did We Get Better Output?

NEUTRAL - The enhanced summary shows NO change.

Why Did This Happen?

The self-correction did not improve results because:

1. The initial summary m


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
