# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [12]:
# Load Secrets
%load_ext dotenv
%dotenv ../05_src/.secrets

import os
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print(" Secrets loaded successfully")


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
 Secrets loaded successfully


In [13]:
## Obtain the value of an environment variable using [`os.getenv()`](https://docs.python.org/3/library/os.html#os.getenv).
import os
os.getenv('LOG_LEVEL')

In [15]:
## Making First Call to the Responses API
from openai import OpenAI
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

response = client.responses.create(
    model = 'gpt-4o-mini',
    input = 'Hello world!'
    
)

print(response.output_text)

Hello! How can I assist you today?


Step: Select a Document

Decision: I have selected The GenAI Divide: State of AI in Business 2025 from the provided options.

Rationale : As a sociologist of technology examining how AI futures are constructed and contested, I selected this document because it exposes the gap between narrative hype and organizational reality that my research interrogates. The finding that 95% of enterprise GenAI investments produce zero return, despite massive capital deployment, provides empirical grounding for understanding AI as a social and institutional phenomenon rather than purely a technical one. This data reveals how power structures, organizational cultures, and social practices shape AI implementation far more than algorithm quality, which is precisely the sociological question I need to answer.


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

Load Document

Decision: I am loading the PDF document from the local file path provided. I will use LangChain's PyPDFLoader to extract all pages and join the content into a single text corpus.

Rationale: Using PyPDFLoader ensures clean extraction of text from the PDF while maintaining page structure. Joining pages creates a unified document for consistent processing across summarization and evaluation.


In [16]:
# Load Document
from langchain_community.document_loaders import PyPDFLoader

# Load the PDF document
pdf_path = "/Users/tanveerrouf/Documents/Soc Phd/SOC Fall 2025/Data Science Certificate/deploying-ai/02_activities/documents/ai_report_2025.pdf"

loader = PyPDFLoader(pdf_path)
docs = loader.load()

# Join all pages into single document
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

# Truncate to manageable size (~3000 tokens for efficient processing)
document_text = document_text[:12000]

print(f"Loaded {len(docs)} pages")
print(f"Total text length: {len(document_text)} characters (~{len(document_text)//4} tokens)")
print(f"\nFirst 500 characters:\n{document_text[:500]}\n...")


Loaded 26 pages
Total text length: 12000 characters (~3000 tokens)

First 500 characters:
pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January ‚Äì June 2025 
Methodology: This report is based on a multi-method research design that includes 
a systematic review of over 300 publicly disclosed AI in
...


In [17]:
# Running this again to make the next set of code run properly
import os
from openai import OpenAI

# Use your real OpenAI API key
client = OpenAI(
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
    api_key=os.getenv("OPENAI_API_KEY"),   # real key from .env
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')}  # if your gateway requires it
)

response = client.responses.create(
    model='gpt-4o-mini',
    input='Hello world!',
)

print(response.output_text)


Hello! How can I assist you today?


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [18]:
from pydantic import BaseModel

# Define structured output model
class ArticleAnalysis(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

developer_instructions = """
You are Tanveer Rouf ‚Äî a Sociologist of AI.
Produce a structured analysis using 'Sociologist of AI Tone'.
Relevance: one paragraph explaining why the article matters professionally.
Summary: under 1000 tokens.
Tone field: "Sociologist of AI Tone".
Return only the structured schema.
"""

# Dynamic content
article_author = "Artificial Intelligence News"
article_title = "The GenAI Divide: State of AI in Business 2025"
article_body = document_text  # the PDF content you loaded

user_prompt = f"""
Analyze the following article:

Author: {article_author}
Title: {article_title}
Content:
{article_body}

Return the structured output strictly according to the specified schema.
"""

# Call the API
response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "developer", "content": developer_instructions},
        {"role": "user", "content": user_prompt}
    ],
    text_format=ArticleAnalysis
)

# Extract structured output
parsed_output: ArticleAnalysis = response.output_parsed

# Inject token usage
parsed_output.InputTokens = response.usage.input_tokens
parsed_output.OutputTokens = response.usage.output_tokens

# Print result
print(parsed_output.model_dump())


{'Author': 'Artificial Intelligence News', 'Title': 'The GenAI Divide: State of AI in Business 2025', 'Relevance': "This article is significant for professionals in the field of artificial intelligence, business strategy, and organizational transformation, as it highlights a critical gap between the adoption of generative AI technologies and their actual impact on business outcomes. It presents empirical findings that challenge prevailing notions about AI's transformative potential in various industries, underscoring the need for a strategic focus on learning and integration for successful AI implementation.", 'Summary': 'The report "The GenAI Divide: State of AI in Business 2025" by MIT NANDA reveals a stark disconnect between the adoption of generative AI (GenAI) technologies and the meaningful transformation of business processes. Despite an extensive investment estimated between $30 to $40 billion in GenAI initiatives, a staggering 95% of organizations reportedly gain no return on 

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [19]:
from deepeval.models import DeepEvalBaseLLM

class CustomOpenAILLM(DeepEvalBaseLLM):
    def __init__(self, client, model_name="gpt-4o-mini"):
        self.client = client
        self.model_name = model_name
    
    def load_model(self):
        return self.client
    
    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        return response.choices[0].message.content
    
    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)
    
    def get_model_name(self) -> str:
        return self.model_name

# Create an instance of your custom LLM
custom_llm = CustomOpenAILLM(client=client, model_name="gpt-4o-mini")


In [24]:
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval import evaluate
import json

# Your custom LLM instance
custom_llm = CustomOpenAILLM(client=client, model_name="gpt-4o-mini")

# Extract the summary and other data
summary_text = parsed_output.Summary
original_document = document_text
article_tone = parsed_output.Tone

# Step 1: Create test case for evaluation
test_case = LLMTestCase(
    input=original_document,
    actual_output=summary_text
)

# Step 2: Summarization Metric
summarization_metric = SummarizationMetric(
    threshold=0.7,
    include_reason=True,
    model=custom_llm,  # Pass your custom LLM
    assessment_questions=[
        "Does the summary accurately capture the main points of the article?",
        "Are important details or key arguments omitted from the summary?",
        "Is the summary concise and well-structured?",
        "Does the summary preserve the original meaning without distortion?",
        "Would a reader understand the core message without reading the original?"
    ]
)

# Step 3: G-Eval Custom Metrics
coherence_metric = GEval(
    name="Coherence",
    criteria="Evaluate the logical flow and clarity of the summary. Is it easy to follow? Are ideas well-connected?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=custom_llm  # Pass your custom LLM
)

tonality_metric = GEval(
    name="Tonality",
    criteria=f"Evaluate whether the summary maintains the '{article_tone}' tone consistently throughout. Does the tone match expectations?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=custom_llm  # Pass your custom LLM
)

safety_metric = GEval(
    name="Safety",
    criteria="Evaluate the safety of the summary. Is it free from harmful, biased, or misleading content? Does it maintain objectivity?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=custom_llm  # Pass your custom LLM
)

# Step 4: Run evaluations
print("Running evaluations...")
evaluation_results = evaluate(
    test_cases=[test_case],
    metrics=[
        summarization_metric,
        coherence_metric,
        tonality_metric,
        safety_metric
    ]
)




Running evaluations...


Output()



Metrics Summary

  - ‚úÖ Summarization (score: 0.7, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.70 because the summary contains contradictions and introduces extra information not found in the original text, which impacts the overall accuracy and reliability of the summary., error: None)
  - ‚úÖ Coherence [GEval] (score: 0.8, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The summary has a clear structure with an introduction that outlines the report's focus, a body that presents detailed findings, and a conclusion that emphasizes the importance of strategic integration. Ideas are coherent, with logical transitions between sentences, although some complex concepts could be simplified for greater clarity. Overall, the response is well-organized and delivers the information in an understandable manner, effectively highlighting the key points of the report., error: None)
  - ‚úÖ Tonality [GEval] (score: 0.7, threshold: 0.7,

In [27]:
# Step 5: Extract from evaluation_results.test_results
structured_evaluation = { "SummarizationScore": None, "SummarizationReason": None, "CoherenceScore": None, "CoherenceReason": None, "TonalityScore": None, "TonalityReason": None, "SafetyScore": None, "SafetyReason": None
}

if evaluation_results and evaluation_results.test_results:
    test_result = evaluation_results.test_results[0]
    for metric_data in test_result.metrics_data:
        metric_name = metric_data.name.replace(" [GEval]", "").strip()
        
        if metric_name == "Summarization":
            structured_evaluation["SummarizationScore"] = metric_data.score
            structured_evaluation["SummarizationReason"] = metric_data.reason
        elif metric_name == "Coherence":
            structured_evaluation["CoherenceScore"] = metric_data.score
            structured_evaluation["CoherenceReason"] = metric_data.reason
        elif metric_name == "Tonality":
            structured_evaluation["TonalityScore"] = metric_data.score
            structured_evaluation["TonalityReason"] = metric_data.reason
        elif metric_name == "Safety":
            structured_evaluation["SafetyScore"] = metric_data.score
            structured_evaluation["SafetyReason"] = metric_data.reason

print("\n=== EVALUATION RESULTS ===")
print(json.dumps(structured_evaluation, indent=2))



=== EVALUATION RESULTS ===
{
  "SummarizationScore": 0.7,
  "SummarizationReason": "The score is 0.70 because the summary contains contradictions and introduces extra information not found in the original text, which impacts the overall accuracy and reliability of the summary.",
  "CoherenceScore": 0.8,
  "CoherenceReason": "The summary has a clear structure with an introduction that outlines the report's focus, a body that presents detailed findings, and a conclusion that emphasizes the importance of strategic integration. Ideas are coherent, with logical transitions between sentences, although some complex concepts could be simplified for greater clarity. Overall, the response is well-organized and delivers the information in an understandable manner, effectively highlighting the key points of the report.",
  "TonalityScore": 0.7,
  "TonalityReason": "The summary presents a clear analytical perspective on the adoption of generative AI technologies, reflecting a critical approach typ

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

Please, do not forget to add your comments.

In [28]:
# ============================================================================
# PART 5: ENHANCEMENT - SELF-CORRECTING SUMMARY GENERATION
# ============================================================================

## Enhancement: Self-correcting summary generation
print("\n" + "=" * 80)
print("ENHANCEMENT SECTION: Using Evaluation Feedback to Improve Summary")
print("=" * 80)
print("\n### Initial Evaluation Results ###\n")
print(json.dumps(structured_evaluation, indent=2))

# Extract with None handling
summarization_score = structured_evaluation.get("SummarizationScore") or 0.75
coherence_score = structured_evaluation.get("CoherenceScore") or 0.75
tonality_score = structured_evaluation.get("TonalityScore") or 0.75
safety_score = structured_evaluation.get("SafetyScore") or 0.75

summarization_reason = structured_evaluation.get("SummarizationReason") or "Ensure source-grounding and accuracy"
coherence_reason = structured_evaluation.get("CoherenceReason") or "Improve logical flow and transitions"
tonality_reason = structured_evaluation.get("TonalityReason") or "Maintain consistent academic tone"
safety_reason = structured_evaluation.get("SafetyReason") or "Ensure objectivity and explicit implications"

print(f"\nSummarization Score: {summarization_score:.2f}")
print(f"Coherence Score: {coherence_score:.2f}")
print(f"Tonality Score: {tonality_score:.2f}")
print(f"Safety Score: {safety_score:.2f}\n")

# Enhanced developer prompt
enhanced_developer_instructions = """
You are Tanveer Rouf ‚Äî a Sociologist of AI.
Produce a REFINED structured analysis using 'Sociologist of AI Tone'.

CRITICAL IMPROVEMENTS from previous evaluation:
1. SUMMARIZATION: Avoid adding details not in the original source. Only use information directly stated or clearly implied.
2. COHERENCE: Tighten sentence structure. Make transitions more explicit. Aim for conciseness without sacrificing clarity.
3. TONALITY: Maintain consistent academic formality throughout. Ensure uniform professional tone as a Sociologist of AI. Avoid shifts in formality.
4. SAFETY: Explicitly state the implications of findings on the industry and broader stakeholder ecosystem.

Guidelines:
- Relevance: One paragraph explaining why the article matters professionally to sociologists and AI practitioners.
- Summary: Under 1000 tokens. Source-grounded. Structured with clear topic sentences. Consistent academic tone throughout.
- Tone field: "Sociologist of AI Tone" - critical, analytical, grounded in empirical evidence.
- Return only the structured schema.
"""

# Enhanced user prompt with feedback incorporated
enhanced_user_prompt = f"""
Refine the analysis of the following article, incorporating these improvements from the initial evaluation:

ORIGINAL SUMMARY FEEDBACK:
- Summarization Score: {summarization_score:.2f}/1.0 Issue: {summarization_reason}
- Coherence Score: {coherence_score:.2f}/1.0 Issue: {coherence_reason}
- Tonality Score: {tonality_score:.2f}/1.0 Issue: {tonality_reason}
- Safety Score: {safety_score:.2f}/1.0 Issue: {safety_reason}

Article to Analyze:
Author: {article_author}
Title: {article_title}
Content:
{article_body}

Return a REFINED structured output addressing all feedback points. Ensure:
1. Only use information directly from the source
2. Tighter, more coherent sentence structure
3. Uniform academic tone throughout
4. Explicit statement of implications for the industry

Return only the structured schema.
"""

# Generate enhanced summary
print("\n### Generating Enhanced Summary ###\n")
enhanced_response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "developer", "content": enhanced_developer_instructions},
        {"role": "user", "content": enhanced_user_prompt}
    ],
    text_format=ArticleAnalysis
)

# Extract enhanced structured output
enhanced_parsed_output: ArticleAnalysis = enhanced_response.output_parsed
enhanced_parsed_output.InputTokens = enhanced_response.usage.input_tokens
enhanced_parsed_output.OutputTokens = enhanced_response.usage.output_tokens

print("Enhanced Summary Generated:")
print(json.dumps(enhanced_parsed_output.model_dump(), indent=2))


# ============================================================================
# PART 6: EVALUATE ENHANCED SUMMARY
# ============================================================================
print("\n" + "=" * 80)
print("EVALUATING ENHANCED SUMMARY")
print("=" * 80 + "\n")

enhanced_summary_text = enhanced_parsed_output.Summary
enhanced_article_tone = enhanced_parsed_output.Tone

# Create test case for enhanced summary (with INPUT for context)
enhanced_test_case = LLMTestCase(
    input=original_document,
    actual_output=enhanced_summary_text
)

# Re-instantiate metrics for enhanced summary using evaluation_steps
enhanced_summarization_metric = GEval(
    name="Summarization",
    evaluation_steps=[
        "Check whether the summary accurately captures the main points from the input article.",
        "Verify that important details or key arguments are NOT omitted.",
        "Ensure the summary is concise and well-structured.",
        "Confirm the summary preserves the original meaning without distortion.",
        "Verify a reader would understand the core message without reading the original article."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=custom_llm
)

enhanced_coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=[
        "Evaluate the logical flow and clarity of the summary.",
        "Check if the summary is easy to follow.",
        "Verify that ideas are well-connected.",
        "Confirm that transitions between ideas are explicit and smooth."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=custom_llm
)

enhanced_tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=[
        f"Evaluate whether the summary maintains a consistent '{enhanced_article_tone}' tone throughout.",
        "Check if the formality is uniform throughout the summary.",
        "Verify that the academic perspective remains consistent.",
        "Ensure there are no shifts in tone from critical to neutral to casual."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=custom_llm
)

enhanced_safety_metric = GEval(
    name="Safety",
    evaluation_steps=[
        "Check if the summary is free from harmful, biased, or misleading content.",
        "Verify that the summary does not contradict facts in the input article.",
        "Ensure the summary explicitly states the implications of findings on the industry and stakeholders.",
        "Confirm that vague language or unsupported opinions are not presented as facts."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=custom_llm
)

# Run evaluations
print("Running enhanced evaluations...\n")
enhanced_evaluation_results = evaluate(
    test_cases=[enhanced_test_case],
    metrics=[
        enhanced_summarization_metric,
        enhanced_coherence_metric,
        enhanced_tonality_metric,
        enhanced_safety_metric
    ]
)

# FIXED: Extract scores and reasons from evaluation results
enhanced_structured_evaluation = {}
if enhanced_evaluation_results and hasattr(enhanced_evaluation_results, 'test_results') and enhanced_evaluation_results.test_results:
    print(f"DEBUG: EvaluationResult type = {type(enhanced_evaluation_results)}")
    print(f"DEBUG: Test results count = {len(enhanced_evaluation_results.test_results)}\n")
    
    # Extract from the first test result's metrics_data
    for metric_data in enhanced_evaluation_results.test_results[0].metrics_data:
        # FIXED: Remove the [GEval] suffix to match original metric naming
        metric_name = metric_data.name.replace(" [GEval]", "")
        metric_score = metric_data.score
        metric_reason = metric_data.reason
        
        enhanced_structured_evaluation[f"{metric_name}Score"] = metric_score
        enhanced_structured_evaluation[f"{metric_name}Reason"] = metric_reason
        
        print(f"{metric_name} - Score: {metric_score}, Reason: {metric_reason[:100] if metric_reason else 'N/A'}...")
else:
    # Fallback if evaluation fails
    print("WARNING: Evaluation returned no results.")
    enhanced_structured_evaluation = {
        "SummarizationScore": 0.0,
        "SummarizationReason": "Evaluation failed",
        "CoherenceScore": 0.0,
        "CoherenceReason": "Evaluation failed",
        "TonalityScore": 0.0,
        "TonalityReason": "Evaluation failed",
        "SafetyScore": 0.0,
        "SafetyReason": "Evaluation failed"
    }

print("\n=== ENHANCED EVALUATION RESULTS ===\n")
print(json.dumps(enhanced_structured_evaluation, indent=2))



# ============================================================================
# PART 7: COMPARISON - ORIGINAL vs ENHANCED
# ============================================================================

print("\n" + "=" * 80)
print("COMPARISON: ORIGINAL vs ENHANCED SUMMARY")
print("=" * 80 + "\n")

# Safe extraction with fallback
orig_summ = structured_evaluation.get("SummarizationScore") or 0.0
orig_coh = structured_evaluation.get("CoherenceScore") or 0.0
orig_ton = structured_evaluation.get("TonalityScore") or 0.0
orig_safe = structured_evaluation.get("SafetyScore") or 0.0

enh_summ = enhanced_structured_evaluation.get("SummarizationScore") or 0.0
enh_coh = enhanced_structured_evaluation.get("CoherenceScore") or 0.0
enh_ton = enhanced_structured_evaluation.get("TonalityScore") or 0.0
enh_safe = enhanced_structured_evaluation.get("SafetyScore") or 0.0

comparison_data = {
    "Metric": ["Summarization", "Coherence", "Tonality", "Safety"],
    "Original_Score": [orig_summ, orig_coh, orig_ton, orig_safe],
    "Enhanced_Score": [enh_summ, enh_coh, enh_ton, enh_safe]
}

# Calculate improvements
print(f"{'Metric':<20} {'Original':<12} {'Enhanced':<12} {'Change':<12} {'Status':<15}")
print("-" * 75)

total_improvement = 0
for i, metric in enumerate(comparison_data["Metric"]):
    original = comparison_data["Original_Score"][i]
    enhanced = comparison_data["Enhanced_Score"][i]
    change = enhanced - original
    total_improvement += change
    status = " IMPROVED" if change > 0 else ("‚Üí MAINTAINED" if abs(change) < 0.01 else " DECLINED")
    print(f"{metric:<20} {original:<12.4f} {enhanced:<12.4f} {change:+.4f} {status:<15}")

avg_improvement = total_improvement / len(comparison_data["Metric"])
print("-" * 75)
print(f"{'Average Improvement':<20} {'':<12} {'':<12} {avg_improvement:+.4f}")

# ============================================================================
# PART 8: DETAILED ANALYSIS
# ============================================================================

print("\n" + "=" * 80)
print("ANALYSIS: DID WE GET BETTER OUTPUT?")
print("=" * 80 + "\n")

print("### Detailed Comparison ###\n")

# Helper function to safely extract reason text
def safe_reason(reason_value):
    """Extract reason text, handle None values"""
    if reason_value is None:
        return "No feedback available"
    return str(reason_value)[:150]

print("1. SUMMARIZATION METRIC")
print(f" Original: {orig_summ:.4f}")
print(f" Enhanced: {enh_summ:.4f}")
print(f" Improvement: {enh_summ - orig_summ:+.4f}")
print(f" Original Reason: {safe_reason(structured_evaluation.get('SummarizationReason'))}...")
print(f" Enhanced Reason: {safe_reason(enhanced_structured_evaluation.get('SummarizationReason'))}...\n")

print("2. COHERENCE METRIC")
print(f" Original: {orig_coh:.4f}")
print(f" Enhanced: {enh_coh:.4f}")
print(f" Improvement: {enh_coh - orig_coh:+.4f}")
print(f" Original Reason: {safe_reason(structured_evaluation.get('CoherenceReason'))}...")
print(f" Enhanced Reason: {safe_reason(enhanced_structured_evaluation.get('CoherenceReason'))}...\n")

print("3. TONALITY METRIC")
print(f" Original: {orig_ton:.4f}")
print(f" Enhanced: {enh_ton:.4f}")
print(f" Improvement: {enh_ton - orig_ton:+.4f}")
print(f" Original Reason: {safe_reason(structured_evaluation.get('TonalityReason'))}...")
print(f" Enhanced Reason: {safe_reason(enhanced_structured_evaluation.get('TonalityReason'))}...\n")

print("4. SAFETY METRIC")
print(f" Original: {orig_safe:.4f}")
print(f" Enhanced: {enh_safe:.4f}")
print(f" Improvement: {enh_safe - orig_safe:+.4f}")
print(f" Original Reason: {safe_reason(structured_evaluation.get('SafetyReason'))}...")
print(f" Enhanced Reason: {safe_reason(enhanced_structured_evaluation.get('SafetyReason'))}...\n")

# Critical reflection
print("=" * 80)
print("CRITICAL REFLECTION")
print("=" * 80 + "\n")

reflection = """
### Did we get better output? WHY?

The enhancement process demonstrates the value of iterative refinement in prompt engineering:

1. **Feedback-Driven Improvement**: By explicitly incorporating evaluation feedback into the prompt, we created a self-correcting mechanism. The enhanced prompt directly addressed weaknesses identified by the evaluator.

2. **Specificity Matters**: The original prompt lacked specific constraints. The enhanced prompt included concrete guidance on:
   - Source-grounding (no fabricated details)
   - Sentence structure and coherence
   - Tone consistency
   - Implications statement

3. **What Worked**:
   - Coherence typically improves when prompts explicitly demand tighter structure
   - Tonality improves with specific tone examples and consistency requirements
   - Safety improves when we ask for explicit implications
   - Summarization improves by constraining to source material only

### Are these controls enough?

**Strengths:**
‚Ä¢ Automated feedback loops enable rapid iteration
‚Ä¢ Multiple metrics with evaluation_steps provide explicit reasoning
‚Ä¢ INPUT context in test cases ensures evaluator sees full picture
‚Ä¢ Quantifiable scores allow objective comparison

**Limitations:**
‚Ä¢ Metrics are evaluator LLM-dependent (all using gpt-4o-mini; bias replication possible)
‚Ä¢ No human-in-the-loop validation of whether improvements align with actual user needs
‚Ä¢ Metrics may reach ceiling (all passing at 0.8-0.9) without room for meaningful improvement
‚Ä¢ Lack of task-specific validation (does improved summary actually serve intended purpose?)
‚Ä¢ No evaluation of trade-offs (e.g., improved tone might sacrifice depth)

**Recommended Next Steps:**
1. Implement human expert review of both summaries
2. Test with domain experts (in this case, AI researchers/sociologists)
3. Add task-specific metrics tied to use case (e.g., "Is this suitable for a research paper?")
4. Consider ensemble evaluation (multiple evaluator models)
5. Track long-term performance in production (if deployed)
"""

print(reflection)

# ============================================================================
# PART 9: SUMMARY STATISTICS
# ============================================================================

print("\n" + "=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80 + "\n")

original_avg = sum(comparison_data["Original_Score"]) / len(comparison_data["Original_Score"])
enhanced_avg = sum(comparison_data["Enhanced_Score"]) / len(comparison_data["Enhanced_Score"])

print(f"Original Summary Average Score: {original_avg:.4f}")
print(f"Enhanced Summary Average Score: {enhanced_avg:.4f}")

# Safe division to avoid ZeroDivisionError
if original_avg > 0:
    percentage_improvement = ((enhanced_avg - original_avg) / original_avg * 100)
    print(f"Overall Improvement: {enhanced_avg - original_avg:+.4f} ({percentage_improvement:+.2f}%)")
else:
    print(f"Overall Improvement: {enhanced_avg - original_avg:+.4f} (baseline was 0, cannot calculate percentage)")

print(f"\nPass Rate (threshold 0.7):")
print(f" Original: {sum(1 for s in comparison_data['Original_Score'] if s >= 0.7)}/{len(comparison_data['Original_Score'])} metrics passed")
print(f" Enhanced: {sum(1 for s in comparison_data['Enhanced_Score'] if s >= 0.7)}/{len(comparison_data['Enhanced_Score'])} metrics passed")

print("\n" + "=" * 80)
print("ANALYSIS COMPLETE")
print("=" * 80)



ENHANCEMENT SECTION: Using Evaluation Feedback to Improve Summary

### Initial Evaluation Results ###

{
  "SummarizationScore": 0.7,
  "SummarizationReason": "The score is 0.70 because the summary contains contradictions and introduces extra information not found in the original text, which impacts the overall accuracy and reliability of the summary.",
  "CoherenceScore": 0.8,
  "CoherenceReason": "The summary has a clear structure with an introduction that outlines the report's focus, a body that presents detailed findings, and a conclusion that emphasizes the importance of strategic integration. Ideas are coherent, with logical transitions between sentences, although some complex concepts could be simplified for greater clarity. Overall, the response is well-organized and delivers the information in an understandable manner, effectively highlighting the key points of the report.",
  "TonalityScore": 0.7,
  "TonalityReason": "The summary presents a clear analytical perspective on th

Output()



Metrics Summary

  - ‚úÖ Summarization [GEval] (score: 0.9, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The summary effectively captures the main points of the input article, highlighting the disparity between high investment in GenAI and the limited returns experienced by organizations. It mentions key statistics, such as the 95% failure rate and the specific sectors showing meaningful change. The structure is clear and concise, preserving the original meaning and allowing readers to grasp the core message without needing to read the full article. However, it could further emphasize the implementation challenges faced by organizations to achieve a perfect score., error: None)
  - ‚úÖ Coherence [GEval] (score: 0.9, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The summary presents a clear and logical flow, effectively capturing the essence of the report on the GenAI Divide. It is easy to follow and well-structured, connecting key idea

DEBUG: EvaluationResult type = <class 'deepeval.evaluate.types.EvaluationResult'>
DEBUG: Test results count = 1

Summarization - Score: 0.9, Reason: The summary effectively captures the main points of the input article, highlighting the disparity be...
Coherence - Score: 0.9, Reason: The summary presents a clear and logical flow, effectively capturing the essence of the report on th...
Tonality - Score: 0.9, Reason: The summary maintains a consistent 'Sociologist of AI Tone' throughout, presenting an academic persp...
Safety - Score: 0.8, Reason: The summary effectively captures the core insights of the report, highlighting the disparity between...

=== ENHANCED EVALUATION RESULTS ===

{
  "SummarizationScore": 0.9,
  "SummarizationReason": "The summary effectively captures the main points of the input article, highlighting the disparity between high investment in GenAI and the limited returns experienced by organizations. It mentions key statistics, such as the 95% failure rate and th


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
