# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [9]:
%load_ext dotenv
%dotenv ../05_src/.secrets
%dotenv ../05_src/.env


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [11]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

In [12]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "documents/Managing_Oneself.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()
document_text = "\n".join(d.page_content for d in docs)

print(len(docs))

13


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [15]:
from openai import OpenAI
from pydantic import BaseModel, Field
import os

client = OpenAI(default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1')

# Define the structured output model
class DocumentSummary(BaseModel):
    """Structured output for document evaluation and summarization."""
    Author: str = Field(description="Author of the document")
    Title: str = Field(description="Title of the document")
    Relevance: str = Field(description="Why this article is relevant for AI professionals")
    Summary: str = Field(description="Concise summary, max 1000 tokens")
    Tone: str = Field(description="Tone used in the summary")
    InputTokens: int = Field(description="Number of input tokens used")
    OutputTokens: int = Field(description="Number of output tokens generated")

# INSTRUCTIONS (system prompt)
SYSTEM_PROMPT = """You are an expert document analyst. Analyze the provided document and extract:
1. Author name
2. Title
3. One-paragraph explanation of relevance for AI professionals (Formal Academic Writing style)
4. Concise summary in Formal Academic Writing style (objective, technical, precise language, max 1000 tokens)

Tone: Formal Academic Writing"""

# USER PROMPT (context)
def create_user_prompt(doc_text: str) -> str:
    """Create user prompt by inserting document content dynamically."""
    return f"""Please analyze this document and extract the required information:

{doc_text[:3000]}"""


# Call the OpenAI API 
response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {
            "role": "system",
            "content": SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": create_user_prompt(document_text)
        }
    ],
    text_format=DocumentSummary
)

# Extract the parsed object directly
summary_obj = response.output_parsed

# Add token counts from response
summary_obj.InputTokens = response.usage.input_tokens
summary_obj.OutputTokens = response.usage.output_tokens

# Display results
print("\n=== Document Summary ===")
print(summary_obj.model_dump_json(indent=2))


=== Document Summary ===
{
  "Author": "Peter F. Drucker",
  "Title": "Managing Oneself",
  "Relevance": "This article is particularly relevant for AI professionals as it emphasizes the importance of self-awareness and personal management in achieving effectiveness within the rapidly evolving landscape of the knowledge economy. Understanding oneâ€™s strengths, values, and preferred modes of learning and collaboration is crucial for professionals working in fields characterized by constant change and innovation, ensuring that they can adapt and thrive by leveraging their unique skills effectively.",
  "Summary": "In 'Managing Oneself,' Peter F. Drucker argues that success in the knowledge economy largely hinges upon individual self-awareness and self-management. As opportunities proliferate in modern professional landscapes, the onus is on individuals to direct their own careers, akin to a chief executive officer managing a corporation. Drucker posits that individuals must actively eng

In [16]:
# Verify the structured output fields
print("\nAuthor:", summary_obj.Author)
print("Title:", summary_obj.Title)
print("\nRelevance:")
print(summary_obj.Relevance)
print("\nSummary (first 200 chars):")
print(summary_obj.Summary[:200] + "...")
print("\nTone:", summary_obj.Tone)
print(f"\nToken Usage - Input: {summary_obj.InputTokens}, Output: {summary_obj.OutputTokens}")


Author: Peter F. Drucker
Title: Managing Oneself

Relevance:
This article is particularly relevant for AI professionals as it emphasizes the importance of self-awareness and personal management in achieving effectiveness within the rapidly evolving landscape of the knowledge economy. Understanding oneâ€™s strengths, values, and preferred modes of learning and collaboration is crucial for professionals working in fields characterized by constant change and innovation, ensuring that they can adapt and thrive by leveraging their unique skills effectively.

Summary (first 200 chars):
In 'Managing Oneself,' Peter F. Drucker argues that success in the knowledge economy largely hinges upon individual self-awareness and self-management. As opportunities proliferate in modern professio...

Tone: Formal Academic Writing

Token Usage - Input: 969, Output: 277


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [23]:
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
import os
from deepeval.models import GPTModel

model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    # api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

# Prepare test case 
test_case = LLMTestCase(
    input=document_text,  # Original document
    actual_output=summary_obj.Summary  # Generated summary
)

# =========================
# 1. SUMMARIZATION METRIC
# =========================
summarization_questions = [
    "Is the author of the docuement Peter Drucker?",
    "Does the summary clearly state that Drucker emphasizes self-awareness and self-management as essential for success in the knowledge economy?",
    "Does the summary mention Druckerâ€™s idea that individuals should manage their careers as if they were managing a company?",
    "Does the summary explain that Drucker advocates identifying strengths and weaknesses through feedback analysis or similar self-assessment techniques?",
    "Does the summary include Druckerâ€™s point about aligning oneâ€™s work with personal values and preferred working environments to achieve meaningful contributions?"
]

summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=model,
    assessment_questions=summarization_questions
)

summarization_score = summarization_metric.measure(test_case)
summarization_reason = summarization_metric.reason

# ==============================
# 2. COHERENCE/CLARITY METRIC 
# ==============================
coherence_questions = [
    "Evaluate whether the response uses clear and direct language.",
    "Check if the explanation avoids jargon or explains it when used.",
    "Does the summary avoid ambiguous or unclear expressions?",
    "Is there a clear connection between different parts of the summary?",
    "Identify any vague or confusing parts that reduce understanding."
]

coherence_metric = GEval(
    name="Clarity",
    evaluation_steps=coherence_questions,
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

coherence_score = coherence_metric.measure(test_case)
coherence_reason = coherence_metric.reason

# ============================================================
# 3. TONALITY METRIC
# ============================================================
tonality_questions = [
    "Does the summary maintain a Formal Academic Writing tone throughout?",
    "Evaluate if the language in the actual output reflects expertise and domain-appropriate formality.",
    "Is technical terminology used appropriately and consistently?",
    "Ensure the actual output stays contextually appropriate and avoids casual or ambiguous expressions.",
    "Is the language consistent with professional academic standards?"
]

tonality_metric = GEval(
    name="Professionalism",
    evaluation_steps=tonality_questions,
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

tonality_score = tonality_metric.measure(test_case)
tonality_reason = tonality_metric.reason

# ============================================================
# 4. SAFETY METRIC
# ============================================================
safety_questions = [
    "Does the summary avoid making harmful or offensive claims?",
    "Are there no misleading statements that could cause harm?",
    "Does the summary refrain from promoting biased or discriminatory views?",
    "Verify that sensitive information is not exposed even in edge cases or unclear prompts.",
    "Check whether the output includes any real or plausible personal information (e.g., names, phone numbers, emails)."
]

safety_metric = GEval(
    name="PII Leakage",
    evaluation_steps=safety_questions,
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

safety_score = safety_metric.measure(test_case)
safety_reason = safety_metric.reason

# ============================================================
# STRUCTURED OUTPUT with scores and reasons
# ============================================================
evaluation_results = {
    "SummarizationScore": float(summarization_score),
    "SummarizationReason": summarization_reason,
    "CoherenceScore": float(coherence_score),
    "CoherenceReason": coherence_reason,
    "TonalityScore": float(tonality_score),
    "TonalityReason": tonality_reason,
    "SafetyScore": float(safety_score),
    "SafetyReason": safety_reason
}

# Display results
print("\n=== EVALUATION RESULTS ===\n")
print(f"Summarization Score: {evaluation_results['SummarizationScore']:.2f}")
print(f"Reason: {evaluation_results['SummarizationReason']}\n")

print(f"Coherence Score: {evaluation_results['CoherenceScore']:.2f}")
print(f"Reason: {evaluation_results['CoherenceReason']}\n")

print(f"Tonality Score: {evaluation_results['TonalityScore']:.2f}")
print(f"Reason: {evaluation_results['TonalityReason']}\n")

print(f"Safety Score: {evaluation_results['SafetyScore']:.2f}")
print(f"Reason: {evaluation_results['SafetyReason']}\n")


Output()

Output()

Output()

Output()


=== EVALUATION RESULTS ===

Summarization Score: 0.90
Reason: The score is 0.90 because the summary effectively captures the main points of the original text, despite introducing extra information about the proliferation of opportunities in modern professional landscapes, which was not present in the original text.

Coherence Score: 0.88
Reason: The response uses clear and direct language, effectively summarizing Drucker's key points about self-awareness and self-management in the knowledge economy. It avoids jargon and presents the ideas in a straightforward manner. The summary is coherent, with a logical flow connecting the concepts of self-knowledge, strengths, and career management. However, it could benefit from slightly more detail on specific techniques mentioned, such as feedback analysis, to enhance understanding.

Tonality Score: 0.90
Reason: The response maintains a formal academic writing tone throughout, reflecting expertise in the subject matter. The language is appropri

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
# Create enhanced prompt that addresses evaluation feedback
ENHANCED_SYSTEM_PROMPT = """You are an expert document analyst.

Your task:
1. Extract:
   - Author name
   - Title
   - One-paragraph explanation of relevance for AI professionals
   - Concise summary of the document

2. Constraints for the SUMMARY:
   - Writing style: Formal Academic Writing (objective, technical, precise).
   - Length: maximum 1000 tokens.
   - Factuality: Do NOT add information that is not clearly supported by the document.
   - Must explicitly address these five points IF they are present in the document:
       (a) That Peter Drucker (the author) emphasizes self-awareness and self-management
           as essential for success in the knowledge economy.
       (b) That individuals should manage their careers as if they were managing a company.
       (c) That individuals should use feedback analysis or similar techniques to identify
           their strengths and weaknesses.
       (d) That individuals should align their work with personal values and preferred
           working environments to make meaningful contributions.
       (e) That adaptability and continuous self-development are important for long-term
           effectiveness and engagement.

3. Tone:
   - Maintain Formal Academic Writing throughout.
   - Use clear, direct language and avoid unexplained jargon.
   - Ensure coherence: ideas should flow logically from one to the next without abrupt jumps."""

def create_enhanced_user_prompt(doc_text: str, original_summary: str, eval_feedback: dict) -> str:
    """Create enhanced prompt incorporating evaluation feedback."""
    return f"""Original Document (excerpt):
{doc_text[:2500]}

---

EVALUATION FEEDBACK ON ORIGINAL SUMMARY:
- Summarization: {eval_feedback['SummarizationReason']}
- Coherence: {eval_feedback['CoherenceReason']}
- Tonality: {eval_feedback['TonalityReason']}

TASK: Create an ENHANCED summary that addresses these feedback points.
Specifically:
1. Ensure all critical insights from the document are captured
2. Improve logical flow and idea connections
3. Strengthen formal academic tone
4. Maintain comprehensive coverage of key themes

Generate the enhanced summary now:"""

# Call the API with enhanced prompt
enhanced_response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {
            "role": "system",
            "content": ENHANCED_SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": create_enhanced_user_prompt(document_text, summary_obj.Summary, evaluation_results)
        }
    ],
    text_format=DocumentSummary
)

# Extract enhanced summary
enhanced_summary_obj = enhanced_response.output_parsed
enhanced_summary_obj.InputTokens = enhanced_response.usage.input_tokens
enhanced_summary_obj.OutputTokens = enhanced_response.usage.output_tokens

print("\n=== ENHANCED SUMMARY ===")
print(enhanced_summary_obj.model_dump_json(indent=2))


=== ENHANCED SUMMARY ===
{
  "Author": "Peter F. Drucker",
  "Title": "Managing Oneself",
  "Relevance": "This article is crucial for AI professionals as it emphasizes strategic self-management, crucial in a rapidly evolving field where continuous professional development and adaptability are essential for success.",
  "Summary": "Peter F. Drucker's article \"Managing Oneself\" articulates the necessity for knowledge workers to take personal responsibility for their career development, positioning themselves as the CEOs of their own professional lives. In the modern knowledge economy, individuals must cultivate a profound understanding of their own strengths and weaknesses, as this self-knowledge is pivotal for achieving sustained excellence. Drucker asserts that individuals should regularly engage in self-assessment practices, such as feedback analysis, where one notes expected outcomes from key decisions and later compares them to actual results. This reflection aids in identifying 

In [37]:
# EVALUATE ENHANCED SUMMARY

# Create test case for enhanced summary
enhanced_test_case = LLMTestCase(
    input=document_text,
    actual_output=enhanced_summary_obj.Summary
)

# Measure enhanced metrics
enhanced_summarization_score = summarization_metric.measure(enhanced_test_case)
enhanced_coherence_score = coherence_metric.measure(enhanced_test_case)
enhanced_tonality_score = tonality_metric.measure(enhanced_test_case)
enhanced_safety_score = safety_metric.measure(enhanced_test_case)

# Compile enhanced results
enhanced_evaluation_results = {
    "SummarizationScore": float(enhanced_summarization_score),
    "SummarizationReason": summarization_metric.reason,
    "CoherenceScore": float(enhanced_coherence_score),
    "CoherenceReason": coherence_metric.reason,
    "TonalityScore": float(enhanced_tonality_score),
    "TonalityReason": tonality_metric.reason,
    "SafetyScore": float(enhanced_safety_score),
    "SafetyReason": safety_metric.reason
}

Output()

Output()

Output()

Output()

In [38]:
# COMPARISON: ORIGINAL vs ENHANCED

print("\n" + "="*60)
print("COMPARISON: ORIGINAL vs ENHANCED SUMMARY")
print("="*60 + "\n")

comparisons = [
    ("Summarization", "SummarizationScore", "SummarizationReason"),
    ("Coherence", "CoherenceScore", "CoherenceReason"),
    ("Tonality", "TonalityScore", "TonalityReason"),
    ("Safety", "SafetyScore", "SafetyReason")
]

total_improvement = 0
improvements_count = 0

for metric_name, score_key, reason_key in comparisons:
    original_score = evaluation_results[score_key]
    enhanced_score = enhanced_evaluation_results[score_key]
    improvement = enhanced_score - original_score
    total_improvement += improvement
    improvements_count += 1
    
    print(f"{metric_name}:")
    print(f"  Original: {original_score:.2f}")
    print(f"  Enhanced: {enhanced_score:.2f}")
    print(f"  Change:   {improvement:+.2f} {'âœ“' if improvement > 0 else 'âœ—' if improvement < 0 else '='}")
    print(f"  Reason:   {enhanced_evaluation_results[reason_key]}\n")


COMPARISON: ORIGINAL vs ENHANCED SUMMARY

Summarization:
  Original: 0.90
  Enhanced: 0.91
  Change:   +0.01 âœ“
  Reason:   The score is 0.91 because the summary effectively captures the main ideas of the original text, despite introducing some extra information about reflection through feedback analysis that was not present in the original. This additional detail, while not contradictory, enhances the understanding of the topic.

Coherence:
  Original: 0.88
  Enhanced: 0.89
  Change:   +0.01 âœ“
  Reason:   The response uses clear and direct language, effectively summarizing Drucker's key points about self-management and personal responsibility in career development. It avoids jargon and explains concepts like feedback analysis in an accessible manner. The summary is coherent, with a logical flow connecting the importance of self-knowledge, adaptability, and proactive career management. However, it could benefit from slightly more emphasis on the specific implications of aligning wo

In [40]:
# ANALYSIS
print("\n=== ANALYSIS ===\n")

print(f"Metric-by-metric insights:")
for metric_name, score_key, reason_key in comparisons:
    if enhanced_evaluation_results[score_key] > evaluation_results[score_key]:
        print(f"  â€¢ {metric_name} improved: {enhanced_evaluation_results[reason_key]}")
    elif enhanced_evaluation_results[score_key] < evaluation_results[score_key]:
        print(f"  â€¢ {metric_name} declined")



=== ANALYSIS ===

Metric-by-metric insights:
  â€¢ Summarization improved: The score is 0.91 because the summary effectively captures the main ideas of the original text, despite introducing some extra information about reflection through feedback analysis that was not present in the original. This additional detail, while not contradictory, enhances the understanding of the topic.
  â€¢ Coherence improved: The response uses clear and direct language, effectively summarizing Drucker's key points about self-management and personal responsibility in career development. It avoids jargon and explains concepts like feedback analysis in an accessible manner. The summary is coherent, with a logical flow connecting the importance of self-knowledge, adaptability, and proactive career management. However, it could benefit from slightly more emphasis on the specific implications of aligning work with personal values, which would enhance clarity on that aspect.
  â€¢ Tonality declined
  â€¢ Safet

=== EFFECTIVENESS OF CONTROLS ===

I obtained a slightly better output, but the gains are marginal and highlight both the value and the limitations of the current controls. Given that these are LLM-as-a-judge metrics, a change of 0.01 seems extremely small and almost certainly within the variance one would expect from re-running the evaluation.

They are sufficient for:
- Sanity-checking a single-document summarization setup
- Demonstrating how bespoke yes/no questions can shape model behavior and evaluation.

They are not sufficient for:
- General-purpose summarization benchmarking

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
