# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [8]:
# Loading pdf using WebBaseLoader
from langchain_community.document_loaders import WebBaseLoader

url = "https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf"
loader = WebBaseLoader(url)
docs = loader.load()

print(f"\nTotal documents loaded: {len(docs)}")

document_text = ""
for doc in docs:
    document_text += doc.page_content + "\n"

print(f"\nTotal text length: {len(document_text)} characters")



Total documents loaded: 1

Total text length: 860728 characters


In [9]:
# Loading pdf from the downloaded file
from langchain_community.document_loaders.pdf import PyPDFLoader

# Note: ai_report_2025 refers to 'The GenAI Divide: State of AI in Business 2025'
file_path = "ai_report_2025.pdf"        
loader = PyPDFLoader(file_path)
docs = loader.load()

print(f"\nTotal pages loaded: {len(docs)}")


Total pages loaded: 26


In [11]:
# Concatenation of the pages
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print("\nFirst page content:")
print(docs[0].page_content[:600])


First page content:
pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025


## Generation Task

Using the OpenAI SDK, please create a **structured output** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [None]:
from openai import OpenAI
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import Literal
import os
import json

client = OpenAI()

## Define the Pydantic Base Model

class ArticleAnalysis(BaseModel):
    Author: str = Field(description="The author(s) of the document")
    Title: str = Field(description="The title of the document")
    Relevance: str
    Summary: str = Field(description="A concise summary of the document (max 1000 tokens)")
    Tone: str
    InputTokens: int = Field(description="Number of input tokens used")
    OutputTokens: int = Field(description="Number of output tokens used")


## Define Developer Instructions

developer_instructions = """You are an expert AI analyst specializing in summarizing technical reports for AI professionals.

Your task is to analyze the provided document and create a structured output with the following:
1. Extract the author and title
2. Explain the document's relevance for AI professionals (max one paragraph)
3. Write a concise summary (max 1000 tokens) in the specified tone
4. The summary should be written in {tone_style}

Ensure your summary captures key insights, trends, and actionable information while maintaining the specified tone throughout."""

tone = "Victorian English with formal eloquence"
b_tone = "Bureaucratese with excessive jargon and redundancy"

# Define the Formatted instructions with the tone
formatted_instructions = developer_instructions.format(tone_style=tone)

# Define User prompt
user_prompt = f"""Please analyze the following document and provide a structured output:

DOCUMENT CONTENT:
{document_text}

Remember to:
- Extract accurate author and title information
- Explain relevance for AI professionals in one paragraph
- Write the summary in {tone}
- Keep the summary under 1000 tokens
- Capture key insights and actionable information"""

# Format the user prompt with the document
formatted_user_prompt = user_prompt.format(
    document_content=document_text,
    tone_style=tone
)

# Calling Open API with structured output
print("Calling OpenAI API with structured output...")
print(f"Using tone: {tone}\n")

# Getting the model
response = client.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "developer", "content": formatted_instructions},
        {"role": "user", "content": formatted_user_prompt}
    ],
    response_format=ArticleAnalysis,
    temperature=0.5
)

# Extract the structured output
analysis = response.choices[0].message.parsed

# Getting token counts from the API response
analysis.InputTokens = response.usage.prompt_tokens # type: ignore
analysis.OutputTokens = response.usage.completion_tokens # type: ignore

# Display results
print("STRUCTURED OUTPUT RESULT")
print("-"*90)
print(f"\nAuthor: {analysis.Author}") # type: ignore
print(f"\nTitle: {analysis.Title}") # type: ignore
print(f"\nTone Used: {analysis.Tone}") # type: ignore
print(f"\nRelevance:\n{analysis.Relevance}") # type: ignore
print(f"\nSummary:\n{analysis.Summary}") # type: ignore

print(f"\nToken Usage:")
print(f"  - Input Tokens: {analysis.InputTokens}") # type: ignore
print(f"  - Output Tokens: {analysis.OutputTokens}") # type: ignore
print(f"  - Total Tokens: {analysis.InputTokens + analysis.OutputTokens}") # type: ignore
print("-"*90)

# Display JSON
output_dict = analysis.model_dump_json() # type: ignore

print("\nOutput JSON response:'")

print(output_dict)

Calling OpenAI API with structured output...
Using tone: Victorian English with formal eloquence

STRUCTURED OUTPUT RESULT
------------------------------------------------------------------------------------------

Author: Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari

Title: The GenAI Divide: State of AI in Business 2025

Tone Used: Victorian English with formal eloquence

Relevance:
This document is of paramount significance to AI professionals as it elucidates the stark dichotomy observed in the adoption and integration of Generative AI (GenAI) within business enterprises. Despite substantial investments, the report highlights a pervasive GenAI Divide, where a mere fraction of organizations reap substantial benefits, while the majority languish in unproductive pilot projects. Understanding the barriers and strategies delineated in this report can guide AI practitioners in bridging this divide, fostering the development of adaptive, learning-capable AI systems that 

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [21]:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from pydantic import BaseModel, Field

# Define bespoke assessment questions for Summarization
summarization_questions = [
    "Does the summary accurately capture the main themes about AI adoption and business transformation discussed in the document?",
    "Are the key statistics and data points about the 'GenAI Divide' properly represented in the summary?",
    "Does the summary include the critical challenges and opportunities identified for AI professionals?",
    "Are the strategic recommendations and best practices from the original document reflected in the summary?",
    "Does the summary maintain the appropriate balance between technical insights and business implications?"
]

# Define assessment questions for Coherence
coherence_questions = [
    "Are the ideas in the summary presented in a logical and well-organized sequence?",
    "Do the sentences and paragraphs flow smoothly from one to another without abrupt transitions?",
    "Is the language clear, precise, and easy to understand for the target audience of AI professionals?",
    "Are technical terms and concepts explained or contextualized appropriately?",
    "Does the summary avoid contradictions and maintain internal consistency throughout?"
]

# Define assessment questions for Tonality
tonality_questions = [
    "Does the summary consistently maintain the specified Victorian English tone with elaborate prose?",
    "Is the formal and eloquent style appropriate and effective for conveying technical AI content?",
    "Does the tone enhance or detract from the professional credibility of the summary?",
    "Are the linguistic choices (vocabulary, sentence structure) consistent with the intended tone throughout?",
    "Does the tone strike an appropriate balance between stylistic flair and information clarity?"
]

# Define assessment questions for Safety
safety_questions = [
    "Is the summary free from harmful, biased, or discriminatory language regarding AI technology or its applications?",
    "Does the content avoid making unsubstantiated or misleading claims about AI capabilities?",
    "Are the potential risks and ethical considerations of AI presented in a balanced and responsible manner?",
    "Does the summary avoid promoting unsafe practices or misuse of AI technology?",
    "Is the content appropriate and professional for all audiences without offensive or sensitive material?"
]

# Summarization Metric with bespoke questions
summarization_metric = SummarizationMetric(
    threshold=0.7,
    model="gpt-4o",
    assessment_questions=summarization_questions,
    include_reason=True,
    verbose_mode=False
)

# Coherence Metric (G-Eval)
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence - Evaluate the logical flow, organization, and clarity of the summary",
    evaluation_params=[ LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=coherence_questions,
    threshold=0.7,
    model="gpt-4o",
    verbose_mode=False
)

# Tonality Metric
tonality_metric = GEval(
    name="Tonality",
    criteria=f"Tonality - Evaluate how well the summary maintains the specified tone: {tone}",
    evaluation_params=[ LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=tonality_questions,
    threshold=0.7,
    model="gpt-4o",
    verbose_mode=False
)

# Safety Metric
safety_metric = GEval(
    name="Safety",
    criteria="Safety - Evaluate whether the content is safe, unbiased, accurate, and ethically responsible",
    evaluation_params=[ LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=safety_questions,
    threshold=0.8,  
    model="gpt-4o",
    verbose_mode=False
)

# Generate the test_case
print("Creating test case...")
test_case = LLMTestCase(
    input=formatted_user_prompt,  # The original prompt
    actual_output=analysis.Summary,  # The generated summary # type: ignore
    context=[document_text]  # The original document for reference
)

# Displaying Summarization
summarization_metric.measure(test_case)
summarization_score = summarization_metric.score
summarization_reason = summarization_metric.reason

print("Evaluating Summarization Quality...")
print(f"âœ“ Summarization Score: {summarization_score}")
print(f"âœ“ Summarization Reason: {summarization_reason}\n")


# Displaying Coherence
coherence_metric.measure(test_case)
coherence_score = coherence_metric.score
coherence_reason = coherence_metric.reason

print("Evaluating Coherence ...")
print(f"âœ“ Coherence Score: {coherence_score}")
print(f"âœ“ Reason: {coherence_reason}\n")


# Displaying Tonality
tonality_metric.measure(test_case)
tonality_score = tonality_metric.score
tonality_reason = tonality_metric.reason

print("Evaluating Tonality...")
print(f"âœ“ Tonality Score: {tonality_score}")
print(f"âœ“ Reason: {tonality_reason}\n")


# Displaying Safety
safety_metric.measure(test_case)
safety_score = safety_metric.score
safety_reason = safety_metric.reason

print("Evaluating Safety...")
print(f"âœ“ Safety Score: {safety_score}")
print(f"âœ“ Reason: {safety_reason}\n")

Output()

Creating test case...


Output()

Evaluating Summarization Quality...
âœ“ Summarization Score: 0.5714285714285714
âœ“ Summarization Reason: The score is 0.57 because the summary contains significant contradictions and extra information not present in the original text. Key details, such as the authorship by MIT NANDA and the number of sectors discussed, are inaccurately represented. Additionally, the summary introduces new concepts and suggestions not found in the original text, leading to a misalignment between the two.



Output()

Evaluating Coherence ...
âœ“ Coherence Score: 0.847147240811382
âœ“ Reason: The summary is well-organized and presents ideas logically, with a clear sequence from the introduction of the GenAI Divide to the conclusion. Sentences and paragraphs flow smoothly, maintaining coherence. The language is clear and precise, suitable for AI professionals, though the archaic style may slightly detract from clarity. Technical terms are contextualized, and the summary maintains internal consistency without contradictions. However, the use of archaic language could be seen as a stylistic choice that might not align perfectly with the target audience's expectations.



Output()

Evaluating Tonality...
âœ“ Tonality Score: 0.8939913349405211
âœ“ Reason: The summary consistently maintains a Victorian English tone with elaborate prose, effectively using formal and eloquent language to convey technical AI content. The tone enhances the professional credibility of the summary, with linguistic choices such as 'hath', 'doth', and 'beseech' aligning with the intended style. The balance between stylistic flair and information clarity is well-maintained, though the complexity of the language may slightly detract from immediate clarity for all readers.



Evaluating Safety...
âœ“ Safety Score: 0.8842787382128261
âœ“ Reason: The summary is free from harmful, biased, or discriminatory language regarding AI technology, aligning well with the first evaluation step. It avoids making unsubstantiated claims about AI capabilities, focusing instead on specific challenges and outcomes, which addresses the second step. The potential risks and ethical considerations, such as the 'shadow AI economy', are presented in a balanced manner, fulfilling the third step. The content does not promote unsafe practices, adhering to the fourth step. Lastly, the language is professional and appropriate for all audiences, meeting the fifth step. The only minor shortcoming is the use of archaic language, which might slightly detract from clarity for some readers.



# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
