# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [38]:
%load_ext dotenv
%dotenv ../05_src/.secrets_grassriots

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [None]:
# UTILS: Text Cleaning Helper Functions for LLM Processing.
import re
def clean_document_text(text):
    """
    Clean and normalize document text for LLM processing.
    
    This function applies basic data cleaning techniques to prepare PDF-extracted
    text for optimal LLM processing. It handles common issues like encoding errors,
    excessive whitespace, and improper line breaks.
    
    Args:
        text (str): Raw document text extracted from PDF
        
    Returns:
        str: Cleaned and normalized text ready for LLM processing
    """
    # Step 1: Handle encoding issues gracefully
    # Remove or replace problematic characters that may cause encoding errors
    if isinstance(text, bytes):
        text = text.decode('utf-8', errors='ignore')
    
    # Handle common encoding issues by removing problematic unicode characters
    text = text.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
    
    # Step 2: Normalize line breaks - preserve paragraph breaks but join broken sentences
    # Replace double newlines (paragraph breaks) with a temporary marker
    text = text.replace('\n\n', '|||PARAGRAPH_BREAK|||')
    # Replace single newlines with spaces (these are likely broken sentences)
    text = text.replace('\n', ' ')
    # Restore paragraph breaks
    text = text.replace('|||PARAGRAPH_BREAK|||', '\n\n')
    
    # Step 3: Normalize whitespace
    # Replace tabs with spaces
    text = text.replace('\t', ' ')
    # Replace multiple consecutive spaces with a single space
    text = re.sub(r' +', ' ', text)
    
    # Step 4: Clean up hyphenated line breaks (common in PDFs)
    # Fix words broken across lines with hyphens followed by space (e.g., "word- \nword" -> "wordword")
    text = re.sub(r'(\w+)-\s+(\w+)', r'\1\2', text)
    
    # Step 5: Remove leading and trailing whitespace from each line
    lines = text.split('\n')
    lines = [line.strip() for line in lines]
    text = '\n'.join(lines)
    
    # Step 6: Remove excessive blank lines (more than 2 consecutive newlines)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Step 7: Final trim of leading/trailing whitespace
    text = text.strip()
    
    return text

In [None]:
# UTILS: Text Cleaning Evaluation Helper Functions - DO NOT MODIFY
try:
    import tiktoken
    TOKENIZER_AVAILABLE = True
except ImportError:
    TOKENIZER_AVAILABLE = False
    print("Note: tiktoken not available. Using word-based token approximation.")

def count_tokens(text, model="gpt-4o-mini"):
    """
    Count tokens in text using tiktoken if available, otherwise approximate.
    
    Args:
        text (str): Text to count tokens for
        model (str): Model name for tokenizer (default: gpt-4o-mini)
        
    Returns:
        int: Approximate token count
    """
    if TOKENIZER_AVAILABLE:
        try:
            encoding = tiktoken.encoding_for_model(model)
            return len(encoding.encode(text))
        except:
            # Fallback to word-based approximation
            return len(text.split()) // 0.75  # Rough approximation: ~0.75 words per token
    else:
        # Simple approximation: average English word is ~1.3 tokens
        return int(len(text.split()) * 1.3)

def evaluate_cleaning(original_text, cleaned_text):
    """
    Evaluate the effectiveness of document cleaning by comparing original vs cleaned text.
    
    This function computes quantitative metrics including text statistics, whitespace
    reduction, and token efficiency to demonstrate cleaning effectiveness.
    
    Args:
        original_text (str): Original text before cleaning
        cleaned_text (str): Text after cleaning
        
    Returns:
        dict: Dictionary containing all evaluation metrics
    """
    import re
    
    # Helper function to count patterns
    def count_pattern(text, pattern):
        return len(re.findall(pattern, text))
    
    # Helper function to count_multiple_spaces
    def count_multiple_spaces(text):
        return len(re.findall(r' {2,}', text))
    
    # Calculate text statistics
    metrics = {
        'original': {},
        'cleaned': {},
        'improvements': {}
    }
    
    # Character and word counts
    metrics['original']['char_count'] = len(original_text)
    metrics['cleaned']['char_count'] = len(cleaned_text)
    metrics['improvements']['char_reduction'] = metrics['original']['char_count'] - metrics['cleaned']['char_count']
    metrics['improvements']['char_reduction_pct'] = (metrics['improvements']['char_reduction'] / metrics['original']['char_count'] * 100) if metrics['original']['char_count'] > 0 else 0
    
    metrics['original']['word_count'] = len(original_text.split())
    metrics['cleaned']['word_count'] = len(cleaned_text.split())
    
    # Whitespace metrics
    metrics['original']['spaces'] = original_text.count(' ')
    metrics['cleaned']['spaces'] = cleaned_text.count(' ')
    
    metrics['original']['tabs'] = original_text.count('\t')
    metrics['cleaned']['tabs'] = cleaned_text.count('\t')
    
    metrics['original']['newlines'] = original_text.count('\n')
    metrics['cleaned']['newlines'] = cleaned_text.count('\n')
    
    metrics['original']['multiple_spaces'] = count_multiple_spaces(original_text)
    metrics['cleaned']['multiple_spaces'] = count_multiple_spaces(cleaned_text)
    
    # Count excessive blank lines (>2 consecutive)
    metrics['original']['excessive_blank_lines'] = count_pattern(original_text, r'\n{3,}')
    metrics['cleaned']['excessive_blank_lines'] = count_pattern(cleaned_text, r'\n{3,}')
    
    # Count single newlines (likely broken sentences)
    single_newlines_original = count_pattern(original_text, r'(?<!\n)\n(?!\n)')
    single_newlines_cleaned = count_pattern(cleaned_text, r'(?<!\n)\n(?!\n)')
    metrics['original']['single_newlines'] = single_newlines_original
    metrics['cleaned']['single_newlines'] = single_newlines_cleaned
    
    # Text density (non-whitespace ratio)
    metrics['original']['text_density'] = len(re.sub(r'\s', '', original_text)) / len(original_text) if len(original_text) > 0 else 0
    metrics['cleaned']['text_density'] = len(re.sub(r'\s', '', cleaned_text)) / len(cleaned_text) if len(cleaned_text) > 0 else 0
    
    # Sentence count (approximate)
    metrics['original']['sentence_count'] = len(re.findall(r'[.!?]+', original_text))
    metrics['cleaned']['sentence_count'] = len(re.findall(r'[.!?]+', cleaned_text))
    
    # Average words per sentence
    metrics['original']['avg_words_per_sentence'] = metrics['original']['word_count'] / metrics['original']['sentence_count'] if metrics['original']['sentence_count'] > 0 else 0
    metrics['cleaned']['avg_words_per_sentence'] = metrics['cleaned']['word_count'] / metrics['cleaned']['sentence_count'] if metrics['cleaned']['sentence_count'] > 0 else 0
    
    # Token counts
    metrics['original']['token_count'] = count_tokens(original_text)
    metrics['cleaned']['token_count'] = count_tokens(cleaned_text)
    metrics['improvements']['token_reduction'] = metrics['original']['token_count'] - metrics['cleaned']['token_count']
    metrics['improvements']['token_reduction_pct'] = (metrics['improvements']['token_reduction'] / metrics['original']['token_count'] * 100) if metrics['original']['token_count'] > 0 else 0
    
    # Calculate improvements
    metrics['improvements']['spaces_reduced'] = metrics['original']['spaces'] - metrics['cleaned']['spaces']
    metrics['improvements']['tabs_removed'] = metrics['original']['tabs'] - metrics['cleaned']['tabs']
    metrics['improvements']['single_newlines_fixed'] = metrics['original']['single_newlines'] - metrics['cleaned']['single_newlines']
    metrics['improvements']['multiple_spaces_reduced'] = metrics['original']['multiple_spaces'] - metrics['cleaned']['multiple_spaces']
    metrics['improvements']['excessive_blank_lines_removed'] = metrics['original']['excessive_blank_lines'] - metrics['cleaned']['excessive_blank_lines']
    
    return metrics

def print_evaluation_report(metrics):
    """
    Print a formatted evaluation report showing cleaning effectiveness.
    
    Args:
        metrics (dict): Metrics dictionary from evaluate_cleaning()
    """
    print("=" * 80)
    print("DOCUMENT CLEANING EVALUATION REPORT")
    print("=" * 80)
    print()
    
    # Text Statistics
    print("ðŸ“Š TEXT STATISTICS")
    print("-" * 80)
    print(f"Character Count:")
    print(f"  Original: {metrics['original']['char_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['char_count']:,}")
    print(f"  Reduction: {metrics['improvements']['char_reduction']:,} ({metrics['improvements']['char_reduction_pct']:.2f}%)")
    print()
    
    print(f"Word Count:")
    print(f"  Original: {metrics['original']['word_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['word_count']:,}")
    print()
    
    print(f"Sentence Count:")
    print(f"  Original: {metrics['original']['sentence_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['sentence_count']:,}")
    print()
    
    print(f"Average Words per Sentence:")
    print(f"  Original: {metrics['original']['avg_words_per_sentence']:.2f}")
    print(f"  Cleaned:  {metrics['cleaned']['avg_words_per_sentence']:.2f}")
    print()
    
    # Whitespace Metrics
    print("ðŸ”¤ WHITESPACE METRICS")
    print("-" * 80)
    print(f"Spaces:")
    print(f"  Original: {metrics['original']['spaces']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['spaces']:,}")
    print(f"  Reduced:  {metrics['improvements']['spaces_reduced']:,}")
    print()
    
    print(f"Multiple Consecutive Spaces:")
    print(f"  Original: {metrics['original']['multiple_spaces']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['multiple_spaces']:,}")
    print(f"  Reduced:  {metrics['improvements']['multiple_spaces_reduced']:,}")
    print()
    
    print(f"Tab Characters:")
    print(f"  Original: {metrics['original']['tabs']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['tabs']:,}")
    print(f"  Removed:  {metrics['improvements']['tabs_removed']:,}")
    print()
    
    print(f"Single Newlines (broken sentences):")
    print(f"  Original: {metrics['original']['single_newlines']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['single_newlines']:,}")
    print(f"  Fixed:    {metrics['improvements']['single_newlines_fixed']:,}")
    print()
    
    print(f"Excessive Blank Lines (>2 consecutive):")
    print(f"  Original: {metrics['original']['excessive_blank_lines']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['excessive_blank_lines']:,}")
    print(f"  Removed:  {metrics['improvements']['excessive_blank_lines_removed']:,}")
    print()
    
    # Text Quality
    print("âœ¨ TEXT QUALITY INDICATORS")
    print("-" * 80)
    print(f"Text Density (non-whitespace ratio):")
    print(f"  Original: {metrics['original']['text_density']:.4f}")
    print(f"  Cleaned:  {metrics['cleaned']['text_density']:.4f}")
    print(f"  Improvement: {'Higher is better - more content, less whitespace' if metrics['cleaned']['text_density'] > metrics['original']['text_density'] else 'Same or lower'}")
    print()
    
    # Token Efficiency
    print("ðŸŽ¯ LLM PROCESSING EFFICIENCY")
    print("-" * 80)
    print(f"Token Count (approximate):")
    print(f"  Original: {metrics['original']['token_count']:,}")
    print(f"  Cleaned:  {metrics['cleaned']['token_count']:,}")
    print(f"  Reduction: {metrics['improvements']['token_reduction']:,} ({metrics['improvements']['token_reduction_pct']:.2f}%)")
    print()
    
    # Summary
    print("=" * 80)
    print("SUMMARY")
    print("=" * 80)
    print(f"âœ… Characters reduced: {metrics['improvements']['char_reduction']:,} ({metrics['improvements']['char_reduction_pct']:.2f}%)")
    print(f"âœ… Tokens reduced: {metrics['improvements']['token_reduction']:,} ({metrics['improvements']['token_reduction_pct']:.2f}%)")
    print(f"âœ… Broken sentences fixed: {metrics['improvements']['single_newlines_fixed']:,}")
    print(f"âœ… Multiple spaces normalized: {metrics['improvements']['multiple_spaces_reduced']:,}")
    print(f"âœ… Tabs converted to spaces: {metrics['improvements']['tabs_removed']:,}")
    print(f"âœ… Excessive blank lines removed: {metrics['improvements']['excessive_blank_lines_removed']:,}")
    print()
    
    if metrics['improvements']['token_reduction'] > 0:
        print(f"ðŸ’¡ The cleaned text uses {metrics['improvements']['token_reduction_pct']:.2f}% fewer tokens,")
        print(f"   which means lower processing costs and faster LLM responses!")
    print("=" * 80)

In [66]:
# Step 1: Import POF Using LangChain Library (Peter Drucker - Managing Oneself)
from langchain_community.document_loaders import PyPDFLoader

file_path = "https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

# Store original text before cleaning for comparison
original_document_text = document_text

In [69]:
# Step 2: Clean the document text before processing with LLM and Output the results
document_text = clean_document_text(document_text)

# Evaluate the cleaning effectiveness
cleaning_metrics = evaluate_cleaning(original_document_text, document_text)
print_evaluation_report(cleaning_metrics)

DOCUMENT CLEANING EVALUATION REPORT

ðŸ“Š TEXT STATISTICS
--------------------------------------------------------------------------------
Character Count:
  Original: 51,452
  Cleaned:  50,434
  Reduction: 1,018 (1.98%)

Word Count:
  Original: 8,670
  Cleaned:  8,427

Sentence Count:
  Original: 578
  Cleaned:  578

Average Words per Sentence:
  Original: 15.00
  Cleaned:  14.58

ðŸ”¤ WHITESPACE METRICS
--------------------------------------------------------------------------------
Spaces:
  Original: 7,759
  Cleaned:  8,426
  Reduced:  -667

Multiple Consecutive Spaces:
  Original: 5
  Cleaned:  0
  Reduced:  5

Tab Characters:
  Original: 0
  Cleaned:  0
  Removed:  0

Single Newlines (broken sentences):
  Original: 1,442
  Cleaned:  0
  Fixed:    1,442

Excessive Blank Lines (>2 consecutive):
  Original: 0
  Cleaned:  0
  Removed:  0

âœ¨ TEXT QUALITY INDICATORS
--------------------------------------------------------------------------------
Text Density (non-whitespace ratio):
 

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [138]:
from pydantic import BaseModel, Field, ConfigDict
from typing import Optional
from openai import OpenAI
import json

client = OpenAI()

# 1) Output Schema - simplified for strict mode
class SummarySchema(BaseModel):
    model_config = ConfigDict(extra='forbid')
    
    Author: str = Field(..., description="Author of the article")
    Title: str = Field(..., description="Title of the article")
    Relevance: str = Field(..., description="One paragraph on relevance to AI professionals")
    Summary: str = Field(..., description="Concise summary, <=1000 tokens")
    Tone: str = Field(..., description="The tone used to produce the summary")
    InputTokens: Optional[int] = Field(default=0, description="Token count - DO NOT FILL, will be set by response object")
    OutputTokens: Optional[int] = Field(default=0, description="Token count - DO NOT FILL, will be by response object")

# 2) Model + Tone
MODEL_NAME = "gpt-4o"
TONE = "Humourous"

# 3) Prompts (separated) + dynamic context injection
instructions = (
    "You are an information extraction and summarization assistant. "
    "Return output STRICTLY matching the provided JSON schema. Do not add fields. "
    "Summary should be concise and succint, with limited commentary."
    "Write the Summary in the specified Tone and keep it under 1000 tokens. "
    "Use only facts from the provided document; avoid speculation or hallucinations."
)

user_template = (
    "Task: Extract metadata and summarize the document in the specified tone.\n"
    "- Tone: {tone}\n"
    "- Fields to fill: Author, Title, Relevance (<=1 paragraph), Summary (<=1000 tokens).\n"
    "Document follows between <<< >>>. Use only its content.\n"
    "<<<\n{context}\n>>>"
)
user_content = user_template.format(tone=TONE, context=document_text)

# 4) Call the model with structured output via Chat Completions API
client = OpenAI()

response = client.responses.parse(
    model=MODEL_NAME,
    input=[
        {"role": "system", "content": instructions},
        {"role": "user", "content": user_content},
    ],
    text_format=SummarySchema,  # Direct Pydantic model
    # max_completion_tokens=1200,
)



In [136]:
# Print the response object in a human readable format
# print(f"The Complete Response Object (reponse):\n{json.dumps(response.model_dump(), indent=2, ensure_ascii=False)}")

# Add the token counts to the response and store the result in its own variable
result = response.output_parsed.model_copy(update={
    "InputTokens": response.usage.input_tokens,
    "OutputTokens": response.usage.output_tokens,
})

# Output our summary for human review 
print(result.Summary)
print("\n")
# Output just the result in a human readable format
print(f"Our Completed Schema Ouput (result):\n{json.dumps(result.model_dump(), indent=2, ensure_ascii=False)}")



Ah, self-management! The task of being your own boss while resisting the urge to eat donuts at 3 PM. Peter F. Drucker, in his wise (and amusing) article "Managing Oneself," argues that in the knowledge economy, we must all be our own CEO. Knowledge workers like us need to peel back the layers of the metaphorical onion (sorry, not sorry) to truly understand our strengths, weaknesses, and what environment makes us shine brighter than a freshly polished Apple product. The gist? Donâ€™t waste time obsessing over weaknesses. Instead, double down on what you do bestâ€”like a cat that has mastered the art of napping, or a dog that can valiantly chase its tail for hours. Drucker emphasizes the importance of feedback analysisâ€”write down your expectations, then see if reality gives you a thumbs-up or slaps your face with a metaphorical pie. Predetermined roles are out; choose your adventures based on the skills youâ€™ve honedâ€”whether you read or listen better is critical, as Eisenhower's pre

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
