# Computer Topic Classifier Evaluation

This notebook demonstrates how to evaluate a simple LLM-based classifier that determines whether a piece of text is about computer-related topics or not. We use `pytest-evals` to run our evaluation and analyze the results.

## Setup
First, we'll load the required extensions and import necessary libraries.

In [1]:
%load_ext pytest_evals

## Classifier Implementation

Below is our classifier implementation that uses GPT-4 to determine if text is computer-related. The classifier returns a boolean value:
- `True`: Text is computer-related
- `False`: Text is not computer-related

In [2]:
import openai


def classify(text: str) -> bool:
    """Classify text as computer-related or not using GPT-4.

    Args:
        text (str): The input text to classify

    Returns:
        bool: True if the text is computer-related, False otherwise
    """
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Is this text about a computer-related subject? "
                "Reply ONLY with either true or false.",
            },
            {"role": "user", "content": text},
        ],
    )
    return resp.choices[0].message.content.lower() == "true"

## Test Data

We define a set of test cases to evaluate our classifier. Each test case contains:
- `text`: The input text to classify
- `label`: The expected classification (True for computer-related, False otherwise)

In [3]:
TEST_DATA = [
    {"text": "I need to debug this Python code", "label": True},
    {"text": "The cat jumped over the lazy dog", "label": False},
    {"text": "My monitor keeps flickering", "label": True},
]

## Evaluation Tests

We use pytest-evals to:
1. Run individual test cases and collect results
2. Analyze the overall performance of our classifier

The evaluation requires:
- Accuracy >= 70%
- All test cases must match their expected labels

In [4]:
%%ipytest_evals
import pytest

@pytest.mark.eval(name="computer_classifier")
@pytest.mark.parametrize("case", TEST_DATA)
def test_classifier(case: dict, eval_bag):
    """Test individual classification cases.
    
    Args:
        case (dict): Test case containing text and expected label
        eval_bag: Container for test results
    """
    # Store inputs and results in eval_bag for analysis
    eval_bag.input_text = case["text"]
    eval_bag.label = case["label"]
    eval_bag.prediction = classify(case["text"])

    # Log results for visibility
    print(f"Input: {eval_bag.input_text}")
    print(f"Prediction: {eval_bag.prediction}")

    assert eval_bag.prediction == eval_bag.label


@pytest.mark.eval_analysis(name="computer_classifier")
def test_analysis(eval_results):
    """Analyze overall classifier performance.
    
    Args:
        eval_results: Collection of all test results
    """
    total = len(eval_results)
    correct = sum(1 for r in eval_results if r.result.prediction == r.result.label)
    accuracy = correct / total

    print(f"Accuracy: {accuracy:.2%}")
    assert accuracy >= 0.7  # Require at least 70% accuracy


t_fe596c0d68894784969f18775cec634a.py::test_classifier[case0] Input: I need to debug this Python code
Prediction: True
[32mPASSED[0m
t_fe596c0d68894784969f18775cec634a.py::test_classifier[case1] Input: The cat jumped over the lazy dog
Prediction: False
[32mPASSED[0m
t_fe596c0d68894784969f18775cec634a.py::test_classifier[case2] Input: My monitor keeps flickering
Prediction: True
[32mPASSED[0m
t_fe596c0d68894784969f18775cec634a.py::test_analysis Accuracy: 100.00%
[32mPASSED[0m

