Skip to content

feat: implement evaluation module#6

Merged
rparkr merged 2 commits into
mainfrom
feat/add-eval-module
Apr 24, 2026
Merged

feat: implement evaluation module#6
rparkr merged 2 commits into
mainfrom
feat/add-eval-module

Conversation

@rparkr
Copy link
Copy Markdown
Owner

@rparkr rparkr commented Apr 24, 2026

Summary

Add modules for evaluating a model's performance on the benchmark datasets.

Changes

  • Implement two interfaces for model evaluation: an evaluator that works with any model with an OpenAI-compatible API, and an evaluator that works with Hugging Face transformers models.
  • Process the evaluation datasets to prepare for execution in the sandbox
  • Checkpoint evaluation results (after each batch) with automatic retries for any failed tasks (i.e., a system failure like an unresponsive API, not a model failure on the task)
  • Execute evaluation in two stages in parallel: GPU-based LLM inference for completions, and CPU-based sandbox evaluation of the tests using the LLM-written solution
  • Add tests to verify the correct functionality of the eval.py module
  • Various other fixes and updates

Motivation

Measuring model performance requires evaluation. The HumanEval and MBPP datasets are well-known (though slightly dated) benchmarks that test a model's ability to write Python functions to solve tasks. This commit implements the full evaluation suite to be used at regular checkpoints during model training and to establish peformance baselines with other models.

# Summary
Add modules for evaluating a model's performance on the benchmark datasets.

# Changes
- Implement two interfaces for model evaluation: an evaluator that works with any model with an OpenAI-compatible API, and an evaluator that works with Hugging Face transformers models.
- Process the evaluation datasets to prepare for execution in the sandbox
- Checkpoint evaluation results (after each batch) with automatic retries for any failed tasks (i.e., a system failure like an unresponsive API, not a model failure on the task)
- Execute evaluation in two stages in parallel: GPU-based LLM inference for completions, and CPU-based sandbox evaluation of the tests using the LLM-written solution
- Add tests to verify the correct functionality of the eval.py module
- Various other fixes and updates

# Motivation
Measuring model performance requires evaluation. The HumanEval and MBPP datasets are well-known (though slightly dated) benchmarks that test a model's ability to write Python functions to solve tasks. This commit implements the full evaluation suite to be used at regular checkpoints during model training and to establish peformance baselines with other models.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive evaluation module for measuring model performance on HumanEvalPlus and MBPPPlus benchmarks. It includes a base evaluator class with checkpointing and resume capabilities, along with specific implementations for Hugging Face transformers and OpenAI-compatible APIs. Key updates also include enhanced logging with JSON formatting, improved dataset processing for MBPPPlus to capture test results, and batching support for sandbox verification. Feedback focused on correcting a module import for Path, improving sandbox security by disabling network access, and increasing the robustness of code extraction regex.

Comment thread src/lfm_coder/evals/types.py Outdated
import datetime
from dataclasses import dataclass, field

from PIL.ImagePath import Path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Path class is incorrectly imported from PIL.ImagePath. It should be imported from the standard pathlib module to ensure compatibility with other parts of the codebase that expect a pathlib.Path object.

Suggested change
from PIL.ImagePath import Path
from pathlib import Path

self.sandbox = Sandbox(
sandbox_type=sandbox_type,
use_cache=True,
disable_network=False,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Setting disable_network=False allows model-generated code to access the network during evaluation. For security and to ensure benchmark integrity (preventing data leakage or external dependencies), it is generally recommended to keep the sandbox isolated by disabling network access, especially since HumanEval and MBPP do not require it.

Suggested change
disable_network=False,
disable_network=True,

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I allowed network access to automatically download required dependencies, but those are now cached on my machine, so I can remove network access and still allow the sandbox to install the packages it needs for execution.

Tuple of (extracted_code, correct_format)
"""
for pattern, correct_format in zip(
[r"```python\n(.*?)\n```", r"```\n(.*?)\n```"],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current regex for code extraction is quite strict regarding newlines and whitespace around the fences. If a model omits the newline before the closing fence or adds trailing spaces after the language identifier, extraction will fail. Using a more flexible regex that accounts for optional whitespace and newlines would improve robustness.

Suggested change
[r"```python\n(.*?)\n```", r"```\n(.*?)\n```"],
[r"```python\s*\n?(.*?)\n?\s*```", r"```\s*\n?(.*?)\n?\s*```"],

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion. I'm also considering extracting and concatenating all code blocks since sometimes the final code block is not the model's solution but rather is a demonstration or script to show how the model's (previous) solution works.

# Summary
Correct an import statement.

# Motivation
There was a mistaken auto-import of the Path class from the PIL library rather than from the built-in module `pathlib`.

# Changes made
Changed the import to `from pathlib import Path`.
@rparkr rparkr merged commit c45600a into main Apr 24, 2026
2 checks passed
@rparkr rparkr deleted the feat/add-eval-module branch April 24, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant