feat: implement evaluation module by rparkr · Pull Request #6 · rparkr/lfm-coder

rparkr · 2026-04-24T22:09:05Z

Summary

Add modules for evaluating a model's performance on the benchmark datasets.

Changes

Implement two interfaces for model evaluation: an evaluator that works with any model with an OpenAI-compatible API, and an evaluator that works with Hugging Face transformers models.
Process the evaluation datasets to prepare for execution in the sandbox
Checkpoint evaluation results (after each batch) with automatic retries for any failed tasks (i.e., a system failure like an unresponsive API, not a model failure on the task)
Execute evaluation in two stages in parallel: GPU-based LLM inference for completions, and CPU-based sandbox evaluation of the tests using the LLM-written solution
Add tests to verify the correct functionality of the eval.py module
Various other fixes and updates

Motivation

Measuring model performance requires evaluation. The HumanEval and MBPP datasets are well-known (though slightly dated) benchmarks that test a model's ability to write Python functions to solve tasks. This commit implements the full evaluation suite to be used at regular checkpoints during model training and to establish peformance baselines with other models.

# Summary Add modules for evaluating a model's performance on the benchmark datasets. # Changes - Implement two interfaces for model evaluation: an evaluator that works with any model with an OpenAI-compatible API, and an evaluator that works with Hugging Face transformers models. - Process the evaluation datasets to prepare for execution in the sandbox - Checkpoint evaluation results (after each batch) with automatic retries for any failed tasks (i.e., a system failure like an unresponsive API, not a model failure on the task) - Execute evaluation in two stages in parallel: GPU-based LLM inference for completions, and CPU-based sandbox evaluation of the tests using the LLM-written solution - Add tests to verify the correct functionality of the eval.py module - Various other fixes and updates # Motivation Measuring model performance requires evaluation. The HumanEval and MBPP datasets are well-known (though slightly dated) benchmarks that test a model's ability to write Python functions to solve tasks. This commit implements the full evaluation suite to be used at regular checkpoints during model training and to establish peformance baselines with other models.

gemini-code-assist

Code Review

This pull request introduces a comprehensive evaluation module for measuring model performance on HumanEvalPlus and MBPPPlus benchmarks. It includes a base evaluator class with checkpointing and resume capabilities, along with specific implementations for Hugging Face transformers and OpenAI-compatible APIs. Key updates also include enhanced logging with JSON formatting, improved dataset processing for MBPPPlus to capture test results, and batching support for sandbox verification. Feedback focused on correcting a module import for Path, improving sandbox security by disabling network access, and increasing the robustness of code extraction regex.

gemini-code-assist · 2026-04-24T22:10:52Z

+import datetime
+from dataclasses import dataclass, field
+
+from PIL.ImagePath import Path


The Path class is incorrectly imported from PIL.ImagePath. It should be imported from the standard pathlib module to ensure compatibility with other parts of the codebase that expect a pathlib.Path object.

Suggested change

from PIL.ImagePath import Path

from pathlib import Path

gemini-code-assist · 2026-04-24T22:10:52Z

+        self.sandbox = Sandbox(
+            sandbox_type=sandbox_type,
+            use_cache=True,
+            disable_network=False,


Setting disable_network=False allows model-generated code to access the network during evaluation. For security and to ensure benchmark integrity (preventing data leakage or external dependencies), it is generally recommended to keep the sandbox isolated by disabling network access, especially since HumanEval and MBPP do not require it.

Suggested change

disable_network=False,

disable_network=True,

Good point. I allowed network access to automatically download required dependencies, but those are now cached on my machine, so I can remove network access and still allow the sandbox to install the packages it needs for execution.

gemini-code-assist · 2026-04-24T22:10:52Z

+            Tuple of (extracted_code, correct_format)
+        """
+        for pattern, correct_format in zip(
+            [r"```python\n(.*?)\n```", r"```\n(.*?)\n```"],


The current regex for code extraction is quite strict regarding newlines and whitespace around the fences. If a model omits the newline before the closing fence or adds trailing spaces after the language identifier, extraction will fail. Using a more flexible regex that accounts for optional whitespace and newlines would improve robustness.

Suggested change

[r"```python\n(.*?)\n```", r"```\n(.*?)\n```"],

[r"```python\s*\n?(.*?)\n?\s*```", r"```\s*\n?(.*?)\n?\s*```"],

That's a good suggestion. I'm also considering extracting and concatenating all code blocks since sometimes the final code block is not the model's solution but rather is a demonstration or script to show how the model's (previous) solution works.

# Summary Correct an import statement. # Motivation There was a mistaken auto-import of the Path class from the PIL library rather than from the built-in module `pathlib`. # Changes made Changed the import to `from pathlib import Path`.

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

fix(evals): correct import

786e1bc

# Summary Correct an import statement. # Motivation There was a mistaken auto-import of the Path class from the PIL library rather than from the built-in module `pathlib`. # Changes made Changed the import to `from pathlib import Path`.

rparkr merged commit c45600a into main Apr 24, 2026
2 checks passed

rparkr deleted the feat/add-eval-module branch April 24, 2026 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement evaluation module#6

feat: implement evaluation module#6
rparkr merged 2 commits into
mainfrom
feat/add-eval-module

rparkr commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

rparkr Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

rparkr Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	[r"```python\n(.?)\n```", r"```\n(.?)\n```"],
	[r"```python\s\n?(.?)\n?\s```", r"```\s\n?(.?)\n?\s```"],

Conversation

rparkr commented Apr 24, 2026

Summary

Changes

Motivation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

rparkr Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

rparkr Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant