feat: implement evaluation module#6
Conversation
# Summary Add modules for evaluating a model's performance on the benchmark datasets. # Changes - Implement two interfaces for model evaluation: an evaluator that works with any model with an OpenAI-compatible API, and an evaluator that works with Hugging Face transformers models. - Process the evaluation datasets to prepare for execution in the sandbox - Checkpoint evaluation results (after each batch) with automatic retries for any failed tasks (i.e., a system failure like an unresponsive API, not a model failure on the task) - Execute evaluation in two stages in parallel: GPU-based LLM inference for completions, and CPU-based sandbox evaluation of the tests using the LLM-written solution - Add tests to verify the correct functionality of the eval.py module - Various other fixes and updates # Motivation Measuring model performance requires evaluation. The HumanEval and MBPP datasets are well-known (though slightly dated) benchmarks that test a model's ability to write Python functions to solve tasks. This commit implements the full evaluation suite to be used at regular checkpoints during model training and to establish peformance baselines with other models.
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive evaluation module for measuring model performance on HumanEvalPlus and MBPPPlus benchmarks. It includes a base evaluator class with checkpointing and resume capabilities, along with specific implementations for Hugging Face transformers and OpenAI-compatible APIs. Key updates also include enhanced logging with JSON formatting, improved dataset processing for MBPPPlus to capture test results, and batching support for sandbox verification. Feedback focused on correcting a module import for Path, improving sandbox security by disabling network access, and increasing the robustness of code extraction regex.
| import datetime | ||
| from dataclasses import dataclass, field | ||
|
|
||
| from PIL.ImagePath import Path |
There was a problem hiding this comment.
| self.sandbox = Sandbox( | ||
| sandbox_type=sandbox_type, | ||
| use_cache=True, | ||
| disable_network=False, |
There was a problem hiding this comment.
Setting disable_network=False allows model-generated code to access the network during evaluation. For security and to ensure benchmark integrity (preventing data leakage or external dependencies), it is generally recommended to keep the sandbox isolated by disabling network access, especially since HumanEval and MBPP do not require it.
| disable_network=False, | |
| disable_network=True, |
There was a problem hiding this comment.
Good point. I allowed network access to automatically download required dependencies, but those are now cached on my machine, so I can remove network access and still allow the sandbox to install the packages it needs for execution.
| Tuple of (extracted_code, correct_format) | ||
| """ | ||
| for pattern, correct_format in zip( | ||
| [r"```python\n(.*?)\n```", r"```\n(.*?)\n```"], |
There was a problem hiding this comment.
The current regex for code extraction is quite strict regarding newlines and whitespace around the fences. If a model omits the newline before the closing fence or adds trailing spaces after the language identifier, extraction will fail. Using a more flexible regex that accounts for optional whitespace and newlines would improve robustness.
| [r"```python\n(.*?)\n```", r"```\n(.*?)\n```"], | |
| [r"```python\s*\n?(.*?)\n?\s*```", r"```\s*\n?(.*?)\n?\s*```"], |
There was a problem hiding this comment.
That's a good suggestion. I'm also considering extracting and concatenating all code blocks since sometimes the final code block is not the model's solution but rather is a demonstration or script to show how the model's (previous) solution works.
# Summary Correct an import statement. # Motivation There was a mistaken auto-import of the Path class from the PIL library rather than from the built-in module `pathlib`. # Changes made Changed the import to `from pathlib import Path`.
Summary
Add modules for evaluating a model's performance on the benchmark datasets.
Changes
Motivation
Measuring model performance requires evaluation. The HumanEval and MBPP datasets are well-known (though slightly dated) benchmarks that test a model's ability to write Python functions to solve tasks. This commit implements the full evaluation suite to be used at regular checkpoints during model training and to establish peformance baselines with other models.