Skip to content

Bug: parse_results includes booleans as numeric metrics (bool is subclass of int) #29

@sacredvoid

Description

@sacredvoid

Description

In `eval.py:31`:
```python
benchmarks[task_name] = {k: v for k, v in metrics.items() if isinstance(v, (int, float))}
```

In Python, `bool` is a subclass of `int`, so `isinstance(True, (int, float))` returns `True`. If lm-eval returns boolean metadata in its results dict (e.g. config flags), they leak through as metric values 1/0, polluting benchmark comparisons.

Fix

Add explicit `not isinstance(v, bool)` guard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions