Description
In `eval.py:31`:
```python
benchmarks[task_name] = {k: v for k, v in metrics.items() if isinstance(v, (int, float))}
```
In Python, `bool` is a subclass of `int`, so `isinstance(True, (int, float))` returns `True`. If lm-eval returns boolean metadata in its results dict (e.g. config flags), they leak through as metric values 1/0, polluting benchmark comparisons.
Fix
Add explicit `not isinstance(v, bool)` guard.