BUG: erroneous accuracy computation in beautiful_mnist on osx-64 with metal

I noticed this issue when [bringing up](https://github.com/conda-forge/tinygrad-feedstock/pull/6) tinygrad v0.11.0 in conda-forge; I've since [tested](https://github.com/conda-forge/tinygrad-feedstock/pull/9) that the issues persists on master (as of f5090192c84760be1227f7e3c4f99ad0603117ae).

Basically, I'm running `examples/beautiful_mnist.py` as a kind of integration test, in addition to the test suite (`pytest test/` aside from a handful of extra skips[^1]). Interestingly, despite a passing test suite on OSX, if I set `TARGET_EVAL_ACC_PCT=90`, beautiful_mnist fails.

[^1]: obviously I'd like to not have any skips at all; happy to raise issues for stuff I find not working, if that's welcome.

```
 │   0%|                                                    | 0/25 [00:00<?, ?it/s]
 │ loss:   2.55 test_accuracy: 11.35%:   4%|▎       | 1/25 [00:58<23:16,  0.02it/s]
 │ loss:   1.54 test_accuracy: 11.35%:   8%|▋       | 2/25 [01:35<18:13,  0.02it/s]
 │ loss:   1.11 test_accuracy: 75.00%:  12%|▉       | 3/25 [01:38<12:05,  0.03it/s]
 │ loss:   0.83 test_accuracy: 75.00%:  16%|█▎      | 4/25 [01:42<09:00,  0.04it/s]
 │ loss:   0.72 test_accuracy: 75.00%:  20%|█▌      | 5/25 [01:46<07:07,  0.05it/s]
 │ loss:   0.57 test_accuracy: 75.00%:  24%|█▉      | 6/25 [01:50<05:50,  0.05it/s]
 │ loss:   0.49 test_accuracy: 75.00%:  28%|██▏     | 7/25 [01:54<04:54,  0.06it/s]
 │ loss:   0.45 test_accuracy: 75.00%:  32%|██▌     | 8/25 [01:58<04:12,  0.07it/s]
 │ loss:   0.32 test_accuracy: 75.00%:  36%|██▉     | 9/25 [02:02<03:37,  0.07it/s]
 │ loss:   0.32 test_accuracy: 75.00%:  40%|██▊    | 10/25 [02:06<03:09,  0.08it/s]
 │ loss:   0.31 test_accuracy: 75.00%:  44%|███    | 11/25 [02:10<02:45,  0.08it/s]
 │ loss:   0.30 test_accuracy: 75.00%:  48%|███▎   | 12/25 [02:14<02:25,  0.09it/s]
 │ loss:   0.25 test_accuracy: 75.00%:  52%|███▋   | 13/25 [02:18<02:07,  0.09it/s]
 │ loss:   0.26 test_accuracy: 75.00%:  56%|███▉   | 14/25 [02:21<01:51,  0.10it/s]
 │ loss:   0.26 test_accuracy: 75.00%:  60%|████▏  | 15/25 [02:25<01:37,  0.10it/s]
 │ loss:   0.23 test_accuracy: 75.00%:  64%|████▍  | 16/25 [02:29<01:23,  0.11it/s]
 │ loss:   0.25 test_accuracy: 75.00%:  68%|████▊  | 17/25 [02:33<01:12,  0.11it/s]
 │ loss:   0.17 test_accuracy: 75.00%:  72%|█████  | 18/25 [02:36<01:01,  0.11it/s]
 │ loss:   0.23 test_accuracy: 75.00%:  76%|█████▎ | 19/25 [02:40<00:50,  0.12it/s]
 │ loss:   0.18 test_accuracy: 75.00%:  80%|█████▌ | 20/25 [02:44<00:41,  0.12it/s]
 │ loss:   0.17 test_accuracy: 75.00%:  84%|█████▉ | 21/25 [02:48<00:32,  0.12it/s]
 │ loss:   0.13 test_accuracy: 75.00%:  88%|██████▏| 22/25 [02:52<00:23,  0.13it/s]
 │ loss:   0.15 test_accuracy: 75.00%:  92%|██████▍| 23/25 [02:56<00:15,  0.13it/s]
 │ loss:   0.16 test_accuracy: 75.00%:  96%|██████▋| 24/25 [02:59<00:07,  0.13it/s]
 │ loss:   0.12 test_accuracy: 75.00%: 100%|███████| 25/25 [03:03<00:00,  0.14it/s]
 │ loss:   0.12 test_accuracy: 75.00%: 100%|███████| 25/25 [03:03<00:00,  0.14it/s]
```
Note that the loss is decreasing and matches the overall convergence seen on other platforms (linux, win), where the accuracy is achieved easily. Note also that I wanted to see updates for the accuracy at every step, so I'm carrying a trivial patch that does
```diff
--- a/examples/beautiful_mnist.py
+++ b/examples/beautiful_mnist.py
@@ -38,7 +38,7 @@ if __name__ == "__main__":
   for i in (t:=trange(getenv("STEPS", 70))):
     GlobalCounters.reset()   # NOTE: this makes it nice for DEBUG=2 timing
     loss = train_step()
-    if i%10 == 9: test_acc = get_test_acc().item()
+    test_acc = get_test_acc().item()
     t.set_description(f"loss: {loss.item():6.2f} test_accuracy: {test_acc:5.2f}%")

   # verify eval acc
```

At first, my guess was that this might be due to the JIT
https://github.com/tinygrad/tinygrad/blob/f5090192c84760be1227f7e3c4f99ad0603117ae/examples/beautiful_mnist.py#L34-L35
going wrong somehow. But even if I remove the decorator, the same situation (loss decreases, accuracy stagnates) persists
```
 │ loss:   2.56 test_accuracy: 10.09%:   4%|▎       | 1/25 [00:44<17:53,  0.02it/s]
 │ loss:   1.66 test_accuracy: 10.09%:   8%|▋       | 2/25 [01:13<14:08,  0.03it/s]
 │ loss:   1.18 test_accuracy: 11.35%:  12%|▉       | 3/25 [01:18<09:33,  0.04it/s]
 │ loss:   0.90 test_accuracy: 11.35%:  16%|█▎      | 4/25 [01:22<07:12,  0.05it/s]
 │ loss:   0.78 test_accuracy: 11.35%:  20%|█▌      | 5/25 [01:26<05:45,  0.06it/s]
 │ loss:   0.71 test_accuracy: 11.35%:  24%|█▉      | 6/25 [01:30<04:46,  0.07it/s]
 │ loss:   0.57 test_accuracy: 11.35%:  28%|██▏     | 7/25 [01:34<04:03,  0.07it/s]
 │ loss:   0.53 test_accuracy: 11.35%:  32%|██▌     | 8/25 [01:38<03:29,  0.08it/s]
 │ loss:   0.45 test_accuracy: 11.35%:  36%|██▉     | 9/25 [01:42<03:02,  0.09it/s]
 │ loss:   0.45 test_accuracy: 11.35%:  40%|██▊    | 10/25 [01:47<02:40,  0.09it/s]
 │ loss:   0.39 test_accuracy: 11.35%:  44%|███    | 11/25 [01:51<02:21,  0.10it/s]
 │ loss:   0.34 test_accuracy: 11.35%:  48%|███▎   | 12/25 [01:55<02:04,  0.10it/s]
 │ loss:   0.30 test_accuracy: 11.35%:  52%|███▋   | 13/25 [01:59<01:50,  0.11it/s]
 │ loss:   0.29 test_accuracy: 11.35%:  56%|███▉   | 14/25 [02:03<01:36,  0.11it/s]
 │ loss:   0.27 test_accuracy: 11.35%:  60%|████▏  | 15/25 [02:07<01:24,  0.12it/s]
 │ loss:   0.27 test_accuracy: 11.35%:  64%|████▍  | 16/25 [02:11<01:13,  0.12it/s]
 │ loss:   0.24 test_accuracy: 11.35%:  68%|████▊  | 17/25 [02:15<01:03,  0.13it/s]
 │ loss:   0.20 test_accuracy: 11.35%:  72%|█████  | 18/25 [02:19<00:54,  0.13it/s]
 │ loss:   0.20 test_accuracy: 11.35%:  76%|█████▎ | 19/25 [02:23<00:45,  0.13it/s]
 │ loss:   0.20 test_accuracy: 11.35%:  80%|█████▌ | 20/25 [02:27<00:36,  0.14it/s]
 │ loss:   0.19 test_accuracy: 11.35%:  84%|█████▉ | 21/25 [02:32<00:28,  0.14it/s]
 │ loss:   0.16 test_accuracy: 11.35%:  88%|██████▏| 22/25 [02:36<00:21,  0.14it/s]
 │ loss:   0.19 test_accuracy: 11.35%:  92%|██████▍| 23/25 [02:40<00:13,  0.14it/s]
 │ loss:   0.20 test_accuracy: 11.35%:  96%|██████▋| 24/25 [02:44<00:06,  0.15it/s]
 │ loss:   0.14 test_accuracy: 11.35%: 100%|███████| 25/25 [02:48<00:00,  0.15it/s]
 │ loss:   0.14 test_accuracy: 11.35%: 100%|███████| 25/25 [02:48<00:00,  0.15it/s]
```
Looking at the `DEBUG=2` output, I was somewhat surprised to see that it is using METAL; not sure how good the state of azure pipelines is for that. On the other hand, since the test suite passes, I'm assuming that this covers the "default device" on that platform quite well. We're running on vanilla images from https://github.com/actions/runner-images, in this case `macOS-15`, but on an `osx-64` host (because azure pipelines still hasn't delivered `osx-arm64` runners at scale; we're cross-compiling `osx-arm64` from `osx-64` in the meantime, as we cannot yet use GHA for other reasons). Since there's no emulation framework like QEMU for osx, we're only running the tests in native compilation, i.e. only on `osx-64`.

To test the hypothesis that this is somehow related to metal, I forced the builds to run on CPU using
```diff
--- a/tinygrad/device.py
+++ b/tinygrad/device.py
@@ -11,7 +11,7 @@ from tinygrad.renderer import Renderer

 # **************** Device ****************

-ALL_DEVICES = ["METAL", "AMD", "NV", "CUDA", "QCOM", "CL", "CPU", "DSP", "WEBGPU"]
+ALL_DEVICES = ["CPU", "DSP", "WEBGPU"]
 class _Device:
   def __init__(self) -> None:
     self._devices = [x.stem[len("ops_"):].upper() for x in (pathlib.Path(__file__).parent/"runtime").iterdir() if x.stem.startswith("ops_")]
```
With that patch (on top of the others), the test suite continues to pass almost[^2] unchanged, and the accuracy in `beautiful_mnist.py` converges again.

[^2]: one test failure due to something JIT-related
```
 │   0%|                                                    | 0/25 [00:00<?, ?it/s]
 │ loss:   2.54 test_accuracy: 25.89%:   4%|▎       | 1/25 [00:17<06:52,  0.06it/s]
 │ loss:   1.55 test_accuracy: 25.96%:   8%|▋       | 2/25 [00:27<05:18,  0.07it/s]
 │ loss:   1.16 test_accuracy: 27.18%:  12%|▉       | 3/25 [00:33<04:07,  0.09it/s]
 │ loss:   0.95 test_accuracy: 44.30%:  16%|█▎      | 4/25 [00:38<03:23,  0.10it/s]
 │ loss:   0.75 test_accuracy: 60.55%:  20%|█▌      | 5/25 [00:44<02:57,  0.11it/s]
 │ loss:   0.67 test_accuracy: 72.00%:  24%|█▉      | 6/25 [00:50<02:38,  0.12it/s]
 │ loss:   0.51 test_accuracy: 76.94%:  28%|██▏     | 7/25 [00:55<02:22,  0.13it/s]
 │ loss:   0.45 test_accuracy: 80.07%:  32%|██▌     | 8/25 [01:00<02:09,  0.13it/s]
 │ loss:   0.40 test_accuracy: 83.90%:  36%|██▉     | 9/25 [01:06<01:57,  0.14it/s]
 │ loss:   0.39 test_accuracy: 85.94%:  40%|██▊    | 10/25 [01:12<01:48,  0.14it/s]
 │ loss:   0.37 test_accuracy: 87.67%:  44%|███    | 11/25 [01:18<01:39,  0.14it/s]
 │ loss:   0.30 test_accuracy: 88.41%:  48%|███▎   | 12/25 [01:24<01:31,  0.14it/s]
 │ loss:   0.30 test_accuracy: 89.23%:  52%|███▋   | 13/25 [01:30<01:23,  0.14it/s]
 │ loss:   0.31 test_accuracy: 89.92%:  56%|███▉   | 14/25 [01:38<01:17,  0.14it/s]
 │ loss:   0.26 test_accuracy: 90.91%:  60%|████▏  | 15/25 [01:43<01:09,  0.14it/s]
 │ loss:   0.26 test_accuracy: 91.80%:  64%|████▍  | 16/25 [01:51<01:02,  0.14it/s]
 │ loss:   0.24 test_accuracy: 92.54%:  68%|████▊  | 17/25 [01:59<00:56,  0.14it/s]
 │ loss:   0.23 test_accuracy: 93.04%:  72%|█████  | 18/25 [02:06<00:49,  0.14it/s]
 │ loss:   0.19 test_accuracy: 93.83%:  76%|█████▎ | 19/25 [02:14<00:42,  0.14it/s]
 │ loss:   0.17 test_accuracy: 94.53%:  80%|█████▌ | 20/25 [02:22<00:35,  0.14it/s]
 │ loss:   0.20 test_accuracy: 94.97%:  84%|█████▉ | 21/25 [02:28<00:28,  0.14it/s]
 │ loss:   0.18 test_accuracy: 95.19%:  88%|██████▏| 22/25 [02:36<00:21,  0.14it/s]
 │ loss:   0.16 test_accuracy: 95.31%:  92%|██████▍| 23/25 [02:43<00:14,  0.14it/s]
 │ loss:   0.21 test_accuracy: 95.47%:  96%|██████▋| 24/25 [02:51<00:07,  0.14it/s]
 │ loss:   0.14 test_accuracy: 95.68%: 100%|███████| 25/25 [02:57<00:00,  0.14it/s]
 │ loss:   0.14 test_accuracy: 95.68%: 100%|███████| 25/25 [02:57<00:00,  0.14it/s]
 │ test_acc=95.68000030517578 >= 90.0
```

More detailed logs (including `DEBUG=2` output that I don't really know how to read) can be found in https://github.com/conda-forge/tinygrad-feedstock/pull/9. If desired, I can prepare downloadable artefacts that correspond 1:1 with the ones that fail in testing (and what would end up being published).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: erroneous accuracy computation in beautiful_mnist on osx-64 with metal #13866

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	@TinyJit
	def get_test_acc() -> Tensor: return (model(X_test).argmax(axis=1) == Y_test).mean()*100

BUG: erroneous accuracy computation in beautiful_mnist on osx-64 with metal #13866

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions