-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
I noticed this issue when bringing up tinygrad v0.11.0 in conda-forge; I've since tested that the issues persists on master (as of f509019).
Basically, I'm running examples/beautiful_mnist.py as a kind of integration test, in addition to the test suite (pytest test/ aside from a handful of extra skips1). Interestingly, despite a passing test suite on OSX, if I set TARGET_EVAL_ACC_PCT=90, beautiful_mnist fails.
│ 0%| | 0/25 [00:00<?, ?it/s]
│ loss: 2.55 test_accuracy: 11.35%: 4%|▎ | 1/25 [00:58<23:16, 0.02it/s]
│ loss: 1.54 test_accuracy: 11.35%: 8%|▋ | 2/25 [01:35<18:13, 0.02it/s]
│ loss: 1.11 test_accuracy: 75.00%: 12%|▉ | 3/25 [01:38<12:05, 0.03it/s]
│ loss: 0.83 test_accuracy: 75.00%: 16%|█▎ | 4/25 [01:42<09:00, 0.04it/s]
│ loss: 0.72 test_accuracy: 75.00%: 20%|█▌ | 5/25 [01:46<07:07, 0.05it/s]
│ loss: 0.57 test_accuracy: 75.00%: 24%|█▉ | 6/25 [01:50<05:50, 0.05it/s]
│ loss: 0.49 test_accuracy: 75.00%: 28%|██▏ | 7/25 [01:54<04:54, 0.06it/s]
│ loss: 0.45 test_accuracy: 75.00%: 32%|██▌ | 8/25 [01:58<04:12, 0.07it/s]
│ loss: 0.32 test_accuracy: 75.00%: 36%|██▉ | 9/25 [02:02<03:37, 0.07it/s]
│ loss: 0.32 test_accuracy: 75.00%: 40%|██▊ | 10/25 [02:06<03:09, 0.08it/s]
│ loss: 0.31 test_accuracy: 75.00%: 44%|███ | 11/25 [02:10<02:45, 0.08it/s]
│ loss: 0.30 test_accuracy: 75.00%: 48%|███▎ | 12/25 [02:14<02:25, 0.09it/s]
│ loss: 0.25 test_accuracy: 75.00%: 52%|███▋ | 13/25 [02:18<02:07, 0.09it/s]
│ loss: 0.26 test_accuracy: 75.00%: 56%|███▉ | 14/25 [02:21<01:51, 0.10it/s]
│ loss: 0.26 test_accuracy: 75.00%: 60%|████▏ | 15/25 [02:25<01:37, 0.10it/s]
│ loss: 0.23 test_accuracy: 75.00%: 64%|████▍ | 16/25 [02:29<01:23, 0.11it/s]
│ loss: 0.25 test_accuracy: 75.00%: 68%|████▊ | 17/25 [02:33<01:12, 0.11it/s]
│ loss: 0.17 test_accuracy: 75.00%: 72%|█████ | 18/25 [02:36<01:01, 0.11it/s]
│ loss: 0.23 test_accuracy: 75.00%: 76%|█████▎ | 19/25 [02:40<00:50, 0.12it/s]
│ loss: 0.18 test_accuracy: 75.00%: 80%|█████▌ | 20/25 [02:44<00:41, 0.12it/s]
│ loss: 0.17 test_accuracy: 75.00%: 84%|█████▉ | 21/25 [02:48<00:32, 0.12it/s]
│ loss: 0.13 test_accuracy: 75.00%: 88%|██████▏| 22/25 [02:52<00:23, 0.13it/s]
│ loss: 0.15 test_accuracy: 75.00%: 92%|██████▍| 23/25 [02:56<00:15, 0.13it/s]
│ loss: 0.16 test_accuracy: 75.00%: 96%|██████▋| 24/25 [02:59<00:07, 0.13it/s]
│ loss: 0.12 test_accuracy: 75.00%: 100%|███████| 25/25 [03:03<00:00, 0.14it/s]
│ loss: 0.12 test_accuracy: 75.00%: 100%|███████| 25/25 [03:03<00:00, 0.14it/s]
Note that the loss is decreasing and matches the overall convergence seen on other platforms (linux, win), where the accuracy is achieved easily. Note also that I wanted to see updates for the accuracy at every step, so I'm carrying a trivial patch that does
--- a/examples/beautiful_mnist.py
+++ b/examples/beautiful_mnist.py
@@ -38,7 +38,7 @@ if __name__ == "__main__":
for i in (t:=trange(getenv("STEPS", 70))):
GlobalCounters.reset() # NOTE: this makes it nice for DEBUG=2 timing
loss = train_step()
- if i%10 == 9: test_acc = get_test_acc().item()
+ test_acc = get_test_acc().item()
t.set_description(f"loss: {loss.item():6.2f} test_accuracy: {test_acc:5.2f}%")
# verify eval accAt first, my guess was that this might be due to the JIT
tinygrad/examples/beautiful_mnist.py
Lines 34 to 35 in f509019
| @TinyJit | |
| def get_test_acc() -> Tensor: return (model(X_test).argmax(axis=1) == Y_test).mean()*100 |
going wrong somehow. But even if I remove the decorator, the same situation (loss decreases, accuracy stagnates) persists
│ loss: 2.56 test_accuracy: 10.09%: 4%|▎ | 1/25 [00:44<17:53, 0.02it/s]
│ loss: 1.66 test_accuracy: 10.09%: 8%|▋ | 2/25 [01:13<14:08, 0.03it/s]
│ loss: 1.18 test_accuracy: 11.35%: 12%|▉ | 3/25 [01:18<09:33, 0.04it/s]
│ loss: 0.90 test_accuracy: 11.35%: 16%|█▎ | 4/25 [01:22<07:12, 0.05it/s]
│ loss: 0.78 test_accuracy: 11.35%: 20%|█▌ | 5/25 [01:26<05:45, 0.06it/s]
│ loss: 0.71 test_accuracy: 11.35%: 24%|█▉ | 6/25 [01:30<04:46, 0.07it/s]
│ loss: 0.57 test_accuracy: 11.35%: 28%|██▏ | 7/25 [01:34<04:03, 0.07it/s]
│ loss: 0.53 test_accuracy: 11.35%: 32%|██▌ | 8/25 [01:38<03:29, 0.08it/s]
│ loss: 0.45 test_accuracy: 11.35%: 36%|██▉ | 9/25 [01:42<03:02, 0.09it/s]
│ loss: 0.45 test_accuracy: 11.35%: 40%|██▊ | 10/25 [01:47<02:40, 0.09it/s]
│ loss: 0.39 test_accuracy: 11.35%: 44%|███ | 11/25 [01:51<02:21, 0.10it/s]
│ loss: 0.34 test_accuracy: 11.35%: 48%|███▎ | 12/25 [01:55<02:04, 0.10it/s]
│ loss: 0.30 test_accuracy: 11.35%: 52%|███▋ | 13/25 [01:59<01:50, 0.11it/s]
│ loss: 0.29 test_accuracy: 11.35%: 56%|███▉ | 14/25 [02:03<01:36, 0.11it/s]
│ loss: 0.27 test_accuracy: 11.35%: 60%|████▏ | 15/25 [02:07<01:24, 0.12it/s]
│ loss: 0.27 test_accuracy: 11.35%: 64%|████▍ | 16/25 [02:11<01:13, 0.12it/s]
│ loss: 0.24 test_accuracy: 11.35%: 68%|████▊ | 17/25 [02:15<01:03, 0.13it/s]
│ loss: 0.20 test_accuracy: 11.35%: 72%|█████ | 18/25 [02:19<00:54, 0.13it/s]
│ loss: 0.20 test_accuracy: 11.35%: 76%|█████▎ | 19/25 [02:23<00:45, 0.13it/s]
│ loss: 0.20 test_accuracy: 11.35%: 80%|█████▌ | 20/25 [02:27<00:36, 0.14it/s]
│ loss: 0.19 test_accuracy: 11.35%: 84%|█████▉ | 21/25 [02:32<00:28, 0.14it/s]
│ loss: 0.16 test_accuracy: 11.35%: 88%|██████▏| 22/25 [02:36<00:21, 0.14it/s]
│ loss: 0.19 test_accuracy: 11.35%: 92%|██████▍| 23/25 [02:40<00:13, 0.14it/s]
│ loss: 0.20 test_accuracy: 11.35%: 96%|██████▋| 24/25 [02:44<00:06, 0.15it/s]
│ loss: 0.14 test_accuracy: 11.35%: 100%|███████| 25/25 [02:48<00:00, 0.15it/s]
│ loss: 0.14 test_accuracy: 11.35%: 100%|███████| 25/25 [02:48<00:00, 0.15it/s]
Looking at the DEBUG=2 output, I was somewhat surprised to see that it is using METAL; not sure how good the state of azure pipelines is for that. On the other hand, since the test suite passes, I'm assuming that this covers the "default device" on that platform quite well. We're running on vanilla images from https://github.com/actions/runner-images, in this case macOS-15, but on an osx-64 host (because azure pipelines still hasn't delivered osx-arm64 runners at scale; we're cross-compiling osx-arm64 from osx-64 in the meantime, as we cannot yet use GHA for other reasons). Since there's no emulation framework like QEMU for osx, we're only running the tests in native compilation, i.e. only on osx-64.
To test the hypothesis that this is somehow related to metal, I forced the builds to run on CPU using
--- a/tinygrad/device.py
+++ b/tinygrad/device.py
@@ -11,7 +11,7 @@ from tinygrad.renderer import Renderer
# **************** Device ****************
-ALL_DEVICES = ["METAL", "AMD", "NV", "CUDA", "QCOM", "CL", "CPU", "DSP", "WEBGPU"]
+ALL_DEVICES = ["CPU", "DSP", "WEBGPU"]
class _Device:
def __init__(self) -> None:
self._devices = [x.stem[len("ops_"):].upper() for x in (pathlib.Path(__file__).parent/"runtime").iterdir() if x.stem.startswith("ops_")]With that patch (on top of the others), the test suite continues to pass almost2 unchanged, and the accuracy in beautiful_mnist.py converges again.
│ 0%| | 0/25 [00:00<?, ?it/s]
│ loss: 2.54 test_accuracy: 25.89%: 4%|▎ | 1/25 [00:17<06:52, 0.06it/s]
│ loss: 1.55 test_accuracy: 25.96%: 8%|▋ | 2/25 [00:27<05:18, 0.07it/s]
│ loss: 1.16 test_accuracy: 27.18%: 12%|▉ | 3/25 [00:33<04:07, 0.09it/s]
│ loss: 0.95 test_accuracy: 44.30%: 16%|█▎ | 4/25 [00:38<03:23, 0.10it/s]
│ loss: 0.75 test_accuracy: 60.55%: 20%|█▌ | 5/25 [00:44<02:57, 0.11it/s]
│ loss: 0.67 test_accuracy: 72.00%: 24%|█▉ | 6/25 [00:50<02:38, 0.12it/s]
│ loss: 0.51 test_accuracy: 76.94%: 28%|██▏ | 7/25 [00:55<02:22, 0.13it/s]
│ loss: 0.45 test_accuracy: 80.07%: 32%|██▌ | 8/25 [01:00<02:09, 0.13it/s]
│ loss: 0.40 test_accuracy: 83.90%: 36%|██▉ | 9/25 [01:06<01:57, 0.14it/s]
│ loss: 0.39 test_accuracy: 85.94%: 40%|██▊ | 10/25 [01:12<01:48, 0.14it/s]
│ loss: 0.37 test_accuracy: 87.67%: 44%|███ | 11/25 [01:18<01:39, 0.14it/s]
│ loss: 0.30 test_accuracy: 88.41%: 48%|███▎ | 12/25 [01:24<01:31, 0.14it/s]
│ loss: 0.30 test_accuracy: 89.23%: 52%|███▋ | 13/25 [01:30<01:23, 0.14it/s]
│ loss: 0.31 test_accuracy: 89.92%: 56%|███▉ | 14/25 [01:38<01:17, 0.14it/s]
│ loss: 0.26 test_accuracy: 90.91%: 60%|████▏ | 15/25 [01:43<01:09, 0.14it/s]
│ loss: 0.26 test_accuracy: 91.80%: 64%|████▍ | 16/25 [01:51<01:02, 0.14it/s]
│ loss: 0.24 test_accuracy: 92.54%: 68%|████▊ | 17/25 [01:59<00:56, 0.14it/s]
│ loss: 0.23 test_accuracy: 93.04%: 72%|█████ | 18/25 [02:06<00:49, 0.14it/s]
│ loss: 0.19 test_accuracy: 93.83%: 76%|█████▎ | 19/25 [02:14<00:42, 0.14it/s]
│ loss: 0.17 test_accuracy: 94.53%: 80%|█████▌ | 20/25 [02:22<00:35, 0.14it/s]
│ loss: 0.20 test_accuracy: 94.97%: 84%|█████▉ | 21/25 [02:28<00:28, 0.14it/s]
│ loss: 0.18 test_accuracy: 95.19%: 88%|██████▏| 22/25 [02:36<00:21, 0.14it/s]
│ loss: 0.16 test_accuracy: 95.31%: 92%|██████▍| 23/25 [02:43<00:14, 0.14it/s]
│ loss: 0.21 test_accuracy: 95.47%: 96%|██████▋| 24/25 [02:51<00:07, 0.14it/s]
│ loss: 0.14 test_accuracy: 95.68%: 100%|███████| 25/25 [02:57<00:00, 0.14it/s]
│ loss: 0.14 test_accuracy: 95.68%: 100%|███████| 25/25 [02:57<00:00, 0.14it/s]
│ test_acc=95.68000030517578 >= 90.0
More detailed logs (including DEBUG=2 output that I don't really know how to read) can be found in conda-forge/tinygrad-feedstock#9. If desired, I can prepare downloadable artefacts that correspond 1:1 with the ones that fail in testing (and what would end up being published).