Skip to content

tinygrad 0.9.0

Choose a tag to compare
@wozeparrot wozeparrot released this 28 May 18:48
· 716 commits to master since this release

Close to the new line limit of 8000 lines, sitting at 7958 lines.
tinygrad is much more usable now.

Just over 1200 commits since 0.8.0.

Release Highlights

  • New documentation:
  • gpuctypes has been brought in tree and is no longer an external dependency. [#3253]
  • AMD=1 and NV=1 experimental backends for not requiring any userspace runtime components like ROCm or CUDA.
    • These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
  • PTX=1 for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]
  • Nvidia tensor core support. [#3544]
  • THREEFRY=1 for numpy-less random number generation using threefry2x32. [#2601] [#3785]
  • More stabilized multi-tensor API.
  • Core tinygrad has been refactored into 4 pieces, read more about it here.
  • Linearizer and codegen has support for generating kernels with multiple outputs.
  • Lots of progress towards greater kernel fusion in the scheduler.
    • Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
    • New LoadOps.ASSIGN allows fusing optimizer updates with grad.
    • Schedule kernels in BFS order. This improves resnet and llama speed.
    • W.I.P. for fusing multiple reduces: [#4259] [#4208]
  • MLPerf ResNet and BERT with a W.I.P. UNet3D
  • Llama 3 support with a new that provides an OpenAI compatible API. [#4576]
  • NF4 quantization support in Llama examples. [#4540]
  • label_smoothing has been added to sparse_categorical_crossentropy. [#3568]

Known Issues

  • Using tinygrad in a conda env on macOS is known to cause problems with the METAL backend. See #2226.

See the full changelog: v0.8.0...v0.9.0

See the known issues:

Join the Discord!