Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rawcpu #365

Merged
merged 6 commits into from
Aug 17, 2022
Merged

rawcpu #365

merged 6 commits into from
Aug 17, 2022

Conversation

geohot
Copy link
Collaborator

@geohot geohot commented Aug 16, 2022

Write operations in portable C++, a step toward removing the NumPy requirement.

@geohot geohot merged commit 783c120 into master Aug 17, 2022
@geohot geohot deleted the rawcpu branch August 17, 2022 09:33
Johnmiicheal added a commit to Johnmiicheal/tinygrad that referenced this pull request Nov 12, 2022
* bugfixes

* get_movementroot

* add PAD movementop

* fix permute stacking

* some permutes are reshapes

* SLICE -> PAD,SHRINK

* test opencl, commit to removing the crap conv code from GPU

* testopencl

* fixup tests

* opencl not imported

* we need that opt to make gpu decent speed

* w/e, that's a later prob

* fix bug caused by rounding

* fix opencl bug, no training on opencl

* more crap to remove without convs

* join expands

* default opt level 2

* prune graph

* don't shuffle if there's children involved

* disable opencl tests

* tests maybe

* output file to disk

* fix row pitch

* inputs and outputs

* outputs with size

* buffer_id is 8 bytes

* Fold reduce (tinygrad#362)

* folding reduce

* fold through movementops

* fixup shapes

* was too aggressive

* i knew we needed that

* don't recompute reduce

* working

* fix openpilot compile

* prunegraph openpilot

* types and reduce_shape

* refactor

* cleanups

* neater

* 1009

* 1004

* clean up reduce for 998

* touchups

* adam in benchmark_train_efficientnet

* REQUIRES_SIMPLE_REDUCE

* save weights

* zero out the buffer

* needs_load in image correct

* float16 fixups

* that should be right

* fix options on old pyopencl

* fix that soon

* remove useless init, add ops counter

* fix op estimate

* add gflop estimate

* add time sum

* notes

* fix ane on new mac os x

* amfi note

* docs

* notes

* update readme

* broken amfi patch

* ane: procPath issue. don't waste more time with this, focus on core tinygrad

* rawcpu (tinygrad#365)

* rawcpu

* add should work when we respect shapetracker

* now that's true

* still have to handle shapetracker

* copyin

* Fix mypy

* tinygrad.nn (tinygrad#367)

* tinygrad.nn

* flake8

* working on pylint

* more pylint

* more pylint

* pylint passes

* networkx

* mypy can't infer that type

* junk

* fixup run thneed

* run_onnx_torch

* reduce axis at the end

* much simpler reduce

* hmm, with the new reduce, we have to opt 3 for memory usage

* maybe that's a better way to do this

* less needless reshaping

* 2 stage reduce

* tune inter_red

* t.assign in optim

* add openpilot tests to tinygrad

* enable the openpilot test

* fix cpu thneed running

* use functools.partialmethod (tinygrad#369)

Co-authored-by: Kyle <kposborne@gmail.com>

* run_thneed with test

* fix test maybe

* opencl can't optimize that

* refactor getters

* remove from_image

* image input works

* fix typing

* native_exp is way faster on qcom

* hmm, the native exp/log breaks it too much

* float32 in image desc

* thneed run float32

* oops, compare with abs

* flip that

* print inputs

* no torch test if no torch

* add reciprocal

* still broken

* line count

* Rewrote Tensor.cat to be shorter and (hopefully) clearer (tinygrad#372)

* Rewrote Tensor.cat to be shorter and (hopefully) clearer

* Use cumsum[-1] instead of separate sum

* typos

* fix cl import error

* fix wrong size input

* TEST_ENET for openpilot compiler

* fix batchnorm folding in openpilot compile

* don't save input buffers

* save free 200ms

* stable diffusion start

* fix tests hopefully, more stable diffusion

* stable_diffusion: add attn and layernorm

* torch bs

* found tinygrad bug

* yolo

* fix check

* stable diffusion works

* remove ugly parens

* cleanups for Mid

* easier to read

* more readable actually

* one liner that's more clear

* from_number_like to fix div issue

* better idea for numbers, do the division in python

* work

* stable diffusion compiles (add no_init)

* runs on torch cpu

* Make creation helpers use fp32 by default (tinygrad#374)

* Make creation helpers use fp32 by default

half the big = twice the fast

* Fix flake8 with an extra multiply

* clip model is running

* fix transformer bugs

* fix last bug in unet probz

* all models match

* brown img

* it renders something

* cat horse winning ❗

* other prompt example

* better alphas

* stable diffusion cleanups

* stable diffusion in readme

* improve opencl, why is it OOMing

* bring back native exp log

* works at work

* 1100 lines, but sane linter rules

* fix stupid OPENCL=1 OOM

* broadcast from right to left (tinygrad#375)

* broadcast from right to left

* add another broadcasted add test

* fix sd with TORCH=1

* hmm, need this with broadcast change

* simpler movement op

* add div to operators

* fix slice one multi, and linear can be simpler with new broadcasting

* make gpu code readable

* cpu line savings and cleaner

* have to ignore that type

* add Linear to tinygrad.nn

* relax mnist test a tiny bit

* support more onnx ops (tinygrad#376)

* broadcast from right to left

* add another broadcasted add test

* more onnx ops

* use float32 range in clip

* change default opt to 2

* Revert "change default opt to 2"

This reverts commit 726f4e9.

* update serious_mnist.py (tinygrad#380)

* Added standalone CLIP tokenizer (tinygrad#382)

* Added standalone CLIP tokenizer.

* Fixed empty phrase.

* Truncating long prompts.

* Keeping two slots for the start and end token.

* Fixed empty phrase.

* Using tokenizer for empty phrase.

* Typo.

* cleanup clip tokenizer

* forgot a few

* test_matmul

* simple on device failing test

* fix test failure on MATMUL=1 backward pass

* fix matmul kernel and tests

* add barrier

* support float16 onnx weights (tinygrad#384)

* add min support

* that's simpler

* import tests from CL metal texture fix

* set requires_grad to None (tinygrad#387)

* set requires_grad to None

* some things need gradients

* hmm, why was get_parameters filtering

* clipnorm support

* Reshape dataset from fetch_mnist (tinygrad#390)

* fix mnist load from other dirs

* move get_parameters to optim.py

* Fix weight init: this work? (tinygrad#391)

* this work?

* glorot uniform

* requies_grad broke

* propagate the None correctly

* so this weight init works

* ahh, i think it's this

* can't beat this

* glorot is best for ae

* remove comments

* layernorm is all axis but the first

* revert layernorm to have axis param

* fix efficientnet

* fix bn folding issue, add new test

* fix tests

* Device.GPU isn't definied

* ugh, global state

* should this be 10?

* notrain test

* external_test_opt

* Fix OpenCL Metal texture issues (tinygrad#378)

* Fix OpenCL Metal texture issues

Tile CL images when needed, to fit into the 16384 max Metal image size;
gets me to ~4.8s/iteration for SD on M1 Pro with OPENCL=1 FLOAT16=1.

* Minor cleanup

* Fix mish in CI, or no-op?

* Is mish being framed?

* It would help if any of this reproduced locally

* ???

* OPT is reverted; use original mish

* Cleanup post-review

* Fix some shape usage

* Tiler tests, shouldn't oom or overflow either

* Can't CL if there's no CL?

* Run tiler tests even if GPU=1

* relu6 segfault binary chop; revert test

* relu6 segfault binary chop; revert accel

* relu6 segfault binary chop; revert . (???)

* end relu6 segfault binary chop; repo's haunted

* some args for stable diffusion

* test_sd_big_conv

* always MATMUL, test the ops in OPENCL

* ugh, why did that fail

* Fix GPU 2**31 virtual size limit (tinygrad#392)

* in progress

* big conv test works

* that's unneeded

* fix opencl with reduce

* rewrite contiguous_view_constant_fold

* clean up mids in loop code

* subidx

* print cl kernel before run

* no reduce, no loop

* Revert "no reduce, no loop"

This reverts commit 92777e4.

* measure speed vs torch

* touchup

* remove redundant list comprehension from inside all. (tinygrad#397)

remove explicit inherit from object.

* enable tests in test_ops.py that are disabled but now work. (tinygrad#396)

remove custom tolerances that don't appear to be needed.

* openpilot: new models and onnx ops (tinygrad#401)

* ngrl stuff

* fngrl

* fix typo in compile script

* workflow dispatch

* new models in tests

* dont need to up this threshold

Co-authored-by: HaraldSchafer <harald.the.engineer@gmail.com>

* fix openpilot test

* refactoring thneed (tinygrad#400)

* refactoring thneed

* continue

* minor update

* looks like it's working

* big refactor

* confirm thneed got the right output

* code is there but it's broken

* works now

* always OPTWG, input -> dat

* fix type issue

* ReduceSum

* fix thneed self test

* read input shapes and break down the layers

* rerun

* zero out the inputs

* remove useless buffer

* add assert to catch issue in attention

* safe_numpy and warning for broken matmul

* add CONTIGUOUS loadop

* don't recopy backing

* might fix tests

* raise, don't assert

* fix nonstatic weights

* really dumb bug

* remove run_thneed dead code

* replace networkx with defaultdict

* move ops.py into lazy.py (tinygrad#402)

* move ops.py into lazy.py

* fix graph and linter

* ugh, didn't add

* relu simpler backward pass

* more imports from llvm branch

* LLVM Backend take 2 (tinygrad#403)

* take 2 llvm

* get_lazybuffers -> get_buffers

* llvm tests pass

* fix type issues and refactor LLVM

* Exec AST (tinygrad#404)

* working exec ast

* exec_ast is staticmethod

* GenericExecAST

* fold that sometimes

* ExplicitExecAST

* exec_ast for GPU

* gpu working

* get_lazyop_shape

* now gpubuffer is ExplicitExecAST

* dedup

* add a type

* RESHAPE in opencl code

* fix linter

* that too for linter

* cleanups

* remove dead code

* GenericShape is less lines

* add ALLOWED_KERNEL_COUNT to tests

* fix mypy

* that's gotta be recursive

* fix opencl shape processing

* remove unneeded lambda

* cleanups, remove E701

* can we lose the lines with E701 still there?

* lazy cleanups

* move into graph.py

* fix flake8

* fix graph in openpilot/compile.py

* hasattr and DeviceBuffer type fixups

* clean up movement_op in cpu and torch

* very minor

* test speed w/o bias

* more test opt

* no RESHAPEs in the AST

* MovementOps is unused

* one more opt test

* accurate flop estimation

* llvm doesn't vectorize

* vectorization

* gemm is 1.7 TFLOPS on a single M1 core

* more amx notes

* oops, remove while(1)

* seperate STRIDED and EXPAND

* fix llvm vectorization by add analysis passes from the target machine

* that was in there twice, DEBUG>=4 to see loop opt

* rewrite some strideds into reshapes

* fix bug in ops test, it was cheating somehow

* stop blowing up floats

* comments and readability in lazy.py

* fix type error

* 1s are always mergable

* Gemm (tinygrad#416)

* gemm

* off by factor of 5

* 50 GFLOPS

* works

* 91 gflops

* working at 50G

* works

* iy

* 150 GFLOPS

* 150 GFLOPS

* N=2048 is still fast

* threading soon

* multithread

* pinning

* throttling is sad

* Align matrices to cacheline width (tinygrad#361)

Co-authored-by: cloud <Cloud11665@gmail.com>

* updates from the chonker branch

* fix termcolor import

* ugh, that too

* rename test functions to helper_

* bump version to 0.4.0

* Create python-publish.yml (tinygrad#163)

* Fix issue where batch_invstd not being set (tinygrad#421)

batch_invstd can be falsely assumed to be set, even though it is None
since hasattr will not return false in this case
BatchNorm2D a reshape will be attempted then, which causes an exception

* Basic editorconfig support (tinygrad#422)

Almost every IDE or texteditor supports
[editorconfig](https://editorconfig.org/).
I've set it up to just enforce the 2 space python indents for now.

* contributing

* more that

* contrib more

* Reduce line count (tinygrad#424)

* save a line, save a life

* save a line, save a life

* change order of tern

* factorizing shapetracker from chonker

* contiguous, and no strided for matmul

* Simple chonker (tinygrad#431)

* chonker will make llvm fast

* work

* better speed tests, we will make them fast

* with the cache add is the same speed

* relu and neg are fast

* fix sum speed

* maximum maxnum?

* hack for gemm opt

* gemm very slow

* zeros like

* test_permute

* shapetracker returns self

* fix shapetracker factorization

* err, int strides

* permutes are faster now in tinygrad than pytorch

* support -1 in expand

* gemm unrolled

* improve final test case

* WIP GEMM

* why isn't GEMM fast?

* revert cache dim

* ffp contract works on clang, not llvm?

* ignore llvm ir

* this makes fma work at least, but no faster

* USE_4x4

* 63 GFLOPS

* 87 GFLOPS

* that wasn't matmul, 44 GFLOPS now

* 82 GFLOPS permuted

* this permute too

* a little speed for the convs

* 45 GFLOPS

* speed tests pass again

* clean up prints

* fix FMA WHAT A WASTE OF TIME

* colors

* moar fair

* GPU

* useless on chonker

* cleanups

* improve factorized shapetracker

* better threshold

* label conv

* work

* ops test pass again

* hot load the index

* run the last view, no need to create

* ZeroView needs a repr for the key to work

* fix segfault on out of bounds

* one more test

* start amx, and llvm.initialize_native_asmparser

* amx works

* nice AMX class

* nicer AMX class

* refactor get_idxs

* amx working

* is slower...

* useless flip

* cache

* SZ_X

* AMX_SZ_X/Y work alone

* Contiguous mlop

* test gemm packed

* PREPARE in packed

* use_amx factor

* prefetch isn't faster

* loop

* same 3ms

* 2.24 ms

* allow double on store in TG

* amx reduce is the same speed as non amx reduce

* include memory bandwidth

* clean up shapetracker

* flip returns stride

* prepare for upstream

* Update ops_llvm.py (tinygrad#426)

* permutes are yellow and green now

* faster conv

* llvm cleanups

* Show optimised IR under debug 4 (tinygrad#428)

* ASTKernel class

* Make tinygrad work with older python version (tinygrad#427)

* Make tinygrad work with older python version

* Use partialmethod instead of partial

* smiple chonker is chonking

* remove junk from test speed vs torch

* fix linker and types

* AMX is only here now

* add LLVM tests, it's a valid backend now

* oops, run llvm test

* contiguous_op

* fix loadops compare

* dedup reduceops

Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>

* s/contiguous_op/contiguous

* the speedy chonker is going to replace the old chonker (tinygrad#432)

* bringing back reshape and permute

* done with E701

* 4x4 works in generic way

* max and sum not vectorizing...

* special case single float

* support comparing to MPS

* improve matmul speed, consider generic principles

* GlobalCounter

* fix op tracking

* faster

* comment that out for now

* err, it needs that

* fix minor issues

* fix global_mem

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: Comma Device <device@comma.ai>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: George Hotz <george@comma.ai>
Co-authored-by: kposborne2 <53231580+kposborne2@users.noreply.github.com>
Co-authored-by: Kyle <kposborne@gmail.com>
Co-authored-by: Mitchell Goff <mitchellgoffpc@gmail.com>
Co-authored-by: Ollin Boer Bohan <madebyollin@gmail.com>
Co-authored-by: YassineYousfi <yyousfi1@binghamton.edu>
Co-authored-by: David Redmon <85855920+redmonmd@users.noreply.github.com>
Co-authored-by: Fernand Pajot <accounts@epigram.me>
Co-authored-by: Jacky Lee <39754370+jla524@users.noreply.github.com>
Co-authored-by: Drew Hintz <dhintz@squareup.com>
Co-authored-by: HaraldSchafer <harald.the.engineer@gmail.com>
Co-authored-by: cloud <Cloud11665@gmail.com>
Co-authored-by: Liam <3579535@myuwc.ac.za>
Co-authored-by: marcojob <44396071+marcojob@users.noreply.github.com>
Co-authored-by: Daniel Davis <dan@dandavis.dev>
Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant