Non-`ggml` backend #31

philpax · 2023-03-17T17:19:34Z

This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.

We would like to switch away from ggml at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).

Our primary candidate for a Rust-native ML/tensor backend is burn, which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).

Unfortunately, it doesn't support the two formats we need: f16 (original weights) and q4_0/q4_1 (quantized weights). Adding these to the ndarray backend should be viable, but getting it right and working optimally (i.e. similar to ggml's optimisations for those datatypes) will take some time.

Torch does support f16 on the GPU only, and burn's Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.

So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:

Quantize the model to standard uint8 and use ndarray/torch backends. This is the least work (at least in theory), but uint8 quantization performs worse than either f16 or q4, from what I've heard.
Add support for f16 to burn's ndarray backend. The torch backend should already work, but it will be very hard to test with most of our machines. Adding support to ndarray for CPU inference shouldn't be impossible either (especially if we just convert to f32 for every operation), but it will be difficult to make it performance-optimal.
Add support to q4_0/1 to burn's ndarray backend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are using q4 weights), but it has the same performance-optimality issue as f16 on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume that ndarray and rustc will remap [[bool; 4]; N] to [u8; N/2]).

This is summarised in the following table:

	`uint8`	`f16`	`q4`
`ndarray`	Yes, but at noticeable quality loss	Requires semi-significant implementation work	Requires significant implementation work
`torch`	Yes, but at noticeable quality loss (GPU, CPU)	Yes, but is GPU-only	Unknown - should be possible, but likely requires custom code

An idea that I briefly floated was porting ggml itself to Rust using c2rust and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs against llama.cpp's ggml implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.

I believe the other Rust ML crates also do not support f16 or q4, but that's from a cursory exploration. Happy to be proven wrong!

The text was updated successfully, but these errors were encountered:

katopz · 2023-03-18T15:32:06Z

Not sure this one is related (as opposed to pickle)? https://github.com/huggingface/safetensors#yet-another-format-
Sorry if it's not, I'm try to catch up here not quite on the same page yet 🤯.

philpax · 2023-03-18T16:38:13Z

safetensors are cool, but that's probably more applicable for #21. (Note that the weights would still need to be converted to safetensors format.)

katopz · 2023-03-19T03:10:23Z

So it look something like this am i right? maybe we should add this to some md file for new comer(me!). 🤔

graph TD;
  A("PyTorch") --"<pre>1️⃣/2️⃣&nbsp;export_state_dict_checkpoint.py</pre>PyTorch model checkpoints (pth)"--> B(Python) --"<pre>3️⃣&nbsp;convert-pth-to-ggml.py</pre>Geometric Deep Learning Markup Language (ggml)"--> C(C++)--"<pre>4️⃣&nbsp;quantize.cpp</pre>Quantized ggml (bin)"-->D(Rust);

1️⃣ tloen/alpaca-lora/export_state_dict_checkpoint.py (llama-7b-hf)
2️⃣ jankais3r/LLaMA_MPS/export_state_dict_checkpoint.py (llama-13b-hf)
3️⃣ llama.cpp/convert-pth-to-ggml.py
4️⃣ llama.cpp/quantize.cpp

philpax · 2023-03-26T18:54:40Z

Also worth keeping an eye on: @Narsil's https://github.com/Narsil/smelte-rs.

KerfuffleV2 · 2023-03-26T23:45:03Z

This is another one that could possibly be worth looking at: https://github.com/coreylowman/dfdx

One thing about it is it seems like it's pretty hard to load models where there stuff like the array dimensions or structure are dynamic.

I looked at smelte for other stuff too, but one big con at the moment is it says it's single threaded. So I don't think it would even be able to get close to the current approach on CPU at least.

Narsil · 2023-03-27T07:12:31Z

So I don't think it would even be able to get close to the current approach on CPU at least.

You'd be suprised :) matmul is still linked against mkl which is multi threaded and make the overall thing fast enough. Even ggml uses threading only for a few select ops, not for all of them.

jasonviipers · 2023-03-31T07:38:26Z

Look into Rust's tch crate, which is a high-level deep learning library built on top of PyTorch. PyTorch has built-in support for f16 and q4, so tch may be able to support those formats.

philpax · 2023-03-31T08:00:56Z

tch works great, but it requires Torch to be installed at the system level, which is non-ideal for us (we want using llama-rs to be as easy as any other Rust crate).

Narsil · 2023-03-31T08:19:32Z

Hey I've started seeing if the code from ggml couldn't be done in pure Rust, here's the first draft:

https://github.com/Narsil/rblas

It's x86_64, avx-only right now and I'm getting 2x slower than intel-mkl on my old personal computer.

RUSTFLAGS="-C target-cpu=native" cargo +nightly bench
# don't forget to build for native to get all avx thing supported.

Not sure if I screwed something up in the translation, the f32 matmul of ggml isn't as good as intel-mkl, or my threading policy sucks (Using simple threadpool which isn't using spinlocks under the hood afaik)

Also threadpool and num_cpus can be removed as dependencies, they just make my life and the code easier.

Still if people find that interesting to work on.

KerfuffleV2 · 2023-03-31T14:48:16Z

I'm not sure if it's the same for GPT (I assume it would be) but at least with RWKV the vast, vast majority of the time was spent just in the matrix multiplication. The rest was basically insignificant.

You probably can just do simple/unoptimized versions of the other ops and come very close to equal performance as long as the MM part is fast.

Narsil · 2023-03-31T15:28:06Z

the vast, vast majority of the time was spent just in the matrix multiplication. The rest was basically insignificant.

The softmax and layer norm can start to take up some time when not threaded.
Still not horrifying and beating torch, but still much more significant than one would expect.

Not sure if I screwed something up in the translation, the f32 matmul of ggml isn't as good as intel-mkl, or my threading policy sucks (Using simple threadpool which isn't using spinlocks under the hood afaik)

I'm beating cblas by ~30% using this code on M1.. I guess it's not that bad.

kazord · 2023-05-01T07:42:20Z

Did you take a look at https://github.com/Noeda/rllama works ?
doesn't have q4 yet but i have got decent result cpu and opencl on my radeon.
It supports spliting job cpu/gpu (but his implementation have to much barrier, had to rewrite it https://github.com/kazord/rllama/tree/oclsplit2, currently stuck at 200ms/token on f16, too much memory move cpu<->gpu, costing each 18ms)

mert-kurttutan · 2023-05-08T17:21:10Z

Just tossing another idea around, another choice of backend would be to use faer-core.

In my linux machine, it gives the same performance (both speed and cpu utilitization) as intel mkl, which is suprising enough that I kind of doubt my result, but I checked several times if anything is wrong.

One thing is that it was not able to run mac m1 machine.

Narsil · 2023-05-09T10:29:27Z

In my linux machine, it gives the same performance (both speed and cpu utilitization) as intel mkl, which is suprising enough that I kind of doubt my result, but I checked several times if anything is wrong.

This is indeed true and a testament to its creator.
However it isn't as fast on matmul on non contiguous blas calls (like A.matmul(B.T))

mert-kurttutan · 2023-05-09T12:31:15Z

In my linux machine, it gives the same performance (both speed and cpu utilitization) as intel mkl, which is suprising enough that I kind of doubt my result, but I checked several times if anything is wrong.

This is indeed true and a testament to its creator. However it isn't as fast on matmul on non contiguous blas calls (like A.matmul(B.T))

@Narsil Are you sure about the statement noncontiguous calls? In my experiments, which uses your ggblas bench script but faer-core version= 0.8.0 and a change in matrix dims, faer-core still gives same performance as intel mkl for both matmul and matmul_t

The result logs

test bench_faer_rs_n ... bench:   3,245,234 ns/iter (+/- 815,495)
test bench_faer_rs_t ... bench:   3,393,976 ns/iter (+/- 785,863)
test bench_mkl_n     ... bench:   3,042,964 ns/iter (+/- 677,965)
test bench_mkl_t     ... bench:   3,573,910 ns/iter (+/- 894,360)

I also got similar results for gemm backend (instead of faer-core backend).

I also checked that they are indeed calculating AB^T

Narsil · 2023-05-09T13:41:33Z

Interesting numbers (they seem pretty high, are you modifying shapes ?)

test bench_faer_rs_n ... bench:     432,096 ns/iter (+/- 73,060)
test bench_faer_rs_t ... bench:     721,426 ns/iter (+/- 200,362)
test bench_ggblas_n  ... bench:     571,520 ns/iter (+/- 199,552)
test bench_ggblas_t  ... bench:     393,694 ns/iter (+/- 200,837)

Interestingly mkl is performing much worse today on my computer (not sure if I updated since).

In any case, I thought there was potentially some nice upgrade possible on faer-rs an get the best of both worlds ideally

mert-kurttutan · 2023-05-09T14:09:16Z

Yeap I did modify

const M: usize = 128;
const N: usize = 768 * 3;
const K: usize = 768;

This seemed to decrease the variance/mean ratio (not so much).
I think more reliable/realistic/small variance test would be to try it as matmul backend of some NN model.
I will try it in your smelte-rs project to see how it behaves

Narsil · 2023-05-09T14:11:38Z

I will try it in your smelte-rs project to see how it behaves

Thanks. I don't have much time to add new features on it. In general mkl used to be the best runtime for all engines I tried.

coreylowman · 2023-05-09T19:15:42Z

Hey this is the dfdx author. I recently added f16 support for both cpu & gpu (on main currently waiting for new release from half-rs). I actually just fully implemented llama using dfdx, which you can find here https://github.com/coreylowman/llama-dfdx. Notably it supports GPU, and can lazily load tensors stored on disk at runtime when they are needed.

I also moved the CPU matrix multiplication to using gemm (which is the backend of faer-rs) literally this morning 😅

Happy to add anything that you guys would need to use it!

Edit: I was getting ~30ms/token on an A10 GPU with llama-7b

philpax · 2023-05-10T15:24:34Z

Awesome! The main things we'd want that I don't believe dfdx has is being able to use memory-mapped tensors (so not just lazy-loading, but not copying at all) and 4/5/8-bit quantisation support.

Our current thinking on this is to implement a computation graph abstraction (#137), and then have that shell out to alternate backends as required or as available. I'd love to see dfdx as either the provider of the abstraction or a backend in itself :)

Narsil · 2023-05-10T15:38:01Z

a computation graph abstraction

I feel forced to say that this approach has major drawbacks, the biggest of all being that it's hard to implement efficient runtimes.
onnxruntime has fixed primitives to rewrite graphs on the fly for gpt2, and those do not work for most other models (which use slightly different attention code). And in order to implement anything of the kind you're basically implementing a graph compiler.
And no one so far has managed to pull it off. (Pytorch has 10 different kinds, and it doesn't work most of the times, XLA has very severe limitations in terms of control, onnxruntime is also limited).
And what seems to happen in the optimization space rn, is that users will try 10 different "out-of-the-box" compiling solutions, and keep the most performant (even if it's super bad and just happens to be the best). And usually there is no consistent winner across the board for different hardware and different models.

And if someone has a super clever idea to do computations more efficiently, well it's much harder to implement in those graphs, since you have to talk an entire new language (the graph structure).

My very personal opinion is that we shouldn't have "graph computation" models, but real code, as real code is a full descriptor of your graph (no need to reimplement the wheel there). It's fully qualified already, there are already great ways to modify any part you want without having to understand the graph structure.

Like if I want to reimplement a given model with new kernel X that can replace existing ops, there's an obvious way to do it (rewrite the corresponding code). It's very much not easy to do on computation graphs.

Ofc, for training, in order to get automatic differentiation, we need a computation graph. PyTorch seemed to have made it work correctly without having a "real computation graph", it ends up being classic python code, which is where it wins imo.

coreylowman · 2023-05-10T16:16:32Z

Yeah of those are both accurate.

4/5/8-bit quantisation support.

We have a tracking issue for 4 bit quantization, and someone actually was working on a draft PR for this, but it has kind of stalled out given how specific of a use case it is. So there is a fairly complex way forward, but it'll take a not insignificant amount of time. Luckily llama doesn't use all the possible tensor operations, so the MVP is probably just implementing kernels for the specific ops we'd need.

Has anyone done 4 bit quantization on CUDA? Or is this specifically for Cpu optimizations?

memory-mapped tensors (so not just lazy-loading, but not copying at all)

I was thinking about memory mapped tensors yesterday (probably for similar use case where CPU tensors for llama can just use memory mapping for data storage). There might be a way to do this on top of dfdx, similar to how I did the lazy tensor stuff, however it'd be really unsafe/sketchy. Basically we'd have to construct a Vec<T> from the mmap'd &[u8] and then ensure that we can mem::forget the tensor so the rust allocator doesn't try to free the vec. This may not be possible without causing undefined behavior. I'll experiment with this and let ya'll know.

a computation graph abstraction

+1 on narsil's response.

This seems like a very complex way to gain access to running on multiple devices. The easiest/hackiest way to do it would be with feature flags, something like:

#[cfg(feature = "cuda")]
// call e.g. dfdx backend
#[cfg(feature = "cpu)]
// call existing ggml backend

While you have to maintain two separate pieces of code that do the same thing, I think its probably simpler than creating/impl'ing a graph abstraction.

Thoughts?

Narsil · 2023-05-10T16:19:31Z

Has anyone done 4 bit quantization on CUDA? Or is this specifically for Cpu optimizations?

GPTQ does it : https://github.com/qwopqwop200/GPTQ-for-LLaMa (Triton backed, so you could steal their ptx file ! )

coreylowman · 2023-05-10T18:47:09Z

Also I tried out some sketch mmap stuff, and it seems like you can create a Vec from an mmap buffer. I have no idea how safe it is, but it seems to work (it produces the same generations as the regular copy version) 🤷 Was able to "load" all the 13gb of weights in 10ms on my dev laptop

Detected model folder as LLaMa 7b.
Model size: 13476 MB
13476 MB of model parameters will be held in RAM.
Loaded weights in 9.5947ms

PR is here coreylowman/llama-dfdx#15

philpax · 2023-05-11T17:04:37Z

My very personal opinion is that we shouldn't have "graph computation" models, but real code, as real code is a full descriptor of your graph (no need to reimplement the wheel there). It's fully qualified already, there are already great ways to modify any part you want without having to understand the graph structure.

My thinking on this is that we already use a computation graph, through ggml: https://github.com/rustformers/llm/blob/main/crates/models/llama/src/lib.rs#L141-L326

Replicating this graph would be no worse than the current state of affairs, and it would allow us to directly "compile" our graph to the GGML graph in a way that would let us maintain compatibility.

I would need to read more into the state of affairs here before making a decision.

While you have to maintain two separate pieces of code that do the same thing, I think its probably simpler than creating/impl'ing a graph abstraction.

At present, we have five models (with a sixth hopefully coming soon). Multiplying the maintenance work by the number of backends seems intractable over time. I'd like to avoid that as much as possible, for as long as possible.

Also I tried out some sketch mmap stuff, and it seems like you can create a Vec from an mmap buffer. I have no idea how safe it is, but it seems to work (it produces the same generations as the regular copy version) 🤷 Was able to "load" all the 13gb of weights in 10ms on my dev laptop

Very cool! I'll have to give this more of a look soon 🙂

philpax · 2023-05-17T07:34:28Z

I've opened an issue with wonnx regarding GPU inference: webonnx/wonnx#169

I imagine it will be non-trivial for them to implement a more freeform interface (if they're interested in doing so), so it may not be done/could take a long time. That being said, I would love to see non-CUDA GPU inference!

philpax · 2023-05-22T05:06:01Z

Just listing all potential backends that come to mind, feel free to suggest more:

ggml
burn
dfdx
wonnx
smelte-rs
faer
onnxruntime
MLC
cuDNN
OpenVINO
ROCm
Torch

Note that some of these overlap and/or are at different abstraction levels. I'm just listing them out for general reference.

wsxiaoys · 2023-05-22T05:47:50Z

https://github.com/OpenNMT/CTranslate2 is another solid choice (cpu / gpu (cuda) support, wide model support matrix)

philpax mentioned this issue Mar 23, 2023

Explain differences from llama.cpp in README #63

Closed

philpax added the issue:enhancement New feature or request label Mar 24, 2023

setzer22 mentioned this issue Mar 26, 2023

Embedding extraction #72

Merged

philpax mentioned this issue May 11, 2023

Use mlc-llm as backend #213

Open

hhamud added the meta:help-wanted Extra attention is needed label May 13, 2023

philpax mentioned this issue Jun 13, 2023

Support WebGPU acceleration? #312

Open

philpax added the topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features label Jun 15, 2023

LLukas22 mentioned this issue Jul 2, 2023

build failure on orin agx #341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-`ggml` backend #31

Non-`ggml` backend #31

philpax commented Mar 17, 2023 •

edited

katopz commented Mar 18, 2023

philpax commented Mar 18, 2023

katopz commented Mar 19, 2023

philpax commented Mar 26, 2023

KerfuffleV2 commented Mar 26, 2023

Narsil commented Mar 27, 2023

jasonviipers commented Mar 31, 2023

philpax commented Mar 31, 2023

Narsil commented Mar 31, 2023 •

edited

KerfuffleV2 commented Mar 31, 2023

Narsil commented Mar 31, 2023 •

edited

kazord commented May 1, 2023 •

edited

mert-kurttutan commented May 8, 2023 •

edited

Narsil commented May 9, 2023

mert-kurttutan commented May 9, 2023 •

edited

Narsil commented May 9, 2023

mert-kurttutan commented May 9, 2023 •

edited

Narsil commented May 9, 2023

coreylowman commented May 9, 2023 •

edited

philpax commented May 10, 2023

Narsil commented May 10, 2023 •

edited

coreylowman commented May 10, 2023

Narsil commented May 10, 2023

coreylowman commented May 10, 2023 •

edited

philpax commented May 11, 2023

philpax commented May 17, 2023

philpax commented May 22, 2023

wsxiaoys commented May 22, 2023 •

edited

Non-ggml backend #31

Non-ggml backend #31

Comments

philpax commented Mar 17, 2023 • edited

katopz commented Mar 18, 2023

philpax commented Mar 18, 2023

katopz commented Mar 19, 2023

philpax commented Mar 26, 2023

KerfuffleV2 commented Mar 26, 2023

Narsil commented Mar 27, 2023

jasonviipers commented Mar 31, 2023

philpax commented Mar 31, 2023

Narsil commented Mar 31, 2023 • edited

KerfuffleV2 commented Mar 31, 2023

Narsil commented Mar 31, 2023 • edited

kazord commented May 1, 2023 • edited

mert-kurttutan commented May 8, 2023 • edited

Narsil commented May 9, 2023

mert-kurttutan commented May 9, 2023 • edited

Narsil commented May 9, 2023

mert-kurttutan commented May 9, 2023 • edited

Narsil commented May 9, 2023

coreylowman commented May 9, 2023 • edited

philpax commented May 10, 2023

Narsil commented May 10, 2023 • edited

coreylowman commented May 10, 2023

Narsil commented May 10, 2023

coreylowman commented May 10, 2023 • edited

philpax commented May 11, 2023

philpax commented May 17, 2023

philpax commented May 22, 2023

wsxiaoys commented May 22, 2023 • edited

Non-`ggml` backend #31

Non-`ggml` backend #31

philpax commented Mar 17, 2023 •

edited

Narsil commented Mar 31, 2023 •

edited

Narsil commented Mar 31, 2023 •

edited

kazord commented May 1, 2023 •

edited

mert-kurttutan commented May 8, 2023 •

edited

mert-kurttutan commented May 9, 2023 •

edited

mert-kurttutan commented May 9, 2023 •

edited

coreylowman commented May 9, 2023 •

edited

Narsil commented May 10, 2023 •

edited

coreylowman commented May 10, 2023 •

edited

wsxiaoys commented May 22, 2023 •

edited