Skip to content
Discussion options

You must be logged in to vote

You're correct that we require a static graph to run on device, and that we use per-tensor quantization for activations (including KV cache). Typically, we keep KV cache activations at 8-bits and the rest of the activations in the model at 16-bits. We've found that this (depending on the technique you use for quantizing your weights) can get you quite close to matching the floating point performance of the model.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by quic-ashvkuma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants