You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding quantization workflows. Does applying QAT before convering to GGUF format (e.g. using Q4, Q4_K_M) result in better quality fompared to directy quantizing with GGUF alone?
I'm planning to serve my model using llama.cpp, so converting to GGUF is required. I’ve noticed a noticeable quality drop when using methods provided by llama.cpp, so I’m considering trying QAT to mitigate this.
Has anyone experimented with this approach or have any insights to share?