# GGML or GGUF in the world of Large Language Models
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load LLM and Tokenizer
model_id = "TheBloke/zephyr-7B-beta-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=False,
revision="main"
)
# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')
###############################################
# AWQ: Activation-aware Weight Quantization
!pip install vllm
from vllm import LLM, SamplingParams
# Load the LLM
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=256)
llm = LLM(
model="TheBloke/zephyr-7B-beta-AWQ",
quantization='awq',
dtype='half',
gpu_memory_utilization=.95,
max_model_len=4096
)
GGUF / GGML are file formats for quantized models
GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML.
Basically, GGUF (i.e. "GPT-Generated Unified Format"), previously GGML, is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up.
📌 GGML is a C++ Tensor library designed for machine learning, facilitating the running of LLMs either on a CPU alone or in tandem with a GPU.
💡 GGUF (new)
💡 GGML (old)
Llama.cpp has dropped support for the GGML format and now only supports GGUF
-
GGUF contains all the metadata it needs in the model file (no need for other files like tokenizer_config.json) except the prompt template
-
llama.cpp has a script to convert
*.safetensors
model files into*.gguf
-
Transformers & Llama.cpp support both CPU, GPU and MPU inference
Being compiled in C++, with GGUF the inference is multithreaded.
↪️ GGML format recently changed to GGUF which is designed to be extensible, so that new features shouldn’t break compatibility with existing models. It also centralizes all the metadata in one file, such as special tokens, RoPE scaling parameters, etc. In short, it answers a few historical pain points and should be future-proof.
📌 GGUF (GGML) vs GPTQ
##########################################
If you just started playing with Quantized LLMs on local machine and are confused about which model format to download.
First noting that the same model is available in many formats and quantizations.
📌 You'll see things like Q4, 4bpw, GGUF, EXL2, and a few others.
📌 These are put into the name for easy identification and represent the technique used to quantize them.
📌 The Q and bpw numbers stand for how much a model has been compressed. Typically, q4, or 4bpw, is considered the sweet spot between quality and size. Smaller than that, and the models rapidly degrade.
📌 GGUF, EXL2, AWQ, GPTQ, and others refer to the format.
The format of the model will depend on the hardware you have.
📌 EXL2 is the best option if you have enough VRAM to load the entire model onto your GPU. But, it only works if the whole thing will fit. The backend, Exllama, is not able to use your system ram.
📌 GGUF is the best if the model does not fit inside the VRAM. Llama.cpp, which uses the GGUF format, is able to use your CPU, system ram, GPU, and VRAM all at the same time. This will allow you to load much larger models by splitting it between your GPU and CPU, but at the cost of speed.
But always be ready to fix issues. For example sometime some model, may be incredibly inference stack or hardware dependent for some strange issues.
Recently e.g. few people were facing issue with "Mixtral Hermes" model. There was a hint that "Mixtral Hermes" vocab size being 32002 instead of mixtral's standard one of 32000 played a role, as that was the cause of Together's initial issue (which they fixed).
The way GGML quantizes weights is not as sophisticated as GPTQ’s. Basically, it groups blocks of values and rounds them to a lower precision. Some techniques, implement a higher precision for critical layers. In this case, every weight is stored in 4-bit precision, with the exception of half of the attention.wv and feed_forward.w2 tensors. Experimentally, this mixed precision proves to be a good tradeoff between accuracy and resource usage.
Look at the source code of ggml.c file, we can see how the blocks are defined. For example, the block_q4_0 structure is defined as:
#define QK4_0 32
typedef struct {
ggml_fp16_t d; // delta
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
In GGML, weights are processed in blocks, each consisting of 32 values. For each block, a scale factor (delta) is derived from the largest weight value. All weights in the block are then scaled, quantized, and packed efficiently for storage (nibbles). This approach significantly reduces the storage requirements while allowing for a relatively simple and deterministic conversion between the original and quantized weights.
AWQ is a relatively new format on the block which is a quantization method similar to GPTQ. There are many differences between AWQ and GPTQ as methods but the crucial one is that AWQ assumes that not all weights are equally important for an LLM’s performance.
So with AWQ, there is a small fraction of weights that will be skipped during quantization which helps with the quantization loss.
As a result, we get a significant speed-up compared to GPTQ whilst keeping similar performance.
For AWQ, its better to use vLLM package as its well-supported with vLLM