This notebook shows how to quantize and run LLMs with mixed precision using ExLlamaV2.

More details in this article: [Run Llama 2 70B on Your GPU with ExLlamaV2](https://kaitchup.substack.com/p/run-llama-2-70b-on-your-gpu-with)

Note that when I wrote this notebook, ExLlamaV2 was still a very young project. If you find that it doesn't work anymore, please leave a comment in the article above. I'll update the notebook.

In [None]:
!pip install transformers
!git clone https://github.com/turboderp/exllamav2
%cd exllamav2
!pip install -r requirements.txt

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m76.1 MB/s[0m eta [36m0:00:

We will download Llama 2 from the Hugging Face Hub. We must login.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

ExLlamaV2 doesn't communicate well with Hugging Face transformers. You first need to download the models locally. We only need the safetensors version. ExLlamaV2 uses safetensors so we don't need to download the ".bin" files.

In [None]:
#The directory where we want to store the model must exist.
!mkdir ./Llama-2-13b-hf/

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="meta-llama/Llama-2-13b-hf", ignore_patterns=["*.bin"], local_dir="./Llama-2-13b-hf/", local_dir_use_symlinks=False)

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

Downloading (…)9e944936/LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading (…)44936/.gitattributes:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading (…)959e944936/README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)944936/USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading (…)9e944936/config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)44936/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)nsible-Use-Guide.pdf:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

'/content/exllamav2'

We need a dataset for calibrating the quantization. I use wikitext test set. Again, we need it locally so I download the .parquet file.

In [None]:
!wget https://huggingface.co/datasets/wikitext/resolve/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet

--2023-09-25 20:27:50--  https://huggingface.co/datasets/wikitext/resolve/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet
Resolving huggingface.co (huggingface.co)... 13.35.166.69, 13.35.166.36, 13.35.166.50, ...
Connecting to huggingface.co (huggingface.co)|13.35.166.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 685430 (669K)
Saving to: ‘0000.parquet’


2023-09-25 20:27:51 (1.69 MB/s) - ‘0000.parquet’ saved [685430/685430]



The quantization of Llama 2 13B is done with convert.py. It takes around 2 hours.

In [None]:
!mkdir ./Llama-2-13b-hf/temp/
!python convert.py \
    -i ./ \
    -o ./Llama-2-13b-hf/temp/ \
    -c 0000.parquet \
    -cf ./Llama-2-13b-hf/3.0bpw/ \
    -b 3.0

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
 -- 1.0:4b 32g s4                  4.13 bpw    rfn_error: 0.02717
 -- 0.1:5b/0.9:4b 32g s4           4.23 bpw    rfn_error: 0.02606
 -- 0.1:6b/0.9:4b 32g s4           4.33 bpw    rfn_error: 0.02580
 -- 1.0:5b 128g s4                 5.03 bpw    rfn_error: 0.01565
 -- 0.1:6b/0.9:5b 32g s4           5.23 bpw    rfn_error: 0.01348
 -- 0.05:8b/0.05:6b/0.9:5b 32g s4  5.33 bpw    rfn_error: 0.01338
 -- 0.4:6b/0.6:5b 32g s4           5.53 bpw    rfn_error: 0.01240
 -- 0.1:8b/0.3:6b/0.6:5b 32g s4    5.73 bpw    rfn_error: 0.01224
 -- 1.0:6b 128g s4                 6.03 bpw    rfn_error: 0.00823
 -- 1.0:6b 32g s4                  6.13 bpw    rfn_error: 0.00832
 -- 0.1:8b/0.9:6b 128g s4          6.23 bpw    rfn_error: 0.00786
 -- 1.0:8b 32g s4                  8.13 bpw    rfn_error: 0.00562
 -- Time: 8.82 seconds
 -- Linear: model.layers.16.self_attn.k_proj
 -- 0.05:3b/0.95:2b 32g s4         2.18 bpw    r

You can use test_inference.py to try the model. In the remainder of this notebook, for the demonstration I use Llama 2 70B with an average 2.55 bpw. It was created by ExLlamaV2 authors.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m100.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.4 MB/s[0m eta [36m0:00:

In [None]:
!mkdir ./Llama-2-70b-2.5bpw/

In [None]:
from huggingface_hub import snapshot_download

snapshot_download(repo_id="turboderp/Llama2-70B-exl2", ignore_patterns=["*.bin"], revision="2.5bpw", local_dir="./Llama-2-70b-2.5bpw/", local_dir_use_symlinks=False)

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

Downloading (…)f13b1b34aeff6/Notice:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)b34aeff6/config.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

Downloading (…)ff6/measurement.json:   0%|          | 0.00/5.38M [00:00<?, ?B/s]

Downloading (…)b34aeff6/LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading (…)aeff6/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)b1b34aeff6/README.md:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

Downloading (…)4aeff6/USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/66.7k [00:00<?, ?B/s]

Downloading (…)aeff6/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/5.07G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

'/content/drive/MyDrive/Llama-2-70b-2.5bpw'

In [None]:
!python test_inference.py -m ./Llama-2-70b-2.5bpw/ -p "Once upon a time,"

 -- Model: /content/drive/MyDrive//Llama-2-70b-2.5bpw/
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Warmup...
 -- Generating (greedy sampling)...

Once upon a time, there was a little girl who lived in a castle. She had everything a princess could want, and she was the most beautiful girl in the world.
One day, she looked in the mirror and said, “Mirror, mirror, on the wall, who is the fairest of them all?”
The mirror replied, “You are the most beautiful girl in the world.”
But the princess wasn’t satisfied. She wanted to be more beautiful. So she went to a witch and asked her to make her even more beautiful.
The witch gave her a potion, and the princess d

Prompt processed in 0.08 seconds, 5 tokens, 65.70 tokens/second
Response generated in 7.43 seconds, 128 tokens, 17.23 tokens/second
