<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Neural_Speed_Fast_Inference_for_4_bit_LLMs_on_CPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to use Neural Speed and bencmarks its inference speed with a 7B model quanized with Intel's Neural Compressor and llama.cpp.

First, you need to install the following libraries:

In [None]:
!pip install neural-speed intel-extension-for-transformers accelerate datasets

Collecting neural-speed
  Downloading neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl (23.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.2/23.2 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting intel-extension-for-transformers
  Downloading intel_extension_for_transformers-1.4-cp310-cp310-manylinux_2_28_x86_64.whl (44.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.29.2-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
Collecting schema (from intel-extension-for-transformers)
  Downloading schema-0.7.5-py2.py3-

To use Neural Speed, the important line is

```
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
```

Also, set load_in_4bit=True when loading the model.

The code quanizes and benchmarks the model. It runs 10 times the same prompts without batching and report the average tokens/sec.

In [None]:
import time
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "kaitchup/Mayonnaise-4in1-02"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. \n\n Tell me about gravity."


loading_start = time.time()
model = AutoModelForCausalLM.from_pretrained(
          model_name, load_in_4bit=True
)
print("--- Loading model time: %s seconds ---" % (time.time() - loading_start))


total_tokens = 0
total_duration = 0
for b in range(10):


  inputs = tokenizer(p, return_tensors="pt")
  generation_time = time.time()
  outputs = model.generate(**inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=300)
  duration = time.time() - generation_time
  total_duration += duration

  for output in outputs:
    result = tokenizer.decode(output)
    nb_tokens = len(result)
    total_tokens += nb_tokens
  print("--- Speed: %s tokens/second ---" % (round(nb_tokens/duration,2)))
print("--- Average speed: %s tokens/second ---" % (round(total_tokens/total_duration,2)))

tokenizer_config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/487 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

2024-04-14 00:23:26 [INFO] cpu device is used.
2024-04-14 00:23:26 [INFO] Applying Weight Only Quantization.
2024-04-14 00:23:26 [INFO] Using Neural Speed.


cmd: ['python', PosixPath('/usr/local/lib/python3.10/dist-packages/neural_speed/convert/convert_mistral.py'), '--outfile', 'runtime_outs/ne_mistral_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'kaitchup/Mayonnaise-4in1-02']
--- Loading model time: 767.4104936122894 seconds ---
--- Speed: 30.12 tokens/second ---
--- Speed: 32.55 tokens/second ---
--- Speed: 32.61 tokens/second ---
--- Speed: 33.48 tokens/second ---
--- Speed: 32.79 tokens/second ---
--- Speed: 32.64 tokens/second ---
--- Speed: 33.5 tokens/second ---
--- Speed: 32.81 tokens/second ---
--- Speed: 32.51 tokens/second ---
--- Speed: 32.63 tokens/second ---
--- Average speed: 32.54 tokens/second ---


The following cell does the same as the previous one but with the model in the GGUF format. The model is already quantized so we don't need to set load_in_4bit.

In [None]:
import time
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "kaitchup/Mayonnaise-4in1-02"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. \n\n Tell me about gravity."


loading_start = time.time()
model = AutoModelForCausalLM.from_pretrained(
          model_name, model_file = "Q4_0.gguf"
)
print("--- Loading model time: %s seconds ---" % (time.time() - loading_start))


total_tokens = 0
total_duration = 0
for b in range(10):


  inputs = tokenizer(p, return_tensors="pt")
  generation_time = time.time()
  outputs = model.generate(**inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=300)
  duration = time.time() - generation_time
  total_duration += duration

  for output in outputs:
    result = tokenizer.decode(output)
    nb_tokens = len(result)
    total_tokens += nb_tokens
  print("--- Speed: %s tokens/second ---" % (round(nb_tokens/duration,2)))
print("--- Average speed: %s tokens/second ---" % (round(total_tokens/total_duration,2)))

tokenizer_config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/487 [00:00<?, ?B/s]

2024-04-10 05:32:31 [INFO] Using Neural Speed to load the GGUF model...


Q4_0.gguf:   0%|          | 0.00/4.11G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

2024-04-10 05:34:01 [INFO] The model_type is mistral


--- Loading model time: 93.10519242286682 seconds ---
--- Speed: 44.45 tokens/second ---
--- Speed: 43.62 tokens/second ---
--- Speed: 44.93 tokens/second ---
--- Speed: 43.86 tokens/second ---
--- Speed: 43.54 tokens/second ---
--- Speed: 44.05 tokens/second ---
--- Speed: 44.83 tokens/second ---
--- Speed: 44.98 tokens/second ---
--- Speed: 44.6 tokens/second ---
--- Speed: 43.06 tokens/second ---
--- Average speed: 44.18 tokens/second ---


# Appendix: Benchmark with llama.cpp

For comparison, I run the same model with same prompt using llama.cpp

In [None]:
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && make -j4

Cloning into 'llama.cpp'...
remote: Enumerating objects: 22268, done.[K
remote: Counting objects: 100% (10019/10019), done.[K
remote: Compressing objects: 100% (742/742), done.[K
remote: Total 22268 (delta 9639), reused 9427 (delta 9274), pack-reused 12249[K
Receiving objects: 100% (22268/22268), 26.58 MiB | 25.65 MiB/s, done.
Resolving deltas: 100% (15755/15755), done.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthrea

Get the model file from the HF hub:

In [None]:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="kaitchup/Mayonnaise-4in1-02", filename="Q4_0.gguf")

Q4_0.gguf:   0%|          | 0.00/4.11G [00:00<?, ?B/s]

'/root/.cache/huggingface/hub/models--kaitchup--Mayonnaise-4in1-02/snapshots/243063dab3eb237d6e2138f0233363d4fdaffccf/Q4_0.gguf'

In [None]:
!./llama.cpp/main -m /root/.cache/huggingface/hub/models--kaitchup--Mayonnaise-4in1-02/snapshots/243063dab3eb237d6e2138f0233363d4fdaffccf/Q4_0.gguf \
  -p "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. \n\n Tell me about gravity." \
   -n 300 -e

Log start
main: build = 2640 (ba5e134e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1712729280
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--kaitchup--Mayonnaise-4in1-02/snapshots/243063dab3eb237d6e2138f0233363d4fdaffccf/Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = original_model
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   