# HuggingFace LLM inference experiments

- Compare inference options for quantisation and flash-atten
- As of 01/2024 torch's built in sdpa attention impmenentation has flash-attention without the need for the external lib from Tri Dao (which can be a pain to install). But it's not currently implemented within HF for all models -- particularly there's no support for Mistral
- So still need to use flash-atten lib for now to get flash-attention v2

In [4]:
import json
import os
import sys
import time
from typing import List, Dict, Any

In [5]:
sys.path.append(os.getcwd())

In [3]:
from dotenv import load_dotenv
load_dotenv()

True

In [9]:
# !huggingface-cli login --token $HUGGINGFACE_TOKEN

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/dom/.cache/huggingface/token
Login successful


First just load Mistral model with no additional options -- would be just as easy to use their reference implementation but HF gives all the other options for quant etc so go with this

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", torch_dtype=torch.float16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

  from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 596/596 [00:00<00:00, 3.11MB/s]
model.safetensors.index.json: 100%|██████████| 25.1k/25.1k [00:00<00:00, 52.8MB/s]
model-00001-of-00003.safetensors: 100%|██████████| 4.94G/4.94G [00:21<00:00, 233MB/s]
model-00002-of-00003.safetensors: 100%|██████████| 5.00G/5.00G [00:22<00:00, 225MB/s]
model-00003-of-00003.safetensors: 100%|██████████| 4.54G/4.54G [00:20<00:00, 217MB/s]
Downloading shards: 100%|██████████| 3/3 [01:04<00:00, 21.65s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00,  2.17s/it]
generation_config.json: 100%|██████████| 111/111 [00:00<00:00, 235kB/s]
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 2.54MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 396MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 16.5MB/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 313kB/s]


In [22]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

In [11]:
prompt = "[INST] What is your favourite condiment? [/INST] "
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print(inputs)
generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True)
print(generated_ids)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'input_ids': tensor([[    1,   733, 16289, 28793,  1824,   349,   574, 16020,  2076,  2487,
         28804,   733, 28748, 16289, 28793, 28705]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
tensor([[    1,   733, 16289, 28793,  1824,   349,   574, 16020,  2076,  2487,
         28804,   733, 28748, 16289, 28793, 28705,   315,   949, 28742, 28707,
           506,   264,  5277,  2187,   442,  3327, 22731, 28723,  2993, 28725,
           315,   541,  1912,   368,   369,  1287,   905,  1401,   272,  1526,
          3555,  4118,  2076,  8447, 10085,   356,   652,  8932,  5414, 28713,
           304,  3327, 26918, 28723,  2909,  3298, 10392,  3387,  3024,   446,
          4455,   715, 28725,  1580,   488, 28725,   579, 28724, 16042, 28725,
          3296, 16042, 28725,   993,  7136,   864, 28725,   304,  1016,   789,
         28723, 19576,  9807, 28725,   272,  1489,  2076,  2487,   349,   264,
          3209,   302,  3327, 

In [12]:
messages = [
    {"role": "user", "content": "How were the samples in the sp1200 encoded?"},
    {"role": "assistant", "content": "The samples were encoded in 12-bit PCM."},
    {"role": "user", "content": "What could be done to improve the drum sounds?"},
]

`apply_chat_template` doesn't return `attention_mask` which isn't really a problem here, but we can either generate it manually or first create a string, then tokenize -- this is advised against, but under the hood is probably what `apply_chat_template` is doing anyway...

For llama and mistral, it's easy to use their implementation of sentencepiece tokenizer (see llm-utils for utility funcs)

In [13]:
# tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
inputs

{'input_ids': tensor([[    1,     1,   733, 16289, 28793,  1602,   654,   272, 11714,   297,
           272,   668, 28740, 28750, 28734, 28734, 23174, 28804,   733, 28748,
         16289, 28793,  1014, 11714,   654, 23174,   297, 28705, 28740, 28750,
         28733,  2581,  9596, 28755, 28723,     2,   733, 16289, 28793,  1824,
           829,   347,  2203,   298,  4916,   272, 16049,  7258, 28804,   733,
         28748, 16289, 28793]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]], device='cuda:0')}

In [15]:
# encodeds = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
encodeds = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", ).to(device)
print(encodeds)
attention_mask = torch.ones_like(encodeds)
generated_ids = model.generate(encodeds, attention_mask=attention_mask , max_new_tokens=256, do_sample=True)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


tensor([[    1,   733, 16289, 28793,  1602,   654,   272, 11714,   297,   272,
           668, 28740, 28750, 28734, 28734, 23174, 28804,   733, 28748, 16289,
         28793,  1014, 11714,   654, 23174,   297, 28705, 28740, 28750, 28733,
          2581,  9596, 28755, 28723,     2,   733, 16289, 28793,  1824,   829,
           347,  2203,   298,  4916,   272, 16049,  7258, 28804,   733, 28748,
         16289, 28793]], device='cuda:0')
[INST] How were the samples in the sp1200 encoded? [/INST]The samples were encoded in 12-bit PCM. [INST] What could be done to improve the drum sounds? [/INST]There are several ways to improve the drum sounds from an SP-1200:

1. Use higher quality samples: Samples with better recordings, more detail and less noise can improve the overall sound quality.
2. Editing and processing: The drum sounds can be edited and processed using various techniques such as compressing, equalizing, or adding reverb and delay to enhance their characteristics.
3. Velocity layer

same as above but without direct tokenization using the chat template...does make a minor difference to the output but can be more flexible

In [18]:
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print(inputs)
generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'input_ids': tensor([[    1,     1,   733, 16289, 28793,  1602,   654,   272, 11714,   297,
           272,   668, 28740, 28750, 28734, 28734, 23174, 28804,   733, 28748,
         16289, 28793,  1014, 11714,   654, 23174,   297, 28705, 28740, 28750,
         28733,  2581,  9596, 28755, 28723,     2,   733, 16289, 28793,  1824,
           829,   347,  2203,   298,  4916,   272, 16049,  7258, 28804,   733,
         28748, 16289, 28793]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]], device='cuda:0')}
[INST] How were the samples in the sp1200 encoded? [/INST]The samples were encoded in 12-bit PCM. [INST] What could be done to improve the drum sounds? [/INST]There are several ways to improve drum sounds in a SP-1200 or any other sampler:

1. Use high-quality source material: Always start with the best possible rec

In [None]:
%timeit -r 2 model.generate(**inputs, max_new_tokens=256, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


16.2 s ± 304 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)


In [33]:
import time

n_repeats = 2
total_duration = 0
total_tps = 0
for _ in range(n_repeats):
    start = time.time()
    generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True)
    end = time.time()
    duration = end - start
    total_duration += duration
    total_tps += len(generated_ids[0]) / duration
    print(f"duration: {duration:.2f}, t/s: {len(generated_ids[0]) / duration:.2f}")
print(f"Averages for prompt and generated | duration: {total_duration / n_repeats:.2f}, t/s: {total_tps / n_repeats:.2f}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


duration: 16.44 t/s: 18.80
duration: 16.49 t/s: 18.74
Averages for prompt and generated | duration: 16.47 t/s: 18.77


In [19]:
# Note: this ignores the difference between prompt and generated tokens, but as long as we're comparing like for like
# it's good enough for these purposes. Should find the time for each part separately for a more accurate comparison.
def calculate_average_duration(inputs, model, n_repeats=2, max_new_tokens=256):
    total_duration = 0
    total_tps = 0
    for _ in range(n_repeats):
        start = time.time()
        generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True)
        end = time.time()
        duration = end - start
        total_duration += duration
        total_tps += len(generated_ids[0]) / duration
        print(f"duration: {duration:.2f}, t/s: {len(generated_ids[0]) / duration:.2f}")
    print(f"Averages for prompt and generated | duration: {total_duration / n_repeats:.2f}, t/s: {total_tps / n_repeats:.2f}")

In [None]:
calculate_average_duration(inputs, model)

Quick test for 2 repeats on the inputs defined above gives:

- duration: 16.44 t/s: 18.80
- duration: 16.49 t/s: 18.74
- Averages for prompt and generated | duration: 16.47 t/s: 18.77


## Flash Attention v2

In [20]:
# check that the variable `model` exists and if it does, delete it
if "model" in locals():
    model = model.to("cpu")
    del model
    torch.cuda.empty_cache()

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device_map=device,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Loading checkpoint shards: 100%|██████████| 3/3 [01:06<00:00, 22.01s/it]


In [21]:
generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] How were the samples in the sp1200 encoded? [/INST]The samples were encoded in 12-bit PCM. [INST] What could be done to improve the drum sounds? [/INST] There are several ways to improve drum sounds using the SP-1200 or any other digital sampling workstation:

1. Sample selection: Choosing high-quality source material to begin with is crucial. High-end microphones, pre-amps, and interfaces can make a significant difference in the final sound.

2. Sample editing: Manipulating the samples to tailor them to your needs. This includes trimming, pitch shifting, reverse, and time-stretching.

3. Sample processing: Applying various effects to enhance or change the sound. Equalization, compression, and reverb are some of the most common effects.

4. Sequencing: Arranging the sounds in a pleasing way to create a groove or rhythm.

5. Layering: Combining multiple samples to create complex drum sounds and textures.

6. Tuning and timing: Making sure the drum sounds are in tune and in time.


In [24]:
calculate_average_duration(inputs, model)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


duration: 20.21, t/s: 15.29
duration: 20.23, t/s: 15.27
Averages for prompt and generated | duration: 20.22, t/s: 15.28
