# Most popular open weights LLMs - December 2023

## Flash attention 2 support

Documentation and installation:

https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2

- you need flash_attn package version to be greater or equal than 2.1.0
- pass the argument attn_implementation="flash_attention_2" to from_pretrained()

Configuration:

https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/modeling_utils.py#L1393

- if torch_dtype not in [torch.float16, torch.bfloat16]: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes
- the full model must be loaded on a GPU (no partial offloading to CPU or disk)

Implementation: 

https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/models/mistral/modeling_mistral.py#L715

- supported on models classes with the property: cls._supports_flash_attn_2=True

Example for Mistral:

```python
MISTRAL_ATTENTION_CLASSES = {
    "eager": MistralAttention,
    "flash_attention_2": MistralFlashAttention2,
    "sdpa": MistralSdpaAttention,
}

class MistralDecoderLayer(nn.Module):
    def __init__(self, config: MistralConfig, layer_idx: int):
        self.self_attn = MISTRAL_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
        
# https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/models/mistral/modeling_mistral.py#L321
```

Usage:

- FlashAttention-2 does not support computing attention scores with padding tokens: this leads to a significant slowdown for batched generations with padding tokens
- you must manually pad/unpad the attention scores for batched inference when the sequence contains padding tokens
- to overcome this, you should use FlashAttention-2 without padding tokens in the sequence during training (by packing a dataset or concatenating sequences until reaching the maximum sequence length)

List of supported models as of 12/30/2023:
- mbart
- bark
- distilbert
- falcon
- gpt_bigcode
- gpt_neo / gpt_neox
- llama / 
- llava / vipllava
- mistral / mixtral
- opt
- phi
- whisper

In [None]:
pip install flash-attn --no-build-isolation

In [2]:
import flash_attn
flash_attn.__version__

'2.3.6'

## Load models to GPU

In [None]:
pip install --upgrade transformers

In [10]:
import transformers
transformers.__version__

'4.36.2'

In [1]:
models = { 
    "redpajama_3b" : "togethercomputer/RedPajama-INCITE-Base-3B-v1", # 5.30 GB
    "btlm_3b" : "cerebras/btlm-3b-8k-base", #  4.93 GB
    "openllama2_3b" : "openlm-research/open_llama_3b_v2", #  6.38 GB
    "stablelm_3b" : "stabilityai/stablelm-3b-4e1t", # 5.21 GB
    "phi2_3b" : "microsoft/phi-2", # 5.18 GB

    "bloomz_7b" : "bigscience/bloomz-7b1-mt", # 13.18 GB
    "falcon_7b" : "tiiuae/falcon-7b", # 13.45 GB       
    "redpajama_7b" : "togethercomputer/RedPajama-INCITE-7B-Base", # 12.90 GB
    "mpt_7b" : "mosaicml/mpt-7b", # 12.39 GB
    "mpt_7b_8k" : "mosaicml/mpt-7b-8k", # 12.39 GB
    "openllama2_7b" : "openlm-research/open_llama_7b_v2", # 12.55 GB
    "llama2_7b" : "meta-llama/Llama-2-7b-hf", # 12.55 GB
    "llama2_7b_32k" : "togethercomputer/LLaMA-2-7B-32K", # 12.55 GB
    "mistral_7b" : "mistralai/Mistral-7B-v0.1", # 13.49 GB
    "qwen_7b" : "Qwen/Qwen-7B", # 14.38 GB
    "yi_6b" : "01-ai/Yi-6B", # 11.29 GB
    "decilm_7b" : "Deci/DeciLM-7B", # 13.12 GB
    
    "openllama1_13b" : "openlm-research/open_llama_13b", # 24.24 GB
    "llama2_13b" : "meta-llama/Llama-2-13b-hf", # 24.25 GB
    "qwen_14b" : "Qwen/Qwen-14B", # 26.39 GB
    "solar_10b" : "upstage/SOLAR-10.7B-v1.0", # 19.99 GB
    
    "llama1_33b" : "TheBloke/WizardLM-33B-V1.0-Uncensored-GPTQ", # 15.78 GB https://huggingface.co/alexl83/LLaMA-33B-HF
    "falcon_40b" : "TheBloke/falcon-40b-instruct-GPTQ", # 21.00 GB https://huggingface.co/tiiuae/falcon-40b
    "mpt_30b" : "abhinavkulkarni/mosaicml-mpt-30b-instruct-w4-g128-awq", # 15.00 GB https://huggingface.co/mosaicml/mpt-30b
    "codellama_34b" : "TheBloke/CodeLlama-34B-Instruct-GPTQ", # 17.07 GB https://huggingface.co/codellama/CodeLlama-34b-hf
    "yi_34b" : "TheBloke/Yi-34B-GPTQ", # 17.33 GB https://huggingface.co/01-ai/Yi-34B    
    "mixtral_8x7B" : "TheBloke/Mixtral-8x7B-v0.1-GPTQ" # 22.18 GB https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
}

In [2]:
from time import perf_counter_ns

time_unit_µs = 1000
time_unit_ms = 1000*1000
time_unit_s = 1000*1000*1000

In [6]:
import torch

memory_unit_mb = 1024*1024
memory_unit_gb = 1024*1024*1024

total_memory = torch.cuda.get_device_properties(0).total_memory

def display_gpu_memory():
    print(torch.cuda.get_device_name(0))
    print(f"Total    : {(total_memory/memory_unit_mb):8,.1f} MB")
    print("------------------------------")
    free_memory = torch.cuda.mem_get_info()[0]
    reserved_memory = torch.cuda.memory_reserved(0)
    used_memory = torch.cuda.memory_allocated(0)    
    max_used_memory = torch.cuda.max_memory_allocated(0)
    overhead_memory = total_memory - free_memory - reserved_memory
    print(f"Overhead : {(overhead_memory/memory_unit_mb):8,.1f} MB - {int(overhead_memory/total_memory*100):3} %")
    print(f"Reserved : {(reserved_memory/memory_unit_mb):8,.1f} MB - {int(reserved_memory/total_memory*100):3} %")
    print(f"Free     : {(free_memory/memory_unit_mb):8,.1f} MB - {int(free_memory/total_memory*100):3} %")
    print("------------------------------")
    print(f"Used (total): {(used_memory/memory_unit_mb):8,.1f} MB - {int(used_memory/total_memory*100):3} %")
    print(f"Used (model): {((used_memory-overhead_memory)/memory_unit_mb):8,.1f} MB - {int((used_memory-overhead_memory)/total_memory*100):3} %")
    print(f"Max used    : {(max_used_memory/memory_unit_mb):8,.1f} MB - {int(max_used_memory/total_memory*100):3} %")

In [27]:
import os
from transformers.utils.hub import cached_file

def get_model_path_and_size_on_disk(model_name):    
    model_config_file = cached_file(model_name, "config.json", local_files_only=True)
    model_directory = os.path.dirname(model_config_file)
    
    total_size = 0
    for entry in os.listdir(model_directory):
        full_entry_path = os.path.join(model_directory, entry)
        if os.path.isfile(full_entry_path):
            file_size = os.path.getsize(full_entry_path)
            if file_size > 1024*1024:
                print(f"- {entry} = {file_size/memory_unit_mb:.1f} MB")
            total_size += file_size
    return model_directory,total_size

In [48]:
idx = 2
model_name = list(models.values())[idx]
idx,model_name

(2, 'openlm-research/open_llama_3b_v2')

In [49]:
print("Files size on disk:")
path,size = get_model_path_and_size_on_disk(model_name)
print(f"Total size = {size/memory_unit_mb:.1f} MB")

Files size on disk:
- pytorch_model.bin = 6535.6 MB
Total size = 6536.1 MB


In [50]:
filename = os.path.join(path, "pytorch_model.bin")

In [51]:
before_time_ns = perf_counter_ns()
with open(filename, mode='rb') as file:
    fileContent = file.read()
duration_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time (cold) = {(duration_ns/time_unit_s):.3f} s")

- disk load time (cold) = 34.310 s


In [52]:
before_time_ns = perf_counter_ns()
with open(filename, mode='rb') as file: # b is important -> binary
    fileContent = file.read()
duration_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time (hot) = {(duration_ns/time_unit_s):.3f} s")

- disk load time (hot) = 7.952 s


In [54]:
%%timeit 10
with open(filename, mode='rb') as file: # b is important -> binary
    fileContent = file.read()

1.52 s ± 357 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [56]:
temptensor = torch.rand(int(6536/2),1024,1024,dtype=torch.float16)

In [57]:
display_gpu_memory()
print()

before_time_ns = perf_counter_ns()
temptensor.to('cuda')
duration_ns = perf_counter_ns() - before_time_ns
print(f"- gpu transfer time (cold) = {(duration_ns/time_unit_s):.3f} s")

print()
display_gpu_memory()

NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,555.5 MB -   6 %
Reserved : 11,026.0 MB -  44 %
Free     : 11,982.0 MB -  48 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,903.1 MB -  15 %
Max used    : 10,917.1 MB -  44 %

- gpu transfer time (cold) = 0.984 s

NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,696.5 MB -   6 %
Reserved : 17,562.0 MB -  71 %
Free     :  5,305.0 MB -  21 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,762.1 MB -  15 %
Max used    : 11,994.6 MB -  48 %


In [61]:
before_time_ns = perf_counter_ns()
temptensor.to('cuda')
duration_ns = perf_counter_ns() - before_time_ns
print(f"- gpu transfer time (hot) = {(duration_ns/time_unit_s):.3f} s")

print()
display_gpu_memory()

- gpu transfer time (hot) = 0.663 s

NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,696.5 MB -   6 %
Reserved : 17,562.0 MB -  71 %
Free     :  5,305.0 MB -  21 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,762.1 MB -  15 %
Max used    : 11,994.6 MB -  48 %


In [13]:
from transformers import AutoModelForCausalLM, AutoTokenizer


print(f"Loading {model_name} from disk to GPU:")

before_time_ns = perf_counter_ns()
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto", attn_implementation="flash_attention_2") 
duration_ns = perf_counter_ns() - before_time_ns
print(f"- load time = {(duration_ns/time_unit_s):.3f} s")
free_memory = torch.cuda.mem_get_info()[0]
reserved_memory = torch.cuda.memory_reserved(0)
overhead_memory = total_memory - free_memory - reserved_memory
used_memory = torch.cuda.memory_allocated(0)    
model_used_memory = used_memory - overhead_memory
print(f"- GPU memory = {((used_memory-overhead_memory)/memory_unit_mb):.1f} MB")
print("")

print("GPU memory after")
display_gpu_memory()

Loading togethercomputer/RedPajama-INCITE-Base-3B-v1 from disk to GPU:
- load time = 4.758 s
- GPU memory = 3903.1 MB

GPU memory after
NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,555.5 MB -   6 %
Reserved : 11,026.0 MB -  44 %
Free     : 11,982.0 MB -  48 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,903.1 MB -  15 %
Max used    : 10,917.1 MB -  44 %


In [None]:
display_memory()