# Performance Implications of Various Model Loading Options

Version: 2024-4-18

In this notebook, we will explore the performance implications of various model loading options, including the precision type and model loading mechanism.

TL;DR: 
- Use `vLLM` if the model you want to use is compitable with the library. Use with AWQ quantization if possible.
- If you must load the model with the `transformers` library:
    - For small models, manually specify the correct 16-bit format.
    - For large models, load model in 4-bit.

First we write a class that loads a model and run inference, incorporating a few useful features such as GPU detection and memory release.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.utils import logging
import torch
import time
import humanize
import gc

# Hide all but the most critical log messages
logging.set_verbosity(logging.FATAL)

class LoadTest():
    
    def __init__(self,
                 model_path="mistralai/Mistral-7B-Instruct-v0.1",
                 **kwargs):
        """
        Initializes the model
        """
        
        # Is GPU available?
        if torch.cuda.is_available():
            self.device = "cuda"
        else:
            self.device = "cpu"
            
        if not "device_map" in kwargs:
            kwargs["device_map"] = self.device
                          
        start = time.time()
        
        # Load model
        self.model = AutoModelForCausalLM.from_pretrained(model_path,**kwargs)
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)

        print("Loading time:",round(time.time() - start,2),'s')
        print("Memory footprint:",humanize.naturalsize(self.model.get_memory_footprint()))

    def generate(self,messages=None,**kwargs):
        """
        Generate a response to a message
        """
        
        if messages == None:
             # Default message
            messages = [
                {"role": "user", "content": "What is your favourite condiment?"},
                {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
                {"role": "user", "content": "Do you have mayonnaise recipes?"}
            ]
            
        # Some default settings:
        settings_list = [("max_new_tokens",1000),("do_sample",True)]
        for s in settings_list:
            if not s[0] in kwargs:
                kwargs[s[0]] = s[1]
            
        #start =time.time()
        
        # Encode message, move to device, run through model and decode
        encodeds = self.tokenizer.apply_chat_template(messages, return_tensors="pt")
        model_inputs = encodeds.to(self.device)
        generated_ids = self.model.generate(model_inputs,**kwargs)
        decoded = self.tokenizer.batch_decode(generated_ids)
        
        #print("Inference time:",round(time.time() - start,2),'s')
        
        return decoded[0]
    
    def clear(self):
        """
        Clear memory occupied by the model
        """
        self.model = None
        gc.collect()
        torch.cuda.empty_cache()

## A. Default Settings

The default settings load models in 32-bit precision. In this mode,
a 1-billion parameter model requires approximately 4GB of memory.

In [2]:
model = LoadTest()
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 39.37 s
Memory footprint: 30.0 GB
6.45 s ± 903 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


## B. GPU/CPU Placement with `Accelerate`
`Accelerate` can determine automatically the best way to place the model in the device(s) 
we have access to. This is particularly useful if we have more than one GPU.
With only one GPU, the performance should be the same as default.

In [3]:
model = LoadTest(device_map="auto")
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 37.91 s
Memory footprint: 30.0 GB
6.78 s ± 646 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


## C. Auto-Detect Parameter Precision

If we specify `torch_dtype='auto'`, `AutoModelForCausalLM` will attempt 
to detect the correct data type to use based on how the pretrained model was saved.
If the detection could be incorrect, it could lead to unnecessary conversion of
parameters, resulting in very long loading time.

In [4]:
model = LoadTest(torch_dtype="auto",
                 device_map="auto",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 163.68 s
Memory footprint: 15.0 GB
6.88 s ± 532 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


## D. Manually Specifying Parameter Precision

Instead of setting `torch_dtype` to auto, we could instead manually specify a data type.
Here we will try the two 16-bit precision data types, `float16` and `bfloat16`.
You will see that Mistral-7B is intended to work with the former, resulting in much faster
loading time.

When loading model in 16-bit precision, a 1-billion parameter model requires approximately 2GB of memory.

In [5]:
model = LoadTest(torch_dtype=torch.float16,
                 device_map="auto",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 34.68 s
Memory footprint: 15.0 GB
7.29 s ± 1.16 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [6]:
model = LoadTest(torch_dtype=torch.bfloat16,
                 device_map="auto",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 161.88 s
Memory footprint: 15.0 GB
6.78 s ± 1.17 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


In contrast, there is no such difference for the Falcon-7B model:

In [7]:
model = LoadTest("tiiuae/falcon-7b-instruct",
                 torch_dtype=torch.float16,
                 device_map="auto",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 9.68 s
Memory footprint: 13.9 GB
The slowest run took 4.45 times longer than the fastest. This could mean that an intermediate result is being cached.
13.8 s ± 7.43 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [8]:
model = LoadTest("tiiuae/falcon-7b-instruct",
                 torch_dtype=torch.bfloat16,
                 device_map="auto",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 8.18 s
Memory footprint: 13.9 GB
9.99 s ± 4.27 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


## E. Load in 8-bit Precision
By specifying `load_in_8bit = True`, we can load the model in 8-bit precision,
which reduces memory requirement to 1/4 of the default.
In this mode, a 1-billion parameter model requires approximately 1GB of memory.
This option relies on the `bitsandbytes` library, and for some reason the implementation
is quite slow.

In [9]:
model = LoadTest(load_in_8bit=True,
                 device_map="auto",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 36.12 s
Memory footprint: 8.0 GB
44.5 s ± 7.9 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


## F. Load in 4-bit Precision
We can lower the memory requirement further by specifying `load_in_4bit = True`. 
At 4-bit precision, memory requirement is 1/8 of the default.
In this mode, a 1-billion parameter model requires approximately 500MB of memory.
This option relies on the `bitsandbytes` library. Performance is slower than 16-bit
but faster than 8-bit.

In [7]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = LoadTest(quantization_config=bnb_config,
                 device_map="auto",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 32.08 s
Memory footprint: 4.6 GB
8.75 s ± 1.05 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


## H. Flash Attention 2
Fast attention provides a drop-in replacement for PyTorch's self-attention mechanism.
Some models already have efficient self-attention built-in, 
in which case we will not see any speed up.

In [11]:
model = LoadTest(torch_dtype=torch.float16,
                 device_map="auto",
                 attn_implementation="flash_attention_2",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 35.69 s
Memory footprint: 15.0 GB
8.43 s ± 398 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [12]:
model = LoadTest("tiiuae/falcon-7b-instruct",
                 torch_dtype=torch.bfloat16,
                 device_map="auto",
                 attn_implementation="flash_attention_2",
                 )
%timeit -r 3 -n 1 model.generate()
model.clear()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading time: 8.13 s
Memory footprint: 13.9 GB
7.2 s ± 1.54 s per loop (mean ± std. dev. of 3 runs, 1 loop each)


## vLLM

vLLM is a library for fast LLM inference. This is the current recommended method 
to run inference if the model you intend to use is  
[supported](https://docs.vllm.ai/en/latest/models/supported_models.html).

We first modify our custom class to utilize vLLM:

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.utils import logging
import torch
import time
import humanize
import gc
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel

# vLLM needs this to find NCCL
os.environ["VLLM_NCCL_SO_PATH"] = "/usr/lib/x86_64-linux-gnu/libnccl.so.2"

class LoadTestvLLM():
    
    def __init__(self,
                 model_path="mistralai/Mistral-7B-Instruct-v0.1",
                 **kwargs):
        """
        Initializes the model
        """
                      
        start = time.time()
        
        if not "dtype" in kwargs:
            kwargs["dtype"] = 'float16'
        
        # Load model
        self.model = LLM(model=model_path,
                         **kwargs)
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)

        print("Loading time:",round(time.time() - start,2),'s')
        
        # Approx. memory usage, buffer not included
        model_params = self.model.llm_engine.model_executor.driver_worker.model_runner.model.parameters()
        mem_params = sum([param.nelement()*param.element_size() for param in model_params])
        print("Approx. memory footprint:",humanize.naturalsize(mem_params))   

    def generate(self,messages=None,**kwargs):
        """
        Generate a response to a message
        """
        
        if messages == None:
             # Default message
            messages = [
                {"role": "user", "content": "What is your favourite condiment?"},
                {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
                {"role": "user", "content": "Do you have mayonnaise recipes?"}
            ]
            
        # Some default settings:
        settings_list = [("max_tokens",1000)]
        for s in settings_list:
            if not s[0] in kwargs:
                kwargs[s[0]] = s[1]
                        
            
        sampling_params = SamplingParams(**kwargs)
                  
        # Encode message, move to device, run through model and decode
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False)
        outputs = self.model.generate(prompt, sampling_params)
        
        return outputs
    
    def clear(self):
        """
        Clear memory occupied by the model
        """
        destroy_model_parallel()
        del self.model.llm_engine.model_executor.driver_worker
        self.model = None
        gc.collect()
        torch.cuda.empty_cache()

Default settings:

In [2]:
model = LoadTestvLLM()
%timeit -r 3 -n 1 model.generate()
model.clear()

INFO 12-29 12:55:48 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.1', tokenizer='mistralai/Mistral-7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 12-29 12:56:23 llm_engine.py:222] # GPU blocks: 27184, # CPU blocks: 2048
Loading time: 36.54 s
Approx. memory footprint: 14.5 GB


Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.34s/it]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.33s/it]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.80s/it]

2.84 s ± 431 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)





vLLm is compatible with models quantized in the AWQ format. 
The AWQ version of many models can be found [here](https://huggingface.co/TheBloke).

In [2]:
model = LoadTestvLLM(model_path="TheBloke/Mistral-7B-Instruct-v0.2-AWQ")
%timeit -r 3 -n 1 model.generate()
model.clear()

INFO 12-29 13:50:24 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 12-29 13:51:22 llm_engine.py:222] # GPU blocks: 22843, # CPU blocks: 2048
Loading time: 60.11 s
Approx. memory footprint: 4.2 GB


Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.42s/it]
Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.07s/it]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.34s/it]

2.63 s ± 322 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)



