# Most popular open weights LLMs - December 2023

## Huggingface "from_pretrained" model loading

Analyzing code from main branch on 30/12/2O23:

https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/modeling_utils.py#L2514

Goal: loading a configuration (from a json file) and state dictionary (from a single or sharded archive file).

3 use cases:
- from a model repo on huggingface.co
- from a local directory
- from function arguments config and state_dict

We will analyze only the first use case.

Additional librairies used:
- safetensors: https://huggingface.co/docs/safetensors/
- accelerate: https://huggingface.co/docs/accelerate/
- bitasandbytes: https://github.com/TimDettmers/bitsandbytes
- flash-attn: https://github.com/Dao-AILab/flash-attention
- optimum: https://huggingface.co/docs/optimum/
- autogptq: https://github.com/PanQiWei/AutoGPTQ
- autoawq: https://github.com/casper-hansen/AutoAWQ

Safetensors should be much faster than Pytorch to load the weights on CPU or GPU (no copy). 

We test this below on this machine:
- 200x faster to load 5 GB weights on CPU
- 3x faster to load 5 GB weights on GPU

**Finding 1**: alway use safetensors ! View zoom, code and experiments below.

**Finding 2**: always try to pass use_safetensors=True to the from_pretrained method => if there is no safetensors file in the main branch of the repo, the transformers library will try to load the result of an automatic conversion to safetensors in a PR created by a HF bot.

**Finding 3**: always specify explicitly the torch_dtype="auto" to load the weights in the native model data type (read from the config file or the the first weight in the state dict), or torch_dtype=torch.float16 if want to quantize it, because otherwise the default dtype will be float32 !!

**Finding 4**: never pass trust_remote_code=True (unless you get an error when loading the model), because you will get a stale and out of date version of the model code from the original repo and not the HF official implementation with the latest features (like FlashAttention).

**Finding 5**: all supported quantization algorithms now support serialization => you should always try to load an already quantized model instead of quantizing it yourself.

**Finding 6**: use transformers>4.6.0 and make sure to use param "_fast_init=True" to disable weights random initialization before loading, and make sure we don't spend an extraodinary amount of time doing useless work. See [pull request 11471](https://github.com/huggingface/transformers/pull/11471) for more information.

**Finding 7**: always try to pass the argument attn_implementation="flash_attention_2" to from_pretrained(), after installing the flash-attn package, to check wether flash attention 2 is supported by the model architecture, which should greatly speedup operations if available (see zoom below).


Step 1: download model config file "config.json"

```python
# We make a call to the config file first (which may be absent)

CONFIG_NAME = "config.json"

resolved_config_file = cached_file(
    pretrained_model_name_or_path,
    CONFIG_NAME,
    cache_dir=cache_dir,
    force_download=force_download,
    resume_download=resume_download,
    proxies=proxies,
    local_files_only=local_files_only,
    token=token,
    revision=revision,
    subfolder=subfolder,
    _raise_exceptions_for_missing_entries=False,
    _raise_exceptions_for_connection_errors=False,
)
```

Step 2: check and set memory loading parameters

- device_map & low_cpu_mem_usage

```python
if device_map is not None:
    low_cpu_mem_usage = True
        
    if not is_accelerate_available():
        raise ImportError("Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate")
```

- quantization_config

```python
if load_in_8bit or load_in_4bit:
    if not torch.cuda.is_available():
        raise RuntimeError("No GPU found. A GPU is needed for quantization.")
    if not (is_accelerate_available() and is_bitsandbytes_available()):
        raise ImportError("Using `load_in_8bit=True` requires Accelerate and the latest version of bitsandbytes")
        
    if torch_dtype is None:
        # We force the `dtype` to be float16, this is a requirement from `bitsandbytes`
        torch_dtype = torch.float16
    if device_map is None:
        device_map = {"": torch.cuda.current_device()}
```

Step 3: instantiate config object

```python
config, model_kwargs = cls.config_class.from_pretrained(
    config_path,
    cache_dir=cache_dir,
    return_unused_kwargs=True,
    force_download=force_download,
    resume_download=resume_download,
    proxies=proxies,
    local_files_only=local_files_only,
    token=token,
    revision=revision,
    subfolder=subfolder,
    _from_auto=from_auto_class,
    _from_pipeline=from_pipeline,
    **kwargs,
)
```

Note: this config object can itself contain a quantization config object.

Step 4: check and set quantization config (from config or/and from params)

- GPTQ

```python
if quantization_method == QuantizationMethod.GPTQ:
    if not gptq_supports_cpu and not torch.cuda.is_available():
        raise RuntimeError("GPU is required to quantize or run quantize model.")
    elif not (is_optimum_available() and is_auto_gptq_available()):
        raise ImportError("Loading a GPTQ quantized model requires optimum and auto-gptq library.")
    elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"):
        raise ImportError("You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`")

    if torch_dtype is None:
        torch_dtype = torch.float16
    else:
        logger.info("We suggest you to set `torch_dtype=torch.float16` for better efficiency with GPTQ.")
```

- AWQ

```python
if quantization_method_from_config == QuantizationMethod.AWQ:
     if not torch.cuda.is_available():
        raise RuntimeError("GPU is required to run AWQ quantized model.")
    if not is_auto_awq_available():
        raise ImportError("Loading an AWQ quantized model requires auto-awq library")

    if not is_accelerate_available():
        raise ImportError("Loading an AWQ quantized model requires accelerate")

    if torch_dtype is None:
        torch_dtype = torch.float16
    else:
        logger.info("We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.")
```

- BITS_AND_BYTES (see above)

Step 5: 

1. Locate archive files

```python
SAFE_WEIGHTS_NAME = "model.safetensors"
SAFE_WEIGHTS_INDEX_NAME = "model.safetensors.index.json"

WEIGHTS_NAME = "pytorch_model.bin"
WEIGHTS_INDEX_NAME = "pytorch_model.bin.index.json"

if use_safetensors is not False:
    filename = _add_variant(SAFE_WEIGHTS_NAME, variant)
else:
    filename = _add_variant(WEIGHTS_NAME, variant)

resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)

if resolved_archive_file is None and filename == _add_variant(SAFE_WEIGHTS_NAME, variant):
    resolved_archive_file = cached_file(pretrained_model_name_or_path,
                            _add_variant(SAFE_WEIGHTS_INDEX_NAME, variant),
                            **cached_file_kwargs)
    if resolved_archive_file is not None:
        is_sharded = True
    elif use_safetensors:
        if revision == "main":
        # Here auto_conversion() tries to load the safetensors file from an PR
        # pr.author = "SFConvertBot"
        # pr_title = "Adding `safetensors` variant of this model"
            resolved_archive_file, revision, is_sharded = auto_conversion(
                pretrained_model_name_or_path, **cached_file_kwargs
            )
    else:
        # This repo has no safetensors file of any kind, we switch to PyTorch.
        filename = _add_variant(WEIGHTS_NAME, variant)
        resolved_archive_file = cached_file(
            pretrained_model_name_or_path, filename, **cached_file_kwargs
        )

if resolved_archive_file is None and filename == _add_variant(WEIGHTS_NAME, variant):
    # Maybe the checkpoint is sharded, we try to grab the index name in this case.
    resolved_archive_file = cached_file(
        pretrained_model_name_or_path,
        _add_variant(WEIGHTS_INDEX_NAME, variant),
        **cached_file_kwargs,
    )
    if resolved_archive_file is not None:
        is_sharded = True
            
# We'll need to download and cache each checkpoint shard if the checkpoint is sharded.
if is_sharded:
    # resolved_archive_file becomes a list of files that point to the different checkpoint shards in this case.
    resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
        pretrained_model_name_or_path, ....)
```

2. Load state dict

```python
if not is_sharded and state_dict is None:
    # Time to load the checkpoint
    state_dict = load_state_dict(resolved_archive_file)

if is_sharded:
    loaded_state_dict_keys = sharded_metadata["all_checkpoint_keys"]
else:
    loaded_state_dict_keys = list(state_dict.keys())
  
if low_cpu_mem_usage:
    state_dict = None
    
def load_state_dict(checkpoint_file: Union[str, os.PathLike]):
    if checkpoint_file.endswith(".safetensors") and is_safetensors_available():
        # Check format of the archive
        return safe_load_file(checkpoint_file)
    else:
        map_location = "cpu"
        return torch.load(checkpoint_file, map_location=map_location, weights_only=True)
```

=> we load the state dict once for nothing ??

3. Compute torch_dtype

```python
if torch_dtype == "auto":
    if hasattr(config, "torch_dtype") and config.torch_dtype is not None:
        torch_dtype = config.torch_dtype
    else:
        torch_dtype = get_state_dict_dtype(state_dict)
```       

Step 6: Instantiate model.

```python
init_contexts = [no_init_weights(_enable=_fast_init)]

if load_in_8bit or load_in_4bit or low_cpu_mem_usage:
    init_contexts.append(init_empty_weights())

config = cls._autoset_attn_implementation(config, use_flash_attention_2=use_flash_attention_2, torch_dtype=torch_dtype, device_map=device_map)

with ContextManagers(init_contexts):
    # Let's make sure we don't run the init function of buffer modules
    model = cls(config, *model_args, **model_kwargs)
    


``` 

## Flash attention 2 support

Documentation and installation:

https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2

- you need flash_attn package version to be greater or equal than 2.1.0
- pass the argument attn_implementation="flash_attention_2" to from_pretrained()

Configuration:

https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/modeling_utils.py#L1393

- if torch_dtype not in [torch.float16, torch.bfloat16]: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes
- the full model must be loaded on a GPU (no partial offloading to CPU or disk)

Implementation: 

https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/models/mistral/modeling_mistral.py#L715

- supported on models classes with the property: cls._supports_flash_attn_2=True

Example for Mistral:

```python
MISTRAL_ATTENTION_CLASSES = {
    "eager": MistralAttention,
    "flash_attention_2": MistralFlashAttention2,
    "sdpa": MistralSdpaAttention,
}

class MistralDecoderLayer(nn.Module):
    def __init__(self, config: MistralConfig, layer_idx: int):
        self.self_attn = MISTRAL_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
        
# https://github.com/huggingface/transformers/blob/3cefac1d974db5e2825a0cb2b842883a628be7a0/src/transformers/models/mistral/modeling_mistral.py#L321
```

Usage:

- FlashAttention-2 does not support computing attention scores with padding tokens: this leads to a significant slowdown for batched generations with padding tokens
- you must manually pad/unpad the attention scores for batched inference when the sequence contains padding tokens
- to overcome this, you should use FlashAttention-2 without padding tokens in the sequence during training (by packing a dataset or concatenating sequences until reaching the maximum sequence length)

List of supported models as of 12/30/2023:
- mbart
- bark
- distilbert
- falcon
- gpt_bigcode
- gpt_neo / gpt_neox
- llama / 
- llava / vipllava
- mistral / mixtral
- opt
- phi
- whisper

## Load models from local cache

In [None]:
pip install --upgrade transformers

In [None]:
pip install --upgrade safetensors

In [None]:
pip install --upgrade accelerate

In [None]:
pip install --upgrade bitsandbytes

In [None]:
pip install --upgrade flash-attn --no-build-isolation

In [30]:
import importlib.metadata
importlib.metadata.version("transformers")


'4.36.2'

In [31]:
importlib.metadata.version("safetensors")

'0.4.1'

In [32]:
importlib.metadata.version("accelerate")

'0.25.0'

In [33]:
importlib.metadata.version("bitsandbytes")

'0.41.3.post2'

In [36]:
importlib.metadata.version("flash-attn")

'2.4.2'

In [1]:
models = { 
    "redpajama_3b" : "togethercomputer/RedPajama-INCITE-Base-3B-v1", # 5.30 GB
    "btlm_3b" : "cerebras/btlm-3b-8k-base", #  4.93 GB
    "openllama2_3b" : "openlm-research/open_llama_3b_v2", #  6.38 GB
    "stablelm_3b" : "stabilityai/stablelm-3b-4e1t", # 5.21 GB
    "phi2_3b" : "microsoft/phi-2", # 5.18 GB

    "bloomz_7b" : "bigscience/bloomz-7b1-mt", # 13.18 GB
    "falcon_7b" : "tiiuae/falcon-7b", # 13.45 GB       
    "redpajama_7b" : "togethercomputer/RedPajama-INCITE-7B-Base", # 12.90 GB
    "mpt_7b" : "mosaicml/mpt-7b", # 12.39 GB
    "mpt_7b_8k" : "mosaicml/mpt-7b-8k", # 12.39 GB
    "openllama2_7b" : "openlm-research/open_llama_7b_v2", # 12.55 GB
    "llama2_7b" : "meta-llama/Llama-2-7b-hf", # 12.55 GB
    "llama2_7b_32k" : "togethercomputer/LLaMA-2-7B-32K", # 12.55 GB
    "mistral_7b" : "mistralai/Mistral-7B-v0.1", # 13.49 GB
    "qwen_7b" : "Qwen/Qwen-7B", # 14.38 GB
    "yi_6b" : "01-ai/Yi-6B", # 11.29 GB
    "decilm_7b" : "Deci/DeciLM-7B", # 13.12 GB
    
    "openllama1_13b" : "openlm-research/open_llama_13b", # 24.24 GB
    "llama2_13b" : "meta-llama/Llama-2-13b-hf", # 24.25 GB
    "qwen_14b" : "Qwen/Qwen-14B", # 26.39 GB
    "solar_10b" : "upstage/SOLAR-10.7B-v1.0", # 19.99 GB
    
    "llama1_33b" : "TheBloke/WizardLM-33B-V1.0-Uncensored-GPTQ", # 15.78 GB https://huggingface.co/alexl83/LLaMA-33B-HF
    "falcon_40b" : "TheBloke/falcon-40b-instruct-GPTQ", # 21.00 GB https://huggingface.co/tiiuae/falcon-40b
    "mpt_30b" : "abhinavkulkarni/mosaicml-mpt-30b-instruct-w4-g128-awq", # 15.00 GB https://huggingface.co/mosaicml/mpt-30b
    "codellama_34b" : "TheBloke/CodeLlama-34B-Instruct-GPTQ", # 17.07 GB https://huggingface.co/codellama/CodeLlama-34b-hf
    "yi_34b" : "TheBloke/Yi-34B-GPTQ", # 17.33 GB https://huggingface.co/01-ai/Yi-34B    
    "mixtral_8x7B" : "TheBloke/Mixtral-8x7B-v0.1-GPTQ" # 22.18 GB https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
}

In [2]:
from time import perf_counter_ns

time_unit_µs = 1000
time_unit_ms = 1000*1000
time_unit_s = 1000*1000*1000

In [3]:
import torch

memory_unit_mb = 1024*1024
memory_unit_gb = 1024*1024*1024

total_memory = torch.cuda.get_device_properties(0).total_memory

def display_gpu_memory():
    print(torch.cuda.get_device_name(0))
    print(f"Total    : {(total_memory/memory_unit_mb):8,.1f} MB")
    print("------------------------------")
    free_memory = torch.cuda.mem_get_info()[0]
    reserved_memory = torch.cuda.memory_reserved(0)
    used_memory = torch.cuda.memory_allocated(0)    
    max_used_memory = torch.cuda.max_memory_allocated(0)
    overhead_memory = total_memory - free_memory - reserved_memory
    print(f"Overhead : {(overhead_memory/memory_unit_mb):8,.1f} MB - {int(overhead_memory/total_memory*100):3} %")
    print(f"Reserved : {(reserved_memory/memory_unit_mb):8,.1f} MB - {int(reserved_memory/total_memory*100):3} %")
    print(f"Free     : {(free_memory/memory_unit_mb):8,.1f} MB - {int(free_memory/total_memory*100):3} %")
    print("------------------------------")
    print(f"Used (total): {(used_memory/memory_unit_mb):8,.1f} MB - {int(used_memory/total_memory*100):3} %")
    print(f"Used (model): {((used_memory-overhead_memory)/memory_unit_mb):8,.1f} MB - {int((used_memory-overhead_memory)/total_memory*100):3} %")
    print(f"Max used    : {(max_used_memory/memory_unit_mb):8,.1f} MB - {int(max_used_memory/total_memory*100):3} %")

In [4]:
import os
from transformers.utils.hub import cached_file

def get_model_path_and_size_on_disk(model_name):    
    model_config_file = cached_file(model_name, "config.json", local_files_only=True)
    model_directory = os.path.dirname(model_config_file)
    
    total_size = 0
    for entry in os.listdir(model_directory):
        full_entry_path = os.path.join(model_directory, entry)
        if os.path.isfile(full_entry_path):
            file_size = os.path.getsize(full_entry_path)
            if file_size > 1024*1024:
                print(f"- {entry} = {file_size/memory_unit_mb:.1f} MB")
            total_size += file_size
    return model_directory,total_size



### Benchmark 1 - Pytorch weights loading on CPU

In [5]:
idx = 0
model_name = list(models.values())[idx]
idx,model_name

(0, 'togethercomputer/RedPajama-INCITE-Base-3B-v1')

In [6]:
print("Files size on disk:")
path,size = get_model_path_and_size_on_disk(model_name)
print(f"Total size = {size/memory_unit_mb:.1f} MB")

Files size on disk:
- pytorch_model.bin = 5422.7 MB
- tokenizer.json = 2.0 MB
Total size = 5424.7 MB


In [7]:
pt_filename = os.path.join(path, "pytorch_model.bin")

In [8]:
before_time_ns = perf_counter_ns()
weights = torch.load(pt_filename, map_location="cpu")
duration_pt_cold_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with Pytorch (cold) = {(duration_pt_cold_ns/time_unit_s):.3f} s")

- disk load time with Pytorch (cold) = 3.150 s


In [9]:
before_time_ns = perf_counter_ns()
weights = torch.load(pt_filename, map_location="cpu")
duration_pt_hot_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with Pytorch (hot) = {(duration_pt_hot_ns/time_unit_s):.3f} s")

- disk load time with Pytorch (hot) = 4.658 s


In [10]:
idx = 3
model_name = list(models.values())[idx]
idx,model_name

(3, 'stabilityai/stablelm-3b-4e1t')

In [11]:
print("Files size on disk:")
path,size = get_model_path_and_size_on_disk(model_name)
print(f"Total size = {size/memory_unit_mb:.1f} MB")

Files size on disk:
- model.safetensors = 5331.9 MB
- tokenizer.json = 2.0 MB
Total size = 5334.0 MB


In [12]:
sf_filename = os.path.join(path, "model.safetensors")

In [13]:
from safetensors.torch import load_file

before_time_ns = perf_counter_ns()
weights = load_file(sf_filename, device="cpu")
duration_sf_cold_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with safetensors (cold) = {(duration_sf_cold_ns/time_unit_s):.3f} s")

- disk load time with safetensors (cold) = 0.029 s


In [14]:
before_time_ns = perf_counter_ns()
weights = load_file(sf_filename, device="cpu")
duration_sf_hot_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with safetensors (hot) = {(duration_sf_hot_ns/time_unit_s):.3f} s")

- disk load time with safetensors (hot) = 0.009 s


In [15]:
print(f"on CPU, safetensors is faster than pytorch by: {(duration_pt_cold_ns+duration_pt_hot_ns)/(duration_sf_cold_ns+duration_sf_hot_ns):.1f} X")

on CPU, safetensors is faster than pytorch by: 204.7 X


### Benchmark 2 - Pytorch weights loading on GPU

In [17]:
# **** IMPORTANT INIT ***

# This is required because this feature hasn't been fully verified yet, but 
# it's been tested on many different environments
os.environ["SAFETENSORS_FAST_GPU"] = "1"

# CUDA startup out of the measurement
torch.zeros((2, 2)).cuda()

tensor([[0., 0.],
        [0., 0.]], device='cuda:0')

In [21]:
before_time_ns = perf_counter_ns()
weights = torch.load(pt_filename, map_location="cuda:0")
duration_pt_cold_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with Pytorch (cold) = {(duration_pt_cold_ns/time_unit_s):.3f} s")

- disk load time with Pytorch (cold) = 3.566 s


In [22]:
before_time_ns = perf_counter_ns()
weights = torch.load(pt_filename, map_location="cuda:0")
duration_pt_hot_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with Pytorch (hot) = {(duration_pt_hot_ns/time_unit_s):.3f} s")

- disk load time with Pytorch (hot) = 2.777 s


In [23]:
before_time_ns = perf_counter_ns()
weights = load_file(sf_filename, device="cuda:0")
duration_sf_cold_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with safetensors (cold) = {(duration_sf_cold_ns/time_unit_s):.3f} s")

- disk load time with safetensors (cold) = 1.163 s


In [24]:
before_time_ns = perf_counter_ns()
weights = load_file(sf_filename, device="cuda:0")
duration_sf_hot_ns = perf_counter_ns() - before_time_ns
print(f"- disk load time with safetensors (hot) = {(duration_sf_hot_ns/time_unit_s):.3f} s")

- disk load time with safetensors (hot) = 0.896 s


In [25]:
print(f"on GPU, safetensors is faster than pytorch by: {(duration_pt_cold_ns+duration_pt_hot_ns)/(duration_sf_cold_ns+duration_sf_hot_ns):.1f} X")

on GPU, safetensors is faster than pytorch by: 3.1 X


### Test: convert local weights from Pytorch to safetensor format

Let's find out how we can convert models downloaded in the Pytorch format: 
https://github.com/huggingface/safetensors/blob/main/bindings/python/convert.py.

In [None]:
def convert(local_model_name)
    model_directory = os.path.dirname(cached_file(local_model_name, "config.json", local_files_only=True))
    filenames = list(os.listdir(model_directory))
    
    if any(filename.endswith(".safetensors") for filename in filenames):
        print(f"Model {local_model_name} is already converted: skipping")
        return
    elif "pytorch_model.bin" in filenames:
        operations, errors = convert_single()
    elif "pytorch_model.bin.index.json" in filenames:
        operations, errors = convert_multi()
    else:
        print(f"Model {local_model_name} doesn't seem to be a valid pytorch model: cannot convert")
        return

    print(operations)
    print(errors)
    
def convert_multi(
    model_id: str, *, revision=Optional[str], folder: str, token: Optional[str], discard_names: List[str]
) -> ConversionResult:
    filename = hf_hub_download(
        repo_id=model_id, revision=revision, filename="pytorch_model.bin.index.json", token=token, cache_dir=folder
    )
    with open(filename, "r") as f:
        data = json.load(f)

    filenames = set(data["weight_map"].values())
    local_filenames = []
    for filename in filenames:
        pt_filename = hf_hub_download(repo_id=model_id, filename=filename, token=token, cache_dir=folder)

        sf_filename = rename(pt_filename)
        sf_filename = os.path.join(folder, sf_filename)
        convert_file(pt_filename, sf_filename, discard_names=discard_names)
        local_filenames.append(sf_filename)

    index = os.path.join(folder, "model.safetensors.index.json")
    with open(index, "w") as f:
        newdata = {k: v for k, v in data.items()}
        newmap = {k: rename(v) for k, v in data["weight_map"].items()}
        newdata["weight_map"] = newmap
        json.dump(newdata, f, indent=4)
    local_filenames.append(index)

    operations = [
        CommitOperationAdd(path_in_repo=os.path.basename(local), path_or_fileobj=local) for local in local_filenames
    ]
    errors: List[Tuple[str, "Exception"]] = []

    return operations, errors


def convert_single(
    model_id: str, *, revision: Optional[str], folder: str, token: Optional[str], discard_names: List[str]
) -> ConversionResult:
    pt_filename = hf_hub_download(
        repo_id=model_id, revision=revision, filename="pytorch_model.bin", token=token, cache_dir=folder
    )

    sf_name = "model.safetensors"
    sf_filename = os.path.join(folder, sf_name)
    convert_file(pt_filename, sf_filename, discard_names)
    operations = [CommitOperationAdd(path_in_repo=sf_name, path_or_fileobj=sf_filename)]
    errors: List[Tuple[str, "Exception"]] = []
    return operations, errors


def convert_file(
    pt_filename: str,
    sf_filename: str,
    discard_names: List[str],
):
    loaded = torch.load(pt_filename, map_location="cpu")
    if "state_dict" in loaded:
        loaded = loaded["state_dict"]
    to_removes = _remove_duplicate_names(loaded, discard_names=discard_names)

    metadata = {"format": "pt"}
    for kept_name, to_remove_group in to_removes.items():
        for to_remove in to_remove_group:
            if to_remove not in metadata:
                metadata[to_remove] = kept_name
            del loaded[to_remove]
    # Force tensors to be contiguous
    loaded = {k: v.contiguous() for k, v in loaded.items()}

    dirname = os.path.dirname(sf_filename)
    os.makedirs(dirname, exist_ok=True)
    save_file(loaded, sf_filename, metadata=metadata)
    check_file_size(sf_filename, pt_filename)
    reloaded = load_file(sf_filename)
    for k in loaded:
        pt_tensor = loaded[k]
        sf_tensor = reloaded[k]
        if not torch.equal(pt_tensor, sf_tensor):
            raise RuntimeError(f"The output tensors do not match for key {k}")

In [56]:
temptensor = torch.rand(int(6536/2),1024,1024,dtype=torch.float16)

In [57]:
display_gpu_memory()
print()

before_time_ns = perf_counter_ns()
temptensor.to('cuda')
duration_ns = perf_counter_ns() - before_time_ns
print(f"- gpu transfer time (cold) = {(duration_ns/time_unit_s):.3f} s")

print()
display_gpu_memory()

NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,555.5 MB -   6 %
Reserved : 11,026.0 MB -  44 %
Free     : 11,982.0 MB -  48 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,903.1 MB -  15 %
Max used    : 10,917.1 MB -  44 %

- gpu transfer time (cold) = 0.984 s

NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,696.5 MB -   6 %
Reserved : 17,562.0 MB -  71 %
Free     :  5,305.0 MB -  21 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,762.1 MB -  15 %
Max used    : 11,994.6 MB -  48 %


In [61]:
before_time_ns = perf_counter_ns()
temptensor.to('cuda')
duration_ns = perf_counter_ns() - before_time_ns
print(f"- gpu transfer time (hot) = {(duration_ns/time_unit_s):.3f} s")

print()
display_gpu_memory()

- gpu transfer time (hot) = 0.663 s

NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,696.5 MB -   6 %
Reserved : 17,562.0 MB -  71 %
Free     :  5,305.0 MB -  21 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,762.1 MB -  15 %
Max used    : 11,994.6 MB -  48 %


In [13]:
from transformers import AutoModelForCausalLM, AutoTokenizer


print(f"Loading {model_name} from disk to GPU:")

before_time_ns = perf_counter_ns()
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto", attn_implementation="flash_attention_2") 
duration_ns = perf_counter_ns() - before_time_ns
print(f"- load time = {(duration_ns/time_unit_s):.3f} s")
free_memory = torch.cuda.mem_get_info()[0]
reserved_memory = torch.cuda.memory_reserved(0)
overhead_memory = total_memory - free_memory - reserved_memory
used_memory = torch.cuda.memory_allocated(0)    
model_used_memory = used_memory - overhead_memory
print(f"- GPU memory = {((used_memory-overhead_memory)/memory_unit_mb):.1f} MB")
print("")

print("GPU memory after")
display_gpu_memory()

Loading togethercomputer/RedPajama-INCITE-Base-3B-v1 from disk to GPU:
- load time = 4.758 s
- GPU memory = 3903.1 MB

GPU memory after
NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,555.5 MB -   6 %
Reserved : 11,026.0 MB -  44 %
Free     : 11,982.0 MB -  48 %
------------------------------
Used (total):  5,458.6 MB -  22 %
Used (model):  3,903.1 MB -  15 %
Max used    : 10,917.1 MB -  44 %


In [None]:
display_memory()