This notebook demonstrates how to save and load safetensors models, using Llama 2 7B for example.
More details about safetensors in this article: [Safe, Fast, and Memory Efficient Loading of LLMs with Safetensors](https://kaitchup.substack.com/p/safe-fast-and-memory-efficient-loading)

*You will need a GPU with at least 29 GB of VRAM to run this notebook (e.g., Google Colab Pro A100).*




In [None]:
!pip install transformers accelerate
!pip install nvidia-ml-py3

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.1-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m117.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Dow

Import the required libraries. I use pynml to benchmark the VRAM consumption.

In [None]:
from transformers import AutoModelForCausalLM
from pynvml import *
import time

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

model_id = "meta-llama/Llama-2-7b-chat-hf"

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Benchmark the loading time and memory consumption of loading Llama 2 7B safetensors. By default, transformers loads the safetensors version if it exists. It should consume less than 4 GB of CPU RAM.
*Note: You should download the model first for a fair benchmarking. Run the cell, then restart the runtime.*


In [None]:
start_time = time.time()
model = AutoModelForCausalLM.from_pretrained(
          model_id, device_map={"": 0}
)
duration = float(time.time() - start_time)
print("--- %s seconds ---" % (round(duration,3)))
print_gpu_utilization()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

--- 16.745 seconds ---
GPU memory occupied: 26709 MB.


Benchmark the loading time and memory consumption of loading Llama 2 7B without safetensors. *Note: Restart the runtime before running this cell.*

In [None]:
start_time = time.time()
model = AutoModelForCausalLM.from_pretrained(
          model_id, device_map={"": 0}, use_safetensors=False
)
duration = float(time.time() - start_time)
print("--- %s seconds ---" % (round(duration,3)))
print_gpu_utilization()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

--- 23.311 seconds ---
GPU memory occupied: 26709 MB.


Save the model to pickle format.

In [None]:
start_time = time.time()
model.save_pretrained("llama2_PyTorch_pickle")
duration = float(time.time() - start_time)
print("--- %s seconds ---" % (round(duration,3)))
print_gpu_utilization()

--- 79.358 seconds ---
GPU memory occupied: 26709 MB.


Save the model to safetensors.

In [None]:
start_time = time.time()
model.save_pretrained("llama2_safetensors", safe_serialization=True)
duration = float(time.time() - start_time)
print("--- %s seconds ---" % (round(duration,3)))
print_gpu_utilization()

--- 109.432 seconds ---
GPU memory occupied: 26709 MB.
