# L4-D - Building your own Quantizer: Load your Quantized Weights from Hugging Face Hub

In this lesson, you will learn memory efficient model loading.

Run the next cell to import all of the functions you have used before in the previous lesson(s) of `Building your own Quantizer` to follow along with the video.

- To access the `helper.py` file, you can click `File --> Open...`, on the top left.

In [1]:
import torch
from helpers_extended import W8A16LinearLayer, replace_linear_with_target_and_quantize, replace_linear_with_target

## Memory Efficient Model Loading

- Load [facebook/opt-125m](https://huggingface.co/facebook/opt-125m)

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-125m"

model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [5]:
replace_linear_with_target_and_quantize(model, 
                             W8A16LinearLayer, 
                                   ["lm_head"])

In [6]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): W8A16LinearLayer()
            (v_proj): W8A16LinearLayer()
            (q_proj): W8A16LinearLayer()
            (out_proj): W8A16LinearLayer()
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): W8A16LinearLayer()
          (fc2): W8A16LinearLayer()
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=768, out_features=50272, bias=False)
)

In [24]:
quantized_state_dict = model.state_dict()
model_path = "../models/quantized/fb_125m_quantized_state_dict2.pth"
torch.save(quantized_state_dict, model_path)

- The below code is for demonstration purposes only.
- You'll need your own Hugging Face username in order for it to run.
- You'll add your usernmae in `YOUR_HF_USERNAME = ""` 

```Python
from huggingface_hub import HfApi, create_repo

YOUR_HF_USERNAME = ""
your_repo_id = f"{YOUR_HF_USERNAME}/opt-125m-quantized-dlai"

api = HfApi()

# create_repo(your_repo_id)

api.upload_file(
 path_or_fileobj="quantized_state_dict.pth",
 path_in_repo="quantized_state_dict.pth",
 repo_id=your_repo_id
)
```

In [25]:
import torch
from helpers import W8A16LinearLayer, W8A16LinearLayerDtype, replace_linear_with_target_and_quantize, replace_linear_with_target
from transformers import AutoModelForCausalLM, AutoTokenizer
from dotenv import load_dotenv
from huggingface_hub import HfApi, create_repo
import os

# load the environment variables
load_dotenv()

# get the api token and the username from the environment variables
USERNAME = os.getenv("USERNAME")
# Get the Hugging Face API token from the environment variable
api_token = os.getenv('HUGGINGFACE_HUB_TOKEN')

if not api_token:
  raise ValueError("No API token found. Please set HUGGINGFACE_HUB_TOKEN in your .env file.")

if not USERNAME:
  raise ValueError("No username found. Please set USERNAME in your .env file.")


repo_id = f"{USERNAME}/opt-125m-quantized-deeplearningai"

# instantiate the HfApi class
api = HfApi()

# create a new repository if not already created
create_repo(repo_id=repo_id, token=api_token, repo_type="model", exist_ok=True)

api.upload_file(
  path_or_fileobj=model_path,
  path_in_repo="opt_125m_quantized_state_dict2.pth",
  repo_id=repo_id,
  token=api_token
)

fb_125m_quantized_state_dict2.pth: 100%|██████████| 166M/166M [00:43<00:00, 3.82MB/s] 


CommitInfo(commit_url='https://huggingface.co/tew9/opt-125m-quantized-deeplearningai/commit/420f7dcfa27471447a13b8f38cc846f5d00e7433', commit_message='Upload opt_125m_quantized_state_dict2.pth with huggingface_hub', commit_description='', oid='420f7dcfa27471447a13b8f38cc846f5d00e7433', pr_url=None, pr_revision=None, pr_num=None)

### Load the Model in the Meta Device

In [42]:
from transformers import OPTForCausalLM, AutoTokenizer, AutoConfig

model_id = "facebook/opt-125m"
config = AutoConfig.from_pretrained(model_id)

with torch.device("meta"):
  model_loaded = OPTForCausalLM(config)

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [44]:
for param in model_loaded.parameters():
  print(param)

Parameter containing:
tensor(..., device='meta', size=(50272, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(2050, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768, 768), requires_grad=True)
Parameter containing:
tensor(..., device='meta', size=(768,), requires_

In [28]:
model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), ep

In [54]:
replace_linear_with_target(model_loaded, W8A16LinearLayer, ["lm_head"])

In [57]:
model_loaded

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): W8A16LinearLayer()
            (v_proj): W8A16LinearLayer()
            (q_proj): W8A16LinearLayer()
            (out_proj): W8A16LinearLayer()
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): W8A16LinearLayer()
          (fc2): W8A16LinearLayer()
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=768, out_features=50272, bias=False)
)

In [58]:
# Move model to CPU or GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_loaded.to(device)

NotImplementedError: Cannot copy out of meta tensor; no data!

In [52]:
from huggingface_hub import hf_hub_download

# Download the quantized state dict and cache it
state_dict_cache_path = hf_hub_download(
    repo_id, "opt_125m_quantized_state_dict2.pth"
)

state_dict = torch.load(state_dict_cache_path)
model_loaded.load_state_dict(state_dict, strict=True, assign=True)

<All keys matched successfully>

<All keys matched successfully>

- Test your model.
- **Note:** Your generated text might be different than what you see in the video.

In [51]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model_loaded, tokenizer=tokenizer)
pipe("Hello today I am", max_new_tokens=40)

NotImplementedError: Cannot copy out of meta tensor; no data!

In [41]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device="meta")
pipe("Hello today I am giving a course about", max_new_tokens=10)

NotImplementedError: aten::_local_scalar_dense: attempted to run this operator with Meta tensors, but there was no abstract impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add an abstract impl.Please see the following doc for next steps: https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU/edit