<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/advanced-quantization-techniques/01_offloading_model_with_flexgen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using FlexGen to Offload OPT Models' Weights to RAM and Disk


The code in this notebook is to perform inference with a Meta AI's OPT models, by offloading part of models' weights from VRAM to RAM and/or disk, using the [FlexGen](https://github.com/FMInference/FlexLLMGen/) generation engine programmatically. While the code refers to the [OPT 1.3 B](https://huggingface.co/facebook/opt-1.3b) model, the same applies to any other model from the same family. It requires hardware acceleration to be executed.  

Install the FlexGen from source.

In [None]:
!git clone https://github.com/FMInference/FlexLLMGen.git
%cd FlexLLMGen
!pip install -e .

## Evaluate FlexGen

In [2]:
!python -m flexllmgen.flex_opt --model facebook/opt-1.3b


 81% 315/389 [13:28<02:25,  1.97s/it][A
 82% 319/389 [13:41<02:53,  2.48s/it][A
 83% 321/389 [13:55<04:01,  3.55s/it][A
 84% 325/389 [13:57<02:28,  2.33s/it][A
 84% 327/389 [14:00<02:09,  2.09s/it][A
 85% 329/389 [14:05<02:12,  2.21s/it][A
 85% 331/389 [14:06<01:41,  1.74s/it][A
 86% 335/389 [14:19<02:09,  2.39s/it][A
 87% 337/389 [14:28<02:30,  2.89s/it][A
 88% 341/389 [14:30<01:34,  1.96s/it][A
 88% 343/389 [14:33<01:25,  1.86s/it][A
 89% 345/389 [14:36<01:18,  1.79s/it][A
 89% 347/389 [14:38<01:06,  1.59s/it][A
 90% 351/389 [14:49<01:17,  2.05s/it][A
 91% 353/389 [15:01<01:46,  2.95s/it][A
 92% 357/389 [15:02<00:59,  1.85s/it][A
 92% 359/389 [15:05<00:54,  1.81s/it][A
 93% 361/389 [15:11<00:59,  2.11s/it][A
 93% 363/389 [15:13<00:45,  1.76s/it][A
 94% 364/389 [15:13<00:38,  1.52s/it][A
 94% 367/389 [15:25<00:57,  2.60s/it][A
 95% 369/389 [15:38<01:11,  3.60s/it][A
 96% 373/389 [15:40<00:35,  2.19s/it][A
 96% 375/389 [15:42<00:27,  1.97s/it][A
 97% 377/389 [1

In [None]:
!python -m flexllmgen.flex_opt --model facebook/opt-6.7b --percent 50

# Using FlexGen and the Transformers library programmatically

Import the required FlexGen classes.

In [4]:
from flexllmgen.flex_opt import (Policy, OptLM, ExecutionEnv, CompressionConfig, str2bool)

Download the OPT 1.3 B tokenizer form the Hugging Face's Hub.

In [None]:
from transformers import AutoTokenizer

model_id = "facebook/opt-1.3b"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
tokenizer.add_bos_token = False
stop = tokenizer("\n").input_ids[0]

Setup the FlexGen execution environment.

In [6]:
offload_dir = './flexgen_offload'
env = ExecutionEnv.create(offload_dir)

Prepare a list of prompts for batch inference.

In [7]:
prompts = [
    "Question: Where were the 2004 Olympics held?\n"
    "Answer: Athens, Greece\n"
    "Question: What is the longest river on the earth?\n"
    "Answer:",

    "Extract the airport codes from this text.\n"
    "Text: \"I want a flight from New York to San Francisco.\"\n"
    "Airport codes: JFK, SFO.\n"
    "Text: \"I want you to book a flight from Phoenix to Las Vegas.\"\n"
    "Airport codes:",
]

Setup an offloading policy.

In [8]:
policy = Policy(len(prompts), 1,
                70, 30, 70, 30, 100, 0,
                overlap=True, sep_layer=True, pin_weight=True,
                cpu_cache_compute=True, attn_sparsity=1.0,
                compress_weight=True,
                comp_weight_config=CompressionConfig(
                    num_bits=4, group_size=64,
                    group_dim=0, symmetric=False),
                compress_cache=False, # Set compress_cache to False
                comp_cache_config=CompressionConfig(
                    num_bits=4, group_size=64,
                    group_dim=2, symmetric=False)
                )

Prepare the model to be executed through the FlexGen inference engine and following the preliminary defined offloading policies. This step also downloads the model's checkpoints from the Hugging Face's Hub and manages the conversion process.

In [9]:
path = '~/opt_weights'
model = OptLM(model_id, env, path, policy)

Generate text for the given set of prompts and then display the generated result for each one.

In [10]:
print("Generate...")
inputs = tokenizer(prompts, padding="max_length", max_length=128)
output_ids = model.generate(
    inputs.input_ids,
    do_sample=True,
    temperature=0.7,
    max_new_tokens=32,
    stop=stop)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print("Outputs:\n" + 70 * '-')
for i in [0, len(outputs)-1]:
    print(f"{i}: {outputs[i]}")
    print("-" * 70)

Outputs:
----------------------------------------------------------------------
0: Question: Where were the 2004 Olympics held?
Answer: Athens, Greece
Question: What is the longest river on the earth?
Answer: The Nile
Question: What is the number of Grecian tigers?
Answer: 10,000
Question: What is the capital of Macedonia?

----------------------------------------------------------------------
1: Extract the airport codes from this text.
Text: "I want a flight from New York to San Francisco."
Airport codes: JFK, SFO.
Text: "I want you to book a flight from Phoenix to Las Vegas."
Airport codes: PHX, LVG.

Text: I want to book a flight from New York to San Francisco.
Airport codes: JFK, SFO
----------------------------------------------------------------------


Shutdown the FlexGen execution environment when done.

In [11]:
print("Shutting down...")
env.close_copy_threads()

Shutting down...
