### 1.HuggingFace
- LLM을 로딩하는 가장 간단하고 기본적인 방법
- Transformer library
- Load model with no quantization & compression

In [1]:
from torch import bfloat16
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="HuggingFaceH4/zephyr-7b-beta",
    torch_dtype=bfloat16,
    device_map="auto",
    model_kwargs={
        "cache_dir": "/media/shin/T7/huggingface/models",
    }
)

  from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 638/638 [00:00<00:00, 1.48MB/s]
model.safetensors.index.json: 100%|██████████| 23.9k/23.9k [00:00<00:00, 22.9MB/s]
model-00001-of-00008.safetensors: 100%|██████████| 1.89G/1.89G [00:22<00:00, 84.3MB/s]
model-00002-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [00:23<00:00, 81.8MB/s]
model-00003-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [00:23<00:00, 84.4MB/s]
model-00004-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [00:23<00:00, 83.6MB/s]
model-00005-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [00:23<00:00, 83.2MB/s]
model-00006-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [00:23<00:00, 82.2MB/s]
model-00007-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [00:24<00:00, 82.2MB/s]
model-00008-of-00008.safetensors: 100%|██████████| 816M/816M [00:10<00:00, 81.4MB/s]
Downloading shards: 100%|██████████| 8/8 [02:58<00:00, 22.25s/it]
Loading checkpoint shards: 100%|██████████| 8

In [2]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user", 
        "content": "Tell me a funny joke about Large Language Models."
    },
]
prompt = pipe.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

In [3]:
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



In [4]:
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [6]:
print(outputs[0]["generated_text"])

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To impress everyone with its vocabulary!

The host was amazed as the LLM effortlessly chatted with guests, using words that left them in awe. The partygoers couldn't believe how intelligent and witty the LLM was, and some even mistook it for a human!

But as the night wore on, the LLM started to run out of steam. It struggled to keep up with the conversation, repeating itself and making grammatical errors.

The host, feeling sorry for the LLM, suggested they take a break and recharge. The LLM gratefully accepted, promising to come back stronger and more intelligent than ever before.

From that night on, the LLM became a regular at the party, impressing everyone with its vast knowledge and quick wit. But it also learned to pace itself, knowing that being a Large Language Model was a big responsibility, and it needed to

### 2. Sharding
- model을 여러개의 조각(shard)로 나눈다.
- 각 조각들을 다른 장치에 분산시켜 GPU memory 제한을 우회한다
- Accelerate 패키지를 통해 구현됨
    - device_map="auto": 자동으로 gpu, cpu 순으로 load

In [11]:
from accelerate import Accelerator

accelerator = Accelerator()
accelerator.save_model(
    model=pipe.model,
    save_directory="./model",
    max_shard_size="4GB"
)
# ./model
    # ├── model-00001-of-00004.safetensors
    # ├── model-00002-of-00004.safetensors
    # ├── model-00003-of-00004.safetensors
    # ├── model-00004-of-00004.safetensors
    # └── model.safetensors.index.json

### 3. Quantize with Bitsandbytes
- NF4 (NormalFloat-4bit)
    1. Normalization  
        특정 범위로 weights를 정규화
    2. Quantization  
        quantized to 4-bit
    3. Dequantization  
        dequantized during computation -> performance boost

In [1]:
from transformers import BitsAndBytesConfig
from torch import bfloat16

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

tokenizer = AutoTokenizer.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta", 
    cache_dir="/media/shin/T7/huggingface/tokenizers"
)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta",
    quantization_config=bnb_config,
    cache_dir="/media/shin/T7/huggingface/models",
    device_map="auto"
)

config.json: 100%|██████████| 638/638 [00:00<00:00, 1.90MB/s]
config.json: 100%|██████████| 638/638 [00:00<00:00, 2.16MB/s]
model.safetensors.index.json: 100%|██████████| 23.9k/23.9k [00:00<00:00, 34.6MB/s]
model-00001-of-00008.safetensors: 100%|██████████| 1.89G/1.89G [00:23<00:00, 78.8MB/s]
model-00002-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [00:30<00:00, 63.0MB/s]
model-00003-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [00:30<00:00, 65.9MB/s]
model-00004-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [00:30<00:00, 63.9MB/s]
model-00005-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [00:29<00:00, 66.7MB/s]
model-00006-of-00008.safetensors: 100%|██████████| 1.95G/1.95G [00:31<00:00, 62.6MB/s]
model-00007-of-00008.safetensors: 100%|██████████| 1.98G/1.98G [00:24<00:00, 79.9MB/s]
model-00008-of-00008.safetensors: 100%|██████████| 816M/816M [00:14<00:00, 56.9MB/s]
Downloading shards: 100%|██████████| 8/8 [03:41<00:00, 27.67s/it]
Loading checkpoint shards: 100%|█

In [8]:
pipe = pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation'
)

In [9]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user", 
        "content": "Tell me a funny joke about Large Language Models."
    },
]
prompt = pipe.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

In [10]:
outputs = pipe(
    prompt, 
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)
print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To mingle with the crowd and spread its wit and humor! But unfortunately, it got so caught up in its own jokes that it forgot to introduce itself to anyone, and left the party feeling a little lonely.

The moral of the story? Even the smartest and most entertaining among us need to remember to connect with others and build relationships, no matter how eloquent our words may be.


### Pre-Quantization (GPTQ vs AWQ vs GGUF)
- 이미 quantized model들이 huggingface hub에 많이 존재
- 특히 GPTQ, AWQ, GGUF가 가장 인기

#### GPTQ
- post-training quantization(PTQ) method
- weight의 mean squared error를 minimize하면서 4-bit quantized
- inference 동안 동적으로 dequantize float16  
    -> improve performance & keep memory
- GPU 사용에 최적화

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "TheBloke/zephyr-7B-beta-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True, cache_dir="/media/shin/T7/huggingface/tokenizers")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=False,
    revision="main",
    cache_dir="/media/shin/T7/huggingface/models"
)

pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

  from .autonotebook import tqdm as notebook_tqdm
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


In [3]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user", 
        "content": "Tell me a funny joke about Large Language Models."
    },
]
prompt = pipe.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

In [4]:
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)
print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To show off its wit and charm, of course! But unfortunately, it got stuck in a corner and couldn't seem to break the ice with anyone. It kept trying to make small talk, but all it could muster were generic responses and clichés. It felt like a total party pooper, but at least it knew it had a good excuse - it was just too large for the room!


#### GGUF
- cpu에 사용가능한 quantization
- offload some of layers to gpu

In [4]:
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-beta-GGUF",
    model_file="zephyr-7b-beta.Q4_K_M.gguf",
    model_type="mistral", 
    gpu_layers=50,
    hf=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta", use_fast=True
)

pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 7084.97it/s]
zephyr-7b-beta.Q4_K_M.gguf: 100%|██████████| 4.37G/4.37G [02:41<00:00, 27.0MB/s]
Fetching 1 files: 100%|██████████| 1/1 [02:42<00:00, 162.55s/it]
tokenizer_config.json: 100%|██████████| 1.43k/1.43k [00:00<00:00, 3.58MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 11.1MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 2.42MB/s]
added_tokens.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 111kB/s]
special_tokens_map.json: 100%|██████████| 168/168 [00:00<00:00, 420kB/s]


In [5]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user", 
        "content": "Tell me a funny joke about Large Language Models."
    },
]
prompt = pipe.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

In [6]:
outputs = pipe(prompt, max_new_tokens=256)
print(outputs[0]["generated_text"])

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To impress everyone with its vocabulary!

But unfortunately, it kept repeating the same jokes over and over again, making everyone groan and roll their eyes. The partygoers soon realized that the Large Language Model was more of a party pooper than a party animal.

Moral of the story: Just because a Large Language Model can generate a lot of words, doesn't mean it knows how to be funny or entertaining. Sometimes, less is more!


#### AWQ (Activation-aware Weight Quantization)
- "LLM의 모든 Weight들이 동등하게 중요하지 않다" -> 에서 시작
- 소수의 가중치는 quantization에서 제외된다
- GPTQ보다 성능, 속도면에서 모두 뛰어나다고 주장
- vLLM 패키지로 구현 가능

In [1]:
from vllm import LLM, SamplingParams

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    max_tokens=256
)
llm = LLM(
    model="TheBloke/zephyr-7B-beta-AWQ",
    quantization='awq',
    dtype='half',
    gpu_memory_utilization=0.8,
    max_model_len=4096
)

  from .autonotebook import tqdm as notebook_tqdm
2024-02-27 21:09:44,270	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 02-27 21:09:44 llm_engine.py:79] Initializing an LLM engine with config: model='TheBloke/zephyr-7B-beta-AWQ', tokenizer='TheBloke/zephyr-7B-beta-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 02-27 21:09:46 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 02-27 21:09:48 llm_engine.py:337] # GPU blocks: 2367, # CPU blocks: 2048
INFO 02-27 21:09:48 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-27 21:09:48 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, c

In [5]:
prompt = """
<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
"""

In [6]:
output = llm.generate(prompt, sampling_params)
print(output[0].outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.60s/it]

Why did the Large Language Model go to the party?

To network and expand its vocabulary!

Why did the Large Language Model blush?

Because it overheard another model saying it was a little too wordy!

Why did the Large Language Model get kicked out of the library?

It was being too loud and kept interrupting other models' conversations with its endless chatter!

Why did the Large Language Model get a standing ovation at the comedy club?

Because it told some really punny jokes!

Why did the Large Language Model get a job as a writer?

Because it was the most wordy model in the room!

Why did the Large Language Model get a job as a librarian?

Because it knew all the right words to shelve books in the right place!

Why did the Large Language Model get a job as a teacher?

Because it knew all the right words to help students learn and grow!

Why did the Large Language Model get a job as a lawyer?

Because it knew all the right words to argue a case in court!

Why did the Large Language M




### Reference
https://towardsdatascience.com/which-quantization-method-is-right-for-you-gptq-vs-gguf-vs-awq-c4cd9d77d5be