## Introduction

@Author: Yingding Wang\
@Created: 10.Jan 2024\
@Updated: 22.Jan 2024\
@Version: 1

this notebook book demonstrate Llama2 quantization with bitsandbytes(bnb), the CUDA toolkit is used by bnb. Thus a custom pytorch cu118 image is built and used for this example. This custom pytorch image extends the kubeflow pytorch cuda base image. 

Reference:
* https://www.youtube.com/watch?v=ypzmPwLH_Q4
* https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb

In [1]:
import sys

In [2]:
# examing the kfp python sdk version inside a KubeFlow v1.5.1
!{sys.executable} -m pip list | grep kfp

kfp                       1.8.22
kfp-pipeline-spec         0.1.16
kfp-server-api            1.8.5


In [3]:
# !{sys.executable} -m pip list

In [4]:
# bitsandbytes quantization does not work with macosx MPS GPU backend
quantization_enabled = True
# quantization_enabled = False

### LD_LIBRARY_PATH

bitsandbytes shall find the `LD_LIBRARY_PATH` automatically
```console
/usr/local/cuda-11.8/lib64:/usr/local/cuda-11/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
```

In [5]:
import os
lb_library_path = os.environ['LD_LIBRARY_PATH']
print(lb_library_path)

/usr/local/cuda-11.8/lib64:/usr/local/cuda-11/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64


In [6]:
#!{sys.executable} -m pip install --upgrade pip
#!{sys.executable} -m pip install --user --upgrade -r ./requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118 --trusted-host https://download.pytorch.org/whl/cu118

In [7]:
#!{sys.executable} -m pip list

## (optional) restart kernel

### (optional) Set huggingface cli in terminal

```shell
PATH=${PATH}:/home/jupyter/.local/bin
```

In [8]:
# (optional) uncomment the following lines to set path in python notebook cell for notebook session 
# PATH=%env PATH
# %env PATH={PATH}:/home/jupyter/.local/bin

### Extract the GPU Accelerator MIG UUIDs

* Extract with re.search and group: https://note.nkmk.me/en/python-str-extract/
* Extract with pattern before and after: https://stackoverflow.com/questions/4666973/how-to-extract-the-substring-between-two-markers

In [9]:
from util.accelerator_utils import (
    AcceleratorHelper, 
    DIR_MODE_MAP, 
    DirectorySetting,
    TokenHelper as th
)
import os

dir_setting = dir_setting=DIR_MODE_MAP.get("kf_notebook")
gpu_helper = AcceleratorHelper()
UUIDs = gpu_helper.nvidia_device_uuids_filtered_by(is_mig=True, log_output=False)
gpu_helper.init_cuda_torch(UUIDs, dir_setting)

os.environ['XDG_CACHE_HOME']

'/home/jovyan/llm-models/core-kind/yinwang/models'

### PyTorch distributed with device UUID
* https://discuss.pytorch.org/t/world-size-and-rank-torch-distributed-init-process-group/57438

In [10]:
import os, time, sys
from platform import python_version

# os.environ["CUDA_LAUNCH_BLOCKING"]="1" # for debugging
# os.environ["TOKENIZERS_PARALLELISM"]="false"

print(os.environ["CUDA_VISIBLE_DEVICES"])
print(python_version())

MIG-9579f618-ddae-5958-9285-3207382f0b36
3.8.10


#### CUDA MIG memory notice
The following python command shall show the available MIG memory
```shell
print(torch.cuda.mem_get_info())
for e in torch.cuda.mem_get_info():
    print(e/1024**3)
```
The first tuple shows the availabe MIG cuda memory, if it goes to zero, and no process is attached,
this means a cuda process is hang.
```console
(20748107776, 20937965568)
19.32318115234375
19.5
```

To terminate a cuda process, log into the GPU host
```shell
nvidia-smi # find out the PID something like 830333
sudo kill -9 PID
```

In [11]:
model_map = {
    "llama7B-chat":     "meta-llama/Llama-2-7b-chat-hf",
    "llama13B-chat" :   "meta-llama/Llama-2-13b-chat-hf",
    "llama70B-chat" :   "meta-llama/Llama-2-70b-chat-hf",
    # "70B" : "meta-llama/Llama-2-70b-hf"
    "mistral7B-01":     "mistralai/Mistral-7B-v0.1",
    "mistral7B-inst02": "mistralai/Mistral-7B-Instruct-v0.2",
    "mixtral8x7B-01":   "mistralai/Mixtral-8x7B-v0.1",
    "mixtral8x7B-inst01":   "mistralai/Mixtral-8x7B-Instruct-v0.1", 
}

default_model_type = "mistral7B-01"
default_dir_mode = "mac_local"

In [12]:
import transformers
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
print(transformers.__version__)
print(torch.__version__)

4.36.2
2.1.2+cu118


In [13]:
# model_type = default_model_type
# model_type = "mistral7B-inst02"
# model_type = "mixtral8x7B-01"
# model_type = "mixtral8x7B-inst01"
model_type = "llama13B-chat"

# model_type = "llama70B-chat" # 37GB GPU MEM is too small for 4bit quantization

In [14]:
"""
Load the token
"""
def gen_token_kwargs():
    # call method from token helper class
    if th.need_token(model_type):
        # kwargs = {"use_auth_token": get_token(dir_setting)}
        token_kwargs = {
            "token": th.get_token(dir_setting),
            # "truncation_side": "left",
            # "return_tensors": "pt",            
                        }
        print("huggingface token loaded")
    else:
        token_kwargs = {}
        print("huggingface token is NOT needed")
    return token_kwargs

token_kwargs = gen_token_kwargs()

huggingface token loaded


In [15]:
model_name = model_map.get(model_type, "7B")

print(model_name)

meta-llama/Llama-2-13b-chat-hf


In [16]:
from util.accelerator_utils import AcceleratorStatus

gpu_status = AcceleratorStatus.create_accelerator_status()
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 0.000000 GB
Allocated memory : 0.000000 GB
Free      memory : 0.000000 GB
--------------------


In [17]:
f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB'

'37GB'

## Following this approach to load llama2 model with bitsandbytes
* https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb

## 4bit quantization

<table>
    <!-- row 1-->
<tr>
<th>
Llama-2-13b-chat-hf
</th>
<th>
Mixtral-8x7B-v0.1
</th>
</tr>
    <!-- row 2 -->
<tr>

<td>
<pre>
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 7.085938 GB
Allocated memory : 6.809501 GB
Free      memory : 0.276437 GB
--------------------
</pre>
</td>

<td>
<pre>
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 23.496094 GB
Allocated memory : 23.303491 GB
Free      memory : 0.192603 GB
</pre>
</td>

</tr>
</table>

In [18]:
import transformers
from transformers import AutoModelForCausalLM

print(transformers.__version__)

4.36.2


In [19]:
from torch import bfloat16
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
  model_name, #'decapoda-research/llama-7b-hf',
  device_map='auto',
  quantization_config=bnb_config,
  # max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB',
  **token_kwargs,  
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## 8bit quantization

<table>
    <!-- row 1-->
<tr>
<th>
Llama-2-13b-chat-hf
</th>
<th>
Mixtral-8x7B-v0.1
</th>
</tr>
    <!-- row 2 -->
<tr>

<td>
<pre>
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 13.185547 GB
Allocated memory : 12.572203 GB
Free      memory : 0.613344 GB
</pre>
</td>

<td>
<pre>

n.a

</pre>
</td>

</tr>
</table>

In [20]:
#model = AutoModelForCausalLM.from_pretrained(
#  model_name, #'decapoda-research/llama-7b-hf',
#  device_map='auto',
#  load_in_8bit=True,
#  # max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB',
#  **token_kwargs,  
#)

In [21]:
#model = AutoModelForCausalLM.from_pretrained(
#  model_name, #'decapoda-research/llama-7b-hf',
#  device_map='auto',
#  load_in_8bit=True,
#  # max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB',
#  **token_kwargs,  
#)

In [22]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 7.085938 GB
Allocated memory : 6.809501 GB
Free      memory : 0.276437 GB
--------------------


In [23]:
from torch import cuda
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
device

'cuda:0'

In [24]:
print(model.eval())
print(f"Model loaded on {device}")

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

### Issue with 
probability tensor contains either inf, nan or element < 0
* https://github.com/facebookresearch/llama/issues/380#issuecomment-1716832417
* https://github.com/facebookresearch/llama/issues/380

The bitsandbytes quantized model might be corrupt.

In [25]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name,
    device='cpu',
    # device=device,
    **token_kwargs,
)
# tokenizer.pad_token = "[PAD]"
# tokenizer.padding_side = "left"

In [26]:
generator = transformers.pipeline(
    model=model, 
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.01,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=80,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

In [27]:
res = generator("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

Explain to me the difference between nuclear fission and fusion.

Nuclear fission is a process in which an atomic nucleus splits into two or more smaller nuclei, releasing a large amount of energy in the process. This process typically occurs when an atom is bombarded with a high-energy particle, such as a neutron. When the nucleus splits, it releases a large amount of energy in the form of kin


In [28]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 7.509766 GB
Allocated memory : 6.817435 GB
Free      memory : 0.692330 GB
--------------------


In [29]:
import gc
def clear_cuda_memory(
    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast,
    gpu_status: AcceleratorStatus
):
    """clear the MPS memory"""
    if tokenizer is not None:
        # tokenizer is load in cpu
        # tokenizer.model.cpu()
        del tokenizer
    if generator is not None:
        # need to move the model to cpu before delete.
        generator.model.cpu()
        del generator
    gc.collect()
    torch.cuda.empty_cache()
    # report the GPU usage
    gpu_status.gpu_usage()
    
# clear_cuda_memory(generator, tokenizer, gpu_status)

## HuggingFace Pipeline doesn't seem to work with Bits and Bytes

In [30]:
from torch import cuda, bfloat16
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
device

'cuda:0'

In [31]:
tokenizer

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-13b-chat-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [32]:
# quantization_enabled = True
# bitsandbytes quantization does not work with MPS 
# quantization_enabled = False

#if quantization_enabled:
#    compression_kwargs = {
#        "load_in_8bit": True,
#        # "load_in_4bit": True,
#    }
#else:
#    compression_kwargs = {
#        "torch_dtype": torch.float16
#    }

#generator = transformers.pipeline(
#    "text-generation",
#    model=model_name,
#    tokenizer=tokenizer, # optional
#    # torch_dtype=torch.float16, #bfloat16 is not supported on MPS backend
#    # torch_dtype=torch.float32,
#    device_map="auto",
#    # max_length=MAX_LENGTH,
#    max_length=None, # remove the total length of the generated response
#    max_new_tokens=100, # set the size of new generated token # 200, are the token size different as the text size?
#    use_fast = True,
#    **token_kwargs,
#    **compression_kwargs,
#)

In [33]:
generator.model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

In [34]:
def chat_gen(
    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast,
    gpu_status: AcceleratorStatus
):    
    def local(input_prompts: list=[], temperature: float=0.1, max_new_tokens: int=200, verbose: bool=True) -> list:
        """
        do_sample, top_k, num_return_sequences, eos_token_id are the settings 
        the TextGenerationPipeline
        
        Reference:
        https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation
        """
        start = time.time()
        sequences = generator(
            input_prompts,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            # pad_token_id=tokenizer.eos_token_id, # for mistral
            eos_token_id=tokenizer.eos_token_id,
            # max_length=200,
            max_new_tokens= max_new_tokens, # 200 # max number of tokens to generate in the output
            temperature=temperature,
            repetition_penalty=1.1  # without this output begins repeating
        )
        # for seq in sequences:
        #     print(f"Result: \n{seq['generated_text']}")
        
        batch_result = []
        for prompt_result in sequences: # passed a list of prompt
            result = []
            for seq in prompt_result: # 
                result.append(f"Result: \n{seq['generated_text']}")
            batch_result.append(result)
            
        end = time.time()
        duration = end - start
        
        if verbose == True:
            for prompt_result in batch_result:
                for result in prompt_result:
                    print("promt-response")
                    print(result)
            print("-"*20)
            print(f"walltime: {duration} in secs.")
            gpu_status.gpu_usage()
            
        return batch_result   
    return local
    
chat = chat_gen(generator, tokenizer, gpu_status)

In [35]:
system_message="""[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.\n\n
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n
"""

# testing prompt
inputs=['Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?\nA: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.\nQ: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?\n']

def get_inputs(idx):   
    return f"{system_message}{inputs[idx]}"

print(get_inputs(0))

[INST]<<SYS>>
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.


If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
<</SYS>>


Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?



In [36]:
verbose = True
batch_answers = chat(inputs, temperature=0.001, max_new_tokens = 80, verbose=verbose)
# batch_answers = chat(inputs, temperature=0.1, max_new_tokens = 80, verbose=verbose)
if not verbose:
    prompt_0_results = batch_answers[0]
    print(prompt_0_results[0])

promt-response
Result: 
Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A: They started with 23 apples. Used 20 for lunch, so they have 23 - 20 = 3 apples left. Then they bought 6 more, so they have 3 + 6 = 9 apples.
--------------------
walltime: 4.174436569213867 in secs.
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 7.509766 GB
Allocated memory : 6.817435 GB
Free      memory : 0.692330 GB
--------------------


In [37]:
#def print_answer(answer: list)-> None:
#    if DEBUG:
#        print("-"*10)
#        print(answer[0])
#        print("-"*10)
#        print(answer[0].split("\n")[-1])

#### Free pytorch gpu memory
* https://discuss.pytorch.org/t/how-to-delete-a-tensor-in-gpu-to-free-up-memory/48879/5
* https://discuss.huggingface.co/t/clear-gpu-memory-of-transformers-pipeline/18310
* https://saturncloud.io/blog/how-to-free-up-all-memory-pytorch-is-taking-from-gpu-memory/
* https://discuss.pytorch.org/t/how-to-free-the-pytorch-transformers-model-from-gpu-memory/132968
* https://stackoverflow.com/questions/70508960/how-to-free-gpu-memory-in-pytorch

#### Huggingface pipelines
* https://huggingface.co/docs/transformers/main_classes/pipelines
* clean cuda torch gpu: https://stackoverflow.com/questions/55322434/how-to-clear-cuda-memory-in-pytorch

In [38]:
clear_cuda_memory(generator, tokenizer, gpu_status)

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 5.746094 GB
Allocated memory : 0.212502 GB
Free      memory : 5.533591 GB
--------------------
