## Intro

* https://www.youtube.com/watch?v=ypzmPwLH_Q4
* https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb

# Install cuda toolkit
* https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#conda-installation

Run the following cmd in terminal
```shell
conda install -y cuda -c nvidia/label/cuda-11.8.0
python -m bitsandbytes
# conda install -y cuda-toolkit-11.8.0 -c nvidia/label/cuda-11.8.0
```

In [1]:
# !conda install -y cuda -c nvidia/label/cuda-11.8.0

In [2]:
import sys

In [3]:
import os
# find / -name libcuda.so 2>/dev/null
# os.environ['LD_LIBRARY_PATH']='/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu'
os.environ['LD_LIBRARY_PATH']='/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib/:/opt/conda/pkgs/cuda-cudart-dev-11.8.89-0/lib/'
os.environ['LD_LIBRARY_PATH']

'/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib/:/opt/conda/pkgs/cuda-cudart-dev-11.8.89-0/lib/'

In [4]:
# from bitsandbytes.cextension import CUDASetup
# import torch
# lib = CUDASetup.get_instance().lib
# lib.cadam32bit_g32

In [5]:
#!cat ./requirements.txt

In [6]:
# !{sys.executable} -m pip install --upgrade pip

In [7]:
# !{sys.executable} -m pip install --user --upgrade -r ./requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118 --trusted-host https://download.pytorch.org/whl/cu118 

In [8]:
# !{sys.executable} -m pip list

#### Useful installation for KF notebook 1.7.0 cu111 drivers

```shell
#!{sys.executable} -m pip install --user --upgrade transformers==4.31.0
#!{sys.executable} -m pip install --user --upgrade torch==1.10.2+cu111 fastai==2.7.12 fastcore==1.5.29 fastdownload==0.0.7 torchvision==0.11.3+cu111 --extra-index-url https://download.pytorch.org/whl/cu111
#!{sys.executable} -m pip install --user --upgrade accelerate==0.20.3
```

We shall use the cuda 11.8 version (Cuda118)
```shell
#!{sys.executable} -m pip install --user --upgrade torch==2.0.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
```
`xformers==0.0.21` need `torch==2.0.1``
```shell
#!{sys.executable} -m pip install --user --upgrade xformers==0.0.21 torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
```

show js loading with ipywidgets
```shell
#!{sys.executable} -m pip install --user --upgrade ipywidgets==8.1.0 comm==0.1.4 jupyterlab-widgets==3.0.8 widgetsnbextension==4.0.8
```

uninstall
```shell
#!{sys.executable} -m pip uninstall accelerator transformers xformers torch -y 
```

## (optional) restart kernel

### (optional) Set huggingface cli in terminal

```shell
PATH=${PATH}:/home/jupyter/.local/bin
```

In [9]:
# (optional) uncomment the following lines to set path in python notebook cell for notebook session 
# PATH=%env PATH
# %env PATH={PATH}:/home/jupyter/.local/bin

## Introduction

Multi GPU inference: https://github.com/tloen/alpaca-lora/issues/445

Show accelerator device IDs:

```shell
nvidia-smi -L
```

Nvidia usage
```shell
nvidia-smi -q -g 0 -d UTILIZATION -l
```

python lib: gpustat
```python
gpustat -cp
```

* https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu

Check GPU info in PyTorch
* https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu
* CUDA memory management https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

### Extract the GPU Accelerator MIG UUIDs

* Extract with re.search and group: https://note.nkmk.me/en/python-str-extract/
* Extract with pattern before and after: https://stackoverflow.com/questions/4666973/how-to-extract-the-substring-between-two-markers

In [10]:
from util.accelerator_utils import (
    AcceleratorHelper, 
    DIR_MODE_MAP, 
    DirectorySetting,
    TokenHelper as th
)
import os

dir_setting = dir_setting=DIR_MODE_MAP.get("kf_notebook")
gpu_helper = AcceleratorHelper()
UUIDs = gpu_helper.nvidia_device_uuids_filtered_by(is_mig=True, log_output=False)
gpu_helper.init_cuda_torch(UUIDs, dir_setting)

os.environ['XDG_CACHE_HOME']

'/home/jovyan/llm-models/core-kind/yinwang/models'

### PyTorch distributed with device UUID
* https://discuss.pytorch.org/t/world-size-and-rank-torch-distributed-init-process-group/57438

In [11]:
import os, time, sys
from platform import python_version

# os.environ["CUDA_LAUNCH_BLOCKING"]="1" # for debugging
# os.environ["TOKENIZERS_PARALLELISM"]="false"

print(os.environ["CUDA_VISIBLE_DEVICES"])
print(python_version())

MIG-9579f618-ddae-5958-9285-3207382f0b36
3.8.10


#### CUDA MIG memory notice
The following python command shall show the available MIG memory
```shell
print(torch.cuda.mem_get_info())
for e in torch.cuda.mem_get_info():
    print(e/1024**3)
```
The first tuple shows the availabe MIG cuda memory, if it goes to zero, and no process is attached,
this means a cuda process is hang.
```console
(20748107776, 20937965568)
19.32318115234375
19.5
```

To terminate a cuda process, log into the GPU host
```shell
nvidia-smi # find out the PID something like 830333
sudo kill -9 PID
```

In [12]:
model_map = {
    "llama7B-chat":     "meta-llama/Llama-2-7b-chat-hf",
    "llama13B-chat" :   "meta-llama/Llama-2-13b-chat-hf",
    "llama70B-chat" :   "meta-llama/Llama-2-70b-chat-hf",
    # "70B" : "meta-llama/Llama-2-70b-hf"
    "mistral7B-01":     "mistralai/Mistral-7B-v0.1",
    "mistral7B-inst02": "mistralai/Mistral-7B-Instruct-v0.2",
    "mistral8x7B-01":   "mistralai/Mistral-Mixtral-8x7B-v0.1", 
}

default_model_type = "mistral7B-01"
default_dir_mode = "mac_local"

In [13]:
import transformers
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
print(transformers.__version__)
print(torch.__version__)

4.35.2
2.1.2+cu118


In [14]:
# model_type = default_model_type
# model_type = "mistral7B-inst02"
model_type = "llama13B-chat"

In [15]:
"""
Load the token
"""
def gen_token_kwargs():
    # call method from token helper class
    if th.need_token(model_type):
        # kwargs = {"use_auth_token": get_token(dir_setting)}
        token_kwargs = {
            "token": th.get_token(dir_setting),
            # "truncation_side": "left",
            # "return_tensors": "pt",            
                        }
        print("huggingface token loaded")
    else:
        token_kwargs = {}
        print("huggingface token is NOT needed")
    return token_kwargs

token_kwargs = gen_token_kwargs()

huggingface token loaded


In [16]:
model_name = model_map.get(model_type, "7B")

print(model_name)

meta-llama/Llama-2-13b-chat-hf


## HuggingFace Pipeline doesn't seem to work with Bits and Bytes

In [17]:
from torch import cuda, bfloat16
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
device

'cuda:0'

In [18]:
# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
#bnb_config = transformers.BitsAndBytesConfig(
#    load_in_4bit=True,
#    bnb_4bit_quant_type='nf4',
#    bnb_4bit_use_double_quant=True,
#    bnb_4bit_compute_dtype=bfloat16
#)

bnb_config = transformers.BitsAndBytesConfig(
    #load_in_8bit=True,
    # bnb_4bit_quant_type='nf4',
    # bnb_4bit_use_double_quant=True,
    # bnb_4bit_compute_dtype=bfloat16
)

model_config = transformers.AutoConfig.from_pretrained(
    model_name,
    **token_kwargs,
)

print(bnb_config)
print(model_config)

BitsAndBytesConfig {
  "bnb_4bit_compute_dtype": "float32",
  "bnb_4bit_quant_type": "fp4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": false,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-13b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 13824,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "num_key_value_heads": 40,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.35.2",
  "use_cach

In [19]:
#model = transformers.AutoModelForCausalLM.from_pretrained(
#    model_name,
#    trust_remote_code=True,
#    config=model_config,
#    quantization_config=bnb_config,
#    device_map='auto',
#    **token_kwargs,
#)

In [20]:
from util.accelerator_utils import AcceleratorStatus

gpu_status = AcceleratorStatus.create_accelerator_status()
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 0.000000 GB
Allocated memory : 0.000000 GB
Free      memory : 0.000000 GB
--------------------


In [21]:
# from bitsandbytes.cextension import CUDASetup
# import torch
# lib = CUDASetup.get_instance().lib
# lib.cadam32bit_g32

In [22]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    device="cpu",
    trust_remote_code=True,
    **token_kwargs,
)

In [23]:
tokenizer

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-13b-chat-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [24]:
%time

# quantization_enabled = True
# bitsandbytes quantization does not work with MPS 
quantization_enabled = False

if quantization_enabled:
    compression_kwargs = {
        "load_in_8bit": True,
        # "load_in_4bit": True,
    }
else:
    compression_kwargs = {
        "torch_dtype": torch.float16
    }

generator = transformers.pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer, # optional
    # torch_dtype=torch.float16, #bfloat16 is not supported on MPS backend
    # torch_dtype=torch.float32,
    device_map="auto",
    # max_length=MAX_LENGTH,
    max_length=None, # remove the total length of the generated response
    max_new_tokens=100, # set the size of new generated token # 200, are the token size different as the text size?
    use_fast = True,
    **token_kwargs,
    **compression_kwargs,
)

CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 4.05 µs


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [25]:
from util.accelerator_utils import AcceleratorStatus

gpu_status = AcceleratorStatus.create_accelerator_status()
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 24.443359 GB
Allocated memory : 24.439268 GB
Free      memory : 0.004091 GB
--------------------


In [26]:
generator.model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_

In [27]:
def chat_gen(
    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast,
    gpu_status: AcceleratorStatus
):    
    def local(input_prompts: list=[], temperature: float=0.1, max_new_tokens: int=200, verbose: bool=True) -> list:
        """
        do_sample, top_k, num_return_sequences, eos_token_id are the settings 
        the TextGenerationPipeline
        
        Reference:
        https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation
        """
        start = time.time()
        sequences = generator(
            input_prompts,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            # pad_token_id=tokenizer.eos_token_id, # for mistral
            eos_token_id=tokenizer.eos_token_id,
            # max_length=200,
            max_new_tokens= max_new_tokens, # 200 # max number of tokens to generate in the output
            temperature=temperature,
            repetition_penalty=1.1  # without this output begins repeating
        )
        # for seq in sequences:
        #     print(f"Result: \n{seq['generated_text']}")
        
        batch_result = []
        for prompt_result in sequences: # passed a list of prompt
            result = []
            for seq in prompt_result: # 
                result.append(f"Result: \n{seq['generated_text']}")
            batch_result.append(result)
            
        end = time.time()
        duration = end - start
        
        if verbose == True:
            for prompt_result in batch_result:
                for result in prompt_result:
                    print("promt-response")
                    print(result)
            print("-"*20)
            print(f"walltime: {duration} in secs.")
            gpu_status.gpu_usage()
            
        return batch_result   
    return local
    
chat = chat_gen(generator, tokenizer, gpu_status)

In [28]:
system_message="""[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.\n\n
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n
"""

# testing prompt
inputs=['Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?\nA: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.\nQ: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?\n']

def get_inputs(idx):   
    return f"{system_message}{inputs[idx]}"

print(get_inputs(0))

[INST]<<SYS>>
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.


If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
<</SYS>>


Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?



In [29]:
verbose = True
batch_answers = chat(inputs, temperature=0.001, max_new_tokens = 80, verbose=verbose)
# batch_answers = chat(inputs, temperature=0.1, max_new_tokens = 80, verbose=verbose)
if not verbose:
    prompt_0_results = batch_answers[0]
    print(prompt_0_results[0])

promt-response
Result: 
Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A: They started with 23 apples. 20 were used for lunch, leaving 3. Then they bought 6 more, so they have 3 + 6 = 9 apples.
--------------------
walltime: 2.6894333362579346 in secs.
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 24.734375 GB
Allocated memory : 24.447203 GB
Free      memory : 0.287172 GB
--------------------


In [30]:
def print_answer(answer: list)-> None:
    if DEBUG:
        print("-"*10)
        print(answer[0])
        print("-"*10)
        print(answer[0].split("\n")[-1])        

#### Free pytorch gpu memory
* https://discuss.pytorch.org/t/how-to-delete-a-tensor-in-gpu-to-free-up-memory/48879/5
* https://discuss.huggingface.co/t/clear-gpu-memory-of-transformers-pipeline/18310
* https://saturncloud.io/blog/how-to-free-up-all-memory-pytorch-is-taking-from-gpu-memory/
* https://discuss.pytorch.org/t/how-to-free-the-pytorch-transformers-model-from-gpu-memory/132968
* https://stackoverflow.com/questions/70508960/how-to-free-gpu-memory-in-pytorch

#### Huggingface pipelines
* https://huggingface.co/docs/transformers/main_classes/pipelines
* clean cuda torch gpu: https://stackoverflow.com/questions/55322434/how-to-clear-cuda-memory-in-pytorch

In [33]:
import gc
def clear_cuda_memory(
    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast,
    gpu_status: AcceleratorStatus
):
    """clear the MPS memory"""
    if tokenizer is not None:
        # tokenizer is load in cpu
        # tokenizer.model.cpu()
        del tokenizer
    if generator is not None:
        # need to move the model to cpu before delete.
        generator.model.cpu()
        del generator
    gc.collect()
    torch.cuda.empty_cache()
    # report the GPU usage
    gpu_status.gpu_usage()
    
clear_cuda_memory(generator, tokenizer, gpu_status)

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 0.019531 GB
Allocated memory : 0.007935 GB
Free      memory : 0.011597 GB
--------------------


In [None]:
#import gc
#def free_memory_gen(
#    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
#    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast):
#    """
#    """
#    def local():
#        l_generator = generator
#        l_tokenizer = tokenizer
#        #l_generator.cpu()
#        #l_tokenizer.cpu()
#        # model.cpu()
#        
#        del l_tokenizer, l_generator
#        gc.collect()
#        torch.cuda.empty_cache()
#        #for device_idx in range(torch.cuda.device_count()):
#        #    print(device_idx)
#        #    device = torch.device(f"cuda:{device_idx}")
#        #    device.reset()
#    return local
#
# free_memory = free_memory_gen(generator, tokenizer)    