## Introduction

@author: Yingding Wang\
@created: 24.11.2023\
@updated: 27.11.2023\
@version: 1

This notebook comprises of examples to use transformer, pytorch, llama2, langchain to achive entity extraction with engineered prompts.



In [1]:
import sys

In [2]:
# !{sys.executable} -m pip install --upgrade --user jupyterlab==3.4.3 # for the KF 1.7.0 release
# !{sys.executable} -m pip install --upgrade --user jupyterlab==3.6.6 # custom upgrade

[0mCollecting jupyterlab==3.6.6
  Downloading jupyterlab-3.6.6-py3-none-any.whl (8.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting tomli
  Downloading tomli-2.0.1-py3-none-any.whl (12 kB)
Collecting jupyter-server-ydoc~=0.8.0
  Downloading jupyter_server_ydoc-0.8.0-py3-none-any.whl (11 kB)
Collecting jupyter-ydoc~=0.2.4
  Downloading jupyter_ydoc-0.2.5-py3-none-any.whl (6.2 kB)
Collecting ypy-websocket<0.9.0,>=0.8.2
  Downloading ypy_websocket-0.8.4-py3-none-any.whl (10 kB)
Collecting jupyter-server-fileid<1,>=0.6.0
  Downloading jupyter_server_fileid-0.9.0-py3-none-any.whl (15 kB)
Collecting y-py<0.7.0,>=0.6.0
  Downloading y_py-0.6.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting jupyter-events>=0.5.0
  Downloading ju

In [3]:
!{sys.executable} -m pip show jupyterlab

[0mName: jupyterlab
Version: 3.6.6
Summary: JupyterLab computational environment
Home-page: https://jupyter.org
Author: Jupyter Development Team
Author-email: jupyter@googlegroups.com
License: 
Location: /home/jovyan/.local/lib/python3.8/site-packages
Requires: ipython, jinja2, jupyter-core, jupyter-server, jupyter-server-ydoc, jupyter-ydoc, jupyterlab-server, nbclassic, notebook, packaging, tomli, tornado
Required-by: 


In [2]:
#!{sys.executable} -m pip install --upgrade pip
#!{sys.executable} -m pip install --user --upgrade kfp==1.8.22

In [3]:
#!cat ./requirements.txt

## Use the cuda 118 and torch 2.1.0 version

In [4]:
# !{sys.executable} -m pip install --user --upgrade -r ./requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118

In [5]:
# cuda 11.7 version
# !{sys.executable} -m pip install --user --upgrade -r ./requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117

In [6]:
# !{sys.executable} -m pip list

## Additional technical informaiton
#### Useful installation for KF notebook 1.7.0 cu111 drivers

```shell
#!{sys.executable} -m pip install --user --upgrade transformers==4.31.0
#!{sys.executable} -m pip install --user --upgrade torch==1.10.2+cu111 fastai==2.7.12 fastcore==1.5.29 fastdownload==0.0.7 torchvision==0.11.3+cu111 --extra-index-url https://download.pytorch.org/whl/cu111
#!{sys.executable} -m pip install --user --upgrade accelerate==0.20.3
```
cuda118
```shell
#!{sys.executable} -m pip install --user --upgrade torch==2.0.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
```
`xformers==0.0.21` need `torch==2.0.1`

```shell
#!{sys.executable} -m pip install --user --upgrade xformers==0.0.21 torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
```

show js loading with ipywidgets
```shell
#!{sys.executable} -m pip install --user --upgrade ipywidgets==8.1.0 comm==0.1.4 jupyterlab-widgets==3.0.8 widgetsnbextension==4.0.8
```

uninstall
```shell
#!{sys.executable} -m pip uninstall accelerator transformers xformers torch -y 
```

## (optional) restart kernel

### (optional) Set huggingface cli in terminal

```shell
PATH=${PATH}:/home/jupyter/.local/bin
```

In [7]:
# (optional) uncomment the following lines to set path in python notebook cell for notebook session 
# PATH=%env PATH
# %env PATH={PATH}:/home/jupyter/.local/bin

#### Basics of GPU

Multi GPU inference: https://github.com/tloen/alpaca-lora/issues/445

Show accelerator device IDs:

```shell
nvidia-smi -L
```

Nvidia usage
```shell
nvidia-smi -q -g 0 -d UTILIZATION -l
```

python lib: gpustat
```python
gpustat -cp
```

* https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu

Check GPU info in PyTorch
* https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu
* CUDA memory management https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

#### Extract the GPU Accelerator MIG UUIDs

* Extract with re.search and group: https://note.nkmk.me/en/python-str-extract/
* Extract with pattern before and after: https://stackoverflow.com/questions/4666973/how-to-extract-the-substring-between-two-markers

#### PyTorch distributed with device UUID
* https://discuss.pytorch.org/t/world-size-and-rank-torch-distributed-init-process-group/57438

#### CUDA MIG memory notice
The following python command shall show the available MIG memory
```shell
print(torch.cuda.mem_get_info())
for e in torch.cuda.mem_get_info():
    print(e/1024**3)
```
The first tuple shows the availabe MIG cuda memory, if it goes to zero, and no process is attached,
this means a cuda process is hang.
```console
(20748107776, 20937965568)
19.32318115234375
19.5
```

To terminate a cuda process, log into the GPU host
```shell
nvidia-smi # find out the PID something like 830333
sudo kill -9 PID
```

In [8]:
from platform import python_version

print(python_version())

3.8.10


In [9]:
import os, time, sys
from util.accelerator_utils import AcceleratorStatus, AcceleratorHelper

# data volume mounted in kubeflow notebook
MODEL_ROOT="/home/jovyan/llm-models"
MODEL_SUB_PATH = "core-kind/yinwang"
# the cache dir for huggingface models
MODEL_CACHE_DIR = f"{MODEL_ROOT}/{MODEL_SUB_PATH}"

gpu_status = AcceleratorStatus()
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.000000 GB
Allocated memory : 0.000000 GB
Free      memory : 0.000000 GB
--------------------


In [10]:
gpu_helper = AcceleratorHelper()
# dynamically fetch attached accelerator devices
UUIDs = gpu_helper.nvidia_device_uuids_filtered_by(is_mig=True, log_output=False)

In [11]:
# init all the cuda torch env and model download cache directory
gpu_helper.init_cuda_torch(UUIDs, MODEL_CACHE_DIR)

print(os.environ["CUDA_VISIBLE_DEVICES"])
print(os.environ["XDG_CACHE_HOME"])

MIG-0efc9f06-6dca-5886-98af-0273ca7fde51
/home/jovyan/llm-models/core-kind/yinwang/models


In [12]:
model_map = {
        "7B": "meta-llama/Llama-2-7b-chat-hf",
        "13B" : "meta-llama/Llama-2-13b-chat-hf",
        "70B" : "meta-llama/Llama-2-70b-chat-hf",
        # "70B" : "meta-llama/Llama-2-70b-hf" 
}

import transformers
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
print(transformers.__version__)
print(torch.__version__)

4.35.2
2.1.0+cu118


In [13]:
"""
Load the huggingface hub token
"""
token_sub_path = ".cache/huggingface/token"
token_file_path = f"{MODEL_CACHE_DIR}/{token_sub_path}"
# stripe the leading and tailing EOL chars
# https://stackoverflow.com/questions/275018/how-can-i-remove-a-trailing-newline/275025#275025
with open (token_file_path, "r") as file:
    # file read add a new line to the token, remove it.
    # token = file.read().replace('\n', '')    
    token = file.read().strip()

# print the raw string to see if there is new line in the token
# print(r'{}'.format(token))

In [14]:
# model_type = "13B"
model_type = "7B"
model_name = model_map.get(model_type, "7B")

print(model_name)

meta-llama/Llama-2-7b-chat-hf


In [15]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    token=token, #transformer>=4.32.1
    device_map="auto", # put to GPU
    # use_auth_token=token, #transformer==4.31.0
)

In [16]:
# %time
# not loading to the GPU with accelerator
# model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", token=token)

In [17]:
# # will call the AutoModelForCausalLM automatically
# generator = pipeline(
#     "text-generation",
#     model=model_name,
#     torch_dtype=torch.float16,
#     device_map="auto",
#     token=token, #transformer>=4.32.1
#     #use_auth_token=token, #transformer==4.31.0
# )

In [18]:
%time
# in Transformer 4.32.1 need to use "token" parameter
# in Transformer 4.30.x need to use "use_auth_token" parameter
# with torch.no_grad():
generator = pipeline(
    "text-generation",
    # model=model,
    model=model_name,
    tokenizer=tokenizer, # optional
    torch_dtype=torch.float16,
    device_map="auto",
    token=token, #transformer>=4.32.1
    #use_auth_token=token, #transformer==4.31.0
)

CPU times: user 2 µs, sys: 4 µs, total: 6 µs
Wall time: 11.2 µs


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [19]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 12.615234 GB
Allocated memory : 12.613792 GB
Free      memory : 0.001442 GB
--------------------


## Passing temparature to the generator for each prompt

https://discuss.huggingface.co/t/how-to-set-generation-parameters-for-transformers-pipeline/48837

LLama2 chat agent
https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-70b-chat-agent.ipynb

In [20]:
def chat_gen(
    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast,
    gpu_status: AcceleratorStatus
):    
    def local(input_prompts: list=[], temperature: float=0.1, max_new_tokens: int=200, verbose: bool=True) -> list:
        start = time.time()
        sequences = generator(
            input_prompts,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            # max_length=200,
            max_new_tokens= max_new_tokens, # 200 # max number of tokens to generate in the output
            temperature=temperature,
            repetition_penalty=1.1  # without this output begins repeating
        )
        # for seq in sequences:
        #     print(f"Result: \n{seq['generated_text']}")
        
        batch_result = []
        for prompt_result in sequences: # passed a list of prompt
            result = []
            for seq in prompt_result: # 
                result.append(f"Result: \n{seq['generated_text']}")
            batch_result.append(result)
            
        end = time.time()
        duration = end - start
        
        if verbose == True:
            for prompt_result in batch_result:
                for result in prompt_result:
                    print("promt-response")
                    print(result)
            print("-"*20)
            print(f"walltime: {duration} in secs.")
            gpu_status.gpu_usage()
            
        return batch_result   
    return local
    
chat = chat_gen(generator, tokenizer, gpu_status)

In [21]:
# set DEBUG to false to remove all the llm answer outputs
# DEBUG=True
DEBUG=False

In [22]:
# def print_answer(answer: list)-> None:
#     if DEBUG:
#         print("-"*10)
#         print(answer[0])
#         print("-"*10)
#         print(answer[0].split("\n")[-1])   

#### Free pytorch gpu memory
* https://discuss.pytorch.org/t/how-to-delete-a-tensor-in-gpu-to-free-up-memory/48879/5
* https://discuss.huggingface.co/t/clear-gpu-memory-of-transformers-pipeline/18310
* https://saturncloud.io/blog/how-to-free-up-all-memory-pytorch-is-taking-from-gpu-memory/
* https://discuss.pytorch.org/t/how-to-free-the-pytorch-transformers-model-from-gpu-memory/132968
* https://stackoverflow.com/questions/70508960/how-to-free-gpu-memory-in-pytorch

#### Huggingface pipelines
* https://huggingface.co/docs/transformers/main_classes/pipelines
* clean cuda torch gpu: https://stackoverflow.com/questions/55322434/how-to-clear-cuda-memory-in-pytorch

In [23]:
# import gc
# def free_memory_gen(
#     generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
#     tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast):
#     """
#     """
#     def local():
#         l_generator = generator
#         l_tokenizer = tokenizer
#         #l_generator.cpu()
#         #l_tokenizer.cpu()
#         # model.cpu()
        
#         del l_tokenizer, l_generator
#         gc.collect()
#         torch.cuda.empty_cache()
#         #for device_idx in range(torch.cuda.device_count()):
#         #    print(device_idx)
#         #    device = torch.device(f"cuda:{device_idx}")
#         #    device.reset()
#     return local    

# free_memory = free_memory_gen(generator, tokenizer)    

In [24]:
# chain of thoughts prompting

# testing prompt
inputs=['Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?\nA: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.\nQ: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?\n']
print(inputs[0])

Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?



In [25]:
verbose = True
batch_answers = chat(inputs, temperature=0.1, max_new_tokens = 80, verbose=verbose)
if not verbose:
    prompt_0_results = batch_answers[0]
    print(prompt_0_results[0])
    
# note: the expected answer is 9    

promt-response
Result: 
Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A: First, the cafeteria had 23 apples. Then, they used 20 to make lunch, leaving 3 apples. Finally, they bought 6 more, so they have 3 + 6 = 9 apples. The answer is 9.
--------------------
walltime: 2.7993807792663574 in secs.
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 12.955078 GB
Allocated memory : 12.621727 GB
Free      memory : 0.333351 GB
--------------------


## Huggingface with Local LLM

https://python.langchain.com/docs/integrations/llms/huggingface_pipelines

In [26]:
#!{sys.executable} -m pip install --user --upgrade langchain==0.0.340
#!{sys.executable} -m pip install --user --upgrade langchain==0.0.313 

# HuggingFacePipeline broken above version 0.0.313
# HuggingFacePipeline works in version 0.0.312
# !{sys.executable} -m pip install --user --upgrade langchain==0.0.312

In [27]:
import langchain
from langchain.llms.huggingface_pipeline import HuggingFacePipeline

print(langchain.__version__)

0.0.312


### Init a HuggingFacePipeline with pipeline_kwargs

https://github.com/langchain-ai/langchain/issues/8280#issuecomment-1652085694

In [28]:
# from langchain.llms import HuggingFacePipeline
# from transformers import AutoModelForCausalLM, AutoTokenizer

# model_id  = "TheBloke/wizardLM-7B-HF"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

# hf = HuggingFacePipeline.from_model_id(
#     model_id=model_id,
#     task="text-generation",
#     model_kwargs={"trust_remote_code": True},
#     pipeline_kwargs={
#         "model": model,
#         "tokenizer": tokenizer,
#         "device_map": "auto",
#         "max_new_tokens": 1200,
#         "temperature": 0.3,
#         "top_p": 0.95,
#         "repetition_penalty": 1.15,
#     },
# )
# print(hf)

In [29]:
llm = HuggingFacePipeline(
    pipeline=generator 
)
print(llm.pipeline.model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_

In [30]:
# there is a bug, the HuggingFacePipeine is not getting the param directly
# https://github.com/langchain-ai/langchain/issues/8280

# this must be set for the generator (HuggingFacePipeline) to work
llm.model_id = model_name
pipeline_kwargs_config = {
    # "do_sample": True, # also making trouble with langchain
    # "top_k": 10, # this param result in trouble with langchain
    # "num_return_sequences": 1, #optional
    # "eos_token_id": tokenizer.eos_token_id, # also making trouble
    "device_map": "auto",
    "max_length": 200,
    "max_new_tokens": 80, # this is not taken by the model ?
    "temperature": 0.1,
    # "top_p": 0.95, # what is this?
    "repetition_penalty": 1.15, # 1.15,
}
# llm.model_kwargs = {"trust_remote_code": True}
llm.pipeline_kwargs = pipeline_kwargs_config

In [31]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 12.955078 GB
Allocated memory : 12.621727 GB
Free      memory : 0.333351 GB
--------------------


In [32]:
print(llm.pipeline.model.name_or_path)
print(llm.model_id)
print(llm.model_kwargs)
print(llm.pipeline_kwargs)

meta-llama/Llama-2-7b-chat-hf
meta-llama/Llama-2-7b-chat-hf
None
{'device_map': 'auto', 'max_length': 200, 'max_new_tokens': 80, 'temperature': 0.1, 'repetition_penalty': 1.15}


## Sequential Doc Chain

https://github.com/langchain-ai/langchain/discussions/8383


In [33]:
llm

HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7febedc41e20>, model_id='meta-llama/Llama-2-7b-chat-hf', pipeline_kwargs={'device_map': 'auto', 'max_length': 200, 'max_new_tokens': 80, 'temperature': 0.1, 'repetition_penalty': 1.15})

In [34]:
%time
for _ in range(1): # is here a CPU bottleneck? for some reason, if called twice, the model lost the context, will hallucinate.
    print(llm(inputs[0]))
    gpu_status.gpu_usage()

CPU times: user 6 µs, sys: 1e+03 ns, total: 7 µs
Wall time: 17.9 µs
A: The cafeteria started with 23 apples. They used 20 to make lunch, so now they have 23 - 20 = 3 apples left. Then, they bought 6 more apples, so now they have 3 + 6 = 9 apples. The answer is 9.
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 12.974609 GB
Allocated memory : 12.621727 GB
Free      memory : 0.352882 GB
--------------------


In [35]:
#gpu_status.gpu_usage()

## Loading documents from s3 bucket source

In [36]:
# TODO

In [37]:
context = ""

#### zero shot prompt

In [38]:
#name
input=f"Can you tell me the name of the patient from the folowing doctor's letter?\nLetter:\n{context}\nAnswer: "

In [39]:
#len(input)
# 6810

In [40]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [41]:
#age
input=f"Can you tell me the age of the patient from the following doctor's letter?\nLetter:\n{context}\nAnswer: "

In [42]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [43]:
#diagnosis
input=f"Can you tell me the diagnosis of the patient from the following doctor's letter?\nLetter:\n{context}\nAnswer: "

In [44]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

#### Chain-of-thoughts prompt

In [45]:
# name prompt
input = f"Context: Patient: Fried\nQuestion: what is the name of the patient? \nAnswer: Name of the patient is Fried\nContext: {context}\nQuestion: what is the name of the patient?\nAnswer: the name of patient is"
#print(input)

In [46]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [47]:
# age prompt
input = f"Context:\nPatient: Fried is a 34-year-old patient\nQuestion:\nhow old is the patient? \nAnswer:\nFried is a patient, 34 year-old, the answers is 34\nContext:\n{context}\nQuestion:\nhow old is the patient?\nAnswer: "
# print(input)

In [48]:
# age prompt
#len(input)
# > 6913 tokens

In [49]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [50]:
# diagnose prompt
input=f"Context:\nPatient: Fried is a 34-year-old patient, Diagnoses: Influenza (J09.X2) \nQuestion:\nWhat diagnoses has the patient? \nAnswer:\nFried is a patient, 34 year-old, has diagnoses Influenza (J09.X2). The answers is Influenza (J09.X2)\nContext:\n{context}\nQuestion:\nWhat diagnoses has the patient?\nAnswer: "

In [51]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [52]:
# gpu_status.gpu_usage()

In [53]:
# free_memory()