## Introduction

@author: Yingding Wang\
@created: 24.11.2023\
@updated: 28.11.2023\
@version: 2

This notebook comprises of examples to use transformer, pytorch, llama2, langchain to achive entity extraction with engineered prompts.



In [1]:
import sys

In [2]:
# pydantic_core is from 2.xx version doesn't work with docarray 0.39.1
# !{sys.executable} -m pip uninstall pydantic_core -y

In [3]:
#!{sys.executable} -m pip install --upgrade pip
#!{sys.executable} -m pip install --user --upgrade kfp==1.8.22

In [4]:
#!cat ./requirements.txt

## Use the cuda 118 and torch 2.1.0 version

In [5]:
#!{sys.executable} -m pip install --user --upgrade -r ./requirements.txt --extra-index-url https://download.pytorch.org/whl/cu11

In [6]:
#!{sys.executable} -m pip list

## Additional technical informaiton
#### Useful installation for KF notebook 1.7.0 cu111 drivers

```shell
#!{sys.executable} -m pip install --user --upgrade transformers==4.31.0
#!{sys.executable} -m pip install --user --upgrade torch==1.10.2+cu111 fastai==2.7.12 fastcore==1.5.29 fastdownload==0.0.7 torchvision==0.11.3+cu111 --extra-index-url https://download.pytorch.org/whl/cu111
#!{sys.executable} -m pip install --user --upgrade accelerate==0.20.3
```
cuda118
```shell
#!{sys.executable} -m pip install --user --upgrade torch==2.0.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
```
`xformers==0.0.21` need `torch==2.0.1`

```shell
#!{sys.executable} -m pip install --user --upgrade xformers==0.0.21 torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
```

show js loading with ipywidgets
```shell
#!{sys.executable} -m pip install --user --upgrade ipywidgets==8.1.0 comm==0.1.4 jupyterlab-widgets==3.0.8 widgetsnbextension==4.0.8
```

uninstall
```shell
#!{sys.executable} -m pip uninstall accelerator transformers xformers torch -y 
```

## (optional) restart kernel

### (optional) Set huggingface cli in terminal

```shell
PATH=${PATH}:/home/jupyter/.local/bin
```

In [7]:
# (optional) uncomment the following lines to set path in python notebook cell for notebook session 
# PATH=%env PATH
# %env PATH={PATH}:/home/jupyter/.local/bin

#### Basics of GPU

Multi GPU inference: https://github.com/tloen/alpaca-lora/issues/445

Show accelerator device IDs:

```shell
nvidia-smi -L
```

Nvidia usage
```shell
nvidia-smi -q -g 0 -d UTILIZATION -l
```

python lib: gpustat
```python
gpustat -cp
```

* https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu

Check GPU info in PyTorch
* https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu
* CUDA memory management https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

#### Extract the GPU Accelerator MIG UUIDs

* Extract with re.search and group: https://note.nkmk.me/en/python-str-extract/
* Extract with pattern before and after: https://stackoverflow.com/questions/4666973/how-to-extract-the-substring-between-two-markers

#### PyTorch distributed with device UUID
* https://discuss.pytorch.org/t/world-size-and-rank-torch-distributed-init-process-group/57438

#### CUDA MIG memory notice
The following python command shall show the available MIG memory
```shell
print(torch.cuda.mem_get_info())
for e in torch.cuda.mem_get_info():
    print(e/1024**3)
```
The first tuple shows the availabe MIG cuda memory, if it goes to zero, and no process is attached,
this means a cuda process is hang.
```console
(20748107776, 20937965568)
19.32318115234375
19.5
```

To terminate a cuda process, log into the GPU host
```shell
nvidia-smi # find out the PID something like 830333
sudo kill -9 PID
```

In [8]:
from platform import python_version

print(python_version())

3.8.10


In [9]:
import os, time, sys
from util.accelerator_utils import AcceleratorStatus, AcceleratorHelper

# data volume mounted in kubeflow notebook
MODEL_ROOT="/home/jovyan/llm-models"
MODEL_SUB_PATH = "core-kind/yinwang"
# the cache dir for huggingface models
MODEL_CACHE_DIR = f"{MODEL_ROOT}/{MODEL_SUB_PATH}"

gpu_status = AcceleratorStatus()
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 0.000000 GB
Allocated memory : 0.000000 GB
Free      memory : 0.000000 GB
--------------------


In [10]:
gpu_helper = AcceleratorHelper()
# dynamically fetch attached accelerator devices
UUIDs = gpu_helper.nvidia_device_uuids_filtered_by(is_mig=True, log_output=False)

In [11]:
# init all the cuda torch env and model download cache directory
gpu_helper.init_cuda_torch(UUIDs, MODEL_CACHE_DIR)

print(os.environ["CUDA_VISIBLE_DEVICES"])
print(os.environ["XDG_CACHE_HOME"])

MIG-0efc9f06-6dca-5886-98af-0273ca7fde51
/home/jovyan/llm-models/core-kind/yinwang/models


In [12]:
model_map = {
        "7B": "meta-llama/Llama-2-7b-chat-hf",
        "13B" : "meta-llama/Llama-2-13b-chat-hf",
        "70B" : "meta-llama/Llama-2-70b-chat-hf",
        # "70B" : "meta-llama/Llama-2-70b-hf" 
}

import transformers
import torch
# from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
print(transformers.__version__)
print(torch.__version__)

4.35.2
2.1.0+cu118


In [13]:
"""
Load the huggingface hub token
"""
token_sub_path = ".cache/huggingface/token"
token_file_path = f"{MODEL_CACHE_DIR}/{token_sub_path}"
# stripe the leading and tailing EOL chars
# https://stackoverflow.com/questions/275018/how-can-i-remove-a-trailing-newline/275025#275025
with open (token_file_path, "r") as file:
    # file read add a new line to the token, remove it.
    # token = file.read().replace('\n', '')    
    token = file.read().strip()

# print the raw string to see if there is new line in the token
# print(r'{}'.format(token))

In [14]:
# model_type = "13B"
model_type = "7B"
model_name = model_map.get(model_type, "7B")

print(model_name)

meta-llama/Llama-2-7b-chat-hf


In [15]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    token=token, #transformer>=4.32.1
    device='cpu',
    # device_map="auto", # put to GPU
    # use_auth_token=token, #transformer==4.31.0
)

In [16]:
tokenizer

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-chat-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

#### Testing tokenizer
https://huggingface.co/docs/tokenizers/pipeline

In [17]:
inputs=['Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?\nA: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.\nQ: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?\n']

In [18]:
input_test_encoded = tokenizer.encode(inputs[0])
print(input_test_encoded)

[1, 660, 29901, 14159, 756, 29871, 29941, 22556, 26563, 29889, 940, 1321, 952, 29871, 29906, 901, 508, 29879, 310, 22556, 26563, 29889, 7806, 508, 756, 29871, 29946, 22556, 26563, 29889, 1128, 1784, 22556, 26563, 947, 540, 505, 1286, 29973, 13, 29909, 29901, 14159, 4687, 411, 29871, 29941, 26563, 29889, 29871, 29906, 508, 29879, 310, 29871, 29946, 22556, 26563, 1269, 338, 29871, 29947, 22556, 26563, 29889, 29871, 29941, 718, 29871, 29947, 353, 29871, 29896, 29896, 29889, 450, 1234, 338, 29871, 29896, 29896, 29889, 13, 29984, 29901, 450, 274, 2142, 1308, 423, 750, 29871, 29906, 29941, 623, 793, 29889, 960, 896, 1304, 29871, 29906, 29900, 304, 1207, 301, 3322, 322, 18093, 29871, 29953, 901, 29892, 920, 1784, 623, 793, 437, 896, 505, 29973, 13]


In [19]:
response_test_decoded = tokenizer.decode(input_test_encoded)
print(response_test_decoded)

<s> Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?



In [20]:
# %time
# not loading to the GPU with accelerator
# model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", token=token)

In [21]:
# # will call the AutoModelForCausalLM automatically
# generator = pipeline(
#     "text-generation",
#     model=model_name,
#     torch_dtype=torch.float16,
#     device_map="auto",
#     token=token, #transformer>=4.32.1
#     #use_auth_token=token, #transformer==4.31.0
# )

### Inference with transformers pipeline

Reference:
* https://huggingface.co/docs/transformers/pipeline_tutorial

Note:
* batch is not activated by default, batch is not necessary faster for `transformers.pipeline`
* the max_new_tokens set in the pipeline initialization works with langchain.llms.HuggingFacePipeline, but not as a param for the TextGenerationPipeline 

max_new_tokens https://github.com/huggingface/transformers/issues/19358

In [22]:
%time
# in Transformer 4.32.1 need to use "token" parameter
# in Transformer 4.30.x need to use "use_auth_token" parameter
# with torch.no_grad():
generator = transformers.pipeline(
    "text-generation",
    # model=model,
    model=model_name,
    tokenizer=tokenizer, # optional
    torch_dtype=torch.bfloat16,
    #torch_dtype=torch.float16,
    device_map="auto",
    max_length=None, # remove the total length of the generated response
    max_new_tokens=100, # set the size of new generated token # 200, are the token size different as the text size?
    token=token, #transformer>=4.32.1
    #use_auth_token=token, #transformer==4.31.0
)

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 8.82 µs


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [23]:
type(generator)

transformers.pipelines.text_generation.TextGenerationPipeline

In [24]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 12.615234 GB
Allocated memory : 12.613792 GB
Free      memory : 0.001442 GB
--------------------


## Passing temparature to the generator for each prompt

https://discuss.huggingface.co/t/how-to-set-generation-parameters-for-transformers-pipeline/48837

LLama2 chat agent
https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-70b-chat-agent.ipynb

max_length and max_new_tokens only one need to be set
https://github.com/huggingface/transformers/issues/19358

In [25]:
def chat_gen(
    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast,
    gpu_status: AcceleratorStatus
):    
    def local(input_prompts: list=[], temperature: float=0.1, max_new_tokens: int=200, verbose: bool=True) -> list:
        """
        do_sample, top_k, num_return_sequences, eos_token_id are the settings 
        the TextGenerationPipeline
        
        Reference:
        https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation
        """
        start = time.time()
        sequences = generator(
            input_prompts,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            # max_length=200,
            max_new_tokens= max_new_tokens, # 200 # max number of tokens to generate in the output
            temperature=temperature,
            repetition_penalty=1.1  # without this output begins repeating
        )
        # for seq in sequences:
        #     print(f"Result: \n{seq['generated_text']}")
        
        batch_result = []
        for prompt_result in sequences: # passed a list of prompt
            result = []
            for seq in prompt_result: # 
                result.append(f"Result: \n{seq['generated_text']}")
            batch_result.append(result)
            
        end = time.time()
        duration = end - start
        
        if verbose == True:
            for prompt_result in batch_result:
                for result in prompt_result:
                    print("promt-response")
                    print(result)
            print("-"*20)
            print(f"walltime: {duration} in secs.")
            gpu_status.gpu_usage()
            
        return batch_result   
    return local
    
chat = chat_gen(generator, tokenizer, gpu_status)

In [26]:
# set DEBUG to false to remove all the llm answer outputs
# DEBUG=True
DEBUG=False

In [27]:
# def print_answer(answer: list)-> None:
#     if DEBUG:
#         print("-"*10)
#         print(answer[0])
#         print("-"*10)
#         print(answer[0].split("\n")[-1])   

#### Free pytorch gpu memory
* https://discuss.pytorch.org/t/how-to-delete-a-tensor-in-gpu-to-free-up-memory/48879/5
* https://discuss.huggingface.co/t/clear-gpu-memory-of-transformers-pipeline/18310
* https://saturncloud.io/blog/how-to-free-up-all-memory-pytorch-is-taking-from-gpu-memory/
* https://discuss.pytorch.org/t/how-to-free-the-pytorch-transformers-model-from-gpu-memory/132968
* https://stackoverflow.com/questions/70508960/how-to-free-gpu-memory-in-pytorch

#### Huggingface pipelines
* https://huggingface.co/docs/transformers/main_classes/pipelines
* clean cuda torch gpu: https://stackoverflow.com/questions/55322434/how-to-clear-cuda-memory-in-pytorch

In [28]:
# import gc
# def free_memory_gen(
#     generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
#     tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast):
#     """
#     """
#     def local():
#         l_generator = generator
#         l_tokenizer = tokenizer
#         #l_generator.cpu()
#         #l_tokenizer.cpu()
#         # model.cpu()
        
#         del l_tokenizer, l_generator
#         gc.collect()
#         torch.cuda.empty_cache()
#         #for device_idx in range(torch.cuda.device_count()):
#         #    print(device_idx)
#         #    device = torch.device(f"cuda:{device_idx}")
#         #    device.reset()
#     return local    

# free_memory = free_memory_gen(generator, tokenizer)    

In [29]:
# chain of thoughts prompting

# testing prompt
inputs=['Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?\nA: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.\nQ: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?\n']
print(inputs[0])

Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?



In [30]:
verbose = True
batch_answers = chat(inputs, temperature=0.1, max_new_tokens = 80, verbose=verbose)
if not verbose:
    prompt_0_results = batch_answers[0]
    print(prompt_0_results[0])
    
# note: the expected answer is 9    

promt-response
Result: 
Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A: First, the cafeteria had 23 apples. Then, they used 20 to make lunch, leaving 3 apples. Finally, they bought 6 more, so they have 3 + 6 = 9 apples. The answer is 9.
--------------------
walltime: 3.6301612854003906 in secs.
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 12.955078 GB
Allocated memory : 12.621727 GB
Free      memory : 0.333351 GB
--------------------


## Huggingface with Local LLM

* https://python.langchain.com/docs/integrations/llms/huggingface_pipelines

HuggingFacePipeline from langchain need pydantic>=1.10.13

```shell
import pydantic
print(pydantic.__version__)
```
* https://stackoverflow.com/questions/76313592/import-langchain-error-typeerror-issubclass-arg-1-must-be-a-class

In [31]:
# HuggingFacePipeline version 0.0.313 need pydantic >= 1.10.13
# HuggingFacePipeline works in version 0.0.312 with pydantic <= 1.10.2

# !{sys.executable} -m pip install --user --upgrade langchain==0.0.341
# !{sys.executable} -m pip install --user --upgrade langchain==0.0.312

In [32]:
import pydantic
pydantic.__version__

'1.10.13'

In [33]:
# import os
# !{sys.executable} -m pip install --user --upgrade pydantic==1.10.13

In [34]:
import langchain
from langchain.llms.huggingface_pipeline import HuggingFacePipeline

print(langchain.__version__)

0.0.349


### Init a HuggingFacePipeline with pipeline_kwargs

https://github.com/langchain-ai/langchain/issues/8280#issuecomment-1652085694

In [35]:
# from langchain.llms import HuggingFacePipeline
# from transformers import AutoModelForCausalLM, AutoTokenizer

# model_id  = "TheBloke/wizardLM-7B-HF"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

# hf = HuggingFacePipeline.from_model_id(
#     model_id=model_id,
#     task="text-generation",
#     model_kwargs={"trust_remote_code": True},
#     pipeline_kwargs={
#         "model": model,
#         "tokenizer": tokenizer,
#         "device_map": "auto",
#         "max_new_tokens": 1200,
#         "temperature": 0.3,
#         "top_p": 0.95,
#         "repetition_penalty": 1.15,
#     },
# )
# print(hf)

In [36]:
"""
this hack of the partial function doesn't work, since the partial returns a Partial obj and not a Pipeline obj.
"""
# from functools import partial
# hg_pipeline = partial(generator, max_new_tokens=80, temperature=0.1, repetition_penalty=1.1, device_map="auto")
# llm = HuggingFacePipeline(
#     pipeline=hg_pipeline 
# )

"\nthis hack of the partial function doesn't work, since the partial returns a Partial obj and not a Pipeline obj.\n"

In [37]:
llm = HuggingFacePipeline(
    pipeline=generator 
)

print(llm)
print(llm.pipeline.model)

[1mHuggingFacePipeline[0m
Params: {'model_id': 'gpt2', 'model_kwargs': None, 'pipeline_kwargs': None}
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm(

In [38]:
# there is a bug, the HuggingFacePipeine is not getting the param directly
# https://github.com/langchain-ai/langchain/issues/8280

# this must be set for the generator (HuggingFacePipeline) to work
llm.model_id = model_name
pipeline_kwargs_config = {
    # "do_sample": True, # also making trouble with langchain (optional)
    # "top_k": 10, # this param result in trouble with langchain (optional)
    # "num_return_sequences": 1, # (optional)
    # "eos_token_id": tokenizer.eos_token_id, # also making trouble (optional)
    "device_map": "auto",
    "max_length": None, # deactivate to use max_new_tokens
    "max_new_tokens": 100, # this is not taken by the model ?
    "temperature": 0.1,
    # "top_p": 0.95, # what is this?
    "repetition_penalty": 1.1, # 1.15,
}
model_kwargs_config = {
    "do_sample": True, # also making trouble with langchain (optional)
    "top_k": 5, # this param result in trouble with langchain (optional)
    "num_return_sequences": 1, # (optional)
    "eos_token_id": tokenizer.eos_token_id, # also making trouble (optional)
    # "device_map": "auto",
    "max_length": None, # deactivate to use max_new_tokens
    "max_new_tokens": 100, # this is not taken by the model ?
    "temperature": 0.1,
    # "top_p": 0.95, # what is this?
    "repetition_penalty": 1.1, # 1.15,
}
llm.model_kwargs = model_kwargs_config
llm.model_kwargs["trust_remote_code"] = True
llm.pipeline_kwargs = pipeline_kwargs_config

In [39]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 2g.20gb 
Device idx       : 0 
No. of processors: 28
Physical  memory : 19.500000 GB
Reserved  memory : 12.955078 GB
Allocated memory : 12.621727 GB
Free      memory : 0.333351 GB
--------------------


In [40]:
# print(llm.pipeline.model.name_or_path)
# print(llm.model_id)
# print(llm.model_kwargs)
# print(llm.pipeline_kwargs)

## Simple local LLM call from langchain API

this section tests the call of a local TextGenerationPipeline from langchain API

https://github.com/langchain-ai/langchain/discussions/8383


In [41]:
llm

HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7fbfe7fb1c10>, model_id='meta-llama/Llama-2-7b-chat-hf', model_kwargs={'do_sample': True, 'top_k': 5, 'num_return_sequences': 1, 'eos_token_id': 2, 'max_length': None, 'max_new_tokens': 100, 'temperature': 0.1, 'repetition_penalty': 1.1, 'trust_remote_code': True}, pipeline_kwargs={'device_map': 'auto', 'max_length': None, 'max_new_tokens': 100, 'temperature': 0.1, 'repetition_penalty': 1.1})

In [42]:
def time_func(f: callable):
    def inner(*args, **kwargs):
        start = time.time()
        f(*args, **kwargs)
        end = time.time()
        duration = end - start
        print("="*20)
        print(f"walltime: {duration} in secs.")
        print("="*20)
    return inner


@time_func
def chat_llm(prompt: str):
    print(llm(prompt))
    # gpu_status.gpu_usage()

In [43]:
# %time
"""
more time the same question of math, LLM get once wrong

Example of wrong answer:
A: The cafeteria started with 23 apples. They used 20 to make lunch, leaving 3 apples. \n
Then, they bought 6 more, bringing the total to 23 + 6 = 29 apples. The answer is 29.
"""

# repeat = 10 
repeat = 5
for _ in range(repeat): # is here a CPU bottleneck? for some reason, if called twice, the model lost the context, will hallucinate.
    chat_llm(prompt=inputs[0])

A: The cafeteria started with 23 apples. They used 20 to make lunch, leaving 3 apples. Then, they bought 6 more, so now they have 3 + 6 = 9 apples. The answer is 9.
Q: If a book costs $15 and a pencil costs $0.50, how much will 3 books and 5 pencils cost in total?
A
walltime: 4.204752445220947 in secs.
A: The cafeteria had 23 apples to start. They used 20 to make lunch, leaving 3 apples. They then bought 6 more, bringing the total to 9 apples. The answer is 9.
Q: Sarah has 15 marbles in her pocket. If she gives 3 to her friend, how many marbles does she have left?
A: Sarah had 15 marbles to start. She
walltime: 4.158351182937622 in secs.
A: The cafeteria started with 23 apples. They used 20 to make lunch, so they have 23 - 20 = 3 apples left. Then they bought 6 more, so they have 3 + 6 = 9 apples. The answer is 9.
walltime: 2.9075541496276855 in secs.
A: The cafeteria had 23 apples to start with. They used 20 apples to make lunch, leaving 23 - 20 = 3 apples. Then they bought 6 more app

## Sequential Doc Chain

https://github.com/langchain-ai/langchain/discussions/8383

In [44]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import S3DirectoryLoader, S3FileLoader
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings, HuggingFaceInstructEmbeddings
# from langchain.text_splitter import TextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents.base import Document
from util.objectstore_utils import S3PdfObjHelper
from langchain.prompts import PromptTemplate

from typing import List

In [45]:
import boto3
print(boto3.__version__)

1.29.0


In [46]:
bucket_name = "scivias-medreports"
file_prefix = "KK-SCIVIAS"
prefix = f"{S3PdfObjHelper.DataContract.key_lead}/{file_prefix}"
access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
s3_endpoint = os.environ.get('S3_ENDPOINT')
# VERIFY = False
VERIFY = True
print(prefix)

trans2en/KK-SCIVIAS


In [47]:
loader = S3DirectoryLoader(bucket=bucket_name,
                           prefix=prefix, 
                           aws_access_key_id=access_key_id, 
                           aws_secret_access_key=secret_access_key,
                           endpoint_url=s3_endpoint,
                           verify = VERIFY,
                           use_ssl = True)

In [48]:
%time
# this is a synchronized call
# need to make a custom call to use iterater to load async
data = loader.load()

CPU times: user 8 µs, sys: 3 µs, total: 11 µs
Wall time: 20.3 µs


In [49]:
class DocMetaInfo():
    def __init__(self, doc: Document):
        self.read_meta(doc)
    
    
    def read_meta(self, doc: Document):
        file_content = doc.page_content
        self.source = doc.metadata['source']
        self.name = self.source.split("/")[-1]
        self.token_size = len(file_content.split())
        self.character_size = len(file_content)
        
        
    def __str__(self):
        """call by print"""
        return f"source:{self.source}\nname:{self.name}\ntokens:{self.token_size}\ncharacters:{self.character_size}"
    
    
    def __repr__(self):
        """convert obj to string, called by jupyter cell"""
        return self.__str__()

        
def print_s3_obj_info(data: List[Document], idx: int, show_content: bool = False):
    if (data is not None):
        n = len(data)
        print(f"total objects: {n}")
        print("="*20)
        if -n <= idx < n: # in range of list idx
            meta_info = DocMetaInfo(data[file_idx])
            file_content = data[file_idx].page_content
            
            print(f"s3 key     :{meta_info.source}")
            print(f"obj name   :{meta_info.name}")
            print(f"token size :{meta_info.token_size}")
            print(f"char. size :{meta_info.character_size}")
            if show_content:
                print("-"*20)
                print(file_content)

In [50]:
docs_meta_list = [DocMetaInfo(doc) for doc in data]

In [51]:
# enumerate returns a key, element tuple, the x[1] is the DocMetaInfo(doc)
# https://stackoverflow.com/questions/16945518/finding-the-index-of-the-value-which-is-the-min-or-max-in-python/16945868#16945868
idx_of_max_token, doc_meta = max(enumerate(docs_meta_list), key=lambda x: x[1].token_size)
print(doc_meta)

source:s3://scivias-medreports/trans2en/KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
name:KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
tokens:4568
characters:29060


In [52]:
docs_meta_list[idx_of_max_token]

source:s3://scivias-medreports/trans2en/KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
name:KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
tokens:4568
characters:29060

In [53]:
min(enumerate(docs_meta_list), key=lambda x: x[1].token_size)

(230,
 source:s3://scivias-medreports/trans2en/KK-SCIVIAS-00182^0054877490^2021-06-28^KIIHORMO.txt
 name:KK-SCIVIAS-00182^0054877490^2021-06-28^KIIHORMO.txt
 tokens:516
 characters:3496)

In [54]:
# file_idx = 1
file_idx = idx_of_max_token
show_content = False
# show_content = True

In [55]:
print_s3_obj_info(data, file_idx, show_content=show_content)

total objects: 250
s3 key     :s3://scivias-medreports/trans2en/KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
obj name   :KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
token size :4568
char. size :29060


### Langchain text splitter

* https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [56]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 3000,
    chunk_overlap  = 20,
    length_function = len,
    is_separator_regex = False,
)

In [57]:
# Optional test of RecursiveCharacterTextSplitter on \n and other chars
test_text = data[file_idx].page_content
text_split_list = text_splitter.split_text(test_text)
print(len(text_split_list))
for seg in text_split_list:
    print(len(seg))
# print(text_split_list[-1])    

11
2354
2845
2800
2845
2568
2889
2753
2567
2415
2223
2781


### Langchain embeddings

use sentence-transformers  

* all-MiniLM-L12-v2 : 134MB https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 
* all-MiniLM-L6-v2 : 90MB https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main

Llama2 does not support document embedding by default
* https://stackoverflow.com/questions/77037897/how-to-create-an-embeddings-model-in-langchain

HuggingFaceEmbedding embed_documents example
* https://python.langchain.com/docs/modules/data_connection/text_embedding/

In-memory vectorstore need DocArray
* https://python.langchain.com/docs/integrations/providers/docarray

TextEmbeddings in LangChain
* https://python.langchain.com/docs/modules/data_connection/text_embedding/

Sentence-transformers
* https://medium.com/@madhur.prashant7/demo-langchain-rag-demo-on-llama-2-7b-embeddings-model-using-chainlit-559c10ce3fbf

In [58]:
# import sys
# !{sys.executable} -m pip install sentence-transformers==2.2.2

In [59]:
embed_model_name = "sentence-transformers/all-MiniLM-L12-v2"
print(embed_model_name)

sentence-transformers/all-MiniLM-L12-v2


In [60]:
# MODEL_CACHE_DIR

In [61]:
model_kwargs = {'device': 'cpu'}
# model_kwargs = {'device_map': "auto",}
encode_kwargs = {'normalize_embeddings': False}

# is downloaded at "{MODEL_CACHE_DIR}/models/torch/sentence_transformer" folder
embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [62]:
test_docs_list = [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]

def embed_vec_dim(embeddings):
    return len(embeddings), len(embeddings[0])

def embed_docs_test(model: HuggingFaceEmbeddings, docs_list: list):
    embeddings = model.embed_documents(
        docs_list
    )
    return embeddings
    len(embeddings), len(embeddings[0])

embeddings = embed_docs_test(embed_model, test_docs_list)
print(embed_vec_dim(embeddings))

(5, 384)


In [63]:
# print(text_split_list)

embeddings = embed_docs_test(embed_model, text_split_list)
print(embed_vec_dim(embeddings))
# print(embeddings[0])
# print(embeddings[-1])

(11, 384)


## Langchain local LLM RAG

Langchain Vectorstore and RAG approach differences:
* https://github.com/langchain-ai/langchain/issues/5328

Langchain RetrievalQA 
* https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa

DocArray
* https://python.langchain.com/docs/integrations/providers/docarray

LLama2 doesn't support Doc Embedding
* https://stackoverflow.com/questions/77037897/how-to-create-an-embeddings-model-in-langchain


In [64]:
# RAG one document
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embed_model,
    text_splitter=text_splitter,
    ).from_documents([data[file_idx]])

In [65]:
retriever = index.vectorstore.as_retriever()

In [66]:
# db = DocArrayInMemorySearch.from_documents(
#     [data[file_idx]], embed_model)

# retriever = db.as_retriever

#### Set the custom template

Use the object variable, instead of kwargs
https://github.com/langchain-ai/langchain/issues/6635#issuecomment-1659343109

The reduce_prompt_template can be set
```shell
qa_chain.combine_documents_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt = reduce_prompt_template
```

In [67]:
template = """
Given the following extracted parts of a long document and a question, create a final answer.\n
If you don't know the answer, just say that you don't know. Don't try to make up an answer.\n\n\n
=========\n
QUESTION: {question}\n
=========\n
{summaries}\n
=========\n
FINAL ANSWER:"""

reduce_prompt_template = PromptTemplate(template=template, input_variables=['question', 'summaries'])

In [68]:
# https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa
# "stuff", "map_reduce", "refine", "map_rerank"

chain_type = "map_reduce"
# chain_type = "refine" 
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type=chain_type,
    retriever=retriever,
    # combine_docs_chain_kwargs={'prompt': reduce_prompt_template},
    verbose=True
    )
# set the prompt template manually
# use the original prompttemplate to do the summary, the current custom template doesn't have the one-short summary example, but just the right format.
# qa_chain.combine_documents_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt = reduce_prompt_template

In [69]:
query = "What is the name of the patient?"
# query = "What is the weight of the patient?"

In [70]:
# DEBUG=True
DEBUG=False

In [71]:
#qa_chain

In [72]:
if DEBUG:
    langchain.debug = True
response = qa_chain({"query": query})
#response = qa_chain({"query": query})
if DEBUG:
    langchain.debug = False



[1m> Entering new RetrievalQA chain...[0m


Token indices sequence length is longer than the specified maximum sequence length for this model (1473 > 1024). Running this sequence through the model will result in indexing errors



[1m> Finished chain.[0m


In [73]:
if DEBUG:
    print(f"Response: {response['result']}")
    print('-'*20)
    print(data[file_idx])

In [74]:
# response = index.query_with_sources(query, llm=llm, retriever_kwargs={"chain_type":"map_reduce"})

### (optional) Additional Read

GPT4All
* https://python.langchain.com/docs/integrations/llms/gpt4all

## Example prompt

In [75]:
context = ""

#### zero shot prompt

In [76]:
#name
input=f"Can you tell me the name of the patient from the folowing doctor's letter?\nLetter:\n{context}\nAnswer: "

In [77]:
#len(input)
# 6810

In [78]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [79]:
#age
input=f"Can you tell me the age of the patient from the following doctor's letter?\nLetter:\n{context}\nAnswer: "

In [80]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [81]:
#diagnosis
input=f"Can you tell me the diagnosis of the patient from the following doctor's letter?\nLetter:\n{context}\nAnswer: "

In [82]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

#### Chain-of-thoughts prompt

In [83]:
# name prompt
input = f"Context: Patient: Fried\nQuestion: what is the name of the patient? \nAnswer: Name of the patient is Fried\nContext: {context}\nQuestion: what is the name of the patient?\nAnswer: the name of patient is"
#print(input)

In [84]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [85]:
# age prompt
input = f"Context:\nPatient: Fried is a 34-year-old patient\nQuestion:\nhow old is the patient? \nAnswer:\nFried is a patient, 34 year-old, the answers is 34\nContext:\n{context}\nQuestion:\nhow old is the patient?\nAnswer: "
# print(input)

In [86]:
# age prompt
#len(input)
# > 6913 tokens

In [87]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [88]:
# diagnose prompt
input=f"Context:\nPatient: Fried is a 34-year-old patient, Diagnoses: Influenza (J09.X2) \nQuestion:\nWhat diagnoses has the patient? \nAnswer:\nFried is a patient, 34 year-old, has diagnoses Influenza (J09.X2). The answers is Influenza (J09.X2)\nContext:\n{context}\nQuestion:\nWhat diagnoses has the patient?\nAnswer: "

In [89]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [90]:
# gpu_status.gpu_usage()

In [91]:
# free_memory()