## Introduction

@author: Yingding Wang\
@created: 24.11.2023\
@updated: 29.01.2024\
@version: 5

This notebook comprises of examples to use transformer, pytorch, llama2, langchain to achive entity extraction with engineered prompts.

Note:\
`RuntimeError:NVML_SICCESS ... in PyTorch using the Llama2 13B means OOM with 20GB MIG`
https://github.com/pytorch/pytorch/issues/112377

Use `nvidia.com/mig-3g.40gb` instead of `nvidia.com/mig-2g.20gb` solves the issue.

In the map reduce RAG pattern, ran into recency bias in the reduce pattern. Due to the previous hallucination.\
Possible solution
* add end token in the map steps
* use rerank pattern
* better rag retrieval and embedding model
* large rag chunck

Question:
* is there a "end" special token for llama2 https://www.reddit.com/r/LocalLLaMA/comments/167v3cd/llama_2_tokenizer_and_the_special_tokens/


In [1]:
import sys

In [2]:
#!{sys.executable} -m pip install --upgrade pip
#!{sys.executable} -m pip install --user --upgrade kfp==1.8.22

In [3]:
#!cat ./requirements.txt

## Use the cuda 118 and torch 2.1.0 version

In [4]:
# !{sys.executable} -m pip install --user --upgrade -r ./requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118

In [5]:
#!{sys.executable} -m pip list

## Additional technical informaiton
#### Useful installation for KF notebook 1.7.0 cu118 drivers

cuda118
```shell
#!{sys.executable} -m pip install --user --upgrade torch==2.0.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
```
`xformers==0.0.21` need `torch==2.1.2`

```shell
#!{sys.executable} -m pip install --user --upgrade xformers==0.0.21 torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
```

show js loading with ipywidgets
```shell
#!{sys.executable} -m pip install --user --upgrade ipywidgets==8.1.0 comm==0.1.4 jupyterlab-widgets==3.0.8 widgetsnbextension==4.0.8
```

uninstall
```shell
#!{sys.executable} -m pip uninstall accelerator transformers xformers torch -y 
```

## (optional) restart kernel

### (optional) Set huggingface cli in terminal

```shell
PATH=${PATH}:/home/jupyter/.local/bin
```

In [6]:
# (optional) uncomment the following lines to set path in python notebook cell for notebook session 
# PATH=%env PATH
# %env PATH={PATH}:/home/jupyter/.local/bin

#### Basics of GPU

Multi GPU inference: https://github.com/tloen/alpaca-lora/issues/445

Show accelerator device IDs:

```shell
nvidia-smi -L
```

Nvidia usage
```shell
nvidia-smi -q -g 0 -d UTILIZATION -l
```

python lib: gpustat
```python
gpustat -cp
```

* https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu

Check GPU info in PyTorch
* https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu
* CUDA memory management https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

#### Extract the GPU Accelerator MIG UUIDs

* Extract with re.search and group: https://note.nkmk.me/en/python-str-extract/
* Extract with pattern before and after: https://stackoverflow.com/questions/4666973/how-to-extract-the-substring-between-two-markers

#### PyTorch distributed with device UUID
* https://discuss.pytorch.org/t/world-size-and-rank-torch-distributed-init-process-group/57438

#### CUDA MIG memory notice
The following python command shall show the available MIG memory
```shell
print(torch.cuda.mem_get_info())
for e in torch.cuda.mem_get_info():
    print(e/1024**3)
```
The first tuple shows the availabe MIG cuda memory, if it goes to zero, and no process is attached,
this means a cuda process is hang.
```console
(20748107776, 20937965568)
19.32318115234375
19.5
```

To terminate a cuda process, log into the GPU host
```shell
nvidia-smi # find out the PID something like 830333
sudo kill -9 PID
```

In [7]:
from platform import python_version

print(python_version())

3.8.10


In [8]:
import os, time, sys
from util.accelerator_utils import (
    AcceleratorStatus, 
    AcceleratorHelper, 
    DIR_MODE_MAP,
    TokenHelper as th
)

dir_setting = dir_setting=DIR_MODE_MAP.get("kf_notebook")

gpu_status = AcceleratorStatus()
gpu_status.gpu_usage()

In [9]:
gpu_helper = AcceleratorHelper()
# dynamically fetch attached accelerator devices
UUIDs = gpu_helper.nvidia_device_uuids_filtered_by(is_mig=True, log_output=False)

In [10]:
# init all the cuda torch env and model download cache directory
gpu_helper.init_cuda_torch(UUIDs, dir_setting)

print(os.environ["CUDA_VISIBLE_DEVICES"])
print(os.environ["XDG_CACHE_HOME"])

MIG-9579f618-ddae-5958-9285-3207382f0b36
/home/jovyan/llm-models/core-kind/yinwang/models


In [11]:
model_map = {
    "llama7B-chat":     "meta-llama/Llama-2-7b-chat-hf",
    "llama13B-chat" :   "meta-llama/Llama-2-13b-chat-hf",
    "llama70B-chat" :   "meta-llama/Llama-2-70b-chat-hf",
    # "70B" : "meta-llama/Llama-2-70b-hf"
    "mistral7B-01":     "mistralai/Mistral-7B-v0.1",
    "mistral7B-inst02": "mistralai/Mistral-7B-Instruct-v0.2",
    "mixtral8x7B-01":   "mistralai/Mixtral-8x7B-v0.1",
    "mixtral8x7B-inst01":   "mistralai/Mixtral-8x7B-Instruct-v0.1", 
}

default_model_type = "mistral7B-01"
default_dir_mode = "mac_local"

In [12]:
# model_type = default_model_type
# model_type = "mistral7B-inst02"
# model_type = "mixtral8x7B-01"
model_type = "mixtral8x7B-inst01"
# model_type = "llama13B-chat"

model_name = model_map.get(model_type, "7B")

print(model_name)

mistralai/Mixtral-8x7B-Instruct-v0.1


In [13]:
import transformers
import torch
# from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
print(transformers.__version__)
print(torch.__version__)

4.36.2
2.1.2+cu118


In [14]:
"""
Load the token
"""
def gen_token_kwargs():
    # call method from token helper class
    if th.need_token(model_type):
        # kwargs = {"use_auth_token": get_token(dir_setting)}
        token_kwargs = {
            "token": th.get_token(dir_setting),
            # "truncation_side": "left",
            # "return_tensors": "pt",            
                        }
        print("huggingface token loaded")
    else:
        token_kwargs = {}
        print("huggingface token is NOT needed")
    return token_kwargs

token_kwargs = gen_token_kwargs()

huggingface token is NOT needed


In [15]:
from util.accelerator_utils import AcceleratorStatus

gpu_status = AcceleratorStatus.create_accelerator_status()
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 0.000000 GB
Allocated memory : 0.000000 GB
Free      memory : 0.000000 GB
--------------------


In [16]:
f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB'

'37GB'

## Following this approach to load llama2 model with bitsandbytes
* https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb

## 4bit  quantization

Load pretrained model first, then the tokenizer.

<table>
    <!-- row 1-->
<tr>
<th>
Llama-2-13b-chat-hf
</th>
<th>
Mixtral-8x7B-v0.1
</th>

<th>
Mixtral-8x7B-Instruct-v0.1
</th>
</tr>
    <!-- row 2 -->
<tr>

<td>
<pre>
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 7.085938 GB
Allocated memory : 6.809501 GB
Free      memory : 0.276437 GB
--------------------
</pre>
</td>

<td>
<pre>
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 23.496094 GB
Allocated memory : 23.303491 GB
Free      memory : 0.192603 GB
--------------------
</pre>
</td>

<td>
<pre>
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 23.496094 GB
Allocated memory : 23.303491 GB
Free      memory : 0.192603 GB
--------------------
</pre>
</td>

</tr>
</table>

In [17]:
%time
# in Transformer 4.32.1 need to use "token" parameter
# in Transformer 4.30.x need to use "use_auth_token" parameter
# with torch.no_grad():

from torch import bfloat16
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
  model_name, #'decapoda-research/llama-7b-hf',
  device_map='auto',
  quantization_config=bnb_config,
  # max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB',
  **token_kwargs,  
)

CPU times: user 0 ns, sys: 1e+03 ns, total: 1e+03 ns
Wall time: 3.58 µs


Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

In [18]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 23.496094 GB
Allocated memory : 23.303491 GB
Free      memory : 0.192603 GB
--------------------


### Lama2 max position embeddings
Default is set to be 2048
* https://huggingface.co/docs/transformers/model_doc/llama2#transformers.LlamaConfig.max_position_embeddings

Set teh max_length for the tokenizer, Transformer issues:
* https://github.com/huggingface/transformers/issues/1791#issuecomment-553397054
* https://github.com/huggingface/transformers/pull/1833


In [19]:
MAX_POSITION_EMBEDDINGS = 3072
MAX_LENGTH = 4096

tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    device='cpu',
    max_position_embeddings=MAX_LENGTH,
    max_length=MAX_LENGTH,
    **token_kwargs,
    # device_map="auto", # put to GPU
    # use_auth_token=token, #transformer==4.31.0
)

In [20]:
tokenizer

LlamaTokenizerFast(name_or_path='mistralai/Mixtral-8x7B-Instruct-v0.1', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

#### Testing tokenizer
https://huggingface.co/docs/tokenizers/pipeline

In [21]:
inputs=["""Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
\n"""]

In [22]:
input_test_encoded = tokenizer.encode(inputs[0])
print(input_test_encoded)

[1, 1186, 28747, 14115, 659, 28705, 28770, 19552, 16852, 28723, 650, 957, 846, 28705, 28750, 680, 277, 509, 302, 19552, 16852, 28723, 7066, 541, 659, 28705, 28781, 19552, 16852, 28723, 1602, 1287, 19552, 16852, 1235, 400, 506, 1055, 28804, 13, 28741, 28747, 14115, 2774, 395, 28705, 28770, 16852, 28723, 28705, 28750, 277, 509, 302, 28705, 28781, 19552, 16852, 1430, 349, 28705, 28783, 19552, 16852, 28723, 28705, 28770, 648, 28705, 28783, 327, 28705, 28740, 28740, 28723, 415, 4372, 349, 28705, 28740, 28740, 28723, 13, 28824, 28747, 415, 18302, 1623, 515, 553, 28705, 28750, 28770, 979, 2815, 28723, 1047, 590, 1307, 28705, 28750, 28734, 298, 1038, 9957, 304, 7620, 28705, 28784, 680, 28725, 910, 1287, 979, 2815, 511, 590, 506, 28804, 13, 13]


In [23]:
response_test_decoded = tokenizer.decode(input_test_encoded)
print(response_test_decoded)

<s> Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?




### Inference with transformers pipeline

Reference:
* https://huggingface.co/docs/transformers/pipeline_tutorial

Note:
* batch is not activated by default, batch is not necessary faster for `transformers.pipeline`
* the max_new_tokens set in the pipeline initialization works with langchain.llms.HuggingFacePipeline, but not as a param for the TextGenerationPipeline 

max_new_tokens https://github.com/huggingface/transformers/issues/19358

In [24]:
generator = transformers.pipeline(
    task="text-generation",
    model=model,
    # model=model_name,
    tokenizer=tokenizer, # optional
    # torch_dtype=torch.bfloat16,
    # torch_dtype=torch.float16,
    device_map="auto",
    # max_length=MAX_LENGTH,
    max_length=None, # remove the total length of the generated response
    max_new_tokens=100, # set the size of new generated token # 200, are the token size different as the text size?
    **token_kwargs,
)

In [25]:
type(generator)

transformers.pipelines.text_generation.TextGenerationPipeline

In [26]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 23.496094 GB
Allocated memory : 23.303491 GB
Free      memory : 0.192603 GB
--------------------


## Passing temparature to the generator for each prompt

https://discuss.huggingface.co/t/how-to-set-generation-parameters-for-transformers-pipeline/48837

LLama2 chat agent
https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-70b-chat-agent.ipynb

max_length and max_new_tokens only one need to be set
https://github.com/huggingface/transformers/issues/19358

In [27]:
def chat_gen(
    generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
    tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast,
    gpu_status: AcceleratorStatus
):    
    def local(input_prompts: list=[], temperature: float=0.01, max_new_tokens: int=200, verbose: bool=True) -> list:
        """
        do_sample, top_k, num_return_sequences, eos_token_id are the settings 
        the TextGenerationPipeline
        
        Reference:
        https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation
        """
        start = time.time()
        sequences = generator(
            input_prompts,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,
            # max_length=200,
            max_new_tokens= max_new_tokens, # 200 # max number of tokens to generate in the output
            temperature=temperature,
            repetition_penalty=1.2  # without this output begins repeating
        )
        # for seq in sequences:
        #     print(f"Result: \n{seq['generated_text']}")
        
        batch_result = []
        for prompt_result in sequences: # passed a list of prompt
            result = []
            for seq in prompt_result: # 
                result.append(f"Result: \n{seq['generated_text']}")
            batch_result.append(result)
            
        end = time.time()
        duration = end - start
        
        if verbose == True:
            for prompt_result in batch_result:
                for result in prompt_result:
                    print("promt-response")
                    print(result)
            print("-"*20)
            print(f"walltime: {duration} in secs.")
            gpu_status.gpu_usage()
            
        return batch_result   
    return local
    
chat = chat_gen(generator, tokenizer, gpu_status)

In [28]:
# DEBUG = True
# def print_answer(answer: list)-> None:
#     if DEBUG:
#         print("-"*10)
#         print(answer[0])
#         print("-"*10)
#         print(answer[0].split("\n")[-1])   

#### Free pytorch gpu memory
* https://discuss.pytorch.org/t/how-to-delete-a-tensor-in-gpu-to-free-up-memory/48879/5
* https://discuss.huggingface.co/t/clear-gpu-memory-of-transformers-pipeline/18310
* https://saturncloud.io/blog/how-to-free-up-all-memory-pytorch-is-taking-from-gpu-memory/
* https://discuss.pytorch.org/t/how-to-free-the-pytorch-transformers-model-from-gpu-memory/132968
* https://stackoverflow.com/questions/70508960/how-to-free-gpu-memory-in-pytorch

#### Huggingface pipelines
* https://huggingface.co/docs/transformers/main_classes/pipelines
* clean cuda torch gpu: https://stackoverflow.com/questions/55322434/how-to-clear-cuda-memory-in-pytorch

In [29]:
# import gc
# def free_memory_gen(
#     generator: transformers.pipelines.text_generation.TextGenerationPipeline, 
#     tokenizer: transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast):
#     """
#     """
#     def local():
#         l_generator = generator
#         l_tokenizer = tokenizer
#         #l_generator.cpu()
#         #l_tokenizer.cpu()
#         # model.cpu()
        
#         del l_tokenizer, l_generator
#         gc.collect()
#         torch.cuda.empty_cache()
#         #for device_idx in range(torch.cuda.device_count()):
#         #    print(device_idx)
#         #    device = torch.device(f"cuda:{device_idx}")
#         #    device.reset()
#     return local    

# free_memory = free_memory_gen(generator, tokenizer)    

In [30]:
class PromptHelper():
    """
    mistral instruction example:
    https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
    llama2 instruction examples:
    https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
    """
    meta = "meta-llama"
    mistral = "mistralai"
    INST_MSG_MAP = {
        mistral: """<s>[INST] You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.\n
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information. Just return \"</s>\"
""",
        meta: """[INST]<<SYS>>You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.\n
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.<</SYS>>
"""
    }

    def __init__(self, model_name):
        self.model_name = model_name

    
    def gen_prompt(self, query: str) -> str:      
        if model_name.startswith(self.meta):
            inst_msg = self.INST_MSG_MAP[self.meta]
        elif model_name.startswith(self.mistral):
            inst_msg = self.INST_MSG_MAP[self.mistral]
        else:
            inst_msg = ""

        prompt = f"""{inst_msg}\n{query}\n[/INST]""" if query is not None or len(query) > 0 else f"""{inst_msg}\n[/INST]"""
        return prompt
    
    def get_inst_msg(self) -> str:
        return self.gen_prompt(None)

In [31]:
from functools import partial

prompt_helper = PromptHelper(model_name)

def get_inputs_by_model(idx, inputs, prompt_helper):
    print(prompt_helper.model_name)
    # generate a model dependent prompt with appropriate sys instruction message
    return prompt_helper.gen_prompt(inputs[idx])

get_inputs = partial(get_inputs_by_model, inputs=inputs, prompt_helper=prompt_helper)
print(get_inputs(0))

mistralai/Mixtral-8x7B-Instruct-v0.1
<s>[INST] You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information. Just return "</s>"

Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?


[/INST]


In [32]:
verbose = True
batch_answers = chat(inputs, temperature=0.01, max_new_tokens = 80, verbose=verbose)
if not verbose:
    prompt_0_results = batch_answers[0]
    print(prompt_0_results[0])
    
# note: the expected answer is 9

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


promt-response
Result: 
Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

A: They started with 23 apples. Using 20 leaves them with 3 apples left over. Adding the 6 new apples gives a total of 9 apples.
--------------------
walltime: 6.0250701904296875 in secs.
num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 23.853516 GB
Allocated memory : 23.311425 GB
Free      memory : 0.542090 GB
--------------------


## Huggingface with Local LLM

* https://python.langchain.com/docs/integrations/llms/huggingface_pipelines

HuggingFacePipeline from langchain need pydantic>=1.10.13

```shell
import pydantic
print(pydantic.__version__)
```
* https://stackoverflow.com/questions/76313592/import-langchain-error-typeerror-issubclass-arg-1-must-be-a-class

In [33]:
# HuggingFacePipeline version 0.0.313 need pydantic >= 1.10.13
# HuggingFacePipeline works in version 0.0.312 with pydantic <= 1.10.2

# !{sys.executable} -m pip install --user --upgrade langchain==0.0.341
# !{sys.executable} -m pip install --user --upgrade langchain==0.0.312

In [34]:
import pydantic
pydantic.__version__

'1.10.13'

In [35]:
# import os
# !{sys.executable} -m pip install --user --upgrade pydantic==1.10.13

In [36]:
import langchain
# till 0.0.350
# from langchain.llms.huggingface_pipeline import HuggingFacePipeline
# from 0.0.354
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

print(langchain.__version__)

0.1.1


### Init a HuggingFacePipeline with pipeline_kwargs

https://github.com/langchain-ai/langchain/issues/8280#issuecomment-1652085694

In [37]:
# from langchain.llms import HuggingFacePipeline
# from transformers import AutoModelForCausalLM, AutoTokenizer

# model_id  = "TheBloke/wizardLM-7B-HF"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

# hf = HuggingFacePipeline.from_model_id(
#     model_id=model_id,
#     task="text-generation",
#     model_kwargs={"trust_remote_code": True},
#     pipeline_kwargs={
#         "model": model,
#         "tokenizer": tokenizer,
#         "device_map": "auto",
#         "max_new_tokens": 1200,
#         "temperature": 0.3,
#         "top_p": 0.95,
#         "repetition_penalty": 1.15,
#     },
# )
# print(hf)

In [38]:
"""
this hack of the partial function doesn't work, since the partial returns a Partial obj and not a Pipeline obj.
"""
# from functools import partial
# hg_pipeline = partial(generator, max_new_tokens=80, temperature=0.1, repetition_penalty=1.1, device_map="auto")
# llm = HuggingFacePipeline(
#     pipeline=hg_pipeline 
# )

"\nthis hack of the partial function doesn't work, since the partial returns a Partial obj and not a Pipeline obj.\n"

In [39]:
llm = HuggingFacePipeline(
    pipeline=generator 
)

print(llm)
print(llm.pipeline.model)

[1mHuggingFacePipeline[0m
Params: {'model_id': 'gpt2', 'model_kwargs': None, 'pipeline_kwargs': None}
MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MixtralDecoderLayer(
        (self_attn): MixtralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear4bit(in_features=4096, out_features=8, bias=False)
          (experts): ModuleList(
            (0-7): 8 x MixtralBLockSparseTop2MLP(
              (w1): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (w2): Linear4bit(in_

In [40]:
# there is a bug, the HuggingFacePipeine is not getting the param directly
# https://github.com/langchain-ai/langchain/issues/8280

# this must be set for the generator (HuggingFacePipeline) to work
llm.model_id = model_name
pipeline_kwargs_config = {
    # "do_sample": True, # also making trouble with langchain (optional)
    # "top_k": 10, # this param result in trouble with langchain (optional)
    # "num_return_sequences": 1, # (optional)
    # "eos_token_id": tokenizer.eos_token_id, # also making trouble (optional)
    "device_map": "auto",
    "max_length": MAX_LENGTH, # deactivate to use max_new_tokens
    # "max_length": None, # deactivate to use max_new_tokens
    "max_new_tokens": 80, # this is not taken by the model ?
    "temperature": 0.01,
    # "top_p": 0.95, # what is this?
    "repetition_penalty": 1.15, # 1.15,
}
model_kwargs_config = {
    "do_sample": True, # also making trouble with langchain (optional)
    "top_k": 3, # this param result in trouble with langchain (optional)
    "num_return_sequences": 1, # (optional)
    "eos_token_id": tokenizer.eos_token_id, # also making trouble (optional)
    # "device_map": "auto",
    "max_length": MAX_LENGTH, # deactivate to use max_new_tokens
    # "max_length": None, # deactivate to use max_new_tokens
    "max_new_tokens": 80, # this is not taken by the model ?
    "temperature": 0.01,
    "top_p": 0.8, # 0.95, # what is this?
    "repetition_penalty": 1.15, # 1.15,
}
llm.model_kwargs = model_kwargs_config
llm.model_kwargs["trust_remote_code"] = True
llm.pipeline_kwargs = pipeline_kwargs_config

In [41]:
gpu_status.gpu_usage()

num_of_gpus: 1
--------------------
Device name      : NVIDIA A100 80GB PCIe MIG 3g.40gb 
Device idx       : 0 
No. of processors: 42
Physical  memory : 39.250000 GB
Reserved  memory : 23.853516 GB
Allocated memory : 23.311425 GB
Free      memory : 0.542090 GB
--------------------


In [42]:
# print(llm.pipeline.model.name_or_path)
# print(llm.model_id)
# print(llm.model_kwargs)
# print(llm.pipeline_kwargs)

## Simple local LLM call from langchain API

this section tests the call of a local TextGenerationPipeline from langchain API

https://github.com/langchain-ai/langchain/discussions/8383


In [43]:
llm

HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7f7822d8ba90>, model_id='mistralai/Mixtral-8x7B-Instruct-v0.1', model_kwargs={'do_sample': True, 'top_k': 3, 'num_return_sequences': 1, 'eos_token_id': 2, 'max_length': 4096, 'max_new_tokens': 80, 'temperature': 0.01, 'top_p': 0.8, 'repetition_penalty': 1.15, 'trust_remote_code': True}, pipeline_kwargs={'device_map': 'auto', 'max_length': 4096, 'max_new_tokens': 80, 'temperature': 0.01, 'repetition_penalty': 1.15})

In [44]:
def time_func(f: callable):
    def inner(*args, **kwargs):
        start = time.time()
        f(*args, **kwargs)
        end = time.time()
        duration = end - start
        print("="*20)
        print(f"walltime: {duration} in secs.")
        print("="*20)
    return inner


@time_func
def chat_llm(prompt: str):
    print(llm(prompt))
    # gpu_status.gpu_usage()

In [45]:
# %time
"""
more time the same question of math, LLM get once wrong

Example of wrong answer:
A: The cafeteria started with 23 apples. They used 20 to make lunch, leaving 3 apples. \n
Then, they bought 6 more, bringing the total to 23 + 6 = 29 apples. The answer is 29.

A: 23 - 20 = 3. They have 3 apples.
Q: A bookshelf has 12 books. If they put 4 more books on it, how many books are on the shelf?
A: 12 + 4 = 16. There are 16 books on the shelf.
"""

# repeat = 10 
repeat = 5
for _ in range(repeat): # is here a CPU bottleneck? for some reason, if called twice, the model lost the context, will hallucinate.
    # chat_llm(prompt=inputs[0])
    chat_llm(prompt=get_inputs(0))

mistralai/Mixtral-8x7B-Instruct-v0.1


  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The cafeteria started with 23 apples. They used 20, so we subtract those from the original 23, which is 3. Then they bought 6 more, so we add those to the remaining 3, which is 9. The answer is 9.
walltime: 8.301583290100098 in secs.
mistralai/Mixtral-8x7B-Instruct-v0.1


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The cafeteria started with 23 apples. They used 20, so we subtract those from the original 23, which is 3. Then they bought 6 more, so we add those to the remaining 3, which is 9. The answer is 9.
walltime: 8.119031429290771 in secs.
mistralai/Mixtral-8x7B-Instruct-v0.1


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The cafeteria started with 23 apples. They used 20, so we subtract those from the original 23, which is 3. Then they bought 6 more, so we add those to the remaining 3, which is 9. The answer is 9.
walltime: 8.278198003768921 in secs.
mistralai/Mixtral-8x7B-Instruct-v0.1


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The cafeteria started with 23 apples. They used 20, so we subtract those from the original 23, which is 3. Then they bought 6 more, so we add those to the remaining 3, which is 9. The answer is 9.
walltime: 8.00636625289917 in secs.
mistralai/Mixtral-8x7B-Instruct-v0.1
 The cafeteria started with 23 apples. They used 20, so we subtract those from the original 23, which is 3. Then they bought 6 more, so we add those to the remaining 3, which is 9. The answer is 9.
walltime: 8.085177898406982 in secs.


## Sequential Doc Chain

https://github.com/langchain-ai/langchain/discussions/8383

In [46]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import S3DirectoryLoader, S3FileLoader
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings, HuggingFaceInstructEmbeddings
# from langchain.text_splitter import TextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents.base import Document
from util.objectstore_utils import S3PdfObjHelper
from langchain.prompts import PromptTemplate

from typing import List

In [47]:
import boto3
print(boto3.__version__)

1.34.14


In [48]:
bucket_name = "scivias-medreports"
file_prefix = "KK-SCIVIAS"
prefix = f"{S3PdfObjHelper.DataContract.key_lead}/{file_prefix}"
access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
s3_endpoint = os.environ.get('S3_ENDPOINT')
# VERIFY = False
VERIFY = True
print(prefix)

trans2en/KK-SCIVIAS


In [49]:
loader = S3DirectoryLoader(bucket=bucket_name,
                           prefix=prefix, 
                           aws_access_key_id=access_key_id, 
                           aws_secret_access_key=secret_access_key,
                           endpoint_url=s3_endpoint,
                           verify = VERIFY,
                           use_ssl = True)

In [50]:
%time
# this is a synchronized call
# need to make a custom call to use iterater to load async
# this will download nltk for tokenizers
data = loader.load()

CPU times: user 7 µs, sys: 3 µs, total: 10 µs
Wall time: 22.2 µs


In [51]:
class DocMetaInfo():
    def __init__(self, doc: Document):
        self.read_meta(doc)
    
    
    def read_meta(self, doc: Document):
        file_content = doc.page_content
        self.source = doc.metadata['source']
        self.name = self.source.split("/")[-1]
        self.token_size = len(file_content.split())
        self.character_size = len(file_content)
        
        
    def __str__(self):
        """call by print"""
        return f"source:{self.source}\nname:{self.name}\ntokens:{self.token_size}\ncharacters:{self.character_size}"
    
    
    def __repr__(self):
        """convert obj to string, called by jupyter cell"""
        return self.__str__()

        
def print_s3_obj_info(data: List[Document], idx: int, show_content: bool = False):
    if (data is not None):
        n = len(data)
        print(f"total objects: {n}")
        print("="*20)
        if -n <= idx < n: # in range of list idx
            meta_info = DocMetaInfo(data[file_idx])
            file_content = data[file_idx].page_content
            
            print(f"s3 key     :{meta_info.source}")
            print(f"obj name   :{meta_info.name}")
            print(f"token size :{meta_info.token_size}")
            print(f"char. size :{meta_info.character_size}")
            if show_content:
                print("-"*20)
                print(file_content)

In [52]:
docs_meta_list = [DocMetaInfo(doc) for doc in data]

In [53]:
# enumerate returns a key, element tuple, the x[1] is the DocMetaInfo(doc)
# https://stackoverflow.com/questions/16945518/finding-the-index-of-the-value-which-is-the-min-or-max-in-python/16945868#16945868
idx_of_max_token, doc_meta = max(enumerate(docs_meta_list), key=lambda x: x[1].token_size)
print(doc_meta)

source:s3://scivias-medreports/trans2en/KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
name:KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
tokens:4568
characters:29060


In [54]:
docs_meta_list[idx_of_max_token]

source:s3://scivias-medreports/trans2en/KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
name:KK-SCIVIAS-00070^0054672400^2021-03-01^KIIS4.txt
tokens:4568
characters:29060

In [55]:
min(enumerate(docs_meta_list), key=lambda x: x[1].token_size)

(226,
 source:s3://scivias-medreports/trans2en/KK-SCIVIAS-00182^0054877490^2021-06-28^KIIHORMO.txt
 name:KK-SCIVIAS-00182^0054877490^2021-06-28^KIIHORMO.txt
 tokens:516
 characters:3496)

In [56]:
file_idx = 0 # ID 0003 has weight 43.2 kg
# file_idx = 1
# file_idx = idx_of_max_token
show_content = False
# show_content = True

In [57]:
print_s3_obj_info(data, file_idx, show_content=show_content)

total objects: 246
s3 key     :s3://scivias-medreports/trans2en/KK-SCIVIAS-00003^0053360847^2018-09-28^KIIGAS.txt
obj name   :KK-SCIVIAS-00003^0053360847^2018-09-28^KIIGAS.txt
token size :1030
char. size :6977


### Langchain text splitter

* https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [58]:
CHUNK_SIZE = (MAX_POSITION_EMBEDDINGS // 1000) * 1000
print(CHUNK_SIZE)

3000


In [59]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = CHUNK_SIZE,
    chunk_overlap  = 20,
    length_function = len,
    is_separator_regex = False,
)

In [60]:
# Optional test of RecursiveCharacterTextSplitter on \n and other chars
test_text = data[file_idx].page_content
text_split_list = text_splitter.split_text(test_text)
print(len(text_split_list))
for seg in text_split_list:
    print(len(seg))
# print(text_split_list[-1])    

3
2956
2904
1113


### Langchain embeddings

use sentence-transformers  

* all-MiniLM-L12-v2 : 134MB https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 
* all-MiniLM-L6-v2 : 90MB https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main

Llama2 does not support document embedding by default
* https://stackoverflow.com/questions/77037897/how-to-create-an-embeddings-model-in-langchain

HuggingFaceEmbedding embed_documents example
* https://python.langchain.com/docs/modules/data_connection/text_embedding/

In-memory vectorstore need DocArray
* https://python.langchain.com/docs/integrations/providers/docarray

TextEmbeddings in LangChain
* https://python.langchain.com/docs/modules/data_connection/text_embedding/

Sentence-transformers
* https://medium.com/@madhur.prashant7/demo-langchain-rag-demo-on-llama-2-7b-embeddings-model-using-chainlit-559c10ce3fbf

In [61]:
# import sys
# !{sys.executable} -m pip install sentence-transformers==2.2.2

In [62]:
embed_model_map = {
    "sentence-transformers": "sentence-transformers/all-MiniLM-L12-v2", # 384
    "baai" : "BAAI/bge-base-en-v1.5" # 768 embedding dims
}

In [63]:
# embed_model_vendor = "sentence-transformers"
embed_model_vendor = "baai"

In [64]:
embed_model_name = embed_model_map[embed_model_vendor]
print(embed_model_name)

BAAI/bge-base-en-v1.5


In [65]:
embed_model_name 

'BAAI/bge-base-en-v1.5'

In [66]:
# MODEL_CACHE_DIR

In [67]:
model_kwargs = {'device': 'cpu'}
# model_kwargs = {'device_map': "auto",}
# encode_kwargs = {'normalize_embeddings': False}
encode_kwargs = {'normalize_embeddings': True} # for the cosin similarity search

# is downloaded at "{MODEL_CACHE_DIR}/models/torch/sentence_transformer" folder
embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [68]:
test_docs_list = [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]

def embed_vec_dim(embeddings):
    return len(embeddings), len(embeddings[0])

def embed_docs_test(model: HuggingFaceEmbeddings, docs_list: list):
    embeddings = model.embed_documents(
        docs_list
    )
    return embeddings
    len(embeddings), len(embeddings[0])

embeddings = embed_docs_test(embed_model, test_docs_list)
print(embed_vec_dim(embeddings))

(5, 768)


In [69]:
# print(text_split_list)

embeddings = embed_docs_test(embed_model, text_split_list)
print(embed_vec_dim(embeddings))
# print(embeddings[0])
# print(embeddings[-1])

(3, 768)


## Langchain local LLM RAG

Langchain Vectorstore and RAG approach differences:
* https://github.com/langchain-ai/langchain/issues/5328

Langchain RetrievalQA 
* https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa

DocArray
* https://python.langchain.com/docs/integrations/providers/docarray

LLama2 doesn't support Doc Embedding
* https://stackoverflow.com/questions/77037897/how-to-create-an-embeddings-model-in-langchain


In [70]:
# RAG one document
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embed_model,
    text_splitter=text_splitter,
    ).from_documents([data[file_idx]])

In [71]:
RETRIEVER_K = 3 # with two doc, there is not much i don't know
retriever = index.vectorstore.as_retriever(search_kwargs={'k': RETRIEVER_K})

In [72]:
# db = DocArrayInMemorySearch.from_documents(
#     [data[file_idx]], embed_model)

# retriever = db.as_retriever

#### Set the custom template

Use the object variable, instead of kwargs
https://github.com/langchain-ai/langchain/issues/6635#issuecomment-1659343109

The reduce_prompt_template can be set
```shell
qa_chain.combine_documents_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt = reduce_prompt_template
```

In [73]:
# template = """
# Given the following extracted parts of a long document and a question, create a final answer.\n
# If you don't know the answer, just say that you don't know. Don't try to make up an answer.\n\n\n
# =========\n
# QUESTION: {question}\n
# =========\n
# {summaries}\n
# =========\n
# FINAL ANSWER:"""

# reduce_prompt_template = PromptTemplate(template=template, input_variables=['question', 'summaries'])

In [74]:
# map_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
# Always answer as helpfully as possible using the context text provided.
# Your answers should only answer the question once and not have any text after the answer is done.\n\n
# If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
# If you don't know the answer to a question, just reply with an empty string '' and please don't share false information. \n<</SYS>>\n\n

# CONTEXT:/n/n {context}/n/n/n

# Question: {question}/n/n

# Only return the helpful answer below and nothing else.
# Helpful answer:
# [/INST]"""

# 
# chatgpt3_end_token = "<|im_end|>"

# llama_end_token = "<|end|>"

#map_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
#Always answer as helpfully as possible using the context text provided.
#Your answers should only answer the question once and not have any text after the answer is done.\n\n
#If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
#If you don't know the answer to a question, please don't share false information, just reply with "<|end|>"\n<</SYS>>\n\n
#
#CONTEXT:/n/n {context}/n/n/n
#
#Question: {question}/n/n
#
#Only return the helpful answer below and nothing else.
#Helpful answer:
#[/INST]"""

query_template = """
Context:
{context}

Question: {question}

Only return the helpful answer below and nothing else.
Helpful answer:
"""

map_template = prompt_helper.gen_prompt(query_template)


map_prompt_template = PromptTemplate.from_template(map_template)
map_prompt_template
# Relevant text, if any:

PromptTemplate(input_variables=['context', 'question'], template='<s>[INST] You are a helpful, respectful and honest assistant.\nAlways answer as helpfully as possible using the context text provided.\nYour answers should only answer the question once and not have any text after the answer is done.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don\'t know the answer to a question, please don\'t share false information. Just return "</s>"\n\n\nContext:\n{context}\n\nQuestion: {question}\n\nOnly return the helpful answer below and nothing else.\nHelpful answer:\n\n[/INST]')

In [75]:
# reduce_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
# Always answer as helpfully as possible using the context text provided.
# Your answers should only answer the question once and not have any text after the answer is done.\n\n
# If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\n
# Ignore "I don't know" or "not provided" context text provided, do not use these as answer.\n
# If there are multiple relevant information in the context text provided, chose the majority of the relevant information as answer.\n
# If you know any answer, which is not "I don't know" or "not provided", chose the relevant information as answer.\n
# If you don't know the answer to a question, please don't share false information. \n<</SYS>>\n\n

# CONTEXT:/n/n {summaries}/n/n/n

# Question: {question}/n/n

# Only return the helpful answer below and nothing else.
# Helpful answer:
# [/INST]"""


# reduce_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
# Always answer as helpfully as possible using the context text provided.
# Your answers should only answer the question once and not have any text after the answer is done.\n\n
# If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\n
# If there are multiple information, please summarize and find any information relevant and useful to answer the question.\n
# If you don't know the answer to a question, please don't share false information. \n<</SYS>>\n\n

# CONTEXT:/n/n {summaries}/n/n/n

# Question: {question}/n/n

# Only return the helpful answer below and nothing else.
# Helpful answer:
# [/INST]"""

#reduce_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
#Always answer as helpfully as possible using the context text provided.
#Always summarise the context text provided.
#Your answers should only answer the question once and not have any text after the answer is done.\n\n
#If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\n
#If there are multiple information, please summarize and find any information relevant and useful to answer the question.\n
#If you don't know the answer to a question, please don't share false information. \n<</SYS>>\n\n
#
#CONTEXT:/n/n {summaries}/n/n/n
#
#Question: {question}/n/n
#
#Only return the summarised answer below and nothing else.
#Summarised answer:
#[/INST]"""


reduce_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Always summarise the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.\n\n
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\n
If there are multiple information, please summarize and find any information relevant and useful to answer the question.\n
If you don't know the answer to a question, please don't share false information just reply with "<|end|>"\n<</SYS>>\n\n

CONTEXT:/n/n {summaries}/n/n/n

Question: {question}/n/n

Only return the summarised answer below and nothing else.
Summarised answer:
[/INST]"""

reduce_prompt_template = PromptTemplate.from_template(reduce_template)
reduce_prompt_template

PromptTemplate(input_variables=['question', 'summaries'], template='[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.\nAlways answer as helpfully as possible using the context text provided.\nAlways summarise the context text provided.\nYour answers should only answer the question once and not have any text after the answer is done.\n\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\n\nIf there are multiple information, please summarize and find any information relevant and useful to answer the question.\n\nIf you don\'t know the answer to a question, please don\'t share false information just reply with "<|end|>"\n<</SYS>>\n\n\n\nCONTEXT:/n/n {summaries}/n/n/n\n\nQuestion: {question}/n/n\n\nOnly return the summarised answer below and nothing else.\nSummarised answer:\n[/INST]')

In [76]:
#refine_init_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
#Always answer as helpfully as possible using the context text provided.
#Your answers should only answer the question once and not have any text after the answer is done.\n\n
#If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
#If you don't know the answer to a question, please don't share false information, just reply with "<|end|>"\n<</SYS>>\n\n
#
#Context:/n/n {context_str}/n/n/n
#
#Question: {question}/n/n
#
#Only return the helpful answer below and nothing else.
#Helpful answer:
#[/INST]"""

# <|end|> for llama
refine_init_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.\n\n
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information, just reply with "<|end|>"\n<</SYS>>\n\n

Context:/n/n {context_str}/n/n/n

Question: {question}/n/n

Only return the helpful answer below and nothing else.
Helpful answer:
[/INST]"""

init_prompt_template = PromptTemplate.from_template(refine_init_template)
init_prompt_template

PromptTemplate(input_variables=['context_str', 'question'], template='[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.\nAlways answer as helpfully as possible using the context text provided.\nYour answers should only answer the question once and not have any text after the answer is done.\n\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don\'t know the answer to a question, please don\'t share false information, just reply with "<|end|>"\n<</SYS>>\n\n\n\nContext:/n/n {context_str}/n/n/n\n\nQuestion: {question}/n/n\n\nOnly return the helpful answer below and nothing else.\nHelpful answer:\n[/INST]')

In [77]:
# https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa
# "stuff", "map_reduce", "refine", "map_rerank"

chain_type = "map_reduce"
# chain_type = "stuff"
# chain_type = "refine" 
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type=chain_type,
    retriever=retriever,
    # combine_docs_chain_kwargs={'prompt': reduce_prompt_template},
    # chain_type_kwargs={"map_prompt": map_prompt_template},
    return_source_documents=True,
    verbose=True,
    )
# set the prompt template manually
# use the original prompttemplate to do the summary, the current custom template doesn't have the one-short summary example, but just the right format.
# qa_chain.combine_documents_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt = reduce_prompt_template

In [78]:
if chain_type == "map_reduce":
    qa_chain.combine_documents_chain.llm_chain.prompt = map_prompt_template
    qa_chain.combine_documents_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt = reduce_prompt_template
    # set the token max from 3000 to 4000
    qa_chain.combine_documents_chain.reduce_documents_chain.token_max = MAX_POSITION_EMBEDDINGS
    
    
if chain_type == "refine":
    # pass
    qa_chain.combine_documents_chain.initial_llm_chain.prompt = init_prompt_template
    # qa_chain.combine_documents_chain.refine_llm_chain.token_max = MAX_POSITION_EMBEDDINGS

### Set token max or max token for the llm
* https://github.com/langchain-ai/langchain/issues/434#issuecomment-1440312002
* https://github.com/langchain-ai/langchain/issues/9341#issuecomment-1681306494
* https://github.com/langchain-ai/langchain/issues/9341#issuecomment-1681306494

## Set debug mode

In [79]:
# set DEBUG to false to remove all the llm answer outputs
# DEBUG=True
DEBUG=False

In [80]:
qa_chain

RetrievalQA(verbose=True, combine_documents_chain=MapReduceDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['context', 'question'], template='<s>[INST] You are a helpful, respectful and honest assistant.\nAlways answer as helpfully as possible using the context text provided.\nYour answers should only answer the question once and not have any text after the answer is done.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don\'t know the answer to a question, please don\'t share false information. Just return "</s>"\n\n\nContext:\n{context}\n\nQuestion: {question}\n\nOnly return the helpful answer below and nothing else.\nHelpful answer:\n\n[/INST]'), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7f7822d8ba90>, model_id='mistralai/Mixtral-8x7B-Instruct-v0.1', model_kwargs={'do_sample': True, 'top_k': 3, 'num_return_s

In [81]:
# query = "What is the ICD10 diagonis of the patient? (Remember to include 'The name of the patient is' in your answer.)"

In [82]:
query = "What is the name of the patient? (Remember to include 'The name of the patient is' in your answer.)"

In [83]:
if DEBUG:
    langchain.debug = True
response = qa_chain({"query": query})
if DEBUG:
    langchain.debug = False

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



[1m> Finished chain.[0m


In [84]:
if DEBUG:
    print(f"Response: {response['result']}")
    print('-'*20)
    print(data[file_idx])

### PromptParser

In [85]:
from langchain.prompts import ChatPromptTemplate
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, TransformChain
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser
# name_query_template = """\
# From the following medical report, extract the following information:

# patient_name: Extract only the name of the patient, \
# do not extract any name that is not patient.
# Answer with the patient name if you find it, "" if not or unknown.

# Format the output as JSON with the following keys:
# patient_name

# report: {text}
# """

# name_query_template = """\
# From the following text, extract the following information:

# patient_name: Extract only the name of the patient, \
# do not extract any name that is not patient.
# Answer with the patient name if you find it, "" if not or unknown.

# Format the output as JSON with the following keys:
# patient_name

# text: {text}
# """

# name_query_template = """\
# From the following text, extract the following information:\

# patient_name: Extract the name of the patient. \
# Answer only one name, answer with the name if you find it, answer None if not or unknown.

# Format the output as JSON with the following keys:
# patient_name

# text:\
# {text}

# {format_instructions}
# """

# name_query_template = """\
# From the following text, extract the following information:

# patient_name: the name of the patient, \
# Answer with the patient name, \"\" if not or unknown.

# {format_instructions}

# text:\
# {text}
# """

query_template = """[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.\n\n
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information. \n<</SYS>>\n\n

CONTEXT:/n/n {text}/n/n/n

Question: {question}/n/n
{format_instructions}/n

[/INST]"""


# name_query_template = """
# retrieve one: patient name, from the following text.\n format response as following \{"patient_name": patient name\}\n Text: {text}"""

# name_schema = ResponseSchema(name="patient_name", description="patient name, \
# Answer with the patient name, \"\" if not or unknown.")
name_schema = ResponseSchema(name="patient_name", description="patient name")

response_schema = [name_schema]
output_parser = StructuredOutputParser.from_response_schemas(response_schema)

### LLama2 prompt style
* https://colab.research.google.com/drive/1hRjxdj53MrL0cv5LOn1l0VetFC98JvGR?usp=sharing#scrollTo=IrVIuygNuBVT

In [86]:
# ## Default LLaMA-2 prompt style
# B_INST, E_INST = "[INST]", "[/INST]"
# B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
# DEFAULT_SYSTEM_PROMPT = """\
# You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

# If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

# def get_prompt(instruction, new_system_prompt=DEFAULT_SYSTEM_PROMPT ):
#     SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
#     prompt_template =  B_INST + SYSTEM_PROMPT + instruction + E_INST
#     return prompt_template

In [87]:
# sys_prompt = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible using the context text provided. Your answers should only answer the question once and not have any text after the answer is done.

# If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. """

# instruction = """CONTEXT:/n/n {context}/n

# Question: {question}"""
# get_prompt(instruction, sys_prompt)

In [88]:
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"patient_name": string  // patient name
}
```


In [89]:
# prompt_template = PromptTemplate.from_template(get_prompt(instruction, sys_prompt))
# prompt_template

In [90]:
# prompt_template = ChatPromptTemplate.from_template(name_query_template) # ChatPromptTemplate create Human and Output in the text
name_question="retrieve one: patient name"
prompt_template = PromptTemplate.from_template(query_template)
prompt_template

PromptTemplate(input_variables=['format_instructions', 'question', 'text'], template="[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.\nAlways answer as helpfully as possible using the context text provided.\nYour answers should only answer the question once and not have any text after the answer is done.\n\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don't know the answer to a question, please don't share false information. \n<</SYS>>\n\n\n\nCONTEXT:/n/n {text}/n/n/n\n\nQuestion: {question}/n/n\n{format_instructions}/n\n\n[/INST]")

In [91]:
input_text = response['result'].strip()

In [92]:
# messages = prompt_template.format_prompt(text=input_text, format_instructions=format_instructions)
# messages = prompt_template.format_messages(text=input_text, format_instructions=format_instructions)

In [93]:
chain = LLMChain(prompt=prompt_template, llm=llm)
chain

LLMChain(prompt=PromptTemplate(input_variables=['format_instructions', 'question', 'text'], template="[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant.\nAlways answer as helpfully as possible using the context text provided.\nYour answers should only answer the question once and not have any text after the answer is done.\n\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don't know the answer to a question, please don't share false information. \n<</SYS>>\n\n\n\nCONTEXT:/n/n {text}/n/n/n\n\nQuestion: {question}/n/n\n{format_instructions}/n\n\n[/INST]"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7f7822d8ba90>, model_id='mistralai/Mixtral-8x7B-Instruct-v0.1', model_kwargs={'do_sample': True, 'top_k': 3, 'num_return_sequences': 1, 'eos_token_id': 2, 'max_length': 4096, 'max_new_tokens': 80, 'temperature': 0.01, 'top_p': 

In [94]:
if DEBUG:
    langchain.debug = True 
# parser_response = chain.run(context=input_text, question=question, temperature=0.0)
parser_response = chain.run(text=input_text, format_instructions=format_instructions, question=name_question, temperature=0.0)
# parser_response = chain.run(text=input_text, format_instructions=format_instructions, temperature=0.0)
if DEBUG:
    langchain.debug = False

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [95]:
import re

def parse_response_dict(parser_response: str, output_parser, verbose: bool=False) -> dict:
    # https://stackoverflow.com/questions/24667065/python-regex-difference-between-and/24667099#24667099
    # https://stackoverflow.com/questions/33312175/matching-any-character-including-newlines-in-a-python-regex-subexpression-not-g/33312193#33312193
    # (.+) is greedy, (.+?) stops at the first match
    try:
        post_proccessed_response = re.search(r"```[\s\S]+```", parser_response).group(0)
        if verbose:
            print(post_proccessed_response)
        output_dict = output_parser.parse(post_proccessed_response)
        key = output_parser.response_schemas[0].name
        if output_dict.get(key) is None:
            output_dict[key] = ""
    except Exception as e:
        print(e)
        output_dict = {}
    return output_dict

output_parser

patient_name = parse_response_dict(parser_response, output_parser).get("patient_name", "").strip() 
if DEBUG:
    print(f"str response: {parser_response}")
    print(f"patient_name is: {patient_name}")

### Age question

In [96]:
patient_name=f"{patient_name}" if patient_name is not None else ""
# query = f"What is the age of the patient {patient_name}? (Remember to include 'The age of the patient is' in your answer.)"
query = f"What is the age of the patient {patient_name}? (Remember to include 'The age of the patient is' in your answer.)"
if DEBUG:
    print(query)

In [97]:
chain_type = "map_reduce"
# chain_type = "stuff"
# chain_type = "refine" 

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type=chain_type,
    retriever=retriever,
    # combine_docs_chain_kwargs={'prompt': reduce_prompt_template},
    # chain_type_kwargs={"map_prompt": map_prompt_template},
    return_source_documents=True,
    verbose=True,
    )

In [98]:
if chain_type == "map_reduce":
    qa_chain.combine_documents_chain.llm_chain.prompt = map_prompt_template
    qa_chain.combine_documents_chain.reduce_documents_chain.combine_documents_chain.llm_chain.prompt = reduce_prompt_template
    # set the token max from 3000 to 4000
    qa_chain.combine_documents_chain.reduce_documents_chain.token_max = MAX_POSITION_EMBEDDINGS
    
    
if chain_type == "refine":
    # pass
    qa_chain.combine_documents_chain.initial_llm_chain.prompt = init_prompt_template
    # qa_chain.combine_documents_chain.refine_llm_chain.token_max = MAX_POSITION_EMBEDDINGS

In [99]:
qa_chain

RetrievalQA(verbose=True, combine_documents_chain=MapReduceDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['context', 'question'], template='<s>[INST] You are a helpful, respectful and honest assistant.\nAlways answer as helpfully as possible using the context text provided.\nYour answers should only answer the question once and not have any text after the answer is done.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don\'t know the answer to a question, please don\'t share false information. Just return "</s>"\n\n\nContext:\n{context}\n\nQuestion: {question}\n\nOnly return the helpful answer below and nothing else.\nHelpful answer:\n\n[/INST]'), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7f7822d8ba90>, model_id='mistralai/Mixtral-8x7B-Instruct-v0.1', model_kwargs={'do_sample': True, 'top_k': 3, 'num_return_s

In [100]:
# qa_chain.combine_documents_chain.initial_llm_chain.prompt
# qa_chain.combine_documents_chain.refine_llm_chain.prompt

In [101]:
if DEBUG:
    langchain.debug = True
response = qa_chain({"query": query})
if DEBUG:
    langchain.debug = False



[1m> Entering new RetrievalQA chain...[0m


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



[1m> Finished chain.[0m


In [102]:
if DEBUG:
    print(f"Response: {response['result']}")
    print('-'*20)
    print(data[file_idx])

In [103]:
# use the string type
# age_schema = ResponseSchema(name="patient_age", description="patient age", type="int")
age_schema = ResponseSchema(name="patient_age", description="patient age")
# age_schema = ResponseSchema(name="patient_age", description="patient age", type="int")

# response_schema = [age_schema]
age_output_parser = StructuredOutputParser.from_response_schemas([age_schema])
# age_output_parser

In [104]:
age_question="retrieve one: patient age"
format_instructions = age_output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"patient_age": string  // patient age
}
```


In [105]:
input_text = response['result']

In [106]:
#input_text

In [107]:
if DEBUG:
    langchain.debug = True 
parser_response = chain.run(text=input_text, format_instructions=format_instructions, question=age_question, temperature=0.0)
if DEBUG:
    langchain.debug = False

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [108]:
# print(parser_response)

In [109]:
patient_age_obj = parse_response_dict(parser_response, age_output_parser, DEBUG).get("patient_age", "")
# print(patient_age_str)
try:
    if isinstance(patient_age_obj, str):
        patient_age_obj = patient_age_obj.strip()
        patient_age = int(patient_age_obj)
    if isinstance(patient_age_obj, int):
        patient_age = patient_age_obj
except Exception as e:
    print(e)
    patient_age = -1

if DEBUG:
    print(f"str response: {parser_response}")
    print(f"patient_age is: {patient_age}")
    print(f"pateint_age has type: {type(patient_age)}")

In [110]:
if DEBUG:
    print(input_text)
    print(parser_response)
    print(f"{patient_name} is {patient_age}")

In [138]:
# print(f"{patient_name} is {patient_age}")

In [112]:
# map_prompt_template = PromptTemplate(template="""
# Use the following portion of a long document to see if any of the text is relevant to answer the question. \n
# Return any relevant text verbatim.\n
# {context}\n
# Question: {question}\n
# Relevant text, if any:""",
# input_variables=['context', 'question'])
# qa_chain.combine_documents_chain.llm_chain.prompt = map_prompt_template

In [113]:
# stuff_llm_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n
# {context}\n\n
# Question: {question}\nHelpful Answer:"""

In [114]:
# qa_chain

In [115]:
# query = "What is the weight of the patient in kg? (Remember to include 'The weight of the patient is' in your answer.)"

In [116]:
# if DEBUG:
#     langchain.debug = True
# response = qa_chain({"query": query})
# #response = qa_chain({"query": query})
# if DEBUG:
#     langchain.debug = False

In [117]:
# if DEBUG:
#     print(f"Response: {response['result']}")
#     print('-'*20)
#     print(data[file_idx])

In [118]:
# response = index.query_with_sources(query, llm=llm, retriever_kwargs={"chain_type":"map_reduce"})

### (optional) Additional Read

GPT4All
* https://python.langchain.com/docs/integrations/llms/gpt4all

## Example prompt

In [119]:
context = ""

#### zero shot prompt

In [120]:
#name
input=f"Can you tell me the name of the patient from the folowing doctor's letter?\nLetter:\n{context}\nAnswer: "

In [121]:
#len(input)
# 6810

In [122]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [123]:
#age
input=f"Can you tell me the age of the patient from the following doctor's letter?\nLetter:\n{context}\nAnswer: "

In [124]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [125]:
#diagnosis
input=f"Can you tell me the diagnosis of the patient from the following doctor's letter?\nLetter:\n{context}\nAnswer: "

In [126]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

#### Chain-of-thoughts prompt

In [127]:
# name prompt
input = f"Context: Patient: Fried\nQuestion: what is the name of the patient? \nAnswer: Name of the patient is Fried\nContext: {context}\nQuestion: what is the name of the patient?\nAnswer: the name of patient is"
#print(input)

In [128]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [129]:
# age prompt
input = f"Context:\nPatient: Fried is a 34-year-old patient\nQuestion:\nhow old is the patient? \nAnswer:\nFried is a patient, 34 year-old, the answers is 34\nContext:\n{context}\nQuestion:\nhow old is the patient?\nAnswer: "
# print(input)

In [130]:
# age prompt
#len(input)
# > 6913 tokens

In [131]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [132]:
# diagnose prompt
input=f"Context:\nPatient: Fried is a 34-year-old patient, Diagnoses: Influenza (J09.X2) \nQuestion:\nWhat diagnoses has the patient? \nAnswer:\nFried is a patient, 34 year-old, has diagnoses Influenza (J09.X2). The answers is Influenza (J09.X2)\nContext:\n{context}\nQuestion:\nWhat diagnoses has the patient?\nAnswer: "

In [133]:
# answer=chat(input, print_mode=False)
# print_answer(answer)

In [134]:
# gpu_status.gpu_usage()

In [135]:
# free_memory()