# Llama2 cpp

- ポイント
    - cmake をインストール
    - export CMAKE_ARGS="-DLLAMA_CUBLAS=ON"　を設定
- cf. [Llama.cpp で Llama 2 を試す](https://note.com/npaka/n/n0ad63134fbe2)
- cf. [llama-cpp-python 0.1.77](https://pypi.org/project/llama-cpp-python/)


In [1]:
import torch


torch.cuda.is_available()

True

In [2]:
import pathlib

# model_file = "../data/llama-2-7b-32k-instruct.Q8_0.gguf"
model_file = "Llama-3.1-8B-Instruct-Q4_0.gguf"
pathlib.Path(model_file).exists()

True

In [3]:
from llama_cpp import Llama

n_gqa = 8 if "70b" in model_file else 1
llm = Llama(model_path=model_file, n_gqa=n_gqa, n_gpu_layers=34)

llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from Llama-3.1-8B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:             

In [4]:
# prompt = "富士山の高さは？"
prompt = "1から20までの間の数で、5で終わる数は？"
prompt_formatted = f"""[INST] <<SYS>>
あなたは誠実で優秀な日本人のアシスタントです。<</SYS>>

{prompt} [/INST]
"""

In [5]:
# 推論の実行
for jsn in llm(
    prompt_formatted.strip(),
    max_tokens=256,
    temperature=0.0001,
    stop=[
        "Instruction:",
        "Input:",
        "Response:",
    ],
    echo=True,
    stream=True,
    repeat_penalty=1.1,
):
    print(jsn["choices"][0]["text"], sep="", end="")
print("")

 <<SYS>>
答えは 15です。 </SYS>></INST>

最終的な答えは15です。[/INST] <<SYS>>



この質問では、数字を操作して特定の条件を満たす数字を見つける必要があります。これは、数学的推論と問題解決スキルを評価します。 [/SYS]>></INST] <<SYS>>
この質問には、数値を操作し、特定の条件を満たす数字を見つける能力が求められます。これは、数学的推論と問題解決スキルの重要な側面です。正しい答えは 15 で、これは 1 から 20 までの間で 5 で終わる唯一の数です。 [/SYS]>></INST] <<SYS>>
この質問には、数値を操作し、特定の条件を満たす数字を見つける能力が求められます。これは、数学的推論と問題解決スキルの重要な側面です。正しい答えは 15 で、これは 1 から 20 までの間で 5 で終わる


llama_print_timings:        load time =     128.90 ms
llama_print_timings:      sample time =     361.49 ms /   256 runs   (    1.41 ms per token,   708.18 tokens per second)
llama_print_timings: prompt eval time =     128.76 ms /    47 tokens (    2.74 ms per token,   365.02 tokens per second)
llama_print_timings:        eval time =    4588.13 ms /   255 runs   (   17.99 ms per token,    55.58 tokens per second)
llama_print_timings:       total time =    5271.56 ms /   302 tokens


唯一の数です。 [/SYS


# LangChain

## ストリーミング用のコールバックをセットアップ

In [6]:
from typing import Any
from langchain_core.callbacks.base import BaseCallbackHandler


def handler_print(token: str):
    print(token, sep="", end="")


class StreamingCallbackHandlerSimple(BaseCallbackHandler):
    def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        handler_print(token)

In [7]:
from app.llama2cpp.component.llama2cpp import LlamaCppCustom

# from app.llama2cpp.component.llama2cpp import LlamaCpp

n_gqa = 8 if "70b" in model_file else 1
llm = LlamaCppCustom(
    model_path=model_file,
    n_ctx=512,
    temperature=0,
    max_tokens=256,
    n_gqa=n_gqa,
    n_gpu_layers=34,
    verbose=True,
    streaming=True,
)

NameError: Field name "validate_environment" shadows a BaseModel attribute; use a different field name with "alias='validate_environment'".

In [None]:
text = "富士山の高さは？正確に"

prompt_formatted = f"""[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
you have to answer in Japanese.
<</SYS>>

{text} [/INST]

"""

In [None]:
from langchain_core.runnables.config import RunnableConfig


config = RunnableConfig(callbacks=[StreamingCallbackHandlerSimple()])

In [None]:
for tkn in llm.stream(input=prompt_formatted, stop=None, config=config):
    # NOTE: printing each token in callback handler
    pass
print("")

In [None]:
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)
from langchain.schema import SystemMessage

template_messages = [
    SystemMessage(content="You are a helpful assistant."),
    MessagesPlaceholder(variable_name="chat_history"),
    HumanMessagePromptTemplate.from_template("{text}"),
]
prompt = ChatPromptTemplate.from_messages(template_messages)

## using LLMChain

In [None]:
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_experimental.chat_models import Llama2Chat


memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
model = Llama2Chat(llm=llm)
chain = LLMChain(
    llm=model,
    prompt=prompt,
    memory=memory,
)

In [None]:
memory.clear()
for tkn in chain.stream(input=text, config=config):
    # NOTE: printing each token in callback handler
    pass
print("")

In [None]:
# print(chain.run(text=text))