## Introduction
The plan:
- load and use llama-cpp with langchain
- add custom CallbackHandler to track token usage

### Imports and installs

In [1]:
%pip install -qqq langchain==0.0.304 --progress-bar off
%pip install -qqq llama-cpp-python==0.2.7 --progress-bar off

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import sys
import warnings

warnings.simplefilter("ignore")

from huggingface_hub import hf_hub_download
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks import get_openai_callback
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler # handle std out of llm in jupyterNB

***
## Load Llama2-13b

In [None]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGUF"
model_basename = "llama-2-13b-chat.Q4_K_M.gguf"
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)


callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

n_gpu_layers = 32  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.


llm = LlamaCpp(
    model_path=model_path,
    max_tokens=2500,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    n_ctx=2500, # Context window
    verbose=True, # Verbose is required to pass to the callback manager
)

## Create prompt template and run chain

In [4]:
template = """Question: {question}

Answer: Answer briefly in a sentence!"""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [5]:
chain = LLMChain(prompt=prompt, llm=llm)

with get_openai_callback() as cb:
    result = chain.run({'question': "When was Einstein born? Give year, month day.",})



Albert Einstein was born on March 14, 1879, in the town of Ulm, in the Kingdom of Württemberg, German Empire.


llama_print_timings:        load time = 17844.41 ms
llama_print_timings:      sample time =    24.31 ms /    38 runs   (    0.64 ms per token,  1563.34 tokens per second)
llama_print_timings: prompt eval time = 17844.25 ms /    25 tokens (  713.77 ms per token,     1.40 tokens per second)
llama_print_timings:        eval time = 31662.92 ms /    37 runs   (  855.75 ms per token,     1.17 tokens per second)
llama_print_timings:       total time = 49710.74 ms
