<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/llm/llama_2_llama_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LlamaCPP 

In this short notebook, we show how to use the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) library with LlamaIndex.

In this notebook, we use the [`Qwen/Qwen2.5-7B-Instruct-GGUF`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF) model, along with the proper prompt formatting. 

By default, if model_path and model_url are blank, the `LlamaCPP` module will load llama2-chat-13B.

## Installation

To get the best performance out of `LlamaCPP`, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).

Full MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).

In general:
- Use `CuBLAS` if you have CUDA and an NVidia GPU
- Use `METAL` if you are running on an M1/M2 MacBook
- Use `CLBLAST` if you are running on an AMD/Intel GPU

For me, on a MAC, I need to install the `metal` backend.

```bash
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
```


Then you can install the required llama-index pacakages

In [None]:
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-llama-cpp

## Setup LLM

The LlamaCPP llm is highly configurable. Depending on the model being used, you'll want to pass in `messages_to_prompt` and `completion_to_prompt` functions to help format the model inputs.

For any kwargs that need to be passed in during initialization, set them in `model_kwargs`. A full list of available model kwargs is available in the [LlamaCPP docs](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__init__).

For any kwargs that need to be passed in during inference, you can set them in `generate_kwargs`. See the full list of [generate kwargs here](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__call__).

In general, the defaults are a great starting point. The example below shows configuration with all defaults.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"

In [None]:
from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")


def messages_to_prompt(messages):
    messages = [{"role": m.role.value, "content": m.content} for m in messages]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


def completion_to_prompt(completion):
    messages = [{"role": "user", "content": completion}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=16384,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

llama_model_load_from_file: using device Metal (Apple M2 Max) - 16584 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/loganmarkewich/Library/Caches/llama_index/models/qwen2.5-7b-instruct-q3_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_

We can tell that the model is using `metal` and our GPU due to the logging!

```

```
offloaded 29/29 layers to GPU
```

## Start using our `LlamaCPP` LLM abstraction!

We can simply use the `complete` method of our `LlamaCPP` LLM abstraction to generate completions given a prompt.

In [None]:
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

llama_perf_context_print:        load time =     699.52 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    42 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   170 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    4230.39 ms /   212 tokens


Certainly! Here's a short poem about cats and dogs, blending their characteristics and interactions:

In the quiet of the evening,
Where shadows play and light,
Two creatures, one in each corner,
Of the world, they find their might.

The cat, with grace and stealth,
In the dark, she pounces,
While the dog, with loyal teeth,
Protects and never fumbles.

One with fur as soft as silk,
The other with fur like flax,
Yet in their hearts, a bond is built,
A friendship that will never lack.

Though their paths may seem apart,
In their hearts, they find a way,
To coexist, in peace and heart,
A tale of two, in harmony's sway.

This poem captures the essence of both cats and dogs, highlighting their unique traits and the potential for friendship between them.


We can use the `stream_complete` endpoint to stream the response as it’s being generated rather than waiting for the entire response to be generated.

In [None]:
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

Sure, here's a

Llama.generate: 24 prefix-match hit, remaining 15 prompt tokens to eval


 poem about fast cars:

In the glow of twilight's soft embrace,
Rumble of engines, a deep, steady thrum,
Metal beasts, sleek and poised to race,
Through the night, their shadows run.

Wind whispers secrets as it streams,
Past the curves, the corners, the bends,
Adrenaline pulses, hearts in dreams,
As they chase the horizon, the open ends.

Neons flicker, red and blue,
In the darkness, a vibrant hue,
Each car a story, a tale to chew,
On the thrill of speed, the rush of new.

Steel and glass, a silent crew,
In the silence, a deafening sound,
Of the road, of the wind, of the crowd,
In the moment, time seems to be bound.

But the night is long, the journey vast,
And the dawn will come, the race will pass,
Yet in memory, the thrill will last,
A fleeting dream, a fast car's class.

llama_perf_context_print:        load time =     699.52 ms
llama_perf_context_print: prompt eval time =       0.00 ms /    15 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   201 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    4365.20 ms /   216 tokens


## Query engine set up with LlamaCPP

We can simply pass in the `LlamaCPP` LLM abstraction to the `LlamaIndex` query engine as usual.

But first, let's change the global tokenizer to match our LLM.

In [None]:
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)

In [None]:
# use Huggingface embeddings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

In [None]:
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader("../data/paul_graham/").load_data()

In [None]:
from llama_index.core import VectorStoreIndex

# create vector store index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [None]:
# set up query engine
query_engine = index.as_query_engine(llm=llm)

In [None]:
response = query_engine.query("What did the author do growing up?")
print(response)

Llama.generate: 24 prefix-match hit, remaining 1983 prompt tokens to eval
llama_perf_context_print:        load time =     699.52 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  1983 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   172 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    8297.83 ms /  2155 tokens


Growing up, the author focused on writing and programming. Specifically, before college, he wrote short stories as a beginning writer, which he found to be lacking in plot but rich in character emotions. He also spent time programming on an IBM 1401 computer in his school district's basement, using an early version of Fortran. He had to input programs on punch cards and run them on a noisy printer. This experience was limited and not very productive, as he couldn't find much to do with the machine without input data or the ability to perform more complex calculations. The author then shifted to using microcomputers, which allowed for more interactive programming and writing, and he began to write essays again. Later, in March 2015, he started working on Lisp again, attracted by its unique characteristics and power as a programming language.
