## Using Llama_cpp_python to run LLMs

Much like Tranformers, `LLaMa.cpp (or LLaMa C++)` is a tool or framework to run an open-source LLM. We'll demonstrate how to use the Python wrapper for it: <font color='purple'>__llama_cpp_python__</font>.


### <font color='purple'>_Llama_cpp_python vs. Transformers_</font>

Both <font color='purple'>__Transformers__</font> and <font color='purple'>__llama_cpp_python__</font> are tools you can apply to run open-source LLMs. Since `LLaMa.cpp` is also derived from the Transformer architecture, they follow similar processing steps.  However, `LLaMa.cpp` applies some improvements that make it slightly more efficient and easier to run.  The diagram below details the difference between the Llama and Transformer architecture.

<!-- <div style="text-align:center">
    <img src="./images/architectures.png" width="700">
</div> -->

![architects](./images/architectures.png)  

The upshot of these tweaks are:

- __Better training__: LLaMA uses a technique called __pre-normalization__ to improves training stability by adjusting the data before processing it. Consider this akin to tuning up your guitar before each song to ensure the strings are at the right tension for consistent sound. 
<!-- This is similar to how a chef might preheat an oven for even cooking. -->

- __Faster calculations__: LLaMA utilizes the __SwiGLU activation function__ to help the model compute information faster. This is like using a pick with a special grip that makes it easier and faster to strum chords or play single notes.
 <!-- This is like a sharper knife for quicker chopping. -->

- __Simpler positioning__: LLaMA incorporates __rotary embeddings__ which track word order in a more efficient way compared to absolute positional embeddings. Imagine that the frets at different positions of your guitar neck were color-coded instead of just metal bars. This makes it quicker to find the right notes on the fretboard.
 <!-- Imagine using color-coded ingredients (rotary) instead of labels (absolute) to keep track in a recipe-->

### <font color='purple'>_Basic Workflow for __llama_cpp_python___</font>

In addition to the architectural improvements, <font color='purple'>__llma_cpp_python__</font> is simpler to use than Transformers. Unlike transformers, you don't need to be familiar with libraries like PyTorch or the intricacies of token conversion and decoding. <font color='purple'>__llma_cpp_python__</font> takes care of these technical details under the hood, allowing you to focus on feeding it text and getting results directly.

**1. Load the model parameters and weights:**
Load the pre-trained LLaMA model in a <font color='purple'>__GGUF__</font> format.

**2. Send your prompt:** 
Either feed it a plain text prompt or place the query in the appropriate model prompt syntax.

**3. Get text response:** 
Retrieve the text response with no decoding necessary.


#### <font color='purple'>__Quantized Models - What is GGUF?__</font>

In order to run an LLM, <font color='purple'>__llama_cpp_python__</font> requires the model's parameters and weights – the building blocks of its knowledge. These are stored as complex, multidimensional arrays called __tensors__. The data type chosen for these tensors (like Float64, Float16, or even integers) impacts both accuracy and memory usage. Think of the data type as the number of "digits" used to represent the information in memory:

- __Higher precision data types (e.g., Float64)__: Offer greater accuracy and stability during training, but require more memory and computational resources.

- __Lower precision data types (e.g., Float16)__: Reduce memory requirements and potentially speed up computations, but might introduce slight accuracy trade-offs.

Llama_cpp_python simplifies how to store model inputs by using a single, compressed format called <font color='purple'>__GGUF (GPT-Generated Unified Format)__</font>. <font color='purple'>__GGUF__</font> can leverage a technique called <font color='purple'>__quantization__</font> to further optimize storage. <font color='purple'>__Quantization__</font> essentially "compresses" the data by representing the model weights using less precise data types. The image below demonstrates how <font color='purple'>__quantization__</font> works: 

<!--<div style="text-align:center">
    <img src="./images/quantize.gif" width="800">
</div> -->

![quantize](./images/quantize.gif)  

By using <font color='purple'>__quantization__</font>, <font color='purple'>__llama_cpp_python__</font> makes it easier to work with LLMs on devices with limited memory or processing power. Here you see how quantizaton reduces the size of the llama2 models you can call; making it feasible to run these models on a single GPU or even CPUs.  

<!--<div style="text-align:center">
    <img src="./images/llama_size.png" width="800">
</div>-->

![llama](./images/llama_size.png) 

While you can create your own <font color='purple'>__GGUF__</font> files for most model parameters and weights, it's also possible to download these files from Hugging Face. The <font color='purple'>__GGUF__</font> files for llama2 can be found here: https://huggingface.co/TheBloke/Llama-2-7B-GGUF.

### <font color='purple'>_Tradeoffs between Tools_</font>

Even though the architecture and basic steps suggest that <font color='purple'>__llama_cpp_python__</font> might be easier to run and more efficient than Transformers, the improvements come with some tradeoffs.

* __Inference only__: llama_cpp_python is primarily for running inference (using the model to generate text). You can't fine-tune the model (further train it on a specific task) or use techniques like RAG (using a retrieval model to improve factual accuracy) through llama_cpp_python. Transformers offer more comprehensive functionality for model development and experimentation.

* __Slower for small models__: llama_cpp_python might be slower for very small GGUF files. The overhead of using llama_cpp_python might outweigh the benefits for tiny models. Transformers might be a better choice in such cases.

* __Precision loss__: Quantization (converting complex numbers to simpler formats) does introduce some loss in precision. This might be negligible for many tasks, but for applications requiring very high accuracy, transformers with full-precision computation could be preferable.

<font color='purple'>_Choose __Transformers__ for:_</font>

* Fine-tuning models
* Using RAG or other advanced techniques
* Working with very small models
* When high precision is crucial

<font color='purple'>_Choose __llama_cpp_python__ for:_</font>

* Faster inference on large models
* Deploying models on CPUs

### <font color='purple'>_Run Gemma model_</font>

Here is some sample code for the a gemma 7B instruct model: 

<font color='purple'>__Python script__</font>

In [None]:
###################################
# Gemma model with llama_cpp_python
###################################
# libraries
from llama_cpp import Llama

#########
# Inputs
model_path ="/model/gemma-7b-it.gguf"
context_size = 512
max_tokens_select = 1000
temperature_select: float=0
top_p_select: float=0.9
top_k_select: int=0

########
# Prompt
#prompt = "Can you provide a summary of Guy Debord's Societe du Spectacle?"
prompt = "What kind of pickups are on an ESP LTD Alexi Ripped?"

# prompt written in gemma prompt syntax
prompt_sytnax = "<start_of_turn>user" + prompt + "<end_of_turn>" + "<start_of_turn>model"

############
# load model
llm = Llama(
  model_path=model_path,  # The path to the model file
  n_ctx=context_size,  # The max sequence length to use - adjust based on your model's requirements
  n_threads=1,  # The number of CPU threads to use
  n_gpu_layers=-1  # Set to 0 if you want to use CPU only and -1 if you want to use all available GPUs
)

##############
# send prompt
response = llm(
    prompt,
    max_tokens=max_tokens_select, 
    temperature=temperature_select,
    top_p=top_p_select,
    top_k=top_k_select,
    echo = False
    )

##############
# get response
response_text = response['choices'][0]['text']
print(response_text)

Breaking down the parameter options:

__At the model load step__:
- `model_path` is the path to the model file being used.
- `n_ctx` of the context window for the model; max number of tokens in the prompt; default is 512
- `n_threads` is the number of the number of CPU threads
- `n_gpu_layers`set to 0 if you want to use CPU only and -1 if you want to use all available GPUs



__At the model output step__:

- `prompt` is the input prompt for the model. Under the hood, the text is tokenized and passed to the model.
- `max_tokens` is the maximum number of tokens to be generated in the model's response
- `temperature` this value ranges from 0 to 1. The lower the value, the more deterministic the end result. A higher value leads to more randomness.
- `top_p` is used to control the diversity of the predictions, meaning that it selects the most probable tokens whose cumulative probability exceeds a given threshold. Starting from zero, a higher value increases the chance of finding a better output but requires additional computations.
- `echo` specifies a boolean used to determine whether the model includes the original prompt at the beginning (True) or does not include it (False).

_For more information about these parameters and additional ones, please see: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/._


Note that this file is saved as __gemma_test.py__ on KLC here:
```
/kelloggs/software/llama_cpp/code
```

This directory is bound to our llama-cpp-python module and can be accessed as:

```
/code
```


<font color='purple'>__SLURM script__</font>

You can run this file with the following SLURM script:




In [None]:
#!/bin/bash

#SBATCH --account=e32337
#SBATCH --partition gengpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:a100:1
#SBATCH --time 0:30:00
#SBATCH --mem=40G

module purge
module use /kellogg/software/Modules/modulefiles
module load llama_cpp/2.38

python3 /code/gemma_test.py

__Output:__ 
```
ggml_init_cublas: no CUDA devices found, CUDA will be disabled
llama_model_loader: loaded meta data with 19 key-value pairs and 254 tensors from /model/gemma-7b-it.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 28
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv   9:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  10:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256128]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256128]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256128]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - type  f32:  254 tensors
llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256128
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 31.81 GiB (32.00 BPW) 
llm_load_print_meta: general.name     = gemma-7b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size = 32570.17 MiB
............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
WARNING: failed to allocate 224.00 MB of pinned memory: no CUDA-capable device is detected
llama_kv_cache_init:        CPU KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
WARNING: failed to allocate 8.01 MB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:        CPU input buffer size   =     8.01 MiB
WARNING: failed to allocate 506.25 MB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:  CUDA_Host compute buffer size =   506.25 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '1', 'general.architecture': 'gemma', 'gemma.feed_forward_length': '24576', 'gemma.attention.head_count': '16', 'general.name': 'gemma-7b-it', 'gemma.context_length': '8192', 'gemma.block_count': '28', 'gemma.embedding_length': '3072', 'gemma.attention.head_count_kv': '16', 'gemma.attention.key_length': '256', 'tokenizer.ggml.model': 'llama', 'gemma.attention.value_length': '256', 'gemma.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.bos_token_id': '2'}

llama_print_timings:        load time =    1047.55 ms
llama_print_timings:      sample time =     111.74 ms /    34 runs   (    3.29 ms per token,   304.29 tokens per second)
llama_print_timings: prompt eval time =    1047.37 ms /    15 tokens (   69.82 ms per token,    14.32 tokens per second)
llama_print_timings:        eval time =  136433.05 ms /    33 runs   ( 4134.33 ms per token,     0.24 tokens per second)
llama_print_timings:       total time =  138373.07 ms /    48 tokens


The Esp Ltd.AlexI Ripper has ceramic humbuckers with coil tapping capabilities, which give you a wide range from clean to overdrive tones and everything in between

```

### <font color='purple'>_Run Mistral model_</font>

Check out [scripts/llama_cpp_python/misral_test.py](https://github.com/rs-kellogg/krs-openllm-cookbook/blob/main/scripts/llama_cpp_python/mistral_test.py).  

### <font color='purple'>_Run llama2 model_</font>

Check out [scripts/llama_cpp_python/llama2_test.py](https://github.com/rs-kellogg/krs-openllm-cookbook/blob/main/scripts/llama_cpp_python/llama2_test.py).  

### <font color='purple'>_Reference Sources_</font>

- [Llama.cpp Git Repo](https://github.com/ggerganov/llama.cpp)
- [Llama.cpp.python Git Repo](https://github.com/abetlen/llama-cpp-python)
- [Llama.cpp Tuturial](https://www.datacamp.com/tutorial/llama-cpp-tutorial)
- [LLM By Examples — Use GGUF Quantization](https://medium.com/@metechsolutions/llm-by-examples-use-gguf-quantization-3e2272b66343)
- [What are Quantized LLMs](https://www.tensorops.ai/post/what-are-quantized-llms)
- [Quantize Llama Models with GGUF adn llama.cpp](https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html)
- [Quantization of LLMs with llama.cpp](https://medium.com/@ingridwickstevens/quantization-of-llms-with-llama-cpp-9bbf59deda35)
- [Prompt Engineering Guide](https://www.promptingguide.ai/)