## Using llama_cpp_python to run LLMs
`llama_cpp_python` is a Python wrapper for the C/C++ implementation of Meta's LLaMa architecture. It provides access to powerful open-source large language models (LLMs) built with the Llama-cpp framework, enabling tasks like text generation and translation.  For more information see this repo on llama.cpp (https://github.com/ggerganov/llama.cpp), this for the python wrapper (https://github.com/abetlen/llama-cpp-python), or this overview (https://www.datacamp.com/tutorial/llama-cpp-tutorial)_

To run llama_cpp_python, you can download any compatible model and model weights.  We provive sample code for 3 different models

- (1.) Mistral
- (2.) Gemma
- (3.) Llama2

Here is some sample code for the a gemma 7B instruct model: 

In [None]:
#################################
# Gemma test code with llama_cpp
#################################
# libraries
from llama_cpp import Llama

# Inputs
model_path ="/model/gemma-7b-it.gguf"
CONTEXT_SIZE = 512
temperature: float=0

# basic prompt
#prompt = "Can you provide a summary of Guy Debord's Societe du Spectacle?"

# prompt written in gemma prompt syntax
prompt = """
<start_of_turn>user
Can you provide a summary of Guy Debord's Societe du Spectacle?"<end_of_turn>
<start_of_turn>model
"""

# LOAD THE MODEL
llm = Llama(
  model_path=model_path,  # The path to the model file
  n_ctx=CONTEXT_SIZE,  # The max sequence length to use - adjust based on your model's requirements
  n_threads=1,  # The number of CPU threads to use
  n_gpu_layers=-1  # Set to 0 if you want to use CPU only and -1 if you want to use all available GPUs
)

# send prompts
response = llm("Can you provide a concise summary of Debord's Societe du Spectacle?", max_tokens=1000, temperature=temperature)
response_text = response['choices'][0]['text']
print(response_text)

Note that this file is saved on KLC here:
```
/kelloggs/software/llama_cpp/code
```

This directory is bound to our llama-cpp-python model and can be accessed as:

```
/code
```


#### <font color='purple'>__SLURM Script to run llama_cpp_python__</font>

You can run this file with the following SLURM script




In [None]:
#!/bin/bash

#SBATCH -A your_quest_allocation_account
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 0:30:00
#SBATCH --mem=40G

module purge
module use modulefiles
module load llama_cpp/2.38
python3 /code/gemma_test.py


Likewise, you can launch similar code for the mistral and llama2 models using the sample mistral_test.py and llama2_test.py files, respectively.  Each of these files are also found in the code subfolder.