# llama.cpp Fine-Tuning Template





llama.cpp helps to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware.This allows the use of models packaged as .gguf files, which run efficiently in CPU-only and mixed CPU/GPU environments

`I prepared this llama.cpp fine-tuning template for my use case, but you could change it to suit your requirements.`



To View My Account:

* [Hugging Face ](https://huggingface.co/santhoshmlops)

* [Git Hub](https://github.com/santhoshmlops)

To View Some other Fine Tuning Template:

* [Fine Tuning Template ](https://github.com/santhoshmlops/MyHF_LLM_FineTuning/FineTuningTemplate)


To View My Model Fine Tuning  NoteBook:

* [MY HF LLM Fine-Tuning](https://github.com/santhoshmlops/MyHF_LLM_FineTuning)



## Setting Up on Google Colab
Google Colab provides a convenient, cloud-based environment with access to powerful GPUs like the `T4`. If you choose Colab for this tutorial, make sure to select a GPU runtime by going to `Runtime > Change runtime type > T4 GPU`. This ensures that your notebook has access to the necessary computational resources.

## Setting Up Hugging Face Authentication

On Google Colab, you can safely store your Hugging Face token by using Colab's "Secrets" feature. This can be done by clicking on the "Key" icon in the sidebar, selecting "`Secrets`", and adding a new secret with the name `HF_TOKEN` and your Hugging Face token as the value. This method ensures that your token remains secure and is not exposed in your notebook's code.

#Things to change in this template



* Model Name
 - Check the llama.cpp Github page to find the Supported models before you proceed:  [Git Hub Page](https://github.com/ggerganov/llama.cpp)
* Quantization Methods
* Hugging Face User Name

## Quantization Methods

The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used (detailed below). Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by [TheBloke](https://huggingface.co/TheBloke/):

* `q2_k`: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
* `q3_k_l`: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_m`: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_s`: Uses Q3_K for all tensors
* `q4_0`: Original quant method, 4-bit.
* `q4_1`: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
* `q4_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
* `q4_k_s`: Uses Q4_K for all tensors
* `q5_0`: Higher accuracy, higher resource usage and slower inference.
* `q5_1`: Even higher accuracy, resource usage and slower inference.
* `q5_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
* `q5_k_s`:  Uses Q5_K for all tensors
* `q6_k`: Uses Q8_K for all tensors
* `q8_0`: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

As a rule of thumb, **I recommend using Q5_K_M** as it preserves most of the model's performance. Alternatively, you can use Q4_K_M if you want to save some memory. In general, K_M versions are better than K_S versions. I cannot recommend Q2_K or Q3_* versions, as they drastically decrease model performance.

# Step 1 - Clone the llama.cpp Git Repository

In [None]:
!git clone https://github.com/ggerganov/llama.cpp

# Step 2 - Install the Requirements

In [None]:
!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements/requirements-convert-hf-to-gguf.txt

#Step 3 - Initialize the Model Name and its Method to Quantize

In [None]:
from huggingface_hub import snapshot_download
model_name = ""  # Change the model name as your wish. For eg -"microsoft/phi-1_5"
quantization_methods = ['']   # Change the model quantization methods type as your wish. For eg - ['q5_k_m']
hf_user_name = ""   # Change the HF User Name. For eg - "santhoshmlops"
base_model = "./original_model/"
quantized_path = "./quantized_model/"
qtype = f"{quantized_path}{quantization_methods[0].upper()}.gguf"
original_model = quantized_path+'/FP16.gguf'
snapshot_download(repo_id=model_name, local_dir=base_model , local_dir_use_symlinks=False)

#Step 4 - Make a Directory for Quantized model

In [None]:
!mkdir ./quantized_model/
!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

# Step 5 - Build the Quantized Model

`Note:` You can stop the run cell once you're okay with the user interacting with Bob the Assistant

In [None]:
import os
for m in quantization_methods:
    qtype = f"{quantized_path}/{m.upper()}.gguf"
    os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)

! ./llama.cpp/main -m {qtype} -n 90 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt

# Step 6 - Login to your Hugging Face Hub
`Note:`  If you have already set the HF_TOKEN secret key, you can skip this step

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Step 7 -  Initialize the Model path and Repository name

In [None]:
from huggingface_hub import HfApi, HfFolder, create_repo, upload_file
model_path = qtype
user_name = hf_user_name
repo_name = model_name.split("/")[-1]+"-GGUF"
repo_path = model_name.split("/")[-1].lower()+"."+model_path.split("/")[-1]
repo_id = user_name+"/"+repo_name
repo_type = "model"
repo_url = create_repo(repo_name, private=False)

# Step 8 - Push the Quantized model to Hub

In [None]:
api = HfApi()
api.upload_file(
    path_or_fileobj = model_path,
    path_in_repo = repo_path,
    repo_id = repo_id,
    repo_type = repo_type,
)