# Module 2 Project 3 Part 1: Finetuning

Finetune a model using LoRA on some sample data using llama.cpp and confirm it worked by displaying some of output.
Perform unsupervised finetuning with this data.

## STEP 1: GET THE MODEL
- Get the model from HuggingFace
- I choce Wizard-Vicuna-13B-uncensored-SuperHOT-8K for this project
- You can use whetever model you want for this

In [None]:
wget https://huggingface.co/JohanAR/Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GGUF/resolve/main/wizard-vicuna-13b-uncensored-superhot-8k.q4_K_M.gguf

## STEP 2: GET THE DATA
- We want to download some sample data to finetune on
- I randomly picked a book from Project Gutenberg for this, ideally we'd want much more data
- Just making sure the finetuning process works

In [None]:
wget https://www.gutenberg.org/cache/epub/4300/pg4300.txt -O data.txt

## STEP 3: FINETUNE WITH LLAMA.CPP
- Run the finetuning script from llama.cpp 
- Note the checkpoint outputs and `lora-out` parameter
- We use a batch size of 4 and a context window of 64 with 6 threads

In [None]:
./finetune \
        --model-base models/wizard-vicuna-13b-uncensored-superhot-8k.q5_K_M.gguf \
        --checkpoint-in  chk-lora-wizard-vicuna-13b-uncensored-superhot-8k.q5_K_M.gguf \
        --checkpoint-out chk-lora-wizard-vicuna-13b-uncensored-superhot-8k.q5_K_M-ITERATION.gguf \
        --lora-out lora-wizard-vicuna-13b-uncensored-superhot-8k.q5_K_M-ITERATION.bin \
        --train-data "data.txt" \
        --save-every 10 \
        --threads 6 --adam-iter 30 --batch 4 --ctx 64 \
        --use-checkpointing


## STEP 4: TEST THE FINETUNED MODEL
- After finetuning, display some output of the model
- The below code is used for Mac (M1 is tested)
- Context window of 4096 is used here to max out RAM on the M1 chip

In [None]:
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 4096  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

language_model = LlamaCpp(
    model_path="full_path_to_model/wizard-vicuna-13b-uncensored-superhot-8k.q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=4096,
    f16_kv=True, 
    callback_manager=callback_manager,
    verbose=False,
    echo=False
)

result = language_model("What is the meaning of life?")