## Local LLMs fine-tuning with different quantization techniques (`bitsandbytes` and `gptq`)

This notebooks provide a quick overview of using various quantization techniques to fine-tune LLMs on comodity hardware (memory constrained). Especially on Colab GPU (free-tier), to fine-tune small LLM variant (7B) with 16GiB, quantization techniques like 4-bit quantization and GPTQ is needed to prevent Out-of-Memory errors with long sequences length.

Install prerequisite packages

In [None]:
!git clone https://github.com/taprosoft/llm_finetuning/
%cd llm_finetuning
!pip install -r requirements.txt
!pip install -r cuda_quant_requirements.txt
!wandb disabled

Download some model weights from HuggingFace [model hub](https://huggingface.co/models) using the `download_model.py` script.

In [None]:
!mkdir models
# download a 7B GPTQ base model
!python download_model.py TheBloke/open-llama-7b-open-instruct-GPTQ
# download a normal 7B model (note that we have to use sharded checkpoint due to memory limit of Colab)
!python download_model.py CleverShovel/vicuna-7b-v1.3-sharded-bf16

Use `finetune.py` script to run training / inference. We first perform evaluation of the downloaded models on a public instruction-tuning datasets.

To understand the format of the dataset, take a look at [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) or the guideline in [README](https://github.com/taprosoft/llm_finetuning).

It looks something likes this:

```json
[
    {
        "instruction": "do something with the input",
        "input": "input string",
        "output": "output string"
    }
]
```

We start with the 7B model on 4-bit quantization mode from `bitsandbytes`. Take a look at the output loss and processing time per step.

In [None]:
!python finetune.py \
    --base_model 'models/CleverShovel_vicuna-7b-v1.3-sharded-bf16' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir 'output_lora' \
    --batch_size 32 \
    --micro_batch_size 1 \
    --train_on_inputs True \
    --num_epochs 1 \
    --learning_rate 2e-4 \
    --cutoff_len 1600 \
    --group_by_length \
    --val_set_size 0.05 \
    --eval_steps 0 \
    --logging_steps 5 \
    --save_steps 5 \
    --gradient_checkpointing 1 \
    --mode 4 \
    --eval

Now we will run the same script with GPTQ quantization mode (`--mode gptq`). Note that we need to switch to a compatible model weight to be used with this method. (look for `gptq` in the model name). We can see some significant difference in processing time using different quantization methods.

In [None]:
# a hotfix for Colab compatibility issue of peft
!pip install peft==0.3.0
!python finetune.py \
    --base_model 'models/TheBloke_open-llama-7b-open-instruct-GPTQ' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir 'output_lora' \
    --batch_size 32 \
    --micro_batch_size 1 \
    --train_on_inputs True \
    --num_epochs 1 \
    --learning_rate 2e-4 \
    --cutoff_len 1600 \
    --group_by_length \
    --val_set_size 0.05 \
    --eval_steps 0 \
    --logging_steps 5 \
    --save_steps 5 \
    --gradient_checkpointing 1 \
    --mode gptq \
    --eval

Evaluation loop only provides the loss and run time measurement. To actually see the model output in text format, use `inference.py` script. Note that perform inference / generation will take much longer time than evaluation loop due to the additional overhead in token generation steps. We will use `exllama` inference backend to speed up the inference time.

In [None]:
# to fix some Colab install issue with Exllama
!git clone https://github.com/taprosoft/exllama.git
!cd exllama && pip install -e .

In [None]:
!python inference.py \
    --base models/TheBloke_open-llama-7b-open-instruct-GPTQ \
    --mode exllama \
    --data 'yahma/alpaca-cleaned' \
    --selected_ids [0,1,2,3]

Now we can start training. On a relatively old GPU like T4, it can take about 20-30h to complete the training on Alpaca dataset. Output checkpoint is stored in `output_lora`. Checkpoint is created at regular interval so you can stop earlier if needed.

In [None]:
!python finetune.py \
    --base_model 'models/TheBloke_open-llama-7b-open-instruct-GPTQ' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir 'output_lora' \
    --batch_size 32 \
    --micro_batch_size 1 \
    --train_on_inputs True \
    --num_epochs 1 \
    --learning_rate 2e-4 \
    --cutoff_len 1600 \
    --group_by_length \
    --val_set_size 0.05 \
    --eval_steps 0 \
    --logging_steps 5 \
    --save_steps 5 \
    --gradient_checkpointing 1 \
    --mode gptq