# A guide to fine-tuning Code Llama

**In this guide I show you how to fine-tune Code Llama to become a beast of an SQL developer. For coding tasks, you can generally get much better performance out of Code Llama than Llama 2, especially when you specialise the model on a particular task:**

- I use the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) which is a bunch of text queries and their corresponding SQL queries
- A Lora approach, quantizing the base model to int 8, freezing its weights and only training an adapter
- Much of the code is borrowed from [alpaca-lora](https://github.com/tloen/alpaca-lora), but I refactored it quite a bit for this


### 2. Pip installs


In [1]:
!pip install datasets peft
!pip install transformers bitsandbytes accelerate
!pip install zstandard
!pip install wandb

import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-win_amd64.whl.metadata (2.2 kB)
Downloading bitsandbytes-0.43.1-py3-none-win_amd64.whl (101.6 MB)
   ---------------------------------------- 0.0/101.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/101.6 MB 330.3 kB/s eta 0:05:08
   ---------------------------------------- 0.0/101.6 MB 487.6 kB/s eta 0:03:29
   ---------------------------------------- 0.1/101.6 MB 558.5 kB/s eta 0:03:02
   ---------------------------------------- 0.2/101.6 MB 833.5 kB/s eta 0:02:02
   ---------------------------------------- 0.3/101.6 MB 1.3 MB/s eta 0:01:21
   ---------------------------------------- 0.4/101.6 MB 1.5 MB/s eta 0:01:07
   ---------------------------------------- 0.6/101.6 MB 1.8 MB/s eta 0:00:58
   ---------------------------------------- 0.8/101.6 

I used an A100 GPU machine with Python 3.10 and cuda 11.8 to run this notebook. It took about an hour to run.

### Loading libraries


In [2]:
from datetime import datetime
import os
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq


(If you have import errors, try restarting your Jupyter kernel)


### Load dataset


In [9]:
# from datasets import load_dataset
# dataset = load_dataset("b-mc2/sql-create-context", split="train")
# train_dataset = dataset.train_test_split(test_size=0.1)["train"]
# eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

import json
import random

file_path = 'sql_create_context_v4.json'
# Load the dataset
with open(file_path, 'r') as file:
    dataset = json.load(file)

# Display the first few entries
for i in range(5):
    print(dataset[i])


random.shuffle(dataset)

# Define the split ratios
train_ratio = 0.8
validation_ratio = 0.1
test_ratio = 0.1

# Calculate split indices
total_size = len(dataset)
train_end = int(total_size * train_ratio)
validation_end = train_end + int(total_size * validation_ratio)

# Split the dataset
train_dataset = dataset[:train_end]
validation_dataset = dataset[train_end:validation_end]
eval_dataset = dataset[validation_end:]

# Print the sizes of each split to verify
print(f'Train size: {len(train_dataset)}')
print(f'Validation size: {len(validation_dataset)}')
print(f'Test size: {len(eval_dataset)}')


{'question': 'How many heads of the departments are older than 56 ?', 'context': 'CREATE TABLE head (age INTEGER)', 'answer': 'SELECT COUNT(*) FROM head WHERE age > 56'}
{'question': 'List the name, born state and age of the heads of departments ordered by age.', 'context': 'CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)', 'answer': 'SELECT name, born_state, age FROM head ORDER BY age'}
{'question': 'List the creation year, name and budget of each department.', 'context': 'CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)', 'answer': 'SELECT creation, name, budget_in_billions FROM department'}
{'question': 'What are the maximum and minimum budget of the departments?', 'context': 'CREATE TABLE department (budget_in_billions INTEGER)', 'answer': 'SELECT MAX(budget_in_billions), MIN(budget_in_billions) FROM department'}
{'question': 'What is the average number of employees of the departments whose rank is between 10 and 15?', 'context'

The above pulls the dataset from the Huggingface Hub and splits 10% of it into an evaluation set to check how well the model is doing through training. If you want to load your own dataset do this:

```
train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')
```

And if you want to view any samples in the dataset just do something like:``` ```


In [11]:
print(train_dataset[3])

{'question': 'What discs has a region 1 release date of January 22, 2008?', 'context': 'CREATE TABLE table_name_90 (discs VARCHAR, region_1_release VARCHAR)', 'answer': 'SELECT discs FROM table_name_90 WHERE region_1_release = "january 22, 2008"'}


Each entry is made up of a text 'question', the sql table 'context' and the 'answer'.

### Load model
I load code llama from huggingface in int8. Standard for Lora:

In [13]:
base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
    llm_int8_enable_fp32_cpu_offload=True
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

torch_dtype=torch.float16 means computations are performed using a float16 representation, even though the values themselves are 8 bit ints.

If you get error "ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported." Make sure you have transformers version is 4.33.0.dev0 and accelerate is >=0.20.3.


### 3. Check base model
A very good common practice is to check whether a model can already do the task at hand. Fine-tuning is something you want to try to avoid at all cost:


In [14]:
eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""
# {'question': 'Name the comptroller for office of prohibition', 'context': 'CREATE TABLE table_22607062_1 (comptroller VARCHAR, ticket___office VARCHAR)', 'answer': 'SELECT comptroller FROM table_22607062_1 WHERE ticket___office = "Prohibition"'}
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
SELECT class FROM table_name_12 WHERE frequency_ mhz > 91 AND city_ license = 'hyannis nebraska';


I get the output:
```
SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'
```
which is clearly wrong if the input is asking for just class!

### 4. Tokenization
Setup some tokenization settings like left padding because it makes [training use less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa):

In [15]:
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

Setup the tokenize function to make labels and input_ids the same. This is basically what [self-supervised fine-tuning](https://neptune.ai/blog/self-supervised-learning) is:

In [16]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

And run convert each data_point into a prompt that I found online that works quite well:

In [17]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.

### Input:
{data_point["question"]}

### Context:
{data_point["context"]}

### Response:
{data_point["answer"]}
"""
    return tokenize(full_prompt)

Reformat to prompt and tokenize each sample:

In [18]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

AttributeError: 'list' object has no attribute 'map'

### 5. Setup Lora

In [None]:
model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

To resume from a checkpoint, set resume_from_checkpoint to the path of the adapter_model.bin you want to resume from. This code'll replace the lora adapter attached to the model:

In [None]:
resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from

if resume_from_checkpoint:
    if os.path.exists(resume_from_checkpoint):
        print(f"Restarting from {resume_from_checkpoint}")
        adapters_weights = torch.load(resume_from_checkpoint)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {resume_from_checkpoint} not found")

Optional stuff to setup Weights and Biases to view training graphs:

In [None]:
wandb_project = "sql-try2-coder"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project


In [None]:
if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

### 6. Training arguments
If you run out of GPU memory, change per_device_train_batch_size. The gradient_accumulation_steps variable should ensure this doesn't affect batch dynamics during the training run. All the other variables are standard stuff that I wouldn't recommend messing with:

In [None]:
batch_size = 128
per_device_train_batch_size = 32
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "sql-code-llama"

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        max_steps=400,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        # save_total_limit=3,
        load_best_model_at_end=False,
        # ddp_find_unused_parameters=False if ddp else None,
        group_by_length=True, # group sequences of roughly the same length together to speed up training
        report_to="wandb", # if use_wandb else "none",
        run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

Then we do some pytorch-related optimisation (which just make training faster but don't affect accuracy):

In [None]:
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

In [None]:
trainer.train()

### Load the final checkpoint
Now for the moment of truth! Has our work paid off...?

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

To load a fine-tuned Lora/Qlora adapter use PeftModel.from_pretrained. ```output_dir``` should be something containing an adapter_config.json and adapter_model.bin:

In [None]:
from peft import PeftModel
model = PeftModel.from_pretrained(model, output_dir)

Try the same prompt as before:

In [None]:
eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


And the model outputs:
```
SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"
```
So it works! If you want to convert your this adapter to a Llama.cpp model to run locally follow my other [guide](https://ragntune.com/blog/A-guide-to-running-Llama-2-qlora-loras-on-Llama.cpp). If you have any questions, shoot me a message on [Elon Musk's website](https://twitter.com/samlhuillier_).
