# A guide to fine-tuning Code Llama

**In this guide I show you how to fine-tune Code Llama to become a beast of an SQL developer. For coding tasks, you can generally get much better performance out of Code Llama than Llama 2, especially when you specialise the model on a particular task:**

- I use the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) which is a bunch of text queries and their corresponding SQL queries
- A Lora approach, quantizing the base model to int 8, freezing its weights and only training an adapter
- Much of the code is borrowed from [alpaca-lora](https://github.com/tloen/alpaca-lora), but I refactored it quite a bit for this


### 2. Pip installs


In [1]:
!pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3  # we need latest transformers for this
!pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
!pip install datasets==2.10.1
import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
!pip install wandb

Collecting git+https://github.com/huggingface/transformers.git@main
  Cloning https://github.com/huggingface/transformers.git (to revision main) to /tmp/pip-req-build-u6y2jqep
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-u6y2jqep
  Resolved https://github.com/huggingface/transformers.git to commit 09b2de6eb74b1e5ff4f4c3d9839485f4165627c9
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.20.3
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m29.0 MB/s[0m eta [36m0:0

I used an A100 GPU machine with Python 3.10 and cuda 11.8 to run this notebook. It took about an hour to run.

### Loading libraries


In [1]:
from datetime import datetime
import os
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq


(If you have import errors, try restarting your Jupyter kernel)


### Load dataset


In [2]:
from datasets import load_dataset
train_dataset = load_dataset('json', data_files='/content/train_llama_format.json', split='train')
eval_dataset = load_dataset('json', data_files='/content/validation_llama_format.json', split='train')

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-d1fb8c3c3dbf6fdf/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-d1fb8c3c3dbf6fdf/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-4920a7d5e56c3417/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-4920a7d5e56c3417/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


The above pulls the dataset from the Huggingface Hub and splits 10% of it into an evaluation set to check how well the model is doing through training. If you want to load your own dataset do this:

```
train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')
```

And if you want to view any samples in the dataset just do something like:``` ```


In [3]:
print(train_dataset[3])
train_dataset

{'input': 'You are a powerful text-to-SQL model. Here is a database schema:\ndepartment :\nDepartment_ID [ INT ] primary_key\nName [ TEXT ]\nCreation [ TEXT ]\nRanking [ INT ]\nBudget_in_Billions [ INT ]\nNum_Employees [ INT ]\n\nhead :\nhead_ID [ INT ] primary_key\nname [ TEXT ]\nborn_state [ TEXT ]\nage [ INT ]\n\nmanagement :\ndepartment_ID [ INT ] primary_key management.department_ID = department.Department_ID\nhead_ID [ INT ] management.head_ID = head.head_ID\ntemporary_acting [ TEXT ]\n\nWrite an SQL query that answers the following: What are the maximum and minimum budget of the departments? \n### Response:', 'output': 'SELECT max(budget_in_billions) ,  min(budget_in_billions) FROM department;'}


Dataset({
    features: ['input', 'output'],
    num_rows: 7000
})

Each entry is made up of a text 'question', the sql table 'context' and the 'answer'.

### Load model
I load code llama from huggingface in int8. Standard for Lora:

In [4]:
base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

Downloading (…)lve/main/config.json:   0%|          | 0.00/637 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

torch_dtype=torch.float16 means computations are performed using a float16 representation, even though the values themselves are 8 bit ints.

If you get error "ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported." Make sure you have transformers version is 4.33.0.dev0 and accelerate is >=0.20.3.


### 3. Check base model
A very good common practice is to check whether a model can already do the task at hand. Fine-tuning is something you want to try to avoid at all cost:


In [5]:
eval_prompt = eval_dataset[0]['input']
print(eval_prompt)
# {'question': 'Name the comptroller for office of prohibition', 'context': 'CREATE TABLE table_22607062_1 (comptroller VARCHAR, ticket___office VARCHAR)', 'answer': 'SELECT comptroller FROM table_22607062_1 WHERE ticket___office = "Prohibition"'}
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=50)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


You are a powerful text-to-SQL model. Here is a database schema:
stadium :
Stadium_ID [ INT ] primary_key
Location [ TEXT ]
Name [ TEXT ]
Capacity [ INT ]
Highest [ INT ]
Lowest [ INT ]
Average [ INT ]

singer :
Singer_ID [ INT ] primary_key
Name [ TEXT ]
Country [ TEXT ]
Song_Name [ TEXT ]
Song_release_year [ TEXT ]
Age [ INT ]
Is_male [ bool ]

concert :
concert_ID [ INT ] primary_key
concert_Name [ TEXT ]
Theme [ TEXT ]
Stadium_ID [ TEXT ] concert.Stadium_ID = stadium.Stadium_ID
Year [ TEXT ]

singer_in_concert :
concert_ID [ INT ] primary_key singer_in_concert.concert_ID = concert.concert_ID
Singer_ID [ TEXT ] singer_in_concert.Singer_ID = singer.Singer_ID

Write an SQL query that answers the following: How many singers do we have? 
### Response:
You are a powerful text-to-SQL model. Here is a database schema:
stadium :
Stadium_ID [ INT ] primary_key
Location [ TEXT ]
Name [ TEXT ]
Capacity [ INT ]
Highest [ INT ]
Lowest [ INT ]
Average [ INT ]

singer :
Singer_ID [ INT ] primary_k

I get the output:
```
SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'
```
which is clearly wrong if the input is asking for just class!

### 4. Tokenization
Setup some tokenization settings like left padding because it makes [training use less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa):

In [6]:
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

Setup the tokenize function to make labels and input_ids the same. This is basically what [self-supervised fine-tuning](https://neptune.ai/blog/self-supervised-learning) is:

In [7]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

And run convert each data_point into a prompt that I found online that works quite well:

In [8]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""{data_point["input"]}
{data_point["output"]}
"""
    print(full_prompt)
    return tokenize(full_prompt)

Reformat to prompt and tokenize each sample:

In [9]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
role_code [ TEXT ] Employees.role_code = Roles.role_code
employee_name [ TEXT ]
other_details [ TEXT ]

Document_Drafts :
document_id [ INT ] primary_key Document_Drafts.document_id = Documents.document_id
draft_number [ INT ]
draft_details [ TEXT ]

Draft_Copies :
document_id [ INT ] primary_key Draft_Copies.document_id = Document_Drafts.document_id
draft_number [ INT ] Draft_Copies.draft_number = Document_Drafts.draft_number
copy_number [ INT ]

Circulation_History :
document_id [ INT ] primary_key Circulation_History.document_id = Draft_Copies.document_id
draft_number [ INT ] Circulation_History.draft_number = Draft_Copies.draft_number
copy_number [ INT ] Circulation_History.copy_number = Draft_Copies.copy_number
employee_id [ INT ] Circulation_History.employee_id = Employees.employee_id

Documents_Mailed :
document_id [ INT ] primary_key Documents_Mailed.document_id = Documents.document_id
mailed_to_address_id [ INT ]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
You are a powerful text-to-SQL model. Here is a database schema:
list :
LastName [ TEXT ] primary_key
FirstName [ TEXT ]
Grade [ INT ]
Classroom [ INT ]

teachers :
LastName [ TEXT ] primary_key
FirstName [ TEXT ]
Classroom [ INT ]

Write an SQL query that answers the following: Find the last names of all the teachers that teach GELL TAMI. 
### Response:
SELECT T2.lastname FROM list AS T1 JOIN teachers AS T2 ON T1.classroom  =  T2.classroom WHERE T1.firstname  =  "GELL" AND T1.lastname  =  "TAMI";

You are a powerful text-to-SQL model. Here is a database schema:
list :
LastName [ TEXT ] primary_key
FirstName [ TEXT ]
Grade [ INT ]
Classroom [ INT ]

teachers :
LastName [ TEXT ] primary_key
FirstName [ TEXT ]
Classroom [ INT ]

Write an SQL query that answers the following: What are the last names of the teachers who teach the student called GELL TAMI? 
### Response:
SELECT T2.lastname FROM list AS T1 JOIN teachers AS T2 O

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
src_ap [ TEXT ]
alid [ INT ] routes.alid = airlines.alid
airline [ TEXT ]
codeshare [ TEXT ]

airports :
apid [ INT ] primary_key
name [ TEXT ]
city [ TEXT ]
country [ TEXT ]
x [ INT ]
y [ INT ]
elevation [ INT ]
iata [ TEXT ]
icao [ TEXT ]

airlines :
alid [ INT ] primary_key
name [ TEXT ]
iata [ TEXT ]
icao [ TEXT ]
callsign [ TEXT ]
country [ TEXT ]
active [ TEXT ]

Write an SQL query that answers the following: For each country and airline name, how many routes are there? 
### Response:
SELECT T1.country ,  T1.name ,  count(*) FROM airlines AS T1 JOIN routes AS T2 ON T1.alid  =  T2.alid GROUP BY T1.country ,  T1.name;

You are a powerful text-to-SQL model. Here is a database schema:
routes :
rid [ INT ] primary_key
dst_apid [ INT ] routes.dst_apid = airports.apid
dst_ap [ TEXT ]
src_apid [ INT ] routes.src_apid = airports.apid
src_ap [ TEXT ]
alid [ INT ] routes.alid = airlines.alid
airline [ TEXT ]
codeshare [ TEXT ]

Map:   0%|          | 0/1034 [00:00<?, ? examples/s]

You are a powerful text-to-SQL model. Here is a database schema:
stadium :
Stadium_ID [ INT ] primary_key
Location [ TEXT ]
Name [ TEXT ]
Capacity [ INT ]
Highest [ INT ]
Lowest [ INT ]
Average [ INT ]

singer :
Singer_ID [ INT ] primary_key
Name [ TEXT ]
Country [ TEXT ]
Song_Name [ TEXT ]
Song_release_year [ TEXT ]
Age [ INT ]
Is_male [ bool ]

concert :
concert_ID [ INT ] primary_key
concert_Name [ TEXT ]
Theme [ TEXT ]
Stadium_ID [ TEXT ] concert.Stadium_ID = stadium.Stadium_ID
Year [ TEXT ]

singer_in_concert :
concert_ID [ INT ] primary_key singer_in_concert.concert_ID = concert.concert_ID
Singer_ID [ TEXT ] singer_in_concert.Singer_ID = singer.Singer_ID

Write an SQL query that answers the following: How many singers do we have? 
### Response:
SELECT count(*) FROM singer;

You are a powerful text-to-SQL model. Here is a database schema:
stadium :
Stadium_ID [ INT ] primary_key
Location [ TEXT ]
Name [ TEXT ]
Capacity [ INT ]
Highest [ INT ]
Lowest [ INT ]
Average [ INT ]

singer

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
first_name [ TEXT ]
street [ TEXT ]
city [ TEXT ]
state [ TEXT ]
zip_code [ TEXT ]
last_name [ TEXT ]
email_address [ TEXT ]
home_phone [ TEXT ]
cell_number [ TEXT ]

Treatments :
treatment_id [ INT ] primary_key
dog_id [ INT ] Treatments.dog_id = Dogs.dog_id
professional_id [ INT ] Treatments.professional_id = Professionals.professional_id
treatment_type_code [ TEXT ] Treatments.treatment_type_code = Treatment_Types.treatment_type_code
date_of_treatment [ TEXT ]
cost_of_treatment [ INT ]

Write an SQL query that answers the following: Give me the description of the treatment type whose total cost is the lowest. 
### Response:
SELECT T1.treatment_type_description FROM Treatment_types AS T1 JOIN Treatments AS T2 ON T1.treatment_type_code  =  T2.treatment_type_code GROUP BY T1.treatment_type_code ORDER BY sum(cost_of_treatment) ASC LIMIT 1;

You are a powerful text-to-SQL model. Here is a database schema:
Breeds :
breed_cod

### 5. Setup Lora

In [10]:
model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

To resume from a checkpoint, set resume_from_checkpoint to the path of the adapter_model.bin you want to resume from. This code'll replace the lora adapter attached to the model:

In [11]:
resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from

if resume_from_checkpoint:
    if os.path.exists(resume_from_checkpoint):
        print(f"Restarting from {resume_from_checkpoint}")
        adapters_weights = torch.load(resume_from_checkpoint)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {resume_from_checkpoint} not found")

Optional stuff to setup Weights and Biases to view training graphs:

In [12]:
wandb_project = "spider-llama7B-test1"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project


In [13]:
if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

### 6. Training arguments
If you run out of GPU memory, change per_device_train_batch_size. The gradient_accumulation_steps variable should ensure this doesn't affect batch dynamics during the training run. All the other variables are standard stuff that I wouldn't recommend messing with:

In [14]:
batch_size = 128
per_device_train_batch_size = 64
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "spider-7B-test-1"

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        max_steps=400,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        # save_total_limit=3,
        load_best_model_at_end=False,
        # ddp_find_unused_parameters=False if ddp else None,
        group_by_length=True, # group sequences of roughly the same length together to speed up training
        report_to="wandb", # if use_wandb else "none",
        run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

Then we do some pytorch-related optimisation (which just make training faster but don't affect accuracy):

In [15]:
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

compiling the model


In [None]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
20,1.0672,1.160535
40,0.9054,0.722871


### Load the final checkpoint
Now for the moment of truth! Has our work paid off...?

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

To load a fine-tuned Lora/Qlora adapter use PeftModel.from_pretrained. ```output_dir``` should be something containing an adapter_config.json and adapter_model.bin:

In [None]:
from peft import PeftModel
model = PeftModel.from_pretrained(model, output_dir)

Try the same prompt as before:

In [None]:
eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


And the model outputs:
```
SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"
```
So it works! If you want to convert your this adapter to a Llama.cpp model to run locally follow my other [guide](https://ragntune.com/blog/A-guide-to-running-Llama-2-qlora-loras-on-Llama.cpp). If you have any questions, shoot me a message on [Elon Musk's website](https://twitter.com/samlhuillier_).
