## Part 2- Fine Tuning our Llama Model

For some context around this proejct please look at the read_me file. In this notebook (hosted on google collab) we will fine tune the model using the data we generated in the last notebook.

In this notebook, I will walk through how fine tune the model. Because we are using the llama model (with 7 billion weights) we will try to optimize the training process as much as possible. We will use Lora to to the adapter, which inserts some trainable weights to decompose the matrix. This has some advantages over traditional fine tuning, in which we would have to tune all the weights again.


In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/244.2 kB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m100.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer



First, we will import the libraries that we need, we will primarily be using the Hugging Face transformers because they managed and easy to use.

In [3]:
from google.colab import userdata
token = userdata.get('HF_TOKEN')

from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


In order to access the model (which requires a preapproved license from meta) we will need to use our hugging face token. Addtionally, I have inlcuded the files generated in the last notebook to become our training data.

As part of our next step we will import our hyperparameters, which can be viewed in more detailed as part of the Hugging Face documentation. I have primarily set this up to maximize computationally, since we do not have access to the best GPU

In [4]:


################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################
use_4bit = True

bnb_4bit_compute_dtype = "float16"

bnb_4bit_quant_type = "nf4"

use_nested_quant = False

output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 0
logging_steps = 25

################################################################################
# SFT parameters
################################################################################
max_seq_length = None
packing = False


Now we can load the our model from Hugging face (using the token). We will use BitAndBytes to quantize our model weights, which will improve computation. We will also import our tokenizer, which will convert our string data into a format that can be evulated by the model (binary matrices).

In [5]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf", token = token,
    quantization_config=bnb_config
)



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [6]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Finally, we will grab our data  and preform the training (using the hyperparamters we had defined earlier)

In [7]:
data_path = 'drive/MyDrive/recipe_data.jsonl'
#data_path = 'drive/MyDrive/recipe_data.csv'

dataset = load_dataset('json', data_files = data_path, split = 'train')

Generating train split: 0 examples [00:00, ? examples/s]

Now we can grab the configuration we will use for our adapter. LoRA will reduce the number of variables we would need train. Finally, we will set our training arguments, and fine tune our model with our reddit data.

In [8]:
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

In [9]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing
)

#model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": True})




Map:   0%|          | 0/996 [00:00<?, ? examples/s]

In [10]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
25,2.7888
50,2.4029
75,2.2925
100,2.236
125,2.2479
150,2.1217
175,2.1828
200,2.2223
225,2.1054


TrainOutput(global_step=249, training_loss=2.2777042159114975, metrics={'train_runtime': 184.3622, 'train_samples_per_second': 5.402, 'train_steps_per_second': 1.351, 'total_flos': 3151053880197120.0, 'train_loss': 2.2777042159114975, 'epoch': 1.0})

Now that we have trained our model, we can create a pipeline in which we can input our prompts.
(put something about the result)

In [15]:
logging.set_verbosity(logging.CRITICAL)

prompt = "What is a pasta sauce I can make that does not involve tomatoes or heavy cream?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is a pasta sauce I can make that does not involve tomatoes or heavy cream? [/INST] I think you're looking for a pesto sauce. You can make it with a variety of ingredients such as garlic, basil, lemon juice, salt, and olive oil. 

You can also use a variety of herbs such as parsley, thyme, or rosemary. 

If you want a more creamy sauce you can add some grated cheese to it.

Also, you can use this pesto sauce as a base for other sauces, like a tomato sauce, or a creamy sauce.

I hope this helps!







































When you run this same prompt, you will likely get a different result, but here you can see the type of response we get. Notably, it is a bit more casual (as is the reddit style) and shows an understanding of culinary techniques.

Now we can output our model using the below code, and we can also input this model for future inference. If we are interested in just running inference in the future we can just run the first 2 cells and then run the last cell.

In [15]:
new_model = 'drive/MyDrive/Llama-ChefAI'

trainer.model.save_pretrained(new_model)

In [14]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-7b-chat-hf',
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    #device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
s
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf', trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Thank you for looking at this notebook. This demonstrates a general pipeline that can be used to continously train a model with constantly updating data for specified task. Hosting something like this can be done relativily in AWS or some other cloud technology using hugging face endpoints, serverless arch. or hosted servers. This would be dependent on the specific configuration demanded by the project. The data would also be stored in S3 like storage. Overall, we can see how we can use a complex open source model that can be specialized for task that require constantly updating.