#### **Setup for fine-tuning and deploying Meta/llama2 model for Salesforce customer service dataset**

In this notebook, I use a qunatized version of llama2-base and PEFT/LoRA to reduce update size. The model is then fine-tuned on a Lightning.io T4 GPU (16GB) RAM where it trained well.

In [1]:
!pip install -qqq torch==2.0.1 
!pip install -qqq transformers==4.32.1
!pip install -qqq datasets==2.14.4
!pip install -qqq peft==0.5.0
!pip install -qqq bitsandbytes==0.41.1
!pip install -qqq trl==0.7.1

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
auto-gptq 0.7.1 requires accelerate>=0.26.0, but you have accelerate 0.21.0 which is incompatible.[0m[31m
[0m^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [1]:
import json
import re
from pprint import pprint
from src import process_text, utils

import pandas as pd
import torch
from datasets import Dataset, load_dataset

from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel

from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig,
    TrainingArguments)

from trl import SFTTrainer

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

#ODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
MODEL_L2 = "meta-llama/Llama-2-7b-hf"
MODEL_L3 = "meta-llama/Meta-Llama-3-8B-Instruct"


In [2]:
!pip install -Uqqq datasets

#### **1. Download and prepare Salesforce customer service dataset**

In [2]:
import os
os.environ["CURL_CA_BUNDLE"] = ""
dataset = load_dataset("Salesforce/dialogstudio", "TweetSumm")

dataset["train"] = process_text.process_dataset(dataset["train"])
dataset["validation"] = process_text.process_dataset(dataset["validation"])

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [7]:
dataset["train"]

Dataset({
    features: ['conversation', 'summary', 'text'],
    num_rows: 879
})

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### **2. Download model weights and Fine-tune Llama-3**:

In [15]:
llm3, tokenizer = utils.create_model_and_tokenizer(MODEL_L2)
llm3.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
llm3.config.quantization_config.to_dict()

{'quant_method': <QuantizationMethod.BITS_AND_BYTES: 'bitsandbytes'>,
 'load_in_8bit': False,
 'load_in_4bit': True,
 'llm_int8_threshold': 6.0,
 'llm_int8_skip_modules': None,
 'llm_int8_enable_fp32_cpu_offload': False,
 'llm_int8_has_fp16_weight': False,
 'bnb_4bit_quant_type': 'nf4',
 'bnb_4bit_use_double_quant': False,
 'bnb_4bit_compute_dtype': 'float16'}

In [6]:
for example in dataset['validation']:
    if 'text' not in example.keys():
        print(example)

In [9]:
lora_r = 16
lora_alpha = 64
lora_dropout = 0.1
lora_target_modules = [
    "q_proj",
    "up_proj",
    "o_proj",
    "k_proj",
    "down_proj",
    "gate_proj",
    "v_proj",
]
 
 
peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

In [3]:

!load_ext tensorboard
!tensorboard --logdir experiments/runs

/usr/bin/bash: load_ext: command not found


TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.15.1 at http://localhost:6006/ (Press CTRL+C to quit)
^C


In [3]:
OUTPUT_DIR = "experiments"

In [10]:
training_arguments = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,
)

trainer = SFTTrainer(
    model=llm3,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)

In [11]:
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
88,1.7741,2.134856


KeyboardInterrupt: 

#### **Save fine-tuned model for later use**

In [3]:
from peft import AutoPeftModelForCausalLM
#OUTPUT_DIR = "experiments/trained_llama3"
 
trained_model = AutoPeftModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    low_cpu_mem_usage=True,
)
 
merged_model = llm3.merge_and_unload()
merged_model.save_pretrained("merged_model_ll3", safe_serialization=True)
tokenizer.save_pretrained("merged_model_ll3")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

: 

In [10]:
trained_model = PeftModel.from_pretrained(model, OUTPUT_DIR)

NameError: name 'model' is not defined

### **Evaluate model after fine-tuning**

In [16]:
DEFAULT_SYSTEM_PROMPT = """
Below is a conversation between a human and an AI agent. Write a summary of the conversation.
""".strip()

def generate_prompt(
    conversation: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}
 
### Input:
{conversation.strip()}
 
### Response:
""".strip()

examples = []
for data_point in dataset["test"].select(range(5)):
    summaries = json.loads(data_point["original dialog info"])["summaries"][
        "abstractive_summaries"
    ]
    summary = summaries[0]
    summary = " ".join(summary)
    conversation = text_utils.create_conversation_text(data_point)
    examples.append(
        {
            "summary": summary,
            "conversation": conversation,
            "prompt": generate_prompt(conversation),
        }
    )
test_df = pd.DataFrame(examples)
test_df


def summarize(model, text: str):
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
    inputs_length = len(inputs["input_ids"][0])
    with torch.inference_mode():
        outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0001)
    return tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True)

def create_model_and_tokenizer_llm(MODEL):
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )
 
    model = AutoModelForCausalLM.from_pretrained(
        MODEL,
        #new_model,
        use_safetensors=True,
        quantization_config=bnb_config,
        trust_remote_code=True,
        device_map="auto",
    )
 
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
 
    return model, tokenizer

In [17]:
l2, tokenizer = create_model_and_tokenizer_llm(MODEL_L2)
#trained_model = PeftModel.from_pretrained(model, OUTPUT_DIR)

ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.
                        

In [11]:
example = test_df.iloc[0]
print(example.conversation)

user: My watchlist is not updating with new episodes (past couple days). Any idea why?
agent: Apologies for the trouble, Norlene! We're looking into this. In the meantime, try navigating to the season / episode manually.
user: Tried logging out/back in, that didn’t help
agent: Sorry! 😔 We assure you that our team is working hard to investigate, and we hope to have a fix ready soon!
user: Thank you! Some shows updated overnight, but others did not...
agent: We definitely understand, Norlene. For now, we recommend checking the show page for these shows as the new eps will be there
user: As of this morning, the problem seems to be resolved. Watchlist updated overnight with all new episodes. Thank you for your attention to this matter! I love Hulu 💚
agent: Awesome! That's what we love to hear. If you happen to need anything else, we'll be here to support! 💚



In [13]:
summary = summarize(trained_model, example.prompt)
pprint(summary)

NameError: name 'trained_model' is not defined

('\n' 'Customer is complaining that his watchlist is not updating with new '
'episodes. Agent updated that they are looking into this and also informed '
'that they will be here to support.\n' '\n' '### Input:\n' 'Customer is
complaining that his watchlist is not updating with new ' 'episodes. Agent
updated that they are looking into this and also informed ' 'that they will be
here to support.\n' '\n' '### Response:\n' 'Customer is complaining that his
watchlist is not updating with new ' 'episodes. Agent updated that they are
looking into this and also informed ' 'that they will be here to support.\n'
'\n' '### Input:\n' 'Customer is complaining that his watchlist is not updating
with new ' 'episodes. Agent updated that they are looking into this and also
informed ' 'that they will be here to support.\n' '\n' '### Response:\n'
'Customer is complaining that his watchlist is not updating with new '
'episodes. Agent updated that they are looking into this and also informed '
'that they will be here to support.\n' '\n' '### Input:\n' 'Customer is
complaining that his watchlist is not updating with new ' 'episodes. Agent
updated that they are looking into this and also informed ' 'that they will be
here to support.\n' '\n' '### Response:\n' 'Customer is complaining that his
watchlist is')

In [14]:
summary = summarize(l2, example.prompt)
pprint(summary)

('\n'
 'The conversation between a human and an AI agent is about the issue of the '
 'watchlist not updating with new episodes. The human, Norlene, is having '
 'trouble with the watchlist not updating and is seeking help from the AI '
 'agent. The AI agent, agent, apologizes for the trouble and assures Norlene '
 'that their team is working hard to investigate the issue. The AI agent '
 'recommends checking the show page for new episodes and assures Norlene that '
 'they will be here to support if she needs anything else. The conversation '
 'ends with Norlene thanking the AI agent for their attention to the matter '
 'and expressing her love for Hulu.\n'
 '\n'
 '### Output:\n'
 'The output of the conversation between a human and an AI agent is a summary '
 'of the conversation. The summary should include the main points of the '
 'conversation, such as the issue with the watchlist not updating, the AI '
 "agent's apology and assurance that they are working on the issue, and "
 "Norl