#### **Setup for fine-tuning and deploying Meta/llama2 model for Salesforce customer service dataset**

In this notebook, I use a qunatized version of llama2-base and PEFT/LoRA to reduce update size. The model is then fine-tuned on a Lightning.io T4 GPU (16GB) RAM where it trained well.

In [1]:
!pip install -Uqqq pip --progress-bar off
!pip install torch==2.0.1 
!pip install transformers==4.32.1
!pip install datasets==2.14.4
!pip install peft==0.5.0
!pip install bitsandbytes==0.41.1
!pip install trl==0.7.1

[0mCollecting torch==2.0.1
  Downloading torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.1)
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch==2.0.1)
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-cupti-cu11==11.7.101 (from torch==2.0.1)
  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch==2.0.1)
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu11==11.10.3.66 (from torch==2.0.1)
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cufft-cu11==10.9.0.58 (from torch==2.0.1)
  Downloading nvidia_cufft_cu11-10.9.0.58-py3-none

In [3]:
!pip install -U datasets

[0m

In [14]:
import json
import re
from pprint import pprint


import pandas as pd
import torch
from datasets import Dataset, load_dataset

from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel

from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig,
    TrainingArguments)

from trl import SFTTrainer

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

MODEL_NAME = "meta-llama/Llama-2-7b-hf"


In [72]:
import os
os.environ["CURL_CA_BUNDLE"] = ""
dataset = load_dataset("Salesforce/dialogstudio", "TweetSumm")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [73]:
import json
def generate_text(data_point):
    summaries = json.loads(data_point["original dialog info"])["summaries"][
        "abstractive_summaries"
    ]
    summary = summaries[0]
    summary = " ".join(summary)
 
    conversation_text = create_conversation_text(data_point)

    try:
        example = {
            "conversation": conversation_text,
            "summary": summary,
            "text": generate_training_prompt(conversation_text, summary),
        }
        return example

    except Exception as e:
        return {
            "conversation": "",
            "summary": "",
            "text": ""
        }

DEFAULT_SYSTEM_PROMPT = """
Below is a conversation between a human and an AI agent. Write a summary of the conversation.
""".strip()
 
 
def generate_training_prompt(
    conversation: str, summary: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}
 
### Input:
{conversation.strip()}
 
### Response:
{summary}
""".strip()

In [74]:
def create_conversation_text(data_point):
    text = ""
    for item in data_point["log"]:
        user = clean_text(item["user utterance"])
        text += f"user: {user.strip()}\n"
 
        agent = clean_text(item["system response"])
        text += f"agent: {agent.strip()}\n"
 
    return text
 
 
def clean_text(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@[^\s]+", "", text)
    text = re.sub(r"\s+", " ", text)
    return re.sub(r"\^[^ ]+", "", text)

def process_dataset(data: Dataset):
    return (
        data.shuffle(seed=42)
        .map(generate_text)
        .remove_columns(
            [
                "original dialog id",
                "new dialog id",
                "dialog index",
                "original dialog info",
                "log",
                "prompt",
            ]
        )
    )

In [75]:
dataset["train"] = process_dataset(dataset["train"])
dataset["validation"] = process_dataset(dataset["validation"])

In [63]:
dataset["train"]

Dataset({
    features: ['conversation', 'summary', 'text'],
    num_rows: 879
})

In [28]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [35]:
MODEL_NAME = "NousResearch/Llama-2-7b-chat-hf"

In [36]:
def create_model_and_tokenizer():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )
 
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        #new_model,
        use_safetensors=True,
        quantization_config=bnb_config,
        trust_remote_code=True,
        device_map="auto",
    )
 
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
 
    return model, tokenizer

In [37]:
model, tokenizer = create_model_and_tokenizer()
model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [38]:
model.config.quantization_config.to_dict()

{'quant_method': <QuantizationMethod.BITS_AND_BYTES: 'bitsandbytes'>,
 'load_in_8bit': False,
 'load_in_4bit': True,
 'llm_int8_threshold': 6.0,
 'llm_int8_skip_modules': None,
 'llm_int8_enable_fp32_cpu_offload': False,
 'llm_int8_has_fp16_weight': False,
 'bnb_4bit_quant_type': 'nf4',
 'bnb_4bit_use_double_quant': False,
 'bnb_4bit_compute_dtype': 'float16'}

In [70]:
for example in dataset['validation']:
    if 'text' not in example.keys():
        print(example)

{'original dialog id': 'caae83a2ed59e4959d814ea567980226', 'new dialog id': 'TweetSumm--val--1', 'dialog index': 1, 'original dialog info': '{"summaries": {"extractive_summaries": [[{"is_agent": false, "sentences": ["@SpotifyCares hey, any explanation why the \\"Create similar playlist\\" function doesn\'t work anymore for me?"]}, {"is_agent": false, "sentences": ["@SpotifyCares i tried and it\'s still the same... moreover, my song history is always empty, so I can\'t find songs from previous Discover playlists :("]}, {"is_agent": true, "sentences": ["@160485 Could you DM us your account\'s email address or username?"]}, {"is_agent": true, "sentences": ["We\'ll take a look backstage /MT https://t.co/ldFdZRiNAt"]}], [{"is_agent": false, "sentences": ["@SpotifyCares hey, any explanation why the \\"Create similar playlist\\" function doesn\'t work anymore for me?"]}, {"is_agent": true, "sentences": ["@160485 Hi there, the cavalry\'s here!"]}, {"is_agent": true, "sentences": ["Does logging

In [64]:
lora_r = 16
lora_alpha = 64
lora_dropout = 0.1
lora_target_modules = [
    "q_proj",
    "up_proj",
    "o_proj",
    "k_proj",
    "down_proj",
    "gate_proj",
    "v_proj",
]
 
 
peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

In [41]:
OUTPUT_DIR = "experiments"
 
%load_ext tensorboard
%tensorboard --logdir experiments/runs

/usr/bin/bash: load_ext: command not found


TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.15.1 at http://localhost:6006/ (Press CTRL+C to quit)


In [76]:
training_arguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)



In [77]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
22,2.0161,2.034687
44,1.8952,1.936691
66,1.7139,1.910567
88,1.8163,1.899833
110,1.6952,1.897427


TrainOutput(global_step=110, training_loss=1.9881060524420304, metrics={'train_runtime': 2723.5677, 'train_samples_per_second': 0.645, 'train_steps_per_second': 0.04, 'total_flos': 1.2937277221871616e+16, 'train_loss': 1.9881060524420304, 'epoch': 2.0})