# Finetuning a transformer model for a domain

*This notebook must be run on a GPU system with Cuda drivers.*

*For many applications, free GPU on [Colab](https://colab.research.google.com) may be enough*.

----

The aim here is to fine-tune an open-source pre-trained transformer model based on a set of text items that will attune it to perform more effectively on a specific domain. This means that we will be adapting the general capabilities of the model to better suit the needs and nuances of a particular field or area of expertise.

Maybe surprisingly, in spite of the huge size of most pre-trained models, the number of domain examples needed to fine-tune it can often be relatively small (see, for example, the widely cited paper on LLMs being ['few-shot learners'](https://arxiv.org/abs/2005.14165)). This is because the transformer models have a robust foundation of general language understanding, and the fine-tuning acts more like a focused nudge, refining what the model already knows, rather than starting training from scratch.

----

In [None]:
# check that we have cuda
!nvcc --version

Install `unsloth`. It makes finetuning of transformer models much less resource-intensive. Read more [here](https://huggingface.com/unsloth) and [here](https://unsloth.ai).

In [None]:
!pip install unsloth
# also get the latest build of unsloth from GitHub
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Import libraries. (The `torch` import will reveal if you have GPU in order).

In [2]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Check over on [unsloth's huggingface profile](https://huggingface.com/unsloth) which models they support. In this case we chose "`unsloth/mistral-7b-instruct-v0.2-bnb-4bit`".

We download the model. Some notes:

- Unsloth supports [RoPE scaling](https://paperswithcode.com/paper/roformer-enhanced-transformer-with-rotary), meaning that the `max_seq_length` can be set much higher than what the model was initially trained on (if one would want to).
- Set `dtype = None` for auto detection.
- Loading in `4bit` means that the model's weights are quantized to 4 bits, which significantly reduces the memory usage and computational requirements compared to using full precision (usually 16 or 32 bits).


In [39]:
#model_name =  "unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit"
model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit"
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True
)

==((====))==  Unsloth 2024.11.10: Fast Mistral patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add [LoRA adapters](https://arxiv.org/abs/2106.09685) so we only need to update 1 to 10% of all parameters.

In [35]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


**Finetuning data**: We need a csv file with one column, named 'text', where each row is domain text examples. The example here is 5,000 social media posts about climate change.

In [36]:
finetuning_file = "domain.csv"
dataset = load_dataset("csv", data_files=finetuning_file, split="train")

# Inspect the dataset
print(dataset)
print(dataset[0])  # View the first example

Dataset({
    features: ['text'],
    num_rows: 4999
})
{'text': 'BBC now just flagrantly blaming any bad weather on climate change.\\n\\nOutrageous.'}


In [37]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 2,
        warmup_steps = 50,
        max_steps = -1, # <-- for a full run
        learning_rate = 0.0000005,
        num_train_epochs = 1,
        lr_scheduler_type = "linear",
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        seed = 3407,
        output_dir = "fine_tuned_model",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/4999 [00:00<?, ? examples/s]

During training (below) the `Training Loss` will be printed. Note that when fine-tuning for domain adaptation, as we are doing here, the goal may not be a drastic reduction in training loss. The fine-tuning process may still generate embeddings that align better with the target domain.

Especially, in the case of social media research where we often work with messy (and, in this case, also unlabeled) data, it is not necessarily a problem even if the training loss *increases* somewhat.

For tasks such as question-answering or classification, the drop in Training Loss during fine-tuning is more crucial.

In [38]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,999 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 625
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
5,3.5632
10,3.5795
15,3.6719
20,3.5875
25,3.5786
30,3.4717
35,3.4919
40,3.4503
45,3.6034
50,3.3027


Save model to disk.

In [None]:
model.save_pretrained_merged("climate-mistral", tokenizer, save_method = "merged_4bit_forced",)


Unsloth: Merging 4bit and LoRA weights to 4bit...
This might take 5 minutes...
Done.
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 10 minutes for Llama-7b...

Push model to Huggingface Hub

In [None]:
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit_forced", token = "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").

### Use the fine-tuned model

Use the finetuned model for text summarization, and characterization.

We define the `big_chunk_of_text` to be summarized and described.

The `big_chunk_of_text` to be summarized and described must fit within the total token limit defined by `max_seq_length` earlier. This limit includes all tokens: the system message, user prompt, input text, and the generated output. Since the model generates additional tokens during summarization (controlled by max_new_tokens), a margin must be reserved for the output. Note that token counts differ from word counts, as tokenization can split a single word into multiple tokens.

In [None]:
# Using cop_key_clusters as test data, extract all tweets belonging to one of the clusters
df = pd.read_csv('cop_key_clusters.csv', on_bad_lines='skip', low_memory= False)
text_list = df[df['cluster'] == 799]['text'].tolist()
text_list = list(set(text_list)) # remove duplicates


In [None]:
# Define the content strings as variables
system_message = "You are a social science researcher skilled at qualitatively assessing the core meanings of text."
user_prompt = (
    "Summarize the following text with a particular focus on its key unifying themes, ideological stance, and sentiment. "
    "Your only output should be bullet points pertaining to that summary task. Here is the text to be summarized:\n"
    f"{big_chunk_of_text}"
)

In [None]:
# Prepare the summarization prompt with the variables
messages = [
    {
        "role": "system",
        "content": system_message,
    },
    {
        "role": "user",
        "content": user_prompt,
    },
]

# Generate the summary (from the rest of the code)
decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()

# Dynamically construct the removal string
removal_string = f"system\n{system_message}\nuser\n{user_prompt}"

# Clean the output
clean_summary = decoded_output.replace(removal_string, "").strip().lstrip('assistant').strip()


# Print only the clean summary
print(clean_summary)

Key unifying themes:
- Nutritional value (tender meatballs)
- Cultural significance (traditional Swedish meatballs)
- Religious symbolism (meatball symbolism for Sweden)

Ideological stance:
- Anti-globalization
- Nordic nationalism

Sentiment:
- Appreciation for Swedish food traditions
- Joyful celebration of holiday festivities
- Gratitude towards Swedish heritage


In [None]:
import re
from transformers import AutoTokenizer, AutoModelForCausalLM

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit")
model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit")

# Define the content strings
system_message = "You are a social science researcher skilled at qualitatively assessing the core meanings of text."

# Function to prepare the prompt
def prepare_prompt(chunk):
    user_prompt = (
        "Summarize the following text with a particular focus on its key unifying themes, ideological stance, and sentiment. "
        "Your only output should be bullet points pertaining to that summary task. Here is the text to be summarized:\n"
        f"{chunk}"
    )
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
    ]

# Function to clean the output
def clean_output(decoded_output, user_prompt):
    removal_string = f"system\n{system_message}\nuser\n{user_prompt}"
    return decoded_output.replace(removal_string, "").strip().lstrip("assistant").strip()

# Function to generate summary for a chunk
def generate_summary(chunk):
    messages = prepare_prompt(chunk)
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=100,  # Adjust as needed
        use_cache=True,
    )
    decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    return clean_output(decoded_output, messages[1]["content"])

# Function to chunk the text
def chunk_text(text, max_tokens):
    tokens = tokenizer(text, return_tensors="pt")["input_ids"][0]
    chunks = [tokens[i:i + max_tokens] for i in range(0, len(tokens), max_tokens)]
    return [tokenizer.decode(chunk, skip_special_tokens=True) for chunk in chunks]

# Process big chunk of text

max_seq_length = tokenizer.model_max_length  # Use tokenizer's max model length
# If RoPE scaling allows, manually set a higher limit
max_seq_length = 2048  # Adjust if required for your model and hardware
max_tokens_per_chunk = max_seq_length - 200  # Reserve space for prompt and output

chunks = chunk_text(big_chunk_of_text, max_tokens_per_chunk)

# Generate sub-summaries for each chunk
sub_summaries = [generate_summary(chunk) for chunk in chunks]

# Combine sub-summaries and generate overarching summary
combined_text = " ".join(sub_summaries)
final_summary = generate_summary(combined_text)

# Print the final summary
print("Final Summary:")
print(final_summary)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


AttributeError: 'Qwen2ForCausalLM' object has no attribute 'max_seq_length'