This file finetunes the VAGOsolutions/SauerkrautLM-Gemma-7b model on my WhatsApp chat history with my girlfriend using LoRA. It is then used to predict my responses to input prompts.
For privacy reasons, the output of all cells with sensitive data is deleted. This notebook is purely for educational purposes and the results will not be optimal because the used model is only a few days old and still doesn't work very well (but will probably in the next weeks) and because I let the training run only for one epoch to minimise computation cost.

# Prepare dataframe
Due to the large file size, the messages are grouped into batches. If one message was sent more than two hours after the previous message, this counts as a new conversation and will be stored in a new row.

In [1]:
import re
import pandas as pd

with open("clean_chat_with_ts.txt", "r", encoding="utf-8") as file:
    chat = file.read()

# Regular to extract timestamp
pattern = r"\[(\d{2}\.\d{2}\.\d{2}, \d{2}:\d{2}:\d{2})\] (.*)"

times = []
messages = []

with open("clean_chat_with_ts.txt", "r", encoding="utf-8") as file:
    for line in file:
        match = re.match(pattern, line)
        if match:
            # Extract time, name, and message from each line
            times.append(match.group(1))
            messages.append(match.group(2))

# Create DataFrame
df = pd.DataFrame({"time": times, "message": messages})
df["time"] = pd.to_datetime(df["time"], format="%d.%m.%y, %H:%M:%S")
df.insert(1, "time_diff", df["time"].diff())
df.loc[0, "time_diff"] = pd.Timedelta(seconds=0)
df['group_nr'] = [0 for _ in range(len(times))]

In [2]:
# Group messages by time. If a message is sent more than two hours after the preceding message, a new group is created
nr_count = 0
for i in range(df.shape[0]):
    if df.loc[i,'time_diff'] > pd.Timedelta(hours=2):
        nr_count += 1
    df.loc[i,'group_nr'] = nr_count

In [None]:
# Group by 'group_nr' and concatenate all messages in each group into a single string
new_df = df.groupby('group_nr')['message'].agg(lambda x: '\n'.join(x)).reset_index(name='Message')

# Drop the 'group_nr' column if it's no longer needed, leaving only 'Message'
new_df = new_df.drop(columns=['group_nr'])
print(new_df.loc[0, 'Message'])

## Optional: Transform data frame to chat input format suggested by Huggingface

In [None]:
# Function to transform a single message string into the desired format
def transform_message(row):
    system_message = {"role": "system", "content": "You are Tobi."}
    messages_list = [system_message]

    # Split the messages and iterate over them
    for message in row.split('\n'):
        if message.startswith("Viliana:"):
            role = "user"
            content = message.replace("Viliana: ", "")
        elif message.startswith("Tobi:"):
            role = "assistant"
            content = message.replace("Tobi: ", "")
        else:
            continue  # Skip any messages that don't start with the expected prefixes

        messages_list.append({"role": role, "content": content})

    return {"messages": messages_list}

new_df['transformed'] = new_df['Message'].apply(transform_message)
print(new_df['transformed'].iloc[0])

In [None]:
chat = [new_df['transformed'].iloc[i] for i in range(new_df.shape[0])]

In [None]:
chat[0]

# Model finetuning

In [None]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1
!pip3 install accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import os
from peft import LoraConfig
import transformers
from trl import SFTTrainer
from datasets import Dataset

# os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Import base model
model_id = "VAGOsolutions/SauerkrautLM-Gemma-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

In [None]:
# # use for optional chat format
# dataset = Dataset.from_dict({"chat": chat})
# dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
# print(dataset['formatted_chat'][0:10])

In [None]:
# Create dataset from pandas Dataframe
model_df = pd.DataFrame(new_df["Message"])
dataset = Dataset.from_pandas(model_df)
tokenizer.pad_token = tokenizer.eos_token
ds = dataset.map(lambda samples: tokenizer(samples["Message"], truncation=True, padding=True, max_length=512), batched=True)

Map:   0%|          | 0/501 [00:00<?, ? examples/s]

In [None]:
# Test model before finetuning
text = "Vili: Hi Tobi, was machst Du heute so?"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=110)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vili: Hi Tobi, was machst Du heute so?
Tobi: Ich bin gerade im Büro und arbeite.
Vili: Hast Du schon deine Mittagspause gemacht?
Tobi: Ja, gerade🤢
Vili: Was isst Du für Die Pause?
Tobi: Ich werde einfach einen Sandwich machen und vielleicht eine Tasse Tee...
Vili: Okay, das hört sich gut an ☕
Tobi: Danke, ich muss jetzt zurück ins Büro 🏃‍♂️
Vili: Alles Gute! 🙋‍♂️
Da ist Tobi gerade mit dem Mittagessen beschäftigt. Er hat V


In [None]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear4bit(in_features=24576, out_features=3072, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )

In [None]:
# optimise all linear layers as suggested in the LoRA paper with rank=8
lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=ds,
    dataset_text_field="Message",
    max_seq_length=512,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=10,
        warmup_steps=4,
        num_train_epochs=1, # use each WhatsApp message once
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)

trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map:   0%|          | 0/501 [00:00<?, ? examples/s]

Step,Training Loss
1,4.3676
2,5.0058
3,3.8926
4,4.8423
5,4.1221
6,4.0944
7,3.7601
8,3.1916
9,3.2988
10,3.288


TrainOutput(global_step=25, training_loss=3.4238353538513184, metrics={'train_runtime': 120.5857, 'train_samples_per_second': 4.155, 'train_steps_per_second': 0.207, 'total_flos': 8157784853176320.0, 'train_loss': 3.4238353538513184, 'epoch': 1.0})

# Analysis

Analysing the following responses, the model does a relatively good job in adapting to our writing styles (but not grammar) but the content is pure bananas.

In [None]:
# Test model after finetuning
text = "Vili: Hi Tobi, was machst Du heute so?"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=110)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vili: Hi Tobi, was machst Du heute so? Wie war Dein Abend? :)
Tobi: Ich arbeite
Viliana: Oha okay, ich liebe Dich :)
Tobi: Danke:) wie war der Abend bei euch?
Tobi: Und wie hast Du Deinen Abend verbracht? :)
Tobi: Wie war das Essen?
Viliana: Wir waren bei Döner und dann Eis im Eiscafé :)
Viliana: Und das so delicious, Döner war mega gut und Eis auch super :)
Viliana: Aber nicht nur delicious, eher der Eintaler :)
Viliana


In [None]:
text = "Vili: Hi Tobi, was machst Du heute so? \n Tobi:"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=110)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vili: Hi Tobi, was machst Du heute so? 
 Tobi: Oh hi Viliana, ich geh jetzt aus 😂 mit dem Auto und freunde
Tobi: Hab grad die Bahn nach Köln genommen :)
Tobi: Freue mich drauf :) 
Tobi: Ich hoff, du gehst grad auch noch ins Bett 😂
Tobi: Aber was machst Du?
Tobi: Hast grad die Bahn genommen und will nach Hause, was machst du?
Viliana: Ich geh jetzt ins Bett, kann nicht schlafen gehen! 🤦🏼‍♀️
Viliana: Du weißt doch, ich bin ganz um die Ecke
