# Fine Tune GPT-2 Model

Open Questions:
- Is it useful to add \<bot> statement as preparing step? -> yesss
- Should one batch, one dialog?
- Currently the dialogs are mixed, so only question and answer is paired right now
    - How to fix?
    - The batches?
- Removing Bot answers?<br>
    From:<br>
    '<start> Create me a unique interactive story to calm with the topic: Ocean. <bot>:Ah, the ocean... <end>',<br>
    "<start> Ah, the ocean. Such a ... <end>",<br>
    "<start> Yes, I can feel it...'<br>
    <br>
    to:<br>
    '<start> Create me a unique interactive story to calm with the topic: Ocean. <bot>:Ah, the ocean... <end>',<br>
    "<start> Yes, I can feel it...'<br>


1. **Dialog-based Approach:**
   - **One Batch, One Dialog:**
     - Treat each dialog as a separate training example. This allows the model to learn the context and flow of individual conversations.
     - Helps the model focus on capturing the nuances of each conversation independently.
     - Useful if your storytelling involves short, distinct dialogs.

   - **Inclusion of the Past:**
     - You can include the past history within each dialog example. Concatenate the previous turns in the conversation to provide context.
     - This helps the model understand the context and continuity of the ongoing dialog.
     - Be mindful of the token limit, as GPT-2 has a maximum token limit, and longer sequences might get truncated.

2. **Memory and Context:**
   - GPT-2 has a limited context window due to its fixed input size. If the conversations are long, you might lose relevant information.
   - Consider balancing the length of your input sequences to ensure the model can capture essential details.

3. **Dynamic Context Window:**
   - Instead of a fixed history length, you could use a sliding window approach.
   - Maintain a dynamic context window that moves along the conversation, incorporating the most recent interactions.

4. **Experiment and Evaluate:**
   - It's often beneficial to experiment with different approaches to see what works best for your specific use case.
   - Conduct thorough evaluations using validation data to ensure the model is learning effectively and providing desired responses.

5. **Training Strategies:**
   - Experiment with hyperparameters like learning rate, batch size, and the number of training epochs to fine-tune the model effectively.
   - Monitor the model's performance on both training and validation sets.

Preprocess: handling tokenization, special tokens, and managing the context window.

Hint: Use the dialogs.txt file to train the model on google colab.

### Imports

In [2]:
!python --version

Python 3.9.1


In [4]:
#!python -m pip install torch
#!python -m pip install transformers

In [1]:
#from transformers import GPT2Tokenizer, GPT2LMHeadModel
import pandas as pd
import os

import json

import transformers
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam

### Load and Prepare the data

In [2]:
MODEL_PATH = "./model/model.pth"
MODEL_WEIGHT_PATH = "./model/model_weights.pth"
ONNX_PATH = "./model/model.onnx"
# ".pt", ".pth", ".pkl", or ".h5"

class Dialog_Data(Dataset):
    
    def __init__(self, tokenizer, data_dir_path="./data", read_one_file=False):
        self.tokenizer = tokenizer
        self.data_dir_path = data_dir_path
        self.read_data(read_one_file, read_one_file)

    def read_data(self, data_dir_path, read_one_file, should_save_as_one_file=True):
        data = []
        conversations = []
        if read_one_file:
            with open("./dialogs.txt", "r") as f:
                raw = f.read()
            for dialog in raw.split("#/"):
                cur_conversation = []
                for sentence in dialog.split(";"):
                    data += [sentence]
                    cur_conversation += [sentence]
                conversations += [(cur_conversation)]
        else:
            for dialog in os.listdir(self.data_dir_path):
                    with open(f"{self.data_dir_path}/{dialog}", "r") as f:
                        cur_conversation = []
                        for idx, line in enumerate(f.read().split("\n")):
                            content = ":".join(line.split(":")[1:]).strip()
                            if len(content) > 0:
                                if idx == 0:
                                    data += [f"Create me a unique interactive story to calm with the topic: {content}"]
                                    cur_conversation += [f"Create me a unique interactive story to calm with the topic: {content}"]
                                else:
                                    data += [content]
                                    cur_conversation += [content]
                    conversations += [(cur_conversation)]
            if should_save_as_one_file:
                save_data = ""
                for idx_1, dialog in enumerate(conversations):
                    if idx_1 > 0:
                        save_data += "#/"

                    for idx_2, elem in enumerate(dialog):
                        if idx_2 == 0:
                            save_data += f"{elem}"
                        else:
                            save_data += f";{elem}"
                    with open("./dialogs.txt", "w") as f:
                        f.write(save_data)

        # add markers: 
        for idx in range(0, len(data)-1):    # last elem should be skipped
            data[idx] = f"<start> {data[idx]} <bot>:{data[idx+1]} <end>"

        self.conversations = conversations
        self.data = data[:-1]
        self.encoded_data = self.tokenizer(self.data, truncation=True)

        self.input_ids = self.encoded_data['input_ids']
        self.attention_mask = self.encoded_data['attention_mask']
 
    def __len__(self):
        return len(self.data)
 
    def __getitem__(self, idx):
        return (self.input_ids[idx], self.attention_mask[idx])
        # conversation = self.conversations[idx]
        # inputs = self.tokenizer.encode(conversation, max_length=self.max_length, truncation=True, return_tensors="pt")

        # return {
        #     "input_ids": inputs.flatten(),
        # }

In [3]:
tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({  "pad_token": "<pad>",
                                "bos_token": "<start>",
                                "eos_token": "<end>"})
tokenizer.add_tokens(["<bot>:"])

1

In [4]:
data = Dialog_Data(tokenizer=tokenizer, read_one_file=False)
data = DataLoader(data, batch_size=1)
#  make the batch-size bigger, 62, 128

In [6]:
# Test saved dialogs in one file
if True == False:
    counter = 0
    with open("./dialogs.txt", "r") as f:
        dialogs = f.read()
    print(f"Dialogs amount: {len(os.listdir('./data'))}")
    print(f"In one file dialogs amount: {len(dialogs.split('#'))}")

Dialogs amount: 3617
In one file dialogs amount: 3617


In [26]:
counter = 0
for i, a in data:
    counter += 1
    if counter < 2:
        print(i)
        print(a)
    
counter

[tensor([50258]), tensor([13610]), tensor([502]), tensor([257]), tensor([3748]), tensor([14333]), tensor([1621]), tensor([284]), tensor([9480]), tensor([351]), tensor([262]), tensor([7243]), tensor([25]), tensor([10692]), tensor([13]), tensor([220]), tensor([50260]), tensor([10910]), tensor([11]), tensor([262]), tensor([9151]), tensor([13]), tensor([8013]), tensor([257]), tensor([5909]), tensor([11]), tensor([384]), tensor([25924]), tensor([1295]), tensor([13]), tensor([13872]), tensor([534]), tensor([2951]), tensor([329]), tensor([257]), tensor([2589]), tensor([290]), tensor([1011]), tensor([257]), tensor([2769]), tensor([8033]), tensor([11]), tensor([34140]), tensor([262]), tensor([36021]), tensor([21212]), tensor([286]), tensor([262]), tensor([5417]), tensor([13]), tensor([2735]), tensor([11]), tensor([4286]), tensor([3511]), tensor([5055]), tensor([319]), tensor([257]), tensor([4950]), tensor([44039]), tensor([10481]), tensor([11]), tensor([262]), tensor([23125]), tensor([286]), te

28912

### Load pretrained model

In [36]:
model = transformers.GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

Embedding(50261, 768)

In [9]:
config = transformers.GPT2Config.from_pretrained("gpt2")
config.do_sample = config.task_specific_params['text-generation']['do_sample']
config.max_length = config.task_specific_params['text-generation']['max_length']
model = transformers.GPT2LMHeadModel.from_pretrained("gpt2", config=config)

### First test

In [12]:
prompt = "Create me an interactive story to calm me down."
encoded = tokenizer.encode(prompt, return_tensors="pt")
print(f"Encoding: {encoded}\n")
res = tokenizer.decode(model.generate(encoded)[0])
print(f"Prompt:\n{prompt}\n\nResult:\n{res}");

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Create me an interactive story to calm me down.\n\nI'm an experienced writer and I've been writing for a long time and it's something to take a deep breath in. You always say you can't write your own dialogue without having done"

### Fine Tune Model

In [None]:
device = torcg.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")

In [None]:
# model.to(device)

optimizer = Adam(model.parameters(), lr=1e-4)
epochs=10

loss_hist = []
steps = 0

for cur_epoch in range(0, epochs):
    model.train()
    for input_ids, attention_mask in data:
        # input_ids = input_ids.to(device)
        # attention_mask = attention_mask.to(device)
        # shift labels?
        optimizer.zero_grad()
        loss = model(input_ids, attention_maks=attention_mask, labels=input_ids).loss
        loss_hist += [loss.item()]
        loss.backward()
        optimizer.step()
        steps += 1
    torch.save(model.state_dict(), "./model_state.pt")
    print(f'Epoch {cur_epoch+1}/{epochs}, Training Loss: {loss.item():.4f}, Steps: {steps}')

# plot loss
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
ax.plot(np.arange(len(loss_hist)), loss_hist, label='Loss')
ax.set_xlabel('Learning progress')
ax.set_ylabel('Loss (normalized mean absolute error)')
ax.set_title('Loss over time')
ax.legend()
ax.grid()

### Save model

-> Propably save the model in a extra repository/branch and provide it as python module<br>
-> Is model very big?

save only weights

In [None]:
torch.save(model.state_dict(), MODEL_WEIGHT_PATH)

# loading
# config = transformers.GPT2Config.from_pretrained("gpt2")
# config.do_sample = config.task_specific_params['text-generation']['do_sample']
# config.max_length = config.task_specific_params['text-generation']['max_length']
# model = transformers.GPT2LMHeadModel.from_pretrained("gpt2", config=config)
# model.resize_token_embeddings(len(tokenizer))

# model.load_state_dict(torch.load(MODEL_WEIGHT_PATH))
# model.eval()

save whole model

In [None]:
torch.save(model, MODEL_PATH)

# loading
# model = torch.load(MODEL_PATH)
# model.eval()

save as ONNX

see here -> https://onnxruntime.ai/docs/get-started/with-python.html<br>
or here -> https://pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html

In [32]:
def get_example_data():
    for i, a in data:
        return i, a
i, a = get_example_data()

# or:

# text = "Text from the news article"
# text = torch.tensor(text_pipeline(text))
# offsets = torch.tensor([0])

In [None]:
torch.onnx.export(model,                     # model being run
                  (i, a),                    # model input (or a tuple for multiple inputs)
                  ONNX_PATH,                 # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=10,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output']  # the model's output names
                    )

# loading
# import onnx

# onnx_model = onnx.load(ONNX_PATH)
# onnx.checker.check_model(onnx_model)

# import onnxruntime as ort
# import numpy as np
# ort_sess = ort.InferenceSession('ag_news_model.onnx')
# outputs = ort_sess.run(None, {'input': text.numpy(),
#                             'offsets':  torch.tensor([0]).numpy()})
# # Print Result
# result = outputs[0].argmax(axis=1)+1
# print("This is a %s news" %ag_news_label[result[0]])

### Use the model

In [None]:
def inference(prompt:str, tokenizer):
    prompt = f"<startofstring> {data[idx]} <bot>:"
    prompt = tokenizer(prompt)
    output = model.generate(prompt)
    return tokenizer.decode(output[0])

In [None]:
def infer(inp):
  inp = " " + inp + " : "
  inp = tokenizer(inp, return_tensors="pt")
  X = inp["input_ids"].to(device)  # Use .to(device) method to move the tensor to the specified device
  a = inp["attention_mask"].to(device)  # Use .to(device) method here as well

  output = model.generate(X, attention_mask=a, max_length=100, num_return_sequences=1)

  output = tokenizer.decode(output[0])

  return output