## Prefix-Tuning GPT-2 Medium Model



### Import & Setup:

In [None]:
import os, sys

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
os.chdir('drive/MyDrive/cmpt413_proj')

In [None]:
!pip install -r requirements.txt

Collecting jupyter_contrib_nbextensions (from -r requirements.txt (line 6))
  Downloading jupyter_contrib_nbextensions-0.7.0.tar.gz (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jupyter_nbextensions_configurator (from -r requirements.txt (line 7))
  Downloading jupyter_nbextensions_configurator-0.6.3-py2.py3-none-any.whl (466 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m466.9/466.9 kB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
Collecting datasets~=2.12.0 (from -r requirements.txt (line 10))
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
Collecting sacrebleu (from -r requirements.txt (line 12))
  Downloading sacrebleu-2.3.3-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━

In [None]:
import argparse, os, string, sys
import torch
import sacrebleu
from tqdm import tqdm
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup
from datasets import load_dataset
from torch.utils.data import DataLoader
from peft import PrefixTuningConfig, get_peft_model, TaskType
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from peft import PeftModel
import torch.nn.functional as F
import pandas as pd

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Form & Consolidate Dataset

In [None]:
import pandas as pd

dfs = []

file_paths = ['data/quest_dataset_up_to_kalimdor.csv', 'data/quest_dataset_kalimdor_to_profession.csv','data/quest_dataset_profession_to_end.csv', 'data/quest_dataset_wotlk.csv', 'data/quest_dataset_classic.csv']
for file_path in file_paths:
  df = pd.read_csv(file_path)
  df.columns = ["id", "name", "objective", "description"]
  dfs.append(df)
combined_df = pd.concat(dfs, ignore_index=True)
df = combined_df
df.columns = ["id", "name", "objective", "description"]

df = df.drop(columns=['id'])

rows_to_remove = []

# Loop through each row and check for the "�" character in 'objective' or 'description' columns
for index, row in df.iterrows():
    if "�" in row['objective'] or "�" in row['description']:
        print("row has quest thingy! " + str(index))
        rows_to_remove.append(index)

    # print(row['description'])

# Remove the identified rows from the DataFrame
df_cleaned = df.drop(rows_to_remove)

df = df_cleaned

df = df.drop_duplicates(keep='first')

# Reorder the indices of the DataFrame
df.reset_index(drop=True, inplace=True)

# Transform the dataset according to the specified format
def format_quest_input(row):
    title = f"Name: {row['name']}"
    objective = f"Objective: {row['objective']} "
    return f"{title}, {objective}"

def format_quest_output(row):
    description = f"{row['description']}"
    return f"{description}"

df['meaning_representation'] = df.apply(format_quest_input, axis=1)
df['quest_description'] = df.apply(format_quest_output, axis=1)

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['meaning_representation'] = df.apply(format_quest_input, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['quest_description'] = df.apply(format_quest_output, axis=1)


Unnamed: 0,name,objective,description,meaning_representation,quest_description
0,Murloc Motes,Drain 12 Temporal Motes.,An alternate timeline born from murlocs? ? !\r...,"Name: Murloc Motes, Objective: Drain 12 Tempor...",An alternate timeline born from murlocs? ? !\r...
1,Deathwingurlugull!,Get in the Hopper and defeat Deathwingurlugull...,Deathwingurlugull! Deathwingurlugull! Deathwin...,"Name: Deathwingurlugull! , Objective: Get in t...",Deathwingurlugull! Deathwingurlugull! Deathwin...
2,Mugurlglrlgl!,Murglgulglrrlglgul! A level 68 Azmerloth Quest.,Mrglgulglrrlglgul! Grurguglullgurl!\r\n\r\nGru...,"Name: Mugurlglrlgl! , Objective: Murglgulglrrl...",Mrglgulglrrlglgul! Grurguglullgurl!\r\n\r\nGru...
3,The Storm Race Tour,Fly each of the Storm Race courses around the ...,This storm gryphon is intriguing! Whatever lan...,"Name: The Storm Race Tour, Objective: Fly each...",This storm gryphon is intriguing! Whatever lan...
4,Out For Delivery,Tell Cataloger Wulferd in the nearby basecamp ...,I tried to tell Vevesi that I could pull the w...,"Name: Out For Delivery, Objective: Tell Catalo...",I tried to tell Vevesi that I could pull the w...
...,...,...,...,...,...
22538,Major Mana Potion,"In addition to our other supplies, we also hav...","Here you are, <name>. Be careful out there. Ou...","Name: Major Mana Potion, Objective: In additio...","Here you are, <name>. Be careful out there. Ou..."
22539,<UNUSED>,Go to Tarren Mill and find out the status of t...,"As a child, raised as a gladiator at Durnholde...","Name: <UNUSED>, Objective: Go to Tarren Mill a...","As a child, raised as a gladiator at Durnholde..."
22540,<UNUSED>,Talk to Kelt Thomasin.,"Hello there, <name>. I've heard much about you...","Name: <UNUSED>, Objective: Talk to Kelt Thomas...","Hello there, <name>. I've heard much about you..."
22541,Test Kill Quest,Kill 5 Murlocs and come back to the test chara...,Got some time to spare? Kill me some Murlocs ...,"Name: Test Kill Quest, Objective: Kill 5 Murlo...",Got some time to spare? Kill me some Murlocs ...


In [None]:
# Save the transformed dataset to a new CSV file
output_file_path_train = 'formatted_data_tagged_prefix_train.csv'
output_file_path_test = 'formatted_data_tagged_prefix_test.csv'

from sklearn.model_selection import train_test_split

df2 = df[['meaning_representation', 'quest_description']]
train, test = train_test_split(df2, test_size=0.1)

train.to_csv(output_file_path_train, index=False, header=True)
test.to_csv(output_file_path_test, index=False, header=True)

output_file_path_train
output_file_path_test

'formatted_data_tagged_prefix_test.csv'

### Load Dataset for HuggingFace

In [None]:
import pandas as pd
from datasets import Dataset

# Load the dataset
dataset_path = 'formatted_data_tagged_prefix_train.csv'
df = pd.read_csv(dataset_path)

# Convert DataFrame to Hugging Face Dataset
hf_dataset = Dataset.from_pandas(df)

In [None]:
hf_dataset

Dataset({
    features: ['meaning_representation', 'quest_description'],
    num_rows: 20288
})

### Homework code

In [None]:
#@title Quest Generator (from Homework 4 - Prefix Tune & TableToText)

class QuestGenerator:

    def __init__(
            self,
            modelfile,
            modelsuffix='.pt',
            basemodel='distilgpt2',
            epochs=5,
            batchsize=4,
            lr=5e-5,
            virtualtokens=5,
            prefixprojection=False,
        ):
        # the input sentences will be handled using this object, you do not need to manually encode input sentence words
        self.tokenizer = AutoTokenizer.from_pretrained(basemodel)
        self.tokenizer_pad_token_id = self.tokenizer.eos_token_id \
            if self.tokenizer.pad_token_id is None else self.tokenizer.pad_token_id
        self.modelfile = modelfile
        self.modelsuffix = modelsuffix
        self.basemodel = basemodel
        self.epochs = epochs
        self.batchsize = batchsize
        self.lr = lr
        self.virtualtokens = virtualtokens
        self.prefixprojection = prefixprojection
        self.prompt = "Generate a description for the following video game quest: "
        self.training_data = []
        self.model = None # setup the model in self.decode() or self.train()

    def preprocess_function(self, examples):

        text_column = "meaning_representation"
        label_column = "quest_description"

        max_length = 128
        batch_size = len(examples[text_column])
        inputs = [f"{self.prompt}{x} {self.tokenizer.bos_token} " for x in examples[text_column]]
        targets = [f"{x} {self.tokenizer.eos_token}" for x in examples[label_column]]

        model_inputs = self.tokenizer(inputs)
        labels = self.tokenizer(targets)

        for i in range(batch_size):
            sample_input_ids = model_inputs["input_ids"][i]
            label_input_ids = labels["input_ids"][i] + [self.tokenizer_pad_token_id]
            model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
            labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
            model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
        for i in range(batch_size):
            sample_input_ids = model_inputs["input_ids"][i]
            label_input_ids = labels["input_ids"][i]
            model_inputs["input_ids"][i] = [self.tokenizer_pad_token_id] * (
                    max_length - len(sample_input_ids)
            ) + sample_input_ids
            model_inputs["attention_mask"][i] = \
                [0] * \
                (max_length - len(sample_input_ids)) + \
                model_inputs["attention_mask"][i]
            labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
            model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
            model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
            labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])

        model_inputs["labels"] = labels["input_ids"]

        return model_inputs

    def get_data(self, hf_dataset):
        """
        Loads the requested dataset with name == :param dataset_name: and returns dataloaders over each split defined
          in :param splits: which can contain any subset of ("train", "validation", "test"). The dataloder batchsize will be
            defined using :param self.batchsize:.
        """
        processed_dataset = hf_dataset.map(
            self.preprocess_function,
            batched=True,
            num_proc=1,
            load_from_cache_file=False,
            desc="Running tokenizer on dataset"
        )

        data_loader = DataLoader(
                                processed_dataset,
                                collate_fn=default_data_collator,
                                batch_size=self.batchsize,
                                pin_memory=True,
                                shuffle=True
                              )
        return data_loader

    def train(self, hf_dataset_train):

        data_loaders = {"train":self.get_data(hf_dataset_train)}

        model = AutoModelForCausalLM.from_pretrained(self.basemodel)

        prefix_config = PrefixTuningConfig(
            task_type=TaskType.CAUSAL_LM,
            num_virtual_tokens=self.virtualtokens,
            prefix_projection=self.prefixprojection
        )
        model = get_peft_model(model, prefix_config)

        model.print_trainable_parameters()

        optimizer = torch.optim.AdamW(model.parameters(), lr=self.lr)
        lr_scheduler = get_linear_schedule_with_warmup(
            optimizer=optimizer,
            num_warmup_steps=0,
            num_training_steps=(len(data_loaders["train"]) * self.epochs),
        )
        model = model.to(device)

        for epoch in range(self.epochs):
            model.train()

            # TODO rest of the training steps for prefix tuning
            total_loss = 0
            for batch in tqdm(data_loaders["train"]):
              batch = {k: v.to(device) for k, v in batch.items()}

              outputs = model(**batch)
              loss = outputs.loss
              total_loss += loss.item()

              optimizer.zero_grad()
              loss.backward()
              torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
              optimizer.step()
              lr_scheduler.step()

            print(f"Epoch {epoch}: Average Loss = {total_loss / len(data_loaders['train'])}")

            if epoch == self.epochs - 1:
                epoch_str = '' # last epoch so do not use epoch number in model filename
            else:
                epoch_str = str(epoch)

            savefile = self.modelfile + epoch_str + self.modelsuffix
            model.save_pretrained(savefile)


### Train:

In [None]:
from peft import PeftModel

model_name = 'gpt2-medium'
model_suffix = '.pt'

modelfile = os.path.join('data', 'peft_128')
if modelfile.endswith('.pt'):
    modelfile = modelfile.removesuffix('.pt')
quest_generator = QuestGenerator(
                    modelfile,
                    modelsuffix='.pt',
                    basemodel=model_name,
                    epochs=1,
                    batchsize=4,
                    lr=5e-5,
                    virtualtokens=8,
                    prefixprojection=True
                )
quest_generator.train(hf_dataset)

Running tokenizer on dataset:   0%|          | 0/20288 [00:00<?, ? examples/s]

trainable params: 51,438,592 || all params: 406,261,760 || trainable%: 12.661440742047688


100%|██████████| 5072/5072 [23:27<00:00,  3.60it/s]

Epoch 0: Average Loss = 2.986790826808579





### Evaluate:

In [None]:
tokenizer = AutoTokenizer.from_pretrained('gpt2-medium')
tokenizer_pad_token_id = tokenizer.eos_token_id \
  if tokenizer.pad_token_id is None else tokenizer.pad_token_id
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')

# make sure you have a model file to access
modelfile = os.path.join('data', 'pt_gpt2_medium_128_length')
model_suffix = '.pt'

model = PeftModel.from_pretrained(model, modelfile + model_suffix)
model.eval()
model = model.to(device)  # Move model to the device

In [None]:
tokenizer.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

In [None]:
sentence = f"Generate a description for the following video game quest: Name: Water Ways, Objective: Kill all the Ogres by the river, near the town of Algredon. <|endoftext|> "

In [None]:
#@title Single Prompt

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch.nn.functional as F

# Check if GPU is available and set the device accordingly
device = 'cuda' if torch.cuda.is_available() else 'cpu'

context_tokens = tokenizer.encode(sentence, add_special_tokens=False)
context = torch.tensor(context_tokens, dtype=torch.long, device=device)  # Move tensor to the device
num_samples = 1
context = context.unsqueeze(0).repeat(num_samples, 1)
generated = context

eos_token_id = tokenizer.eos_token_id

length = 128
temperature = 0.7  # Temperature parameter
top_k = 120

with torch.no_grad():
    for _ in range(length):
        outputs = model(generated)
        next_token_logits = outputs.logits[:, -1, :] / (temperature if temperature > 0 else 1.)

        # Filter the logits to only include top-k options
        top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
        probabilities = F.softmax(top_k_logits, dim=-1)

        # Sample from the filtered distribution
        next_token = torch.multinomial(probabilities, num_samples=1)
        next_token = top_k_indices.gather(-1, next_token)
        if next_token.item() == eos_token_id | next_token.item() == tokenizer.bos_token_id:
            break
        generated = torch.cat((generated, next_token.to(device)), dim=1)  # Ensure next_token is on the same device

out = generated
out = out[:, len(context_tokens):].tolist()
for o in out:
    text = tokenizer.decode(o, clean_up_tokenization_spaces=False)
    print(text)


Algredon is a city of stone, filled with high-fliers and few goodly creatures. We will not risk our lives here, <name>.  

The Ogres are here to drink the river's blood.�

No, I do not mean to take them out of sight. 

I am asking you to do what only the bravest of us would do. 

Take our place and help.  


In [None]:
#@title Run Test File

dataset_test_path = 'formatted_data_tagged_prefix_test.csv'
df_test = pd.read_csv(dataset_test_path)
df_test.columns = ['sentence', 'output']

prompt = "Generate a description for the following video game quest: "

references = []
preds = []

max_step = 500

for i, sentence in enumerate(df_test['sentence']):
    p = prompt + sentence + "<|endoftext|>"
    # print("prompt: {}".format(p))
    context_tokens = tokenizer.encode(sentence, add_special_tokens=True)
    context = torch.tensor(context_tokens, dtype=torch.long, device=device)  # Move tensor to the device
    num_samples = 1
    context = context.unsqueeze(0).repeat(num_samples, 1)
    generated = context

    eos_token_id = tokenizer.eos_token_id

    length = 128
    temperature = 0.9  # Temperature parameter
    top_k = 100

    with torch.no_grad():
        for _ in range(length):
            outputs = model(generated)
            next_token_logits = outputs.logits[:, -1, :] / (temperature if temperature > 0 else 1.)

            # Filter the logits to only include top-k options
            top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
            probabilities = F.softmax(top_k_logits, dim=-1)

            # Sample from the filtered distribution
            next_token = torch.multinomial(probabilities, num_samples=1)
            next_token = top_k_indices.gather(-1, next_token)
            if next_token.item() == eos_token_id | next_token.item() == tokenizer.bos_token_id:
                break
            generated = torch.cat((generated, next_token.to(device)), dim=1)  # Ensure next_token is on the same device

    out = generated
    out = out[:, len(context_tokens):].tolist()
    for o in out:
        text = tokenizer.decode(o, clean_up_tokenization_spaces=False)

        references.append(df_test['output'][i])
        preds.append(text)

        print(text)

        break

    if i >= max_step:
      break

These are some extremely dangerous demons. 

The battle to repel them is at hand.

We must see to their immediate defense to ensure victory.

My people have developed an advanced defensive device: the "Vindicaar" - an oversized shield that will protect them from all attacks.

It is far from perfect, but we can afford to lose it if we are going to stand a chance.

I will go back to the Vindicaar and see this through.  
Now you see, it's the first time I've seen a gift from you. I have for you a single item, with a message that speaks volumes. rawdownload

The Gift of the Huntress is meant to prepare you for adventure. If the task is answered, the Gift will be consumed.

Your sacrifice will help me bring the Horde's ancient enemies to heel.

I want you to prove me right. 

I shall reward you with all of the fame that awaits you for your efforts.  
Now, the power to summon these artifacts may be found nearby you in the cavern, but there is a larger piece of ancient information h

KeyboardInterrupt: ignored

In [None]:
for i, pred in enumerate(preds):
  print("---> prompt: {}".format(df_test['sentence'][i]))
  print(pred)

---> prompt: Name: Return to the Vindicaar, Objective: Return to the Vindicaar.  
Generate a description for the following video game quest: Name: Return to the Vindicaar, Objective: Return to the Vindicaar.  <|endoftext|>There are many such things as that: the Burning Legion's presence haunts the whole of Azeroth, and the remnants of the original Horde have turned on one another.

The Burning Legion has brought it down to this, so I trust you can do something to help them.

I do not speak for the rest of the Vindicaar, but I do know they are ready to aid in any way you need them to.  
---> prompt: Name: Blingtron 7000, Objective: OUTPUT: Please accept this GIFT.  
Generate a description for the following video game quest: Name: Blingtron 7000, Objective: OUTPUT: Please accept this GIFT.  <|endoftext|>If you've ever made the pilgrimage to the Ruins of Orgrimmar, chances are you've heard of the Dragonblight. A plague which has ravaged the Horde's borders, spawning a terrible plague 

In [None]:
#@title Export

import csv

if not os.path.exists('output'):
    os.makedirs('output')

csv_file_path = 'output/formatted_dataset_tagged_prefix_test_out.csv'

# Write the list to a CSV file
with open(csv_file_path, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)

    # Writing each string as a new row
    for string in outputs:
        csv_writer.writerow([string])

print(f"List of strings written to {csv_file_path}")