In [None]:
!pip install torch transformers



In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokenizer("king")

{'input_ids': [3364], 'attention_mask': [1]}

In [None]:
tokenizer("queen and the king")

{'input_ids': [4188, 268, 290, 262, 5822], 'attention_mask': [1, 1, 1, 1, 1]}

In [None]:
from transformers import GPT2LMHeadModel

# Generating un-tuned text with our model
model = GPT2LMHeadModel.from_pretrained("gpt2")

test_generation = model.generate(max_length=120)
tokenizer.decode(test_generation[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'<|endoftext|>\nThe first time I saw the new version of the game, I was so excited. I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of'

In [None]:
input_sequence = "On a bright sunny day in the streets of Stockholm"
input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

test_generation = model.generate(input_ids, max_length=300)
tokenizer.decode(test_generation[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'On a bright sunny day in the streets of Stockholm, a young man named Anders Behring Breivik, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been living in the city for a year, was shot dead by police.\n\nThe man, who had been li

In [None]:
input_sequence = "Alf and Yngvi, | Eikinskjaldi"
input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

test_generation = model.generate(input_ids, max_length=120)
tokenizer.decode(test_generation[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Alf and Yngvi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, | Eikinskjaldi, |'

# Loading our dataset

In [None]:
# Download the file
!pip install wget

import wget
url = 'https://raw.githubusercontent.com/ohshitnotgood/kth-wrkshp/refs/heads/main/data.txt'
wget.download(url)



'data.txt'

In [None]:
# Open the file
data = open("./data.txt").read()
print(len(data))

16339


# Build our dataset and dataloader

In [None]:
# first, we tokenize the entire dataset
tokenized_text = tokenizer.encode(data, return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (4914 > 1024). Running this sequence through the model will result in indexing errors


Then, we want to split our data into "chunks". We will feed each of these chunks into our model during finetuning.

We want to feed in a chunk, calculate the loss, backwards propagate, feed in another chunk, backwards propagate, feed in another chunk and so on and so forth.

In [None]:
examples = []
block_size = 128
for i in range(0, tokenized_text.size(1) - block_size + 1, block_size):
  examples.append(tokenized_text[:, i:i + block_size])

examples

[tensor([[   39,  6648,   314,  1265,   930,   422,   262, 11386,  9558,    11,
            198,  4863,   679,   320,    67,   439,   338, 11989,    11,   930,
           1111,  1029,   290,  1877,    26,   198,   817,   280,   266,  2326,
             11,   569,  1604,  1032,    11,   930,   326,   880,   314, 15124,
            198, 19620, 19490,   314,  3505,   930,   286,  1450,   890,  2084,
             13,   198,   198,    40,  3505,  1865,   930,   262, 20178,   286,
            331,   382,    11,   198,  8241,  2921,   502,  8509,   930,   287,
            262,  1528,  3750,   416,    26,   198, 37603, 11621,   314,  2993,
             11,   930,   262,  5193,   287,   262,  5509,   198,  3152, 18680,
          11135,   930, 11061,   262, 15936,    13,   198,   198,  5189,  1468,
            373,   262,  2479,   930,   618,   575, 10793,  5615,    26,   198,
          37567,  4249,  3608,  9813,   930,  4249,  6450,   612,   547,    26,
            198, 22840,   550,   407,   

We can package our data nicely using the `Dataset` module from Pytorch

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self):
        self.examples = []
        for i in range(0, tokenized_text.size(1) - block_size + 1, block_size):
            self.examples.append(tokenized_text[:, i:i + block_size])

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx].squeeze()


dataset = TextDataset()
dataloader = DataLoader(dataset, shuffle=True)

# How can we finetune our model?

In [None]:
# define an optimizer
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=3e-5)



In [None]:
# Fine-tuning loop
epochs = 3
model.train()
for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    for step, batch in enumerate(dataloader):
        batch = batch
        outputs = model(batch, labels=batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if step % 100 == 0:
            print(f"Step {step}, Loss: {loss.item():.4f}")


Epoch 1/3
Step 0, Loss: 5.5227
Epoch 2/3
Step 0, Loss: 4.1066
Epoch 3/3
Step 0, Loss: 3.1474


In [None]:
# Generate our new text
def generate_text(prompt, max_length=500):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(inputs["input_ids"], max_length=max_length, temperature=0.7, do_sample=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Generate text with a prompt
prompt = "Alf and Yngvi, | Eikinskjaldi"
print("Generated text:", generate_text(prompt))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text: Alf and Yngvi, | Eikinskjaldi, | and Fjord and Skjaldi.

The earth is well, | the men well with frost and darkness, | and the rivers well green.

Beneath the mountains high | the gods see | and Thor sees | all things well.

In Asgard | Loki, | Helvig, the serpent, the sea monster, | Loki's son, | and the moon | Loki's sister.

As Thor battles the god | with swords and spears, | and the giants fight | the gods with gold and stone, | and the seas | with fire and ice.

The sons | of the gods, | their race and age are well, | and the daughters | their race and age are well, | and the sons | of earth and fire are well,
And Thor battles in high | the mighty gods, | and Loki battles his brother, | and the giants fight their brother, | and the rivers | with fire and ice.

In Niflheim | Thor fights in high | the mighty gods, | and Loki battles his brother, | and the giants fight their brother, | and the seas | with fire and ice.

In the Old City | where Loki lies, | and Asgard s

In [None]:
for each in range(0, 1000, 128):
  print(each)

  # samadder@kth.se

0
128
256
384
512
640
768
896
