<a href="https://colab.research.google.com/github/uceku95/ShakespeareLLM/blob/main/ShakespeareLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install Dependencies**

In [1]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
Collec

**Import necessary Libraries**

In [2]:
import os
import random
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup

**Load Dataset**

In [3]:
# set random seed for reproducibility
random.seed(42)
torch.manual_seed(42)

# load text file as dataset
with open('/content/tiny_shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

**Initializing GPT2 tokenizer and model**

In [4]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [5]:

# set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# tokenize text and convert to torch tensors
input_ids = tokenizer.encode(text, return_tensors='pt', max_length=512, truncation=True).to(device)

# set training parameters
train_batch_size = 4
num_train_epochs = 3
learning_rate = 5e-5

# initialize optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(input_ids) * num_train_epochs // train_batch_size
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)



**Training**

In [6]:
# train the model
model.train()
for epoch in range(num_train_epochs):
    epoch_loss = 0.0
    for i in range(0, len(input_ids)-1, train_batch_size):
        # slice the input ids tensor to get the current batch
        batch_input_ids = input_ids[i:i+train_batch_size]
        # create shifted labels for each input in the batch
        batch_labels = batch_input_ids.clone()
        batch_labels[:, :-1] = batch_labels[:, 1:]
        # set label ids to -100 for padded tokens
        batch_labels[batch_labels == tokenizer.pad_token_id] = -100
        # clear gradients
        optimizer.zero_grad()
        # forward pass
        outputs = model(input_ids=batch_input_ids, labels=batch_labels)
        loss = outputs[0]
        # backward pass
        loss.backward()
        epoch_loss += loss.item()
        # clip gradients to prevent exploding gradients problem
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # update parameters
        optimizer.step()
        scheduler.step()
    print('Epoch: {}, Loss: {:.4f}'.format(epoch+1, epoch_loss/len(input_ids)))

Epoch: 1, Loss: 0.0000
Epoch: 2, Loss: 0.0000
Epoch: 3, Loss: 0.0000


**Saving the trained model**

In [7]:
output_dir = 'results'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('results/tokenizer_config.json',
 'results/special_tokens_map.json',
 'results/vocab.json',
 'results/merges.txt',
 'results/added_tokens.json')

In [8]:
tokenizer = GPT2Tokenizer.from_pretrained('/content/results')
model = GPT2LMHeadModel.from_pretrained('/content/results')



In [9]:
text ="Tell me about India?"
completion = model.generate(**tokenizer(text,return_tensors="pt"), max_length=100)
print(tokenizer.decode(completion[0]))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tell me about India?

India is a country of about 1.5 billion people. It is the second largest economy in the world after China. It is the second largest economy in the world after the United States. It is the second largest economy in the world after the United Kingdom. It is the second largest economy in the world after the United States. It is the second largest economy in the world after the United States. It is the second largest economy in the world after the United States
