<a href="https://colab.research.google.com/github/suryatejaganji/NLP-2303A51L19-27/blob/main/NLP_AS_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW
from torch.utils.data import Dataset, DataLoader
import numpy as np

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Example data (Use a larger dataset for meaningful training)
text = """
Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods.
One day, her mother asked her to take a basket of goodies to her grandmother. On her way through the woods, she met a big bad wolf who wanted to eat her.
"""

# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set pad token to eos_token to avoid NoneType error
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.to(device)

# Prepare dataset without padding here, as padding will be handled in the collate function
class TextDataset(Dataset):
    def __init__(self, text, tokenizer, max_length=50): # Corrected method name to __init__
        self.tokens = tokenizer(text, return_tensors="pt", truncation=True)["input_ids"][0]

    def __len__(self): # Corrected method name to __len__ and indentation
        return len(self.tokens) - 50  # Number of training steps

    def __getitem__(self, idx): # Corrected method name to __getitem__ and indentation
        return self.tokens[idx:idx+50]

# Custom collate function for dynamic padding
def collate_fn(batch):
    max_length = max([len(x) for x in batch])
    padded_batch = [torch.cat([x, torch.full((max_length - len(x),), tokenizer.pad_token_id)]) for x in batch]
    return torch.stack(padded_batch)

dataset = TextDataset(text, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

# Training function
def train_model(epochs):
    model.train()
    optimizer = AdamW(model.parameters(), lr=3e-5)

    for epoch in range(epochs):
        for batch in dataloader:
            inputs = batch.to(device)
            labels = inputs.clone()
            outputs = model(inputs, labels=labels)
            loss = outputs.loss
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

# Train the model with different epochs
for epochs in [20, 60, 70]:
    print(f"Training with {epochs} epochs")
    train_model(epochs)

# Text generation function
def generate_text(seed_text, max_length=50):
    model.eval()
    input_ids = tokenizer.encode(seed_text, return_tensors="pt").to(device)
    generated_ids = model.generate(input_ids, max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2)
    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

# Example of generating new text with seed text
seed_text = "Once upon a time"
generated_text = generate_text(seed_text)
print("Generated Text:", generated_text)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Training with 20 epochs




Epoch 1/20, Loss: 3.073253631591797
Epoch 1/20, Loss: 2.7050082683563232
Epoch 1/20, Loss: 2.3932690620422363
Epoch 1/20, Loss: 2.221575975418091
Epoch 1/20, Loss: 2.0989279747009277
Epoch 1/20, Loss: 1.6237587928771973
Epoch 1/20, Loss: 1.5885716676712036
Epoch 1/20, Loss: 1.3167166709899902
Epoch 1/20, Loss: 1.1896772384643555
Epoch 2/20, Loss: 0.9906514286994934
Epoch 2/20, Loss: 0.9208289384841919
Epoch 2/20, Loss: 0.908475935459137
Epoch 2/20, Loss: 0.8454148173332214
Epoch 2/20, Loss: 0.6942786574363708
Epoch 2/20, Loss: 0.5161041617393494
Epoch 2/20, Loss: 0.4405713379383087
Epoch 2/20, Loss: 0.4624619781970978
Epoch 2/20, Loss: 0.4078707695007324
Epoch 3/20, Loss: 0.42080116271972656
Epoch 3/20, Loss: 0.2918441593647003
Epoch 3/20, Loss: 0.2592482566833496
Epoch 3/20, Loss: 0.3665432929992676
Epoch 3/20, Loss: 0.24951618909835815
Epoch 3/20, Loss: 0.1496712565422058
Epoch 3/20, Loss: 0.14253084361553192
Epoch 3/20, Loss: 0.21963432431221008
Epoch 3/20, Loss: 0.15467670559883118

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Epoch 70/70, Loss: 0.0055433777160942554
Generated Text: Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods.
One day, her mother asked her to take a basket of goodies to her grandparents. On her way through
