<a href="https://colab.research.google.com/github/sahithmanda/NLP/blob/main/Assignment_NLP_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Install Required Libraries

In [1]:
!pip install torch transformers




2. Import Libraries

In [2]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import Dataset, DataLoader
import numpy as np
import random


3. Sample Text Data

In [3]:
# Sample text data (replace with a larger dataset if available)
text = """Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandmother. On her way through the woods, she met a big bad wolf who wanted to eat her."""


4. Define Dataset Class for Tokenization

In [4]:
class TextDataset(Dataset):
    def __init__(self, text, tokenizer, max_length=128):
        self.examples = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=max_length)

    def __len__(self):
        return self.examples['input_ids'].size(0)

    def __getitem__(self, i):
        return {key: val[i] for key, val in self.examples.items()}


5. Initialize Model and Tokenizer

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

6. Prepare Dataset and DataLoader

In [6]:
# Set the padding token for the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Prepare Dataset and DataLoader
dataset = TextDataset(text, tokenizer)
data_loader = DataLoader(dataset, batch_size=1, shuffle=True)


7. Define Training Function

In [7]:
def train(model, data_loader, optimizer, scheduler, num_epochs=3):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in data_loader:
            outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['input_ids'])
            loss = outputs.loss
            total_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

        print(f"Epoch {epoch + 1}/{num_epochs} Loss: {total_loss / len(data_loader)}")


8. Set Up Optimizer and Scheduler, Train the Model with Multiple Epochs

In [8]:
epochs_list = [20, 60, 70]  # Training for 20, 60, 70 epochs
for epochs in epochs_list:
    optimizer = AdamW(model.parameters(), lr=5e-5)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=epochs * len(data_loader))
    print(f"\nTraining for {epochs} epochs")
    train(model, data_loader, optimizer, scheduler, num_epochs=epochs)





Training for 20 epochs
Epoch 1/20 Loss: 6.351659297943115
Epoch 2/20 Loss: 3.6298418045043945
Epoch 3/20 Loss: 2.0891330242156982
Epoch 4/20 Loss: 1.7046347856521606
Epoch 5/20 Loss: 1.250606656074524
Epoch 6/20 Loss: 1.174290657043457
Epoch 7/20 Loss: 1.2331775426864624
Epoch 8/20 Loss: 1.1530059576034546
Epoch 9/20 Loss: 0.9782848358154297
Epoch 10/20 Loss: 0.8526082634925842
Epoch 11/20 Loss: 0.8599589467048645
Epoch 12/20 Loss: 0.8508809804916382
Epoch 13/20 Loss: 0.80182945728302
Epoch 14/20 Loss: 0.7336181998252869
Epoch 15/20 Loss: 0.6503495573997498
Epoch 16/20 Loss: 0.6068370342254639
Epoch 17/20 Loss: 0.6382080912590027
Epoch 18/20 Loss: 0.6045262813568115
Epoch 19/20 Loss: 0.5531098246574402
Epoch 20/20 Loss: 0.6273317337036133

Training for 60 epochs




Epoch 1/60 Loss: 0.5795770287513733
Epoch 2/60 Loss: 0.4231037497520447
Epoch 3/60 Loss: 0.31841912865638733
Epoch 4/60 Loss: 0.27735501527786255
Epoch 5/60 Loss: 0.19045403599739075
Epoch 6/60 Loss: 0.14069651067256927
Epoch 7/60 Loss: 0.09830697625875473
Epoch 8/60 Loss: 0.06880036741495132
Epoch 9/60 Loss: 0.11759037524461746
Epoch 10/60 Loss: 0.10520955920219421
Epoch 11/60 Loss: 0.03253151848912239
Epoch 12/60 Loss: 0.013296842575073242
Epoch 13/60 Loss: 0.010225880891084671
Epoch 14/60 Loss: 0.016380297020077705
Epoch 15/60 Loss: 0.011239998042583466
Epoch 16/60 Loss: 0.016388488933444023
Epoch 17/60 Loss: 0.008318658918142319
Epoch 18/60 Loss: 0.016170738264918327
Epoch 19/60 Loss: 0.004188969731330872
Epoch 20/60 Loss: 0.004850511439144611
Epoch 21/60 Loss: 0.00394170917570591
Epoch 22/60 Loss: 0.0058295163325965405
Epoch 23/60 Loss: 0.06843443214893341
Epoch 24/60 Loss: 0.00371655635535717
Epoch 25/60 Loss: 0.00262631056830287
Epoch 26/60 Loss: 0.004200311377644539
Epoch 27/60

9. Define Text Generation Function

In [9]:
def generate_text(seed_text, max_length=50):
    model.eval()
    input_ids = tokenizer.encode(seed_text, return_tensors='pt')
    with torch.no_grad():
        output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2, temperature=0.7, top_k=50)
    return tokenizer.decode(output[0], skip_special_tokens=True)


10. Generate Text with Initial Seed

In [10]:
seed_text = "Once upon a time"
generated_text = generate_text(seed_text)
print("Generated Text:\n", generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text:
 Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandma. On her way through the


Experimenting and Improving the Model by large dataset and hyper tune parameter.

1. Try Different Learning Rates

In [11]:
learning_rates = [5e-5, 3e-5, 1e-4]  # Suggested learning rates to try

for lr in learning_rates:
    print(f"\nTraining with Learning Rate: {lr}")
    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=20 * len(data_loader))  # For example, 20 epochs
    train(model, data_loader, optimizer, scheduler, num_epochs=20)  # Testing with 20 epochs for each learning rate



Training with Learning Rate: 5e-05
Epoch 1/20 Loss: 0.05514654144644737
Epoch 2/20 Loss: 9.231707372236997e-05
Epoch 3/20 Loss: 0.0001888120750663802
Epoch 4/20 Loss: 8.597374835517257e-05
Epoch 5/20 Loss: 6.3963612774387e-05
Epoch 6/20 Loss: 0.00017493873019702733
Epoch 7/20 Loss: 5.5252796300919726e-05
Epoch 8/20 Loss: 0.00023886351846158504
Epoch 9/20 Loss: 0.05194152146577835
Epoch 10/20 Loss: 0.04969194158911705
Epoch 11/20 Loss: 0.00010174401541007683
Epoch 12/20 Loss: 0.050609707832336426
Epoch 13/20 Loss: 0.0504622645676136
Epoch 14/20 Loss: 0.00011266343790339306
Epoch 15/20 Loss: 0.00010168516746489331
Epoch 16/20 Loss: 5.995984611217864e-05
Epoch 17/20 Loss: 8.149826317094266e-05
Epoch 18/20 Loss: 9.905474144034088e-05
Epoch 19/20 Loss: 0.00011514206562424079
Epoch 20/20 Loss: 0.0001295665861107409

Training with Learning Rate: 3e-05
Epoch 1/20 Loss: 5.051031621405855e-05
Epoch 2/20 Loss: 0.0014993450604379177
Epoch 3/20 Loss: 6.869241769891232e-05
Epoch 4/20 Loss: 6.028463

2. Experiment with Batch Size

In [12]:
batch_sizes = [1, 2, 4]  # Batch sizes to test

for batch_size in batch_sizes:
    print(f"\nTraining with Batch Size: {batch_size}")
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    optimizer = AdamW(model.parameters(), lr=5e-5)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=20 * len(data_loader))  # For example, 20 epochs
    train(model, data_loader, optimizer, scheduler, num_epochs=20)  # Testing with 20 epochs for each batch size



Training with Batch Size: 1
Epoch 1/20 Loss: 0.00017360621131956577
Epoch 2/20 Loss: 6.684022082481533e-05
Epoch 3/20 Loss: 4.776308560394682e-05
Epoch 4/20 Loss: 4.762533353641629e-05
Epoch 5/20 Loss: 8.063619316089898e-05
Epoch 6/20 Loss: 5.077901732875034e-05
Epoch 7/20 Loss: 6.999783363426104e-05
Epoch 8/20 Loss: 6.946591747691855e-05
Epoch 9/20 Loss: 4.9184378440259025e-05
Epoch 10/20 Loss: 0.04491704702377319
Epoch 11/20 Loss: 4.607820665114559e-05
Epoch 12/20 Loss: 5.720061744796112e-05
Epoch 13/20 Loss: 0.0001354385312879458
Epoch 14/20 Loss: 8.247563528129831e-05
Epoch 15/20 Loss: 3.220672806492075e-05
Epoch 16/20 Loss: 0.040634747594594955
Epoch 17/20 Loss: 7.963016832945868e-05
Epoch 18/20 Loss: 6.0788646806031466e-05
Epoch 19/20 Loss: 5.555158350034617e-05
Epoch 20/20 Loss: 4.564332994050346e-05

Training with Batch Size: 2
Epoch 1/20 Loss: 6.785606819903478e-05
Epoch 2/20 Loss: 5.40618239028845e-05
Epoch 3/20 Loss: 5.204560875426978e-05
Epoch 4/20 Loss: 4.312064265832305e

3. Experiment with max_length in Text Generation

In [13]:
seed_text = "Once upon a time"
max_lengths = [50, 100, 150]  # Different max lengths for generation

for max_len in max_lengths:
    print(f"\nGenerating text with max_length: {max_len}")
    generated_text = generate_text(seed_text, max_length=max_len)
    print("Generated Text:\n", generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generating text with max_length: 50


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandma. On her way through the

Generating text with max_length: 100


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandma. On her way through the forest, she met a big bad wolf who wanted to eat her.

Generating text with max_length: 150
Generated Text:
 Once upon a time, there was a little girl named Red Riding Hood. She loved to visit her grandmother, who lived in the woods. One day, her mother asked her to take a basket of goodies to her grandma. On her way through the forest, she met a big bad wolf who wanted to eat her.
