<a href="https://www.kaggle.com/code/sharanharsoor/fine-tuning-gpt-2-on-bhagavad-gita-dataset?scriptVersionId=196484668" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction.
In this notebook, I am fine-tuning a GPT-2 model on a Bhagavad Gita dataset to generate responses related to various verses. The process involves loading and tokenizing the dataset, converting the text into a format suitable for GPT-2, and then training the model with a focus on both English and Sanskrit translations. The fine-tuned model is later evaluated by generating responses to predefined queries about the Bhagavad Gita. The implementation includes steps for data processing, model training, evaluation, and ensuring memory efficiency with strategies like gradient accumulation and mixed precision. Finally, I evaluate the model by generating answers to specific Bhagavad Gita-related questions.

# Overview of the Process:

1. **Loading and Processing Data:**
   - Load the Bhagavad Gita dataset from a CSV file.
   - The dataset includes Shlokas (Sanskrit verses), transliterations, and translations in English and Hindi.
   - Data is preprocessed and displayed to ensure correctness before further steps.

2. **Tokenization:**
   - Use the GPT-2 tokenizer to encode the text data.
   - The `pad_token` is set to the `eos_token` because GPT-2 doesn’t have a padding token by default.
   - Tokenization is done for both prompts (Sanskrit verses) and responses (translations).

3. **Preparing Data for Model Training:**
   - Convert the dataset into a format compatible with the Hugging Face library.
   - The dataset is tokenized with truncation and padding, and labels are prepared by masking padding tokens.
   - This step ensures the model understands both inputs (verses) and outputs (translations).

4. **Training the GPT-2 Model:**
   - Split the dataset into training and validation sets.
   - Define training arguments with memory optimization techniques like gradient accumulation and mixed precision (FP16).
   - Log the training process using a custom callback to track loss and monitor model performance.

5. **Evaluation and Model Generation:**
   - After training, the model is evaluated using predefined queries from the Bhagavad Gita.
   - The fine-tuned model generates answers to these queries, leveraging both English translations and Sanskrit verses.

6. **Saving and Using the Model:**
   - The trained model and tokenizer are saved for future use.
   - Evaluate the model by generating responses to Bhagavad Gita-related questions, demonstrating the ability of the fine-tuned model to generate relevant answers.


## Loading of dataset

In [1]:
import pandas as pd
import numpy as np
pd.read_csv("/kaggle/input/bhagwad-gita-dataset/Bhagwad_Gita.csv")

Unnamed: 0,ID,Chapter,Verse,Shloka,Transliteration,HinMeaning,EngMeaning,WordMeaning
0,BG1.1,1,1,धृतराष्ट्र उवाच |\nधर्मक्षेत्रे कुरुक्षेत्रे स...,dhṛtarāṣṭra uvāca .\ndharmakṣetre kurukṣetre s...,।।1.1।।धृतराष्ट्र ने कहा -- हे संजय ! धर्मभूमि...,1.1 Dhritarashtra said What did my people and...,1.1 धर्मक्षेत्रे on the holy plain? कुरुक्षेत्...
1,BG1.2,1,2,सञ्जय उवाच |\nदृष्ट्वा तु पाण्डवानीकं व्यूढं द...,sañjaya uvāca .\ndṛṣṭvā tu pāṇḍavānīkaṃ vyūḍha...,।।1.2।।संजय ने कहा -- पाण्डव-सैन्य की व्यूह रच...,1.2. Sanjaya said Having seen the army of the...,1.2 दृष्ट्वा having seen? तु indeed? पाण्डवानी...
2,BG1.3,1,3,पश्यैतां पाण्डुपुत्राणामाचार्य महतीं चमूम् |\n...,paśyaitāṃ pāṇḍuputrāṇāmācārya mahatīṃ camūm .\...,।।1.3।।हे आचार्य ! आपके बुद्धिमान शिष्य द्रुपद...,"1.3. ""Behold, O Teacher! this mighty army of t...",1.3 पश्य behold? एताम् this? पाण्डुपुत्राणाम् ...
3,BG1.4,1,4,अत्र शूरा महेष्वासा भीमार्जुनसमा युधि |\nयुयुध...,atra śūrā maheṣvāsā bhīmārjunasamā yudhi .\nyu...,।।1.4।।इस सेना में महान् धनुर्धारी शूर योद्धा ...,"1.4. Here are heroes, mighty archers, eal in b...",1.4 अत्र here? शूराः heroes? महेष्वासाः mighty...
4,BG1.5,1,5,धृष्टकेतुश्चेकितानः काशिराजश्च वीर्यवान् |\nपु...,dhṛṣṭaketuścekitānaḥ kāśirājaśca vīryavān .\np...,"।।1.5।।धृष्टकेतु, चेकितान, बलवान काशिराज, पुर...","1.5. ""Dhrishtaketu, chekitana and the valiant ...",1.5 धृष्टकेतुः Dhrishtaketu? चेकितानः Chekitan...
...,...,...,...,...,...,...,...,...
696,BG18.74,18,74,सञ्जय उवाच |\nइत्यहं वासुदेवस्य पार्थस्य च महा...,sañjaya uvāca .\nityahaṃ vāsudevasya pārthasya...,।।18.74।। संजय ने कहा -- इस प्रकार मैंने भगवान...,18.74 Sanjaya said Thus I have heard this won...,18.74 इति thus? अहम् I? वासुदेवस्य of Krishna?...
697,BG18.75,18,75,व्यासप्रसादाच्छ्रुतवानेतद्गुह्यमहं परम् |\nयोग...,vyāsaprasādācchrutavānetadguhyamahaṃ param .\n...,।।18.75।। व्यास जी की कृपा से मैंने इस परम् गु...,18.75 Through the grace of Vyasa I have heard ...,18.75 व्यासप्रसादात् through the grace of Vyas...
698,BG18.76,18,76,राजन्संस्मृत्य संस्मृत्य संवादमिममद्भुतम् |\nक...,rājansaṃsmṛtya saṃsmṛtya saṃvādamimamadbhutam ...,।।18.76।। हे राजन् ! भगवान् केशव और अर्जुन के ...,"18.76 O King, remembering this wonderful and h...",18.76 राजन् O King? संस्मृत्य having remembere...
699,BG18.77,18,77,तच्च संस्मृत्य संस्मृत्य रूपमत्यद्भुतं हरेः |\...,tacca saṃsmṛtya saṃsmṛtya rūpamatyadbhutaṃ har...,।।18.77।। हे राजन ! श्री हरि के अति अद्भुत रूप...,"18.77 And, remembering again and again, also t...",18.77 तत् that? च and? संस्मृत्य having rememb...


## Tokenization and Preparing Data for Model Training

In [2]:
import pandas as pd
from datasets import Dataset
from transformers import GPT2Tokenizer

def load_and_process_data():
    # Load the Bhagwad Gita dataset from CSV
    print("Loading and processing initial data...")
    df = pd.read_csv('/kaggle/input/bhagwad-gita-dataset/Bhagwad_Gita.csv')
    print("Initial data processing completed. First few rows:")
    print(df.head())
    return df

def load_tokenizer():
    # Initialize GPT-2 tokenizer and set pad token
    print("Loading GPT-2 tokenizer...")
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default
    return tokenizer

def prepare_gpt2_examples(examples, tokenizer):
    print("Preparing GPT-2 examples...")
    input_ids_list = []
    attention_mask_list = []
    labels_list = []

    for i in range(len(examples['Shloka'])):
        # Create prompt with the Shloka (Sanskrit verse)
        prompt = f"Shloka: {examples['Shloka'][i]}\nTransliteration: {examples['Transliteration'][i]}\nTranslation: "

        # Choose the translation based on language
        english_translation = examples['EngMeaning'][i]
        hindi_translation = examples['HinMeaning'][i]
        word_meaning = examples['WordMeaning'][i]

        # You can decide how to tokenize these translations.
        # Let's take English and Word meaning as the response.
        response = f"{english_translation}\nWordMeaning: {word_meaning}"

        # Tokenize both the prompt and response
        encoded = tokenizer(prompt + response, truncation=True, max_length=512, padding="max_length")

        # Prepare labels by copying input_ids and masking the padding tokens
        labels = encoded['input_ids'].copy()
        labels = [-100 if token == tokenizer.pad_token_id else token for token in labels]

        input_ids_list.append(encoded['input_ids'])
        attention_mask_list.append(encoded['attention_mask'])
        labels_list.append(labels)

    return {
        'input_ids': input_ids_list,
        'attention_mask': attention_mask_list,
        'labels': labels_list
    }

def process_dataset(df, tokenizer):
    print("Converting DataFrame to Hugging Face Dataset...")
    dataset = Dataset.from_pandas(df)

    print("Processing dataset...")
    processed_dataset = dataset.map(
        lambda examples: prepare_gpt2_examples(examples, tokenizer),
        batched=True,
        remove_columns=dataset.column_names
    )

    print(f"Processed dataset size: {len(processed_dataset)}")
    print("Sample processed data:")
    print(processed_dataset[0])

    return processed_dataset

def save_dataset(processed_dataset):
    print("Saving processed dataset to disk...")
    processed_dataset.save_to_disk("processed_bhagavadgita_gpt2_dataset")
    print("Processed dataset saved to disk.")

def main():
    print("Starting data processing pipeline...")
    df = load_and_process_data()
    tokenizer = load_tokenizer()
    processed_dataset = process_dataset(df, tokenizer)
    save_dataset(processed_dataset)
    print("Data processing pipeline completed.")

if __name__ == "__main__":
    main()


Starting data processing pipeline...
Loading and processing initial data...
Initial data processing completed. First few rows:
      ID  Chapter  Verse                                             Shloka  \
0  BG1.1        1      1  धृतराष्ट्र उवाच |\nधर्मक्षेत्रे कुरुक्षेत्रे स...   
1  BG1.2        1      2  सञ्जय उवाच |\nदृष्ट्वा तु पाण्डवानीकं व्यूढं द...   
2  BG1.3        1      3  पश्यैतां पाण्डुपुत्राणामाचार्य महतीं चमूम् |\n...   
3  BG1.4        1      4  अत्र शूरा महेष्वासा भीमार्जुनसमा युधि |\nयुयुध...   
4  BG1.5        1      5  धृष्टकेतुश्चेकितानः काशिराजश्च वीर्यवान् |\nपु...   

                                     Transliteration  \
0  dhṛtarāṣṭra uvāca .\ndharmakṣetre kurukṣetre s...   
1  sañjaya uvāca .\ndṛṣṭvā tu pāṇḍavānīkaṃ vyūḍha...   
2  paśyaitāṃ pāṇḍuputrāṇāmācārya mahatīṃ camūm .\...   
3  atra śūrā maheṣvāsā bhīmārjunasamā yudhi .\nyu...   
4  dhṛṣṭaketuścekitānaḥ kāśirājaśca vīryavān .\np...   

                                          HinMeaning  \
0  ।।

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



Converting DataFrame to Hugging Face Dataset...
Processing dataset...


Map:   0%|          | 0/701 [00:00<?, ? examples/s]

Preparing GPT-2 examples...
Processed dataset size: 701
Sample processed data:
{'input_ids': [2484, 75, 17411, 25, 28225, 100, 24231, 225, 11976, 97, 11976, 108, 48077, 11976, 115, 24231, 235, 11976, 253, 24231, 235, 11976, 108, 28225, 231, 11976, 113, 48077, 11976, 248, 930, 198, 11976, 100, 11976, 108, 24231, 235, 11976, 106, 11976, 243, 24231, 235, 11976, 115, 24231, 229, 11976, 97, 24231, 235, 11976, 108, 24231, 229, 28225, 243, 24231, 223, 11976, 108, 24231, 223, 11976, 243, 24231, 235, 11976, 115, 24231, 229, 11976, 97, 24231, 235, 11976, 108, 24231, 229, 28225, 116, 11976, 106, 11976, 113, 24231, 229, 11976, 97, 48077, 28225, 107, 24231, 223, 11976, 107, 24231, 223, 11976, 97, 24231, 235, 11976, 116, 11976, 113, 11976, 225, 930, 198, 11976, 106, 48077, 11976, 106, 11976, 243, 48077, 11976, 225, 28225, 103, 48077, 11976, 96, 24231, 235, 11976, 94, 11976, 113, 48077, 11976, 114, 24231, 235, 11976, 248, 24231, 230, 11976, 113, 28225, 243, 11976, 123, 11976, 106, 11976, 243, 24231, 

Saving the dataset (0/1 shards):   0%|          | 0/701 [00:00<?, ? examples/s]

Processed dataset saved to disk.
Data processing pipeline completed.


In [3]:
!ls

__notebook__.ipynb  processed_bhagavadgita_gpt2_dataset


## Verification of tokenization

In [4]:
import random
from datasets import load_from_disk
from transformers import GPT2Tokenizer

def display_sample(sample, tokenizer):
    print("Input IDs:", sample['input_ids'])
    print("Attention Mask:", sample['attention_mask'])
    print("Labels:", sample['labels'])

    # Decode the input IDs back into text
    decoded_input = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)
    print("Decoded Input (Prompt + Response):", decoded_input)
    
    # Decode the labels back into text (ignore padding or masked tokens)
    decoded_labels = tokenizer.decode([label for label in sample['labels'] if label != -100], skip_special_tokens=True)
    print("Decoded Labels (Response):", decoded_labels)
    print("\n" + "-"*50 + "\n")

def main():
    # Load the processed dataset from disk
    print("Loading processed dataset...")
    dataset = load_from_disk("processed_bhagavadgita_gpt2_dataset")

    # Load the GPT-2 tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default

    # Display information about the dataset
    print(f"Dataset size: {len(dataset)}")
    print(f"Dataset features: {dataset.features}\n")

    # Display a few random samples from the dataset
    num_samples = 2  # You can change this to view more samples
    random_indices = random.sample(range(len(dataset)), num_samples)

    for idx in random_indices:
        sample = dataset[idx]
        display_sample(sample, tokenizer)

    print("Tokenization verification complete.")

if __name__ == "__main__":
    main()


Loading processed dataset...
Dataset size: 701
Dataset features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

Input IDs: [2484, 75, 17411, 25, 28225, 103, 11976, 114, 24231, 235, 11976, 107, 24231, 230, 11976, 97, 48077, 11976, 224, 28225, 103, 48077, 11976, 96, 24231, 235, 11976, 94, 24231, 223, 11976, 103, 24231, 223, 11976, 97, 24231, 235, 11976, 108, 48077, 11976, 96, 48077, 11976, 106, 48077, 11976, 248, 48077, 11976, 108, 24231, 235, 11976, 107, 28225, 106, 11976, 117, 11976, 97, 24231, 222, 11976, 224, 28225, 248, 11976, 106, 24231, 224, 11976, 106, 24231, 235, 930, 198, 11976, 113, 24231, 235, 11976, 107, 24231, 224, 11976, 95, 48077, 11976, 224, 28225, 99, 24231, 235, 11976, 108, 24231, 223, 11976, 103, 11976, 99, 11976, 103, 24231, 223, 11976, 97, 24231, 235, 11976, 108, 2423

In [5]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

CUDA available: True
Number of GPUs: 2


In [6]:
import numpy as np
import matplotlib.pyplot as plt

def plot_training_loss(losses, output_file='training_loss_curve.png'):
    if not losses:
        print("No loss data available for plotting.")
        return

    # Convert losses to numpy array for easier manipulation
    losses = np.array(losses)

    # Remove any potential NaN or inf values
    losses = losses[np.isfinite(losses)]

    if len(losses) == 0:
        print("No valid loss data available for plotting after removing NaN/inf values.")
        return

    steps = range(1, len(losses) + 1)

    # Debug: print losses and steps for verification
    print(f"Number of losses: {len(losses)}")
    print(f"Losses: {losses}")
    
    plt.figure(figsize=(10, 5))
    plt.plot(steps, losses, label='Training Loss')
    plt.xlabel('Steps')
    plt.ylabel('Loss')
    plt.title('Training Loss Curve')
    plt.legend()

    # Set y-axis to logarithmic scale if the loss varies over several orders of magnitude
    if np.log10(losses.max()) - np.log10(losses.min()) > 2:
        plt.yscale('log')

    # Add grid for better readability
    plt.grid(True, which="both", ls="-", alpha=0.2)

    try:
        plt.savefig(output_file)
        print(f"Learning and loss curve has been plotted and saved to {output_file}")
    except Exception as e:
        print(f"Error saving the plot: {e}")
    finally:
        plt.close()


## Training the GPT-2 Model:  and Evaluation and Model Generation:

In [7]:
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer, GPT2Tokenizer, EarlyStoppingCallback, TrainerCallback
from datasets import load_from_disk
import torch
import numpy as np
import matplotlib.pyplot as plt

# Define a callback class to log training loss
class LogCallback(TrainerCallback):
    def __init__(self):
        self.losses = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs.get("loss") is not None:
            self.losses.append(logs["loss"])

def main():
    # Load the processed dataset
    print("Loading processed dataset...")
    dataset = load_from_disk("processed_bhagavadgita_gpt2_dataset")

    # Split the dataset into training and validation sets
    dataset = dataset.train_test_split(test_size=0.2)

    # Initialize the model and tokenizer
    print("Initializing model and tokenizer...")
    model = GPT2LMHeadModel.from_pretrained("gpt2")
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token  # Set pad token to EOS for GPT-2
    
    # Define training arguments with memory optimization
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=10,
        learning_rate=5e-5,
        per_device_train_batch_size=2,  
        per_device_eval_batch_size=2,  
        gradient_accumulation_steps=4,  # Simulate larger batch size via accumulation
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=200,  
        evaluation_strategy="steps",
        eval_steps=500,
        save_steps=1000,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        log_level="info", 
        fp16=True,  # Enable mixed precision to save memory
        save_total_limit=2,  # Limit number of saved checkpoints
        report_to="none",  
        lr_scheduler_type="cosine"
    )

    # Define a custom data collator
    def data_collator(features):
        # Extract the input_ids and attention_mask from the dataset
        input_ids = torch.tensor([f["input_ids"] for f in features])
        attention_mask = torch.tensor([f["attention_mask"] for f in features])
        labels = torch.tensor([f["labels"] for f in features])

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }

    # Initialize LogCallback to capture the loss during training
    log_callback = LogCallback()

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        data_collator=data_collator,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3), log_callback]  # Add custom log callback
    )

    # Clear CUDA cache before training to free up memory
    torch.cuda.empty_cache()

    # Train the model
    print("Training the model...")
    train_result = trainer.train()

    # Save the trained model
    print("Saving the trained model...")
    trainer.save_model("./bhagavadgita_gpt2_model")

    # After training
    print("Evaluating the model...")
    eval_results = trainer.evaluate()
    print(f"Evaluation results: {eval_results}")

    print("Running prediction...")
    # Custom prediction function to handle OOM errors
    def predict_in_batches(dataset, batch_size=8):
        all_predictions = []
        for i in range(0, len(dataset), batch_size):
            print(i)
            batch = dataset.select(range(i, min(i + batch_size, len(dataset))))
            with torch.no_grad():
                outputs = trainer.predict(batch)
            all_predictions.append(outputs.predictions)
            torch.cuda.empty_cache()  # Clear GPU memory after each batch
        return np.concatenate(all_predictions, axis=0)

    test_results = predict_in_batches(dataset["test"])
    print(f"Test results shape: {test_results.shape}")

    print("Training complete. Model saved to ./bhagavadgita_gpt2_model")
    #plot_training_loss(log_callback.losses)

    print("Learning and loss curves have been plotted and saved.")

if __name__ == "__main__":
    main()


Loading processed dataset...
Initializing model and tokenizer...


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
Using auto half precision backend


Training the model...


***** Running training *****
  Num examples = 560
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Training with DataParallel so batch size has been adjusted to: 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 350
  Number of trainable parameters = 124,439,808
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss


Saving model checkpoint to ./results/checkpoint-350
Configuration saved in ./results/checkpoint-350/config.json
Configuration saved in ./results/checkpoint-350/generation_config.json
Model weights saved in ./results/checkpoint-350/model.safetensors


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to ./bhagavadgita_gpt2_model
Configuration saved in ./bhagavadgita_gpt2_model/config.json
Configuration saved in ./bhagavadgita_gpt2_model/generation_config.json


Saving the trained model...


Model weights saved in ./bhagavadgita_gpt2_model/model.safetensors

***** Running Evaluation *****
  Num examples = 141
  Batch size = 4


Evaluating the model...


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


Evaluation results: {'eval_loss': 1.555938482284546, 'eval_runtime': 6.1291, 'eval_samples_per_second': 23.005, 'eval_steps_per_second': 5.874, 'epoch': 10.0}
Running prediction...
0



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


8



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


16



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


24



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


32



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


40



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


48



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


56



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


64



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


72



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


80



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


88



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


96



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


104



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


112



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


120



***** Running Prediction *****
  Num examples = 8
  Batch size = 4


128



***** Running Prediction *****
  Num examples = 5
  Batch size = 4


136


Test results shape: (141, 512, 50257)
Training complete. Model saved to ./bhagavadgita_gpt2_model
Learning and loss curves have been plotted and saved.


In [8]:
#! rm training_loss_curve.png
!ls

  pid, fd = os.forkpty()


__notebook__.ipynb	 processed_bhagavadgita_gpt2_dataset
bhagavadgita_gpt2_model  results


In [9]:
'''
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Load and display the saved plot image
img = mpimg.imread('training_loss_curve.png')
plt.figure(figsize=(10, 5))
plt.imshow(img)
plt.axis('off')  # Hide axis labels for a cleaner display
plt.show()
'''

"\nimport matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\n\n# Load and display the saved plot image\nimg = mpimg.imread('training_loss_curve.png')\nplt.figure(figsize=(10, 5))\nplt.imshow(img)\nplt.axis('off')  # Hide axis labels for a cleaner display\nplt.show()\n"

## Fine-tuned Model testing

In [10]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

def evaluate_model():
    # Define some default queries based on the Bhagavad Gita
    queries = [
        "Explain the meaning of Chapter 1, Verse 1 of the Bhagavad Gita.",
        "What does Lord Krishna say in Chapter 2, Verse 47?",
        "What is the significance of Chapter 4, Verse 7 in the Bhagavad Gita?",
        "What lesson is taught in Chapter 3, Verse 16?",
        "Describe the teachings of Chapter 10, Verse 20.",
        "What does Arjuna ask Krishna in Chapter 11, Verse 32?",
        "How does Krishna explain the cycle of life and death in Chapter 8, Verse 6?",
        "Explain the concept of yoga as mentioned in Chapter 5, Verse 27.",
        "What is the essence of the Bhagavad Gita?",
        "How does the Bhagavad Gita define karma?",
        "What are the three gunas in the Bhagavad Gita?",
        "What is Lord Krishna's advice to Arjuna regarding action and inaction?",
        "What does the Gita say about the nature of the soul?",
        "How can one attain peace according to the teachings of the Bhagavad Gita?",
        "What is the importance of devotion in the Bhagavad Gita?"
    ]

    # Load the pre-trained GPT-2 model and tokenizer (non-fine-tuned)
    print("Loading the pre-trained GPT-2 model...")
    raw_model = GPT2LMHeadModel.from_pretrained("gpt2")
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    
    # Load the fine-tuned model
    print("Loading the fine-tuned model...")
    fine_tuned_model = GPT2LMHeadModel.from_pretrained("./bhagavadgita_gpt2_model")
    
    # Set the pad token for both models
    tokenizer.pad_token = tokenizer.eos_token

    # Ensure both models are on the GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    raw_model.to(device)
    fine_tuned_model.to(device)

    # Set both models to evaluation mode
    raw_model.eval()
    fine_tuned_model.eval()

    # Iterate through each query and get the model's generated answer
    for idx, question in enumerate(queries):
        print(f"Query {idx+1}: {question}")

        # Tokenize the input question
        inputs = tokenizer.encode(question, return_tensors="pt").to(device)

        # Generate the answer using the raw (non-fine-tuned) model
        raw_outputs = raw_model.generate(
            inputs,
            max_length=100,  # Maximum length of the generated answer
            num_return_sequences=1,  # Number of answers to generate
            no_repeat_ngram_size=2,  # Avoid repetition
            top_p=0.95,  # Top-p sampling
            temperature=0.1,  # Adjust the creativity of the generated text
            pad_token_id=tokenizer.eos_token_id
        )

        # Decode the raw model output
        raw_answer = tokenizer.decode(raw_outputs[0], skip_special_tokens=True)

        # Generate the answer using the fine-tuned model
        fine_tuned_outputs = fine_tuned_model.generate(
            inputs,
            max_length=100,  # Maximum length of the generated answer
            num_return_sequences=1,  # Number of answers to generate
            no_repeat_ngram_size=2,  # Avoid repetition
            top_p=0.95,  # Top-p sampling
            temperature=0.1,  # Adjust the creativity of the generated text
            pad_token_id=tokenizer.eos_token_id
        )

        # Decode the fine-tuned model output
        fine_tuned_answer = tokenizer.decode(fine_tuned_outputs[0], skip_special_tokens=True)

        # Print the comparison between raw and fine-tuned model answers
        
        print("Raw GPT-2 Answer (non-fine-tuned):")
        print(raw_answer)
        print("--"* 50)
        print("\nFine-Tuned GPT-2 Answer:")
        print(fine_tuned_answer)
        print("==" * 50)

if __name__ == "__main__":
    evaluate_model()


Loading the pre-trained GPT-2 model...


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/config.json
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.44.0",
  "use_cach

Loading the fine-tuned model...


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at ./bhagavadgita_gpt2_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
loading configuration file ./bhagavadgita_gpt2_model/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Query 1: Explain the meaning of Chapter 1, Verse 1 of the Bhagavad Gita.
Raw GPT-2 Answer (non-fine-tuned):
Explain the meaning of Chapter 1, Verse 1 of the Bhagavad Gita.

The Bhakti-Gita is the most important of all the Givin-gita texts. It is a great source of insight into the nature of life and the way of living. The Bhavat-Bhagavan-Sriya-Vipassana-Dharma-Prakash-Krishna-Rishikesh-Nirvana
----------------------------------------------------------------------------------------------------

Fine-Tuned GPT-2 Answer:
Explain the meaning of Chapter 1, Verse 1 of the Bhagavad Gita.
निद्रवाणो न संतेषु पृशीसू कः तमैकौऽयजॉगल२४भपा |

Query 2: What does Lord Krishna say in Chapter 2, Verse 47?
Raw GPT-2 Answer (non-fine-tuned):
What does Lord Krishna say in Chapter 2, Verse 47?

"I will not give you any more than you have given me. I will give to you what you give me, and I shall give it to others.
...
,
 (1) I am the Lord, the God of the living God, who created the world, created all things,

## END 