# GPT (Graded)

Welcome to your GPT (required) programming assignment! You will build a GPT model that will generate **Shakesperean like text**. You will be using [Tiny Shakespeare](https://huggingface.co/datasets/karpathy/tiny_shakespeare) dataset which contains 40,000 lines from a variety of Shakespeare's plays.

Imagine you have a student (the GPT-2 model) who is already quite good at writing in general. You want to teach this student to write like Shakespeare (fine-tuning). So, you provide the student with a lot of Shakespeare's work (the dataset) and have them practice writing in a similar style (training). Once the student has learned enough, you give them a starting phrase (the prompt) and ask them to write a continuation in Shakespearean style (inference). The student uses their knowledge gained during training to create new text that resembles Shakespeare's writing.


**Instructions:**
* Do not modify any of the codes.
* Only write code when prompted. For example in some sections you will find the following,
```
# YOUR CODE GOES HERE
# YOUR CODE STARTS HERE
# TODO
```
Only modify those sections of the code.

**You will learn to:**
* Explore the [Tiny Shakespeare](https://huggingface.co/datasets/karpathy/tiny_shakespeare) dataset
* Implementation of a GPT-2 model for text generation
* Fine-tuning pre-trained models on specific datasets
* Working with the Transformers library and TensorFlow
* Managing model parameters and hyperparameters
* Balancing between model performance and computational resources

# Data Preparation


In this section, You will transform raw text data into a numerical format suitable for training a GPT-2 model using TensorFlow. It involves loading the dataset, tokenizing the text, and creating a TensorFlow dataset that can be fed into the model during the training process.

You shall be performing the following steps:

1. **Loading the Dataset:**

  * Begin by using the `load_dataset` function from the datasets library to load the "tiny_shakespeare" dataset.
  * The training texts are extracted and stored in the `train_texts` attribute.
2. **Loading and Configuring Tokenizer:**

  * Load the "GPT-2" tokenizer using `GPT2Tokenizer.from_pretrained("gpt2")`.
  * Configure the tokenizer by setting the padding token to the end-of-sequence token (`eos_token`). This ensures that padding is treated as the end of a sequence during training.
3. **Tokenizing the Texts:**

  * Tokenize the training texts using the loaded tokenizer.
  * Parameters like `truncation`, `padding`, `max_length`, and `return_tensors` are used to control the tokenization process. This converts the text into numerical representations that the model can understand.
4. **Creating TensorFlow Dataset:**

  * Create a TensorFlow dataset from the tokenized input IDs, attention mask, and input IDs again. This dataset will be used for training the model.
5. **Preparing Final Training Dataset:**

  * Shuffle the dataset using `shuffle(1000)` to randomize the order of training examples.
  * Then create batches using `batch(self.batch_size)` to divide the data into smaller groups for efficient training.




In [None]:
#TODO

from tests import *
from helpers import *

import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
from datasets import load_dataset
import numpy as np

class ShakespeareGenerator:
    def __init__(self, seed=42, max_length=128, batch_size=8):
        self.max_length = max_length
        self.batch_size = batch_size
        tf.random.set_seed(seed)


    def prepare_data(self):
        """Data preparation phase: Load and process the dataset"""
        # TODO: 1. Load the dataset
        dataset = 
        self.train_texts = 

        # TODO: 2. Load and configure tokenizer
        self.tokenizer = 
        self.tokenizer.pad_token = 

        # TODO: 3. Tokenize the texts
        self.train_encodings = self.tokenizer(
            
        )

        # TODO: 4. Create TensorFlow dataset
        self.train_dataset = tf.data.Dataset.from_tensor_slices((

        ))

        # TODO: 5. Prepare final training dataset
        self.train_dataset = 

        # DO NOT MODFIY THIS CODE
        validate_shakespeare_generator(self)

        return self.train_dataset, self.tokenizer

# Modelling


Load the `TFGPT2LMHeadModel` from the `transformers` library, which is a pre-trained GPT-2 model. This model is then fine-tuned on the Shakespeare dataset you prepared in the previous section.

You shall be performing the following steps:

1. **Model Setup:**

  * Load the pre-trained GPT-2 model using `TFGPT2LMHeadModel.from_pretrained("gpt2")`.
  * Configure the model for training using an Adam optimizer and `SparseCategoricalCrossentropy` loss function.
  * Compile the model with the chosen optimizer and loss function. This prepares the model for training.
2. **Training:**

  * Call the `model.fit` method with the training dataset and specified number of epochs. This starts the fine-tuning process, where the model's parameters are adjusted to better fit the Shakespeare dataset.
  * The training history is stored and can be used to monitor the model's performance during training.


In [None]:
# TODO

def setup_model(generator, learning_rate=3e-5):
    """Model setup phase: Initialize and configure the model"""
    # TODO: 1. Load pre-trained model
    model = 

    # TODO: Configure the model for training
    optimizer = 
    loss = 
    model.compile(
        optimizer=optimizer,
        loss=[loss, *[None] * model.config.n_layer]
    )

    # DO NOT MODIFY THIS CODE
    check_model_setup(generator, model)

    return model

def train(model, train_dataset, epochs=3):
    # TODO: Fit the model
    history = 
    
    return history

# Inference

This is where the fine-tuned GPT-2 model is used to generate Shakespearean-like text based on a given prompt.

Here are the steps to complete the inference code:


1. Take a `prompt` (the starting text) and `max_length` (the maximum length of the generated text) as input.
2. Use the `tokenizer` to encode the `prompt` into a numerical format that the model can understand.
3. Call the `model.generate` method to generate text based on the encoded `prompt` and other parameters such as `max_length`, `num_return_sequences`, and `no_repeat_ngram_size`.
4. Decode the generated output back into text using the `tokenizer`.
5. Return the generated text.



In [None]:
# TODO

def generate(prompt, model, tokenizer, max_length=100):
    """Inference phase: Generate text from prompt"""
    input_ids = 
    output = 
    
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Driver code

In [None]:

def main(prompt):
    # Initialize the data generator
    shakespeare_gen = ShakespeareGenerator()

    # Data preparation
    print("Preparing data...")
    train_dataset, tokenizer = shakespeare_gen.prepare_data()

    validate_shakespeare_generator(shakespeare_gen)

    # Model setup
    print("Setting up model...")
    model=setup_model(shakespeare_gen)

    # Training
    print("Training model...")
    history = train(model, train_dataset, epochs=1)

    # Inference
    print("\nGenerating text from prompt:", prompt)
    generated_text = generate(prompt, model, tokenizer)
    print(f"\nGenerated text:\n{generated_text}")


if __name__=='__main__':
  main(prompt = "To be or not to be")