## Contents

1. Importing necessary libraries
2. Authenticating with Hugging Face Hub
3. Loading the SQuAD dataset
4. Loading the DistilGPT-2 tokenizer
5. Preprocessing the dataset
6. Tokenizing the dataset using the tokenizer
7. Grouping Tokenized Text
8. Get train and evaluation datasets
9. Fine-tuning the model
10. Evaluating the fine-tuned model
11. Push model to Hugging Face Hub

---

<a id="c1"></a> <br>
### 1) Importing necessary libraries

In [None]:
from huggingface_hub import notebook_login
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import math

<a id="c2"></a> <br>
### 2) Authenticating with Hugging Face Hub
You gain access to private repositories and the ability to **push**, **pull**, and **manage models** on the *Hugging Face Hub* directly from your notebook.

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

<a id="c3"></a> <br>
### 3) Loading the SQuAD dataset

In [None]:
dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

<a id="c4"></a> <br>
### 4) Loading the DistilGPT-2 tokenizer

In [None]:
model_checkpoint = 'distilgpt2'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

special_tokens = tokenizer.special_tokens_map
print(special_tokens)

{'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}


<a id="c5"></a> <br>
### 5) Preprocessing the dataset
Since we are going to use `distilgpt2` as our tokenizer, we should add the corresponding special tokens to the dataset. The special tokens are added to the dataset using the `map` function.

In [None]:
def add_end_token_to_question(input_dict):
    input_dict['question'] += special_tokens['bos_token']
    return input_dict

dataset = dataset.remove_columns(['id', 'title', 'context', 'answers'])
dataset = dataset.map(add_end_token_to_question)

<a id="c6"></a> <br>
### 6) Tokenizing the dataset using the tokenizer

In [None]:
def tokenize_function(input_dict):
    return tokenizer(input_dict['question'], truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=['question'])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 10570
    })
})

<a id="c7"></a> <br>
### 7) Grouping Tokenized Text

The grouping tokenized text process involves dividing a tokenized text into fixed-length blocks or chunks to efficiently process large datasets during NLP tasks. By splitting the tokenized sequence into smaller segments, each of equal size, it becomes easier to handle and process the data in parallel, making it ideal for tasks like language modeling and text generation.

In [None]:
max_block_length = 128

def divide_tokenized_text(tokenized_text_dict, block_size):
    """
    Divides the tokenized text in the examples into fixed-length blocks of size block_size.

    Parameters:
    -----------
    tokenized_text_dict: dict
        A dictionary containing tokenized text as values for different keys.

    block_size: int
        The desired length of each tokenized block.

    Returns:
    -----------
        dict: A dictionary with tokenized text divided into fixed-length blocks.
    """
    concatenated_examples = {k: sum(tokenized_text_dict[k], []) for k in tokenized_text_dict.keys()}
    total_length = len(concatenated_examples[list(tokenized_text_dict.keys())[0]])
    total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i: i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }

    result['labels'] = result['input_ids'].copy()
    return result


lm_dataset = tokenized_dataset.map(
    lambda tokenized_text_dict: divide_tokenized_text(tokenized_text_dict, max_block_length),
    batched=True,
    batch_size=1000,
    num_proc=4,
)

<a id="c8"></a> <br>
### 8) Get train and evaluation datasets

In [None]:
train_dataset = lm_dataset['train'].shuffle(seed=42).select(range(100))
eval_dataset = lm_dataset['validation'].shuffle(seed=42).select(range(100))

<a id="c9"></a> <br>
### 9) Fine-tuning the model

The training process is controlled by the TrainingArguments, where we define hyperparameters like the learning rate and weight decay. The model is trained on a question-answering dataset, divided into training and evaluation sets (`train_dataset` and `eval_dataset`). During training, the model's parameters are optimized to predict answers for given questions, making it capable of providing accurate responses to queries.

Also, To ensure the model's compatibility with the tokenization process, we add a special '[PAD]' token to the tokenizer.

By running this section of code, you will have a fine-tuned GPT-2 model optimized for question answering. **(SQuAD)**

In [None]:
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})


training_args = TrainingArguments(
    f'./{model_checkpoint}-squad',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False, # Change to True to push the model to the Hub
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

<a id="c10"></a> <br>
### 10) Evaluating the fine-tuned model

In [None]:
eval_results = trainer.evaluate()
print(f'Perplexity: {math.exp(eval_results["eval_loss"]):.2f}')

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Perplexity: 159.82


<a id="c11"></a> <br>
### 11) Push model to Hugging Face Hub

In [None]:
tokenizer.save_pretrained('gpt2-squad')
model.push_to_hub('gpt2-squad')