# Instruction Fine-Tuning for Spanish Newspaper Article Summarization Using FLAN-T5-Small

In this notebook, we explore the process of **instruction fine-tuning the FLAN-T5-Small language model to enhance its capability in summarizing Spanish newspaper articles**. FLAN-T5-Small, a variant of the T5 family, is designed to handle a variety of natural language processing tasks effectively. By focusing on instruction fine-tuning, we adapt this pre-trained model specifically for the task of summarization in Spanish, leveraging its ability to understand and generate coherent summaries.

This process involves preparing a dataset of Spanish newspaper articles and their summaries, configuring the model for instruction-based training, and fine-tuning the model to improve its performance on the summarization task. The goal is to fine-tune the model so that it can generate concise and accurate summaries of news articles, reflecting the key points and information from the original content.

**Through this notebook, we will guide you through the necessary steps, including data preparation, model configuration, and evaluation, to achieve a well-tuned summarization model for Spanish text.**

In [1]:
!pip install transformers
!pip install sentencepiece
!pip install accelerate
!pip install datasets
!pip install evaluate
!pip install rouge_score

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [3]:
def load_t5_model(name):
  tokenizer_T5 = T5Tokenizer.from_pretrained(name)
  model_T5 = T5ForConditionalGeneration.from_pretrained(name, device_map="auto")
  return tokenizer_T5, model_T5

This code defines a function called **load_t5_model**, which is responsible for loading a T5 (Text-To-Text Transfer Transformer) model and its corresponding tokenizer. This function accepts one argument, name, which represents the name or path of the pre-trained T5 model.

When the function is called, it performs two key actions:

- **Loading the Tokenizer:** It uses the **T5Tokenizer.from_pretrained(name)** method to load the tokenizer, which is responsible for converting text into token IDs that the model can understand.
- **Loading the Model:** It then loads the pre-trained T5 model itself using **T5ForConditionalGeneration.from_pretrained(name, device_map="auto")**. The **device_map="auto"** argument allows the model to be automatically distributed across available hardware resources (like GPUs), optimizing performance.

Finally, the function returns both the loaded tokenizer and model, allowing them to be used together for various natural language processing tasks such as text generation or translation.

In [4]:
def generate_response_from_prompt(model, prompt, max_length=100):
  tokenizer_T5, model_T5 = load_t5_model(model)
  prompt_tokens = tokenizer_T5(prompt, return_tensors="pt").input_ids.to("cpu")
  outputs = model_T5.generate(prompt_tokens, max_length=max_length)
  return tokenizer_T5.decode(outputs[0])

The `generate_response_from_prompt` function is designed to generate a response based on a given prompt using a pre-trained T5 model.

#### Parameters:
- **`model`**: The name or path of the pre-trained T5 model.
- **`prompt`**: The input text for which the model will generate a response.
- **`max_length`**: An optional parameter (default value of 100) that controls the maximum length of the generated response.

#### Function Steps:
1. **Loading the Model and Tokenizer**:  
   The function first calls `load_t5_model(model)` to load the T5 model and tokenizer using the specified model name or path.

2. **Tokenizing the Prompt**:  
   The input `prompt` is tokenized using the loaded tokenizer (`tokenizer_T5`). This step converts the prompt text into token IDs that the model can understand. The resulting tokens are sent to the CPU for processing.

3. **Generating the Response**:  
   The function calls the T5 model (`model_T5`) to generate a response based on the tokenized prompt. The `generate()` method is used, with `max_length` determining the maximum length of the generated text.

4. **Decoding the Response**:  
   The generated output is then decoded back into human-readable text using the tokenizer, and the decoded response is returned by the function.

In summary, this function allows us to input a prompt and obtain a generated response from the T5 model, with control over the output length.

### Example: Generating a Summary from a Text Prompt

The following code demonstrates how to use the `generate_response_from_prompt` function to generate a summary of a given article using a pre-trained T5 model.

#### Code Explanation:

1. **Defining the Article**:  
   The variable `text` contains an article about a mysterious burst of radio waves detected by astronomers. The article explains that the burst took 8 billion years to reach Earth and is one of the most distant and energetic fast radio bursts (FRBs) ever observed.

2. **Creating the Prompt**:  
   The `prompt_template` is created by combining the instruction `"Summarize the following article:"` with the `text` variable. This creates a formatted prompt that asks the model to summarize the article.

3. **Generating the Summary**:  
   The `generate_response_from_prompt` function is called with the following parameters:
   - **Model**: `"google/flan-t5-small"` — A pre-trained T5 model.
   - **Prompt**: The `prompt_template` created earlier.
   
   The function processes the prompt and returns a summary of the article.

In [5]:
text = """Astronomers have detected a mysterious burst of radio waves that took \
8 billion years to reach Earth. The fast radio burst is one of the most distant \
and energetic ever observed. Fast radio bursts (FRBs) are intense bursts of radio \
waves lasting only a few milliseconds, and their origin is unknown. The first FRB \
was discovered in 2007, and since then, hundreds of these fast cosmic flashes \
have been detected, coming from distant points across the universe."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [None]:
text = """The Industrial Revolution, which took place primarily in the 19th century, \
was a period of significant technological, cultural, and socioeconomic changes \
that transformed agrarian societies into industrial societies. During this time, \
there was a massive shift of labor from farms to factories. This was due to the \
invention of new machines that could perform tasks faster and more efficiently \
than humans or animals. This transition led to an increase in the production of \
goods, but it also had negative consequences, such as labor exploitation and \
environmental pollution."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> The Industrial Revolution was a period of significant technological, cultural, and socioeconomic changes that led to the creation of agrarian societies.</s>'

In [None]:
text = """The Hubble Telescope, launched into space in 1990, has provided stunning images \
of the universe and has helped scientists gain a better understanding of cosmology."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> The Hubble Telescope is a telescope that has been used by scientists to study the universe.</s>'

The following code demonstrates how to load the "mlsum" dataset using the `datasets` library from Hugging Face.

#### Code Explanation:

1. **Importing the Library**:  
   The code begins by importing the `load_dataset` function from the `datasets` library.

2. **Loading the Dataset**:  
   The `load_dataset` function is used to load the dataset. In this case, it loads the "mlsum" dataset with the language specification `'es'` for Spanish. This dataset is a collection of multilingual summarization data.

In [None]:
from datasets import load_dataset

ds = load_dataset("mlsum", 'es')

Downloading builder script:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

The repository for mlsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mlsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/77.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/266367 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10358 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/13920 [00:00<?, ? examples/s]

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 266367
    })
    validation: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 10358
    })
    test: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 13920
    })
})

In [None]:
# Display an example from the training dataset subset
ds["train"]["text"][10]

'España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asfalto, el último el pasado 11 de d

In [None]:
# Display the summary corresponding to the previous example
ds["train"]["summary"][10]

'2009 es el periodo con menos fallecidos en accidentes de tráfico en cuatro décadas, desde que existen datos oficiales'

### Reducing the Size of the Dataset

The following code demonstrates how to reduce the size of the previously loaded "mlsum" dataset. This is useful for quicker experimentation or to manage resource constraints by limiting the number of examples in each subset.

#### Code Explanation:

1. **Defining Subset Sizes**:  
   The number of examples to keep in each subset of the dataset is defined as follows:
   - **`NUM_EX_TRAIN`**: Number of examples to retain in the training subset (1500).
   - **`NUM_EX_VAL`**: Number of examples to retain in the validation subset (500).
   - **`NUM_EX_TEST`**: Number of examples to retain in the test subset (200).

2. **Creating Subsets**:
   - **Training Subset**:  
     The `ds['train']` subset is reduced to the first `NUM_EX_TRAIN` examples using `select(range(NUM_EX_TRAIN))`.
   - **Validation Subset**:  
     The `ds['validation']` subset is reduced to the first `NUM_EX_VAL` examples using `select(range(NUM_EX_VAL))`.
   - **Test Subset**:  
     The `ds['test']` subset is reduced to the first `NUM_EX_TEST` examples using `select(range(NUM_EX_TEST))`.

In [None]:
# Reduce the dataset size
NUM_EX_TRAIN = 1500
NUM_EX_VAL = 500
NUM_EX_TEST = 200

# Training subset
ds['train'] = ds['train'].select(range(NUM_EX_TRAIN))

# Validation subset
ds['validation'] = ds['validation'].select(range(NUM_EX_VAL))

# Test subset
ds['test'] = ds['test'].select(range(NUM_EX_TEST))

### Function: `parse_dataset`

The `parse_dataset` function processes individual examples from the dataset to format them according to a specific template. This preparation is typically used to adapt data for model training or evaluation.

#### Code Explanation:

1. **Function Purpose**:  
   The `parse_dataset` function takes a single example from the dataset and reformats it to fit a specific template. This is useful for creating consistent input prompts for tasks such as text summarization.

2. **Function Details**:
   - **Input**: The function receives an `example` dictionary from the dataset, which includes a key `'text'` containing the article or content to be summarized.
   - **Processing**: It constructs a new dictionary with a single key `'prompt'`. The value for `'prompt'` is a string formatted to include the instruction `"Summarize the following article:\n\n"` followed by the content from `example['text']`.
   - **Output**: The function returns this formatted dictionary, which is now ready to be used as input for a model that generates summaries.


In [None]:
def parse_dataset(example):
  """Processes the examples to adapt them to the template."""
  return {"prompt": f"Summarize the following article:\n\n{example['text']}"}

### Applying the `parse_dataset` Function to the Dataset

The following code demonstrates how to apply the `parse_dataset` function to each subset of the dataset (training, validation, and test). This step transforms the dataset examples to fit a specific format required for model input.

#### Code Explanation:

1. **Applying the Function**:
   - The `map` method is used to apply the `parse_dataset` function to each example in the dataset subsets.
   - This method processes each example and reformats it according to the function's logic, ensuring that all examples are structured consistently.

2. **Updating the Dataset**:
   - **Training Subset**:  
     `ds["train"] = ds["train"].map(parse_dataset)`  
     This line applies the `parse_dataset` function to all examples in the training subset, updating it with the new formatted structure.
   - **Validation Subset**:  
     `ds["validation"] = ds["validation"].map(parse_dataset)`  
     Similarly, this line processes the validation subset.
   - **Test Subset**:  
     `ds["test"] = ds["test"].map(parse_dataset)`  
     Finally, this line processes the test subset.

In [None]:
ds["train"] = ds["train"].map(parse_dataset)
ds["validation"] = ds["validation"].map(parse_dataset)
ds["test"] = ds["test"].map(parse_dataset)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
print(ds["train"]["prompt"][10])

Summarize the following article:

España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asf

In [None]:
print(ds["train"]["text"][10])

España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asfalto, el último el pasado 11 de di

### Analyzing Token Lengths in the Dataset

The following code demonstrates how to analyze the token lengths of prompts and completions in the dataset using the tokenizer from the **"google/flan-t5-small"** model. This helps in understanding the size of text inputs and outputs relative to the model's token limits.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import concatenate_datasets

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

# Calculate the maximum prompt size
prompts_tokens = concatenate_datasets([ds["train"], ds["validation"], ds["test"]]).map(lambda x: tokenizer(x["prompt"], truncation=True), batched=True) # Will truncate to 512, which is the maximum size for this model
max_token_len = max([len(x) for x in prompts_tokens["input_ids"]])
print(f"Maximum prompt size: {max_token_len}")

# Calculate the maximum completion size
completions_tokens = concatenate_datasets([ds["train"], ds["validation"], ds["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True)
max_completion_len = max([len(x) for x in completions_tokens["input_ids"]])
print(f"Maximum completion size: {max_completion_len}")

Map:   0%|          | 0/2200 [00:00<?, ? examples/s]

Maximum prompt size: 512


Map:   0%|          | 0/2200 [00:00<?, ? examples/s]

Maximum completion size: 242


#### Code Explanation:

1. **Importing Libraries**:  
   The code imports the `AutoTokenizer` and `AutoModelForSeq2SeqLM` classes from the `transformers` library, as well as the `concatenate_datasets` function from the `datasets` library.

2. **Loading the Tokenizer**:  
   The `AutoTokenizer.from_pretrained("google/flan-t5-small")` method loads the tokenizer for the specified model.

3. **Calculating Maximum Prompt Size**:
   - **Concatenate Datasets**:  
     The `concatenate_datasets` function combines the training, validation, and test subsets into a single dataset.
   - **Tokenizing Prompts**:  
     The `map` function is used to apply the tokenizer to each prompt in the concatenated dataset, with truncation set to `True` to ensure that tokens do not exceed the model's maximum input size (512 tokens for this model).
   - **Finding Maximum Token Length**:  
     The code calculates the length of each tokenized prompt and determines the maximum length across all prompts.
   - **Print Statement**:  
     The maximum prompt size is printed to the console.

4. **Calculating Maximum Completion Size**:
   - **Tokenizing Completions**:  
     Similar to prompts, the `map` function tokenizes each completion in the dataset and calculates token lengths.
   - **Finding Maximum Token Length**:  
     The code calculates the maximum length of the tokenized completions.
   - **Print Statement**:  
     The maximum completion size is printed to the console.

### Function: `padding_tokenizer`

The `padding_tokenizer` function processes text data to prepare it for model training by tokenizing inputs and labels, and applying padding and truncation as needed. This ensures that all examples have a consistent format suitable for training.

In [None]:
def padding_tokenizer(data):
  # Tokenize inputs (prompts)
  model_inputs = tokenizer(data['prompt'], max_length=max_token_len, padding="max_length", truncation=True)

  # Tokenize labels (completions)
  model_labels = tokenizer(data['summary'], max_length=max_completion_len, padding="max_length", truncation=True)

  # Replace padding token in completions with -100 so it is ignored during training
  model_labels["input_ids"] = [[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in model_labels["input_ids"]]

  model_inputs['labels'] = model_labels["input_ids"]

  return model_inputs

#### Code Explanation:

1. **Function Purpose**:  
   The `padding_tokenizer` function tokenizes both the input prompts and the output completions, adjusts the token lengths, and handles padding and truncation. It also modifies the tokenized completions to mark padding tokens as `-100` to be ignored during training.

2. **Tokenizing Inputs (Prompts)**:
   - The `tokenizer` processes the input prompts from the dataset using `max_length=max_token_len`, `padding="max_length"`, and `truncation=True`. This ensures that all input tokens are padded to the maximum length or truncated if they exceed it.

3. **Tokenizing Labels (Completions)**:
   - Similarly, the `tokenizer` processes the completion labels with `max_length=max_completion_len`, `padding="max_length"`, and `truncation=True`. This prepares the labels to match the length required by the model.

4. **Handling Padding in Labels**:
   - **Replacing Padding Tokens**:  
     Padding tokens in the completion labels are replaced with `-100`. This is done so that these padding tokens are ignored during the model's training process.
   - **Setting Labels**:  
     The modified completion labels (`model_labels["input_ids"]`) are added to `model_inputs` under the key `'labels'`.

5. **Returning Processed Data**:  
   The function returns the `model_inputs` dictionary, which now includes both the tokenized prompts and the adjusted labels.


In this step, we apply the `padding_tokenizer` function to the entire dataset. This ensures that all the data is formatted correctly for the model.

In [None]:
ds_tokens = ds.map(padding_tokenizer, batched=True, remove_columns=['text', 'summary', 'topic', 'url', 'title', 'date', 'prompt'])

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

#### What the Code Does:

1. **Apply the Tokenizer Function**:  
   The `map` method is used to apply the `padding_tokenizer` function to each batch of examples in the dataset. This function tokenizes and pads the inputs and labels, preparing them for model training.

2. **Batch Processing**:
   - **`batched=True`**:  
     This parameter processes the data in batches, which improves efficiency by handling multiple examples at once.

3. **Remove Unnecessary Columns**:
   - **`remove_columns`**:  
     After processing, we remove columns that are no longer needed, such as `'text'`, `'summary'`, `'topic'`, `'url'`, `'title'`, `'date'`, and `'prompt'`. This helps in keeping the dataset clean and focused on the tokenized data.

4. **Result**:
   - The result is a dataset (`ds_tokens`) that includes only the essential tokenized inputs and labels, making it ready for the fine-tuning process.

In [None]:
ds_tokens

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [None]:
ds_tokens["train"]["input_ids"][10]

[12198,
 1635,
 1737,
 8,
 826,
 1108,
 10,
 28774,
 2,
 9,
 4244,
 4505,
 9,
 26,
 32,
 3,
 35,
 2464,
 73,
 3,
 7,
 994,
 235,
 3,
 9,
 2,
 32,
 6900,
 15,
 3044,
 23,
 1621,
 20,
 27353,
 12765,
 20,
 50,
 24301,
 15644,
 3,
 35,
 50,
 7,
 443,
 60,
 449,
 9,
 7,
 3,
 63,
 6,
 3,
 9,
 12553,
 17,
 9,
 20,
 50,
 7,
 3,
 31812,
 7,
 11343,
 15,
 7,
 238,
 3606,
 291,
 2975,
 3534,
 63,
 3,
 15,
 40,
 3016,
 6626,
 20,
 40,
 8226,
 6,
 19850,
 32,
 276,
 154,
 2638,
 15612,
 138,
 10891,
 9,
 6,
 5569,
 21628,
 9,
 3,
 6071,
 3,
 35,
 50,
 3,
 107,
 17905,
 142,
 4244,
 3,
 2110,
 19042,
 73,
 1059,
 32,
 20,
 586,
 140,
 2260,
 5569,
 20,
 115,
 9,
 1927,
 20,
 3,
 26756,
 4035,
 49,
 235,
 7,
 5,
 4498,
 17,
 9,
 3,
 15,
 40,
 330,
 9,
 26,
 32,
 507,
 20,
 3,
 26,
 1294,
 21388,
 6,
 3,
 2,
 40,
 2998,
 32,
 3,
 26,
 2,
 9,
 20,
 40,
 238,
 142,
 3,
 10475,
 782,
 20,
 3927,
 32,
 7,
 6,
 50,
 3,
 31812,
 20,
 1590,
 15,
 75,
 28594,
 3,
 15,
 7,
 20,
 3,
 16253,
 2688,
 6,
 3,
 9,


### Evaluating Model Performance

In this part of the fine-tuning process, we set up the evaluation metrics and preprocess the text for accurate evaluation of model outputs.

In [None]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

# Evaluation metric
metric = evaluate.load("rouge")

# Helper function to preprocess the text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects a new line after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in labels as it cannot be decoded
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Preprocess the text
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

#### Steps in the Code:

1. **Import Libraries**:  
   We import necessary libraries for evaluation and text processing:
   - `evaluate` for metrics computation.
   - `nltk` for natural language processing tasks.
   - `numpy` for numerical operations.
   - `sent_tokenize` from `nltk` for sentence tokenization.

2. **Download NLTK Data**:
   - **`nltk.download("punkt")`**:  
     This downloads the Punkt tokenizer models, which are used for sentence tokenization.

3. **Load Evaluation Metric**:
   - **`metric = evaluate.load("rouge")`**:  
     We load the ROUGE metric, which is commonly used for evaluating summarization models.

4. **Text Preprocessing**:
   - **`postprocess_text(preds, labels)`**:  
     This helper function processes the predicted and reference texts by stripping extra spaces and ensuring each sentence is on a new line, as expected by the ROUGE metric.

5. **Compute Metrics**:
   - **`compute_metrics(eval_preds)`**:  
     This function takes the predictions and labels from the evaluation, decodes them, and calculates the evaluation metrics:
     - **Decoding**: Converts token IDs back to text while handling padding tokens.
     - **Postprocessing**: Applies text preprocessing to prepare the text for ROUGE evaluation.
     - **Metric Calculation**: Computes the ROUGE scores and calculates the average generation length.
     - **Formatting**: Rounds the metric results to four decimal places for clarity.

### Loading the Pre-Trained Model

In this step, we load the pre-trained model from the Hugging Face library, which is essential for fine-tuning on our specific task.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

#### What the Code Does:

1. **Import the Model Class**:
   - **`AutoModelForSeq2SeqLM`**:  
     This class from the `transformers` library is used to load sequence-to-sequence models. It's suitable for tasks like text generation, translation, and summarization.

2. **Load the Pre-Trained Model**:
   - **`model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")`**:  
     This line initializes the model with pre-trained weights for the "google/flan-t5-small" model. The `from_pretrained` method loads the model configuration and weights from the specified model identifier.

### Purpose and Objective of the Data Collator

The `DataCollator` plays a crucial role in preparing data for training machine learning models, especially in Natural Language Processing (NLP) tasks and sequence-to-sequence (seq2seq) models. Here’s a detailed overview of its objectives and purposes:

#### 1. **Handling Padding**
   - **Objective**: Padding is required when examples in a batch have different lengths. The `DataCollator` ensures that all sequences in a batch are the same length by adding padding tokens.
   - **Purpose**: To facilitate efficient batch processing, as most models require input sequences to have a fixed length to perform operations in parallel.

#### 2. **Configuring Special Tokens**
   - **Objective**: During training, it is crucial to handle padding tokens in labels and inputs correctly.
   - **Purpose**: The `DataCollator` allows for specifying which tokens should be treated as padding, ensuring that these tokens are not included in the model's loss calculation. This is done by setting the `label_pad_token_id`, which indicates which tokens should be ignored during loss evaluation.

#### 3. **Batch Size Optimization**
   - **Objective**: Optimize memory usage and training performance.
   - **Purpose**: Ensuring that batch sizes are multiples of certain values (like 8) can improve performance on some hardware, such as GPUs. This is managed through the `pad_to_multiple_of` parameter.

#### 4. **Preparation for Training**
   - **Objective**: Convert and organize data into a format that the model can process directly.
   - **Purpose**: The `DataCollator` prepares the data to be in the correct format for the model during training, handling padding and alignment consistently.


### Setting Up the Data Collator for Training

In this step, we configure the `DataCollatorForSeq2Seq` to prepare the data for training. The data collator handles padding and batching, ensuring that the model receives properly formatted inputs and labels.

In [None]:
from transformers import DataCollatorForSeq2Seq

# Ignore padding-related tokens during the training process for prompts
label_pad_token_id = -100

# Data collator for model training
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

#### What the Code Does:

1. **Import the Data Collator**:
   - **`DataCollatorForSeq2Seq`**:  
     This class from the `transformers` library is used to handle padding and other preprocessing tasks for sequence-to-sequence models.

2. **Define Padding Token ID**:
   - **`label_pad_token_id = -100`**:  
     This value is used to indicate padding tokens in the labels, which should be ignored during training. The `-100` value is chosen because it's commonly used to mask tokens in loss calculations.

3. **Create the Data Collator**:
   - **`data_collator = DataCollatorForSeq2Seq`**:  
     We initialize the data collator with the following parameters:
     - **`tokenizer`**: The tokenizer used to convert text into token IDs.
     - **`model`**: The model for which the data collator is being set up.
     - **`label_pad_token_id`**: Specifies which token ID should be treated as padding in the labels.
     - **`pad_to_multiple_of=8`**: Ensures that the batch sizes are multiples of 8, which can help with memory optimization and performance on some hardware.

### Setting Up and Running the Training

In this step, we configure the training process for fine-tuning a sequence-to-sequence model using the `Seq2SeqTrainer` from the Hugging Face library. This setup includes defining training arguments and creating the trainer instance.

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

REPOSITORY = "/content/flan-t5-small-fine-tuned"

# Define the training options
training_args = Seq2SeqTrainingArguments(
    # Training hyperparameters
    output_dir=REPOSITORY,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False,  # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=4,
    # Logging and evaluation strategies
    logging_dir=f"{REPOSITORY}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
)

# Create the training instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=ds_tokens["train"],
    eval_dataset=ds_tokens["validation"],
    compute_metrics=compute_metrics,
)



#### What the Code Does:

1. **Import Required Classes**:
   - **`Seq2SeqTrainer`** and **`Seq2SeqTrainingArguments`**:  
     These classes are used to set up and manage the training process for sequence-to-sequence models.

2. **Define Training Options**:
   - **`training_args`**:  
     The `Seq2SeqTrainingArguments` class is used to specify various hyperparameters and strategies for training:
     - **`output_dir`**: Directory where the model checkpoints and logs will be saved.
     - **`per_device_train_batch_size`** and **`per_device_eval_batch_size`**: Batch sizes for training and evaluation.
     - **`predict_with_generate`**: Enables generation during predictions, useful for tasks like text generation.
     - **`fp16`**: Set to `False` to avoid overflow issues with mixed precision training.
     - **`learning_rate`**: Learning rate for the optimizer.
     - **`num_train_epochs`**: Number of training epochs.
     - **`logging_dir`**: Directory for logging training metrics.
     - **`logging_strategy`** and **`logging_steps`**: Defines how often logs are saved.
     - **`evaluation_strategy`** and **`save_strategy`**: Specifies when to evaluate and save the model.
     - **`save_total_limit`**: Limits the number of saved checkpoints.
     - **`load_best_model_at_end`**: Ensures that the best model is loaded at the end of training.

3. **Create the Training Instance**:
   - **`trainer`**:  
     An instance of `Seq2SeqTrainer` is created with the following parameters:
     - **`model`**: The model to be fine-tuned.
     - **`args`**: The training arguments defined earlier.
     - **`data_collator`**: Handles padding and batching of the data.
     - **`train_dataset`**: The training dataset.
     - **`eval_dataset`**: The evaluation dataset.
     - **`compute_metrics`**: Function to compute evaluation metrics.

### Saving the Tokenizer

After training or fine-tuning a model, it's important to save the tokenizer so that it can be reused later. This ensures that the tokenization process is consistent with the one used during training.

In [None]:
# Save the tokenizer to disk for later use
tokenizer.save_pretrained(f"{REPOSITORY}/tokenizer")

('/content/flan-t5-small-fine-tuned/tokenizer/tokenizer_config.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/special_tokens_map.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/spiece.model',
 '/content/flan-t5-small-fine-tuned/tokenizer/added_tokens.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/tokenizer.json')

#### What the Code Does:

1. **Save the Tokenizer**:
   - **`tokenizer.save_pretrained(f"{REPOSITORY}/tokenizer")`**:  
     This line of code saves the tokenizer to the specified directory. The `save_pretrained` method writes the tokenizer configuration and vocabulary files to disk, allowing you to reload the tokenizer later without having to reinitialize it from scratch.

### Starting the Training

With the `Seq2SeqTrainer` instance set up and configured, you can now start the training process. This step begins the actual fine-tuning of the model using the specified training arguments, dataset, and evaluation criteria.

In [None]:
# Start the training
trainer.train()

Epoch,Training Loss,Validation Loss


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

REPOSITORY = "/content/flan-t5-small-fine-tuned"

# Import the tokenizer
tokenizer_FT5_FT = T5Tokenizer.from_pretrained(f"{REPOSITORY}/tokenizer")

# Import the fine-tuned model
model_FT5_FT = T5ForConditionalGeneration.from_pretrained(f"{REPOSITORY}/checkpoint-752", device_map="auto")