# Instruction Fine-Tuning for Spanish Newspaper Article Summarization Using FLAN-T5-Small

In this notebook, we explore the process of **instruction fine-tuning the FLAN-T5-Small language model to enhance its capability in summarizing Spanish newspaper articles**. FLAN-T5-Small, a variant of the T5 family, is designed to handle a variety of natural language processing tasks effectively. By focusing on instruction fine-tuning, we adapt this pre-trained model specifically for the task of summarization in Spanish, leveraging its ability to understand and generate coherent summaries.

This process involves preparing a dataset of Spanish newspaper articles and their summaries, configuring the model for instruction-based training, and fine-tuning the model to improve its performance on the summarization task. The goal is to fine-tune the model so that it can generate concise and accurate summaries of news articles, reflecting the key points and information from the original content.

**Through this notebook, we will guide you through the necessary steps, including data preparation, model configuration, and evaluation, to achieve a well-tuned summarization model for Spanish text.**

In [1]:
!pip install transformers
!pip install sentencepiece
!pip install accelerate
!pip install datasets
!pip install evaluate
!pip install rouge_score

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

In [49]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

def load_t5_model(name):
  tokenizer_T5 = T5Tokenizer.from_pretrained(name)
  model_T5 = T5ForConditionalGeneration.from_pretrained(name, device_map="auto")
  return tokenizer_T5, model_T5

This code defines a function called **load_t5_model**, which is responsible for loading a T5 (Text-To-Text Transfer Transformer) model and its corresponding tokenizer. This function accepts one argument, name, which represents the name or path of the pre-trained T5 model.

When the function is called, it performs two key actions:

- **Loading the Tokenizer:** It uses the **T5Tokenizer.from_pretrained(name)** method to load the tokenizer, which is responsible for converting text into token IDs that the model can understand.
- **Loading the Model:** It then loads the pre-trained T5 model itself using **T5ForConditionalGeneration.from_pretrained(name, device_map="auto")**. The **device_map="auto"** argument allows the model to be automatically distributed across available hardware resources (like GPUs), optimizing performance.

Finally, the function returns both the loaded tokenizer and model, allowing them to be used together for various natural language processing tasks such as text generation or translation.

In [6]:
def generate_response_from_prompt(model, prompt, max_length=100):
  tokenizer_T5, model_T5 = load_t5_model(model)
  prompt_tokens = tokenizer_T5(prompt, return_tensors="pt").input_ids.to("cuda:0")
  outputs = model_T5.generate(prompt_tokens, max_length=max_length)
  return tokenizer_T5.decode(outputs[0])

The `generate_response_from_prompt` function is designed to generate a response based on a given prompt using a pre-trained T5 model.

#### Parameters:
- **`model`**: The name or path of the pre-trained T5 model.
- **`prompt`**: The input text for which the model will generate a response.
- **`max_length`**: An optional parameter (default value of 100) that controls the maximum length of the generated response.

#### Function Steps:
1. **Loading the Model and Tokenizer**:  
   The function first calls `load_t5_model(model)` to load the T5 model and tokenizer using the specified model name or path.

2. **Tokenizing the Prompt**:  
   The input `prompt` is tokenized using the loaded tokenizer (`tokenizer_T5`). This step converts the prompt text into token IDs that the model can understand. The resulting tokens are sent to the CPU for processing.

3. **Generating the Response**:  
   The function calls the T5 model (`model_T5`) to generate a response based on the tokenized prompt. The `generate()` method is used, with `max_length` determining the maximum length of the generated text.

4. **Decoding the Response**:  
   The generated output is then decoded back into human-readable text using the tokenizer, and the decoded response is returned by the function.

In summary, this function allows us to input a prompt and obtain a generated response from the T5 model, with control over the output length.

### Example: Generating a Summary from a Text Prompt

The following code demonstrates how to use the `generate_response_from_prompt` function to generate a summary of a given article using a pre-trained T5 model.

#### Code Explanation:

1. **Defining the Article**:  
   The variable `text` contains an article about a mysterious burst of radio waves detected by astronomers. The article explains that the burst took 8 billion years to reach Earth and is one of the most distant and energetic fast radio bursts (FRBs) ever observed.

2. **Creating the Prompt**:  
   The `prompt_template` is created by combining the instruction `"Summarize the following article:"` with the `text` variable. This creates a formatted prompt that asks the model to summarize the article.

3. **Generating the Summary**:  
   The `generate_response_from_prompt` function is called with the following parameters:
   - **Model**: `"google/flan-t5-small"` — A pre-trained T5 model.
   - **Prompt**: The `prompt_template` created earlier.
   
   The function processes the prompt and returns a summary of the article.

In [7]:
text = """Astronomers have detected a mysterious burst of radio waves that took \
8 billion years to reach Earth. The fast radio burst is one of the most distant \
and energetic ever observed. Fast radio bursts (FRBs) are intense bursts of radio \
waves lasting only a few milliseconds, and their origin is unknown. The first FRB \
was discovered in 2007, and since then, hundreds of these fast cosmic flashes \
have been detected, coming from distant points across the universe."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> Astronomers have detected a fast radio burst of radio waves that have been detected in the past few decades.</s>'

In [8]:
text = """The Industrial Revolution, which took place primarily in the 19th century, \
was a period of significant technological, cultural, and socioeconomic changes \
that transformed agrarian societies into industrial societies. During this time, \
there was a massive shift of labor from farms to factories. This was due to the \
invention of new machines that could perform tasks faster and more efficiently \
than humans or animals. This transition led to an increase in the production of \
goods, but it also had negative consequences, such as labor exploitation and \
environmental pollution."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> The Industrial Revolution was a period of significant technological, cultural, and socioeconomic changes that led to the creation of agrarian societies.</s>'

In [9]:
text = """The Hubble Telescope, launched into space in 1990, has provided stunning images \
of the universe and has helped scientists gain a better understanding of cosmology."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> The Hubble Telescope is a telescope that has been used by scientists to study the universe.</s>'

### Loading the MLSum Dataset

The following code demonstrates how to load the **"mlsum"** dataset using the `datasets` library from Hugging Face. This dataset is commonly used for training and evaluating models on the task of text summarization in multiple languages, including Spanish.

#### What is MLSum?

The **MLSum** dataset is a multilingual dataset specifically designed for the task of text summarization. It contains news articles and their corresponding summaries in various languages, including Spanish (indicated by the `'es'` language specification in the code). This dataset was created as part of a larger effort to build benchmark datasets for summarization in multiple languages.

MLSum is derived from news articles published by reputable media sources such as **El País** (Spanish) and **Der Spiegel** (German). Each news article is paired with a summary, making it suitable for training models to generate summaries from longer text inputs.

- **Languages**: MLSum includes data in Spanish (`'es'`), German (`'de'`), and other languages.
- **Applications**: It is widely used in Natural Language Processing (NLP) tasks focused on summarization, where the goal is to condense longer articles into shorter, informative summaries.

#### Code Explanation:

1. **Importing the Library**:  
   The code begins by importing the `load_dataset` function from the `datasets` library provided by Hugging Face. This library allows easy access to a wide variety of datasets for NLP tasks.

2. **Loading the Dataset**:  
   The `load_dataset` function is then used to load the "mlsum" dataset. The specific version being loaded is the Spanish (`'es'`) version of the dataset. The dataset includes the text of news articles and their corresponding summaries, which can be used for training, validation, and testing in a summarization task.

In [11]:
from datasets import load_dataset

ds = load_dataset("mlsum", 'es')

The repository for mlsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mlsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/77.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/266367 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10358 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/13920 [00:00<?, ? examples/s]

In [12]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 266367
    })
    validation: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 10358
    })
    test: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 13920
    })
})

In [13]:
# Display an example from the training dataset subset
ds["train"]["text"][10]

'España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asfalto, el último el pasado 11 de d

In [14]:
# Display the summary corresponding to the previous example
ds["train"]["summary"][10]

'2009 es el periodo con menos fallecidos en accidentes de tráfico en cuatro décadas, desde que existen datos oficiales'

### Reducing the Size of the Dataset

The following code demonstrates how to reduce the size of the previously loaded "mlsum" dataset. This is useful for quicker experimentation or to manage resource constraints by limiting the number of examples in each subset.

#### Code Explanation:

1. **Defining Subset Sizes**:  
   The number of examples to keep in each subset of the dataset is defined as follows:
   - **`NUM_EX_TRAIN`**: Number of examples to retain in the training subset (1500).
   - **`NUM_EX_VAL`**: Number of examples to retain in the validation subset (500).
   - **`NUM_EX_TEST`**: Number of examples to retain in the test subset (200).

2. **Creating Subsets**:
   - **Training Subset**:  
     The `ds['train']` subset is reduced to the first `NUM_EX_TRAIN` examples using `select(range(NUM_EX_TRAIN))`.
   - **Validation Subset**:  
     The `ds['validation']` subset is reduced to the first `NUM_EX_VAL` examples using `select(range(NUM_EX_VAL))`.
   - **Test Subset**:  
     The `ds['test']` subset is reduced to the first `NUM_EX_TEST` examples using `select(range(NUM_EX_TEST))`.

In [15]:
# Reduce the dataset size
NUM_EX_TRAIN = 1500
NUM_EX_VAL = 500
NUM_EX_TEST = 200

# Training subset
ds['train'] = ds['train'].select(range(NUM_EX_TRAIN))

# Validation subset
ds['validation'] = ds['validation'].select(range(NUM_EX_VAL))

# Test subset
ds['test'] = ds['test'].select(range(NUM_EX_TEST))

### Function: `parse_dataset`

The `parse_dataset` function processes individual examples from the dataset to format them according to a specific template. This preparation is typically used to adapt data for model training or evaluation.

#### Code Explanation:

1. **Function Purpose**:  
   The `parse_dataset` function takes a single example from the dataset and reformats it to fit a specific template. This is useful for creating consistent input prompts for tasks such as text summarization.

2. **Function Details**:
   - **Input**: The function receives an `example` dictionary from the dataset, which includes a key `'text'` containing the article or content to be summarized.
   - **Processing**: It constructs a new dictionary with a single key `'prompt'`. The value for `'prompt'` is a string formatted to include the instruction `"Summarize the following article:\n\n"` followed by the content from `example['text']`.
   - **Output**: The function returns this formatted dictionary, which is now ready to be used as input for a model that generates summaries.


In [16]:
def parse_dataset(example):
  """Processes the examples to adapt them to the template."""
  return {"prompt": f"Summarize the following article:\n\n{example['text']}"}

### Applying the `parse_dataset` Function to the Dataset

The following code demonstrates how to apply the `parse_dataset` function to each subset of the dataset (training, validation, and test). This step transforms the dataset examples to fit a specific format required for model input.

#### Code Explanation:

1. **Applying the Function**:
   - The `map` method is used to apply the `parse_dataset` function to each example in the dataset subsets.
   - This method processes each example and reformats it according to the function's logic, ensuring that all examples are structured consistently.

2. **Updating the Dataset**:
   - **Training Subset**:  
     `ds["train"] = ds["train"].map(parse_dataset)`  
     This line applies the `parse_dataset` function to all examples in the training subset, updating it with the new formatted structure.
   - **Validation Subset**:  
     `ds["validation"] = ds["validation"].map(parse_dataset)`  
     Similarly, this line processes the validation subset.
   - **Test Subset**:  
     `ds["test"] = ds["test"].map(parse_dataset)`  
     Finally, this line processes the test subset.

In [17]:
ds["train"] = ds["train"].map(parse_dataset)
ds["validation"] = ds["validation"].map(parse_dataset)
ds["test"] = ds["test"].map(parse_dataset)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [18]:
print(ds["train"]["prompt"][10])

Summarize the following article:

España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asf

In [19]:
print(ds["train"]["text"][10])

España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asfalto, el último el pasado 11 de di

### Analyzing Token Lengths in the Dataset

The following code demonstrates how to analyze the token lengths of prompts and completions in the dataset using the tokenizer from the **"google/flan-t5-small"** model. This helps in understanding the size of text inputs and outputs relative to the model's token limits.

In [20]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import concatenate_datasets

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

# Calculate the maximum prompt size
prompts_tokens = concatenate_datasets([ds["train"], ds["validation"], ds["test"]]).map(lambda x: tokenizer(x["prompt"], truncation=True), batched=True) # Will truncate to 512, which is the maximum size for this model
max_token_len = max([len(x) for x in prompts_tokens["input_ids"]])
print(f"Maximum prompt size: {max_token_len}")

# Calculate the maximum completion size
completions_tokens = concatenate_datasets([ds["train"], ds["validation"], ds["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True)
max_completion_len = max([len(x) for x in completions_tokens["input_ids"]])
print(f"Maximum completion size: {max_completion_len}")

Map:   0%|          | 0/2200 [00:00<?, ? examples/s]

Maximum prompt size: 512


Map:   0%|          | 0/2200 [00:00<?, ? examples/s]

Maximum completion size: 242


#### Code Explanation:

1. **Importing Libraries**:  
   The code imports the `AutoTokenizer` and `AutoModelForSeq2SeqLM` classes from the `transformers` library, as well as the `concatenate_datasets` function from the `datasets` library.

2. **Loading the Tokenizer**:  
   The `AutoTokenizer.from_pretrained("google/flan-t5-small")` method loads the tokenizer for the specified model.

3. **Calculating Maximum Prompt Size**:
   - **Concatenate Datasets**:  
     The `concatenate_datasets` function combines the training, validation, and test subsets into a single dataset.
   - **Tokenizing Prompts**:  
     The `map` function is used to apply the tokenizer to each prompt in the concatenated dataset, with truncation set to `True` to ensure that tokens do not exceed the model's maximum input size (512 tokens for this model).
   - **Finding Maximum Token Length**:  
     The code calculates the length of each tokenized prompt and determines the maximum length across all prompts.
   - **Print Statement**:  
     The maximum prompt size is printed to the console.

4. **Calculating Maximum Completion Size**:
   - **Tokenizing Completions**:  
     Similar to prompts, the `map` function tokenizes each completion in the dataset and calculates token lengths.
   - **Finding Maximum Token Length**:  
     The code calculates the maximum length of the tokenized completions.
   - **Print Statement**:  
     The maximum completion size is printed to the console.

### Function: `padding_tokenizer`

The `padding_tokenizer` function processes text data to prepare it for model training by tokenizing inputs and labels, and applying padding and truncation as needed. This ensures that all examples have a consistent format suitable for training.

In [21]:
def padding_tokenizer(data):
  # Tokenize inputs (prompts)
  model_inputs = tokenizer(data['prompt'], max_length=max_token_len, padding="max_length", truncation=True)

  # Tokenize labels (completions)
  model_labels = tokenizer(data['summary'], max_length=max_completion_len, padding="max_length", truncation=True)

  # Replace padding token in completions with -100 so it is ignored during training
  model_labels["input_ids"] = [[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in model_labels["input_ids"]]

  model_inputs['labels'] = model_labels["input_ids"]

  return model_inputs

#### Code Explanation:

1. **Function Purpose**:  
   The `padding_tokenizer` function tokenizes both the input prompts and the output completions, adjusts the token lengths, and handles padding and truncation. It also modifies the tokenized completions to mark padding tokens as `-100` to be ignored during training.

2. **Tokenizing Inputs (Prompts)**:
   - The `tokenizer` processes the input prompts from the dataset using `max_length=max_token_len`, `padding="max_length"`, and `truncation=True`. This ensures that all input tokens are padded to the maximum length or truncated if they exceed it.

3. **Tokenizing Labels (Completions)**:
   - Similarly, the `tokenizer` processes the completion labels with `max_length=max_completion_len`, `padding="max_length"`, and `truncation=True`. This prepares the labels to match the length required by the model.

4. **Handling Padding in Labels**:
   - **Replacing Padding Tokens**:  
     Padding tokens in the completion labels are replaced with `-100`. This is done so that these padding tokens are ignored during the model's training process.
   - **Setting Labels**:  
     The modified completion labels (`model_labels["input_ids"]`) are added to `model_inputs` under the key `'labels'`.

5. **Returning Processed Data**:  
   The function returns the `model_inputs` dictionary, which now includes both the tokenized prompts and the adjusted labels.


In this step, we apply the `padding_tokenizer` function to the entire dataset. This ensures that all the data is formatted correctly for the model.

In [22]:
ds_tokens = ds.map(padding_tokenizer, batched=True, remove_columns=['text', 'summary', 'topic', 'url', 'title', 'date', 'prompt'])

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

#### What the Code Does:

1. **Apply the Tokenizer Function**:  
   The `map` method is used to apply the `padding_tokenizer` function to each batch of examples in the dataset. This function tokenizes and pads the inputs and labels, preparing them for model training.

2. **Batch Processing**:
   - **`batched=True`**:  
     This parameter processes the data in batches, which improves efficiency by handling multiple examples at once.

3. **Remove Unnecessary Columns**:
   - **`remove_columns`**:  
     After processing, we remove columns that are no longer needed, such as `'text'`, `'summary'`, `'topic'`, `'url'`, `'title'`, `'date'`, and `'prompt'`. This helps in keeping the dataset clean and focused on the tokenized data.

4. **Result**:
   - The result is a dataset (`ds_tokens`) that includes only the essential tokenized inputs and labels, making it ready for the fine-tuning process.

In [23]:
ds_tokens

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [24]:
ds_tokens["train"]["input_ids"][10]

[12198,
 1635,
 1737,
 8,
 826,
 1108,
 10,
 28774,
 2,
 9,
 4244,
 4505,
 9,
 26,
 32,
 3,
 35,
 2464,
 73,
 3,
 7,
 994,
 235,
 3,
 9,
 2,
 32,
 6900,
 15,
 3044,
 23,
 1621,
 20,
 27353,
 12765,
 20,
 50,
 24301,
 15644,
 3,
 35,
 50,
 7,
 443,
 60,
 449,
 9,
 7,
 3,
 63,
 6,
 3,
 9,
 12553,
 17,
 9,
 20,
 50,
 7,
 3,
 31812,
 7,
 11343,
 15,
 7,
 238,
 3606,
 291,
 2975,
 3534,
 63,
 3,
 15,
 40,
 3016,
 6626,
 20,
 40,
 8226,
 6,
 19850,
 32,
 276,
 154,
 2638,
 15612,
 138,
 10891,
 9,
 6,
 5569,
 21628,
 9,
 3,
 6071,
 3,
 35,
 50,
 3,
 107,
 17905,
 142,
 4244,
 3,
 2110,
 19042,
 73,
 1059,
 32,
 20,
 586,
 140,
 2260,
 5569,
 20,
 115,
 9,
 1927,
 20,
 3,
 26756,
 4035,
 49,
 235,
 7,
 5,
 4498,
 17,
 9,
 3,
 15,
 40,
 330,
 9,
 26,
 32,
 507,
 20,
 3,
 26,
 1294,
 21388,
 6,
 3,
 2,
 40,
 2998,
 32,
 3,
 26,
 2,
 9,
 20,
 40,
 238,
 142,
 3,
 10475,
 782,
 20,
 3927,
 32,
 7,
 6,
 50,
 3,
 31812,
 20,
 1590,
 15,
 75,
 28594,
 3,
 15,
 7,
 20,
 3,
 16253,
 2688,
 6,
 3,
 9,


### Evaluating Model Performance

In this part of the fine-tuning process, we set up the evaluation metrics and preprocess the text for accurate evaluation of model outputs.

In [25]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

# Evaluation metric
metric = evaluate.load("rouge")

# Helper function to preprocess the text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects a new line after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in labels as it cannot be decoded
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Preprocess the text
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

#### Steps in the Code:

1. **Import Libraries**:  
   We import necessary libraries for evaluation and text processing:
   - `evaluate` for metrics computation.
   - `nltk` for natural language processing tasks.
   - `numpy` for numerical operations.
   - `sent_tokenize` from `nltk` for sentence tokenization.

2. **Download NLTK Data**:
   - **`nltk.download("punkt")`**:  
     This downloads the Punkt tokenizer models, which are used for sentence tokenization.

3. **Load Evaluation Metric**:
   - **`metric = evaluate.load("rouge")`**:  
     We load the ROUGE metric, which is commonly used for evaluating summarization models.

4. **Text Preprocessing**:
   - **`postprocess_text(preds, labels)`**:  
     This helper function processes the predicted and reference texts by stripping extra spaces and ensuring each sentence is on a new line, as expected by the ROUGE metric.

5. **Compute Metrics**:
   - **`compute_metrics(eval_preds)`**:  
     This function takes the predictions and labels from the evaluation, decodes them, and calculates the evaluation metrics:
     - **Decoding**: Converts token IDs back to text while handling padding tokens.
     - **Postprocessing**: Applies text preprocessing to prepare the text for ROUGE evaluation.
     - **Metric Calculation**: Computes the ROUGE scores and calculates the average generation length.
     - **Formatting**: Rounds the metric results to four decimal places for clarity.

### Loading the Pre-Trained Model

In this step, we load the pre-trained model from the Hugging Face library, which is essential for fine-tuning on our specific task.

In [26]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

#### What the Code Does:

1. **Import the Model Class**:
   - **`AutoModelForSeq2SeqLM`**:  
     This class from the `transformers` library is used to load sequence-to-sequence models. It's suitable for tasks like text generation, translation, and summarization.

2. **Load the Pre-Trained Model**:
   - **`model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")`**:  
     This line initializes the model with pre-trained weights for the "google/flan-t5-small" model. The `from_pretrained` method loads the model configuration and weights from the specified model identifier.

### Purpose and Objective of the Data Collator

The `DataCollator` plays a crucial role in **preparing data for training machine learning models**, especially in Natural Language Processing (NLP) tasks and sequence-to-sequence (seq2seq) models. Here’s a detailed overview of its objectives and purposes:

#### 1. **Handling Padding**
   - **Objective**: Padding is required when examples in a batch have different lengths. The `DataCollator` ensures that all sequences in a batch are the same length by adding padding tokens.
   - **Purpose**: To facilitate efficient batch processing, as most models require input sequences to have a fixed length to perform operations in parallel.

#### 2. **Configuring Special Tokens**
   - **Objective**: During training, it is crucial to handle padding tokens in labels and inputs correctly.
   - **Purpose**: The `DataCollator` allows for specifying which tokens should be treated as padding, ensuring that these tokens are not included in the model's loss calculation. This is done by setting the `label_pad_token_id`, which indicates which tokens should be ignored during loss evaluation.

#### 3. **Batch Size Optimization**
   - **Objective**: Optimize memory usage and training performance.
   - **Purpose**: Ensuring that batch sizes are multiples of certain values (like 8) can improve performance on some hardware, such as GPUs. This is managed through the `pad_to_multiple_of` parameter.

#### 4. **Preparation for Training**
   - **Objective**: Convert and organize data into a format that the model can process directly.
   - **Purpose**: The `DataCollator` prepares the data to be in the correct format for the model during training, handling padding and alignment consistently.


### Setting Up the Data Collator for Training

In this step, we configure the `DataCollatorForSeq2Seq` to prepare the data for training. The data collator handles padding and batching, ensuring that the model receives properly formatted inputs and labels.

In [30]:
from transformers import DataCollatorForSeq2Seq

# Ignore padding-related tokens during the training process for prompts
label_pad_token_id = -100  # Tokens with this ID will be ignored in the loss calculation (used to pad labels)

# Data collator for model training
data_collator = DataCollatorForSeq2Seq(
    tokenizer,  # Tokenizer used for processing text data
    model=model,  # The model being fine-tuned
    label_pad_token_id=label_pad_token_id,  # Padding tokens in labels will be replaced with -100 to ignore them during loss computation
    pad_to_multiple_of=8  # Pad sequences to the nearest multiple of 8 for efficiency on certain hardware (e.g., GPUs)
)

#### What the Code Does:

1. **Import the Data Collator**:
   - **`DataCollatorForSeq2Seq`**:  
     This class from the `transformers` library is used to handle padding and other preprocessing tasks for sequence-to-sequence models.

2. **Define Padding Token ID**:
   - **`label_pad_token_id = -100`**:  
     This value is used to indicate padding tokens in the labels, which should be ignored during training. The `-100` value is chosen because it's commonly used to mask tokens in loss calculations.

3. **Create the Data Collator**:
   - **`data_collator = DataCollatorForSeq2Seq`**:  
     We initialize the data collator with the following parameters:
     - **`tokenizer`**: The tokenizer used to convert text into token IDs.
     - **`model`**: The model for which the data collator is being set up.
     - **`label_pad_token_id`**: Specifies which token ID should be treated as padding in the labels.
     - **`pad_to_multiple_of=8`**: Ensures that the batch sizes are multiples of 8, which can help with memory optimization and performance on some hardware.

### Setting Up and Running the Training

In this step, we configure the training process for fine-tuning a sequence-to-sequence model using the `Seq2SeqTrainer` from the Hugging Face library. This setup includes defining training arguments and creating the trainer instance.

In [31]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

REPOSITORY = "/content/flan-t5-small-fine-tuned"

training_args = Seq2SeqTrainingArguments(
    # Training hyperparameters
    output_dir=REPOSITORY,  # Directory to save model checkpoints and logs
    per_device_train_batch_size=8,  # Batch size per device during training
    per_device_eval_batch_size=8,  # Batch size per device during evaluation
    predict_with_generate=True,  # Generate predictions during evaluation (useful for tasks like summarization)
    fp16=False,  # Use full precision (FP32) instead of mixed precision (FP16) to avoid overflow issues
    learning_rate=5e-5,  # Learning rate for the optimizer
    num_train_epochs=4,  # Number of complete passes through the training dataset

    # Logging and evaluation strategies
    logging_dir=f"{REPOSITORY}/logs",  # Directory to save training logs
    logging_strategy="steps",  # Log training metrics every `logging_steps`
    logging_steps=500,  # Log metrics every 500 steps
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save a checkpoint at the end of each epoch
    save_total_limit=2,  # Keep only the 2 most recent model checkpoints
    load_best_model_at_end=True,  # Load the best model (according to validation performance) at the end of training
)

# Create the training instance
trainer = Seq2SeqTrainer(
    model=model,  # The model to be fine-tuned
    args=training_args,  # The training configuration defined above
    data_collator=data_collator,  # Function to handle padding and batching of data
    train_dataset=ds_tokens["train"],  # The training dataset
    eval_dataset=ds_tokens["validation"],  # The validation dataset
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics, e.g., ROUGE
)




#### What the Code Does:

1. **Import Required Classes**:
   - **`Seq2SeqTrainer`** and **`Seq2SeqTrainingArguments`**:  
     These classes are used to set up and manage the training process for sequence-to-sequence models.

2. **Define Training Options**:
   - **`training_args`**:  
     The `Seq2SeqTrainingArguments` class is used to specify various hyperparameters and strategies for training:
     - **`output_dir`**: Directory where the model checkpoints and logs will be saved.
     - **`per_device_train_batch_size`** and **`per_device_eval_batch_size`**: Batch sizes for training and evaluation.
     - **`predict_with_generate`**: Enables generation during predictions, useful for tasks like text generation.
     - **`fp16`**: Set to `False` to avoid overflow issues with mixed precision training.
     - **`learning_rate`**: Learning rate for the optimizer.
     - **`num_train_epochs`**: Number of training epochs.
     - **`logging_dir`**: Directory for logging training metrics.
     - **`logging_strategy`** and **`logging_steps`**: Defines how often logs are saved.
     - **`evaluation_strategy`** and **`save_strategy`**: Specifies when to evaluate and save the model.
     - **`save_total_limit`**: Limits the number of saved checkpoints.
     - **`load_best_model_at_end`**: Ensures that the best model is loaded at the end of training.

3. **Create the Training Instance**:
   - **`trainer`**:  
     An instance of `Seq2SeqTrainer` is created with the following parameters:
     - **`model`**: The model to be fine-tuned.
     - **`args`**: The training arguments defined earlier.
     - **`data_collator`**: Handles padding and batching of the data.
     - **`train_dataset`**: The training dataset.
     - **`eval_dataset`**: The evaluation dataset.
     - **`compute_metrics`**: Function to compute evaluation metrics.

### Saving the Tokenizer

After training or fine-tuning a model, it's important to save the tokenizer so that it can be reused later. This ensures that the tokenization process is consistent with the one used during training.

In [32]:
# Save the tokenizer to disk for later use
tokenizer.save_pretrained(f"{REPOSITORY}/tokenizer")

('/content/flan-t5-small-fine-tuned/tokenizer/tokenizer_config.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/special_tokens_map.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/spiece.model',
 '/content/flan-t5-small-fine-tuned/tokenizer/added_tokens.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/tokenizer.json')

#### What the Code Does:

1. **Save the Tokenizer**:
   - **`tokenizer.save_pretrained(f"{REPOSITORY}/tokenizer")`**:  
     This line of code saves the tokenizer to the specified directory. The `save_pretrained` method writes the tokenizer configuration and vocabulary files to disk, allowing you to reload the tokenizer later without having to reinitialize it from scratch.

### Starting the Training

With the `Seq2SeqTrainer` instance set up and configured, you can now start the training process. This step begins the actual fine-tuning of the model using the specified training arguments, dataset, and evaluation criteria.

In [33]:
# Start the training
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.259114,15.1937,2.9033,13.1427,13.2281,18.986
2,No log,2.251017,15.543,2.9838,13.5111,13.5995,18.986
3,2.712200,2.248771,15.5534,3.0164,13.6785,13.7832,18.998
4,2.712200,2.245695,15.1986,2.9635,13.4001,13.4763,18.998


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=752, training_loss=2.6751643241719996, metrics={'train_runtime': 556.1777, 'train_samples_per_second': 10.788, 'train_steps_per_second': 1.352, 'total_flos': 1115343028224000.0, 'train_loss': 2.6751643241719996, 'epoch': 4.0})

### Training Results Explanation

The following results show the performance metrics for each epoch during the training of a text generation model.

#### Columns:

- **Epoch**: The number of the epoch out of the total epochs.
- **Training Loss**: Measures how well the **model fits the training data**. Lower values indicate a better fit.
- **Validation Loss**: Evaluates how well the **model generalizes to unseen data in the validation set**. Lower values are better.
- **Rouge1**: Measures the **overlap of unigrams** (single words) between the generated text and reference summaries. Higher values indicate better performance.
- **Rouge2**: Assesses the **overlap of bigrams** (two consecutive words) between the generated text and reference summaries. Higher values indicate better performance.
- **Rougel**: Measures the overall recall of the generated text compared to reference summaries. Higher values suggest better performance.
- **Rougesum**: Combines various ROUGE metrics into a summary score. Higher values indicate better performance.
- **Gen Len**: Indicates the average length of the generated summaries.

#### Interpretation:

1. **Training and Validation Loss**:
   - The training loss is stable around 2.71 for epochs 3 and 4. **The validation loss shows a slight decrease from 15.54 to 15.25, indicating improved generalization to unseen data.**

2. **ROUGE Scores**:
   - ROUGE scores show slight improvements over epochs. Rouge1, Rouge2, and Rougel scores generally improve from epoch 1 to epoch 4, indicating **better matching of generated text to reference summaries** in terms of unigrams, bigrams, and overall recall.

3. **Gen Len**:
   - The average length of the generated text remains consistent at approximately 18.99 across all epochs, indicating stable summary length generation.

In summary, the training results indicate steady improvement in validation performance and ROUGE scores across epochs, with consistent summary length generation. The final model performs well in generating summaries that closely match reference texts.

## Text Generation and Evaluation with Fine-Tuned Flan-T5

### Loading the Fine-Tuned Model and Tokenizer

his code snippet demonstrates how to load a **fine-tuned T5 model** and its corresponding tokenizer from a saved checkpoint. The model and tokenizer are retrieved from a specific directory in Google Drive.

In [34]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

REPOSITORY = "/content/flan-t5-small-fine-tuned"

# Import the tokenizer
tokenizer_FT5_FT = T5Tokenizer.from_pretrained(f"{REPOSITORY}/tokenizer")

# Import the fine-tuned model
model_FT5_FT = T5ForConditionalGeneration.from_pretrained(f"{REPOSITORY}/checkpoint-752", device_map="auto")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Code Breakdown:

1. **Importing the Necessary Classes**:
   - The code begins by importing the `T5Tokenizer` and `T5ForConditionalGeneration` classes from the Hugging Face `transformers` library. These classes are used for text generation tasks with the T5 model.

2. **Setting the Repository Path**:
   - **`REPOSITORY="/content/drive/MyDrive/flan-t5-small-fine-tuned"`**:  
     This defines the path to the directory where the fine-tuned model and tokenizer are stored. In this case, it's located in Google Drive.

3. **Loading the Fine-Tuned Tokenizer**:
   - **`tokenizer_FT5_FT = T5Tokenizer.from_pretrained(f"{REPOSITORY}/tokenizer")`**:  
     This line loads the tokenizer that was fine-tuned and saved earlier. The `from_pretrained` method retrieves the tokenizer from the specified directory, making it ready for use in tokenizing input data.

4. **Loading the Fine-Tuned Model**:
   - **`model_FT5_FT = T5ForConditionalGeneration.from_pretrained(f"{REPOSITORY}/checkpoint-752", device_map="auto")`**:  
     This line loads the fine-tuned T5 model from a specific checkpoint (in this case, `checkpoint-752`). The `device_map="auto"` option automatically assigns the model's layers to the available hardware (e.g., CPU, GPU) for efficient execution. This allows the fine-tuned model to be used for generating text based on the training it has undergone.

#### Purpose:

- **Tokenizer**: The fine-tuned tokenizer ensures that input text is tokenized in the same way as during training, preserving consistency in text processing.
- **Model**: The fine-tuned model is ready for inference or further evaluation, loaded from a checkpoint where it was last saved.

By loading both the tokenizer and the model, this code prepares the system for generating text or performing tasks using the fine-tuned Flan-T5 model.

### Text Generation Process

In this section, a fine-tuned language model is used to create a summary of a text about renewable energy. The text is first tokenized into a format that the model can process. The model then generates a summary from the tokenized text. Finally, the summary is decoded back into readable text and printed. This process transforms detailed content into a concise summary using advanced language processing techniques.


In [47]:
text = """La energía renovable está jugando un papel crucial en la transición hacia un futuro más sostenible. A medida que los recursos fósiles se agotan y las preocupaciones sobre el cambio climático aumentan, las fuentes de energía renovable, como la solar, eólica e hidroeléctrica, se están convirtiendo en alternativas viables y cada vez más eficientes.

La energía solar, por ejemplo, utiliza paneles fotovoltaicos para convertir la luz del sol en electricidad. Esta tecnología ha avanzado significativamente en las últimas décadas, y los costos asociados con la instalación de sistemas solares han disminuido, lo que la hace más accesible para hogares y empresas. La energía solar no solo reduce las emisiones de gases de efecto invernadero, sino que también ofrece una fuente de energía gratuita y abundante.

La energía eólica, por otro lado, aprovecha la energía cinética del viento para generar electricidad. Los aerogeneradores, que se encuentran en parques eólicos tanto terrestres como marinos, convierten el viento en energía eléctrica de manera eficiente. Esta forma de energía es especialmente útil en áreas con altos niveles de viento y puede proporcionar una parte significativa de la electricidad necesaria para alimentar redes eléctricas.

La energía hidroeléctrica es otra fuente renovable que utiliza el flujo de agua para generar electricidad. Las plantas hidroeléctricas, que suelen estar ubicadas en presas, convierten la energía del agua en movimiento en energía eléctrica. Aunque la construcción de grandes represas puede tener impactos ambientales significativos, la energía hidroeléctrica sigue siendo una fuente confiable y comprobada de energía renovable.

En conjunto, estas fuentes de energía renovable ofrecen una solución prometedora para reducir nuestra dependencia de los combustibles fósiles y mitigar los efectos del cambio climático. La transición hacia un sistema energético más sostenible no solo es crucial para el medio ambiente, sino que también puede traer beneficios económicos, como la creación de empleos y la reducción de los costos de energía a largo plazo."""

# Build the prompt according to the fine-tuning template
prompt_template = f"Summarize the following article:\n\n{text}"

# Tokenize the prompt using the tokenizer
prompt_tokens = tokenizer_FT5_FT(prompt_template, return_tensors="pt").input_ids.to("cuda")

# Generate the output tokens using the model
outputs = model_FT5_FT.generate(prompt_tokens, max_length=300)

# Decode the generated tokens back to text
print(tokenizer_FT5_FT.decode(outputs[0]))

<pad> La energ<unk> a solar aprovecha una fuente de energ<unk> a renovable para generar electricidad</s>


In [45]:
text = """La tecnología de blockchain ha emergido como una de las innovaciones más disruptivas de la última década. \
Originalmente desarrollada como la base para las criptomonedas, como Bitcoin, blockchain ha demostrado ser una \
tecnología versátil con aplicaciones potenciales en una variedad de sectores.

En el ámbito financiero, blockchain ofrece una forma segura y transparente de registrar transacciones sin necesidad de intermediarios. \
Esto no solo reduce el riesgo de fraude y errores, sino que también puede acelerar el procesamiento de transacciones y reducir costos operativos. \
La descentralización inherente a blockchain también permite una mayor inclusión financiera al proporcionar acceso a servicios bancarios para personas \
en regiones no atendidas por las instituciones financieras tradicionales.

En el sector de la cadena de suministro, blockchain está transformando la forma en que se rastrean y verifican los productos. \
Cada etapa del viaje de un producto, desde la producción hasta la entrega, se registra en un libro mayor inmutable. Esto mejora la transparencia, \
permite una trazabilidad completa y puede ayudar a combatir el fraude y la falsificación.

La tecnología de blockchain también está ganando tracción en la industria de la salud, donde puede ser utilizada para mantener registros médicos de manera segura y accesible. \
Los pacientes pueden tener un control más directo sobre su información médica, y los proveedores de atención pueden acceder a datos precisos y completos para tomar decisiones más informadas sobre el tratamiento.

A pesar de sus muchas ventajas, blockchain enfrenta varios desafíos. La escalabilidad es una preocupación importante, ya que las redes blockchain actuales pueden ser lentas y costosas de operar a gran escala. """

# Build the prompt according to the fine-tuning template
prompt_template = f"Summarize the following article:\n\n{text}"

# Tokenize the prompt using the tokenizer
prompt_tokens = tokenizer_FT5_FT(prompt_template, return_tensors="pt").input_ids.to("cuda")

# Generate the output tokens using the model
outputs = model_FT5_FT.generate(prompt_tokens, max_length=300)

# Decode the generated tokens back to text
print(tokenizer_FT5_FT.decode(outputs[0]))

<pad> La tecnolog<unk> a de blockchain emergida como una de las innovaciones más disruptivas de la <unk> ltima década</s>


In [41]:
text = """"La Revolución Industrial, que tuvo lugar principalmente en el siglo XIX, \
fue un período de grandes cambios tecnológicos, culturales y socioeconómicos que \
transformó a las sociedades agrarias en sociedades industriales. Durante este tiempo, \
hubo un cambio masivo de mano de obra de las granjas a las fábricas. Esto se debió a \
la invención de nuevas máquinas que podían realizar tareas más rápido y eficientemente \
que los humanos o los animales. Esta transición llevó a un aumento en la producción de \
bienes, pero también tuvo consecuencias negativas, como la explotación \laboral y la \
contaminación ambiental."""
# Build the prompt according to the fine-tuning template
prompt_template = f"Summarize the following article:\n\n{text}"

# Tokenize the prompt using the tokenizer
prompt_tokens = tokenizer_FT5_FT(prompt_template, return_tensors="pt").input_ids.to("cuda")

# Generate the output tokens using the model
outputs = model_FT5_FT.generate(prompt_tokens, max_length=300)

# Decode the generated tokens back to text
print(tokenizer_FT5_FT.decode(outputs[0]))

<pad> La Revolución Industrial es un per<unk> odo de grandes cambios tecnológicos, culturales y socioeconómicos que transformó a las sociedades agrarias en sociedades industriales</s>


### Model Evaluation and Text Generation in Batches

This code demonstrates how to evaluate a fine-tuned T5 model by generating predictions (completions) for a test dataset and calculating evaluation metrics. The process is performed in batches for efficiency, and training is disabled to focus on inference.

In [48]:
import torch

# Switch the model to evaluation mode
model_FT5_FT.eval()  # This disables dropout and other training-specific behaviors

# Define the batch size
batch_size = 8  # Number of samples to process in each batch

all_predictions = []  # List to store all generated predictions

# Disable gradient calculation and generate completions
with torch.no_grad():  # Disables gradient calculation to save memory during inference
    for i in range(0, len(ds_tokens["test"]["input_ids"]), batch_size):
        # Extract the current batch
        input_ids_batch = torch.tensor(ds_tokens["test"]["input_ids"][i:i+batch_size], device='cuda:0')  # Load the current batch of input IDs onto the GPU

        # Generate predictions from the model
        outputs = model_FT5_FT.generate(input_ids_batch)  # Generate text completions for the current batch

        # Append the predictions to the list
        all_predictions.extend(outputs)  # Store the generated completions

# Calculate metrics
labels = np.array(ds_tokens['test']['labels'])  # Convert the true labels to a NumPy array
completions = np.array([pred.cpu().numpy() for pred in all_predictions])  # Convert the generated predictions from PyTorch tensors to NumPy arrays

metrics = compute_metrics((completions, labels))  # Compute evaluation metrics (e.g., ROUGE scores)

print(metrics)  # Print the evaluation metrics




{'rouge1': 13.9667, 'rouge2': 2.2304, 'rougeL': 12.1538, 'rougeLsum': 12.2348, 'gen_len': 18.985}


#### Code Breakdown:

1. **Importing PyTorch**:
   - **`import torch`**:  
     The code imports PyTorch, which is used for tensor operations and to manage model execution on hardware (e.g., CPU or GPU).

2. **Switching the Model to Evaluation Mode**:
   - **`model_FT5_FT.eval()`**:  
     This switches the model to evaluation mode, disabling certain behaviors specific to training, such as dropout. This ensures that the model performs consistently during inference.

3. **Setting the Batch Size**:
   - **`batch_size = 8`**:  
     The batch size is set to 8, meaning the model will process 8 samples at a time. This helps balance memory usage and computational efficiency during inference.

4. **Initializing a List for Predictions**:
   - **`all_predictions = []`**:  
     An empty list is initialized to store the predictions (completions) generated by the model.

5. **Disabling Gradient Calculation**:
   - **`with torch.no_grad():`**:  
     This context manager disables gradient calculation to save memory and computation during inference, since gradients are not needed when generating predictions.

6. **Processing the Test Dataset in Batches**:
   - **`for i in range(0, len(ds_tokens["test"]["input_ids"]), batch_size):`**:  
     A loop is set up to iterate over the test dataset in batches of size 8.

7. **Extracting the Current Batch**:
   - **`input_ids_batch = torch.tensor(ds_tokens["test"]["input_ids"][i:i+batch_size], device='cuda:0')`**:  
     This line extracts a batch of input IDs from the test dataset and moves the data to the GPU (if available) for faster processing.

8. **Generating Predictions**:
   - **`outputs = model_FT5_FT.generate(input_ids_batch)`**:  
     The model generates predictions (completions) for the current batch of input IDs using the `generate` method.

9. **Storing the Predictions**:
   - **`all_predictions.extend(outputs)`**:  
     The generated predictions are added to the `all_predictions` list for later processing.

10. **Calculating Evaluation Metrics**:
    - **`labels = np.array(ds_tokens['test']['labels'])`**:  
      The true labels from the test dataset are converted to a NumPy array.
    - **`completions = np.array([pred.cpu().numpy() for pred in all_predictions])`**:  
      The generated predictions are converted from PyTorch tensors to NumPy arrays.
    - **`metrics = compute_metrics((completions, labels))`**:  
      The `compute_metrics` function is called to calculate evaluation metrics (e.g., ROUGE scores) by comparing the generated predictions to the true labels.

11. **Printing the Metrics**:
    - **`print(metrics)`**:  
      The calculated metrics are printed to provide a quantitative evaluation of the model's performance on the test dataset.

#### Purpose:

- **Batch Processing**: By processing the test dataset in batches, this code optimizes memory usage and speeds up the evaluation process.
- **Evaluation Mode**: Switching the model to evaluation mode ensures that it behaves correctly during inference, and disabling gradient calculation reduces unnecessary computation.
- **Metrics Calculation**: After generating predictions, the code calculates evaluation metrics to assess the model's performance in text generation tasks, such as summarization or translation.

This approach provides an efficient way to evaluate a fine-tuned model on a test dataset while generating predictions in batches.