# Instruction Fine-Tuning for Spanish Newspaper Article Summarization Using FLAN-T5-Small

In this notebook, we explore the process of **instruction fine-tuning the FLAN-T5-Small language model to enhance its capability in summarizing Spanish newspaper articles**. FLAN-T5-Small, a variant of the T5 family, is designed to handle a variety of natural language processing tasks effectively. By focusing on instruction fine-tuning, we adapt this pre-trained model specifically for the task of summarization in Spanish, leveraging its ability to understand and generate coherent summaries.

This process involves preparing a dataset of Spanish newspaper articles and their summaries, configuring the model for instruction-based training, and fine-tuning the model to improve its performance on the summarization task. The goal is to fine-tune the model so that it can generate concise and accurate summaries of news articles, reflecting the key points and information from the original content.

**Through this notebook, we will guide you through the necessary steps, including data preparation, model configuration, and evaluation, to achieve a well-tuned summarization model for Spanish text.**

In [None]:
!pip install transformers
!pip install sentencepiece
!pip install accelerate
!pip install datasets
!pip install evaluate
!pip install rouge_score

In [8]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [9]:
def load_t5_model(name):
  tokenizer_T5 = T5Tokenizer.from_pretrained(name)
  model_T5 = T5ForConditionalGeneration.from_pretrained(name, device_map="auto")
  return tokenizer_T5, model_T5

In [15]:
def generate_response_from_prompt(model, prompt, max_length=100):
  tokenizer_T5, model_T5 = load_t5_model(model)
  prompt_tokens = tokenizer_T5(prompt, return_tensors="pt").input_ids.to("cpu")
  outputs = model_T5.generate(prompt_tokens, max_length=max_length)
  return tokenizer_T5.decode(outputs[0])

In [16]:
text = """Astronomers have detected a mysterious burst of radio waves that took \
8 billion years to reach Earth. The fast radio burst is one of the most distant \
and energetic ever observed. Fast radio bursts (FRBs) are intense bursts of radio \
waves lasting only a few milliseconds, and their origin is unknown. The first FRB \
was discovered in 2007, and since then, hundreds of these fast cosmic flashes \
have been detected, coming from distant points across the universe."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> Astronomers have detected a fast radio burst of radio waves that have been detected in the past few decades.</s>'

In [17]:
text = """The Industrial Revolution, which took place primarily in the 19th century, \
was a period of significant technological, cultural, and socioeconomic changes \
that transformed agrarian societies into industrial societies. During this time, \
there was a massive shift of labor from farms to factories. This was due to the \
invention of new machines that could perform tasks faster and more efficiently \
than humans or animals. This transition led to an increase in the production of \
goods, but it also had negative consequences, such as labor exploitation and \
environmental pollution."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> The Industrial Revolution was a period of significant technological, cultural, and socioeconomic changes that led to the creation of agrarian societies.</s>'

In [18]:
text = """The Hubble Telescope, launched into space in 1990, has provided stunning images \
of the universe and has helped scientists gain a better understanding of cosmology."""

prompt_template = f"Summarize the following article:\n\n{text}"

generate_response_from_prompt("google/flan-t5-small", prompt_template)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<pad> The Hubble Telescope is a telescope that has been used by scientists to study the universe.</s>'

In [19]:
from datasets import load_dataset

ds = load_dataset("mlsum", 'es')

Downloading builder script:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

The repository for mlsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mlsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/77.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/266367 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10358 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/13920 [00:00<?, ? examples/s]

In [20]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 266367
    })
    validation: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 10358
    })
    test: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 13920
    })
})

In [21]:
# Display an example from the training dataset subset
ds["train"]["text"][10]

'España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asfalto, el último el pasado 11 de d

In [23]:
# Display the summary corresponding to the previous example
ds["train"]["summary"][10]

'2009 es el periodo con menos fallecidos en accidentes de tráfico en cuatro décadas, desde que existen datos oficiales'

In [24]:
# Reduce the dataset size
NUM_EX_TRAIN = 1500
NUM_EX_VAL = 500
NUM_EX_TEST = 200

# Training subset
ds['train'] = ds['train'].select(range(NUM_EX_TRAIN))

# Validation subset
ds['validation'] = ds['validation'].select(range(NUM_EX_VAL))

# Test subset
ds['test'] = ds['test'].select(range(NUM_EX_TEST))

In [25]:
def parse_dataset(example):
  """Processes the examples to adapt them to the template."""
  return {"prompt": f"Summarize the following article:\n\n{example['text']}"}

In [26]:
ds["train"] = ds["train"].map(parse_dataset)
ds["validation"] = ds["validation"].map(parse_dataset)
ds["test"] = ds["test"].map(parse_dataset)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [27]:
print(ds["train"]["prompt"][10])

Summarize the following article:

España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asf

In [28]:
print(ds["train"]["text"][10])

España ha sumado en 2009 un sexto año consecutivo de reducción de la mortalidad en las carreteras y, a falta de las cifras oficiales que confirmará hoy el ministro del Interior, Alfredo Pérez Rubalcaba, por primera vez en la historia se ha cerrado un periodo de 12 meses por debajo de 2.000 muertos. Hasta el pasado 18 de diciembre, último día del que se dispone de datos, la cifra de fallecidos es de 1.826, a los que hay que sumar los 31 de la primera fase de la operación especial de Navidad, entre el 22 y el 27 de diciembre. La estadística resulta especialmente llamativa si se compara con el primer año en el que se hizo balance, 1969, en el que 3.951 personas perdieron la vida. Desde entonces, el pico más alto se alcanzó en 1989 con 7.000 víctimas. En las cifras de mortalidad de 2009 destacan otros aspectos positivos como el hecho de que sólo se han superado los 200 muertos en un mes (agosto) y que ha habido cinco días sin víctimas mortales sobre el asfalto, el último el pasado 11 de di

In [29]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

In [30]:
from datasets import concatenate_datasets

# Calculate the maximum prompt size
prompts_tokens = concatenate_datasets([ds["train"], ds["validation"], ds["test"]]).map(lambda x: tokenizer(x["prompt"], truncation=True), batched=True) # Will truncate to 512, which is the maximum size for this model
max_token_len = max([len(x) for x in prompts_tokens["input_ids"]])
print(f"Maximum prompt size: {max_token_len}")

# Calculate the maximum completion size
completions_tokens = concatenate_datasets([ds["train"], ds["validation"], ds["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True)
max_completion_len = max([len(x) for x in completions_tokens["input_ids"]])
print(f"Maximum completion size: {max_completion_len}")

Map:   0%|          | 0/2200 [00:00<?, ? examples/s]

Maximum prompt size: 512


Map:   0%|          | 0/2200 [00:00<?, ? examples/s]

Maximum completion size: 242


In [31]:
def padding_tokenizer(data):
  # Tokenize inputs (prompts)
  model_inputs = tokenizer(data['prompt'], max_length=max_token_len, padding="max_length", truncation=True)

  # Tokenize labels (completions)
  model_labels = tokenizer(data['summary'], max_length=max_completion_len, padding="max_length", truncation=True)

  # Replace padding token in completions with -100 so it is ignored during training
  model_labels["input_ids"] = [[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in model_labels["input_ids"]]

  model_inputs['labels'] = model_labels["input_ids"]

  return model_inputs

In [32]:
ds_tokens = ds.map(padding_tokenizer, batched=True, remove_columns=['text', 'summary', 'topic', 'url', 'title', 'date', 'prompt'])

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [33]:
ds_tokens

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [34]:
ds_tokens["train"]["input_ids"][10]

[12198,
 1635,
 1737,
 8,
 826,
 1108,
 10,
 28774,
 2,
 9,
 4244,
 4505,
 9,
 26,
 32,
 3,
 35,
 2464,
 73,
 3,
 7,
 994,
 235,
 3,
 9,
 2,
 32,
 6900,
 15,
 3044,
 23,
 1621,
 20,
 27353,
 12765,
 20,
 50,
 24301,
 15644,
 3,
 35,
 50,
 7,
 443,
 60,
 449,
 9,
 7,
 3,
 63,
 6,
 3,
 9,
 12553,
 17,
 9,
 20,
 50,
 7,
 3,
 31812,
 7,
 11343,
 15,
 7,
 238,
 3606,
 291,
 2975,
 3534,
 63,
 3,
 15,
 40,
 3016,
 6626,
 20,
 40,
 8226,
 6,
 19850,
 32,
 276,
 154,
 2638,
 15612,
 138,
 10891,
 9,
 6,
 5569,
 21628,
 9,
 3,
 6071,
 3,
 35,
 50,
 3,
 107,
 17905,
 142,
 4244,
 3,
 2110,
 19042,
 73,
 1059,
 32,
 20,
 586,
 140,
 2260,
 5569,
 20,
 115,
 9,
 1927,
 20,
 3,
 26756,
 4035,
 49,
 235,
 7,
 5,
 4498,
 17,
 9,
 3,
 15,
 40,
 330,
 9,
 26,
 32,
 507,
 20,
 3,
 26,
 1294,
 21388,
 6,
 3,
 2,
 40,
 2998,
 32,
 3,
 26,
 2,
 9,
 20,
 40,
 238,
 142,
 3,
 10475,
 782,
 20,
 3927,
 32,
 7,
 6,
 50,
 3,
 31812,
 20,
 1590,
 15,
 75,
 28594,
 3,
 15,
 7,
 20,
 3,
 16253,
 2688,
 6,
 3,
 9,


In [35]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

In [36]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Evaluation metric
metric = evaluate.load("rouge")

# Helper function to preprocess the text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects a new line after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in labels as it cannot be decoded
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Preprocess the text
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [37]:
from transformers import DataCollatorForSeq2Seq

# Ignore padding-related tokens during the training process for prompts
label_pad_token_id = -100

# Data collator for model training
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [38]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

REPOSITORY = "/content/flan-t5-small-fine-tuned"

# Define the training options
training_args = Seq2SeqTrainingArguments(
    # Training hyperparameters
    output_dir=REPOSITORY,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False,  # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=4,
    # Logging and evaluation strategies
    logging_dir=f"{REPOSITORY}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
)

# Create the training instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=ds_tokens["train"],
    eval_dataset=ds_tokens["validation"],
    compute_metrics=compute_metrics,
)



In [39]:
# Save the tokenizer to disk for later use
tokenizer.save_pretrained(f"{REPOSITORY}/tokenizer")

('/content/flan-t5-small-fine-tuned/tokenizer/tokenizer_config.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/special_tokens_map.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/spiece.model',
 '/content/flan-t5-small-fine-tuned/tokenizer/added_tokens.json',
 '/content/flan-t5-small-fine-tuned/tokenizer/tokenizer.json')

In [None]:
# Start the training
trainer.train()

Epoch,Training Loss,Validation Loss


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

REPOSITORY = "/content/flan-t5-small-fine-tuned"

# Import the tokenizer
tokenizer_FT5_FT = T5Tokenizer.from_pretrained(f"{REPOSITORY}/tokenizer")

# Import the fine-tuned model
model_FT5_FT = T5ForConditionalGeneration.from_pretrained(f"{REPOSITORY}/checkpoint-752", device_map="auto")