# Fine-tuning en GPT

El proceso de Fine-tuning nos permite tomar un modelo preexistente como GPT-2, y hacer ajustes en sus pesos para que sea capaz de aprender nuevas relaciones entre tokens e incluso aumentar el contexto de algunas palabras que antes carecían de él. Por ejemplo GPT-2 fue entrenado en un momento en el que no se hablaba mucho respecto a los chatbots o herramientas de NLP, por lo tanto al enfrentarse a algunos conceptos como _tranformers_, _fine-tuning_, _tokens_, etc., podría no interpretarlos de forma correcta.

Podemos, mediante el proceso de fine-tuning ajustar los pesos del modelo (o parte del mismo) para que "aprenda" estos nuevos conceptos y mejore la generación de un texto en este contexto específico.

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, \
    Trainer, TrainingArguments, pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Solo necesario en caso de problemas con los certificados SSL
import os
import certifi
os.environ['REQUESTS_CA_BUNDLE'] = certifi.where()
os.environ['HF_HOME'] = 'D:\\huggingface_cache' # Cambia esta ruta a la que prefieras

In [3]:
from transformers import GPT2Tokenizer

# Agregamos una prueba para verificar si estamos usando cuda o cpu
# e imprimimos el dispositivo que se está utilizando así como su nombre

import torch
device = 0 if torch.cuda.is_available() else -1
print("Dispositivo utilizado:", "cuda" if device == 0 else "cpu")
if device == 0:
    print("Nombre del dispositivo:", torch.cuda.get_device_name(0))

Dispositivo utilizado: cuda
Nombre del dispositivo: NVIDIA T1200 Laptop GPU


In [4]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# GPT-2 no tiene pad_token por defecto, así que lo asignamos al eos_token

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


nlp_data = TextDataset(
    tokenizer=tokenizer,
    file_path="data\\The_Evolution_of_Natural_Language_Processing.txt",  # Cambia esta ruta a la que prefieras
    block_size=128,
)



In [5]:
# Inspeccionemos el primero elemento del dataset
print("Primer elemento del dataset (token IDs):", nlp_data[0])
print("\nPrimer elemento del dataset (texto):", tokenizer.decode(nlp_data[0]))
print("\nNúmero de tokens en el primer elemento del dataset:", len(nlp_data[0]))

# Veamos cuántos ejemplos hay en el dataset
print("\nNúmero de ejemplos en el dataset:", len(nlp_data))

Primer elemento del dataset (token IDs): tensor([28900, 34786,   420, 36876,    16,   198,    16,   198,   198, 35191,
        15547,   407,  1695,   198,   198, 16192,  1160,    11,  1160,  1954,
          198, 23839,   198,   198, 14231,   319,  1160,  1526,  1160,  1954,
          851, 12624,    12, 17513,   604,    13,    15,   851,  3740,  1378,
        34023,    13,  2398,    14,   940,    13, 18182,  3901,    14,   559,
           13,  1433,  3720,  2327, 34229,    13, 34716, 38569,  4051,    14,
           85,    16,   851,   770,   257,   662,  4798,   290,   468,   407,
          587, 12720, 11765,    13,  6060,   743,   307, 15223,    13,   198,
          198,  1212,  3188, 13692,   319, 24101,    38, 11571,    11,   257,
         3288,  3303,  7587,   357,    45, 19930,     8,  2746,  3170,   416,
          262, 47385, 17019,  3127,    13,   198,   464,  3188,  3769,   257,
         9815, 16700,   286,   262, 10959,    11,  3047,    11,   290,  3734,
           12, 28286,  

In [6]:
from transformers import set_seed, GPT2LMHeadModel, pipeline
from torch import tensor, numel
from bertviz import model_view

set_seed(42)

from transformers import GPT2Config
config = GPT2Config.from_pretrained("gpt2", attn_implementation="eager", output_attentions=True)
model = GPT2LMHeadModel.from_pretrained("gpt2", config=config)
model = model.to(device)

The following generation flags are not valid and may be ignored: ['output_attentions']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [7]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)   

In [8]:
collator_example = data_collator([tokenizer('This is a random input'), tokenizer('This is another one')])

collator_example

{'input_ids': tensor([[ 1212,   318,   257,  4738,  5128],
        [ 1212,   318,  1194,   530, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0]]), 'labels': tensor([[1212,  318,  257, 4738, 5128],
        [1212,  318, 1194,  530, -100]])}

In [9]:
# Nuestro token de padding es el mismo que el token de fin de secuencia (eos_token)
tokenizer.pad_token_id

50256

In [10]:
# Asimismo, nuestro tensor de atención debe ignorar los tokens de padding
collator_example['attention_mask']

tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0]])

In [11]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
pretrained_generator = pipeline(
    'text-generation', model=model, tokenizer=tokenizer, device=device, 
    config={'max_length': 20, 'do_sample': True, 'top_k': 10, 'top_p': 0.9, 'temperature': 0.7}
)

Device set to use cuda:0


In [12]:
print('--------------------')
for generated_sequence in pretrained_generator("Datasets biases are a problem in", num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('--------------------')

--------------------


  attn_output = torch.nn.functional.scaled_dot_product_attention(


Datasets biases are a problem in both the physical sciences and computer science. In the physical sciences, we have an open field that enables us to analyze and interpret complex computational models. In computer science, we have a number of closed field models where we can only do a series of observations and not even model the entire network. However, there are many applications to these data and it has been recently observed that the number of models of this kind can be greatly increased due to the fact that these data are freely available online. In this post, I will discuss this open field model and how it can be applied to real data sets.

The Open Field Model

The Open Field Model is a model of data, which is a way of categorizing data and identifying the specific variables in a continuous set. We use the term 'data' to describe a set of variables that can be grouped together in a continuous set. In the Open Field Model, we can classify data by the number of observations, the nu

In [13]:
training_args = TrainingArguments(
    output_dir="./models/gpt2-finetuned",  # Cambia esta ruta a la que prefieras
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=len(nlp_data)//16,  # 1 epoch de warmup
    logging_steps=5
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=nlp_data.examples[:int(0.8*len(nlp_data))],
    eval_dataset=nlp_data.examples[int(0.8*len(nlp_data)):]
)

trainer.evaluate()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


{'eval_loss': 3.4060862064361572,
 'eval_model_preparation_time': 0.003,
 'eval_runtime': 1.5585,
 'eval_samples_per_second': 38.499,
 'eval_steps_per_second': 2.567}

In [14]:
trainer.train()

Step,Training Loss
5,3.7868
10,3.5441
15,3.266
20,3.0992
25,2.8236
30,2.85
35,2.7694
40,2.5871
45,2.5802


TrainOutput(global_step=45, training_loss=3.034053760104709, metrics={'train_runtime': 451.0996, 'train_samples_per_second': 1.596, 'train_steps_per_second': 0.1, 'total_flos': 47032565760000.0, 'train_loss': 3.034053760104709, 'epoch': 3.0})

In [16]:
# Veamos como ha ido el entrenamiento
for log in trainer.state.log_history:
    if 'loss' in log and 'epoch' in log:
        print(f"Epoch: {log['epoch']:.1f} - Training Loss: {log['loss']:.4f}")
    if 'eval_loss' in log and 'epoch' in log:
        print(f"Epoch: {log['epoch']:.1f} - Validation Loss: {log['eval_loss']:.4f}")

Epoch: 0.3 - Training Loss: 3.7868
Epoch: 0.7 - Training Loss: 3.5441
Epoch: 1.0 - Training Loss: 3.2660
Epoch: 1.3 - Training Loss: 3.0992
Epoch: 1.7 - Training Loss: 2.8236
Epoch: 2.0 - Training Loss: 2.8500
Epoch: 2.3 - Training Loss: 2.7694
Epoch: 2.7 - Training Loss: 2.5871
Epoch: 3.0 - Training Loss: 2.5802


In [18]:
loaded_model = GPT2LMHeadModel.from_pretrained("./models/gpt2-finetuned/checkpoint-45")  # Cambia esta ruta a la que prefieras

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer, 
    config={'max_length': 20, 'do_sample': True, 'top_k': 10, 'top_p': 0.9, 'temperature': 0.7}
)

Device set to use cuda:0


In [19]:
print('--------------------')
for generated_sequence in pretrained_generator("Datasets biases are a problem in", num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('--------------------')

--------------------
Datasets biases are a problem in many languages.

Many languages have many features to solve many of these problems.
However, there are some limitations in language generation and the amount of support available.
This article will provide a detailed overview of many of the most common problems in language generation and development.
3.1.1 Learning and
Learning is an important aspect of language development because it is the key to understanding of the language.
Learning is important to understand the natural language and its features, such as its structure, syntax, morphology, and phonemporal structure.
When learning new concepts to a new language, there are many different approaches that can be used to help the user understand them.
This article will provide an overview of the most common use of learning and information in language development.
3.1.2 Learning and
Learning is an important aspect of the language development process.
Learning is an important aspect o