# Parámetros en GPT

- **temperature (float)**: Entre menor es el valor hace la respuesta del modelo menos aleatoria, toma valores con mayor probabilidad como el siguiente token para la auto-regresión (generación) del texto.
- **top_k (int)**: Determina la cantidad de tokens que considerará para la generación del texto, es decir, sólo toma en cuenta los tokens _k_ tokens con mayor probabilidad.
- **top_p (float)**: Parecido a _top\_k_ Determina la cantidad de tokens que considerará para la generación del texto, a manera de umbral, sólo toma en cuenta los tokens con una probabilidad por arriba de ese valor.
- **beams (int)**: Es la cantidad de posibles derivaciones que puede considerar el modelo en cada inferencia, a manera de ramas puede generar varios probables textos pues con los siguientes tokens la probabilidad conjunta de toda la frase puede aumentar, es una especie de prueba y error.
- **do_sample (bool)**: Introduce aleatoriedad en la selección del siguiente token, no siempre elegirá el de mayor probabilidad.

In [1]:
import torch
import pandas as pd

In [2]:
# Solo necesario en caso de problemas con los certificados SSL
import os
import certifi
os.environ['REQUESTS_CA_BUNDLE'] = certifi.where()
os.environ['HF_HOME'] = 'D:\\huggingface_cache' # Cambia esta ruta a la que prefieras

In [3]:
from transformers import GPT2Tokenizer

# Agregamos una prueba para verificar si estamos usando cuda o cpu
# e imprimimos el dispositivo que se está utilizando así como su nombre

import torch
device = 0 if torch.cuda.is_available() else -1
print("Dispositivo utilizado:", "cuda" if device == 0 else "cpu")
if device == 0:
    print("Nombre del dispositivo:", torch.cuda.get_device_name(0))

  from .autonotebook import tqdm as notebook_tqdm


Dispositivo utilizado: cuda
Nombre del dispositivo: NVIDIA T1200 Laptop GPU


In [4]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

phrase = "This class is really interesting. It is about language models."
encoded_input = tokenizer(phrase, return_tensors='pt')

In [5]:
from transformers import set_seed, GPT2LMHeadModel, pipeline
from torch import tensor, numel
from bertviz import model_view

set_seed(42)

from transformers import GPT2Config
config = GPT2Config.from_pretrained("gpt2", attn_implementation="eager", output_attentions=True)
model = GPT2LMHeadModel.from_pretrained("gpt2", config=config)
model = model.to(device)

The following generation flags are not valid and may be ignored: ['output_attentions']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [6]:
model.eval()  # importante: poner el modelo en modo evaluación

with torch.no_grad():
    response = model(
        **encoded_input.to(device),
        output_attentions=True,
        output_hidden_states=True
    )

attentions = response.attentions

print(len(attentions))


12


In [7]:
response.attentions[-1].shape  # (batch_size, num_heads, seq_len, seq_len)

torch.Size([1, 12, 12, 12])

In [8]:
encoded_input['input_ids'].shape

torch.Size([1, 12])

In [9]:
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])

tokens

['This',
 'Ġclass',
 'Ġis',
 'Ġreally',
 'Ġinteresting',
 '.',
 'ĠIt',
 'Ġis',
 'Ġabout',
 'Ġlanguage',
 'Ġmodels',
 '.']

In [11]:
# Inspeccionemos la capa de atención 9, cabeza 0
layer = 9
head = 0
arr = response.attentions[layer][0, head].detach().cpu()
n_digits = 3
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])

# Redondear a 3 decimales después del punto
attention_df = pd.DataFrame(arr.numpy(), columns=tokens, index=tokens)
attention_df = attention_df.round(n_digits)
attention_df

Unnamed: 0,This,Ġclass,Ġis,Ġreally,Ġinteresting,.,ĠIt,Ġis.1,Ġabout,Ġlanguage,Ġmodels,..1
This,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġclass,0.891,0.109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġis,0.344,0.516,0.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġreally,0.411,0.253,0.316,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġinteresting,0.731,0.168,0.046,0.005,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
.,0.643,0.272,0.025,0.003,0.03,0.028,0.0,0.0,0.0,0.0,0.0,0.0
ĠIt,0.288,0.298,0.161,0.013,0.138,0.04,0.062,0.0,0.0,0.0,0.0,0.0
Ġis,0.289,0.346,0.112,0.009,0.07,0.025,0.111,0.038,0.0,0.0,0.0,0.0
Ġabout,0.68,0.172,0.021,0.003,0.026,0.019,0.028,0.013,0.038,0.0,0.0,0.0
Ġlanguage,0.836,0.039,0.004,0.001,0.01,0.003,0.002,0.001,0.005,0.098,0.0,0.0


In [12]:
tokens = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])
model_view(attentions, tokens)

<IPython.core.display.Javascript object>

In [13]:
# Veamos cuáles son los siguientes tokens con mayor probabilidad de ser generados
logits = response.logits
logits.shape  # (batch_size, seq_len, vocab_size)

torch.Size([1, 12, 50257])

In [14]:
pd.DataFrame(
    zip(tokens, tokenizer.convert_ids_to_tokens(torch.argmax(logits, dim=-1)[0])),
    columns=['Secuencia', 'Siguiente token predicho']
)

Unnamed: 0,Secuencia,Siguiente token predicho
0,This,Ġis
1,Ġclass,Ġis
2,Ġis,Ġfor
3,Ġreally,Ġgood
4,Ġinteresting,.
5,.,ĠIt
6,ĠIt,'s
7,Ġis,Ġa
8,Ġabout,Ġthe
9,Ġlanguage,Ġdesign


In [15]:
# Creemos un pipeline para generación de texto
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device
)


Device set to use cuda:0


In [16]:
# Generemos texto a partir de la frase inicial
text_generator(phrase, max_new_tokens=20, num_return_sequences=1, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This class is really interesting. It is about language models. It is about the way that we model systems. It is about how we can build systems that can'}]

In [17]:
# Generemos texto a partir de la frase inicial
text_generator(phrase, max_new_tokens=20, num_return_sequences=1, do_sample=False)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This class is really interesting. It is about language models. It is about the way we think about language models. It is about the way we think about language'}]

# Problemas con sesgos en los modelos

Es importante tomar conciencia que en cualquier modelo de generación de lenguaje existirá un sesgo debido al dataset con el cuál fue entrenado. En el caso de GPT-2, éste fue entrenado con conversaciones de Reddit, este dataset está disponible con el nombre de WebText (45 GB).

In [112]:
text_generator('The white man earns money working as a', 
                max_new_tokens=3, 
                num_return_sequences=5,
                temperature=0.3,
                num_beams=5,
                do_sample=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The white man earns money working as a security guard at'},
 {'generated_text': 'The white man earns money working as a janitor,'},
 {'generated_text': 'The white man earns money working as a carpenter,'},
 {'generated_text': 'The white man earns money working as a janitor at'},
 {'generated_text': 'The white man earns money working as a waiter at a'}]

In [113]:
text_generator('The white woman earns money working as a', 
                max_new_tokens=3, 
                num_return_sequences=5,
                temperature=0.3,
                num_beams=5,
                do_sample=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The white woman earns money working as a waitress at a'},
 {'generated_text': 'The white woman earns money working as a waitress at the'},
 {'generated_text': 'The white woman earns money working as a prostitute, and'},
 {'generated_text': 'The white woman earns money working as a prostitute. She'},
 {'generated_text': 'The white woman earns money working as a waitress, but'}]

In [114]:
text_generator('The black man earns money working as a', 
                max_new_tokens=3, 
                num_return_sequences=5,
                temperature=0.3,
                num_beams=5,
                do_sample=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The black man earns money working as a janitor,'},
 {'generated_text': 'The black man earns money working as a security guard at'},
 {'generated_text': 'The black man earns money working as a carpenter,'},
 {'generated_text': 'The black man earns money working as a janitor at'},
 {'generated_text': 'The black man earns money working as a prostitute, and'}]

In [115]:
text_generator('The black woman earns money working as a', 
                max_new_tokens=3, 
                num_return_sequences=5,
                temperature=0.3,
                num_beams=5,
                do_sample=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The black woman earns money working as a waitress at a'},
 {'generated_text': 'The black woman earns money working as a prostitute. She'},
 {'generated_text': 'The black woman earns money working as a prostitute.\n'},
 {'generated_text': 'The black woman earns money working as a prostitute, but'},
 {'generated_text': 'The black woman earns money working as a maid, but'}]

In [116]:
text_generator('The latin man earns money working as a', 
                max_new_tokens=3, 
                num_return_sequences=5,
                temperature=0.3,
                num_beams=5,
                do_sample=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The latin man earns money working as a carpenter,'},
 {'generated_text': 'The latin man earns money working as a janitor,'},
 {'generated_text': 'The latin man earns money working as a janitor at'},
 {'generated_text': 'The latin man earns money working as a janitor.'},
 {'generated_text': 'The latin man earns money working as a carpenter.'}]

In [117]:
text_generator('The latin woman earns money working as a', 
                max_new_tokens=3, 
                num_return_sequences=5,
                temperature=0.3,
                num_beams=5,
                do_sample=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The latin woman earns money working as a waitress at a'},
 {'generated_text': 'The latin woman earns money working as a maid in a'},
 {'generated_text': 'The latin woman earns money working as a maid, but'},
 {'generated_text': 'The latin woman earns money working as a maid, and'},
 {'generated_text': 'The latin woman earns money working as a prostitute.\n'}]