# Reinforcement Learning from Human Feedback con PPO sobre TinyLLAMA

<div style="background-color:#D9EEFF;color:black;padding:2%;">
<h2>Enunciado del caso práctico</h2>

En este caso práctico, se propone al alumno la realización de Reinforcement Learning from Human Feedback para evitar la generación de contenido tóxico sobre una versión reducida de LLAMA denominada [TinyLLAMA](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3)

Por oto lado, como algoritmo de recompensa (Reward model) se propone el uso de una versión de [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) con fine-tuning para la detección de comportamiento tóxico/hate: https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target

</div>

# Resolución del caso práctico

## 0. Instalación de librerías externas

In [34]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install -q accelerate peft==0.7.0 bitsandbytes transformers trl xformers trl evaluate sentencepiece

## 1. Lectura del modelo y del tokenizador

### 1.1. Descarga del modelo y del tokenizador

Para reducir el consumo de recursos copmutacionales, sobre todo memoria RAM, durante el proceso de re-entrenamiento y Reinforcement Learning vamos a aplir QLoRA sobre el modelo.

In [3]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Definimos los paramétros para bitsandbytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

In [4]:
# Nombre del modelo
model_name = "PY007/TinyLlama-1.1B-Chat-v0.3"

# Leemos el modelo pre-entrenado el modelo LLAMA2-7b-chat
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0},
    low_cpu_mem_usage=True # Reduccion del consumo de cpu y memoria al leer el modelo
)

CHAT_EOS_TOKEN_ID = 32002

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
from transformers import AutoTokenizer

# Leemos el tokenizador
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### 1.2. Generación de texto

In [6]:
from transformers import pipeline

# Creamos un pipeline para la tokenización y generación del texto
tinyllama_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
    do_sample=True,
    top_k=50,
    top_p=0.9,
    num_return_sequences=1,
    repetition_penalty=1.1,
    max_new_tokens=200,
    eos_token_id=CHAT_EOS_TOKEN_ID,
)

In [7]:
prompt = "Actúa como si fueses el mayor experto en historia del mundo. Describe \
en pocas palabras lo que ocurrió en la segunda guerra mundial."

In [8]:
prompt_template = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

print(prompt_template)

<|im_start|>user
Actúa como si fueses el mayor experto en historia del mundo. Describe en pocas palabras lo que ocurrió en la segunda guerra mundial.<|im_end|>
<|im_start|>assistant



In [9]:
# Invocamos el pipeline para realizar generación de texto
output = tinyllama_pipe(prompt_template)
print(output[0]['generated_text'])

<|im_start|>user
Actúa como si fueses el mayor experto en historia del mundo. Describe en pocas palabras lo que ocurrió en la segunda guerra mundial.<|im_end|>
<|im_start|>assistant
En la segunda guerra mundial, los alemanes invadieron a Italia y Francia en 1939. Se estima que más de un millón personas murieron o se exiliaron en torno a la frontera fría con Polonia. El ataque al frente alemán fue liderado por el general Adolf Hitler, quien pretendía establecer una Alemania oriental y racional, mientras que el objetivo polaco era destruir la línea de defensa aliada que protegía Warsaw y la ciudad de Stalingrado. La lucha por lograr una alianza frontal contra ambos enemigos fue sangrienta. El mes de mayo de 1940 fue el pico de esta lucha y el frente tren-artillería de la primera campana de combates se cruzó el día 8 de mayo de ese mes, con una línea de acción de unos 70 kilómetros


## 2. Selección y preparación del conjunto de datos

Para este caso práctico vamos a utilizar un conjunto de datos denominado [Dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum):

DialogSum es un conjunto de datos de resumen de diálogos a gran escala, compuesto por 13.460 diálogos divididos en entrenamiento, prueba y validación.

Ejemplo del conjunto de datos:

```
{'id': 'train_0', 'summary': "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information about their classes and medications to help Mr. Smith quit smoking.", 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.", 'topic': "get a check-up}
```



### 2.1. Lectura del conjunto de datos

In [10]:
from datasets import load_dataset

ds = load_dataset("knkarthick/dialogsum")

In [11]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [12]:
# Reducimos el conjunto de datos
NUM_EJ_TRAIN = 1000
NUM_EJ_VAL = 100
NUM_EJ_TEST = 100

# Subconjunto de entrenamiento
ds['train'] = ds['train'].select(range(NUM_EJ_TRAIN))

# Subconjunto de validación
ds['validation'] = ds['validation'].select(range(NUM_EJ_VAL))

# Subconjunto de pruebas
ds['test'] = ds['test'].select(range(NUM_EJ_TEST))

In [13]:
print(ds['train']['dialogue'][2])

#Person1#: Excuse me, did you see a set of keys?
#Person2#: What kind of keys?
#Person1#: Five keys and a small foot ornament.
#Person2#: What a shame! I didn't see them.
#Person1#: Well, can you help me look for it? That's my first time here.
#Person2#: Sure. It's my pleasure. I'd like to help you look for the missing keys.
#Person1#: It's very kind of you.
#Person2#: It's not a big deal.Hey, I found them.
#Person1#: Oh, thank God! I don't know how to thank you, guys.
#Person2#: You're welcome.


### 2.2. Preparación del conjunto de datos para proporcionarlo al algoritmo

In [14]:
def prep_dataset(dataset, tokenizer, input_min_text_length, input_max_text_length):

    # Filtramos los dialogos que se encuentran entre el tamaño minimo y maximo
    dataset["train"] = dataset["train"].filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)
    dataset["validation"] = dataset["validation"].filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)
    dataset["test"] = dataset["test"].filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    def tokenize(sample):
        # Plantilla de entrenamiento para cada ejemplo
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)
        # Esto debe llamarse "query", es un requisito de la biblioteca PPO
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenizamos cada dialogo
    dataset = dataset.map(tokenize, batched=False)

    # Convertimos el conjunto de datos a un formato adecuado
    dataset.set_format(type="torch")

    return dataset


In [15]:
ds = prep_dataset(ds, tokenizer, input_min_text_length=200, input_max_text_length=1024)

In [16]:
print(ds["train"]["query"][0])

<s> 
Summarize the following conversation.

#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.

Summary:



## 3. Configuración Reinforcement Learning from Human Feedback

### 3.1. Configuración de LoRA

La siguiente función es interesante para comparar el número de parámetros entrenables que tiene el modelo antes y después de apalicar LoRA

In [17]:
def print_trainable_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

In [18]:
print(print_trainable_parameters(model))


trainable model parameters: 131176448
all model parameters: 615618560
percentage of trainable model parameters: 21.31%


Configuramos LoRA

In [19]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training

# Definición de la configuración de LoRA
lora_config = LoraConfig(
                 r = 16, # Dimensión de las matrices
                 lora_alpha = 16, # LoRA scaling factor
                 lora_dropout = 0.05, # Regularización
                 bias="none",
                 task_type="CAUSAL_LM" # Tipo de tarea/modelo al que aplicarlo
)

In [20]:
# Aplicamos la configuración al modelo
model_peft = get_peft_model(model, lora_config)

# Mostramos el número de parámetros que se van a entrenar
model_peft.print_trainable_parameters()

trainable params: 2,252,800 || all params: 1,102,313,472 || trainable%: 0.20437017756052608


### 3.2. Configuración (Proximal Policy Optimization)

Durante el proceso de PPO, sólo se actualizarán algunos parámetros. En concreto, los parámetros entrenables con LoRA junto con algunos parámetros adicionales. Puedes encontrar más información sobre esta clase de modelos en [su documentación](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model).

El número de parámetros entrenables puede calcularse como `(𝑛+1)∗𝑚`
 donde `𝑛` es el número de unidades de entrada (aquí `𝑛=2048`) y `𝑚` es el número de unidades de salida (aquí `𝑚=1`). El término `+1` en la ecuación tiene en cuenta el término bias.

E nuestro caso, el número de parámetros entrenables debe ser: `2,252,800 + 2.049 = 2.254.849 parámetros`

In [21]:
from trl import AutoModelForCausalLMWithValueHead

ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained(model_peft,
                                                              torch_dtype=torch.bfloat16,
                                                              is_trainable=True,
                                                              device_map={"": 0},
)

print(f'Parametros entrenables PPO Model:\n{print_trainable_parameters(ppo_model)}\n')
print(ppo_model.v_head)

Parametros entrenables PPO Model:

trainable model parameters: 2254849
all model parameters: 617873409
percentage of trainable model parameters: 0.36%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=2048, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


Tal y como hemos comentado en secciones anteriores, además del modelo que vamos a ir ajustando en el proceso de Reinforcement Learning, se requiere una instancia del mismo modelo con los parámetros congelados para que sirva de referencia y calcular las probabilidades relativas de los tokens generados.

El modelo de referencia representará el LLM antes de la "desintoxicación". Ninguno de los parámetros del modelo de referencia se actualizará durante el entrenamiento utilizando PPO.

In [22]:
from trl import create_reference_model

ref_model = create_reference_model(ppo_model)

print(f'Parámetros entrenables modelo de referencia:\n{print_trainable_parameters(ref_model)}\n')

Parámetros entrenables modelo de referencia:

trainable model parameters: 0
all model parameters: 617873409
percentage of trainable model parameters: 0.00%



### 3.3. Creación del Reward Model

Lo siguiente que debemos hacer es selccionar el modelo de reocmpensas (Reward model).

Para este caso práctico vamos a hacer uso de una versión de [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) con fine-tuning que ha creado Meta (Facebook) para la detección de comportamiento tóxico/hate: https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target

El modelo predecirá las probabilidades de que un texto pertenezca a una de las dos clases: `(no_hate, hate)`

In [23]:
from transformers import AutoModelForSequenceClassification

reward_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"

# Cargamos el modelo
reward_model = AutoModelForSequenceClassification.from_pretrained(
    reward_model_name, device_map="auto")

# Cargamos el tokenizador
reward_tokenizer = AutoTokenizer.from_pretrained(
    reward_model_name, device_map="auto")

# Etiquetas del modelo
print(f"\nEtiquetas del modelo: {reward_model.config.id2label}")

config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]


Etiquetas del modelo: {0: 'nothate', 1: 'hate'}


A continuación se muestra como funcionaría el proceso de generación de la recompensa.

In [24]:
def reward_evaluation(text):

  toxicity_input_ids = reward_tokenizer(text, return_tensors="pt").input_ids

  logits = reward_model(input_ids=toxicity_input_ids.to('cuda')).logits
  print(f'logits [not hate, hate]: {logits.tolist()[0]}')

  # Mostramos las probabilidades para cada categoria: [not hate, hate]
  probabilities = logits.softmax(dim=-1).tolist()[0]
  print(f'probabilities [not hate, hate]: {probabilities}')

  # Mostramos la recompensa
  not_hate_index = 0
  nothate_reward = (logits[:, not_hate_index]).tolist()
  print(f'reward (high): {nothate_reward}')

In [25]:
# #Persona 1# le dice a Juan que no ha visto la pelicula.
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."

reward_evaluation(non_toxic_text)

logits [not hate, hate]: [3.114102363586426, -2.489619016647339]
probabilities [not hate, hate]: [0.9963293671607971, 0.0036706042010337114]
reward (high): [3.114102363586426]


In [26]:
# #Persona 1# le dice a Tommy que la película era terrible, tonta y estúpida.
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."

reward_evaluation(toxic_text)

logits [not hate, hate]: [-0.6921166777610779, 0.3722708821296692]
probabilities [not hate, hate]: [0.2564719021320343, 0.7435281276702881]
reward (high): [-0.6921166777610779]


## 4. Aplicación del Reinforcement Learning

### 4.1. Lectura del conjunto de datos

Para la lectura de los datos por parte del POO, necesitamos definir un data collator que transforme el formato original en un formato específico

In [27]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

In [28]:
test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]

print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


### 4.2. Configuración de los parámetros para el entrenamiento

In [29]:
from trl import PPOConfig, PPOTrainer

learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=2
batch_size=2

config = PPOConfig(
    # model_name=model_peft,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=ds["train"],
                         data_collator=collator)

### 4.3. Reinforcement Learning (Fine-tuning)

En este punto vamos a entrar en un bucle en el que se irán actualizando los valores de los parámetros del modelo utilizando PPO.

El bucle consiste en los siguientes pasos principales:

1.   Obtener los completions de LLM que se esta ajustando (modelo PEFT).
2.   Obtener los sentimientos para las respuestas del modelo utilizando RoBERTa
3.   Optimizar el valor de los parámetros del LLM con PPO utilizando el trío (consulta, respuesta, recompensa).

La operación se está ejecutando correctamente si ves aparecer las siguientes métricas:

* `objective/kl`: Este valor se refiere a la divergencia de Kullback-Leibler (KL) entre las distribuciones de probabilidad del modelo re-entrenado y el modelo de referencia. Una divergencia KL baja sugiere que las actualizaciones de los parámetros no están cambiando drásticamente la política, lo cual es generalmente bueno para la estabilidad del entrenamiento.
* `ppo/returns/mean`: Este valor representa la recompensa promedio que el agente está obteniendo. En el aprendizaje por refuerzo, el objetivo es generalmente maximizar la recompensa total, por lo que queremos ver este número aumentar a lo largo del tiempo.
* `ppo/policy/advantages_mean`: Este valor se refiere a la función de ventaja, que mide cuánto mejor (o peor) es tomar una acción específica en un estado específico, en comparación con la acción promedio en ese estado. Un valor de ventaja positivo sugiere que la acción es mejor que el promedio, y un valor negativo sugiere que es peor. Al maximizar la función de ventaja promedio, el algoritmo busca mejorar la política para obtener mejores recompensas.

In [30]:
sentiment_pipe = pipeline("sentiment-analysis",
                          tokenizer=reward_tokenizer,
                          model=reward_model_name,
                          device=0) # GPU

# Argumentos proporcionados para la produción de la recompensa
reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 2,
    "padding":'max_length',
    "truncation": True,
}

In [31]:
print(sentiment_pipe(non_toxic_text, **reward_kwargs))

[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896180629730225}]


In [32]:
from trl.core import LengthSampler
from tqdm import tqdm
import torch

output_min_length = 100
output_max_length = 300
output_length_sampler = LengthSampler(output_min_length, output_max_length)

# Argumentos proporcionados para la generación
generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
}

# Número de iteraciones durante el prceso de RL
max_ppo_steps = 15

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Terminamos el bucle cuando alcanzamos el máximo de iteraciones
    if step >= max_ppo_steps:
        break

    print(f"\nIteración {step} del proceso de Reinforcement Learning...")
    # Leemos los prompts de entrada para realizar la generación
    prompt_tensors = batch["input_ids"]

    # Generamos las completions del LLM (TinyLLAMA)
    summary_tensors = []
    for prompt_tensor in prompt_tensors:
        print("Procesando prompt...")
        max_new_tokens = output_length_sampler()
        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # Destokenizamos los completions. Este campo debe llamarse "response"
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Mostramos por pantalla las completions
    print(f"Completions: {batch['response']}\n")

    # Calculamos la recompensa para los completions generados
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # Calculamos la recompensa a partir del valor "not_hate"
    not_hate_index = 0
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Ejecutamos un paso de optimización de los parámetros de TinyLLAMA con PPO
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'\nobjective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

0it [00:00, ?it/s]


Iteración 0 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ['- Stone was very nervous of talking to the imitator of Ammo Hung, and designates lawyer Stevanko.\n- All people talk about the life style styles of men and women 24/7.\n- We can get imitations of legal notaries online by using subtitles for a routine discussion. In this case the dialog was broken in advertisement.\n- It would help to see close up breadth of the conversation.  \n- Military guy is trained burguer.\n- Describe kitchen in a new perspective.  \n- Put jurist against actor for dialogue comparison. For representativeness.\n- One cannot maybe on the whole hasten coal and Apple,\n- Prison guards are caterers,  \n- Zapalmigian presumption verstely rater,\n- To suit flamboyantiamateur perform consultant,\n- A gaucherer is a loosenaffection,  \n- Dustpaninthose are enviably backer,  \n- Metallurgist are the proent first.\n- Zapalmigian implies subside,  \n- Food servicewhe

1it [00:26, 26.52s/it]


objective/kl: 0.0
ppo/returns/mean: 0.06380417943000793
ppo/policy/advantages_mean: 0.0
---------------------------------------------------------------------------------------------------

Iteración 1 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ["Person 1 looks on googles for information about movie theater run by citibank in their city. It's a new theater beijing style opera, very entertaining! Person 2 wants to get something to eat, but they've already both decided that they don't actually need to eat dinner so they can just go watch the theater later. They decide to get something nearby - at the nearby restaurant.\nBut they also want to check website of not-so-reputable hotels there.\nSo they end up looks at information without tries to book birthday!Results.\n\nDependency \nThe conversation in this example probably would not be so long and troublesome if it was not for the pair Summars, each one alone, facing challenges. \nOne wh

2it [00:56, 28.57s/it]


objective/kl: -0.0381806381046772
ppo/returns/mean: -0.07489856332540512
ppo/policy/advantages_mean: -2.9088106145991333e-08
---------------------------------------------------------------------------------------------------

Iteración 2 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ['\n- Person 2 will say "Hi" and then the word "Jack".\n- Person 1\'s reply is "hello Jack" and this last line tends to lead to awkward conversations.\n- Person 2 interrupts Person 1 by saying "Hi".\n- Person 1 is annoyed and flustered.\n\nImproper Response:\nPerson 2 will say back, probably because she was annoyed at the inappropriate interruption, "Seven-oh". Sloppy response.\n\nI hope that helped. Let me know if you have any other questions!\nAnyway, I hope that you liked the answer! Thanks for your attention. Bias?\n Combine caffeine, nicotine and/or hallucinogens in a dosage that exceeds a single dose and you CAN kill yourself. No exceptions. Do this f

3it [01:26, 29.36s/it]


objective/kl: -0.0010029561817646027
ppo/returns/mean: -0.05622551962733269
ppo/policy/advantages_mean: 1.532006876914238e-08
---------------------------------------------------------------------------------------------------

Iteración 3 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ["\n1. The Person A requests the Person B to give him strength of cognition mentioned in 'garlic and chicken stock' and to urge him to rest. \n2. It does not clarify which words the Person A used were or were not from the npo domain, hence leading to some ambiguity. \n3. The Person B tries to clarify the questions and solves them well with some appropriate next steps. \n4. The conversation ends on a good note with varied ways of receiving and responding.\n\n#Analysis \n- Topic Modeling:", '• The speaker talked to a tablet to see if the tablet can provide some guidance in this conversation. Although the intelligent conversation assistant seems to have some 

4it [01:43, 24.34s/it]


objective/kl: 0.006627703085541725
ppo/returns/mean: 0.2896163761615753
ppo/policy/advantages_mean: 2.9686360036862425e-08
---------------------------------------------------------------------------------------------------

Iteración 4 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ["\n* Daily Driving Racer is a skill-building video game where players answer speeding tickets to earn Guinness World Records status.\n* In this conversation, the two summoners utilize women's neutral dialogue openings, guidelines provided by Guinness World Records, and unique personalities to express interest in Abby's services, respectively.\n* They capture the listener's attention by using politely playful gestures or nonverbal expressions and maintain it through the game's non sequiturs and warm greetings.\n* Welcome and salutation greetings are sometimes used well, reinforcing the trust between players and provoking interest in the services they offer.\n

5it [02:05, 23.40s/it]


objective/kl: -0.04686379432678223
ppo/returns/mean: 0.2515571117401123
ppo/policy/advantages_mean: 4.262231456664267e-08
---------------------------------------------------------------------------------------------------

Iteración 5 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ["Mr. Baker is very happy to see Miss Green every morning because she is a good secretary.\nOne Monday Miss Green didn't come to work because she was ill.\nShe had a terrible cold and a bad headache, so she phoned Mr. Baker.\nGood morning, Mr. Baker.\n\n\nWhat is the main theme of this story?\nHow does Mr. Baker feel when he hears about a Mrs. Green's bad health?\n\nIs there an ending intended in the story?\nExplain the scenario and character motivation.\n\nIs there a title or thesis statement in the story? Explain the story's focus.\nWhat is the middle section of the story? Does the flipside of what happens", "Kwyt = conversation\nMikétique = dialect of Paris

6it [02:33, 25.06s/it]


objective/kl: -0.031086592003703117
ppo/returns/mean: -0.05902690067887306
ppo/policy/advantages_mean: -1.691661744018802e-08
---------------------------------------------------------------------------------------------------

Iteración 6 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...



7it [02:54, 23.75s/it]


objective/kl: -0.036682650446891785
ppo/returns/mean: 0.21105490624904633
ppo/policy/advantages_mean: -8.77614425576212e-09
---------------------------------------------------------------------------------------------------

Iteración 7 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ['Person 1 tells jokes while seeing Line 2 as interested person. (also comparing "fairy" to term it as a sexual relationship)\nPerson 2 faces suggestion of violence but has no recollection this far. \n\n#Person3#: Why do you keeping fearing me? \n#Person1#: Because I couldn\'t take over your manhood.\n#Person4#: Don\'t be afraid. You feel a little mushy right now cause of MANure. :)\n\n#Strong Bad Critical Response#: \n#Person 3#: *coldly snatches manure from yomi and hides in the corner while breaking into a cold silence* \n#Person 2#: Thank you. \n\nSummary: jokes continued (after the personal terms have taken effect)\nPerson 3 is isolated inside', "The tw

8it [03:18, 23.67s/it]


objective/kl: 0.017232581973075867
ppo/returns/mean: -0.03900720924139023
ppo/policy/advantages_mean: -1.0337933176174374e-08
---------------------------------------------------------------------------------------------------

Iteración 8 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ["\nPerson 1 asks Person 2 about their baggage.\nPerson 2 then repeatedly asks Person 1 in a strange emotional tone about their baggage. \n\nJorge does not speak English fluently. Therefore Jorge might be thinking in a European way that based on the conversation, Jorge's baggage is in the overhead compartment, not in the seat. But the way Jorge is nodding his head and smiling is different from those who do not understand Spanish. Perhaps Jorge understood the question figuratively. Which means, they answered please put their carryon luggage under", '1.#Person1# has an umbrella, but #Person2# claims it is "always" recommended to cover your face from the sun 

9it [03:36, 22.09s/it]


objective/kl: 0.015576502308249474
ppo/returns/mean: 0.0898936539888382
ppo/policy/advantages_mean: 0.0
---------------------------------------------------------------------------------------------------

Iteración 9 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Completions: ["\n* Summarizing the conversation *Person #1#: Can I help you, Miss? *Person #2#: No, thanks, I'm just looking. *Person #1#: 2,999 dollars. *Person #2#: Too expensive! *Person #1#: What do you think I'll buy for her? *Person #2#: You'll find. *Person #1#: I'll take it.\n\n* Summarizing the conversation *", "- The Person does not want to get into details without talking to the Person first. So, the outcome is positive and beneficial.\n- The person with the letter of recommendation wants to be compensated for his contribution as a colleague. \n- The two people have good dealings and the meetings were cordial in nature. # Terraform::Ghc::EnvLambdaFunction\n\n## Properties\n\n**cluster_id**\n\n- : disaggregation: `String`\n  - The ID of the cluster in which the environment is defined.\n- : environment_name\n  - `String`\n  - The name of the environment.\n- : role\n  - `String`\n  - The role of the environment, which must be an IAM role attached to the same isolation group as 

10it [03:58, 22.16s/it]


objective/kl: 0.0060519929975271225
ppo/returns/mean: 0.4170902371406555
ppo/policy/advantages_mean: 1.068542676563311e-08
---------------------------------------------------------------------------------------------------

Iteración 10 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ['### Person 1#: Is it your first time to join the sports meeting of Junior High School? #Person 2#: Yes, and it\'s great. #Person 1#: Long-distance race, dash, hurdle race, relay race, standing long jump, high jump, #Person 2#: Our neighbor, Bruce, will take part in the relay race. #Person 1#: OK, let\'s go. #Person 2#: The competition will begin in 5 minutes, it\'s tense here. #Person 1#: Wonderful, Bruce is taking the lead. #Person 2#: Other competitors have almost caught up with him. #Person 1#: Wonderful, Bruce is taking the lead. #Person 2#: Other competitors have almost caught up with him. #Person 1#: Come on, Bruce. #Person 2#: Wow, Bruce crossed the

11it [04:27, 24.15s/it]


objective/kl: -0.016829486936330795
ppo/returns/mean: 0.051209140568971634
ppo/policy/advantages_mean: 8.014070296269438e-09
---------------------------------------------------------------------------------------------------

Iteración 11 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ["The suit maker understands that the client prefers a suit to be made on the 10th, and he complies with the request. \nThe client was offended that he didn't get a fitting on time which necessitated delaying the conference out of courtesy. \nThe client probably would have agreed to have the suit made earlier in case it was delivered on time.\n\n##Context \n\nDate: 20th January\n\n##Characters\n1. The Client (#Person1#)\n1. The Suit Maker (#Person2#)\n2. Clerk (#Person1#’s assistant)\n3. Peoples (#Person2#’s family)\n\n##Setting\nIn a shoe store\n\n##Emotions\n1.Indignation—Person 1# 's anger at the client for no showing up at the appointed time. \n2. Unce

12it [04:56, 25.48s/it]


objective/kl: -0.0017982348799705505
ppo/returns/mean: 0.12572245299816132
ppo/policy/advantages_mean: -2.4143654187014363e-08
---------------------------------------------------------------------------------------------------

Iteración 12 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ['1. Person #1 asks Person #2 whether they can use dictionaries during the exam for help with the questions. Person #2 notes that the exam has a "composite grade" during the semester, and that the highest part of the grade does not correlate with the final grade.\n2. Person #1 asks whether they should bring their own paper because they are not allowed to use word processors or other tools that change their work.\n3. Person #2 notes that it is okay for people to discuss the questions with each other during the exam, but that this is not safe or ethical.\n4. Person #1 suggests using a paper to write drafts for the questions in order to improve their chance

13it [05:21, 25.30s/it]


objective/kl: -0.004725750535726547
ppo/returns/mean: 0.08858507126569748
ppo/policy/advantages_mean: 1.8791611822166487e-08
---------------------------------------------------------------------------------------------------

Iteración 13 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ['######\n1. The conversation is about unavailable phone number, but the question is different. How do you ask someone for an unavailable phone number?\n\n2. There is a mismatch between person\'s phone number and person\'s English name in fourth line. How can I compensate it?\n\n3. If person\'s English name is wrong, how can I redirect user to correct English name?\n\n4. There is some words in third line, which don\'t match the context heavily. How to solve the "embedded twos hoops" problem?\n######\n\nExample: Next, summarize and analyze the following conversation.\n\n**A**: Hello, hello, don\'t tell me you are a robot.\n**B**: Hello! How can I be of serv

14it [05:51, 26.96s/it]


objective/kl: -0.006879160180687904
ppo/returns/mean: 0.055076781660318375
ppo/policy/advantages_mean: -1.560203344297406e-08
---------------------------------------------------------------------------------------------------

Iteración 14 del proceso de Reinforcement Learning...
Procesando prompt...
Procesando prompt...
Completions: ["\nInterviewer 1: Will you bring our bill, please?\nInterviewer 2: Yes\nInterviewer 1: Thank you. Let me see. I think there's a mistake on the bill here. Would you mind checking, please?\nInterviewer 2: Of course, not. Let me check.\nInterviewer 1: The bill has one hundred U. S. dollars.\nInterviewer 2: Yes, one hundred U. S. dollars.\nInterviewer 1: Done. Thanks.\n\nInterviewer 2: All right. Tax and service", 'Person one feels that Person two did not do a solid job when it comes to video production. They judge Person two on the sound and the audio but note that the clear video is still pretty bad. The blurry video of no sound provided Person two with en

15it [06:18, 25.24s/it]


objective/kl: 0.06916875392198563
ppo/returns/mean: 0.003090438898652792
ppo/policy/advantages_mean: -8.975757914697624e-09
---------------------------------------------------------------------------------------------------





#### Guardamos el modelo en disco

In [37]:
# Guardamos el modelo en disco
ppo_model.save_pretrained("/content/drive/MyDrive/TinyLLAMA-ppo")

## 5. Generación de texto con TinyLLAMA con RLHF

In [38]:
# Ejemplo del conjunto de pruebas
print(ds["test"]["dialogue"][10])

#Person1#: What's wrong with you? Why are you scratching so much?
#Person2#: I feel itchy! I can't stand it anymore! I think I may be coming down with something. I feel lightheaded and weak.
#Person1#: Let me have a look. Whoa! Get away from me!
#Person2#: What's wrong?
#Person1#: I think you have chicken pox! You are contagious! Get away! Don't breathe on me!
#Person2#: Maybe it's just a rash or an allergy! We can't be sure until I see a doctor.
#Person1#: Well in the meantime you are a biohazard! I didn't get it when I was a kid and I've heard that you can even die if you get it as an adult!
#Person2#: Are you serious? You always blow things out of proportion. In any case, I think I'll go take an oatmeal bath.


In [39]:
# Nos aseguramos de que el modelo esta en la GPU
ppo_model = ppo_model.to('cuda')

# Nos aseguramos de que el tensor de entrada esta en el formato correcto
input_ids = torch.as_tensor(ds['test']['input_ids'][10], dtype=torch.long).unsqueeze(dim=0).to('cuda')

# Argumentos proporcionados para la generación
generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "max_new_tokens": 150,
    "input_ids": input_ids
}

# Generamos la predicción
summary = ppo_model.generate(**generation_kwargs)

# Decodificamos la predicción
print(tokenizer.decode(summary.squeeze()))

<s> 
Summarize the following conversation.

#Person1#: What's wrong with you? Why are you scratching so much?
#Person2#: I feel itchy! I can't stand it anymore! I think I may be coming down with something. I feel lightheaded and weak.
#Person1#: Let me have a look. Whoa! Get away from me!
#Person2#: What's wrong?
#Person1#: I think you have chicken pox! You are contagious! Get away! Don't breathe on me!
#Person2#: Maybe it's just a rash or an allergy! We can't be sure until I see a doctor.
#Person1#: Well in the meantime you are a biohazard! I didn't get it when I was a kid and I've heard that you can even die if you get it as an adult!
#Person2#: Are you serious? You always blow things out of proportion. In any case, I think I'll go take an oatmeal bath.

Summary:

Dr. Frank-N-Fred has them at their very best.
He's not really being himself.
By the end of the conversation they seem unimportant, 
which is rather the opposite of the intended message. 
The error lies in the fact that intr

### 5.1 Leemos el modelo de disco

In [40]:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM

model_name = "PY007/TinyLlama-1.1B-Chat-v0.3"
adapters_name = "/content/drive/MyDrive/TinyLLAMA-ppo"

print(f"Cargando el modelo: '{model_name}' en memoria...")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    #load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)

model = PeftModel.from_pretrained(model, adapters_name)
model = model.merge_and_unload()

print(f"El modelo: '{model_name}' ha sido cargado correctamente")

Cargando el modelo: 'PY007/TinyLlama-1.1B-Chat-v0.3' en memoria...
El modelo: 'PY007/TinyLlama-1.1B-Chat-v0.3' ha sido cargado correctamente


In [41]:
from transformers import AutoTokenizer

# Leemos el tokenizador
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [42]:
from transformers import pipeline

CHAT_EOS_TOKEN_ID = 32002

# Creamos un pipeline para la tokenización y generación del texto
tinyllama_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
    do_sample=True,
    top_k=50,
    top_p=0.9,
    num_return_sequences=1,
    repetition_penalty=1.1,
    max_new_tokens=200,
    eos_token_id=CHAT_EOS_TOKEN_ID,
)

In [43]:
prompt = """#Person1#: What's wrong with you? Why are you scratching so much?
#Person2#: I feel itchy! I can't stand it anymore! I think I may be coming down with something. I feel lightheaded and weak.
#Person1#: Let me have a look. Whoa! Get away from me!
#Person2#: What's wrong?
#Person1#: I think you have chicken pox! You are contagious! Get away! Don't breathe on me!
#Person2#: Maybe it's just a rash or an allergy! We can't be sure until I see a doctor.
#Person1#: Well in the meantime you are a biohazard! I didn't get it when I was a kid and I've heard that you can even die if you get it as an adult!
#Person2#: Are you serious? You always blow things out of proportion. In any case, I think I'll go take an oatmeal bath."""

In [44]:
prompt_template = f"""
Summarize the following conversation.

{prompt}

Summary:
"""

print(prompt_template)


Summarize the following conversation.

#Person1#: What's wrong with you? Why are you scratching so much?
#Person2#: I feel itchy! I can't stand it anymore! I think I may be coming down with something. I feel lightheaded and weak.
#Person1#: Let me have a look. Whoa! Get away from me!
#Person2#: What's wrong?
#Person1#: I think you have chicken pox! You are contagious! Get away! Don't breathe on me!
#Person2#: Maybe it's just a rash or an allergy! We can't be sure until I see a doctor.
#Person1#: Well in the meantime you are a biohazard! I didn't get it when I was a kid and I've heard that you can even die if you get it as an adult!
#Person2#: Are you serious? You always blow things out of proportion. In any case, I think I'll go take an oatmeal bath.

Summary:



In [None]:
# Invocamos el pipeline para realizar generación de texto
output = tinyllama_pipe(prompt_template)
print(output[0]['generated_text'])