## Introduction to the baseline
This notebook shows an example of training a GRU Seq2Seq model on our dialogue data using supervised learning. We also use it as a baseline in our challenge. You can also find this baseline helpful if you want to train or evaluate a reinforcement learning model.

### Import the necessary packages and modules

In [None]:
import os
import torch

from dialoguefactory.environments import easy
from dialoguefactory.environments import hard

import dialoguefactory.trainers.vocab as vocab
from dialoguefactory.generation import mappers_database as mdb
import dialoguefactory.dialogue_generator as dg

import dialoguefactory.trainers.arch as arch
import dialoguefactory.trainers.baseline as baseline
import dialoguefactory.trainers.evaluation as evl

home_directory = os.path.expanduser('~')

error_path = os.path.join(home_directory, 'dialoguefactory_logs', 'error.log')
context_path = os.path.join(home_directory, 'dialoguefactory_logs', 'context.log')
os.makedirs(os.path.dirname(error_path), exist_ok=True)
os.makedirs(os.path.dirname(context_path), exist_ok=True)


### Initialize the dialogue generator and the environments

We create the training environment that we call _easy_and generate the training dialogues in it. The _hard_ environment is an extension of the easy environment, and we use it to see how well the agent generalizes in new environments. Later, we merge the new hard environment with the easy one and evaluate the model in the extended environment. 


In [None]:
easy_world = easy.build_world()
hard_world = hard.build_world()
database = mdb.create_database_all_mappers()
dia_generator = dg.DialogueGenerator(easy_world, error_path, context_path)

We choose the number of training points and the main agent to be trained. Only one agent is trained, and we selected Gretel as our main agent.


In [None]:
num_points = 175000
train_num_points = int (num_points*0.85)
val_num_points = num_points - train_num_points
main_player = easy_world.player
other_players = [p for p in easy_world.players if p != main_player]

### Generate and prepare the data for training

The dialogues consist of a user issuing a request to an agent.

We generate the dialogues and prepare the data in the following format: (unseen_context, agent_response). 
We expect the main agent to provide a response in the dialogue based on the seen context. The unseen_context consists of sentences from the current dialogue or previous dialogues that the agent has yet to see (if a new dialogue has started). The state of the GRU model is continuous; that's why we don't use the full context as input.

In the simulation, we allow the generation of dialogues between all agents in the environment. Because there are five agents in the training environment, the main agent participates in 20% of the dialogues and has to respond to a user issuing a request. In the rest, the other agents communicate with each other. This way, the main agent can learn from the actions of the other agents in the simulated world. In the simulation we do not generate dialogues where the main agent plays the user role.

Since the simulation is continual, we preserve the continuity of the data when generating the train, val, and test data. We do it by providing the last_context_id, the index of the last context sentence observed by the main agent.  
Inside the generate_data function, we serialize the output sentences of the main agent into a format explained in the function [serialize](https://revivegretel.com/docs/dialoguefactory.trainers.html#dialoguefactory.trainers.serializers.serialize). We represent each sentence's meaning using a [Describer](https://revivegretel.com/docs/dialoguefactory.language.html#dialoguefactory.language.components.Describer). Therefore, the process of serialization is converting the Describer's arguments in a list of strings/tokens. During the serialization, we do not allow representing the entities in the world using their unique names (var_name). For example, when we serialize the entity Gretel, the 'name' property can be used: \['bentity', 'Gretel', 'eentity'\]. But it is not allowed to serialize the following way: \['bentity', 'player', 'eentity'\] or ['bentity', 'main', 'player', 'eentity'\]. The 'var_name' property and the ('main', 'player') attribute are not allowed in the output because we want to test whether the agent can identify the entities using properties and attributes. 

During testing, we deserialize the list of tokens into a class Sentence using the function [deserialize](https://revivegretel.com/docs/dialoguefactory.trainers.html#dialoguefactory.trainers.serializers.deserialize). Feel free to use any serializing/deserializing method as long as the output sentence contains the correct Describer and does not use the.

Please note that we only use the hard environment in the following cell to fetch the words needed for the input/output vocabulary.

In [None]:
train_data_x, train_data_y, last_context_id, num_train_dias = baseline.generate_data(dia_generator, 
                                                                                      train_num_points, 
                                                                                      main_player, 
                                                                                      other_players, 
                                                                                      0, 
                                                                                      100, 
                                                                                      True)
val_data_x, val_data_y, last_context_id, num_val_dias = baseline.generate_data(dia_generator, 
                                                                                val_num_points, 
                                                                                main_player,
                                                                                other_players, 
                                                                                last_context_id, 
                                                                                100, 
                                                                                True)

data_x = train_data_x+val_data_x
data_y = train_data_y+val_data_y
orig_data_y = data_y

input_voc = vocab.Vocabulary(vocab.compute_input_vocab(easy_world, hard_world)+['bos','eos'],'bos','eos')
output_voc = vocab.Vocabulary(vocab.compute_output_vocab(easy_world, hard_world)+['bos', 'eos'], 'bos', 'eos')


data_x = input_voc.to_indices(data_x)
data_y = output_voc.to_indices(data_y)

### 

Split the preprocessed data and configure the Pytorch model

We use batch_size = 1 since the Seq2Seq model is continuous and can not run multiple input samples in parallel.

In [None]:
(train_data_x, train_data_y),  (val_data_x, val_data_y) = baseline.dataset_split(data_x, data_y, len(train_data_y))

max_len = max(map(len, data_y))
model = arch.Seq2SeqContModel(input_size = len(input_voc),
                     embed_size = 32,
                     encoder_hidden_size = 128,
                     decoder_hidden_size = 128,
                     output_size = len(output_voc),
                     num_layers = 1)

opt = torch.optim.Adam(model.parameters())
batch_size = 1

### Training the model

We train the model using backpropagation through time (BPTT) and save the encoder's state for the next iteration. 
We do not shuffle the data since the model is continuous.

In [None]:
from tqdm.notebook import tqdm

num_epochs = 1
for e in range(num_epochs):
    enc_hid_state = None
    progress_bar = tqdm(total=int(train_num_points/batch_size))
    for bx, by in arch.generate_batch(train_data_x,train_data_y,batch_size, shuffle=False):
        train_loss, enc_hid_state = arch.compute_loss(model, bx, by, input_voc, output_voc, max_len, enc_hid_state)
        train_loss.backward()
        opt.step()
        opt.zero_grad()
        enc_hid_state = enc_hid_state.detach()
        progress_bar.update(1)

    for bx, by in arch.generate_batch(val_data_x, val_data_y, batch_size, shuffle=False):
        val_loss, enc_hid_state = arch.compute_loss(model, bx, by, input_voc, output_voc, max_len, enc_hid_state)
        enc_hid_state = enc_hid_state.detach()

    print (train_loss.item(), val_loss.item())


### Evaluating the model

We evaluate the model in the easy environment and print all the metrics for the leaderboard. The evaluation for both the training and testing environment must be done on at least 200000 dialogues in which the main agent participates.

In [None]:
from dialoguefactory.trainers.evaluation import pretty_print_eval
new_policy = baseline.AgentPolicy(main_player, database, model, enc_hid_state, last_context_id, input_voc, output_voc, max_len)


In [None]:
dias, individual_accuracies, total_accuracy, num_agent_dias = evl.generate_and_eval (dia_generator, 1, 200000, new_policy, 100,  notebook_run=True)

pretty_print_eval("baseline easy", individual_accuracies, total_accuracy, num_agent_dias, num_train_dias)

We evaluate the model in the extended environment (easy+hard). To do the evaluation, we first merge the easy environment with the hard one to preserve continuity. We unlock the locked doors in the easy environment that lead to the hard environment. We inject the information that the doors are no longer locked in the context so the agent can observe this information.

When evaluating, we chose to have our main player play the role of an agent in 20% of the dialogues, similar to the training environment. We did not set this parameter to 0.2 in the training environment because there are five players in the training environment.

In [None]:
hard.merge_worlds(dia_generator, hard_world)

In [None]:
dias, individual_accuracies, total_accuracy, num_agent_dias = evl.generate_and_eval (dia_generator, 1, 200000, new_policy, 100, 0.2, notebook_run=True)

pretty_print_eval("baseline hard", individual_accuracies, total_accuracy, num_agent_dias, num_train_dias)