# T5 with value head
> A T5 model with a value head built on the `transformer` library by Hugging Face.

## Why a value head?
Optimisation through PPO requires estimates on the current states value. The value can be estimated by adding a second head to the T5 model which outputs a scalar for each output token.

## Detach head
I experimented with detaching the head from the body when optimizing the model. This means that only the head is trained and the gradients are not passed through the body. Although I did not use it in the end it is still possible to detach the head by calling `model.detach_head()`.

In [None]:
# default_exp t5

In [None]:
# export
from transformers import T5Model, T5PreTrainedModel, AutoTokenizer,T5ForConditionalGeneration
from transformers import top_k_top_p_filtering
from torch import nn
from torch.nn import Identity
import torch.nn.functional as F
import torch
from trl.core import RLMixin
from collections import OrderedDict

In [None]:
# exports

class T5ValueHead(nn.Module):
    """The ValueHead class implements a head for T5 that returns a scalar for each output token."""
    def __init__(self, config):
        super().__init__()
        self.state_representation = nn.Linear(config.vocab_size, 1)

    def forward(self, hidden_states):
        output = self.state_representation(hidden_states)
        return output

In [None]:
# exports

class T5HeadWithValueModel(T5ForConditionalGeneration, RLMixin):
    """The T5HeadWithValueModel class implements a T5 language model with a secondary, scalar head."""
    def __init__(self, config):
        super().__init__(config)
        self.v_head = T5ValueHead(config)
        
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
        head_mask=None,
        decoder_head_mask=None,
        encoder_outputs=None,
        past_key_values=None,
        inputs_embeds=None,
        decoder_inputs_embeds=None,
        labels=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None
    ):
       
        transformer_output = super().forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            head_mask=head_mask,
            decoder_head_mask=decoder_head_mask,
            encoder_outputs=encoder_outputs,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            decoder_inputs_embeds=decoder_inputs_embeds,
            labels=labels,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=True,
            return_dict=True
        )
        

        value = self.v_head(transformer_output[0]).squeeze(-1)
        
        return transformer_output

## Load a pre-trained language model
Loading a pretrained language model works like loading it with a model from the `transformer` library.

In [None]:
model = T5HeadWithValueModel.from_pretrained('ramsrigouthamg/t5_paraphraser')
tokenizer = AutoTokenizer.from_pretrained('ramsrigouthamg/t5_paraphraser')

Some weights of T5HeadWithValueModel were not initialized from the model checkpoint at ramsrigouthamg/t5_paraphraser and are newly initialized: ['v_head.state_representation.weight', 'v_head.state_representation.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Forward pass

In [None]:
sentence = "Which course should I take to get started in data science?"
text =  "paraphrase: " + sentence + " </s>"
max_len = 256

encoding = tokenizer.encode_plus(text, padding=True, return_tensors="pt")


# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
beam_outputs = model.generate(
    input_ids=encoding["input_ids"], #attention_mask=encoding["attention_mask"],
    do_sample=True,
    max_length=256,
    top_k=100,
    top_p=0.95,
    early_stopping=True,
    num_return_sequences=3
)

In [None]:
print ("Original Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        final_outputs.append(sent)

for i, final_output in enumerate(final_outputs):
    print("{}: {}".format(i, final_output))


Original Question ::
Which course should I take to get started in data science?


Paraphrased Questions :: 
0: What are the top 5 courses on Data Science?
1: How can I learn to use data science?
2: Which courses should I take for Data Science?


We have the model respond to two queries in parallel:

**Note:** This only works because both queries have the same number of tokens. If that is not the case one must pad the tensors before stacking them in `torch.cat(queries)`.

Then we can decode the responses:

In [None]:
r = model.respond_to_batch(encoding["input_ids"],
                           max_length=256)

In [None]:
outputs = []
for out in r:
    sent = tokenizer.decode(out, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        outputs.append(sent)

for i, final_output in enumerate(outputs):
    print("{}: {}".format(i, final_output))

0: Which is the best undergraduate course for data science?



## Why the custom response function?
The models in the `transformer` library come with a very useful and optimised generation function `model.generate()`. In the beginning this function was indeed used to generate text but after lengthy debugging it turned out that PPO was exploiting some aspects that are generally useful for text generation but allowed the model to abuse it and gain extra rewards.

### The model reward
To understand how the model was able to exploit the generation function it is worth looking at the reward function for language modeling with PPO. The reward consists of an arbitrary score (any scalar to indicate whether the model output was good or bad) and the KL-divergence from the untrained model:

$$reward = score - \beta \times KL$$

where $\beta$ is some positive factor. The KL divergence is calculate with:

$$ KL = \mathbb{E}_{x \sim p_{model}} [\log p_{model}(x) - \log p_{refmodel}(x)]$$

Since $x$ is sampled from $p_{model}$ the KL-divergence is always positive. However, if the model found a way to get negative KL-divergence it would achieve a positive reward. This is what happened twice with in the experiment and both times a quirk of the text generation was abused to avoid proper sampling from the probability distribution.

### Case 1: `min_length=None`
When no `min_length` is specified in the `model.generate()` function the model probability distribution is normally sampled until the first `<eos>` token appears. Then the rest of the sequence is padded with a padding token until `max_length` is reached (for GPT2 this is also the `<eos>` token). If that sequence is again passed through the model to evaluate the log-probabilities everything is normal until after the first `<eos>` token, since multiple `<eos>` tokens are very unlikely. The model exploited this by decreasing the probability for the `<eos>` token after the first appearence even further below the probability of the reference model, thus achieving negative KL-divergence. Additionally, it inserted the first `<eos>` earlier and earlier in the sentences to minimize the KL-divergence and thus maximise the reward. This only worked because the sequence after the first `<eos>` token wasn't properly sampled but padded, otherwise the low probabilities would have lead to other tokens with higher probability being sampled.


### Case 2: `min_length=max_length`
I thought this could be easily fixed: just set the `min_length=max_length`. This seemed to work fine for a few experiments until the training failed again due to negative KL-divergences. Finding the problem was harder than before, since it only happened rarely after several training steps. In addition the generated sentences deteriorated quickly to complete gibberish. After some investigation it turned out that the model was again exploiting the sampling function. Up to this point I was not aware that the model was also not allowed to produce an `<eos>` token before `min_length` is reached. In practice this is achieved by setting the next token logit to -infinity:

```
next_token_logits[:, eos_token_id] = -float("inf")
```

This makes sure that after the softmax function the probability for the `<eos>` token is zero, no matter the model output. The model exploited this by maximizing the logit output for that token and thus setting all other logits to increasingly small numbers. Since, I did not apply the same step when evaluating the generated sequence (calculating softmax without the -inf trick) the probabilities for the generated sequences were extremely small and in fact smaller than the probabilities of the reference model. This lead again to negative KL-divergence.

### Conclusion
In both cases $x \sim p_{model}$ in the KL-divergence equation was not satisfied, but this was hidden in the sequence generating function. Reinforcement Learning is very effective in finding and exploiting environment quirks as others have pointed out for other environments such as ATARI games. The solution was to go back to a simpler sequence sampler to avoid this exploits. Alternatively, I could have applied the same tricks and some masking to the model outputs when evaluating the sequences, but I didn't feel confident enough that there would not be other sequence generation tricks the model could abuse.