# Part 3: Sequence-to-sequence models

© Anatolii Stehnii, 2018

## Lecture 3: Reinforcement learning for natural language processing

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()

---
*Returning to the previous lectures: One of the most important problems in NLP is that text representation is **discrete space**. You can not add **noise** to text (or you can – [
Adversarial Texts with Gradient Methods, Gong et al., 2018](https://arxiv.org/abs/1801.07175)) and you can not  generate texts with **Generative Adversarial Networks** (or you can – [
MaskGAN: Better Text Generation via Filling in the______, Fedus et al., 2018](https://arxiv.org/abs/1801.07736)), because you can not differentiate a text generation function.*

---

### Reinforcement learning for optimization of non-differentiable loss

Many of the previously defined NLP problems have special evaluation functions, hand-tailored to these problems, like **BLEU** for Neural Machine Translation or **ROGUE** for Text Summarization. However, the neural network usually trained to maximize **likelihood** function because nor **BLEU** neither **ROGUE** cannot be differentiated, therefore we can't backpropagate an error for them.

Reinforcement Learning from its very beginning was a framework for optimization of systems with target function being **unknown or nondifferentiable**. And we can imagine our translation quality function being a delayed reward for actor, which on each step tries to select a proper action *(a next word)* given a state *(previous words and a source sentence)* to maximize this reward. 

**REINFORCE** – [Simple statistical gradient-following algorithms for connectionist reinforcement learning (Ronald Williams, 1992)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf), [Sequence Level Training with Recurrent Neural Networks (Ranzato et al., 2015)](https://arxiv.org/abs/1511.06732)

**Self-critical sequence training (SCST)** – [A Deep Reinforced Model for Abstractive Summarization (Paulus et al., 2017)](https://arxiv.org/abs/1705.04304), [Self-critical Sequence Training for Image Captioning (Rennie et al., 2017)](https://arxiv.org/abs/1612.00563)

### Reinforcement learning for text generation with GANs

![GAN idea](https://pbs.twimg.com/media/CwSKfkBWEAAXd4d.jpg:large)
*Via [Chris Olaf](https://twitter.com/ch402/status/793911806494261248/photo/1) twitter*

In typical GAN, discriminator compares two pieces of content, real and generated, and learns to distinguish them. A gradient of a function, reversed to discriminator error, backpropagates to generated content and back to generator parameters, optimizing it to generate more realistic content. 

But text generator function (hardmax, random sampling, beam search) cannot be differentiated. Therefore the usual adversarial approach cannot be used for text generation task (for example, NMT).

As been previously said, RL can help to solve a problem with non-differentiable loss function. [Fedus, Goodfellow, and Dai](https://arxiv.org/abs/1801.07736) proposed to train **MaskGAN**: Actor-Critic setup from Reinforcement Learning, where Discriminator network is Critic and Generator network is Actor. However, they generated short sequences and not tested this setup for a text generation of reasonably large amounts. In [Text Generation Based on Generative Adversarial Nets with Latent Variable (Wang et al., 2017)](https://arxiv.org/abs/1712.00170) was used policy gradient to train VGAN - Variational Autoencoder with GAN optimizer.

Another GAN approach (without RL, actually): [
Long Text Generation via Adversarial Training with Leaked Information (Guo et al., 2017)](https://arxiv.org/abs/1709.08624)

### Reinforcement learning for code generation

The previously mentioned **text-to-code** problem usually solved with a neural model, trained with cross-entropy loss and evaluated with metrics, inherited from NMT, like **BLEU**. This raises a problem of training a model, which gives lower credit to a **syntactically and algorithmically correct** result, but of a different form (different order of operations, other variable names, etc).

However, another possible supervision and evaluation source can be code execution results: **unit test success** or **database query relevance**. Such supervision cannot be differentiated, but it can be incorporated as a reward and used for REINFORCE or policy gradient training.

In [Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning (Zhong et al., 2017)](https://arxiv.org/abs/1709.00103) this approach is used to address generation of queries to WikiSQL database from query natural description; in addition to cross-entropy objective function, reward signal from an execution of generation results is used to learn a policy.