# RL Exercise 5 - Summarizing Text Using RL

**GOAL:** The goal of this exercise is to show how one might frame the problem of text summarization as a reinforcement learning problem and train a policy to perform text summarization.

Text summarization here means taking some writing and shortening it.

**NOTE:** In this example, we will use an extremely **simplistic** policy for *extractive summarization*. Extractive summarization involves summarizing text by keeping some subset of the original text and discarding the rest. In contrast, *abstractive summarization* involves synthesizing brand new text, and is required for producing high quality summaries.

## Problem Setup

We set up the problem as follows.

1. The **state** of the environment consists of pre-trained word vectors for the current word from the original text and several of the previous and subsequent words. One way this setup could be improved is by putting much more information in the state so that the policy can make a more informed decision.
2. The **action** that can be taken by the policy is to either include the current word in the summary or to drop the current word from the summary.
3. The state of the environment starts at the first word in the original text. After an action is taken, the state transitions to the next word in the original text.
4. At each time step, the **reward** is the incremental increase in the [ROUGE score][1] obtained by the current partial summarization (relative to the partial summarization from the previous time step). Computing this requires a ground truth summary.

[1]: https://en.wikipedia.org/wiki/ROUGE_(metric)

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import summarization
import ray

from ray.rllib.ppo import PPOAgent, DEFAULT_CONFIG

Start Ray.

In [None]:
ray.init(num_workers=0)

The cell below is a hack. The explanation is as follows. Internally within the `PPOAgent` constructor, a number of actors are created, and these actors will instantiate gym environments using the command `gym.make('SimpleSummarization-v0')`. The command `gym.make` knows how to instantiate a number of pre-defined environments that are shipped with the `gym` module. However, the `SimpleSummarization-v0` environment is defined in the `summarization` module and is registered with the `gym` module when the `import summarization` statement gets run.

Therefore, for the actors to successfully instantiate the gym environments, the `summarization` module must be imported on the actors. This is why we define a remote function `import_summarization` which closes over the `summarization` environment. When the actors are created, that remote function is unpickled on the actors which forces the `summarization` module to be imported, which enables the `gym` module to create the `SimpleSummarization-v0` environment.

In [None]:
# This is a hack.
@ray.remote
def import_summarization():
    summarization

Instantiate an agent that can be trained using Proximal Policy Optimization (PPO).

In [None]:
config = DEFAULT_CONFIG.copy()
# Consider using more workers to speed up the rollouts.
config['num_workers'] = 10
config['gamma'] = 0.99
config['sgd_stepsize'] = 5e-3
config['kl_coeff'] = 0.1
config['num_sgd_iter'] = 20
config['sgd_batchsize'] = 8196
config['observation_filter'] = 'NoFilter'
config['model']['fcnet_hiddens'] = [32, 32]

agent = PPOAgent('SimpleSummarization-v0', config)

Try using the model to generate an extractive summary of a sentence. The model has not been trained yet so this should do a very bad job.

In [None]:
env = gym.make('SimpleSummarization-v0')