## Abstractive Text Summarization
#### Sanja Simonovikj - Shell internship, summer 2018

Until now we have used extractive text summarization techniques and algorithms, such as TextRank (graph-based algorithm) or ElasticSearch (query-based summarization). These approaches gave relatively good summaries in the form of extracted relevant and important sentences of the document/s. However, there are limitations such as incomplete coverage or redundancy and sometimes lack of coherence between sentences. 

In this document we will provide overview of an abstractive method for text summarization, presented in a very recent paper

**"Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting"**. 

- The paper can be found here: https://arxiv.org/abs/1805.11080
- Abstract:
> Inspired by how humans summarize long documents, we propose an accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively (i.e., compresses and paraphrases) to generate a concise overall summary. We use a novel sentence-level policy gradient method to bridge the non-differentiable computation between these two neural networks in a hierarchical way, while maintaining language fluency. Empirically, we achieve the new state-of-the-art on all metrics (including human evaluation) on the CNN/Daily Mail dataset, as well as significantly higher abstractiveness scores. Moreover, by first operating at the sentence-level and then the word-level, we enable parallel decoding of our neural generative model that results in substantially faster (10-20x) inference speed as well as 4x faster training convergence than previous long-paragraph encoder-decoder models. We also demonstrate the generalization of our model on the test-only DUC-2002 dataset, where we achieve higher scores than a state-of-the-art model.
- The github code is available here: https://github.com/ChenRocks/fast_abs_rl
> a pretrained model is available to be used to generate summaries on newly provided documents


### Notes from the paper

![title](fig1.png)

![title](fig2.png)

**Main idea:**
- *The Extractor* (called f) extracts salient sentences. 
- *The Abstractor* (called g) rewrites each sentence and generates new words from a large vocabulary.
- The extractor and abstractor are first trained separately, by optimizing ML (maximum-likelihood) objective and then RL is used to train the full model end-to-end (to aproximate the so-called model h) by applying policy gradient techniques.
- The following is true: In any given pair of document and summary, every summary sentence can be produced from some document sentence. This underlying assumption is very important as it is the base of how our training data is constructed.


**Extractor details:**
- The extractor uses embedded word vectors and temporal convolutional model to get representation for each individual sentence in the documents. Then bidirectonal LSTM-RNN is applied to incorporate global context i.e. to get sentence representation which takes into account all the previous and future sentences. The extractor then selects sentences based on the obtained representations. LSTM-RNN is used to train a Pointer Network.
- *Extractor Training:* The summarization dataset used does not have saliency labels for each sentence. In order to get labels indicating whether a certain sentence should be extracted or not, the following approach is used: for each ground-truth summary sentence we find the most similar document sentence using ROUGE-L (longest common subsequence) metric. This gives us labels so that we can have labeled training data to train the extractor.

**Abstractor details:**
- The abstractor approximates g, which compresses and paraphrases an extracted document sentence to a concise summary sentence. A standard encoder-aligner-decoder network is trained to minimize the cross-entropy loss of the decoder language model at each generation step.  At a high-level, it is a sequence-to-sequence model with attention  and  copy  mechanism  (but  no  coverage).
- *Abstractor Training:* The training pairs for the abstractor are created by taking each ground-truth summary sentence and pairing it with its extracted document sentence (same as in training the extractor). 

**End-to-end training details:**
- Reinforcement Learning (RL) is used to train the full model. The extractor is considered the RL agent. Note that the agent is sentence-level as opposed to word-level. The reward is a ROUGE-L score between the final generated sentence (after going through extractor + abstractor) and the associated ground-truth summary sentence. While doing RL training, we keep the abstractor parameters fixed. This implies that the extraction is reinforce-guided. 
- To limit the length of the produced summary a "stop" action is added in the policy action space. After reaching this stop action, the agent receives 0 reward. This means that the length of the produced summary is not customazable when using a pretrained model. The length is one of the parameters being learned. If we want to change this we would need to retrain the whole model, with training data containing reference (ground-truth) summaries that have the desired length. With the provided pre-trained model, we get ~4 sentences as a summary, no matter the length of the input document.

**Repetition-avoiding reranking:**
- An optional strategy to mitigate the issue of redundancy is by using reranking. However, the extractor already takes care of the redundancy issue by selecting non-redundant sentences. The reranking mechanism can slightly improve the outcome by removing a few "across-sentence" repetitions.


### Training speed:
As reported in the paper: 
> "It took a total of 19.71 hours to train our model. 4.15 hours for the abstractor, 15.56 hours for the  RL training. Extractor ML training can be run at the same time with abstractor training and is approximately 1.5 hours. "

The code is written in python3 using PyTorch library with GPU and CUDA enabled installation. However, to run a pretrained model we can simply use CPU.

### Examples
Below we are presenting the generated summaries on our own data using a pretrained model. The model was trained as in the paper, using reranking strategy, on the CNN/DailyMail news dataset. To generate these summaries, the beam size was set to 5, batch to 1 (this solves the interleaving bug) and the other parameters were set to default.

#### example 1:
Raw text:
> Russia remained China's largest crude supplier in February, with exports rising by 199,000 b/d to 1.32 million b/d, thanks to the recent start-up of a new crude pipeline between the two countries.
Angola's exports to China also rose sharply, up 14.7% to 978,000 b/d, squeezing Saudi Arabia's supplies, which fell 2.9%, according to Chinese customs data (IOD Feb.27'18).
Although US exports in February did not reach the previous month's record high, they still stood at a healthy 238,000 b/d.
Going forward, the extent of crude trade between the two countries will be watched keenly amid heightened trade tensions.
Venezuela's exports to China dropped sharply year-on-year, and compared to January, in a likely sign of growing upstream struggles in the Latin American country (EC Mar.23'18).
China's February crude import data alone does not give an entirely accurate picture, with the New Year holiday having shut down most of the country for the second half of the month.
While February crude imports edged up 1.5% from the previous year, combined imports for January and February surged by 10.8%, or 885,000 b/d, year-on-year to 9.06 million b/d (IOD Mar.12'18).

Produced summary:
> russia remained china 's largest crude supplier in february .
angola 's exports to china rose sharply , up 14.7 % to 978,000 b/d .
us exports in february did not reach record high .
crude trade between the two countries will be watched keenly amid heightened trade tensions .

#### example 2:
Raw text:
> Total has secured nonoperated interests in two exploration blocks offshore Guyana after acquiring the right to farm into the Orinduik Block last year.
The French major has agreed to acquire a 35% stake in the deepwater Canje Block from Canada's JHI Associates and local firm Mid-Atlantic Oil & Gas, which will retain a 30% interest.
The tract is operated by Exxon Mobil (35%) and lies in water depths ranging from 1,700 to 3,000 meters.
Total has also bagged a 25% interest in the shallow-water Kanuku Block in water depths of 70 to 100 meters, under an agreement with operator Repsol (37.5%) and Tullow Oil (37.5%).
The company previously acquired an option to buy a 25% stake in the nearby shallow-water Orinduik tract -- operated by UK-based Tullow (60%) -- from Canadian independent Eco Atlantic, which will keep a 15% stake.
Guyana is set to become a significant oil producer once Exxon starts production in 2020 from the prolific Stabroek Block, where it has made a string of discoveries in recent years.
Exxon has identified as many as 3.2 billion barrels recoverable from Liza and other discoveries, including Liza Deep, Payara, Turbot and Snoek (EIF Jan.10'18).

Produced summary:
> the french major has agreed to acquire a 35 % stake .
the tract is operated by exxon mobil and lies in water depths .
total has also bagged a 25 % interest in the shallow-water kanuku block .
the company previously acquired an option to buy a 25 % stake in the nearby shallow-water orinduik tract .


#### example 3:
Raw text:
> Britain's largest labour union, Unite the Union, announced series of 24 hour and 12 hour rig worker strikes at Total's North Sea oil and gas platforms, a statement said on Thursday.
Details to follow:
The three platforms will be forced to cease production, the statement said
There will be three 24 hour stoppages on July 23 and Aug. 6 and 20 as well as two 12 hour stoppages on July 30 and Aug. 13
The union also announced a ban on overtime starting on July 23
The strike will affect Total's Alwyn, Dunbar and Elgin-Franklin platforms
Union workers voted in favour of strike action on June 28 over work shifts and pay
Reporting By Julia Payne 

Produced summary:
> britain 's largest labour union , unite the union , announced series of 24 hour strikes .
the three platforms will be forced to cease production .
there will be three 24 hour stoppages on july 23 and aug. 6 .
union also announced a ban on overtime starting on july 23 .
strike will affect total 's alwyn , dunbar and elgin-franklin platforms .
details to follow : 


#### example 4:
Raw text:
> Shipments of CPC Blend crude oil are set to scale new heights in March, with some 1.3 million b/d due to load at the Yuzhnaya Ozerreyevka terminal near the Russian Black Sea port of Novorossiysk, according to the latest schedule (NC Feb.15'18).
The main reason for the uptick is higher volumes going into the Caspian Pipeline Consortium (CPC) pipeline from Kazakhstan's giant Kashagan field, which is set to contribute some 260,000 b/d this month, almost 100,000 b/d more than the amount scheduled for February.
The marketing of the Kashagan barrels is assigned to the seven shareholders in the North Caspian Operating Co. (NCOC), although some of the rights are handed over to third parties.
For example, Total handles Exxon Mobil's share of production, while Swiss trader Vitol lifts the barrels belonging to KMG Kashagan B.V. under a four-year offtake deal.
Most of the CPC Blend ends up in Europe, with Italy the most popular destination.
In recent months, CPC has found a home in the Baltics, with Royal Dutch Shell putting at least one cargo into Gdansk, Poland, and Chevron delivering into the Lithuanian port of Butinge.

Produced summary:
> shipments of cpc blend crude oil are set to scale new heights in march .
the uptick is higher volumes going into the pipeline from kazakhstan 's giant kashagan field .
the marketing of the kashagan barrels is assigned to the seven shareholders in the north caspian operating co. -lrb- ncoc -rrb- .
in recent months , cpc has found a home with royal dutch shell .


#### example 5:
Raw text:
> INDONESIA -- Chevron has submitted a revised development plan for the second stage of its Indonesia Deepwater Development (IDD), with costs said to be reduced by 25% from the original \$12 billion.
The US major also submitted a proposal to extend the contracted areas related to the IDD scheme.
The firm has had to rework and resubmit its development plan several times, delaying the start-up of the phase from 2020 to 2022-23 (PIW Jan.18'16).
One previous plan was rejected by the Indonesian government for requiring too many fiscal incentives, but Chevron has been able to significantly work down the project's capital and operational costs for the Gehem, Gendalo, Gandang and Maha fields under its latest development plan proposal.
The first stage of IDD came on stream in 2016.
At its peak, IDD is expected to produce 1.1 Bcf/d (11.4 Bcm/yr), with most of the gas earmarked for delivery to Indonesia's Bontang LNG plant in East Kalimantan.

Produced summary:
> indonesia has submitted a revised development plan for the second stage of its indonesia deepwater development .
us major also submitted proposal to extend contracted areas related to idd scheme .
the firm has had to rework and resubmit its development plan several times .
chevron has been rejected by the indonesian government for requiring too many fiscal incentives .

### Running the pre-trained model
To produce summaries for our documents, download the code and follow the instructions on https://github.com/ChenRocks/fast_abs_rl under "Decode summaries from the pretrained model". Note that the raw documents need to be pre-processed before using the model. The instructions for preprocessing can be found here: https://github.com/ChenRocks/cnn-dailymail. The code in "make_datafiles.py" is customzed to work on CNN/DailyMail dataset. We should change it to suit our own needs. The local version is saved in "newest_make_datafiles.py". The main file that produces the final summaries is "decode_full_model.py". Make sure you run "export DATA=path/to/decompressed/data" before running the model.  

#### Possible bugs/incompatibilities:
Here are some potential errors you can run into and how to solve them:
- **"RuntimeError: get_device is not implemented for type torch.FloatTensor"**
> replace every instance of something.**get_device()** with **"cpu"**. Example:

```python 
#token, states = bs.pack_beam(beam, article.get_device())
token, states = bs.pack_beam(beam, "cpu")

#ind = torch.LongTensor(ind).to(attention.get_device())
ind = torch.LongTensor(ind).to("cpu")
```
- **"AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'"**
> in "decoding.py" in class "DecodeDataset" in "load_best_ckpt" method, change

```python
 ckpt = torch.load(join(model_dir, 'ckpt/{}'.format(ckpts[0])))['state_dict']
```
to
```python
ckpt = torch.load(join(model_dir, 'ckpt/{}'.format(ckpts[0])), map_location='cpu')['state_dict']
```

- **"ValueError: length of all samples has to be greater than 0, but found an element in 'lengths' that is <=0"**
> In "decoding.py" in class "Abstractor" in "_prepro" method change:

```python
articles = conver2id(UNK, self._word2id, raw_article_sents)
#add the following line
articles = [art for art in articles if len(art)!=0 ]
```