## Abstractive Summarisation

Here we generate new sentences from the original text. This is in contrast to the extractive approach, where we used only the sentences that were present. The sentences generated through abstractive summarisation might not be present in the original text:
![image.png](attachment:image.png)

Here we are building an Abstractive Text Summariser using Deep Learning.

### Introduction to Sequence-to-Sequence (Seq2Seq) Modelling

We can build a Seq2Seq model on any problems that involves sequential information. This includes Sentiment classification, Neural Machine Translation and Named Entity Recognition.

Here our objective is to build a text summarizer where the input is a long sequence of words (in a text body), and the output is a short summary (which is a sequence as well). So we can model this as a **Many-to-Many Seq2Seq problem**.
<img src="../figures/encoder-decoder.jpg">
There are two major components of a Seq2Seq model:
* Encoder
* Decoder

### Understanding the Encoder-Decoder Architecture
>The Encoder-Decoder architecture is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths.

From the perspective of text summarization, the input is a long sequence of words and the output will be a short version of the input sequence.
![image.png](attachment:image.png)

Generally, variants of RNNs (GRUs or LSTMs) are preferred as the encoder and decoder components. This is because they are capable of capturing long term dependencies by overcoming the problem of vanishing gradient.

We can set up the Encoder-Decoder in 2 phases:

* Training phase
* Inference phase

### Training phase
In the training phase, we will first set up the encoder and decoder. We will then train the model to predict the target sequence offset by one timestep. Let's see in detail how to set up the encoder and decoder.

#### Encoder
An encoder Long-Short Term Memory model (LSTM) reads the entire input sequence wherein, at each timestep, one word is fed into the encoder. It then processes the information at every timestep and captures the contextual information present in the input sequence.

![image.png](attachment:image.png)

The hidden state (h$_i$) and cell state (c$_i$) of the last time step are used to initialize the decoder. Remember, this is because the encoder and decoder are two different sets of the LSTM architecture.

#### Decoder
The decoder is also an LSTM network which reads the entire target sequence word-by-word and predicts the same sequence offset by one timestep. **The decoder is trained to predict the next word in the sequence given the previous word.**

![image.png](attachment:image.png)

`<start>` and `<end>` are the special tokens which are added to the target sequence before feeding it into the decoder. The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by passing the first word into the decoder which would be always the `<start>` token. And the `<end>` token signals the end of sentence.

Pretty intuitive so far.

### Inference Phase
After training, the model is tested on new source sequences for which the target sequence is unknown. So, we need to set up the inference architecture to decode a test sequence:
![image.png](attachment:image.png)

#### How does the inference process work?

Here are the steps to decode the test sequence:

1. Encode the entire input sequence and initialize the decoder with internal states of the decoder
2. Pass `<start>` token as an input to the decoder
3. Run the decoder for one timestep with the internal states
4. The output will be the probability of the next word. The word with the maximum probability will be selected.
5. Pass the sampled word as an input to the decoder in the next timestep and update the internal states with the current time step
6. Repeat steps 3-5 until we generate the `<end>` token or hit maximum length of the target sequence.

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('../data/with_cleaned.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
!pip install nltk



In [44]:
first_complaint = df['Consumer complaint narrative'].values[8]

In [17]:
from nltk.tokenize import sent_tokenize

In [45]:
sentences = sent_tokenize(first_complaint)