# Title - Text Summarization with Pretrained Encoders: A market research survey case study

#### Members Names:
                  Donald Lane
                  Oyindamola Oyetola

#### Members Emails:
                  donald.lane@ryerson.ca
                  ooyetola@ryerson.ca

# Introduction:

#### Problem Description:

<p align="justify">The paper showcased how Bidirectional Encoder Representations from Transformers (BERT) can be usefully applied in text summarization and proposed a general framework for both extractive and abstractive models. The word 'extractive' implies that summary is generated by selecting salient sentences or phrases from the source text while 'abstractive' implies that sentences are paraphrased and restructured to compose the summary. </p>
<br>

<p align="justify">The authors have proven that BERT can be used successfuly for generating of abstract text summmaries on their test data.  However, it is important to note that the summarizer has been trained using the CNN news dataset and therefore may not produce accurate, or reflective results when using test data that is based on other types of data sets and use cases, i.e fiction books. </p>

For this research, a survey dataset has been generated to test the BERT-based model on this unique data.

<br>

#### Context of the Problem:

<p align="justify">Text summarization is important because of the need to condense documents into shorter versions while preserving most of it's meaning.</p>

<p align="justify">It is no different for market survey industry.  Often times, surveys produce a lot of free text. Therefore, industry professionals and other stakeholders can greatly benefit from having accurate summaries of each survey question's responses.</p>

<p align="justify">In the 'survey world', free-text responses to any given question can pose a challenge to automated text summarizers.  Often times, the text is abbreviated, poorly structured grammatically, and is redundant / repetitive in nature.  Therefore, this type of data has been selected to see how well Abstractive Text Summarization using pretrained models based on BERT, performs.</p>

<br>

#### Limitation About other Approaches:

<p align="justify">Other approaches employ special mechanisms (e.g. reinforment learning, multiple communicating encoders, copying mechanisms) other than the model's minimum requirement, to achieve good summarization.</p>
<p align="justify">They also do not utilize pre-trained embedding</p>

<br>

#### Solution:

<p align="justify">The paper introduced a novel document-level encoder based on BERT, which is able to express the semantics of a document and obtain representation for its sentences. The solution achieved better results with minimum-requirement model without using any of the mechanisms mentioned in the limitation above.</p>

# Background

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Rush et al.[1] | They were among the first to apply the neural encoder decoder architecture to text summarization|DUC-2004 | ROUGE average score is 20.68
| Nallapati et al.[2] | They used an encoder based on Recurrent Neural Networks|DUC-2003, Gigaword, CNN/Daily Mail | ROUGE average score is 27.14
| See et al.[3] | They used a hybrid pointer-generator network that can copy words from the source text via pointing and a coverage mechanism which keeps track of words that have been summarized. |CNN / Daily Mail | ROUGE average score is 31.06
| Celikyilmaz et al. [4] | They used an abstractive system where multiple agents (encoders) represent the document together with a hierarchical attention mechanism (over the agents) for decoding.|  CNN/Daily Mail and New York Times | ROUGE average score is 33.02
| Paulus et al. [5] | They used inta-temporal attention processes in the encoder and decoder to address repetition and incoherent problem|  CNN/Daily Mail and New York Times | ROUGE average score is 30.87
| Gehrmann et al. [6] | They used a data-efficient content selector to over-determine phrases in a source document that should be part of the summary. The selector is used as a bottom-up attention step to constrain the model to likely phrases.|  CNN/Daily Mail and New York Times | ROUGE average score is 32.75
| Shi et al. [7] | They used a salience estimation network to iteratively extract salient sentences. |CNN/Daily Mail | ROUGE average score is 32.30
| Chen and Bansal. [8] | They used an accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively (i.e., compresses and paraphrases) to generate a concise overall summary. |CNN/Daily Mail, DUC 2002 | ROUGE average score is 32.65
| Zhou et al. [9] | They used an end-to-end neural network framework for extractive document summarization by jointly learning to score and select sentences. |CNN/Daily Mail | ROUGE average score is 32.86
| Zhang et al. [10] | They used a latent variable extractive model where sentences are viewed as latent variables and sentences with activated variables are used to infer gold summaries. |CNN/Daily Mail | ROUGE average score is 32.45
| Li et al. [11] | They extended seq2seq model with an information selection network to generate more informative summaries. |CNN/Daily Mail | ROUGE average score is 32.06
| Narayan et al. [12] | They used abstractive model which is particularly suited to extreme summarization (i.e., single sentence summaries), based on convolutional neural networks and additionally conditioned on topic distributions. |XSum (BBC articles) | ROUGE average score is 20.41
| Yang and Mirella [13] | They applied the BERT model to text summarization |CNN/Daily Mail, NYT, XSum| ROUGE average score is 33.63

<p align="center">ROUGE Average Score represents the average of ROUGE-1, ROUGE-2 and ROUGE-L</p>

# Methodology

<p align="justify">BERT is a new language representation model which is trained with a masked language modeling and a “next sentence prediction” task on a corpus of 3,300M words.</p>

<p align="justify">The input text is first preprocessed by inserting two special tokens. [CLS] is appended to the beginning of the text; the output representation of this token is used to aggregate information from the whole sequence (e.g., for classification tasks). And token [SEP] is inserted after each sentence as an indicator of sentence boundaries. The modified text is then represented as a sequence of tokens.</p>

<p align="justify">Each token is assigned three kinds of embeddings: <I>token embeddings</I> indicate the meaning of each token, <I>segmentation embeddings</I> are used to discriminate between two sentences and <I>position embeddings </I> indicate the position of each token within the text sequence. These three embeddings are summed to a single input vector and fed to a bidirectional Transformer with multiple layers.</p>

<br>

<p align="justify"> Although BERT has been used to fine-tune various NLP tasks, its application to summarization is not as straightforward.
Since BERT is trained as a masked-language model, the output vectors are grounded to tokens instead of sentences, while in extractive summarization, most models manipulate sentence-level representations. Although segmentation embeddings represent different sentences in BERT, they only apply to sentencepair inputs, while in summarization they must encode and manipulate multi-sentential inputs. The figure below illustrates the proposed BERT architecture for SUMmarization (which is called BERTSUM).</p>
<br>
<p align="justify">In order to represent individual sentences, external [CLS] tokens is inserted at the start of each sentence and each [CLS] symbol collects features for the sentence preceding it. Interval segment embeddings is also used to distinguish multiple sentences within a document. For each sentence, segment embedding EA or EB is assigned. For example, for document [sent1, sent2, sent3, sent4, sent5] embeddings [EA, EB, EA, EB, EA] will be assigned.
This way, document representations are learned hierarchically where lower Transformer layers represent adjacent sentences, while higher layers, in combination with self-attention, represent multi-sentence discourse.
Position embeddings in the original BERT model have a maximum length of 512;
this limitation was overcome by adding more position embeddings that are initialized randomly and finetuned with other parameters in the encoder.
</p>

![Bert for Summarization ](https://camo.githubusercontent.com/406a1ba1dee0a456af49e8687fbfd9989484855f/68747470733a2f2f692e696d6775722e636f6d2f4d6a4375424d712e706e67 "Bert for Summarization") 

<p align="center">Architecture of the original BERT model (left) and BERTSUM (right). The sequence on top is the input document, followed by the summation of three kinds of embeddings for each token. The summed vectors are used
as input embeddings to several bidirectional Transformer layers, generating contextual vectors for each token.
BERTSUM extends BERT by inserting multiple [CLS] symbols to learn sentence representations and using interval segmentation embeddings (illustrated in red and green color) to distinguish multiple sentences. </p>

<br>
<p align="justify">To keep the presentation simple, we will focus on abstractive summarization. A standard encoder-decoder framework is used for abstractive summarization. The encoder is the pretrained BERTSUM and the decoder
is a 6-layered Transformer initialized randomly.
It is conceivable that there is a mismatch between the encoder and the decoder, since the encoder is pretrained while the decoder must be trained from scratch. This can make fine-tuning unstable; for example, the encoder might overfit
the data while the decoder underfits, or vice versa.
To circumvent this, a new fine-tuning schedule was designed which, separates the optimizers of the encoder and the decoder (Two Adam optimizers where used with beta values 0.9 and 0.999 respectively, each with different warmup steps and learning rates). This is based on the assumption that the pretrained encoder should be fine-tuned with a smaller learning rate and smoother decay (so that the encoder can be trained with more accurate gradients when the decoder is becoming stable).</p>

<p align="justify"> In addition, a two-stage fine-tuning approach was proposed, where they first fine-tuned the encoder on the extractive summarization task and then fine-tuned it on the abstractive summarization task. The two-stage approach is conceptually very simple, the model can take advantage of information shared between the two tasks, without fundamentally changing its architecture. 
The default abstractive model is named BERTSUMABS and the two-stage fine-tuned model is named BERTSUMEXTABS. </p>

<br>

<p align="justify"> The model was evaluated on three benchmark datasets: CNN/DailyMail news highlights dataset, New York Times Annotated Corpus and XSum.

CNN/DailyMail contains news articles and associated highlights, i.e., a few bullet points giving a brief overview of the article.
NYT contains 110,540 articles with abstractive summaries.
XSum contains 226,711 news articles accompanied with a one-sentence summary, answering the question “What is this article about?”.</p>


# Implementation

Implementation of the research papers technical approach.

## Note:  This code is designed to be executed in Google Colab

## Steps

Clone the github repository

In [0]:
!git clone https://github.com/mingchen62/PreSumm.git 
%cd PreSumm

Cloning into 'PreSumm'...
remote: Enumerating objects: 154, done.[K
remote: Total 154 (delta 0), reused 0 (delta 0), pack-reused 154
Receiving objects: 100% (154/154), 12.97 MiB | 16.11 MiB/s, done.
Resolving deltas: 100% (64/64), done.
/content/PreSumm


Installation of dependencies

In [0]:
!pip install torch==1.1.0 pytorch_transformers tensorboardX multiprocess pyrouge

Collecting torch==1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/69/60/f685fb2cfb3088736bafbc9bdbb455327bdc8906b606da9c9a81bae1c81e/torch-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (676.9MB)
[K     |████████████████████████████████| 676.9MB 27kB/s 
[?25hCollecting pytorch_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |████████████████████████████████| 184kB 35.9MB/s 
[?25hCollecting tensorboardX
[?25l  Downloading https://files.pythonhosted.org/packages/35/f1/5843425495765c8c2dd0784a851a93ef204d314fc87bcc2bbb9f662a3ad1/tensorboardX-2.0-py2.py3-none-any.whl (195kB)
[K     |████████████████████████████████| 204kB 38.3MB/s 
Collecting pyrouge
[?25l  Downloading https://files.pythonhosted.org/packages/11/85/e522dd6b36880ca19dcf7f262b22365748f56edc6f455e7b6a37d0382c32/pyrouge-0.1.3.tar.gz (60kB)
[K     |██████████████

Download the transofrmer models based on CNN and XSUM (Tensorflow) datasets

In [0]:
%cd /content/PreSumm/models

#CNN/DM Extractive bertext_cnndm_transformer.pt
!gdown https://drive.google.com/uc?id=1kKWoV0QCbeIuFt85beQgJ4v0lujaXobJ&export=download #CNN/DM Extractive bertext_cnndm_transformer.pt

#CNN/DM Abstractive model_step_148000.pt    
!gdown https://drive.google.com/uc?id=1-IKVCtc4Q-BdZpjXc4s70_fRsWnjtYLr&export=download #CNN/DM Abstractive model_step_148000.pt 


/content/PreSumm/models
Downloading...
From: https://drive.google.com/uc?id=1kKWoV0QCbeIuFt85beQgJ4v0lujaXobJ
To: /content/PreSumm/models/bertext_cnndm_transformer.zip
1.32GB [00:18, 73.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-IKVCtc4Q-BdZpjXc4s70_fRsWnjtYLr
To: /content/PreSumm/models/bertsumextabs_cnndm_final_model.zip
1.98GB [00:49, 40.3MB/s]
Permission denied: https://drive.google.com/uc?id=1H50fClyTkNprWJNh10HWdGEdDdQIkzsI
Maybe you need to change permission over 'Anyone with the link'?


Unzipping the models

In [0]:
!unzip /content/PreSumm/models/bertext_cnndm_transformer.zip
!unzip /content/PreSumm/models/bertsumextabs_cnndm_final_model.zip

Archive:  /content/PreSumm/models/bertext_cnndm_transformer.zip
  inflating: bertext_cnndm_transformer.pt  
Archive:  /content/PreSumm/models/bertsumextabs_cnndm_final_model.zip
  inflating: model_step_148000.pt    


Creating directories to store the models with data


In [0]:
!mkdir /content/PreSumm/models/CNN_DailyMail_Extractive
!mkdir /content/PreSumm/models/CNN_DailyMail_Abstractive

Move the contents of each model into appropriate training data directories

In [0]:
!mv /content/PreSumm/models/bertext_cnndm_transformer.pt /content/PreSumm/models/CNN_DailyMail_Extractive
!mv /content/PreSumm/models/model_step_148000.pt /content/PreSumm/models/CNN_DailyMail_Abstractive

Create directories to hold test data and text summary outputs

In [0]:
!mkdir /content/PreSumm/bert_data_test/
!mkdir /content/PreSumm/bert_data/cnndm

### Loading a new text file to summarize

First we will connect to your Google Drive account by running the following cell

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Load a new text file and pre-process it for the summarization phase.

Note: You may have to change the paths/name of file to load in the text = open... line. IN the following example, the text file is stored in a google drive folder.

In [0]:
%cd /content/PreSumm/bert_data/cnndm

text = open('/content/drive/My Drive/Colab Notebooks/sample.txt', 'r').read()
text = text.split('.')
with open('/content/PreSumm/bert_data/cnndm/file.txt', 'a') as f: 
    f.writelines(text)

/content/PreSumm/bert_data/cnndm


Import NLTK (download punkt if required). This will be used for tokeization of the input text file.

In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

The following cell contains the required code for training the data

In [0]:
%%writefile /content/PreSumm/src/summarizer.py
#!/usr/bin/env python
"""
    Main training workflow
"""
from __future__ import division

import argparse
import os
from others.logging import init_logger
from train_abstractive import validate_abs, train_abs, baseline, test_abs, test_text_abs, load_models_abs
from train_extractive import train_ext, validate_ext, test_ext
from prepro import data_builder
import glob, os

model_flags = ['hidden_size', 'ff_size', 'heads', 'emb_size', 'enc_layers', 'enc_hidden_size', 'enc_ff_size',
               'dec_layers', 'dec_hidden_size', 'dec_ff_size', 'encoder', 'ff_actv', 'use_interval']


def str2bool(v):
    if v.lower() in ('yes', 'true', 't', 'y', '1'):
        return True
    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
        return False
    else:
        raise argparse.ArgumentTypeError('Boolean value expected.')



def init_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("-task", default='abs', type=str, choices=['ext', 'abs'])
    parser.add_argument("-encoder", default='bert', type=str, choices=['bert', 'baseline'])
    parser.add_argument("-mode", default='test', type=str, choices=['train', 'validate', 'test'])
    parser.add_argument("-bert_data_path", default='../../bert_data_new/cnndm')
    parser.add_argument("-model_path", default='../../models/')
    parser.add_argument("-result_path", default='../../results/cnndm')
    parser.add_argument("-temp_dir", default='../../temp')

    parser.add_argument("-batch_size", default=140, type=int)
    parser.add_argument("-test_batch_size", default=200, type=int)

    parser.add_argument("-max_pos", default=800, type=int)
    parser.add_argument("-use_interval", type=str2bool, nargs='?',const=True,default=True)
    parser.add_argument("-large", type=str2bool, nargs='?',const=True,default=False)
    parser.add_argument("-load_from_extractive", default='', type=str)

    parser.add_argument("-sep_optim", type=str2bool, nargs='?',const=True,default=True)
    parser.add_argument("-lr_bert", default=2e-3, type=float)
    parser.add_argument("-lr_dec", default=2e-3, type=float)
    parser.add_argument("-use_bert_emb", type=str2bool, nargs='?',const=True,default=False)

    parser.add_argument("-share_emb", type=str2bool, nargs='?', const=True, default=False)
    parser.add_argument("-finetune_bert", type=str2bool, nargs='?', const=True, default=True)
    parser.add_argument("-dec_dropout", default=0.2, type=float)
    parser.add_argument("-dec_layers", default=6, type=int)
    parser.add_argument("-dec_hidden_size", default=768, type=int)
    parser.add_argument("-dec_heads", default=8, type=int)
    parser.add_argument("-dec_ff_size", default=2048, type=int)
    parser.add_argument("-enc_hidden_size", default=512, type=int)
    parser.add_argument("-enc_ff_size", default=512, type=int)
    parser.add_argument("-enc_dropout", default=0.2, type=float)
    parser.add_argument("-enc_layers", default=6, type=int)

    # params for EXT
    parser.add_argument("-ext_dropout", default=0.2, type=float)
    parser.add_argument("-ext_layers", default=2, type=int)
    parser.add_argument("-ext_hidden_size", default=768, type=int)
    parser.add_argument("-ext_heads", default=8, type=int)
    parser.add_argument("-ext_ff_size", default=2048, type=int)

    parser.add_argument("-label_smoothing", default=0.1, type=float)
    parser.add_argument("-generator_shard_size", default=32, type=int)
    parser.add_argument("-alpha",  default=0.6, type=float)
    parser.add_argument("-beam_size", default=5, type=int)
    parser.add_argument("-min_length", default=15, type=int)
    parser.add_argument("-max_length", default=150, type=int)
    parser.add_argument("-max_tgt_len", default=140, type=int)

    # params for preprocessing
    parser.add_argument("-shard_size", default=2000, type=int)
    parser.add_argument('-min_src_nsents', default=3, type=int)
    parser.add_argument('-max_src_nsents', default=100, type=int)
    parser.add_argument('-min_src_ntokens_per_sent', default=5, type=int)
    parser.add_argument('-max_src_ntokens_per_sent', default=200, type=int)
    parser.add_argument('-min_tgt_ntokens', default=5, type=int)
    parser.add_argument('-max_tgt_ntokens', default=500, type=int)
    parser.add_argument("-lower", type=str2bool, nargs='?',const=True,default=True)
    parser.add_argument("-use_bert_basic_tokenizer", type=str2bool, nargs='?',const=True,default=False)

 
    parser.add_argument("-param_init", default=0, type=float)
    parser.add_argument("-param_init_glorot", type=str2bool, nargs='?',const=True,default=True)
    parser.add_argument("-optim", default='adam', type=str)
    parser.add_argument("-lr", default=1, type=float)
    parser.add_argument("-beta1", default= 0.9, type=float)
    parser.add_argument("-beta2", default=0.999, type=float)
    parser.add_argument("-warmup_steps", default=8000, type=int)
    parser.add_argument("-warmup_steps_bert", default=8000, type=int)
    parser.add_argument("-warmup_steps_dec", default=8000, type=int)
    parser.add_argument("-max_grad_norm", default=0, type=float)

    parser.add_argument("-save_checkpoint_steps", default=5, type=int)
    parser.add_argument("-accum_count", default=1, type=int)
    parser.add_argument("-report_every", default=1, type=int)
    parser.add_argument("-train_steps", default=1000, type=int)
    parser.add_argument("-recall_eval", type=str2bool, nargs='?',const=True,default=False)


    parser.add_argument('-visible_gpus', default='-1', type=str)
    parser.add_argument('-gpu_ranks', default='0', type=str)
    parser.add_argument('-log_file', default='../../logs/cnndm.log')
    parser.add_argument('-seed', default=666, type=int)

    parser.add_argument("-test_all", type=str2bool, nargs='?',const=True,default=False)
    parser.add_argument("-test_from", default='')
    parser.add_argument("-test_start_from", default=-1, type=int)

    parser.add_argument("-train_from", default='')
    parser.add_argument("-report_rouge", type=str2bool, nargs='?',const=True,default=True)
    parser.add_argument("-block_trigram", type=str2bool, nargs='?', const=True, default=True)

    args = parser.parse_args()
    args.gpu_ranks = [int(i) for i in range(len(args.visible_gpus.split(',')))]
    args.world_size = len(args.gpu_ranks)
    os.environ["CUDA_VISIBLE_DEVICES"] = args.visible_gpus

    init_logger(args.log_file)
    device = "cpu" if args.visible_gpus == '-1' else "cuda"
    device_id = 0 if device == "cuda" else -1

    return args, device_id

if __name__ == '__main__':
    args, device_id = init_args()
    print(args.task, args.mode) 

    cp = args.test_from
    try:
    	step = int(cp.split('.')[-2].split('_')[-1])
    except:
    	step = 0

    predictor = load_models_abs(args, device_id, cp, step)

    all_files = glob.glob(os.path.join('/content/PreSumm/bert_data/cnndm', '*'))
    print('Files In Input Dir: ' + str(len(all_files)))
    for file in all_files:
        with open(file) as f:
            source=f.read().rstrip()

        data_builder.str_format_to_bert(  source, args, '../bert_data_test/cnndm.test.0.bert.pt') 
        args.bert_data_path= '../bert_data_test/cnndm'
        test_text_abs(args, device_id, cp, step, predictor)


Overwriting /content/PreSumm/src/summarizer.py


Set up and run the BERT text summarizer using the CNN_DM abstractive dataset, on the input data

In [0]:
%cd /content/PreSumm/src
!python summarizer.py -task abs -mode test -test_from /content/PreSumm/models/CNN_DailyMail_Abstractive/model_step_148000.pt -batch_size 32 -test_batch_size 500 -bert_data_path ../bert_data/cnndm -log_file ../logs/val_abs_bert_cnndm -report_rouge False  -sep_optim true -use_interval true -visible_gpus -1 -max_pos 512 -max_src_nsents 100 -max_length 200 -alpha 0.95 -min_length 50 -result_path ../results/abs_bert_cnndm_sample

/content/PreSumm/src
abs test
[2020-04-16 14:55:23,300 INFO] Loading checkpoint from /content/PreSumm/models/CNN_DailyMail_Abstractive/model_step_148000.pt
Namespace(accum_count=1, alpha=0.95, batch_size=32, beam_size=5, bert_data_path='../bert_data/cnndm', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=768, dec_layers=6, enc_dropout=0.2, enc_ff_size=512, enc_hidden_size=512, enc_layers=6, encoder='bert', ext_dropout=0.2, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='../logs/val_abs_bert_cnndm', lower=True, lr=1, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=200, max_pos=512, max_src_nsents=100, max_src_ntokens_per_sent=200, max_tgt_len=140, max_tgt_ntokens=500, min_length=50, min_src_nsents=3, min_src_ntokens_per_sent=5, min_tgt_ntokens=5, mode='test', model_path='../

View the abstractive text summary that was generated above.

In [0]:
!head /content/PreSumm/results/abs_bert_cnndm_sample.148000.candidate

` i 've just started using it so it 's too early for me to have any opinion on this service '<q>` i 'm not impressed with the delay of opening accounts and getting answers for questions . ' i do not even understand why i get a phone call with the same questions i answered online . '<q>i do n't know why i have been surprised by the delay '


<br>

## Results

<p allign = "justify">The authors demonstrated the influence of language  model  pre-training on text summarization tasks. By using the extractive summary based on pre-trained models as an input into generating abstractive summaries, in additon to using a 'two-staged fine tuning approach' to enhance the quality of the generated summaries. Using three large datasets state-of-the-art results were achieved.</p>

<p allign = "justify">The following results illustrate the effectiveness of the summarizers using BERT as compared to other approaches.  It is evident that this novel approach produces better results (extracted from paper).</p>

![image.png](https://i.ibb.co/ydkVJxM/Table3-Rouge.png) ![image.png](https://i.ibb.co/h1dGCVz/Table4-Rouge.png)

<p allign = "justify">Moving on the new test dataset; the results of the abstractive text summarizer based on a BERT pre-trained language model are below. It can be seen that the model generated the summary with paraphases of the input dataset. However, the summary is not fully reflective of the full dataset and the gist of the responses of this ficticious dataset. There are main themes that are not included in the summarizer. </p>

<p allign = "justify">Furthemore, it appears that the summary is more 'negative' in tone, and leaves out the postive responses. While it is difficult to know why this is, it is beleived that this is a function of the CNN Daily Mail articles that were used to train BERT.  A lot of news articles are negative in nature in this day in age.</p>
<br>
<p allign = "justify">What is impressive is each sentence in the summary, the paraphrasing makes complete sense and is in fact based on survey responses and therefore is accurate.  However, the full summary does feel a bit fragmented. This is likely simply based on the input data content.  Additionaly, the summarizer presented the summary in a first-person context, further reflecting the input data.</p>

<p allign = "justify">To add some comparison, an extractive summary, using Gensim summarizer, has been included. See below. Recall that extractive summaries use phrases / sentences directly contained in the input document being summarized.  This version managed to present both the negative and positive themes that were included in the input data.  Furthermore, the summary is readable and understandable.  It does contain both views: first-person and third-party, which leads to some disjoint.  However it is easily interpretable.</p>

### **The abstractive summary**
<br>
For readability: here is the summary that was generated using the model:


---
i 've just started using it so it 's too early for me to have any opinion on this service ` i 'm not impressed with the delay of opening accounts and getting answers for questions . ' i do not even understand why i get a phone call with the same questions i answered online . '<q>i do n't know why i have been surprised by the delay '
___

### **A comparison**

<p allign = "justify">To offer a comparison to a summary of the same input text, here is one generated using Gensim summarizer().  This however is not a direct comparison as Gensim uses an extractive text summarization approach.</p>



---

The process to open the account was quite cumbersome however and the need to speak to a person before investing was an irritant as I was expecting a fully online experience.
I really like the ease of use of the product, and I would recommend it for someone's non registered plan; however I don't think the underlying investments have a long enough track record to be comfortable telling someone they should consider it for an RSP.


---


# Conclusion and Future Direction

<p align="justify">The paper introduced a novel document-level encoder and proposed a general framework for both abstractive and extractive summarization. Experimental results across three datasets show that the model achieves state-of-the-art results across the board under automatic and human-based evaluation protocols.
Although the research mainly focused on document encoding for summarization, in the future, the capabilities of BERT can be used for language generation.</p>
<br>

<p allign = "justify">Based on a new test dataset created to mimic results of a survey - one that is challenging based on the grammar, amd redundancy or statements/ themes - BERT generated a 'decent' results, with some very promising results at a micro (sentence) level. </p> 

<p allign = "justify">A recommendation would be to train BERT using another large corpus data set, perhaps wiki, to see if there are improvements.  Furthermore, optimizing the models hyper-parameters may produce even more reflective summaries.</p>

<br>

## You can find this notebook and data on the following github accounts:

https://github.com/troylane/Abstractive-Text-Summarization

https://github.com/ooyetola/abstractive-text-summarization.git


# References:

[1]:  Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal.

[2]:  Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C¸ a˘glar G˙ulc¸ehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-tosequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural
Language Learning, pages 280–290, Berlin, Germany.

[3]:  Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL, pages 1073–1083, 2017.

[4]: Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the
2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),
pages 1662–1675, New Orleans, Louisiana.

[5]:  Romain Paulus, Caiming Xiong, Richard Socher, and Palo Alto. A deep reinforced model for abstractive summarization. ICLR, pages 1–13, 2018.

[6]:  Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792, 2018.

[7]:  Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, and Hanwang Zhang. DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization. CoRR, 2018.

[8]:  Yen-Chun Chen and Mohit Bansal. Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080, 2018

[9]:  Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1:Long Papers, pages 654–663, 2018.

[10]:  Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. Neural latent extractive document summarization. arXiv preprint arXiv:1808.07187, 2018.

[11]:  Wei Li, Xinyan Xiao, Yajuan Lyu, and YuanzhuoWang. Improving Neural Abstractive Document Summarization with Explicit Information Selection Modeling. In EMNLP, pages 1787–1796, 2018.

[12]:  Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018a. Don’t give me the details, just the summary!topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium.

[13]:  Liu, Yang & Lapata, Mirella. (2019). Text Summarization with Pretrained Encoders. 3721-3731. 10.18653/v1/D19-1387.