# Text summarization with Simple Transformers T5

In this notebook, we implement a news article summarization task with T5, 

using the news summary dataset published by [Kondalarao Vonteru].

[Kondalarao Vonteru]: https://www.kaggle.com/sunnysai12345

## Content

* [Import libraries](#Import-libraries)

* [EDA](#EDA)

* [Data Augmentation](#Data-Augmentation)

* [Build the model](#Build-the-model)

* [Evaluation of the model](#Evaluation-of-the-model)

In [None]:
!pip install -q sumeval==0.2.2
!pip install -q nlpaug==1.1.3
!pip install -q simpletransformers==0.60.9

## Import libraries <a class="anchor" id="Import-libraries"></a>

In [None]:
import gc
import random
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split

import nlpaug.augmenter.word as naw
from sumeval.metrics.rouge import RougeCalculator

import torch
from simpletransformers.t5 import T5Model, T5Args

print('Pytorch version: %s'  % torch.__version__)

In [None]:
warnings.simplefilter('ignore')
pd.set_option('display.max_colwidth', 10000)
cuda =  torch.cuda.is_available()

In [None]:
df = pd.read_csv('../input/news-summary/news_summary.csv', encoding='ISO-8859-1').dropna().reset_index(drop=True)
more_df = pd.read_csv('../input/news-summary/news_summary_more.csv', encoding='ISO-8859-1')

## EDA <a class="anchor" id="EDA"></a>

In [None]:
display(df.head(1))
display(more_df.head(1))

In [None]:
df['headlines_length'] = [len(df['headlines'][i]) for i in range(len(df))]
df['text_length'] = [len(df['text'][i]) for i in range(len(df))]
df['ctext_length'] = [len(df['ctext'][i]) for i in range(len(df))]

more_df['headlines_length'] = [len(more_df['headlines'][i]) for i in range(len(more_df))]
more_df['text_length'] = [len(more_df['text'][i]) for i in range(len(more_df))]

In [None]:
print('df headlines length:\n', df['headlines_length'].describe())
print()
print('more_df headlines length:\n', more_df['headlines_length'].describe())

In [None]:
print('df text length:\n', df['text_length'].describe())
print()
print('df ctext length:\n', df['ctext_length'].describe())
print()
print('more_df text length:\n', more_df['text_length'].describe())

In [None]:
df = df.drop(['author', 'date', 'read_more', 'ctext',
              'headlines_length', 'text_length', 'ctext_length'], axis=1)
more_df = more_df.drop(['headlines_length', 'text_length'], axis=1)
df = pd.concat([df, more_df]).reset_index(drop=True)

In [None]:
# https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial

all_words = df['text'].str.split(expand=True).unstack().value_counts()
 
data = [go.Bar(
            x = all_words.index.values[2:50],
            y = all_words.values[2:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 Word frequencies in the dataset'
)

fig = go.Figure(data=data, layout=layout)

fig.show()

In [None]:
wc = WordCloud(width=900, height=600)

wc.generate(','.join(df['headlines']))
plt.figure(figsize=(18,13))
plt.imshow(wc)
plt.axis('off')
plt.title('headlines word cloud', fontdict={'fontsize': 20})

wc.generate(','.join(df['text']))
plt.figure(figsize=(18,13))
plt.imshow(wc)
plt.axis('off')
plt.title('text word cloud', fontdict={'fontsize': 20})

plt.show()

In [None]:
df = df.rename(columns={'text': 'input_text', 'headlines': 'target_text'}).reindex(columns=['input_text', 'target_text'])
df['prefix'] = ''

train, test = train_test_split(df, test_size=0.2, random_state=42)
train, valid = train_test_split(train, test_size=0.2, random_state=42)

## Data Augmentation <a class="anchor" id="Data-Augmentation"></a>

<img src="https://github.com/makcedward/nlpaug/blob/master/res/logo_small.png?raw=true" style="height: 300px; width: 300px;  object-position: 0px;"/>

Use nlpaug to apply data augmentation.

[Document]

[Document]: https://nlpaug.readthedocs.io/en/latest/

[Github]

[Github]: https://github.com/makcedward/nlpaug

In [None]:
aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(list(train['input_text'].head(1)))
print("Original:")
print(','.join(train.head(1)['input_text'].values))
print()
print("Augmented Text:")
print(','.join(augmented_text))

After applying data augmentation, we will combine them with the original train data.

In [None]:
train = pd.concat([
    train,
    pd.DataFrame({'input_text': naw.SynonymAug(aug_src='wordnet').augment(list(train['input_text'])),
                  'target_text': list(train['target_text']),
                  'prefix': ''}),
                  ])

## Build the model <a class="anchor" id="Build-the-model"></a>

<img src="https://repository-images.githubusercontent.com/212747520/6ef26800-0982-11ea-8476-80e5c7b4d3c4" style="height: 250px; width: 500px;  object-position: 0px;"/>

We will use simple transformers to build the model.

* [Document]

[Document]: https://simpletransformers.ai/

* [Github]

[Github]: https://github.com/ThilinaRajapakse/simpletransformers

### Training

In [None]:
train_params = {
    'max_seq_length': 512,
    'max_length': 128,
    'train_batch_size': 8,
    'eval_batch_size': 8,
    'num_train_epochs': 2,
    'evaluate_during_training': True,
    'evaluate_during_training_steps': 10000,
    'use_multiprocessing': False,
    'fp16': False,
    'save_steps': -1,
    'save_eval_checkpoints': False,
    'save_model_every_epoch': False,
    'no_cache': True,
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'preprocess_inputs': False,
    'num_return_sequences': 1 
}

model = T5Model('t5', 't5-small', args=train_params, use_cuda=cuda)
model.train_model(train, eval_data=valid)
gc.collect()

### Predict

In [None]:
pred_params = {
        'max_seq_length': 512,
        'use_multiprocessed_decoding': False
        }

model = T5Model('t5', 'outputs/best_model', args=pred_params, use_cuda=cuda) 
pred = model.predict(list(test['input_text']))

In [None]:
random.sample(pred, 5)

## Evaluation of the model <a class="anchor" id="Evaluation-of-the-model"></a>

<img src="https://raw.githubusercontent.com/chakki-works/sumeval/master/doc/top.png" style="height: 150px; width: 600px;  object-position: 0px;"/>

Evaluate the model performance with the [sumeval]'s Rouge score.

[sumeval]: https://github.com/chakki-works/sumeval

Rouge1: Evaluate the generated text in units of bi-grams.

Rouge2: Evaluate the generated text in units of uni-grams.

RougeL: Evaluate the match of the generated text sequence.

In [None]:
rouge = RougeCalculator(stopwords=True, lang="en")

def rouge_calc(preds, targets):
    rouge_1 = [rouge.rouge_n(summary=preds[i],references=targets[i],n=1) for i in range(len(preds))]
    rouge_2 = [rouge.rouge_n(summary=preds[i],references=targets[i],n=2) for i in range(len(preds))]
    rouge_l = [rouge.rouge_l(summary=preds[i],references=targets[i]) for i in range(len(preds))]

    return {"Rouge_1": np.array(rouge_1).mean(),
            "Rouge_2": np.array(rouge_2).mean(),
            "Rouge_L": np.array(rouge_l).mean()}

In [None]:
rouge_calc(pred, list(test['target_text']))