# 8. Text Summarization
In this notebook we will be working on the creation of models to obtain a summary of the protocols from this track. We will be exploring two different types of summarization approaches:
* Extractive summarization: This approach relies on detecting the important sections of the input text and using those sections to generate a summary which is a subset of the original text. No new text generated with this technique.
* Abstractive summarization: This approach reproduces important information from the input text but generates new complementary text.

## Setup
As always, we will be importing common constants of every notebook from the track, and then we will load the protocols dataframe:

In [1]:
%run __init__.py

INFO:root:Starting logger


In [2]:
import pandas as pd

DF_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'protocols_dataframe.pkl')

df = pd.read_pickle(DF_FILE_PATH)

The dataframe contains a column with the full text of the protocol without its abstract. Since the idea is to generate a summary of the protocol, and the abstract is in itself an abstractive summary, we will try to infer a good summary without relying on it:

In [3]:
protocols_no_abstract = df['full_text_no_abstract_cleaned'].values
protocols_no_abstract[0][:500]

'Scratch Wound Healing Assay. Grow cells in DMEM supplemented with 10% FBS. Seed cells into 24-well tissue culture plate at a density that after 24 h of growth, they should reach ~70-80% confluence as a monolayer. Do not change the medium. Gently and slowly scratch the monolayer with a new 1 ml pipette tip across the center of the well. While scratching across the surface of the well, the long-axial of the tip should always be perpendicular to the bottom of the well. The resulting gap distance th'

## Extractive summary
For the extractive summary approach we will be relying on the term frequency of each term in a sentence to select the most relevant sentences that will then be added to the final summary:

In [6]:
from collections import Counter
from string import punctuation

import spacy
import en_core_sci_lg


nlp = en_core_sci_lg.load()

def extract_summary(text, max_sentences):
    keyword = []
    pos_tag = ['ADJ', 'NOUN', 'PROPN', 'VERB']
    doc = nlp(text.lower())
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            keyword.append(token.text)
    
    freq_word = Counter(keyword)
    max_freq = Counter(keyword).most_common(1)[0][1]
    for w in freq_word:
        freq_word[w] = (freq_word[w]/max_freq)
        
    sent_score={}
    for sent in doc.sents:
        for word in sent:
            if word.text in freq_word.keys():
                if sent in sent_score.keys():
                    sent_score[sent] += freq_word[word.text]
                else:
                    sent_score[sent] = freq_word[word.text]
    
    summary = []
    sorted_x = sorted(sent_score.items(), key=lambda kv: kv[1], reverse=True)
    
    num_sentences = 0
    for i in range(len(sorted_x)):
        summary.append(str(sorted_x[i][0]).capitalize())
        num_sentences += 1
        if num_sentences >= max_sentences:
            break
            
    return ' '.join(summary)

In [7]:
num_sentences = 4

extractive_summaries = [extract_summary(t, num_sentences)
                        for t in protocols_no_abstract]
extractive_summaries[0]

'Grow cells for additional 48 h (or the time required if different cells are used). Wash the cells twice with 1x pbs, then fix the cells with 3.7% paraformaldehye for 30 min. Seed cells into 24-well tissue culture plate at a density that after 24 h of growth, they should reach ~70-80% confluence as a monolayer. After scratching, gently wash the well twice with medium to remove the detached cells.'

Since the protocols contain only the materials and procedure (list of steps of the protocol), an extractive summary will try to select the most important steps from the protocol and remove any redundant information (if any). However, if we want to obtain a summary closer to an abstract, we will need to perform abstractive summarization.

## Abstractive summary
For the abstractive summaries we will train a series of models with the protocol text without the abstract, and try to infer the abstract as an output. First, we will load this data:

In [8]:
x = df['full_text_no_abstract_cleaned'].values
y_true = df['abstract'].values

Now, we will also implement a series of functions which are common to every summarization model used in this section:

In [None]:
import torch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

def trim_batch(input_ids, pad_token_id, attention_mask=None):
    """Remove columns that are populated exclusively by pad_token_id"""
    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)
    if attention_mask is None:
        return input_ids[:, keep_column_mask]
    else:
        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])

def get_model_predictions(model, tokenizer, x):
    return [_predict(model, tokenizer, doc) for doc in x]

def _predict(model, tokenizer, doc):
    batch = tokenizer(doc, return_tensors="pt", truncation=True, padding="max_length").to(DEFAULT_DEVICE)
    input_ids, attention_mask = trim_batch(**batch, pad_token_id=tokenizer.pad_token_id)
    summaries = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_start_token_id=None
    )
    dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    return dec[0]


In the following cell we will be obtaining the summaries for 3 different models. The three of them are based on [BART](https://ai.facebook.com/research/publications/bart-denoising-sequence-to-sequence-pre-training-for-natural-language-generation-translation-and-comprehension/), a denoising autoencoder for pretraining sequence-to-sequence models:
* facebook/bart-large-cnn: Bart model trained on the [CNN/Dailymail summarization dataset](https://www.tensorflow.org/datasets/catalog/cnn_dailymail). This is a general purpose model used as a baseline to be compared with the following ones.
* distillbart_cnn_protocols: This is a model where alternating layers of the bart-large-cnn model are copied and the others are finetuned on a subset of protocols.
* distillbart_xsum_protocols: This model follows the same principles as the previous one, but the base model was trained on the [extreme summarization (XSum) dataset](https://www.tensorflow.org/datasets/catalog/xsum) instead of the CNN one.

Due to the nature of the original datasets of the pretrained models, we expect the cnn variation of distillbart to produce longer summaries and the xsum variation to produce shorter ones. The model to use could be customized by the final user depending on the length of the summaries that is desired.

In [None]:
base_model_dir = os.path.join(DATA_DIR, 'text_summarization_models')
models_names = ['facebook/bart-large-cnn',
                'distillbart_cnn_protocols',
                'distillbart_xsum_protocols']

model_results = {}
for name in models_names:
    model_path = name if 'distillbart' not in name \
                      else os.path.join(os.path.join(base_model_dir, name), 'best_tfmr')
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(DEFAULT_DEVICE)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model_results[name] = get_model_predictions(model, tokenizer, x)

# add extractive results
model_results['tf_extractive'] = extractive_summaries

Now that the models have been trained, we will select a sample protocol to see what the output looks like:

In [None]:
import random

result_sample_idx = random.randint(0, 100)

for model_name, results in model_results.items():
    print(model_name)
    print(results[result_sample_idx])
    print('\n')

## Evaluation
Now that we have obtained a predicted summary from every model, we will evaluate the results by comparing them to the abstracts of each protocol. To do so, we will be obtaining the rouge-1 and rouge-L score of each model:

In [None]:
import numpy as np

from rouge_score import rouge_scorer

def _compute_mean(scores):
    precision = [score.precision for score in scores]
    recall = [score.recall for score in scores]
    fmeasure = [score.fmeasure for score in scores]
    return {
        'precision': np.mean(precision),
        'recall': np.mean(recall),
        'fmeasure': np.mean(fmeasure)
    }

def compute_rouge_scores(y, results):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge1_scores = []
    rougel_scores = []

    for y_pred, y_true in zip(results, y):
        rouge_score = scorer.score(y_pred, y_true)
        rouge1_scores.append(rouge_score['rouge1'])
        rougel_scores.append(rouge_score['rougeL'])

    return {
        'rouge1': _compute_mean(rouge1_scores),
        'rougeL': _compute_mean(rougel_scores)
    }


In [None]:
model_scores = {}
for model_name, predictions in model_results.items():
    model_scores[model_name] = compute_rouge_scores(predictions, y_true)

In [None]:
for model_name, scores in model_scores.items():
    print(model_name)
    print(scores)
    print('\n')

If we take a look at the results, the _distillbart_cnn_protocols_ model obtains the best f1 scores, followed by the distillbart_xsum_protocols. Both the general purpose bart model and the TF extractive approach have worse scores.

Regarding the differences between the xsum and cnn distilbart models, due to the differences in the predicted summary length explained before, it is expected that the xsum variation obtains a higher precision while the cnn one gets a higher recall. Depending on the needs of the final summary one model could be selected over the other. However, the cnn variation obtains the best f1 score overall and should be selected by default if in doubt.

## Saving the results
Finally, we will be saving the results obtained in this notebook. We will first save the predictions made by each model, and later on their respective scores.

### Predictions

In [None]:
results_df = df[['pr_id', 'title', 'abstract']]
for model_name, predictions in model_results.items():
    results_df = results_df.assign(**{model_name: predictions})
results_df.head()

In [None]:
PREDICTIONS_OUTPUT_PATH = os.path.join(NOTEBOOK_8_RESULTS_DIR, 'predictions.csv')
results_df.to_csv(PREDICTIONS_OUTPUT_PATH, index=False)

### Rouge scores

In [None]:
scores_df = pd.DataFrame({
    'model': list(model_scores.keys()),
    'rouge1_prec': [v['rouge1']['precision'] for v in model_scores.values()],
    'rouge1_recall': [v['rouge1']['recall'] for v in model_scores.values()],
    'rouge1_fmeasure': [v['rouge1']['fmeasure'] for v in model_scores.values()],
    'rougeL_prec': [v['rougeL']['precision'] for v in model_scores.values()],
    'rougeL_recall': [v['rougeL']['recall'] for v in model_scores.values()],
    'rougeL_fmeasure': [v['rougeL']['fmeasure'] for v in model_scores.values()],
})
scores_df

In [None]:
SCORES_OUTPUT_PATH = os.path.join(NOTEBOOK_8_RESULTS_DIR, 'scores.csv')
scores_df.to_csv(SCORES_OUTPUT_PATH, index=False)