# 8. Text Summarization
In this notebook we will be working on the creation of models to obtain a summary of the protocols from this track. We will be exploring two different types of summarization approaches:
* Extractive summarization: This approach relies on detecting the important sections of the input text and using those sections to generate a summary which is a subset of the original text. No new text generated with this technique.
* Abstractive summarization: This approach reproduces important information from the input text but generates new complementary text.

## Setup
As always, we will be importing common constants of every notebook from the track, and then we will load the protocols dataframe:

In [1]:
%run __init__.py

INFO:root:Starting logger


In [2]:
import pandas as pd

DF_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'protocols_dataframe.pkl')

df = pd.read_pickle(DF_FILE_PATH)

The dataframe contains a column with the full text of the protocol without its abstract. Since the idea is to generate a summary of the protocol, and the abstract is in itself an abstractive summary, we will try to infer a good summary without relying on it:

In [3]:
protocols_no_abstract = df['full_text_no_abstract_cleaned'].values
protocols_no_abstract[0][:500]

'Scratch Wound Healing Assay. Grow cells in DMEM supplemented with 10% FBS. Seed cells into 24-well tissue culture plate at a density that after 24 h of growth, they should reach ~70-80% confluence as a monolayer. Do not change the medium. Gently and slowly scratch the monolayer with a new 1 ml pipette tip across the center of the well. While scratching across the surface of the well, the long-axial of the tip should always be perpendicular to the bottom of the well. The resulting gap distance th'

## Extractive summary
For the extractive summary approach we will be relying on the term frequency of each term in a sentence to select the most relevant sentences that will then be added to the final summary:

In [4]:
from collections import Counter
from string import punctuation

import spacy
import en_core_sci_lg


nlp = en_core_sci_lg.load()

def extract_summary(text, max_sentences):
    keyword = []
    pos_tag = ['ADJ', 'NOUN', 'PROPN', 'VERB']
    doc = nlp(text.lower())
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            keyword.append(token.text)
    
    freq_word = Counter(keyword)
    max_freq = Counter(keyword).most_common(1)[0][1]
    for w in freq_word:
        freq_word[w] = (freq_word[w]/max_freq)
        
    sent_score={}
    for sent in doc.sents:
        for word in sent:
            if word.text in freq_word.keys():
                if sent in sent_score.keys():
                    sent_score[sent] += freq_word[word.text]
                else:
                    sent_score[sent] = freq_word[word.text]
    
    summary = []
    sorted_x = sorted(sent_score.items(), key=lambda kv: kv[1], reverse=True)
    
    num_sentences = 0
    for i in range(len(sorted_x)):
        summary.append(str(sorted_x[i][0]).capitalize())
        num_sentences += 1
        if(counter >= max_sentences):
            break
            
    return ' '.join(summary)

In [5]:
num_sentences = 4

extractive_summaries = [extract_summary(t, num_sentences)
                        for t in protocols_no_abstract]
extractive_summaries[0]

'Grow cells for additional 48 h (or the time required if different cells are used). Wash the cells twice with 1x pbs, then fix the cells with 3.7% paraformaldehye for 30 min. Seed cells into 24-well tissue culture plate at a density that after 24 h of growth, they should reach ~70-80% confluence as a monolayer. After scratching, gently wash the well twice with medium to remove the detached cells.'

Since the protocols contain only the materials and procedure (list of steps of the protocol), an extractive summary will try to select the most important steps from the protocol and remove any redundant information (if any). However, if we want to obtain a summary closer to an abstract, we will need to perform abstractive summarization.

## Abstractive summary
For the abstractive summaries we will train a series of models with the protocol text without the abstract, and try to infer the abstract as an output. First, we will load this data:

In [6]:
x = df['full_text_no_abstract_cleaned'].values
y_true = df['abstract'].values

Now, we will also implement a series of functions which are common to every summarization model used in this section:

In [7]:
import torch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

def trim_batch(input_ids, pad_token_id, attention_mask=None):
    """Remove columns that are populated exclusively by pad_token_id"""
    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)
    if attention_mask is None:
        return input_ids[:, keep_column_mask]
    else:
        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])

def get_model_predictions(model, tokenizer, x):
    return [_predict(model, tokenizer, doc) for doc in x]

def _predict(model, tokenizer, doc):
    batch = tokenizer(doc, return_tensors="pt", truncation=True, padding="max_length").to(DEFAULT_DEVICE)
    input_ids, attention_mask = trim_batch(**batch, pad_token_id=tokenizer.pad_token_id)
    summaries = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_start_token_id=None
    )
    dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    return dec[0]


INFO:transformers.file_utils:PyTorch version 1.6.0+cpu available.


In the following cell we will be obtaining the summaries for 3 different models. The three of them are based on [BART](https://ai.facebook.com/research/publications/bart-denoising-sequence-to-sequence-pre-training-for-natural-language-generation-translation-and-comprehension/), a denoising autoencoder for pretraining sequence-to-sequence models:
* facebook/bart-large-cnn: Bart model trained on the [CNN/Dailymail summarization dataset](https://www.tensorflow.org/datasets/catalog/cnn_dailymail). This is a general purpose model used as a baseline to be compared with the following ones.
* distillbart_cnn_protocols: This is a model where alternating layers of the bart-large-cnn model are copied and the others are finetuned on a subset of protocols.
* distillbart_xsum_protocols: This model follows the same principles as the previous one, but the base model was trained on the [extreme summarization (XSum) dataset](https://www.tensorflow.org/datasets/catalog/xsum) instead of the CNN one.

Due to the nature of the original datasets of the pretrained models, we expect the cnn variation of distillbart to produce longer summaries and the xsum variation to produce shorter ones. The model to use could be customized by the final user depending on the length of the summaries that is desired.

In [8]:
base_model_dir = os.path.join(DATA_DIR, 'text_summarization_models')
models_names = ['facebook/bart-large-cnn',
                'distillbart_cnn_protocols',
                'distillbart_xsum_protocols']

model_results = {}
for name in models_names:
    model_path = name if 'distillbart' not in name \
                      else os.path.join(os.path.join(base_model_dir, name), 'best_tfmr')
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(DEFAULT_DEVICE)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model_results[name] = get_model_predictions(model, tokenizer, x)

# add extractive results
model_results['tf_extractive'] = extractive_summaries

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json from cache at C:\Users\alex/.cache\torch\transformers\5f0de1d2bbb8eb1a3b69656622293b3328b06b701663a9d4109359751cb4e739.5e72c6158467741b29afbcad014cd97414f17a191d39253eef90d7bfe969cc1f
INFO:transformers.configuration_utils:Model config BartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "id2label":

INFO:transformers.tokenization_utils_base:Model name 'E:\hercules\hercules-challenge-protocols\data\text_summarization_models\distillbart_cnn_protocols\best_tfmr' not found in model shortcut name list (facebook/bart-base, facebook/bart-large, facebook/bart-large-mnli, facebook/bart-large-cnn, facebook/bart-large-xsum, yjernite/bart_eli5). Assuming 'E:\hercules\hercules-challenge-protocols\data\text_summarization_models\distillbart_cnn_protocols\best_tfmr' is a path, a model identifier, or url to a directory containing tokenizer files.
INFO:transformers.tokenization_utils_base:Didn't find file E:\hercules\hercules-challenge-protocols\data\text_summarization_models\distillbart_cnn_protocols\best_tfmr\added_tokens.json. We won't load it.
INFO:transformers.tokenization_utils_base:Didn't find file E:\hercules\hercules-challenge-protocols\data\text_summarization_models\distillbart_cnn_protocols\best_tfmr\tokenizer.json. We won't load it.
INFO:transformers.tokenization_utils_base:loading file

Now that the models have been trained, we will select a sample protocol to see what the output looks like:

In [10]:
import random

result_sample_idx = random.randint(0, 100)

for model_name, results in model_results.items():
    print(model_name)
    print(results[result_sample_idx])
    print('\n')

facebook/bart-large-cnn
Gallyas Silver Impregnation of Myelinated Nerve Fibers. Carefully remove the brain or spinal cord and post-fix it in 4% PFA for tissue for one hour up to overnight at 4 °C in a 50 ml Falcon. Embed the brain in paraffin by using an automated embedding machine for best results.


distillbart_cnn_protocols
 Myelinated Nerve Fibers are a multilayly, specialized, specialized membrane membrane membrane of the brain and spinal cord. Myelin is a critical component of the neurobiological process that is critical to the survival of myelin fibers in the brain. Here, we describe the protocol of paraffin embedding and microtome sectioning of brain tissue using Gallyas silver impregnation. This protocol provides a detailed understanding of the molecular molecular process of myelinated fibers.


distillbart_xsum_protocols
 The Gallyas silver impregnation of myelinated nerve fibers (Gallyas Silver Impregnation) is described in a series of microtome sections of the brain. This p

## Evaluation
Now that we have obtained a predicted summary from every model, we will evaluate the results by comparing them to the abstracts of each protocol. To do so, we will be obtaining the rouge-1 and rouge-L score of each model:

In [23]:
import numpy as np

from rouge_score import rouge_scorer

def _compute_mean(scores):
    precision = [score.precision for score in scores]
    recall = [score.recall for score in scores]
    fmeasure = [score.fmeasure for score in scores]
    return {
        'precision': np.mean(precision),
        'recall': np.mean(recall),
        'fmeasure': np.mean(fmeasure)
    }

def compute_rouge_scores(y, results):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge1_scores = []
    rougel_scores = []

    for y_pred, y_true in zip(results, y):
        rouge_score = scorer.score(y_pred, y_true)
        rouge1_scores.append(rouge_score['rouge1'])
        rougel_scores.append(rouge_score['rougeL'])

    return {
        'rouge1': _compute_mean(rouge1_scores),
        'rougeL': _compute_mean(rougel_scores)
    }


In [38]:
model_scores = {}
for model_name, predictions in model_results.items():
    model_scores[model_name] = compute_rouge_scores(predictions, y_true)

In [34]:
for model_name, scores in model_scores.items():
    print(model_name)
    print(scores)
    print('\n')

facebook/bart-large-cnn
{'rouge1': {'precision': 0.35190216760012255, 'recall': 0.1970650603655765, 'fmeasure': 0.24239708155727885}, 'rougeL': {'precision': 0.2222215001177071, 'recall': 0.12408936764926332, 'fmeasure': 0.15280770125052318}}


distillbart_cnn_protocols
{'rouge1': {'precision': 0.474512108884034, 'recall': 0.3355643603445984, 'fmeasure': 0.3774299921065759}, 'rougeL': {'precision': 0.2692036410087131, 'recall': 0.19382174285485979, 'fmeasure': 0.21693775860376777}}


distillbart_xsum_protocols
{'rouge1': {'precision': 0.5958868276552898, 'recall': 0.2338036468809735, 'fmeasure': 0.32253741512607187}, 'rougeL': {'precision': 0.38956795161294017, 'recall': 0.15703934413935447, 'fmeasure': 0.2147638629198031}}


tfidf_extractive
{'rouge1': {'precision': 0.2284128492474408, 'recall': 0.3318414622415259, 'fmeasure': 0.24812294545758565}, 'rougeL': {'precision': 0.13074264248047132, 'recall': 0.18895182529229163, 'fmeasure': 0.1404100208081965}}




If we take a look at the results, the _distillbart_cnn_protocols_ model obtains the best f1 scores, followed by the distillbart_xsum_protocols. Both the general purpose bart model and the TF extractive approach have worse scores.

Regarding the differences between the xsum and cnn distilbart models, due to the differences in the predicted summary length explained before, it is expected that the xsum variation obtains a higher precision while the cnn one gets a higher recall. Depending on the needs of the final summary one model could be selected over the other. However, the cnn variation obtains the best f1 score overall and should be selected by default if in doubt.

## Saving the results
Finally, we will be saving the results obtained in this notebook. We will first save the predictions made by each model, and later on their respective scores.

### Predictions

In [29]:
results_df = df[['pr_id', 'title', 'abstract']]
for model_name, predictions in model_results.items():
    results_df = results_df.assign(**{model_name: predictions})
results_df.head()

Unnamed: 0,pr_id,title,abstract,facebook/bart-large-cnn,distillbart_cnn_protocols,distillbart_xsum_protocols,tfidf_extractive
0,e100,Scratch Wound Healing Assay,The scratch wound healing assay has been widel...,Scratch Wound Healing Assay. Grow cells in DME...,The scratch wound healing assay is a useful t...,Scratch wound healing assay (Scratch Wound He...,Grow cells for additional 48 h (or the time re...
1,e1029,ADCC Assay Protocol,Antibody-dependent cell-mediated cytotoxicity ...,ADCC Assay Protocol. Infect Raji cells at a MO...,Antibody-dependent cell-mediated cytotoxicity...,Antibody-dependent cell-mediated cytotoxicity...,1 2 3 4 5 6 7 unstained target cells unstained...
2,e1072,Catalase Activity Assay in Candida glabrata,Commensal and pathogenic fungi are exposed to ...,Catalase Activity Assay in Candida glabrata. Y...,This protocol describes the preparation of to...,Candida glabrata is a type of yeast that can ...,Preparation of total soluble extracts yeast st...
3,e1077,RNA Isolation and Northern Blot Analysis,The northern blot is a technique used in molec...,Cells were infected with either 2 PFU of live ...,The northern blot is a technique used in vitr...,Northern blot analysis is based on an RNA iso...,Place the sample in a thermal block cycler and...
4,e1090,Flow Cytometric Analysis of Autophagic Activit...,Flow cytometry allows very sensitive and relia...,Flow Cytometric Analysis of Autophagic Activit...,Flow Cytometric analysis of autophagic flux i...,Autophagic flux in primary cells is determine...,"Positive control [rapamycin (1-5 µmol/l), pp24..."


In [30]:
PREDICTIONS_OUTPUT_PATH = os.path.join(NOTEBOOK_8_RESULTS_DIR, 'predictions.csv')
results_df.to_csv(PREDICTIONS_OUTPUT_PATH, index=False)

### Rouge scores

In [39]:
scores_df = pd.DataFrame({
    'model': list(model_scores.keys()),
    'rouge1_prec': [v['rouge1']['precision'] for v in model_scores.values()],
    'rouge1_recall': [v['rouge1']['recall'] for v in model_scores.values()],
    'rouge1_fmeasure': [v['rouge1']['fmeasure'] for v in model_scores.values()],
    'rougeL_prec': [v['rougeL']['precision'] for v in model_scores.values()],
    'rougeL_recall': [v['rougeL']['recall'] for v in model_scores.values()],
    'rougeL_fmeasure': [v['rougeL']['fmeasure'] for v in model_scores.values()],
})
scores_df

Unnamed: 0,model,rouge1_prec,rouge1_recall,rouge1_fmeasure,rougeL_prec,rougeL_recall,rougeL_fmeasure
0,facebook/bart-large-cnn,0.363438,0.178974,0.218508,0.233888,0.115813,0.14102
1,distillbart_cnn_protocols,0.517355,0.320229,0.363781,0.322455,0.207531,0.230837
2,distillbart_xsum_protocols,0.61695,0.206607,0.288678,0.401467,0.133659,0.18694
3,tfidf_extractive,0.21844,0.282621,0.2299,0.121314,0.159748,0.128178


In [32]:
SCORES_OUTPUT_PATH = os.path.join(NOTEBOOK_8_RESULTS_DIR, 'scores.csv')
scores_df.to_csv(SCORES_OUTPUT_PATH, index=False)