<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/summarization-keywords/Summarization_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization using Transformers

Summarization is the task of summarizing a document or an article into a shorter text.

Web: https://huggingface.co/transformers/task_summary.html?highlight=summarization#summarization

Model zoo: https://huggingface.co/models?filter=en&pipeline_tag=summarization

## Install

In [None]:
!pip install transformers

## Document of study

We are going to apply keyword Extraction algorithms in a specific text. The idea is use always the same content to study the different results. At same time, it is important know the document to evaluate if the results are valid or not. 

To reach this goal, we are going to use an scientific article text. Furthermore, we removed the abstract and the keywords of the content.

The authors labelled the document with the abstract and keywords:

* **Abstract**: The provision of comprehensive support for traceability and control is a raising demand in some environments such as the eHealth domain where processes can be of critical importance. This paper provides a detailed and thoughtful description of a holistic platform for the characterization and control of processes in the frame of the HACCP context. Traceability features are fully integrated in the model along with support for services concerned with information for the platform users. These features are provided using already tested technologies (RESTful models, QR Codes) and low cost devices (regular smartphones).

* **Keywords**: traceability, eHealth, software platform, mobile environments


Download the text file

In [None]:
!wget -O article.txt https://www.dropbox.com/s/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt?dl=1 

Read the content

In [1]:
# Open a file: file
content = ""
with open('article.txt',mode='r') as file:
  content = file.read()

In [2]:
print(f"Number of words : {len(content.split())}")
print("First lines:")
for line in content.split("\n")[0:3]:
  print(line)

Number of words : 3830
First lines:
﻿________________
A telematic based approach towards the normalization of clinical praxis
Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 


## Apply Transformers pipeline

List of transformers models to summarization: https://huggingface.co/models?filter=en&pipeline_tag=summarization

In [3]:
from transformers import pipeline

In [4]:
# Custom method to combine summaries and create a final summary of summaries
def run_comb_summarization_for_long_texts(content, summarizer, step_words=500):
  chunks = []
  content_words = content.split(" ")

  it_idx = step_words
  while it_idx < len(content_words):
    it_content = " ".join(content_words[it_idx-step_words:it_idx])
    chunks.append(it_content)
    it_idx = it_idx + step_words
  it_content = " ".join(content_words[it_idx-step_words:])
  chunks.append(it_content)

  try:
    summs = summarizer(chunks, min_length=5, max_length=120)
  except:
    summs = []
    for chunk in chunks:
      if chunk is None or len(chunk)==0:
        continue
      try:
        summs.append(summarizer(chunk, min_length=5, max_length=120)[0])
      except:
        try:
          summs.append(summarizer(chunk[:int(len(chunk)/2)], min_length=5, max_length=120)[0])
          summs.append(summarizer(chunk[int(len(chunk)/2):], min_length=5, max_length=120)[0])
        except:
          print(f"The following chunk of text ({len(chunk)} words) have failed. Please, review the causes.")
          print(chunk)
          print("----------")


  summ_conc = ". ".join([summ['summary_text'] for summ in summs])
  comb_summ = summarizer(summ_conc, min_length=5, max_length=90)
  comb_summ

  print("Combined Summary: ")
  display(comb_summ[0]['summary_text'])


### Bart in PyTorch

Default model

In [7]:
# use bart in pytorch
summarizer = pipeline("summarization")

In [8]:
sum = summarizer(content[:4000], min_length=5, max_length=90)

print("Partial Summary: ")
sum[0]['summary_text']

Partial Summary: 


' A telematic based approach towards the normalization of clinical praxis has been developed . The aim of the study was to create a tool to carry out the implementation of controls (in systems such as HACCP) and to record the values obtained efficiently and in a cost-effective manner . The system must provide with tools to control the entire life cycle of procedures and entities .'

In [9]:
run_comb_summarization_for_long_texts(content, summarizer)

Combined Summary: 


' A telematic based approach towards the normalization of clinical praxis has been developed . HACCP (Hazard Analysis and Critical Control Points) is a system aimed to establish a preventive, systematic and organized control of risks . Traceability allows health authorities to respond quickly to the eventual detection of risks for quality and safety .'

### T5 in tf

In [10]:
# use t5 in tf
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [11]:
sum = summarizer(content, min_length=5, max_length=90)

print("\n---------------------------------\n")
print("Partial Summary: ")
sum[0]['summary_text']

Token indices sequence length is longer than the specified maximum sequence length for this model (5668 > 512). Running this sequence through the model will result in indexing errors



---------------------------------

Partial Summary: 


'a telematic-based solution to support the standardization of control and traceability procedures is proposed . the proposed system is based on the HACCP (Hazard Analysis and Critical Control Points) model . it is possible to use the system in a variety of health care settings, including in the pharmaceutical industry .'

In [12]:
run_comb_summarization_for_long_texts(content, summarizer)

Combined Summary: 


'a telematic based approach towards the normalization of clinical praxis was developed . the objective was to create a tool to carry out the implementation of controls (in systems such as HACCP) traceability is the ability to track the history, actual usage and current status of entities .'

### Pegasus

https://huggingface.co/transformers/model_doc/pegasus.html

#### google/pegasus-xsum

In [6]:
summarizer = pipeline("summarization", model="google/pegasus-xsum", tokenizer="google/pegasus-xsum")

In [14]:
sum = summarizer(content[:1000], min_length=5, max_length=90)

print("\n---------------------------------\n")
print("Partial Summary: ")
sum[0]['summary_text']


---------------------------------

Partial Summary: 


'The aim of this study is to develop a telematic approach towards the normalization of clinical procedures and practices in hospitals.'

In [25]:
run_comb_summarization_for_long_texts(content, summarizer, step_words=320)

Combined Summary: 


'Key words: hazard analysis critical control points (HACCP), ISO testing, pen and paper.'

#### google/pegasus-large

In [None]:
!pip install sentencepiece

In [5]:
summarizer = pipeline("summarization", model="google/pegasus-large", tokenizer="google/pegasus-large")

In [6]:
sum = summarizer(content[:1000], min_length=5, max_length=90)

print("\n---------------------------------\n")
print("Partial Summary: ")
sum[0]['summary_text']


---------------------------------

Partial Summary: 


'Santos Gago1, Luis lvarez Sabucedo1, Mateo Ramos Merino1, Javier Sanz Valero2 1 Telematic Engineering Department, University of Vigo, 36310 Vigo, Spain valonso, jsgago, lsabucedo, mateo.ramos@gist.uvigo.es 2 Public Health & History of Science, University Miguel Hernandez, 03550 Alicante, Spain jsanz@umh'

In [None]:
run_comb_summarization_for_long_texts(content, summarizer, step_words=320)

### SSHLEIFER



#### distilbart-cnn-12-6

In [26]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", tokenizer="sshleifer/distilbart-cnn-12-6")

In [27]:
sum = summarizer(content[:1000], min_length=5, max_length=90)

print("\n---------------------------------\n")
print("Partial Summary: ")
sum[0]['summary_text']


---------------------------------

Partial Summary: 


' A telematic based approach towards the normalization of clinical praxis . The healthcare environment is an area in which the quality and safety of clinical procedures and practices is particularly relevant .'

In [28]:
run_comb_summarization_for_long_texts(content, summarizer)

Combined Summary: 


' A telematic based approach towards the normalization of clinical praxis has been developed . HACCP (Hazard Analysis and Critical Control Points) is a system aimed to establish a preventive, systematic and organized control of risks . Traceability allows health authorities to respond quickly to the eventual detection of risks for quality and safety .'

#### distilbart-cnn-6-6

In [29]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", tokenizer="sshleifer/distilbart-cnn-6-6")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1619.0, style=ProgressStyle(description…




  f"Please make sure the config includes `forced_bos_token_id={self.bos_token_id}` in future versions."


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=460021128.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




In [30]:
sum = summarizer(content[:1000], min_length=5, max_length=90)

print("\n---------------------------------\n")
print("Partial Summary: ")
sum[0]['summary_text']


---------------------------------

Partial Summary: 


' The healthcare environment is an area in which the quality and safety of clinical procedures and practices is particularly relevant . The arise of situations and risks not properly tackled may put at stake the life of patients . For example, in case a patient requires to be provided with intravenous nutrition'

In [31]:
run_comb_summarization_for_long_texts(content, summarizer)

Combined Summary: 


' A telematic based approach to the normalization of clinical praxis has been developed . The authors of this work have collaborated under the support of projects mentioned in the acknowledgments to address a solution to this problem . They say the HACCP is a system aimed to establish a preventive, systematic and organized control of risks .'