<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/summarization-keywords/Summarization_BertExtractiveSummarizer_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bert Extractive Summarizer

This works by first embedding the sentences, then running a clustering algorithm, finding the sentences that are closest to the cluster's centroids. 

GitHub: https://github.com/dmmiller612/bert-extractive-summarizer

Paper: https://arxiv.org/abs/1906.04165 

In [None]:
!pip install bert-extractive-summarizer==0.4.2

In [None]:
!pip install spacy
!pip install transformers==2.2.2
!pip install neuralcoref

!python -m spacy download en_core_web_md

In [None]:
!pip install sentencepiece

## Document of study

We are going to apply keyword Extraction algorithms in a specific text. The idea is use always the same content to study the different results. At same time, it is important know the document to evaluate if the results are valid or not. 

To reach this goal, we are going to use an scientific article text. Furthermore, we removed the abstract and the keywords of the content.

The authors labelled the document with the abstract and keywords:

* **Abstract**: The provision of comprehensive support for traceability and control is a raising demand in some environments such as the eHealth domain where processes can be of critical importance. This paper provides a detailed and thoughtful description of a holistic platform for the characterization and control of processes in the frame of the HACCP context. Traceability features are fully integrated in the model along with support for services concerned with information for the platform users. These features are provided using already tested technologies (RESTful models, QR Codes) and low cost devices (regular smartphones).

* **Keywords**: traceability, eHealth, software platform, mobile environments


Download the text file

In [None]:
!wget -O article.txt https://www.dropbox.com/s/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt?dl=1 

Read the content

In [4]:
# Open a file: file
content = ""
with open('article.txt',mode='r') as file:
  content = file.read()

In [5]:
print(f"Number of words : {len(content.split())}")
print("First lines:")
for line in content.split("\n")[0:3]:
  print(line)

Number of words : 3830
First lines:
﻿________________
A telematic based approach towards the normalization of clinical praxis
Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 


## Apply 

```
model = Summarizer(
    model: This gets used by the hugging face bert library to load the model, you can supply a custom trained model here
    custom_model: If you have a pre-trained model, you can add the model class here.
    custom_tokenizer:  If you have a custom tokenizer, you can add the tokenizer here.
    hidden: Needs to be negative, but allows you to pick which layer you want the embeddings to come from.
    reduce_option: It can be 'mean', 'median', or 'max'. This reduces the embedding layer for pooling.
    sentence_handler: The handler to process sentences. If want to use coreference, instantiate and pass CoreferenceHandler instance
)

model(
    body: str # The string body that you want to summarize
    ratio: float # The ratio of sentences that you want for the final summary
    min_length: int # Parameter to specify to remove sentences that are less than 40 characters
    max_length: int # Parameter to specify to remove sentences greater than the max length,
    num_sentences: Number of sentences to use. Overrides ratio if supplied.
)
```

* Simple summarization

In [4]:
from summarizer import Summarizer

model = Summarizer()
model(content)

100%|██████████| 434/434 [00:00<00:00, 85461.41B/s]
100%|██████████| 1344997306/1344997306 [01:44<00:00, 12851549.38B/s]
100%|██████████| 231508/231508 [00:00<00:00, 321135.85B/s]


"\ufeff________________\nA telematic based approach towards the normalization of clinical praxis\nVíctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, \nMateo Ramos Merino1, Javier Sanz Valero2\n\n\n1 Telematic Engineering Department, University of Vigo, 36310 Vigo, Spain \n{valonso, jsgago, lsabucedo, mateo.ramos}@gist.uvigo.es\n2 Public Health & History of Science, University Miguel Hernandez, 03550 Alicante, Spain\njsanz@umh.es\n\n1   Introduction\nThe healthcare environment is an area in which the quality and safety of clinical procedures and practices is particularly relevant. For example, in case a patient requires to be provided with intravenous nutrition, it is especially critical to ensure the quality of the nutrient mixture supplied and the attention given [2]. The core of this system is the identification of moments or places where monitoring specific variables within procedures in order to control potential hazards. However, it is becoming more common its 

* Specified ratio of sentences

In [8]:
from summarizer import Summarizer

model = Summarizer()
result = model(content, ratio=0.1)  # Specified with ratio
result

'\ufeff________________\nA telematic based approach towards the normalization of clinical praxis\nVíctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, \nMateo Ramos Merino1, Javier Sanz Valero2\n\n\n1 Telematic Engineering Department, University of Vigo, 36310 Vigo, Spain \n{valonso, jsgago, lsabucedo, mateo.ramos}@gist.uvigo.es\n2 Public Health & History of Science, University Miguel Hernandez, 03550 Alicante, Spain\njsanz@umh.es\n\n1   Introduction\nThe healthcare environment is an area in which the quality and safety of clinical procedures and practices is particularly relevant. However, it is becoming more common its use in the pharmaceutical and healthcare environment [5], 6]. Furthermore, it should allow to check and automatically analysis the recorded information in real time. In our system, CPs are intended to record the monitored variables and elements under control into traceable records, hereinafter, traces. Actually, this identification is usually done usi

* Specified maximun 

In [12]:
from summarizer import Summarizer

model = Summarizer()
result = model(content, max_length=50) 
result

'Control operations can carry out a particular CP. In this model, concepts (e.g., users, CP, etc.),'

* CoReference

In [None]:
from summarizer import Summarizer
from summarizer.coreference_handler import CoreferenceHandler

handler = CoreferenceHandler(greedyness=.4)
# How coreference works:
# >>>handler.process('''My sister has a dog. She loves him.''', min_length=2)
# ['My sister has a dog.', 'My sister loves a dog.']

model = Summarizer(sentence_handler=handler)
model(content)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


* Custom model

In [6]:
from transformers import *

# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained('gpt2')
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained('gpt2')
custom_model = AutoModel.from_pretrained('gpt2', config=custom_config)

from summarizer import Summarizer

model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)
model(content, max_length=100)

'The arise of situations and risks not properly tackled may put at stake the life of patients [1]. These records enable the application traceability mechanisms. This URI will be generated by the system and will uniquely identify one single entity. Control operations can carry out a particular CP. These operations are represented as resources in the REST server interface. In this model, concepts (e.g., users, CP, etc.), Regretfully, the implementation of a CP is often a costly procedure. Given the scenario above presented, the CP1 will be described in the frame of the application. Additionally, the log of traces also provides detailed information on any procedure. Therefore, it is possible to take full advantage of process mining tools as ProM[1]. The CPs application allows to minimize deviations in the standard processes. This feature may strengthen the knowledge of professionals involved. Journal of Parenteral and Enteral Nutrition 36(2 suppl), 20S–22S (2012)\n3. Springer Science & Bu