<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/summarization-keywords/Summarization_PyTextRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTextRank

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work -- and related knowledge graph practices. This includes the textgraphs algorithms:

 - TextRank by [mihalcea04textrank]
 - PositionRank by [florescuc17]
 - Biased TextRank by [kazemi-etal-2020-biased]


 

 GitHub: https://github.com/DerwenAI/pytextrank

 Documentation: https://derwen.ai/docs/ptr/ 

 Spacy: https://spacy.io/universe/project/spacy-pytextrank

## Install

In [None]:
! pip install pytextrank
! pip install 'graphviz>=0.13'
! pip install 'networkx >= 2.0'
! pip install 'spacy >= 2.0'

In [None]:
!python -m spacy download en_core_web_md

## Document of study

We are going to apply keyword Extraction algorithms in a specific text. The idea is use always the same content to study the different results. At same time, it is important know the document to evaluate if the results are valid or not. 

To reach this goal, we are going to use an scientific article text. Furthermore, we removed the abstract and the keywords of the content.

The authors labelled the document with the abstract and keywords:

* **Abstract**: The provision of comprehensive support for traceability and control is a raising demand in some environments such as the eHealth domain where processes can be of critical importance. This paper provides a detailed and thoughtful description of a holistic platform for the characterization and control of processes in the frame of the HACCP context. Traceability features are fully integrated in the model along with support for services concerned with information for the platform users. These features are provided using already tested technologies (RESTful models, QR Codes) and low cost devices (regular smartphones).

* **Keywords**: traceability, eHealth, software platform, mobile environments


Download the text file

In [None]:
!wget -O article.txt https://www.dropbox.com/s/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt?dl=1 

Read the content

In [1]:
# Open a file: file
content = ""
with open('article.txt',mode='r') as file:
  content = file.read()

In [2]:
print(f"Number of words : {len(content.split())}")
print("First lines:")
for line in content.split("\n")[0:3]:
  print(line)

Number of words : 3830
First lines:
﻿________________
A telematic based approach towards the normalization of clinical praxis
Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 


## Applying PyTextRank

* Spacy direct integration

In [24]:
import spacy
import pytextrank

nlp = spacy.load('en_core_web_md')

nlp.add_pipe("textrank")
doc = nlp(content)

tr = doc._.textrank

for sent in tr.summary(limit_phrases=15, limit_sentences=5):
  print(sent)

These operations can be of two types, control or information.
Conversely, information operations are intended to retrieve relevant resources for the human user associated with the entity (e.g., video tutorials, manuals, brochures, etc.), a paramount feature nowadays [15].
The proposed platform is based on control systems widely tested and broadly adopted in the community.
Nevertheless, it is common that health practitioners initially record data on paper and then they transfer it to information technology based systems [9].
The proper implementation of the proposed system should allow to meet different objectives: monitoring adherence to procedures, risk control, traceability of entities and even behavioral analysis for continuous optimization and decision making procedures.


* Adding stopwords

In [50]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] } })

doc = nlp(content)

tr = doc._.textrank

for sent in tr.summary(limit_phrases=15, limit_sentences=5):
  print(sent)

These operations can be of two types, control or information.
Conversely, information operations are intended to retrieve relevant resources for the human user associated with the entity (e.g., video tutorials, manuals, brochures, etc.), a paramount feature nowadays
The proposed platform is based on control systems widely tested and broadly adopted in the community.
and then they transfer it to information technology based systems [9].

described and tested as a use case in the context of PN mixtures.


* Position Rank

The PositionRank enhanced algorithm is simple to use in the spaCy pipeline and it supports all of the other features described above:

In [51]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("positionrank")

doc = nlp(content)

tr = doc._.textrank

for sent in tr.summary(limit_phrases=15, limit_sentences=5):
  print(sent)

A telematic based approach towards the normalization of clinical praxis
Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 
Mateo Ramos Merino1, Javier Sanz Valero2
Alonso-Rorís, V. M., Santos Gago, J. M., Pérez Rodríguez, R., Rivas Costa, C., Gómez Carballa, M. A., & Anido Rifón, L.: Information extraction in semantic, highly-structured, and semi-structured web sources.
Sanz-Valero, J., Alvarez-Sabucedo, L., Wanden-Berghe, C., Alonso-Rorís, V. M., Santos-Gago, J. M.: SUN-PP236:
These operations can be of two types, control or information.
and then they transfer it to information technology based systems [9].


* Summarization explanation

https://github.com/DerwenAI/pytextrank/blob/main/examples/explain_summ.ipynb

In [41]:
import spacy
import pytextrank

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_md")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(content)

# examine the top-ranked phrases in the document
for phrase in doc._.phrases[:10]:
    print(phrase.text)
    print(phrase.rank, phrase.count)
    print(phrase.chunks)

Control operations
0.0692218951939302 1
[Control operations]
information operations
0.06713421040434547 1
[information operations]
control systems
0.0666589433849378 1
[control systems]
information technology based systems
0.06378624740036501 1
[information technology based systems]
risk control
0.0568668007951074 1
[risk control]
clinical procedures
0.05678111224612967 1
[clinical procedures]
PN mixtures
0.05672523544618933 1
[PN mixtures]
quality control
0.05602853508691536 1
[quality control]
parenteral nutrition systems
0.05576636898434965 1
[parenteral nutrition systems]
retrieved operations
0.05565472628605673 1
[retrieved operations]


In [42]:
sent_bounds = [ [s.start, s.end, set([])] for s in doc.sents ]
sent_bounds[:10]

[[0, 18, set()],
 [18, 51, set()],
 [51, 88, set()],
 [88, 92, set()],
 [92, 98, set()],
 [98, 119, set()],
 [119, 140, set()],
 [140, 176, set()],
 [176, 208, set()],
 [208, 227, set()]]

In [52]:
from icecream import ic
limit_phrases = 20

phrase_id = 0
unit_vector = []

for p in doc._.phrases:
    # ic(phrase_id, p.text, p.rank)

    unit_vector.append(p.rank)

    for chunk in p.chunks:
        # ic(chunk.start, chunk.end)

        for sent_start, sent_end, sent_vector in sent_bounds:
            if chunk.start >= sent_start and chunk.end <= sent_end:
                # ic(sent_start, chunk.start, chunk.end, sent_end)
                sent_vector.add(phrase_id)
                break

    phrase_id += 1

    if phrase_id == limit_phrases:
        break

In [53]:
sum_ranks = sum(unit_vector)

unit_vector = [ rank/sum_ranks for rank in unit_vector ]
unit_vector

[0.05920502879104255,
 0.057683882543991476,
 0.05748537258313386,
 0.05709711234886287,
 0.056834103217638994,
 0.05469871302611156,
 0.054325899299362976,
 0.05423774442452605,
 0.051859279701128934,
 0.05149099560511139,
 0.050921248436766195,
 0.04866858649573457,
 0.04757708489163605,
 0.04494377818406676,
 0.042761898406489016,
 0.042761898406489016,
 0.0427323129846172,
 0.04205451910884827,
 0.04139611711965353,
 0.04126442442478851]

In [54]:
from math import sqrt

sent_rank = {}
sent_id = 0

for sent_start, sent_end, sent_vector in sent_bounds:
    # ic(sent_vector)
    sum_sq = 0.0
    # ic
    for phrase_id in range(len(unit_vector)):
        # ic(phrase_id, unit_vector[phrase_id])

        if phrase_id not in sent_vector:
            sum_sq += unit_vector[phrase_id]**2.0

    sent_rank[sent_id] = sqrt(sum_sq)
    sent_id += 1

In [55]:
from operator import itemgetter

sorted(sent_rank.items(), key=itemgetter(1)) 

pass

In [56]:
limit_sentences = 5

sent_text = {}
sent_id = 0

for sent in doc.sents:
    sent_text[sent_id] = sent.text
    sent_id += 1

num_sent = 0

for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
    ic(sent_id, sent_text[sent_id])
    num_sent += 1

    if num_sent == limit_sentences:
        break

ic| sent_id: 1
    sent_text[sent_id]: ('A telematic based approach towards the normalization of clinical praxis
                        '
                         'Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 
                        '
                         'Mateo Ramos Merino1, Javier Sanz Valero2')
ic| sent_id: 268
    sent_text[sent_id]: 'In Annals of Nutrition and Metabolism 63, 366-367 (2013).'
ic| sent_id: 24
    sent_text[sent_id]: ('In addition, these information systems are often ad-hoc telematics '
                         'solutions, designed to cover only specific tasks (e.g., prescription and '
                         'adherence to treatment).')
ic| sent_id: 69
    sent_text[sent_id]: ('Specifically, a control operation defines variables that the user (automated '
                         'agent or human) must monitor.')
ic| sent_id: 214
    sent_text[sent_id]: '
                         '
