<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/PyTextRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTextRank

Python tool developed by Paco Nathan.

PyTextRank is a Python implementation of TextRank as a spaCy extension, used to:

 - extract the top-ranked phrases from text documents
 - infer links from unstructured text into structured data
 - run extractive summarization of text documents

To more information, visit official github: https://github.com/DerwenAI/pytextrank

## Installation

In [0]:
! pip install pytextrank
! pip install 'graphviz>=0.13'
! pip install 'networkx >= 2.0'
! pip install 'spacy >= 2.0'

Collecting pytextrank
  Downloading https://files.pythonhosted.org/packages/1d/bb/b7f864f862e6fbbbe9c935c5fd006e96fc8cc6c63b74d6c1b8adb668d3d1/pytextrank-2.0.0-py3-none-any.whl
Installing collected packages: pytextrank
Successfully installed pytextrank-2.0.0
Collecting graphviz>=0.13
  Downloading https://files.pythonhosted.org/packages/f5/74/dbed754c0abd63768d3a7a7b472da35b08ac442cf87d73d5850a6f32391e/graphviz-0.13.2-py2.py3-none-any.whl
Installing collected packages: graphviz
  Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed graphviz-0.13.2


## Basic Usage Example


1. Import the packages

In [0]:
import spacy
import pytextrank

2. Load the spacy model

In [0]:
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

3. Add the PyTextRank to the spaCy pipeline

https://explosion.ai/blog/spacy-v2-pipelines-extensions 

In [0]:
# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

4. Tokenize the text

In [0]:
# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

doc = nlp(text)

5. Access to the pytextrank analysis

In [0]:
# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("TextRank : {:.4f} - Count : {:5d} - Text :  {} - Chunk : {}".format(p.rank, p.count, p.text, p.chunks))

TextRank : 0.1567 - Count :     1 - Text :  minimal generating sets - Chunk : [minimal generating sets]
TextRank : 0.1371 - Count :     4 - Text :  systems - Chunk : [systems, systems, systems, a system]
TextRank : 0.1178 - Count :     3 - Text :  solutions - Chunk : [solutions, solutions, solutions]
TextRank : 0.1164 - Count :     1 - Text :  linear diophantine equations - Chunk : [linear Diophantine equations]
TextRank : 0.1077 - Count :     1 - Text :  nonstrict inequations - Chunk : [nonstrict inequations]
TextRank : 0.1050 - Count :     1 - Text :  mixed types - Chunk : [mixed types]
TextRank : 0.1044 - Count :     1 - Text :  strict inequations - Chunk : [strict inequations]
TextRank : 0.1000 - Count :     1 - Text :  a minimal supporting set - Chunk : [a minimal supporting set]
TextRank : 0.0979 - Count :     1 - Text :  linear constraints - Chunk : [linear constraints]
TextRank : 0.0919 - Count :     1 - Text :  upper bounds - Chunk : [Upper bounds]
TextRank : 0.0913 - Count : 

## Example

https://github.com/DerwenAI/pytextrank/blob/master/example.py

In [0]:
#!/usr/bin/env python
# encoding: utf-8

import logging
import pytextrank
import spacy
import sys

######################################################################
## sample usage
######################################################################

# load a spaCy model, depending on language, scale, etc.

nlp = spacy.load("en_core_web_sm")

# logging is optional: to debug, set the `logger` parameter
# when initializing the TextRank object

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logger = logging.getLogger("PyTR")

# add PyTextRank into the spaCy pipeline

tr = pytextrank.TextRank(logger=None)
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)


**Text Rank in a short text**

In [0]:

# parse the document
local = False

if local :
  with open("dat/mih.txt", "r") as f:
      text = f.read()
else: 
  from urllib.request import urlopen
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/dat/mih.txt'
  data = urlopen(url).read().decode("utf-8") 

doc = nlp(text)

print("pipeline", nlp.pipe_names)
print("elapsed time: {} ms".format(tr.elapsed_time))

pipeline ['tagger', 'parser', 'ner', 'textrank']
elapsed time: 8.878231048583984 ms


Top-ranked phrases

In [0]:
# examine the top-ranked phrases in the document
print('Top-ranked phrases: ')
print('RANK - COUNT - TEXT [chunks] ')
for phrase in doc._.phrases:
    print("{:.4f} {:5d}  {} - {}".format(phrase.rank, phrase.count, phrase.text, phrase.chunks))

Top-ranked phrases: 
RANK - COUNT - TEXT [chunks] 
0.1567     1  minimal generating sets - [minimal generating sets]
0.1371     4  systems - [systems, systems, systems, a system]
0.1178     3  solutions - [solutions, solutions, solutions]
0.1164     1  linear diophantine equations - [linear Diophantine equations]
0.1077     1  nonstrict inequations - [nonstrict inequations]
0.1050     1  mixed types - [mixed types]
0.1044     1  strict inequations - [strict inequations]
0.1000     1  a minimal supporting set - [a minimal supporting set]
0.0979     1  linear constraints - [linear constraints]
0.0919     1  upper bounds - [Upper bounds]
0.0913     1  a minimal set - [a minimal set]
0.0804     1  components - [components]
0.0797     1  natural numbers - [natural numbers]
0.0797     1  algorithms - [algorithms]
0.0782     1  all the considered types systems - [all the considered types systems]
0.0768     1  diophantine - [Diophantine]
0.0697     2  compatibility - [Compatibility, compatibi

Document to visualize graph

In [0]:
# generate a GraphViz doc to visualize the lemma graph
tr.write_dot(path="lemma_graph.dot")

**Text Rank in a long text**

In [0]:
# switch to a longer text document...

if local :
  with open("dat/lee.txt", "r") as f:
      text = f.read()
else: 
  from urllib.request import urlopen
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/dat/lee.txt'
  data = urlopen(url).read().decode("utf-8") 

doc = nlp(text)

Top-ranked phrases

In [0]:
for phrase in doc._.phrases[:20]:
    print(phrase)

minimal generating sets
systems
solutions
linear diophantine equations
nonstrict inequations
mixed types
strict inequations
a minimal supporting set
linear constraints
upper bounds
a minimal set
components
natural numbers
algorithms
all the considered types systems
diophantine
compatibility
construction
the set
criteria


Summarize

In [0]:
# summarize the document based on the top 15 phrases, 
# yielding the top 5 sentences...

for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5):
    print(sent)


Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.
These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.
Compatibility of systems of linear constraints over the set of natural numbers.


**Stopwords**

In [0]:

# to show use of stopwords, first we output a baseline...

if local :
  with open("dat/gen.txt", "r") as f:
      text = f.read()
else: 
  from urllib.request import urlopen
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/dat/gen.txt'
  data = urlopen(url).read().decode("utf-8") 

doc = nlp(text)

print(f'Top ranked phrases with stopwords : ')
for phrase in doc._.phrases[:10]:
    print(f'\t - {phrase}')


Top ranked phrases with stopwords : 
	 - minimal generating sets
	 - systems
	 - solutions
	 - linear diophantine equations
	 - nonstrict inequations
	 - mixed types
	 - strict inequations
	 - a minimal supporting set
	 - linear constraints
	 - upper bounds


Load stopwords

In [0]:

# now add `("gensim", "PROPN")` to the stop words list
# then see how the top-ranked phrases differ...

if not local :
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/stop.json'
  data = urlopen(url).read().decode("utf-8") 
  with open('stop.json', 'w') as localfile:
      localfile.write(data)

tr.load_stopwords(path="stop.json")

doc = nlp(text)

print(f'Top ranked phrases without stopwords :')
for phrase in doc._.phrases[:10]:
    print(f'\t - {phrase}')

Top ranked phrases without stopwords :
	 - minimal generating sets
	 - systems
	 - solutions
	 - linear diophantine equations
	 - nonstrict inequations
	 - mixed types
	 - strict inequations
	 - a minimal supporting set
	 - linear constraints
	 - upper bounds


## Example: Spanish text 

In [0]:
! python -m spacy download es_core_news_sm

Collecting es_core_news_sm==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.1.0/es_core_news_sm-2.1.0.tar.gz (11.1MB)
[K     |████████████████████████████████| 11.1MB 781kB/s 
[?25hBuilding wheels for collected packages: es-core-news-sm
  Building wheel for es-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for es-core-news-sm: filename=es_core_news_sm-2.1.0-cp36-none-any.whl size=11111557 sha256=a1d91c5da711883498d09a1b9bc6c07674e80185db21735fb947d75bab754f0a
  Stored in directory: /tmp/pip-ephem-wheel-cache-3s232e3b/wheels/cc/ee/c4/68922955901918a9aaa82e828d4f7ee1ccfc861285277e79b7
Successfully built es-core-news-sm
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')


**Note**: after the installation, you should re-initialize the environment.

In [0]:
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("es_core_news_sm")
#nlp = spacy.load("en_core_web_sm")

# add PyTextRank into the spaCy pipeline
tr = pytextrank.TextRank(logger=None)
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

In [0]:
from bs4 import BeautifulSoup
import requests
import traceback
import sys

def get_text (url):
    buf = []
    
    try:
        soup = BeautifulSoup(requests.get(url).text, "html.parser")
        
        for p in soup.find_all("p"):
            buf.append(p.get_text())

        return "\n".join(buf)
    except:
        print(traceback.format_exc())
        sys.exit(-1)
doc = nlp(get_text("https://es.wikipedia.org/wiki/Italia"))

In [0]:
doc[0:500]





Italia, oficialmente República Italiana (en italiano, Repubblica Italiana), es un país miembro de la Unión Europea, cuya forma de gobierno es la república parlamentaria. Su territorio, con capital en Roma, se divide en veinte regiones formadas estas, a su vez, por 110 provincias.

Italia se ubica en el centro del mar Mediterráneo, en Europa meridional. Ocupa la península Itálica, así como la llanura Padana, la islas de Sicilia y Cerdeña y alrededor de ochocientas islas menores entre las que se destacan las islas Tremiti en el mar Adriático, los archipiélagos Campano y Toscano en el mar Tirreno, o las islas Pelagias en África septentrional, entre otras. En el norte, está rodeada por los Alpes y tiene frontera con Francia, Suiza, Austria, y Eslovenia. Los Estados de San Marino y Ciudad del Vaticano son enclaves dentro del territorio italiano. A su vez, Campione d'Italia es un municipio italiano que forma un pequeño enclave en territorio suizo.

Ha sido el hogar de muchas culturas eur

In [0]:
print(f'Top ranked phrases without stopwords :')
for phrase in doc._.phrases[:10]:
    print(f'\t - {phrase}')

Top ranked phrases without stopwords :
	 - italia
	 - imperio italiano
	 - estado italiano
	 - neorrealismo italiano
	 - siglos
	 - país
	 - siglo xvi
	 - siglo xviii
	 - siglo xvii
	 - siglo xx


In [0]:
print(f'Sumarization :')
for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5):
  sent_ = sent.text.replace('\n','')
  print(f'\t - {sent_}')

Sumarization :
	 - Hacia el 400 a. C., Etruria (nombre del país de los etruscos, entre la Toscana y la Lombardía) fue absorbida por los romanos y, antes o después, lo fueron el resto de pueblos itálicos.[21]​Como Antigua Roma se designa a una sociedad agrícola surgida a mediados del siglo VIII a. C. en el Latium Vetus (actual Lacio), que se expandió desde la ciudad de Roma a toda la península itálica, unificándola bajo el nombre de Italia, y que creció durante siglos hasta convertirse en un imperio, que en su época de apogeo, llegó a abarcar desde la península ibérica a Anatolia y desde las islas británicas hasta Egipto, provocando un importante florecimiento cultural en cada lugar en el que gobernó.
	 - La explicación para este nombre es el hecho de ser un país rico en ganado bovino.[19]​ En el siglo I a. C., el toro, símbolo del pueblo samnita sublevado contra Roma, fue representado en las monedas emitidas por los insurrectos abatiendo a una loba, símbolo de Roma: la leyenda del vité