<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/nlp/PyTextRank%20-%20Example%20of%20use.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTextRank

Python tool developed by Paco Nathan.

PyTextRank is a Python implementation of TextRank as a spaCy extension, used to:

 - extract the top-ranked phrases from text documents
 - infer links from unstructured text into structured data
 - run extractive summarization of text documents

To more information, visit official github: https://github.com/DerwenAI/pytextrank

## Installation

In [20]:
! pip install pytextrank
! pip install 'graphviz>=0.13'
! pip install 'networkx >= 2.0'
! pip install 'spacy >= 2.0'



## Basic Usage Example


1. Import the packages

In [0]:
import spacy
import pytextrank

2. Load the spacy model

In [0]:
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

3. Add the PyTextRank to the spaCy pipeline

https://explosion.ai/blog/spacy-v2-pipelines-extensions 

In [0]:
# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

4. Tokenize the text

In [0]:
# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

doc = nlp(text)

5. Access to the pytextrank analysis

In [5]:
# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("TextRank : {:.4f} - Count : {:5d} - Text :  {} - Chunk : {}".format(p.rank, p.count, p.text, p.chunks))

TextRank : 0.1567 - Count :     1 - Text :  minimal generating sets - Chunk : [minimal generating sets]
TextRank : 0.1371 - Count :     4 - Text :  systems - Chunk : [systems, systems, systems, a system]
TextRank : 0.1178 - Count :     3 - Text :  solutions - Chunk : [solutions, solutions, solutions]
TextRank : 0.1164 - Count :     1 - Text :  linear diophantine equations - Chunk : [linear Diophantine equations]
TextRank : 0.1077 - Count :     1 - Text :  nonstrict inequations - Chunk : [nonstrict inequations]
TextRank : 0.1050 - Count :     1 - Text :  mixed types - Chunk : [mixed types]
TextRank : 0.1044 - Count :     1 - Text :  strict inequations - Chunk : [strict inequations]
TextRank : 0.1000 - Count :     1 - Text :  a minimal supporting set - Chunk : [a minimal supporting set]
TextRank : 0.0979 - Count :     1 - Text :  linear constraints - Chunk : [linear constraints]
TextRank : 0.0919 - Count :     1 - Text :  upper bounds - Chunk : [Upper bounds]
TextRank : 0.0913 - Count : 

## Example

https://github.com/DerwenAI/pytextrank/blob/master/example.py

In [0]:
#!/usr/bin/env python
# encoding: utf-8

import logging
import pytextrank
import spacy
import sys

######################################################################
## sample usage
######################################################################

# load a spaCy model, depending on language, scale, etc.

nlp = spacy.load("en_core_web_sm")

# logging is optional: to debug, set the `logger` parameter
# when initializing the TextRank object

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logger = logging.getLogger("PyTR")

# add PyTextRank into the spaCy pipeline

tr = pytextrank.TextRank(logger=None)
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)


**Text Rank in a short text**

In [7]:

# parse the document
local = False

if local :
  with open("dat/mih.txt", "r") as f:
      text = f.read()
else: 
  from urllib.request import urlopen
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/dat/mih.txt'
  data = urlopen(url).read().decode("utf-8") 

doc = nlp(text)

print("pipeline", nlp.pipe_names)
print("elapsed time: {} ms".format(tr.elapsed_time))

pipeline ['tagger', 'parser', 'ner', 'textrank']
elapsed time: 8.878231048583984 ms


Top-ranked phrases

In [8]:
# examine the top-ranked phrases in the document
print('Top-ranked phrases: ')
print('RANK - COUNT - TEXT [chunks] ')
for phrase in doc._.phrases:
    print("{:.4f} {:5d}  {} - {}".format(phrase.rank, phrase.count, phrase.text, phrase.chunks))

Top-ranked phrases: 
RANK - COUNT - TEXT [chunks] 
0.1567     1  minimal generating sets - [minimal generating sets]
0.1371     4  systems - [systems, systems, systems, a system]
0.1178     3  solutions - [solutions, solutions, solutions]
0.1164     1  linear diophantine equations - [linear Diophantine equations]
0.1077     1  nonstrict inequations - [nonstrict inequations]
0.1050     1  mixed types - [mixed types]
0.1044     1  strict inequations - [strict inequations]
0.1000     1  a minimal supporting set - [a minimal supporting set]
0.0979     1  linear constraints - [linear constraints]
0.0919     1  upper bounds - [Upper bounds]
0.0913     1  a minimal set - [a minimal set]
0.0804     1  components - [components]
0.0797     1  natural numbers - [natural numbers]
0.0797     1  algorithms - [algorithms]
0.0782     1  all the considered types systems - [all the considered types systems]
0.0768     1  diophantine - [Diophantine]
0.0697     2  compatibility - [Compatibility, compatibi

Document to visualize graph

In [0]:
# generate a GraphViz doc to visualize the lemma graph
tr.write_dot(path="lemma_graph.dot")

**Text Rank in a long text**

In [0]:
# switch to a longer text document...

if local :
  with open("dat/lee.txt", "r") as f:
      text = f.read()
else: 
  from urllib.request import urlopen
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/dat/lee.txt'
  data = urlopen(url).read().decode("utf-8") 

doc = nlp(text)

Top-ranked phrases

In [11]:
for phrase in doc._.phrases[:20]:
    print(phrase)

minimal generating sets
systems
solutions
linear diophantine equations
nonstrict inequations
mixed types
strict inequations
a minimal supporting set
linear constraints
upper bounds
a minimal set
components
natural numbers
algorithms
all the considered types systems
diophantine
compatibility
construction
the set
criteria


Summarize

In [12]:
# summarize the document based on the top 15 phrases, 
# yielding the top 5 sentences...

for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5):
    print(sent)


Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.
These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.
Compatibility of systems of linear constraints over the set of natural numbers.


**Stopwords**

In [13]:

# to show use of stopwords, first we output a baseline...

if local :
  with open("dat/gen.txt", "r") as f:
      text = f.read()
else: 
  from urllib.request import urlopen
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/dat/gen.txt'
  data = urlopen(url).read().decode("utf-8") 

doc = nlp(text)

print(f'Top ranked phrases with stopwords : ')
for phrase in doc._.phrases[:10]:
    print(f'\t - {phrase}')


Top ranked phrases with stopwords : 
	 - minimal generating sets
	 - systems
	 - solutions
	 - linear diophantine equations
	 - nonstrict inequations
	 - mixed types
	 - strict inequations
	 - a minimal supporting set
	 - linear constraints
	 - upper bounds


Load stopwords

In [14]:

# now add `("gensim", "PROPN")` to the stop words list
# then see how the top-ranked phrases differ...

if not local :
  url = 'https://raw.githubusercontent.com/DerwenAI/pytextrank/master/stop.json'
  data = urlopen(url).read().decode("utf-8") 
  with open('stop.json', 'w') as localfile:
      localfile.write(data)

tr.load_stopwords(path="stop.json")

doc = nlp(text)

print(f'Top ranked phrases without stopwords :')
for phrase in doc._.phrases[:10]:
    print(f'\t - {phrase}')

Top ranked phrases without stopwords :
	 - minimal generating sets
	 - systems
	 - solutions
	 - linear diophantine equations
	 - nonstrict inequations
	 - mixed types
	 - strict inequations
	 - a minimal supporting set
	 - linear constraints
	 - upper bounds
