<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/summarization-keywords/KeywordExtraction_KEX.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KEX

Kex is a python library for unsurpervised keyword extractions, supporting the following features:

- Easy interface for keyword extraction with a variety of algorithms
- Quick benchmarking over 15 English public datasets
- Custom keyword extractor implementation support

PAPER: https://arxiv.org/pdf/2104.08028v1.pdf

GitHub: https://github.com/asahi417/kex



## Install

In [1]:
!pip install kex



## Document of study

We are going to apply keyword Extraction algorithms in a specific text. The idea is use always the same content to study the different results. At same time, it is important know the document to evaluate if the results are valid or not. 

To reach this goal, we are going to use an scientific article text. Furthermore, we remove the abstract and the keywords of the content.

The authors labelled the document with the abstract and keywords:

* **Abstract**: The provision of comprehensive support for traceability and control is a raising demand in some environments such as the eHealth domain where processes can be of critical importance. This paper provides a detailed and thoughtful description of a holistic platform for the characterization and control of processes in the frame of the HACCP context. Traceability features are fully integrated in the model along with support for services concerned with information for the platform users. These features are provided using already tested technologies (RESTful models, QR Codes) and low cost devices (regular smartphones).

* **Keywords**: traceability, eHealth, software platform, mobile environments


Download the text file

In [2]:
!wget -O article.txt https://www.dropbox.com/s/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt?dl=1 

--2021-05-19 11:44:58--  https://www.dropbox.com/s/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt [following]
--2021-05-19 11:44:58--  https://www.dropbox.com/s/dl/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc6a62c3b2fccde64b1cf894c0ca.dl.dropboxusercontent.com/cd/0/get/BOxmxt3ajEznRcwiFf2y0-2OB8YaPh-6rgxfIdbC-A8kQYTPUt3P9Jh-dNT3gxGJoG_aaijJdhF2tU8z7wyLSh3PZ8TI_Vat7-N9S_Rv6Tc9C3GFJqHAyQDXFugph4wRLn7_J28LkDEVLFs19yaZjKCv/file?dl=1# [following]
--2021-05-19 11:44:58--  https://uc6a62c3b2fccde64b1cf894c0ca.dl.dropboxusercontent.com/cd/0/get/BOxmxt3ajEznRcwiFf2y0-2OB8YaPh-6rg

Read the content

In [3]:
# Open a file: file
content = ""
with open('article.txt',mode='r') as file:
  content = file.read()

In [4]:
print(f"Number of words : {len(content.split())}")
print("First lines:")
for line in content.split("\n")[0:3]:
  print(line)

Number of words : 3830
First lines:
﻿________________
A telematic based approach towards the normalization of clinical praxis
Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 


## Apply KEX


In [5]:
import kex
n_keywords = 10

2021-05-19 11:44:59 INFO     'pattern' package not found; tag filters are not available for English


In [6]:
def print_keywords(keywords):
  print(f"Number of keywords : {len(keywords)}")

  keyword = keywords[0]

  print(f"- Parameters by keyword : {keyword.keys()}")

  print("- First keyword")
  print(f"\t - Keyword : {keyword['stemmed']}")
  print(f"\t - Number of ocurrences : {keyword['count']}")
  print(f"\t - Score : {keyword['score']}")

  print()
  print("List of keywords")
  for idx, keyword in enumerate(keywords):
    print(f"{idx} [{keyword['score']}] {keyword['raw'][0]}")

Built-in algorithms in kex is below:

- FirstN: heuristic baseline to pick up first n phrases as keywords
- TF: scoring by term frequency
- TFIDF: scoring by TFIDF
- LexSpec: scoring by lexical specificity
- TextRank: Mihalcea et al., 04
- SingleRank: Wan et al., 08
- TopicalPageRank: Liu et al.,10
- SingleTPR: Sterckx et al.,15
- TopicRank: Bougouin et al.,13
- PositionRank: Florescu et al.,18
- TFIDFRank: SingleRank + TFIDF based word distribution prior
- LexRank: SingleRank + lexical specificity based word distribution prior

### Direct methods


Algorithms such as `FirstN`, `TextRank`, `SingleRank`, `TopicRank` and `PositionRank` can be directly applied.


In [7]:
model = kex.FirstN() 
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score'])
- First keyword
	 - Keyword : approach
	 - Number of ocurrences : 1
	 - Score : 0

List of keywords
0 [0] approach
1 [1] normalization
2 [2] Luis Álvarez Sabucedo1
3 [3] Mateo Ramos Merino1
4 [4] Engineering Department
5 [5] University
6 [6] Vigo
7 [7] 36310 Vigo
8 [8] Spain
9 [9] valonso


In [17]:
model = kex.TextRank() 
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : system
	 - Number of ocurrences : 24
	 - Score : 0.010023246097269418

List of keywords
0 [0.010023246097269418] system
1 [0.009063705904145378] technological system
2 [0.009020646588795533] information systems
3 [0.008359524585289803] control systems
4 [0.008104165711021339] technologies
5 [0.008061106395671494] information technology
6 [0.00801804708032165] information
7 [0.00768183756774727] system description
8 [0.007525717686871705] information operations
9 [0.00750820812727075] process


In [8]:
model = kex.SingleRank()  
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : system
	 - Number of ocurrences : 24
	 - Score : 0.00951135136685336

List of keywords
0 [0.00951135136685336] system
1 [0.009316402856993903] operations
2 [0.009193496709651451] control systems
3 [0.00909602245472172] Control operations
4 [0.00887564205244954] control
5 [0.008746549543591418] technological system
6 [0.008601789883868498] client
7 [0.008568551318443526] information systems
8 [0.008471077063513797] information operations
9 [0.007981747720329476] technologies


In [18]:
model = kex.TopicRank() 
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : system
	 - Number of ocurrences : 24
	 - Score : 0.015863301489363928

List of keywords
0 [0.015863301489363928] system
1 [0.013317010133330273] operating temperature
2 [0.013174100570169357] human users
3 [0.011725946689673762] control
4 [0.010831990047865789] e.g.
5 [0.010617600258692066] information technology
6 [0.01045177655297573] PN
7 [0.009041290380981006] procedures
8 [0.008741900804153602] labels
9 [0.008722486691391423] traces


In [19]:
model = kex.PositionRank() 
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : M.
	 - Number of ocurrences : 3
	 - Score : 0.042495719917476305

List of keywords
0 [0.042495719917476305] M.
1 [0.027713713851861645] 1
2 [0.025290769818424177] J. M.
3 [0.025269399780546892] V. M.
4 [0.023274685911553985] M. A.
5 [0.018997748507953865] ad-hoc telematics solutions
6 [0.018236219614993644] Clinical
7 [0.018011054269844104] clinical praxis Víctor M. Alonso Rorís1
8 [0.017641874985082736] Juan M. Santos Gago1
9 [0.0166193427185386] Álvarez


### Compute a statistical prior

Algorithms such as `TF`, `TFIDF`, `TFIDFRank`, `LexSpec`, `LexRank`, `TopicalPageRank`, and `SingleTPR` need to compute a prior distribution beforehand by

In [9]:
import nltk
import nltk.data
nltk.download('punkt')

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(content)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
model = kex.TF() 
model.train(sentences, export_directory='./tmp')
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : system
	 - Number of ocurrences : 24
	 - Score : 37.0

List of keywords
0 [37.0] system
1 [34.5] control systems
2 [34.0] operations
3 [33.0] Control operations
4 [32.0] control
5 [30.0] information systems
6 [28.5] information operations
7 [26.0] Users
8 [25.0] system description
9 [24.5] technological system


In [11]:
model = kex.TFIDF() 
model.train(sentences, export_directory='./tmp')
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

2021-05-19 11:45:02 INFO     adding document #0 to Dictionary(0 unique tokens: [])
2021-05-19 11:45:02 INFO     built Dictionary(897 unique tokens: ['&', ',', '.', '03550', '1']...) from 246 documents (total 3153 corpus positions)
2021-05-19 11:45:02 INFO     collecting document frequencies
2021-05-19 11:45:02 INFO     PROGRESS: processing document #0
2021-05-19 11:45:02 INFO     calculating IDF weights for 246 documents and 896 features (2859 matrix non-zeros)
2021-05-19 11:45:02 INFO     saving TfidfModel object under ./tmp/tfidf_model, separately None
2021-05-19 11:45:02 INFO     saved ./tmp/tfidf_model
2021-05-19 11:45:02 INFO     saving dictionary mapping to ./tmp/tfidf_dict


Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : oper
	 - Number of ocurrences : 21
	 - Score : 0.14893716050988023

List of keywords
0 [0.14893716050988023] operations
1 [0.1456873886164094] system
2 [0.1423704106854542] Control operations
3 [0.14074552473871876] control systems
4 [0.13580366086102816] control
5 [0.13135338998251403] information operations
6 [0.12972850403577862] information systems
7 [0.12179945368644407] Users
8 [0.11376961945514784] information
9 [0.11200222045350436] system description


In [12]:
model = kex.TFIDFRank() 
model.train(sentences, export_directory='./tmp')
keywords = model.get_keywords(content, n_keywords=n_keywords)
print_keywords(keywords)

2021-05-19 11:45:04 INFO     adding document #0 to Dictionary(0 unique tokens: [])
2021-05-19 11:45:04 INFO     built Dictionary(897 unique tokens: ['&', ',', '.', '03550', '1']...) from 246 documents (total 3153 corpus positions)
2021-05-19 11:45:04 INFO     collecting document frequencies
2021-05-19 11:45:04 INFO     PROGRESS: processing document #0
2021-05-19 11:45:04 INFO     calculating IDF weights for 246 documents and 896 features (2859 matrix non-zeros)
2021-05-19 11:45:04 INFO     saving TfidfModel object under ./tmp/tfidf_model, separately None
2021-05-19 11:45:04 INFO     saved ./tmp/tfidf_model
2021-05-19 11:45:04 INFO     saving dictionary mapping to ./tmp/tfidf_dict


Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : oper
	 - Number of ocurrences : 21
	 - Score : 0.014513331529947682

List of keywords
0 [0.014513331529947682] operations
1 [0.01427188260367997] Control operations
2 [0.014030433677412257] control
3 [0.013836296749628656] control systems
4 [0.013642159821845056] system
5 [0.012671743174924793] information operations
6 [0.012236157320873482] information systems
7 [0.012026474425472382] client
8 [0.011676799411742099] technological system
9 [0.010830154819901905] information


In [13]:
model = kex.LexSpec() 
model.train(sentences, export_directory='./tmp')
keywords = model.get_keywords(content[:11500], n_keywords=n_keywords) # It fails with all the text
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : server
	 - Number of ocurrences : 4
	 - Score : 1.291104119772643

List of keywords
0 [1.291104119772643] server
1 [1.1954150930170626] recording
2 [1.1064416734547526] mechanisms
3 [0.924830053488585] entities
4 [0.6455520598863215] server functions
5 [0.6455520598863215] server agent
6 [0.5996007124997992] recorded information
7 [0.5977075465085313] traceable records
8 [0.5977075465085313] auditable record
9 [0.5532208367273763] invocation mechanisms


In [14]:
model = kex.LexRank() 
model.train(sentences, export_directory='./tmp')
keywords = model.get_keywords(content[:11500], n_keywords=n_keywords) # It fails with all the text
print_keywords(keywords)

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : system
	 - Number of ocurrences : 19
	 - Score : 0.03041609828790691

List of keywords
0 [0.03041609828790691] system
1 [0.02832000843073278] agents
2 [0.02683221457198216] entities
3 [0.0257337302241822] client agents
4 [0.02333211196445681] server agent
5 [0.02314745201763162] client
6 [0.023083767551911247] system description
7 [0.021330364707705406] HTTP
8 [0.020784550096836097] monitored variables
9 [0.020771119619909992] information systems


In [15]:
model = kex.TopicalPageRank() 
model.train(sentences, export_directory='./tmp')

## There is some error in the implementation. The `build_graph` method returns 4
## params but the code expects only 3. I modified manually the library to accept
## an extra unuseless param:
# graph, phrase_instance, original_sentence_token_size, _ = self.build_graph(document)

keywords = model.get_keywords(content, n_keywords=n_keywords) 
print_keywords(keywords)

2021-05-19 11:45:07 INFO     adding document #0 to Dictionary(0 unique tokens: [])
2021-05-19 11:45:07 INFO     built Dictionary(897 unique tokens: ['&', ',', '.', '03550', '1']...) from 246 documents (total 3153 corpus positions)
2021-05-19 11:45:07 INFO     using symmetric alpha at 0.06666666666666667
2021-05-19 11:45:07 INFO     using symmetric eta at 0.06666666666666667
2021-05-19 11:45:07 INFO     using serial LDA version on this node
2021-05-19 11:45:07 INFO     running online (single-pass) LDA training, 15 topics, 1 passes over the supplied corpus of 246 documents, updating model once every 246 documents, evaluating perplexity every 246 documents, iterating 50x with a convergence threshold of 0.001000
2021-05-19 11:45:07 INFO     -13.721 per-word bound, 13506.0 perplexity estimate based on a held-out corpus of 246 documents with 3153 words
2021-05-19 11:45:07 INFO     PROGRESS: pass 0, at document #246/246
2021-05-19 11:45:07 INFO     topic #2 (0.067): 0.106*"," + 0.052*"." + 0.

4
Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : control oper defin variabl
	 - Number of ocurrences : 1
	 - Score : 0.03879468693711062

List of keywords
0 [0.03879468693711062] control operation defines variables
1 [0.03451257219321824] control process procedures
2 [0.0319067860816122] control systems
3 [0.031685053758973764] Control operations
4 [0.030483723020960727] Standard Operating Procedures
5 [0.029533370625683952] client application agent
6 [0.028246981884026336] generic client agent
7 [0.027196593246606168] information systems
8 [0.027095874492829033] parenteral nutrition systems
9 [0.026974860923967732] information operations


In [16]:
model = kex.SingleTPR() 
model.train(sentences, export_directory='./tmp')
keywords = model.get_keywords(content, n_keywords=n_keywords) 
print_keywords(keywords)

2021-05-19 11:51:31 INFO     adding document #0 to Dictionary(0 unique tokens: [])
2021-05-19 11:51:31 INFO     built Dictionary(897 unique tokens: ['&', ',', '.', '03550', '1']...) from 246 documents (total 3153 corpus positions)
2021-05-19 11:51:31 INFO     using symmetric alpha at 0.06666666666666667
2021-05-19 11:51:31 INFO     using symmetric eta at 0.06666666666666667
2021-05-19 11:51:31 INFO     using serial LDA version on this node
2021-05-19 11:51:31 INFO     running online (single-pass) LDA training, 15 topics, 1 passes over the supplied corpus of 246 documents, updating model once every 246 documents, evaluating perplexity every 246 documents, iterating 50x with a convergence threshold of 0.001000
2021-05-19 11:51:31 INFO     -13.737 per-word bound, 13652.2 perplexity estimate based on a held-out corpus of 246 documents with 3153 words
2021-05-19 11:51:31 INFO     PROGRESS: pass 0, at document #246/246
2021-05-19 11:51:31 INFO     topic #10 (0.067): 0.070*"," + 0.063*"." + 0

Number of keywords : 10
- Parameters by keyword : dict_keys(['stemmed', 'pos', 'raw', 'offset', 'count', 'score', 'n_source_tokens'])
- First keyword
	 - Keyword : oper
	 - Number of ocurrences : 21
	 - Score : 0.016645899118662582

List of keywords
0 [0.016645899118662582] operations
1 [0.01610051087992711] Control operations
2 [0.015555122641191632] control
3 [0.014931230333202268] control systems
4 [0.014307338025212906] system
5 [0.014038391837220335] information operations
6 [0.012911072153984123] client
7 [0.012869111290495498] information systems
8 [0.011962486579538929] technological system
9 [0.011439380176468741] control process procedures
