<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/summarization-keywords/KeywordExtraction_RaKUn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RaKUn

This keyword detection algorithm exploits graph-based language representations for efficient denoising and keyword detection. Key ideas of RaKUn:
1. Transform texts to graphs
2. Prune graphs based on token similarity (meta vertex introduction)
3. Rank nodes -> keywords


PAPER: https://arxiv.org/pdf/1907.06458v3.pdf

GitHub: https://github.com/SkBlaz/rakun


## Install

In [None]:
!pip install mrakun

## Document of study

We are going to apply keyword Extraction algorithms in a specific text. The idea is use always the same content to study the different results. At same time, it is important know the document to evaluate if the results are valid or not. 

To reach this goal, we are going to use an scientific article text. Furthermore, we remove the abstract and the keywords of the content.

The authors labelled the document with the abstract and keywords:

* **Abstract**: The provision of comprehensive support for traceability and control is a raising demand in some environments such as the eHealth domain where processes can be of critical importance. This paper provides a detailed and thoughtful description of a holistic platform for the characterization and control of processes in the frame of the HACCP context. Traceability features are fully integrated in the model along with support for services concerned with information for the platform users. These features are provided using already tested technologies (RESTful models, QR Codes) and low cost devices (regular smartphones).

* **Keywords**: traceability, eHealth, software platform, mobile environments


Download the text file

In [None]:
!wget -O article.txt https://www.dropbox.com/s/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt?dl=1 

Read the content

In [3]:
# Open a file: file
content = ""
with open('article.txt',mode='r') as file:
  content = file.read()

In [4]:
print(f"Number of words : {len(content.split())}")
print("First lines:")
for line in content.split("\n")[0:3]:
  print(line)

Number of words : 3830
First lines:
﻿________________
A telematic based approach towards the normalization of clinical praxis
Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 


## Apply RaKUn

Hyperparameter description: https://github.com/SkBlaz/rakun#hyperparameter-explanation

In [12]:
from mrakun import RakunDetector
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

hyperparameters = {"distance_threshold":2,
                   "distance_method": "editdistance",
                   "num_keywords" : 10,
                   "pair_diff_length":2,
                   "stopwords" : stopwords.words('english'),
                   "bigram_count_threshold":3,
                   "num_tokens":[1,2],
		   "max_similar" : 3, ## n most similar can show up n times
		   "max_occurrence" : 3} ## maximum frequency overall

keyword_detector = RakunDetector(hyperparameters)
keywords = keyword_detector.find_keywords(content, input_type = "text")

print()
keywords
# keyword_detector.visualize_network()

19-May-21 12:23:47 - Initiated a keyword detector instance.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


19-May-21 12:23:51 - Number of nodes reduced from 877 to 800





[('control', 0.15019487151274621),
 ('system', 0.11831072413827162),
 ('information', 0.1072080768975499),
 ('control and traceability', 0.09631130380535535),
 ('control and traceability', 0.09631130380535535),
 ('control and traceability', 0.09631130380535535),
 ('application', 0.07461086465298247),
 ('operations', 0.06307873482228536),
 ('description', 0.06269849996528838),
 ('procedures', 0.055083288992466485)]

### Use fasttext

Using RaKUn with fasttext requires pretrained emmbeding model. Download .bin file from https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md for chosen language and save it.

```python
from mrakun import RakunDetector
from nltk.corpus import stopwords

hyperparameters = {"distance_threshold":0.2,
                   "distance_method": "fasttext",
                   "pretrained_embedding_path": '../pretrained_models/fasttext/wiki.en.bin', #change path accordingly
                   "num_keywords" : 10,
                   "pair_diff_length":2,
                   "stopwords" : stopwords.words('english'),
                   "bigram_count_threshold":2,
                   "num_tokens":[1]}

keyword_detector = RakunDetector(hyperparameters)
example_data = "./datasets/wiki20/docsutf8/7183.txt"
keywords = keyword_detector.find_keywords(example_data)
print(keywords)

keyword_detector.verbose = False
```