<a href="https://colab.research.google.com/github/victor-roris/NLPlearning/blob/master/summarization-keywords/KeywordExtraction_KeyBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KeyBERT

KeyBERT is a minimal and easy-to-use keyword extraction technique that 
uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.


Github: https://github.com/MaartenGr/KeyBERT

Tutorial: https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea

## Install

In [None]:
!pip install sentence-transformers

## Document of study

We are going to apply keyword Extraction algorithms in a specific text. The idea is use always the same content to study the different results. At same time, it is important know the document to evaluate if the results are valid or not. 

To reach this goal, we are going to use an scientific article text. Furthermore, we remove the abstract and the keywords of the content.

The authors labelled the document with the abstract and keywords:

* **Abstract**: The provision of comprehensive support for traceability and control is a raising demand in some environments such as the eHealth domain where processes can be of critical importance. This paper provides a detailed and thoughtful description of a holistic platform for the characterization and control of processes in the frame of the HACCP context. Traceability features are fully integrated in the model along with support for services concerned with information for the platform users. These features are provided using already tested technologies (RESTful models, QR Codes) and low cost devices (regular smartphones).

* **Keywords**: traceability, eHealth, software platform, mobile environments


Download the text file

In [None]:
!wget -O article.txt https://www.dropbox.com/s/1mz0ociy6ipz67q/victor_roris-worldcist2016.txt?dl=1 

Read the content

In [1]:
# Open a file: file
content = ""
with open('article.txt',mode='r') as file:
  content = file.read()

In [9]:
print(f"Number of words : {len(content.split())}")
print("First lines:")
for line in content.split("\n")[0:3]:
  print(line)

Number of words : 3830
First lines:
﻿________________
A telematic based approach towards the normalization of clinical praxis
Víctor M. Alonso Rorís1, Juan M. Santos Gago1, Luis Álvarez Sabucedo1, 


## Minimal method of Keyword Extraction with Bert

In this section we described the algorithm that the KeyBert use to identify and extract the keywords.

### Candidate Keywords/Keyphrases

We start by creating a list of candidate keywords or keyphrases from a document. Although many focus on noun phrases, we are going to keep it simple by using Scikit-Learns CountVectorizer. This allows us to specify the length of the keywords and make them into keyphrases. It also is a nice method for quickly removing stop words.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

n_gram_range = (3, 3)
stop_words = "english"

# Extract candidate words/phrases
count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([content])
candidates = count.get_feature_names()

In [17]:
print(f"Number of keyword candidates : {len(candidates)}")

print("Some of them are :")
display(candidates[100:120])
print("...")

Number of keyword candidates : 2119
Some of them are :


['ability track history',
 'able access server',
 'able navigate different',
 'absence presence particular',
 'abstract characterization application',
 'acceptance environments health',
 'access basic features',
 'access current mobile',
 'access relevant information',
 'access server functions',
 'access statistics plots',
 'access web pages',
 'accessed list filtered',
 'accessed proper authorization',
 'accessible operations referenced',
 'according different criteria',
 'according physical biological',
 'according specific needs',
 'according usage model',
 'account user moment']

...


### Embeddings

Next, we convert both the document as well as the candidate keywords/keyphrases to numerical data. We use BERT for this purpose as it has shown great results for both similarity- and paraphrasing tasks.

In [11]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([content])
candidate_embeddings = model.encode(candidates)

In [19]:
doc_embedding.shape

(1, 768)

In [21]:
candidate_embeddings.shape

(2119, 768)

### Cosine Similarity

In the final step, we want to find the candidates that are most similar to the document. 

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

top_n = 20
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]

In [13]:
keywords

['________________ telematic based',
 'journal medical internet',
 'medicine technology 41',
 'domain clinical practice',
 'principles applications springer',
 'clinical praxis víctor',
 'applications springer science',
 'telematics solutions designed',
 'nutrición clínica el',
 'health history science',
 'video tutorials manuals',
 'hoc telematics solutions',
 'university vigo 36310',
 'medical informatics association',
 'unidad nutrición clínica',
 'telematic engineering department',
 'sanz valero2 telematic',
 'clinical medicine bmj',
 'medical internet research',
 'valero2 telematic engineering']

## KeyBERT library



In [None]:
! pip install keybert[all]

Minimal example

In [2]:
from keybert import KeyBERT
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
keywords = kw_model.extract_keywords(content)

In [5]:
keywords

[('telematics', 0.4772),
 ('clínica', 0.4731),
 ('medicine', 0.4718),
 ('dbpedia', 0.4575),
 ('hospitalaria', 0.4334)]

You can set keyphrase_ngram_range to set the length of the resulting keywords/keyphrases:

In [4]:
kw_model.extract_keywords(content, keyphrase_ngram_range=(3, 3), stop_words=None)

[('valero2 telematic engineering', 0.6617),
 ('medical internet research', 0.6385),
 ('clinical medicine bmj', 0.63),
 ('sanz valero2 telematic', 0.6164),
 ('telematic engineering department', 0.613)]

Max Sum Similarity

In [6]:
kw_model.extract_keywords(content, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_maxsum=True, nr_candidates=20, top_n=5)

[('________________ telematic based', 0.5062),
 ('medicine technology 41', 0.4976),
 ('health history science', 0.4705),
 ('video tutorials manuals', 0.3966),
 ('university vigo 36310', 0.3924)]

 Maximal Marginal Relevance with high diversity

In [7]:
kw_model.extract_keywords(content, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_mmr=True, diversity=0.7)

[('valero2 telematic engineering', 0.6617),
 ('science nutrition annals', 0.4805),
 ('user weekly monthly', 0.2615),
 ('managers perform better', 0.1867),
 ('thesis university california', 0.4465)]

Maximal Marginal Relevance with low diversity

In [8]:
kw_model.extract_keywords(content, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_mmr=True, diversity=0.2)

[('valero2 telematic engineering', 0.6617),
 ('medical internet research', 0.6385),
 ('university vigo 36310', 0.6084),
 ('clinical medicine bmj', 0.63),
 ('video tutorials manuals', 0.5976)]