# NLP Keyword Extraction - Introduction

Keyword extraction is a natural language processing technique that automatically identifies the most important words and phrases in a document. It is used to extract the most relevant words and phrases that summarize the main themes of the text. These words and phrases can be used as index terms or tags to help classify and organize the text, and can also be used as input to other NLP tasks such as text summarization or information retrieval.

## Dataset

There are various dataset available for the task of keyword extraction. I am particularly interested in the keyword extraction from the title and abstract of a research paper and hence the [ICMLA 2014/2015/2016/2017 Accepted Papers Data Set](https://data.mendeley.com/datasets/wj5vb6h9jy/2) from Mendeley Data is used here in the examples.

## Keyword Extraction Methods

There are multiple keyword extraction methods:
### Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word or phrase within a document in relation to an entire corpus. It's often used as a weighting factor in information retrieval and text mining tasks, such as text summarization, document classification and keyword extraction.

The basic idea behind TF-IDF is that words that appear frequently in a document are likely to be more important to the meaning of that document than words that appear less frequently. However, it also takes into account that some words, such as "the" or "and", appear frequently across many documents, and therefore shouldn't be given as much weight as words that appear less frequently in the corpus. This is where the "IDF" component comes in, which down-weights words that appear in many documents. 

The term frequency (TF) can be calculated as:
$$tf(t,d) = \frac{n(t,d)}{\sum_{i=1}^{n}n(i,d)}$$



The inverse document frequency (IDF) can be calculated as:
$$idf(t) = \log_{10}\left(\frac{N}{df(t)}\right)
$$

TF-IDF is just the multiplication of TF and IDF:
$$tf-idf(t,d) = tf(t,d) \times idf(t)$$

Where,
- $tf(t,d)$ : term frequency of term `t` in document `d`
- $n(t,d) $ : number of occurrences of term `t` in document `d`  
- $\sum_{i=1}^{n}n(i,d) $ : total number of words in document `d` 
- $idf(t) $ : inverse document frequency of term `t` 
- $N $ : total number of documents in the corpus
- $df(t) $ : number of documents in the corpus that contain term `t`
- $tf-idf(t,d) $ : TF-IDF weight of term `t` in document `d`




For extracting the keywords, the TF-IDF of all the words in the document is calcuated and then top five keyphrases with the highest TF-IDF are selected as keywords. Keywords can be monograms, bigrams or n-grams where n can be any number from 1 to $\sum_{i=1}^{n}n(i,d) $ but generally only bigrams (n=2) and triagrams (n=3) are used.  

### Rapid Automatic Keyword Extraction (RAKE)

Rapid Automatic Keyword Extraction (RAKE) is an unsupervised method for extracting keywords from a text corpus. It is based on the idea that keywords in a text are the terms that occur most frequently in the text and are not very common across the whole corpus.
RAKE algorithm uses two main components to extract keywords:

A stopword list: a list of common words (such as "and", "the", "is", etc.) that are not considered to be keywords.
A delimiter list: a list of characters (such as punctuation marks) that are used to split the text into phrases.
The algorithm works by:

1. Removing stopwords from the text
2. Splitting the text into phrases using the delimiters
3. Calculating the frequency of each phrase
4. Calculating the degree of each word, which is the number of phrases that contain the word
5. Assign a score to each phrase, which is the sum of the degrees of the words in the phrase divided by the frequency of the phrase
6. Select the phrases with the highest scores as the keywords


RAKE algorithm is considered as a simple and efficient keyword extraction method, but it also has some limitations, such as the fact that it doesn't take into account the context of the words and phrases, and it doesn't handle multiple word expressions well.

### Yet Another Keyword Extractor (YAKE)

Yet Another Keyword Extractor (YAKE) is an unsupervised automatic keyword extraction method based on TextRank algorithm. It was introduced in 2019 as a language-independent keyword extraction method that does not require any external knowledge or resources.

YAKE algorithm works by:

1. Building a graph of words and phrases from the text, where words and phrases are represented as nodes and edges represent the co-occurrence of words and phrases in the text.

2. Computing the similarity score between the nodes in the graph using the KPMiner algorithm which is based on the cosine similarity between the words and phrases.
   
3. Applying a TextRank algorithm on the graph to identify the most relevant words and phrases.

YAKE algorithm uses a set of features called Keyness features (such as term frequency, word length, position in the text, etc.) to determine the relevance of the words and phrases in the text, and to assign a score to each word and phrase.

YAKE is considered to be a powerful method for keyword extraction as it is language independent, requires no external resources and is able to extract multi-word expressions as keywords.
It has been used on various domains such as web pages, scientific literature and it has been shown to achieve state-of-the-art results on several benchmark datasets.

It's worth noting that, like other keyword extraction methods, YAKE has its own limitations, such as its ability to extract words and phrases with low-frequency which may not be relevant,and it may not perform well on texts with very low-quality or on texts with a very low number of words.

## Referances

- Vallejo-Huanga, Diego; Morillo, Paulina; Ferri, Cèsar (2019), “ICMLA 2014/2015/2016/2017 Accepted Papers Data Set”, Mendeley Data, V2, doi: 10.17632/wj5vb6h9jy.2