## What is Keyword Exctraction?

Keyword extraction is defined as the task that automatically identifies a set of the terms that best describe the subject of document. This is an important method in information retrieval (IR) systems: keywords simplify and speed up the search. Keyword extraction can be used to reduce the dimensionality of text for further text analysis (text classification ot topic modeling). [S.Art et al.](https://onlinelibrary.wiley.com/doi/abs/10.1002/smj.2699), for example, extracted keywords to measure patent similarity. Using keyword extraction, you can automatically index data, summarize a text, or generate tag clouds with the most representative keywords.

## How to extract the keywords?

All keyword extraction algorithms include the following steps:

- *Candidate generation*. Detection of possible candidate keywords from the text.
- *Property calculation*. Computation of properties and statistics required for ranking.
- *Ranking*. Computation of a score for each candidate keyword and sorting in descending order of all candidates. The top n candidates are finally selected as the n keywords representing the text.

## Automatic Keyword extraction algorithms

- Rapid Automatic Keyword Extraction (RAKE). Python implementations: [one](https://github.com/csurfer/rake-nltk), [two](https://github.com/zelandiya/RAKE-tutorial), [three](https://github.com/aneesha/RAKE)
- TextRank. Python implementations [number one](https://pypi.org/project/summa/) and [number two](https://radimrehurek.com/gensim/summarization/keywords.html)
- [Yet Another Keyword Extractor (Yake)](https://github.com/LIAAD/yake)


## If you want to know more...
- [Slobodan Beliga.](https://pdfs.semanticscholar.org/bdbf/25f3dcf63d38cdb527a9ffca269fa0b8046b.pdf) Keyword extraction: a review of methods and approache
- [Kamil Bennani-Smires et al.](https://arxiv.org/pdf/1801.04470.pdf) Simple Unsupervised Keyphrase Extraction using Sentence Embeddings
- [YanYing et al.](https://www.sciencedirect.com/science/article/pii/S1877050917303629) A Graph-based Approach of Automatic Keyphrase Extraction
- [Martin Dostal and Karel Jezek](http://ceur-ws.org/Vol-706/poster13.pdf) Automatic Keyphrase Extraction based on NLP and Statistical Methods

# REQUIRED PART OF THE WORK  "EXTRACT TOP WORDS" 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# load the dataset
df = pd.read_excel('../input/invisalign-modif/invisalign modif.xlsx')
df.head()

In [None]:
import re
import string
import nltk

import pandas as pd

from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

STOP_WORDS = stopwords.words()

# removing the emojies
# https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
EMOJI_PATTERN = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)


def cleaning(text):
    """
    Convert to lowercase.
    Rremove URL links, special characters and punctuation.
    Tokenize and remove stop words.
    """
    text = text.lower()
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('[’“”…]', '', text)

    text = EMOJI_PATTERN.sub(r'', text)

    # removing the stop-words
    text_tokens = word_tokenize(text)
    tokens_without_sw = [
        word for word in text_tokens if not word in STOP_WORDS]
    filtered_sentence = (" ").join(tokens_without_sw)
    text = filtered_sentence

    return text


dt = df['Tweet Text'].apply(cleaning)

word_count_10 = Counter(" ".join(dt).split()).most_common(10)
word_count_100 = Counter(" ".join(dt).split()).most_common(100)
word_frequency_10 = pd.DataFrame(word_count_10, columns = ['Word', 'Frequency'])
word_frequency_100 = pd.DataFrame(word_count_100, columns = ['Word', 'Frequency'])
print(word_frequency_10)
print(word_frequency_100)

In [None]:
word_frequency_10.to_excel("./TOP 10.xlsx")
word_frequency_100.to_excel("./TOP 100.xlsx")