## **Keyword Extraction**
In a nutshell, keyword extraction is a methodology to automatically detect important words that can be used to represent the text and can be used for topic modeling. This is a very efficient way to get insights from a huge amount of unstructured text data.

### **Libraries that help in extracting the keywords**
1. spaCy
2. YAKE
3. Rake-Nltk
4. Gensim

## **1. spaCY**
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion

In [1]:
''' Required Libraries '''

import spacy

In [2]:
text = """spaCy is an open-source software library for advanced natural language processing, written in the programming 
          languages Python and Cython. The library is published under the MIT license and its main developers are Matthew 
          Honnibal and Ines Montani, the founders of the software company Explosion."""

In [9]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0


2022-01-30 11:10:59.320256: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-01-30 11:10:59.320317: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.0
    Uninstalling en-core-web-sm-2.2.0:
      Successfully uninstalled en-core-web-sm-2.2.0
Successfully installed en-core-web-sm-3.1.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
''' creating an instance '''
sp = spacy.load("en_core_web_sm")

In [12]:
doc_spacy = sp(text)

In [13]:
print(doc_spacy.ents)

(MIT, Matthew 
          Honnibal, Ines Montani, Explosion)


## **2. YAKE**
`Yake` library selects the most important keywords using the text statistical features method from the article. With the help of YAKE, you can control the extracted keyword word count and other features.

In [16]:
!pip install yake

Collecting yake
  Downloading yake-0.4.8-py2.py3-none-any.whl (60 kB)
Collecting segtok
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting jellyfish
  Downloading jellyfish-0.9.0-cp38-cp38-win_amd64.whl (26 kB)
Installing collected packages: segtok, jellyfish, yake
Successfully installed jellyfish-0.9.0 segtok-1.5.11 yake-0.4.8


In [17]:
''' required libraries '''

import yake

In [18]:
text = """spaCy is an open-source software library for advanced natural language processing, written in the programming 
          languages Python and Cython. The library is published under the MIT license and its main developers are Matthew 
          Honnibal and Ines Montani, the founders of the software company Explosion."""

#### **initializing some parameters**

In [19]:
language = "en"
max_ngram_size = 3
deduplication_threshold = 0.9
numOfKeywords = 20

In [20]:
''' creating an instance '''
kw_extractor = yake.KeywordExtractor()
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, 
                                            top=numOfKeywords, features=None)

In [21]:
keywords = custom_kw_extractor.extract_keywords(text)

for kw in keywords:
    print(kw)

('programming languages Python', 0.001295347548560416)
('natural language processing', 0.002012136772192602)
('advanced natural language', 0.0026621455770583914)
('Python and Cython', 0.0035840985079775055)
('open-source software library', 0.008298152696966859)
('languages Python', 0.009390717577572831)
('language processing', 0.01453240965208459)
('software company Explosion', 0.015993140254256993)
('advanced natural', 0.01840251352140607)
('natural language', 0.019161829017826378)
('programming languages', 0.019161829017826378)
('open-source software', 0.032652195076937375)
('Ines Montani', 0.03375876229391358)
('Matthew Honnibal', 0.04096703831447956)
('Honnibal and Ines', 0.04096703831447956)
('Cython', 0.053691021027863564)
('software library', 0.05857047036380304)
('company Explosion', 0.06120870235178475)
('Python', 0.06651575167590484)
('library for advanced', 0.07441175006256819)


## **3. Rake-NLTK**
You can form a powerful keyword extraction method by combining the Rapid Automatic Keyword Extraction (RAKE) algorithm with the NLTK toolkit. It is known as rake-nltk. It is a modified version of this algorithm. You can know more about rake-nltk here.Install the rake-nltk library using pip install rake-nltk.

In [25]:
!pip install rake-nltk

Collecting rake-nltk
  Using cached rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Collecting nltk<4.0.0,>=3.6.2
  Using cached nltk-3.6.7-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk, rake-nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.5
    Uninstalling nltk-3.5:
      Successfully uninstalled nltk-3.5
Successfully installed nltk-3.6.7 rake-nltk-1.0.6


In [26]:
''' required libraries '''

from rake_nltk import Rake

In [27]:
text = """spaCy is an open-source software library for advanced natural language processing, written in the programming 
          languages Python and Cython. The library is published under the MIT license and its main developers are Matthew 
          Honnibal and Ines Montani, the founders of the software company Explosion."""

In [28]:
''' creating an instance '''
rake_nltk_var = Rake()

In [29]:
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
print(keyword_extracted)

['advanced natural language processing', 'software company explosion', 'programming languages python', 'source software library', 'mit license', 'matthew honnibal', 'main developers', 'ines montani', 'library', 'written', 'spacy', 'published', 'open', 'founders', 'cython']


## **4. Gensim**
Gensim is primarily developed for topic modeling. Over time, Gensim added other NLP tasks such as summarization, finding text similarity, etc.

In [43]:
!pip install gensim==3.4.0

Collecting gensim==3.4.0
  Downloading gensim-3.4.0.tar.gz (22.2 MB)
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py): started
  Building wheel for gensim (setup.py): finished with status 'done'
  Created wheel for gensim: filename=gensim-3.4.0-cp38-cp38-win_amd64.whl size=22590883 sha256=147fa9de2249829c772cc994aca9f7c4872343483467aa109678e3b21c025508
  Stored in directory: c:\users\jgaur\appdata\local\pip\cache\wheels\b4\a4\71\a301cdb2b7d5d31525936fcb8dcd9a5f144578d047407f7cf9
Successfully built gensim
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.8.3
    Uninstalling gensim-3.8.3:
      Successfully uninstalled gensim-3.8.3
Successfully installed gensim-3.4.0


In [48]:
''' required libraries '''

from gensim.summarization import keywords

In [49]:
text = """spaCy is an open-source software library for advanced natural language processing, written in the programming 
          languages Python and Cython. The library is published under the MIT license and its main developers are Matthew 
          Honnibal and Ines Montani, the founders of the software company Explosion."""

In [54]:
# keywords(text)