## Keyword Extraction process in Python with NLP

We will discuss python libraries **spaCy, YAKE, rake-nltk** and **Gensim** for Keyword Extraction Process.

### 1. spaCy

SpaCy is all in one python library for NLP tasks. But, we are interested in the keyword extraction functionality of spaCy.

We will start with installing the spaCy library, then download a model en_core_sci_lg. After that, pass the article text into the NLP pipeline. It will return the extracted keywords.

Each model has its own functionality. If an article consists of medical terms, then use the en_core_sci_lg model. Otherwise you can use the en_core_web_sm model.

In [1]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.2.0-cp38-cp38-macosx_10_9_x86_64.whl (6.2 MB)
     |████████████████████████████████| 6.2 MB 2.7 MB/s            
[?25hCollecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp38-cp38-macosx_10_9_x86_64.whl (450 kB)
     |████████████████████████████████| 450 kB 4.6 MB/s            
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 2.3 MB/s            
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp38-cp38-macosx_10_9_x86_64.whl (2.6 MB)
     |████████████████████████████████| 2.6 MB 5.1 MB/s            
[?25hCollecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-

In [3]:
!pip install --upgrade numpy

Collecting numpy
  Downloading numpy-1.21.4-cp38-cp38-macosx_10_9_x86_64.whl (16.9 MB)
     |████████████████████████████████| 16.9 MB 408 kB/s            
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
Successfully installed numpy-1.21.4


In [1]:
import spacy

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


进入 https://allenai.github.io/scispacy/ 下载 model, scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text(临床文本).

In [5]:
nlp = spacy.load('/Users/leexuewei/Downloads/en_core_sci_lg-0.4.0/en_core_sci_lg')

OSError: [E053] Could not read config.cfg from /Users/leexuewei/Downloads/en_core_sci_lg-0.4.0/en_core_sci_lg/config.cfg

In [4]:
nlp = spacy.load('/Users/leexuewei/Downloads/en_core_sci_lg-0.4.0/en_core_sci_lg')
text = "spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython."\
       "The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software"\
       "company Explosion."
doc = nlp(text)
print(doc.ents)

OSError: [E053] Could not read config.cfg from /Users/leexuewei/Downloads/en_core_sci_lg-0.4.0/en_core_sci_lg/config.cfg

### Observations.

1. The output of `doc.ents` objects could be 1-gram, 2-gram, 3-gram, etc. You can't control the extraction proces based on n-gram and other parameters.

2. For text related to medical term use `en_core_sci_xx(xx=lg,sm,md)` model. It also perform on non-medical term article.

3. Load different model using `spacy.load()` function.

Use the YAKE python library to control the keyword extraction process.

### YAKE

Yake library selects the most important keywords using the text statistical features method from the article. With the help of YAKE, you can control the extracted keyword count and other features.

In [6]:
import yake

In [7]:
kw_extractor = yake.KeywordExtractor()
text = "spaCy is an open-source software library for advanced natural language processing, "\
       "written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani,"\
       "the founders of the software company Explosion."
language = 'en'
max_ngram_size = 3
deduplication_threshold = 0.9
numOfKeywords = 20

In [8]:
custom_kw_extractor = yake.KeywordExtractor(lan=language,n=max_ngram_size,dedupLim=deduplication_threshold,top=numOfKeywords, features=None)

In [9]:
keywords = custom_kw_extractor.extract_keywords(text)
for kw in keywords:
    print(kw)

('programming languages Python', 0.0019334946410281481)
('natural language processing', 0.0029484272342509585)
('advanced natural language', 0.004097221079360261)
('Python and Cython', 0.004169180610102857)
('languages Python', 0.012356448651811323)
('open-source software library', 0.0132657006044428)
('language processing', 0.018764236727871385)
('Ines Montani,the founders', 0.021383729713214176)
('software company Explosion', 0.022297017478192415)
('advanced natural', 0.023429465030212888)
('natural language', 0.02595314493061387)
('programming languages', 0.02595314493061387)
('open-source software', 0.04345039133945653)
('Matthew Honnibal', 0.048358238494448195)
('Honnibal and Ines', 0.048358238494448195)
('Ines Montani,the', 0.048358238494448195)
('Cython', 0.05687138998792217)
('company Explosion', 0.07091362906117832)
('Python', 0.07300455839253525)
('software library', 0.08194949968704518)


### Observations.

1. If you want to extract keywords from a non-English language such as german, then use language='de'. Mismatch in text language and language variable will give you poorly extracted keywords.

2. The duplication_threshold variable is limit the duplication of words in different keywords. You can set the deduplication_threshold value to 0.1 to avoid the repetition of words in keywords. If you set the deduplication_threshold value to 0.9, then repetition of words is allowed in keywords.

        - Example:    
        #For deduplication_threshold = 0.1     
        Output will be ['Python and Cython','software','ines','library is published']   
        
        #For deduplication_threshold = 0.9    
        Output will be ['Python and Cython','programming languages python','natural language processing','advanced natural language','languages python','language processing','ines montain','cython','advanced natural','honnibal and ines','software company explosion','natural language','programming languages','matthew honnibal','python','open-source software library','company explosion','spacy','processing','written']
        
### 3. Rake-Nltk

