# Keyword Phrase Extraction
## This notebook outlines the concepts involved in extracting keyword phrases in text

Problem: **Identify keywords in a piece of text**

Identify **words that are very important** in a piece of text

### Possible Solutions:
- TF-IDF (already seen)
- Noun Chunks
- - Specialized Keyword Extraction algorithms
    - TextRank
    - SGRank

Textacy is an excellent library that uses several information extraction functions, many of them based on regular expression patterns and heuristics to address extracting specific expressions such as acronyms and quotations. Apart from these, one can also extract matching custom regular expressions including POS tag patterns, or look for statements involving an entity, subject-verb-object tuples etc. 

We will use Textacy to extract keywords from documents

Documentaion: https://textacy.readthedocs.io/en/stable/

### Install Textacy

In [1]:
# !pip install textacy==0.9.1
# !python -m spacy download en_core_web_sm

### Import the libraries

In [2]:
import spacy
import textacy.ke
from textacy import *

### Load a spacy model, which will be used for all further processing

In [3]:
en = textacy.load_spacy_lang("en_core_web_sm")

### Load a sample text to find keywords
- kpe_sample_text.txt

In [6]:
mytext = open('kpe_sample_text.txt').read()

### Convert the text into a spacy document

In [7]:
doc = textacy.make_spacy_doc(mytext, lang=en)

## Find keywords

## 1. Noun Chunks

In [13]:
print([chunk for chunk in textacy.extract.noun_chunks(doc)])

[Common NLP Tasks, list, most commonly researched tasks, natural language processing, tasks, direct real-world applications, others, subtasks, larger tasks, natural language processing tasks, they, categories, convenience, coarse division, Text and speech processing, Optical character recognition, (OCR, image, printed text, corresponding text, Speech recognition, sound clip, person, people, textual representation, speech, opposite, text, speech, extremely difficult problems, natural speech, pauses, successive words, speech segmentation, necessary subtask, speech recognition, most spoken languages, sounds, process, coarticulation, conversion, analog signal, characters, very difficult process, words, same language, people, different accents, speech recognition software, wide variety, input, terms, textual equivalent, Speech segmentation, sound clip, person, people, it, words, subtask, speech recognition, it, speech, text, units, spoken representation, Text, speech, Word segmentation, Tok

### Issues ???

- 
- 

## 2. TextRank

In [8]:
textacy.ke.textrank(doc, topn=5)

[('natural language semantic', 0.01826851160612513),
 ('natural language processing task', 0.018092527578712124),
 ('language text segmentation', 0.016815366400779165),
 ('possible word form', 0.012521268238753758),
 ('natural language expression', 0.012062163598739382)]

### Get more keywords

In [9]:
textacy.ke.textrank(doc, topn=20)

[('natural language semantic', 0.01826851160612513),
 ('natural language processing task', 0.018092527578712124),
 ('language text segmentation', 0.016815366400779165),
 ('possible word form', 0.012521268238753758),
 ('natural language expression', 0.012062163598739382),
 ('readable human language', 0.011964826503084598),
 ('powerful neural language model', 0.011901252212139994),
 ('natural language generation', 0.011871126457925964),
 ('multiple possible semantic', 0.011691648529128245),
 ('language question', 0.011681325715062103),
 ('natural language understanding', 0.011676450614983213),
 ('natural language concept', 0.011447215522776017),
 ('implicit semantic role labelling', 0.011167395093629413),
 ('NLP task proper', 0.010611921448657437),
 ('elementary NLP task', 0.01021661374848852),
 ('word sense disambiguation', 0.009974047524682016),
 ('explicit semantic role', 0.009904684234914677),
 ('separate word', 0.009549844032787299),
 ('word segmentation', 0.00953080397548952),
 ('i

### Keywords using TextRank algorithm

In [10]:
[kps for kps, weights in textacy.ke.textrank(doc, normalize="lemma", topn=10)]

['natural language semantic',
 'natural language processing task',
 'language text segmentation',
 'possible word form',
 'natural language expression',
 'readable human language',
 'powerful neural language model',
 'natural language generation',
 'multiple possible semantic',
 'language question']

### Keywords using SGRank algorithm

In [11]:
[kps for kps, weights in textacy.ke.sgrank(doc, topn=10)]

['natural language',
 'speech recognition',
 'word',
 'text',
 'separate word',
 'task',
 'semantic role',
 'word sense',
 'NLP',
 'sentence']

Issue: **Overlapping key phrases**

Solution: **aggregage_term_variants**
- Choosing one of the grouped terms per item will give us a list of non-overlapping key phrases

In [12]:
terms = set([term for term,weight in textacy.ke.sgrank(doc)])
print(textacy.ke.utils.aggregate_term_variants(terms))

[{'speech recognition'}, {'natural language'}, {'semantic role'}, {'separate word'}, {'word sense'}, {'sentence'}, {'text'}, {'word'}, {'task'}, {'NLP'}]
