# Keyphrase Extraction
Keyword and phrase extraction, as the name indicates, is the IE task concerned with extracting important words and phrases that capture the gist of the text from a given text document.

In [1]:
# !pip install pyemd
# !pip install textacy --upgrade
# !python -m spacy download en_core_web_sm

In [5]:
import spacy
# import textacy.ke
import textacy
from textacy import *

In [6]:
#Load a spacy model, which will be used for all further processing.
en = textacy.load_spacy_lang("en_core_web_sm")

#Let us use a sample text file, nlphistory.txt, which is the text from the history section of Wikipedia's
#page on Natural Language Processing 
#https://en.wikipedia.org/wiki/Natural_language_processing
path = ''

mytext = open(path+'Data/nlphistory.txt').read()

#convert the text into a spacy document.
doc = textacy.make_spacy_doc(mytext, lang=en)

In [11]:
import textacy.ke
textacy.ke.textrank(doc, topn=5)

ImportError: cannot import name 'emd' from 'pyemd.emd' (unknown location)

In [13]:
# Google Colab implementation
# https://colab.research.google.com/drive/1AaN8AM2179JBx078prYHrjj5QPKqWFoR?usp=sharing

## Practical Advice
- The process of extracting potential n-grams and building the graph with them is sensitive to document length, which could be an issue in a production scenario. One approach to dealing with it is to not use the full text, but instead try using the first M% and the last N% of the text, since we would expect that the introductory and concluding parts of the text should cover the main summary of the text.
- Since each keyphrase is independently ranked, we sometimes end up seeing overlapping keyphrases (e.g., “buy back stock” and “buy back”). One solution for this could be to use some similarity measure (e.g., cosine similarity) between the top-ranked keyphrases and choose the ones that are most dissimilar to one another. textacy already implements a function to address this issue, as shown in the notebook.
- Seeing counterproductive patterns (e.g., a keyphrase that starts with a preposition when you don’t want that) is another common problem. This is relatively straightforward to handle by tweaking the implementation code for the algorithm and explicitly encoding information about such unwanted word patterns.
- Improper text extraction can affect the rest of the KPE process, especially when dealing with formats such as PDF or scanned images. This is primarily because KPE is sensitive to sentence structure in the document. Hence, it’s always a good idea to add some post-processing to the extracted key phrases list to create a final, meaningful list without noise.