<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-5-information-extraction/1_keyphrase_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keyphrase Extraction

Consider a scenario where we want to buy a product, which has a hundred reviews,
on Amazon. There’s no way we’re going to read all of them to get an idea of what
users think about the product. To facilitate this, Amazon has a filtering feature: “Read reviews that mention.” This presents a bunch of keywords or phrases that several people used in these reviews to filter the review. This is a good example of where KPE can be useful in an application we all use.

<img src='https://github.com/practical-nlp/practical-nlp-figures/raw/master/figures/5-4.png?raw=1' width='800'/>

Keyword and phrase extraction, as the name indicates, is the IE task concerned
with extracting important words and phrases that capture the gist of the text from a given text document. It’s useful for several downstream NLP tasks, such as search/information retrieval, automatic document tagging, recommendation systems, text summarization, etc.

KPE is a well-studied problem in the NLP community, and the two most commonly
used methods to solve it are supervised learning and unsupervised learning.

Supervised learning approaches require corpora with texts and their respective keyphrases and use engineered features or DL techniques. Creating such labeled datasets for KPE is a time- and cost-intensive endeavor. 

Hence, unsupervised approaches that do not require a labeled dataset and are largely domain agnostic are more popular for KPE. These approaches are also more commonly used in real-world KPE applications.

Recent research has also shown that state-of-the-art DL methods for KPE don’t
perform any better than unsupervised approaches.

All the popular unsupervised KPE algorithms are based on the idea of representing the words and phrases in a text as nodes in a weighted graph where the weight indicates the importance of that keyphrase. Keyphrases are then identified based on how connected they are with the rest of the graph. The top-N important nodes from the graph are then returned as keyphrases. Important nodes are those words and phrases that are frequent enough and also well connected to different parts of the text.

The different graph-based KPE approaches differ in the way they select potential words/phrases from the text (from a large set of possible words and phrases in the entire text) and the way these words/phrases are scored in the graph.

## Setup

In [None]:
%%shell

#We need texacy, which inturn loads spacy library
pip install textacy==0.9.1
python -m spacy download en_core_web_sm

In [2]:
import spacy
import textacy.ke
from textacy import *

## Implementing KPE

The Python library [textacy](https://github.com/chartbeat-labs/textacy), built on top of the well-known library [spaCy](https://spacy.io/), contains implementations for some of the common graph-based keyword and phrase
extraction algorithms.

This notebook illustrates the use of textacy to extract keyphrases using two algorithms:

1. [TextRank](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
2. [SGRank](https://www.aclweb.org/anthology/S15-1013/)

We’ll use a text file that talks about the history of NLP as our test
document.

Let us use a sample text file, nlphistory.txt, which is the text from the history section of Wikipedia's page on Natural Language Processing 
https://en.wikipedia.org/wiki/Natural_language_processing

In [3]:
!wget https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/nlphistory.txt

--2020-12-11 09:32:37--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch5/Data/nlphistory.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6049 (5.9K) [text/plain]
Saving to: ‘nlphistory.txt’


2020-12-11 09:32:37 (70.2 MB/s) - ‘nlphistory.txt’ saved [6049/6049]



In [4]:
# Load a spacy model, which will be used for all further processing.
en = textacy.load_spacy_lang("en_core_web_sm")

mytext = open("nlphistory.txt").read()

# convert the text into a spacy document.
doc = textacy.make_spacy_doc(mytext, lang=en)

In [5]:
textacy.ke.textrank(doc, topn=5)

[('successful natural language processing system', 0.02475549496438359),
 ('statistical machine translation system', 0.024648673368376665),
 ('natural language system', 0.020518708001159278),
 ('statistical natural language processing', 0.01858983530270439),
 ('natural language task', 0.01579726776487791)]

In [6]:
# Print the keywords using TextRank algorithm, as implemented in Textacy.
print("Textrank output: ", [kps for kps, weights in textacy.ke.textrank(doc, normalize="lemma", topn=5)])

# Print the key words and phrases, using SGRank algorithm, as implemented in Textacy
print("SGRank output: ", [kps for kps, weights in textacy.ke.sgrank(doc, topn=5)])

Textrank output:  ['successful natural language processing system', 'statistical machine translation system', 'natural language system', 'statistical natural language processing', 'natural language task']
SGRank output:  ['natural language processing system', 'statistical machine translation', 'research', 'late 1980', 'early']


In [7]:
#To address the issue of overlapping key phrases, textacy has a function: aggregage_term_variants.
#Choosing one of the grouped terms per item will give us a list of non-overlapping key phrases!
terms = set(term for term, weight in textacy.ke.sgrank(doc))
print(textacy.ke.utils.aggregate_term_variants(terms))

[{'natural language processing system'}, {'statistical machine translation'}, {'statistical model'}, {'late 1980'}, {'research'}, {'example'}, {'world'}, {'early'}, {'ELIZA'}, {'real'}]


In [8]:
#A way to look at key phrases is just consider all noun chunks as potential ones. 
#However, keep in mind this will result in a lot of phrases, and no way to rank them!
print([chunk for chunk in textacy.extract.noun_chunks(doc)])

[history, natural language processing, 1950s, work, earlier periods, Alan Turing, article, what, criterion, intelligence, Georgetown experiment, fully automatic translation, more than sixty Russian sentences, English, authors, three or five years, machine translation, real progress, ALPAC report, ten-year-long research, expectations, machine translation, Little further research, machine translation, late 1980s, first statistical machine translation systems, notably successful natural language processing systems, SHRDLU, natural language system, restricted "blocks worlds, restricted vocabularies, ELIZA, simulation, Rogerian psychotherapist, Joseph Weizenbaum, almost no information, human thought, emotion, ELIZA, startlingly human-like interaction, "patient, very small knowledge base, ELIZA, generic response, example, head, you, head, 1970s, many programmers, "conceptual ontologies, real-world information, computer-understandable data, Examples, MARGIE, Schank, Cullingford, (Wilensky, Le

There are numerous options for how long our n-grams should be in these phrases;
what POS tags should be considered or ignored; what pre-processing should be done a priori; how to eliminate overlapping n-grams, such as statistical machine translation and machine translation in the above example; and so on.

We showed one example of implementing KPE with textacy. There are other options,
though. For example, the Python library gensim has a keyword extractor based on
TextRank. [This notebook](https://github.com/JRC1995/TextRank-Keyword-Extraction/blob/master/TextRank.ipynb) shows how to implement TextRank from scratch.

Documentation: https://chartbeat-labs.github.io/textacy/build/html/index.html