# Keyword Extraction

In this tutorial, we will learn how to extract keywords from a text in Python. We assume that a reader understands Python code and basic probability.

## Outline
* ### Motivation
* ### Setup
* ### Methods

### Motivation
A keyword is defined as "*a word which occurs in a text more often than we would expect to occur by chance alone*" [[source]].

Today, the Internet has so much information. It would be impossible for a human to read all of this online data, even for just one topic. However, computers are much faster at computation than humans. Therefore, we can use computers to help filter through the mass of online information.

This idea then begs the question, "how do we do this?" Well, that's a loaded question because there are many different approaches for doing text search. But, we can boil it down to a general process. 

#### Preprocess data
Let's think of the Internet as a sea of text documents. In order to work with these documents, we need to categorize them in some way. This will then allow us to organize the data in a neat, searchable data structure. Therefore, we can take some text document and process it into a format that will be useful for search.

#### Organize data
Once we have our data processed, we need to organize in some way. This most likely involves creating one or more data structures with quick lookup or query time.


#### Query data
After our data is processed and organized, we should have some kind of interface to make queries on our data. This might involve working directly with the data structure that we used to organize our data, or we might build a tool around our data structure(s).


Now that we have a broad idea of how looking up information works, we can finally figure out where keyword extraction comes in. When making a search query, people will usually include keywords related to a topic. Since we want to match documents that contain information about these keywords, we can use keyword extraction as part of our preprocessing state. This means that we can use keywords in a text as a way to represent a document.

[source]: https://en.wikipedia.org/wiki/Keyword_(linguistics)

## Setup
For this tutorial, make sure that you have Python 2.7 installed. In addition, you will need to install a library called NLTK (Natural Language Toolkit)

In [5]:
import nltk

## Methods

### Rapid Automatic Keyword Extraction
The first method that we will cover is caled rapid automatic keyword extraction (RAKE). 

Stop words are common words that provide little to no information about the content in a document when it comes to information retrieval. For example, the words "of" or "the" might be considered stop words because they don't provide any information by themselves.

The idea behind RAKE is that keywords and key phrases in a text usually do not contain stop words. In addition, keywords tend to occur frequently in a text and co-occur in a text with other keywords.

Using these ideas, this is how RAKE works:

1. Split the input text on stopwords. The list of phrases that result from this will be called the candidates for keywords.

2. In order to pick which words are keywords, the candidates are ranked based upon the number of times a word occurs and based upon the number of times certain words co-occur. The candidates with the highest scores are selected as keywords.


You can read more about RAKE [here][RAKE].

[RAKE]: http://www.cbs.dtu.dk/courses/introduction_to_systems_biology/chapter1_textmining.pdf

#### Implementing RAKE
Implementation adjusted from [here][SOURCE]

[SOURCE]: http://sujitpal.blogspot.jp/2013/03/implementing-rake-algorithm-with-nltk.html

In [None]:
# First get a list of stopwords. 
# NLTK provides a list of stopwords for English that we will use.
stop_words = nltk.corpus.stopwords.words('english')

def get_candidates(text):
    sentences = text.split()
    candidates = list()
    
    for sentence in sentences:
        candidates.extend(sentence.split(stop_words))
        
    return candidates

Now we will implement scoring each candidate:

In [9]:
def get_counts(candidates):
    counts = dict()
    for phrase in candidates:
        for word in phrase:
            if word in counts:
                counts[word] += 1
            else:
                counts[word] = 1
    
    return counts

def get_degrees(candidates):
    counts = dict()
    
    for phrase in candidates:
        length = len(phrase)
        for word in phrase:
            if word in counts:
                counts[word] += length
            else:
                counts[word] = length
    return counts
    
def get_word_scores(candidates):
    word_counts = get_counts(candidates)
    word_degrees = get_degrees(candidates)
    scores = dict()
    
    for word in word_counts:
        scores[word] = float(word_degrees[word]) / float(word_counts[word])
    return scores
    
def get_phrase_scores(candidates):
    word_scores = get_word_scores(candidates)
    scores = list()
    
    for phrase in candidates:
        score = 0.0
        for word in phrase:
            score += word_scores[word]
            
        sentence = ' '.join(phrase)
        scores.append((sentence, score))
    return scores
        
    
def extract_keywords(text):
    candidates = get_candidates(text)
    phrase_scores = sorted(get_phrase_scores(candidates), key=lambda x : x[1], reverse=True)
    return phrase_scores

def top_k_keywords(text, k):
    keywords = extract_keywords(text)
    if len(keywords < k):
        return keywords
    else:
        return keywords[:k+1]
    

### TextRank