# Introduction 

This tutorial will introduce you to the TextRank algorithm that is commonly used to summarize a document of text. In the modern day, data is being collected and generated at an unprecedented rate. To make use of this data, we need to be able to extract features and important information quickly and accurately. For text data, it is very important to summarize large collections or documents of text to understand what the collection or document is about. Summarization is useful everywhere, since almost every professional at some point will read large amounts of text. There are many approaches for this problem, such as supervised machine learning or maximum entropy. In this tutorial, however, we will go over an unsupervised algorithm called TextRank(or LexRank but they're the same concept). TextRank is an algorithm that retrieves the important parts of the document via a method similar to PageRank, but using different vertices and edges. 

# Tutorial Content

We will go over how the TextRank algorithm works by starting with a bag of words approach. We will be using data copied from a Wikipedia page. After covering the basics of TextRank and its use on document summarization, we will go over how TextRank can be applied to keyword extraction. 






# Installation 
Please just use the anaconda package. 

In [40]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import networkx as nx
import numpy as np  
import nltk
import string
import operator
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


# Text Rank

Text Rank is an algorithm introduced in 2004 from researchers at University of Michigan. It uses the same concept as PageRank: do a random walk on a graph where the sentences are vertices and the similarity between each sentence are the edges. This means that the algorithm is extracting the sentences that are most similar to other sentences, which indicate that this sentence is important and covers the information in a lot of other sentences. To do that, we start by tokenizing all the sentences. 

# Sentence splitting

Since we are doing summarization of the document by finding the most important sentences, we need to split the document by sentences. 
We do this through NLTK. 

In [41]:
def sentence_tokens(document):
    tokenizer = PunktSentenceTokenizer()
    sentences = tokenizer.tokenize(document)
    return sentences

In [42]:
document = """
Another keyphrase extraction algorithm is TextRank. While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of training data. Many documents with known keyphrases are needed. Furthermore, training on a specific domain tends to customize the extraction process to that domain, so the resulting classifier is not necessarily portable, as some of Turney's results demonstrate. Unsupervised keyphrase extraction removes the need for training data. It approaches the problem from a different angle. Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[3] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages. Recall this is based on the notion of "prestige" or "recommendation" from social networks. In this way, TextRank does not rely on any previous training data at all, but rather can be run on any arbitrary piece of text, and it can produce output simply based on the text's intrinsic properties. Thus the algorithm is easily portable to new domains and languages.

TextRank is a general purpose graph-based ranking algorithm for NLP. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or lexical similarity between the text unit vertices. Unlike PageRank, the edges are typically undirected and can be weighted to reflect a degree of similarity. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).

The vertices should correspond to what we want to rank. Potentially, we could do something similar to the supervised methods and create a vertex for each unigram, bigram, trigram, etc. However, to keep the graph small, the authors decide to rank individual unigrams in a first step, and then include a second step that merges highly ranked adjacent unigrams to form multi-word phrases. This has a nice side effect of allowing us to produce keyphrases of arbitrary length. For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together. Note that the unigrams placed in the graph can be filtered by part of speech. The authors found that adjectives and nouns were the best to include. Thus, some linguistic knowledge comes into play in this step.

Edges are created based on word co-occurrence in this application of TextRank. Two vertices are connected by an edge if the unigrams appear within a window of size N in the original text. N is typically around 2–10. Thus, "natural" and "language" might be linked in a text about NLP. "Natural" and "processing" would also be linked because they would both appear in the same string of N words. These edges build on the notion of "text cohesion" and the idea that words that appear near each other are likely related in a meaningful way and "recommend" each other to the reader.

Since this method simply ranks the individual vertices, we need a way to threshold or produce a limited number of keyphrases. The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph. Then the top T vertices/unigrams are selected based on their stationary probabilities. A post- processing step is then applied to merge adjacent instances of these T unigrams. As a result, potentially more or less than T final keyphrases will be produced, but the number should be roughly proportional to the length of the original text.

It is not initially clear why applying PageRank to a co-occurrence graph would produce useful keyphrases. One way to think about it is the following. A word that appears multiple times throughout a text may have many different co-occurring neighbors. For example, in a text about machine learning, the unigram "learning" might co-occur with "machine", "supervised", "un-supervised", and "semi-supervised" in four different sentences. Thus, the "learning" vertex would be a central "hub" that connects to these other modifying words. Running PageRank/TextRank on the graph is likely to rank "learning" highly. Similarly, if the text contains the phrase "supervised classification", then there would be an edge between "supervised" and "classification". If "classification" appears several other places and thus has many neighbors, its importance would contribute to the importance of "supervised". If it ends up with a high rank, it will be selected as one of the top T unigrams, along with "learning" and probably "classification". In the final post-processing step, we would then end up with keyphrases "supervised learning" and "supervised classification".

In short, the co-occurrence graph will contain densely connected regions for terms that appear often and in different contexts. A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters. This is similar to densely connected Web pages getting ranked highly by PageRank. This approach has also been used in document summarization, considered below.
"""
document = unicode(document, 'ascii', 'ignore')
document1 = ' '.join(document.strip().split('\n'))
sentences = sentence_tokens(document1)
print(sentences[0:3])


[u'Another keyphrase extraction algorithm is TextRank.', u'While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of training data.', u'Many documents with known keyphrases are needed.']


# Creating Bag of Words

To use page rank, we need to create a similarity graph of some kind to do a random walk on. To do that, we create a bag of words for each individual sentence. We could use Python's default Counter library, but that returns a dictionary of counts while we want a sparse matrix of the word occurrences in each matrix; basically a matrix of unique words as columns and sentences as rows and each entry is whether a word occurs in a sentence. Luckily, CountVectorizer from sklearn does exactly that. 

In [43]:
# input: array of sentences
# output: sparse matrix of word occurences
def counts(tokens):
    counter = CountVectorizer()
    matrix = counter.fit_transform(tokens)
    return matrix
    

In [44]:
word_matrix = counts(sentences)
word_matrix

<49x359 sparse matrix of type '<type 'numpy.int64'>'
	with 802 stored elements in Compressed Sparse Row format>

# Create a Graph

Now, we have a sparse matrix of sentences by words, but we want a mirror matrix of sentences by sentences because that represents the graph we want. First, we should normalize our graph so zeroes don't ruin the calculations. We do this using TfidfTransformer, which normalizes a count matrix into a tf-idf matrix which better represent the importance of a word in a set of documents. We now multiply the normalized matrix by its transpose, which creates a mirror matrix where each entry is the result of multiplying every tfidf of each word in a sentence by another sentences' and adding them together. The result is a number from 0 to 1, where 1 means the sentences are exactly the same, and 0 means the sentences are completely different. This is an adjancency matrix that represents the graph of sentences and edges that represent similarities. This specific approach is done by the LexRank algorithm. The TextRank algorithm simply uses a different similarity measure that isn't tfidf. 



In [45]:
def graph(word_matrix):
    normalized_matrix = TfidfTransformer().fit_transform(word_matrix)
    similarity_graph = normalized_matrix * normalized_matrix.T
    return similarity_graph

similarity_graph = graph(word_matrix)
similarity_graph

<49x49 sparse matrix of type '<type 'numpy.float64'>'
	with 1929 stored elements in Compressed Sparse Row format>

# The Algorithm

Now, we use networkx's page rank algorithm on this sparse matrix. The PageRank algorithm does a random walk on this graph, our sentences, and terminates after a fixed number and produces the rank of each sentence, which is how similar this sentence is to every other sentence. The higher the rank, means it is similar to a lot of sentences in this article, implying it must be important in some way.

In [48]:

def summary(similarity_graph, n):

    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
    scores = nx.pagerank(nx_graph)
    ranked = sorted(((scores[i],s) for i,s in enumerate(sentences)),
                    reverse=True)
    summary = ""
    for i in range(n):
        summary += ranked[i][1] + " "
    return summary
summ = summary(similarity_graph, 4)
print summ

Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[3] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph). The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph. These edges build on the notion of "text cohesion" and the idea that words that appear near each other are likely related in a meaningful way and "recommend" each other to the reader. 


# TextRank for keyword extraction

Imagine a scenario where you're given a large document and you need to figure out what are the key things this document talks about. You want to use an algorithm that extracts the most important words and phrases from the text. But what determines importance of a word or a phrase? For this problem, we can also use the TextRank approach, but varied slightly. 

# Process text

For a keyword extraction algorithm, a natural intuition is to find the words that are occur the most. To effectively do that, removing the punctuation and then lemmatizing the words will let the algorithm better count the words that occur in different contexts. The same reasoning applies to the TextRank algorithm. 


In [49]:
# taken from homework3 
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs:
        list(str): tokenized text
    """
    b = text.lower()
    b = b.replace("'s","")

    def applyFunc(s):

        if s is "'":
            return ""
        elif s in string.punctuation:
            return " "
        else:
            return s
                
    
        
    newB = ''.join(map(applyFunc, b))
    
    tokens = nltk.word_tokenize(newB)
    newTokens = []
    for tok in tokens :
        
        try: 
            word = lemmatizer.lemmatize(tok)
            newTokens.append(word)
        except:
            pass

    
    return newTokens

In [50]:
text = "This is a sample test input for processing."
print process(text) 
# lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
# print(lemmatizer.lemmatize("processes"))
process("Education is the ability to listen to almost anything without losing your temper or your self-confidence.")
print(process("I don't know how this works"))
print(process("I'm doing well! How about you?"))
print(process("Are #those John-Bahsd's dishes?"))

['this', 'is', 'a', 'sample', 'test', 'input', 'for', 'processing']
['i', 'dont', 'know', 'how', 'this', u'work']
['im', 'doing', 'well', 'how', 'about', 'you']
['are', 'those', 'john', 'bahsd', u'dish']


# Stop words

Extremely common words, or stop words, can also ruin the algorithm. We get rid of all the stopwords that come with nltk.


In [51]:
def remove_stopwords(tokens):
    stopwords=nltk.corpus.stopwords.words('english')
    new_tokens = []
    for tok in tokens:
        if not tok in stopwords:
            new_tokens.append(tok)
    return new_tokens
    

In [52]:
tokens = remove_stopwords(process(document))
print tokens[:100]

[u'another', u'keyphrase', u'extraction', u'algorithm', u'textrank', u'supervised', u'method', u'nice', u'property', u'like', u'able', u'produce', u'interpretable', u'rule', u'feature', u'characterize', u'keyphrase', u'also', u'require', u'large', u'amount', u'training', u'data', u'many', u'document', u'known', u'keyphrases', u'needed', u'furthermore', u'training', u'specific', u'domain', u'tends', u'customize', u'extraction', u'process', u'domain', u'resulting', u'classifier', u'necessarily', u'portable', u'turney', u'result', u'demonstrate', u'unsupervised', u'keyphrase', u'extraction', u'remove', u'need', u'training', u'data', u'approach', u'problem', u'different', u'angle', u'instead', u'trying', u'learn', u'explicit', u'feature', u'characterize', u'keyphrases', u'textrank', u'algorithm', u'3', u'exploit', u'structure', u'text', u'determine', u'keyphrases', u'appear', u'central', u'text', u'way', u'pagerank', u'selects', u'important', u'web', u'page', u'recall', u'based', u'notion'

# Co-occurence

Now that we've finished processing the text, how does TextRank actually work on keywords? The key concept behind TextRank is creating a graph with unigrams as vertices and the co-occurence between 2 words as edges. A co-occurence is when a word is within a window n of another word. For example, if a co-occurence window is 2, that means only words that are next to each other are counted by the algorithm. A co-occurence window of 3 means that words 2 ahead of the word will be added as edges to the graph. 

# Making a Graph

TextRank uses the PageRank algorithm to rank the nodes. To use PageRank, we first need to make a graph. Conveniently, networkx provides a great graph data structure. To simplify the algorithm, we will use a co-occurence window of 2, and use unweighted edges instead of weighted edges. Usually, the edge is weighted by the amount of times the co-occurence happened. 

In [53]:
def make_graph(tokens):
    graph = nx.Graph()
#     set(tokens) generates the unique unigrams of from tokens
    graph.add_nodes_from(set(tokens))
    
#     add edges for every adjacent word (co-occurence window of 2)
    for i in range(len(tokens) - 2 + 1):
        t1, t2 = tokens[i], tokens[i+1]
        graph.add_edge(*sorted([t1,t2]))
    return graph
    
    
    
    
    
graph = make_graph(tokens)
print type(graph)
    

<class 'networkx.classes.graph.Graph'>


# Keyword Extraction

Now, we will rank the nodes in the graph via the PageRank algorithm in networkx. The function will take in a parameter n for the number of keywords that needs to be extracted from the text. 

In [54]:
def extract_n_keywords(graph, n=10):
    ranks = nx.pagerank(graph)
    keywords = {rank[0]: rank[1] for rank in sorted(ranks.items(), key=operator.itemgetter(1),reverse=True)[:n]}
    words = keywords.keys()
    return set(words), ranks
keywords, word_ranks = extract_n_keywords(graph)
print keywords

set([u'word', u'would', u'text', u'vertex', u'unigrams', u'pagerank', u'supervised', u'graph', u'based', u'keyphrases'])


# Keyword Phrases

So far, we've only extracted key unigrams from the text, but we want are not just unigrams, we want phrases along with unigrams. Forturnately, we don't need to go back and create a new graph based on ngrams to find important phrases. After getting the top n keywords, all we need to do is check all the times the keywords occur in the document, and see if other keywords are adjacent to it. Then we average the pagerank scores so we don't overweight longer phrases and rerank the keywords with key phrases. 


In [55]:
def key_phrases(keywords, tokens):
    from itertools import takewhile, tee, izip
    keyphrases = {}
    j = 0
    for i, word in enumerate(tokens):
        if i < j:
            continue
        if word in keywords:
            temp = []
#             if its adjacent to the keyword, add it as a phrase
            for x in tokens[i:i+10] :
                if x in keywords:
                    temp.append(x)
                else:
                    break
            
            kp_words = temp
            sum_ranks = 0
            for w in kp_words:
                sum_ranks += word_ranks[w]
            
            avg_pagerank = sum_ranks / float(len(kp_words))
            
#             insert it back into the keyphrases, and rerank later
            keyphrases[' '.join(kp_words)] = avg_pagerank
            
            j = i + len(kp_words)
    ranked_phrases = sorted(keyphrases.items(), key=operator.itemgetter(1),reverse=True)
    phrases = map((lambda x: x[0]), ranked_phrases)
    return phrases, ranked_phrases

In [56]:
keywords_and_phrases, ranks = key_phrases(keywords, tokens)
for i in ranks:
    print i[0], i[1]
# print ranks

text 0.021334247901
graph 0.0180259236916
vertex graph 0.0163042600728
graph vertex 0.0163042600728
based text 0.0156433251488
graph would 0.0151572236273
vertex 0.014582596454
pagerank graph 0.0140475086452
graph based 0.0139891630441
vertex unigrams 0.0139198268555
keyphrases 0.0134768320763
vertex would 0.0134355600085
unigrams 0.0132570572571
keyphrases supervised 0.0124746149791
would 0.012288523563
supervised 0.0114723978819
word 0.0111125170695
based word 0.010532459733
pagerank 0.0100690935987
based 0.00995240239659


# Evaluation

Now that we know how TextRank works, how do we know how well it works? This is actually a very difficult question to answer. Due to the difficulty in determining what a good summary is, there isn't an absolute measure that determines how good a summarization algorithm is. However, typical benchmark to use is the ROUGE(Recall-Oriented Understudy for Gisting Evaluation) measure. It is a recall-based measure, which encourages an algorithm to cover as many topics as it can. The measure compares the generated summary against a reference summary and computes the recall based on any ngram. For the purposes of this tutorial, we will not go over the evaluation because the ROUGE system requires a registration application. 

# Further Reading

The original TextRank paper: 
http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

Other, more advanced and better versions of TextRank: 

DivRank:
http://clair.si.umich.edu/~radev/papers/SIGKDD2010.pdf

CollabRank:
http://www.aclweb.org/anthology/C/C08/C08-1122.pdf

ExpandRank:
http://www.aaai.org/Papers/AAAI/2008/AAAI08-136.pdf

Wikipedia:
https://en.wikipedia.org/wiki/Automatic_summarization

More on Evaluation:
ROUGE for python
https://pypi.python.org/pypi/pyrouge/0.1.0