# Text summarization: TextRank

With enormous amount of data surrounding us, it is important to be able to extract the most important information from it. In this notebook, we focus on one such information extraction algorithm from text. 

Broadly speaking, there are two different approaches which one can take for summarizing text: extractive summarization where the summary of the document is a part of the document itself and abstractive summarization where the summary is not a part of the document and is generated by a learning model. 

Abstractive summarization is an extremely difficult problem and to date an area of cuntinued research. Some of the advancements in abstractive summarization using recurrant neural networks can be found in the works mentioned in [Quora](https://www.quora.com/Has-Deep-Learning-been-applied-to-automatic-text-summarization-successfully). 

In this work, we approach text summarization from an extractive viewpoint. The questions which we address are: 
   * Given a document, which are the most important lines in it?   
   * Given a document, which are the most important key-words in it?  

To answer these questions, we implement TextRank which is an algorithm that ranks text in a document based on the importance of the text. TextRank is analogous to Google's PageRank and was introduced by Mihalcea and Tarau in the paper [TextRank: Bringing Order into Texts](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf).

TextRank is an unsupervised learning algorithm and much simpler to implement as compared to abstractive summarization methods and yet yields good Recall-Oriented Understudy for Gisting Evaluation (ROGUE) scores. 

In [5]:
import numpy as np
import argparse
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [6]:
document = ""

In [7]:
simple_doc = "Quantum mechanics is interesting. Quantum mechanics is weird. Hello, you there?"
document = simple_doc

For illustrative purpose, we will use the above simple_doc to show the steps involved in the implementation of TextRank. The following things should be noted about this document:
   * It is clear that the third sentence is not something important. So, we expect that the third sentence should be ranked lowest by TextRank. 
   * It is not clear whether the first or the second sentence is more important. 
   * it is clear that "quantum" and "mechanics" are the most important words.

In [8]:
# From the document, extract the list sentences
def get_sentences(doc):
    sentence_tokenizer = PunktSentenceTokenizer()
    return sentence_tokenizer.tokenize(doc)
sentences = get_sentences(document)
print sentences

['Quantum mechanics is interesting.', 'Quantum mechanics is weird.', 'Hello, you there?']


In [15]:
# A callable class which stems the word to its root according to the rules defined in ProterStemmer
class PorterTokenizer(object):
    def __init__(self):
        self.porter = PorterStemmer()

    def __call__(self, *args, **kwargs):
        return [self.porter.stem(word) for word in args[0].split()]
print PorterTokenizer().__call__("run running runs")

[u'run', u'run', u'run']


In [16]:
# 
tfidf = TfidfVectorizer(preprocessor=None, 
                        stop_words=stopwords.words('english'),
                        tokenizer=PorterTokenizer())