# Text summarization: TextRank

With enormous amount of data surrounding us, it is important to be able to extract the most important information from it. In this notebook, we focus on one such information extraction algorithm from text. 

Broadly speaking, there are two different approaches which one can take for summarizing text: extractive summarization where the summary of the document is a part of the document itself and abstractive summarization where the summary is not a part of the document and is generated by a learning model. 

Abstractive summarization is an extremely difficult problem and to date an area of cuntinued research. Some of the advancements in abstractive summarization using recurrant neural networks can be found in the works mentioned in [Quora](https://www.quora.com/Has-Deep-Learning-been-applied-to-automatic-text-summarization-successfully). 

In this work, we approach text summarization from an extractive viewpoint. The questions which we address are: 
   * Given a document, which are the most important lines in it?   
   * Given a document, which are the most important key-words in it?  

To answer these questions, we implement TextRank which is an algorithm that ranks text in a document based on the importance of the text. TextRank is analogous to Google's PageRank and was introduced by Mihalcea and Tarau in the paper [TextRank: Bringing Order into Texts](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf).

TextRank is an unsupervised learning algorithm and much simpler to implement as compared to abstractive summarization methods and yet yields good Recall-Oriented Understudy for Gisting Evaluation (ROGUE) scores. 

In [176]:
import logging
import re
from IPython.core.display import display, HTML
import numpy as np
import argparse
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [216]:
# Utility functions
def get_sentences(doc):
    sentence_tokenizer = PunktSentenceTokenizer()
    return sentence_tokenizer.tokenize(doc)

def remove_non_words(sentences):
    regex = re.compile('[^a-zA-Z\" "]')
    return [regex.sub('', s) for s in sentences]

def get_idx_to_sentences(sentences):
    return {idx: s for idx, s in enumerate(sentences)}

def get_idx_to_word(vocab):
    print vocab
    return {vocab[word]: word for word in vocab}

def get_ranks(directed_graph_weights, d=0.85):
    A = directed_graph_weights
    matrix_size = A.shape[0]
    for id in range(matrix_size):
        A[id, id] = 0
        col_sum = np.sum(A[:,id])
        if col_sum != 0:
            A[:, id] /= col_sum
        A[:, id] *= -d
        A[id, id] = 1
    
    B = (1-d) * np.ones((matrix_size, 1))
    
    ranks = np.linalg.solve(A, B)
    return {idx: r[0] for idx, r in enumerate(ranks)}

def display_highlighted_sentneces(ranks_of_sentences, 
                                  raw_sentneces, 
                                  sentences_to_highlight = 3,
                                  dark=0.8):
    sorted_sentences_ranks_idx = sorted(ranks_of_sentences, key=lambda k: ranks_of_sentences[k], reverse=True)
    weights = [ranks_of_sentences[idx] for idx in ranks_of_sentences]
    weights = (weights - min(weights))/(max(weights) - min(weights) + 1e-4)
    html = ''
    fmt = ' <span style="background-color: #{0:x}{0:x}ff">{1}</span>'
    for idx in range(len(raw_sentences)):
        if idx in sorted_sentences_ranks_idx[:sentences_to_highlight]:
            c = int(256*((1.-dark)*(1.-ranks_of_sentences[idx])+dark))
        else:
            c = int(256*((1.-dark)*(1.-0)+dark))    
        html += fmt.format(c,raw_sentences[idx])
    display(HTML(html))
    
def display_highlighted_words(ranks_of_words, 
                              raw_sentences, 
                              vocab,
                              words_to_highlight = len(vocab)/5,
                              dark=0.8):
    weights = [ranks_of_words[idx] for idx in ranks_of_words]
    sorted_words_ranks_idx = sorted(ranks_of_words, key=lambda k: ranks_of_words[k], reverse=True)
    weights = (weights - min(weights))/(max(weights) - min(weights) + 1e-4)
    html = ''
    fmt = ' <span style="background-color: #{0:x}{0:x}ff">{1}</span>'
    for s in raw_sentences:
        for w_ in s.split(' '):
            regex = re.compile('[^a-zA-Z\" "]')
            w = regex.sub('', w_)
            stemmed_word = PorterTokenizer().__call__(w)[0].lower()
            if stemmed_word in vocab and vocab[stemmed_word] in sorted_words_ranks_idx[:words_to_highlight]:
                c = int(256*((1.-dark)*(1.-ranks_of_words[vocab[stemmed_word]])+dark))
            else:
                c = int(256*((1.-dark)*(1.-0)+dark))
            html += fmt.format(c,w_)
    display(HTML(html))

In [178]:
logger = logging.getLogger('TextRank')
logger.setLevel(logging.INFO)

In [179]:
doc = "Accumulation of intracellular double-stranded RNA (dsRNA) usually marks viral " \
    "infections or de-repression of endogenous retroviruses and repeat elements. The innate " \
    "immune system, the first line of defense in mammals, is therefore equipped to sense " \
    "dsRNA and mount a protective response. The largest family of dsRNA sensors are " \
    "oligoadenylate synthetases (OAS) which produce a second messenger, 2-5A, in " \
    "response to dsRNA. This 2-5A activates an endoribonuclease, RNase L, which cleaves " \
    "single-stranded cellular and viral RNAs. OAS/RNase L is not only essential for coping " \
    "with bacterial and viral infections but also a major regulator of cell cycle progression, " \
    "differentiation, and apoptosis, processes often misregulated in cancers. We seek to " \
    "understand the dynamics and molecular basis of signaling in the OAS/RNase L " \
    "pathway. To this end we have developed a three-pronged approach to: a) identify " \
    "dsRNAs that accumulate b) monitor 2-5A levels real-time in live cells and c) map direct " \
    "RNA cleavages by RNase L. These approaches collectively provide a complete " \
    "molecular framework to examine dsRNA signaling in various infections and disease " \
    "states."

In [180]:
simple_doc = "Quantum mechanics is interesting. Quantum mechanics is weird. Hello, you there?"
document = doc

simple_doc commentry: For illustrative purpose, we will use the above simple_doc to show the steps involved in the implementation of TextRank. The following things should be noted about this document:
   * It is clear that the third sentence is not something important. So, we expect that the third sentence should be ranked lowest by TextRank. 
   * It is not clear whether the first or the second sentence is more important. 
   * it is clear that "quantum" and "mechanics" are the most important words.

In [181]:
raw_sentences = get_sentences(document) # From the document, extract the list sentences
sentences = remove_non_words(raw_sentences) # Remove all non-words from sentence
idx_to_sentences = get_idx_to_sentences(sentences) # Get index to sentences 

logger.debug(sentences)

In [182]:
# A callable class which stems the word to its root according to the rules defined in ProterStemmer
class PorterTokenizer(object):
    def __init__(self):
        self.porter = PorterStemmer()

    def __call__(self, *args, **kwargs):
        return [self.porter.stem(word) for word in args[0].split()]
    
logger.debug(PorterTokenizer().__call__("run running runs")) # Example

In [183]:
# We create a term frequence-inverse document frequence vectorizer object
# Input: List of sentneces.
# Processing: 1) Remove stop words defined in stop_words from the sentences and 
#             2) Stem the words to its roots according to PorterStemmer
tfidf = TfidfVectorizer(preprocessor=None, 
                        stop_words=stopwords.words('english'),
                        tokenizer=PorterTokenizer())

In [184]:
# mat: Normalized tfidf matrix with each row corresponding to a sentence and each column corresponding to a word 
# vocab: Dictionary of words and its corresponding index. The index coresponds to the column number of the word in mat 
tfidf_mat = tfidf.fit_transform(sentences).toarray()
vocab = tfidf.vocabulary_
idx_to_word = get_idx_to_word(vocab)

logger.debug('\n{}'.format(tfidf_mat))
logger.debug(vocab)

{u'major': 47, u'identifi': 37, u'examin': 33, u'cancer': 9, u'oligoadenyl': 59, u'famili': 34, u'doublestrand': 24, u'equip': 31, u'dynam': 26, u'endoribonucleas': 30, u'onli': 60, u'monitor': 54, u'derepress': 19, u'intracellular': 41, u'activ': 1, u'system': 82, u'mark': 50, u'accumul': 0, u'live': 46, u'singlestrand': 79, u'sens': 76, u'therefor': 83, u'regul': 68, u'variou': 88, u'innat': 40, u'synthetas': 81, u'infect': 39, u'rna': 72, u'protect': 65, u'framework': 36, u'b': 5, u'understand': 86, u'cellular': 11, u'rnase': 73, u'apoptosi': 3, u'line': 45, u'sensor': 77, u'level': 44, u'provid': 66, u'realtim': 67, u'signal': 78, u'l': 42, u'pathway': 61, u'collect': 14, u'mammal': 48, u'mount': 55, u'often': 58, u'process': 62, u'molecular': 53, u'direct': 22, u'retrovirus': 71, u'second': 74, u'immun': 38, u'cleav': 12, u'respons': 70, u'seek': 75, u'viral': 89, u'differenti': 21, u'develop': 20, u'messeng': 51, u'cope': 16, u'cell': 10, u'also': 2, u'state': 80, u'endogen': 29,

simple_doc commentry: We see from the above that there are 5 words which make up our vocabulary and there are three sentences. Notice that the words "you" and "there", which were part of the third sentence: "Hello you there?", has been removed from the vocabulary by stop_words. As a result of this, only "hello" remains in the third sentence. This is confirmed by the fact that for the third sentence (third row), we have 1 in the 0th column (note that in vocab, 'hello': 0) of tfidf_mat and all other columns are zero.

For the purpose of carrying out the algorithm of TextRank, we now construct a directed weighed graph where each sentence is a node and the edges between two sentences specify the similarity between them. Suppose s_i corresponds to tfidf vector for sentence i (that is the i_th row in tfidf_mat), then the similarity between sentence i and j is defined as s_i * s_j.T

In [185]:
directed_graph_weights_sentences = np.dot(tfidf_mat, tfidf_mat.T)
logger.debug('\n{}'.format(directed_graph_weights_sentences))

Similar to defining the weight graph for sentences, we can define a weight graph for the words in the document. The similarity between words i and j is defined as s_i.T * s_j where s_i and s_j are sentence rows in tfidf_mat. 

In [186]:
directed_graph_weights_words = np.dot(tfidf_mat.T, tfidf_mat)
logger.debug('\n{}'.format(directed_graph_weights_words))

Now that we have the graph weights, we solve for the ranks of the sentences and words in the document. 

In [187]:
ranks_of_sentences = get_ranks(directed_graph_weights_sentences, 0.85)
ranks_of_words = get_ranks(directed_graph_weights_words, 0.85)

logger.debug(ranks_of_sentences)
logger.debug(ranks_of_words)

In [206]:
display_highlighted_sentneces(ranks_of_sentences, raw_sentences)

In [217]:
display_highlighted_words(ranks_of_words, raw_sentences, vocab)