<a href="https://colab.research.google.com/github/shahabbadihi/LA/blob/master/TextRank-v5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TextRank: Bringing Order into Texts
In this part you will implement [TextRank: Bringing Order into Texts](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) paper.

[Mihalcea](https://scholar.google.com/citations?user=UetM7FgAAAAJ&hl=en) and [Tarau](https://scholar.google.com/citations?user=JUMRc-oAAAAJ&hl=en) in this paper, introduced TextRank – a **graphbased ranking model for text processing**, and show how it can be successfully used for natural language applications. In particular, they proposed and evaluated two innovative unsupervised approaches for keyword and sentence extraction.

The basic idea implemented by a graph-based ranking model is that of “voting” or “recommendation”.

When one vertex links to another one, it is basically casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting the vote determines how important the vote itself is, and this information is also taken into account by the ranking model.**This paper relies on Google's PageRank**.

## Defenition

Formally, let $G=(V, E)$ be a directed graph with the set of vertices $V$ and set of edges $E$, where $E$ is a subset of $V \times V$. For a given vertex $V_i$ , let $In(V_i)$ be the set of vertices that point to it (predecessors), and let $Out(V_i)$ be the set of vertices that vertex $V_i$ points to (successors). The score of a vertex $V_i$ is defined as follows (Brin and Page, 1998):

$S(V_i) = (1-d) + d* \sum_{j \in In(V_i)} \frac{1}{|Out(V_j)|}S(V_j)$


where d is a damping factor and usually set to 0.85.

## Graph representation

TextRank builds a weighted graph representation of a document using words as nodes and **co-ocurrence** [<sup>1</sup>](#fn1) frequencies between pairs of words as edge weights. It then applies PageRank to this graph, and treats the PageRank score of each word as its significance.

<img src="https://github.com/shahabbadihi/LA/blob/master/textrank.png?raw=1" width="400" align="center">

<span id="fn1"> [1]: In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. For example, when the term "strong coffee" appears in a document, the term "espresso bean" probably also tends to occur in that document.</span>


In [2]:
!wget 'https://github.com/shahabbadihi/LA/raw/master/PageRank.zip'
!7z x '/content/PageRank.zip'

--2023-01-27 06:16:25--  https://github.com/shahabbadihi/LA/raw/master/PageRank.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/shahabbadihi/LA/master/PageRank.zip [following]
--2023-01-27 06:16:25--  https://raw.githubusercontent.com/shahabbadihi/LA/master/PageRank.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 740192 (723K) [application/zip]
Saving to: ‘PageRank.zip’


2023-01-27 06:16:25 (16.1 MB/s) - ‘PageRank.zip’ saved [740192/740192]


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs I

In [3]:
import os
import sys
import copy
import collections

import nltk
import nltk.tokenize

sys.path.append(".")

import pandas
import page_rank
import text_rank

In [4]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Preprocessing
**Tokenization** is a common task in **Natural Language Processing** (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like [Transformers](https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/?utm_source=blog&utm_medium=what-is-tokenization-nlp).

This is the process by which a large quantity of text is divided into smaller parts called **tokens**.

Natural Language toolkit has very important module [**NLTK**](https://www.nltk.org/api/nltk.tokenize.html) tokenize sentences which further comprises of sub-modules word tokenize and sentence tokenize.

We use the method [word_tokenize()](https://www.geeksforgeeks.org/python-nltk-nltk-tokenizer-word_tokenize/) to split a sentence into words. Please refer to below word tokenize NLTK example to understand the theory better.
```python
Input: "I love Applied linear algebra! specially the projects."
Output: ['I', 'love', 'Applied', 'linear', 'algebra', '!', 'specially', 'the', 'projects', '.']
```
After tokenizing the document we should filter irrelevant [PoS tags](https://en.wikipedia.org/wiki/Part-of-speech_tagging) and punctuation (e.g, !, ?).

In [25]:
def __preprocess_document(document, relevant_pos_tags):
    '''
    This function accepts a string representation 
    of a document as input, and returns a tokenized
    list of words corresponding to that document.
    '''
    
    # Tokenizing the document
    words = nltk.tokenize.word_tokenize(document)
    
    # PoS tagging
    # Your code here, use nltk.pos_tag for words and make a list of second pair
    # print(nltk.pos_tag(words))
    pos_tags = [value for key, value in nltk.pos_tag(words)]
    print(pos_tags)
    
    # Filter out words with irrelevant POS tags
    filtered_words = []
    for index, word in enumerate(words):
        word = word.lower()
        tag = pos_tags[index]
        # TODO: append `word` to `filtered_words` if the word is not a punctuation and pos is relevant.
        # You can use `__is_punctuation` function and `relevant_pos_tags`
        if not text_rank.__is_punctuation(word) and tag in relevant_pos_tags:
            filtered_words.append(word)


    return filtered_words

__preprocess_document("hello, my name is Shahab!", ["NN", "ADJ"])

['NN', ',', 'PRP$', 'NN', 'VBZ', 'NNP', '.']


['hello', 'name']

## Ranking
In this section, first we will implement weighted PageRank and use this function to implement textRank.

### PageRank using Power method

In [41]:
def power_iteration(transition_weights, rsp=0.15, epsilon=0.00001, max_iterations=1000):
    # Clerical work:
    transition_weights = pandas.DataFrame(transition_weights)
    nodes = page_rank.__extract_nodes(transition_weights)
    transition_weights = page_rank.__make_square(transition_weights, nodes, default=0.0)
    transition_weights = page_rank.__ensure_rows_positive(transition_weights)

    # Setup:
    state = page_rank.__start_state(nodes)
    transition_probabilities = page_rank.__normalize_rows(transition_weights)
    
    # Compute transition matrix
    # Your code here
    transition_matrix = transition_probabilities
    all_cols = [col for col in transition_matrix]
    transition_matrix[all_cols] = transition_matrix[all_cols] * (1 - rsp) + 1 / len(nodes) * rsp
    # Power iteration:
    # TODO: implement power method
    # Use state.copy() for copying to old_state
    for iteration in range(max_iterations):
        state = state.copy().dot(transition_matrix)
        state = state / state.abs().sum()
        
        

    return state

### TextRank algorithm
Authors used a co-occurrence relation (as we discussed), controlled by the distance between word occurrences: **two vertices are connected** if their corresponding lexical units co-occur within a **window of maximum  words**, where  can be set anywhere from 2 to 10 word.


The vertices added to the graph can be restricted with **syntactic filters**, which select only lexical units of a certain part of speech. One can for instance consider only nouns and verbs for addition to the graph, and consequently draw potential edges based only on relations that can be established between nouns and verb. Experiments showed that **best results observed for nouns ("NN") and adjectives ("ADJ") only**.

In [47]:
def textrank(document, window_size=2, rsp=0.15, relevant_pos_tags=["NN", "ADJ"]):
    '''
    Accepts a string representation
    of a document and returns Pandas matrix that maps words to their related TextRank scores.
    Keyword arguments:
    window_size: window of maximum words, can be set between 2 to 10. (default 2)
    rsp:
    relevant_pos_tags: list tags that graph is restricted by (default ["NN", "ADJ"])
    '''
    
    # Tokenize document:
    words = __preprocess_document(document, relevant_pos_tags)
    
    
    # Building the weighted graph:
    # nodes: words
    # edge weights number of times words cooccur within a window of predetermined size
    edge_weights = collections.defaultdict(lambda: collections.Counter())
    for index, word in enumerate(words):
        for other_index in range(index - window_size, index + window_size + 1):
            if other_index >= 0 and other_index < len(words) and other_index != index:
                other_word = words[other_index]
                edge_weights[word][other_word] += 1.0

    # Apply `power_iteration` to `edge_weights` and sort the output
    # Your code here
    word_probabilities = power_iteration(edge_weights)

    word_probabilities.sort_values(inplace=True, ascending=False)

    
    return word_probabilities

## Apply TextRank

In [43]:
def apply_text_rank(file_name, title="a document"):
    print("Reading \"%s\" ..." % title)
    # Opening:
    
    file_path = os.path.join(os.path.abspath(''), file_name)
    document = open(file_path).read()
    document = text_rank.__ascii_only(document)
    
    print("Applying TextRank to \"%s\" ..." % title)
    
    # TODO: get TextRank vector
    keyword_scores = textrank(document)

    print()
    header = "Keyword Significance Scores for \"%s\":" % title
    print(header)
    print("-" * len(header))
    print(keyword_scores)
    print()

### Cinderalla story

In [48]:
apply_text_rank("data/Cinderalla.txt", "Cinderalla")

Reading "Cinderalla" ...
Applying TextRank to "Cinderalla" ...
['RB', 'IN', 'DT', 'NN', ',', 'EX', 'VBD', 'DT', 'NN', ',', 'WP', 'IN', 'PRP$', 'NN', 'CC', 'NN', 'NN', 'VBD', ',', 'VBD', 'DT', 'NN', 'CC', 'JJS', 'NN', 'IN', 'PDT', 'DT', 'NN', '.', 'PRP', 'VBD', 'CD', 'NNS', 'IN', 'DT', 'JJ', 'NN', 'WP', 'VBD', 'RB', 'IN', 'JJ', 'CC', 'NN', 'IN', 'PRP$', 'NN', '.', 'DT', 'NN', 'RB', 'VBD', 'DT', 'JJ', 'NN', 'IN', 'DT', 'NN', 'VBN', 'NNP', ',', 'WP', 'VBD', 'VBN', 'IN', 'NN', 'CC', 'VBD', 'CD', 'IN', 'DT', 'JJS', 'NNS', 'DT', 'NN', 'VBD', 'RB', 'VBN', '.', 'NNP', 'POS', 'NN', 'VBD', 'RB', 'JJ', 'IN', 'PRP$', 'NN', 'CC', 'NN', 'CC', 'VBD', 'PRP$', 'VB', 'DT', 'JJS', 'CC', 'RBS', 'JJ', 'NN', 'IN', 'DT', 'NN', '.', 'NN', 'VBD', 'DT', 'NNS', ',', 'VBD', 'DT', 'NN', 'CC', 'VBD', 'DT', 'NN', 'DT', 'IN', 'PRP$', 'NNS', 'VBN', 'IN', 'JJ', 'NNS', 'VBD', 'JJ', 'VBG', 'JJ', '.', 'RB', 'PRP', 'RB', 'VBD', 'IN', 'DT', 'NNP', 'POS', 'NN', 'VBD', 'TO', 'VB', 'DT', 'NN', ',', 'VBG', 'PDT', 'DT', 'JJ', 'N

### Beauty_and_the_Beast story

In [49]:
apply_text_rank("data/Beauty_and_the_Beast.txt", "Beauty and the Beast")

Reading "Beauty and the Beast" ...
Applying TextRank to "Beauty and the Beast" ...
['RB', 'IN', 'DT', 'NN', 'RB', 'VBD', 'DT', 'JJ', 'NN', 'CC', 'PRP$', 'NN', '.', 'PRP', 'VBD', 'DT', 'NNS', 'IN', 'JJ', 'NNS', 'CC', 'VBD', 'RB', 'JJ', '.', 'RB', 'CD', 'NN', ',', 'DT', 'NN', 'VBD', 'NN', 'TO', 'DT', 'JJ', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBD', 'JJ', '.', 'EX', 'VBD', 'DT', 'JJ', 'NN', 'CC', 'PDT', 'DT', 'NNS', 'IN', 'DT', 'NN', 'VBD', 'VBN', '.', 'CC', 'DT', 'NN', 'VBD', 'TO', 'VB', 'DT', 'JJ', 'NN', '.', 'PRP', 'VBD', 'TO', 'DT', 'NNS', 'CC', 'VBD', 'RB', 'JJ', '.', 'VB', 'PRP', 'VBD', 'NN', 'TO', 'VB', 'DT', 'NN', 'IN', 'JJ', 'NNS', '.', 'DT', 'JJ', 'NNS', 'VBD', 'PRP', 'NN', 'CC', 'VBD', ',', '``', 'NNP', 'PRP', 'VBD', 'TO', 'VB', 'DT', 'RBS', 'JJ', 'NN', 'IN', 'DT', 'NN', '.', 'PRP', 'MD', 'VBG', 'RB', 'CC', 'VB', 'RB', 'RB', '.', 'PRP', 'MD', 'VB', 'RB', '.', 'PDT', 'DT', 'NNS', 'VBD', 'DT', 'NN', 'CC', 'VBD', 'PRP$', 'JJ', 'NNS', '.', 'WRB', 'PRP', 'VBD', 'DT', 'JJ', 'NN', 'NN

In [None]:
# Optional: test textRank on another documents :))