# Text Summarization with NLTK - Extractive

Going to build a TextRank algorithm, it is an unsupervised algorithm based on weighted-graph. It has been built using the Google's Page rank algorithm. 

Text Rank works as follows:
1. Pre-process the text: remove stop words and stem the remaining words.
2. Create a graph where vertices are sentences.
3. Connect every sentence to every other sentence by an edge. The weight of the edge is how similar the two sentences are.
4. Run the PageRank algorithm on the graph.
5. Pick the vertices(sentences) with the highest PageRank score

![](page_rank.png)

In [1]:
import nltk
import numpy as np
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from heapq import nlargest
from nltk.stem import WordNetLemmatizer

### 1. Load the data

Data is related to online store items reviews 

In [2]:
positive_reviews = BeautifulSoup(open("positive.review").read(),"lxml")
positive_reviews =  positive_reviews.findAll('review_text')

### 2. Create a frequency summarizer class

In [3]:
class FrequencySummarizer:
    def __init__(self, min_freq = 0.2, max_freq = 0.8):
        self.min_freq = min_freq
        self.max_freq = max_freq
        self._stopwords = set(stopwords.words('english') + list(punctuation))
        
    def find_frequency(self,wordlist):
        freq_words = nltk.FreqDist(wordlist)
        my_new_dict = {}
        maximum_freq = max(freq_words.values())
        for key,value in freq_words.items():
            freq_ratio = value/maximum_freq
            if(freq_ratio > self.min_freq or freq_ratio < self.max_freq):
                my_new_dict[key] = freq_ratio
#         print(my_new_dict)
        return my_new_dict
    
    
    def word_tokenizer(self,s):
        wordnet_lemmatizer = WordNetLemmatizer()
        s = s.lower()  # downcase
        tokens = nltk.tokenize.word_tokenize(s)  # split string into words (token)
        tokens = [t for t in tokens if len(t) > 2]  # remove short words, they are probably not useful
        tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]  # put words into base form
        tokens = [t for t in tokens if t not in self._stopwords]  # remove stopwords
        return tokens  
   

    def summarize(self, document, n):
        sent_tokens = sent_tokenize(document)
        #print(sent_tokens)
        word_sent = [self.word_tokenizer(s) for s in sent_tokens]
        word_sent = sum(word_sent, [])
        #print(word_sent)
        frequencies_dict = self.find_frequency(word_sent)
        sentence_ranks = {}
        for i, sent in enumerate(sent_tokens):
            word_tokens = self.word_tokenizer(sent)
            freq_count = 0
            for word in word_tokens:
                freq = frequencies_dict[word]
                freq_count+=freq
            sentence_ranks[sent]=freq_count
        return nlargest(n, sentence_ranks, key=sentence_ranks.get)

In [4]:
# nltk.download('stopwords')
# nltk.download('wordnet')

### 3. Check the output -1

In [5]:
print('********** Review Text **********')
print(positive_reviews[0].text)
fs = FrequencySummarizer()
print('********** Review Summary **********')
summary = fs.summarize(positive_reviews[0].text, 2)


for x in summary:
    print(x)

********** Review Text **********

I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.

I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.

As always, Amazon had it to me in <2 business days

********** Review Summary **********
I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.

I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.


### Weightage of each word in sentence

` {'purchased': 0.25, 'unit': 0.25, 'due': 0.5, 'frequent': 0.25, 'blackout': 0.25, 'area': 0.25, 'power': 1.0, 'supply': 0.5, 
 'going': 0.25, 'bad': 0.25, 'run': 0.25, 'cable': 0.25, 'modem': 0.25, 'router': 0.25, 'lcd': 0.25, 'monitor': 0.25, 
 'minute': 0.25, 'enough': 0.25, 'time': 0.25, 'save': 0.25, 'work': 0.25, 'shut': 0.25, 'equally': 0.25, 'important': 0.25, 
 'know': 0.25, 'electronics': 0.25, 'receiving': 0.25, 'clean': 0.25, 'feel': 0.25, 'investment': 0.25, 'minor': 0.25, 
 'compared': 0.25, 'loss': 0.25, 'valuable': 0.25, 'data': 0.25, 'failure': 0.25, 'equipment': 0.25, 'spike': 0.25, 
 'irregular': 0.25, 'always': 0.25, 'amazon': 0.25, 'business': 0.25, 'day': 0.25} `

### output -2


In [6]:
print('********** Review Text **********')
print(positive_reviews[10].text)
fs = FrequencySummarizer()
print('********** Review Summary **********')
summary = fs.summarize(positive_reviews[10].text, 2)


for x in summary:
    print(x)

********** Review Text **********

I am very happy with this product. It folds super slim, so traveling with it is a breeze! Pretty good sound - not Bose quality, but for the price, very respectable! I've had it almost a year, and it has been along on many weekend get-aways, and works great. I use it alot, so it was a good purchase for me

********** Review Summary **********
I've had it almost a year, and it has been along on many weekend get-aways, and works great.
Pretty good sound - not Bose quality, but for the price, very respectable!


### References:
    1. https://github.com/icoxfog417/awesome-text-summarization
    2. https://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html

https://speech-to-text-demo.ng.bluemix.net/?cm_mc_uid=52780117427715257205338&cm_mc_sid_50200000=56929031525720533897&cm_mc_sid_52640000=36056531525720533908