# Auto Summarization of Text/Article using Natural Language Processing

## Import Libraries

In [1]:
import urllib2

In [2]:
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from heapq import nlargest
from nltk.probability import FreqDist
from collections import defaultdict

## Extract Data

The function below takes an input url to the articles , and extracts utf-8 encoding in lxml format.

In [16]:
def get_text_from_url(article_url):
    page = urllib2.urlopen(article_url).read().decode('utf8', 'ignore')
    soup = BeautifulSoup(page, 'lxml')
    #print(soup.find('article').text)
    joined_text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
    return joined_text.encode('ascii', errors='replace').replace("?", " ")

I am using an WashingtonPost article to summarize, firstly tokenizing the sentence using sent_tokenize and then tokenizing the words using word_tokenize.

In [20]:
sample_url = "https://www.washingtonpost.com/powerpost/devin-nunes-targeting-mueller-and-the-fbi-alarms-democrats-and-some-republicans-with-his-tactics/2017/12/30/b8181ebc-eb02-11e7-9f92-10a2203f6c8d_story.html?hpid=hp_hp-top-table-main_nunes-752am%3Ahomepage%2Fstory&utm_term=.f1c7e28204ee"
text = get_text_from_url(sample_url)
sents = sent_tokenize(text)
word_sent = word_tokenize(text.lower())

Removing English stopwords and punctuations because they are irrelevant while creating the frequency distributions of the words.

In [21]:
_stopwords = set(stopwords.words('english')+list(punctuation))
word_sent=[word for word in word_sent if word not in _stopwords]

## Frequency Distribution

Based on the frequency list of the words, let us see the 10 most frequent/significant words in the article.

In [22]:
freq = FreqDist(word_sent)
print nlargest(10, freq, key=freq.get)

ranking = defaultdict(int)

['nunes', 'committee', 'probe', 'intelligence', 'house', 'trump', 'russia', 'said', 'investigation', 'democrats']


Now, for each sentence, calculating the frequency score based on the words present in it. Let us see the scores of top such 4 sentences.

In [34]:
for i, sent in enumerate(sents):
    for w in word_tokenize(sent.lower()):
        if w in freq:
            ranking[i] += freq[w]

sents_index = nlargest(4, ranking, ranking.get)
print sorted(sents_index, reverse=True)

[36, 11, 9, 0]


## Summarization

Now, we can get the summary of the text with top 4 most significant sentences.

In [35]:
print "Summarized Text: {}".format([sents[j] for j in sorted(sents_index)])

Summarized Text: ['       Rep. Devin Nunes, once sidelined by an ethics inquiry from leading the House Intelligence Committee s Russia probe, is reasserting the full authority of his position as chairman just as the GOP appears poised to challenge special counsel Robert S. Mueller III s investigation of possible coordination between the Trump campaign and Russian officials.', 'Gowdy, a member of the Intelligence panel who also chairs the House Committee on Oversight and Government Reform, suggested that Nunes has taken some of these steps without the express blessing of House Speaker Paul D. Ryan (R-Wis.),who has been involved in crafting the GOP s multipronged approach to examining a string of allegations from Russian election interference to alleged mismanagement at the nation s top law enforcement agencies.', 'But Nunes s moves coincide with what Democrats say is a coordinated GOP effort to shutter the House Intelligence Committee s Russia probe, publicly absolve President Trump of 

The above process has lot of practical applications. We can use any post to summarize, the extraction may differ based on the encoding of the document.