## Auto Generate Abstract or Summary
---
#### Workflow
1. Identify the most importance words.
2. Compute significance score of sentences based on words they contain
3. Pick the sentences with top score for summary
---
In this example, we consider the following points
1. <b>Word Importance = Word Frequency</b>
2. <b>Sentence Score = sum(Word Importances)</b>

1. Download and parse Text
    * Use BeautifulSoup to parse text
2. Tokenization and Stop Word Removal
3. Extract Sentences
    * Compute freq of words
    * Calculate significance score of sentences
    * Sort sentences by their significance score
    * Get top N sentences

In [184]:
import requests
from bs4 import BeautifulSoup

In [185]:
docurl = 'https://www.bbc.com/news/world-asia-india-60431187'

In [186]:
def getTextFromUrl(url):
    page = requests.get(url)
    htmlContent = page.content
    soup = BeautifulSoup( htmlContent , 'html.parser')
    text = ' '.join( map( lambda x : x.text , soup.find_all('article') ) )
    return text.replace('\'','')

In [193]:
text = getTextFromUrl(docurl)
text = text.replace('Image source, Getty ImagesImage caption', ' ')
text = text.replace('By Atul Sangar Editor, BBC News PunjabiPublished18 hours agoSharecloseShare pageCopy linkAbout sharing' , ' ')
# text.replace('.','\t.\t')

Now we have the text with us. We clean the text for puntuations and extract sentences from it.

In [194]:
from nltk import sent_tokenize , word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [195]:
# sentences = sent_tokenize(text)
sentences = text.split('.')
new_text = '.  '.join(sentences)
new_text
sentences = new_text.split('?')
new_text = '?  '.join(sentences)

In [196]:
sentences = sent_tokenize( new_text )
sentences

['Punjab voting: Can AAP triumph over BJP, Congress to win elections?',
 ', The Aam Aadmi Party is giving a tough fight to the CongressAs the northern Indian state of Punjab votes to choose its next government, theres one buzzword in the air: change.',
 'Everyone is promising it.',
 'The Aam Aadmi Party (AAP), which impressively became the main opposition party in its debut in the state in 2017, is counting on its assurance of changing the plight of Punjabs voters to help it win.',
 'Both the regional Shiromani Akali Dal (SAD) - which had to do massive damage control after initially speaking in favour of the controversial farm laws - and opposition Bharatiya Janata Party (BJP) hope voters discontent with the ruling Congress party will see them through.',
 'Even the Congress says if given another chance, they will change - by moving away from top-down decisions and keep the common citizen front and centre.',
 'The India election that will test Modis popularityWhy PM Modi rolled back Ind

In [197]:
words = word_tokenize(text.lower())
words

['punjab',
 'voting',
 ':',
 'can',
 'aap',
 'triumph',
 'over',
 'bjp',
 ',',
 'congress',
 'to',
 'win',
 'elections',
 '?',
 ',',
 'the',
 'aam',
 'aadmi',
 'party',
 'is',
 'giving',
 'a',
 'tough',
 'fight',
 'to',
 'the',
 'congressas',
 'the',
 'northern',
 'indian',
 'state',
 'of',
 'punjab',
 'votes',
 'to',
 'choose',
 'its',
 'next',
 'government',
 ',',
 'theres',
 'one',
 'buzzword',
 'in',
 'the',
 'air',
 ':',
 'change.everyone',
 'is',
 'promising',
 'it',
 '.',
 'the',
 'aam',
 'aadmi',
 'party',
 '(',
 'aap',
 ')',
 ',',
 'which',
 'impressively',
 'became',
 'the',
 'main',
 'opposition',
 'party',
 'in',
 'its',
 'debut',
 'in',
 'the',
 'state',
 'in',
 '2017',
 ',',
 'is',
 'counting',
 'on',
 'its',
 'assurance',
 'of',
 'changing',
 'the',
 'plight',
 'of',
 'punjabs',
 'voters',
 'to',
 'help',
 'it',
 'win.both',
 'the',
 'regional',
 'shiromani',
 'akali',
 'dal',
 '(',
 'sad',
 ')',
 '-',
 'which',
 'had',
 'to',
 'do',
 'massive',
 'damage',
 'control',
 '

We first need to remove stopwords and punctuations

In [198]:
newStopWords = set( stopwords.words('english') + list(punctuation) )

In [199]:
words = [ word for word in words if word not in newStopWords]

In [200]:
words

['punjab',
 'voting',
 'aap',
 'triumph',
 'bjp',
 'congress',
 'win',
 'elections',
 'aam',
 'aadmi',
 'party',
 'giving',
 'tough',
 'fight',
 'congressas',
 'northern',
 'indian',
 'state',
 'punjab',
 'votes',
 'choose',
 'next',
 'government',
 'theres',
 'one',
 'buzzword',
 'air',
 'change.everyone',
 'promising',
 'aam',
 'aadmi',
 'party',
 'aap',
 'impressively',
 'became',
 'main',
 'opposition',
 'party',
 'debut',
 'state',
 '2017',
 'counting',
 'assurance',
 'changing',
 'plight',
 'punjabs',
 'voters',
 'help',
 'win.both',
 'regional',
 'shiromani',
 'akali',
 'dal',
 'sad',
 'massive',
 'damage',
 'control',
 'initially',
 'speaking',
 'favour',
 'controversial',
 'farm',
 'laws',
 'opposition',
 'bharatiya',
 'janata',
 'party',
 'bjp',
 'hope',
 'voters',
 'discontent',
 'ruling',
 'congress',
 'party',
 'see',
 'through.even',
 'congress',
 'says',
 'given',
 'another',
 'chance',
 'change',
 'moving',
 'away',
 'top-down',
 'decisions',
 'keep',
 'common',
 'citiz

Construct dictionary of words and sentences according to freq and score

In [201]:
from nltk.probability import FreqDist
freq_words = FreqDist(words)
freq_words


FreqDist({'congress': 12, 'party': 12, 'mr': 12, 'bjp': 9, 'chief': 9, 'punjab': 8, 'also': 8, 'singh': 8, 'minister': 8, 'aap': 6, ...})

In [202]:
from heapq import nlargest

In [203]:
nlargest( 5 , freq_words , key = freq_words.get)
#these are most occuring words in our text

['congress', 'party', 'mr', 'bjp', 'chief']

In [204]:
from collections import defaultdict

In [205]:
# Ranking store the ranking of sentences
ranking = defaultdict(int)

for i,sent in enumerate(sentences):
    for word in word_tokenize( sent.lower() ):
        if word in freq_words:
            ranking[i] += freq_words[word]

In [206]:
sum_sentences = nlargest( 5 , ranking , key = ranking.get)

In [207]:
summary = ''
for i in sorted(sum_sentences):
    summary += sentences[i] + '.   '

In [208]:
summary

'Both the regional Shiromani Akali Dal (SAD) - which had to do massive damage control after initially speaking in favour of the controversial farm laws - and opposition Bharatiya Janata Party (BJP) hope voters discontent with the ruling Congress party will see them through..   Arvind Kejriwal, AAP chief and Delhi chief minister, alleges that both the Congress party and the SAD have failed to keep their promises to people despite being in power for decades since the state was formed in the 1960s..   He also sparked a controversy by asking voters not to give chances to "outsiders" from Uttar Pradesh and Bihar - in an apparent attack on Mr Kejriwal and Mr Modi who are not from Punjab..   Mr Modis BJP is in an alliance with two regional parties, including the one formed by Mr Singh, the former Congress chief minister..   , PM Modis (right) BJP is in alliance with former chief minister Amarinder Singh (left)The prime minister has also attacked Mr Kejriwal based on an unverified allegation m

In [223]:
def summerizeText( text , n ):
    sentences = text.split('.')
    new_text = '.  '.join(sentences)
    sentences = new_text.split('?')
    new_text = '?  '.join(sentences)
    sentences = sent_tokenize( new_text )
    words = word_tokenize(text.lower())
    newStopWords = set( stopwords.words('english') + list(punctuation) )
    words = [ word for word in words if word not in newStopWords]
    freq_words = FreqDist(words)
    
    ranking = defaultdict(int)

    for i,sent in enumerate(sentences):
        for word in word_tokenize( sent.lower() ):
            if word in freq_words:
                ranking[i] += freq_words[word]
    sum_sentences = nlargest( n , ranking , key = ranking.get)

    summary = ' '.join( sentences[i]  for i in sorted(sum_sentences) )
    return summary

In [224]:
summerizeText( text , 3 )

'Arvind Kejriwal, AAP chief and Delhi chief minister, alleges that both the Congress party and the SAD have failed to keep their promises to people despite being in power for decades since the state was formed in the 1960s. Mr Modis BJP is in an alliance with two regional parties, including the one formed by Mr Singh, the former Congress chief minister. , PM Modis (right) BJP is in alliance with former chief minister Amarinder Singh (left)The prime minister has also attacked Mr Kejriwal based on an unverified allegation made by one of the AAP leaders former colleagues, who has accused him of supporting Sikh separatists.'