## Text summarization

#### 1. Import libraries

In [53]:
import bs4 as bs
import urllib.request
import re
import nltk
import heapq

#### 2. Web Scraping 

In [59]:
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Priyanka_Chopra')

In [60]:
soup = bs.BeautifulSoup(source,'lxml')

In [61]:
text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.text


#### 3. Preprocessing text

In [62]:
# Preprocessing text
text = re.sub(r'\[[0-9]*\]',' ',text) #Removes references in wikipedia
text = re.sub(r'\s+',' ',text) #Removes spaces
clean_text = text.lower()
clean_text=  re.sub(r'\W',' ',clean_text) #removes special characters
clean_text=  re.sub(r'\d',' ',clean_text) #removes digits
clean_text=  re.sub(r'\s+',' ',clean_text) #removes spaces
#  The clean_text id for creating histogram .i.e. this is the training data.
#  The text is the prediction.

#### 4. Build the Model

In [63]:
sentences = nltk.sent_tokenize(text)
stop_words = nltk.corpus.stopwords.words('english')


In [64]:
# Basic Histogram
word2count={}
for word in nltk.word_tokenize(clean_text):
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] =1
        else:
            word2count[word] += 1

In [65]:
# Weighted Histogram
for key in word2count.keys():
    word2count[key] = word2count[key]/max(word2count.values())

In [66]:
# Get score of sentence
sent2score={}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' '))<25:
                if sentence not in sent2score.keys():
                    sent2score[sentence] =word2count[word]
                else:
                    sent2score[sentence] += word2count[word]


#### 5. Get the summary

In [67]:
# Select Top 3 sentences
best_sentences = heapq.nlargest(3,sent2score, key=sent2score.get )
print("--------------------------------")
for sentence in best_sentences:
    print(sentence)
print("--------------------------------")

--------------------------------
The film received rave reviews from film critics and was a major commercial success, earning ₹1.75 billion (US$26 million) worldwide.
In December 2012, she received three nominations: Best Female Artist, Best Song and Best Video (for "In My City") at the World Music Awards.
The film premiered at the 2014 Toronto International Film Festival, received positive reviews from critics, and her performance received critical acclaim.
--------------------------------
