# Text Summarizer from websites

In [19]:
import bs4 as bs  
import urllib.request  
import re
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords


scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Isaac_Newton')  
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:  
    article_text += p.text

In [20]:
article_text

'\n\nSir Isaac Newton FRS PRS (25 December 1642\xa0– 20 March 1726/27[a]) was an English mathematician, physicist, astronomer, theologian, and author (described in his own day as a "natural philosopher") who is widely recognised as one of the most influential scientists of all time, and a key figure in the scientific revolution. His book Philosophiæ Naturalis Principia Mathematica ("Mathematical Principles of Natural Philosophy"), first published in 1687, laid the foundations of classical mechanics. Newton also made seminal contributions to optics, and shares credit with Gottfried Wilhelm Leibniz for developing the infinitesimal calculus.\nIn Principia, Newton formulated the laws of motion and universal gravitation that formed the dominant scientific viewpoint until it was superseded by the theory of relativity. Newton used his mathematical description of gravity to prove Kepler\'s laws of planetary motion, account for tides, the trajectories of comets, the precession of the equinoxes 

In the script above we first import the important libraries required for scraping the data from the web. We then use the urlopen function from the `urllib.request` utility to scrape the data. Next, we need to call read function on the object returned by urlopen function in order to read the data. To parse the data, we use `BeautifulSoup` object and pass it the scraped data object i.e. article and the lxml parser.

In Wikipedia articles, all the text for the article is enclosed inside the <p> tags. To retrieve the text we need to call find_all function on the object returned by the `BeautifulSoup`. The tag name is passed as a parameter to the function. The find_all function returns all the paragraphs in the article in the form of a list. All the paragraphs have been combined to recreate the article.

Once the article is scraped, we need to to do some preprocessing.

# Preprocessing
The first preprocessing step is to remove references from the article. Wikipedia, references are enclosed in square brackets. The following script removes the square brackets and replaces the resulting multiple spaces by a single space. Take a look at the script below:

In [2]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)  
article_text = re.sub(r'\s+', ' ', article_text)  

The `article_text` object contains text without brackets. However, we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article.

To clean the text and calculate weighted frequences, we will create another object. Take a look at the following script:

In [3]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )  
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)  

Now we have two objects `article_text`, which contains the original article and `formatted_article_text` which contains the formatted article. We will use `formatted_article_text` to create weighted frequency histograms for the words and will replace these weighted frequencies with the words in the `article_text` object.

## Converting Text To Sentences
At this point we have preprocessed the data. Next, we need to tokenize the article into sentences. We will use the `article_text` object for tokenizing the article to sentence since it contains full stops. The `formatted_article_text` does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.

The following script performs sentence tokenization:

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
sentence_list = nltk.sent_tokenize(article_text)  

# Find Weighted Frequency of Occurrence
To find the frequency of occurrence of each word, we use the `formatted_article_text` variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. Take a look at the following script:

In [6]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}  
for word in nltk.word_tokenize(formatted_article_text):  
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In the script above, we first store all the English stop words from the nltk library into a stopwords variable. Next, we loop through all the sentences and then corresponding words to first check if they are stop words. If not, we proceed to check whether the words exist in word_frequency dictionary i.e. word_frequencies, or not. If the word is encountered for the first time, it is added to the dictionary as a key and its value is set to 1. Otherwise, if the word previously exists in the dictionary, its value is simply updated by 1.

Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the frequency of the most occurring word, as shown below:

In [7]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():  
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

# Calculating Sentence Scores
We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence. The following script calculates sentence scores:

In [16]:
sentence_scores = {}  
for sent in sentence_list:  
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In the script above, we first create an empty `sentence_scores` dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences. Next, we loop through each sentence in the `sentence_list` and tokenize the sentence into words.

We then check if the word exists in the word_frequencies dictionary. This check is performed since we created the `sentence_list` list from the article_text object; on the other hand, the word frequencies were calculated using the formatted_article_text object, which doesn't contain any stop words, numbers, etc.

We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case). Next, we check whether the sentence exists in the `sentence_scores` dictionary or not. If the sentence doesn't exist, we add it to the `sentence_scores` dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.

# Getting the Summary
Now we have the sentence_scores dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

In [17]:
import heapq  
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)  
print(summary)  

Newton also claimed that the four types could be obtained by plane projection from one of them, and this was proved in 1731, four years after his death. In 1679, Newton returned to his work on celestial mechanics by considering gravitation and its effect on the orbits of planets with reference to Kepler's laws of planetary motion. His work on the subject usually referred to as fluxions or calculus, seen in a manuscript of October 1666, is now published among Newton's mathematical papers. In 1665, he discovered the generalised binomial theorem and began to develop a mathematical theory that later became calculus. Building the design, the first known functional reflecting telescope, today known as a Newtonian telescope, involved solving the problem of a suitable mirror material and shaping technique. From this work, he concluded that the lens of any refracting telescope would suffer from the dispersion of light into colours (chromatic aberration). He used the Latin word gravitas (weight)

# Summary using Gensim

In [24]:
print(summarize(article_text, ratio=0.01))

Newton's work has been said "to distinctly advance every branch of mathematics then studied."[21] His work on the subject usually referred to as fluxions or calculus, seen in a manuscript of October 1666, is now published among Newton's mathematical papers.[22] The author of the manuscript De analysi per aequationes numero terminorum infinitas, sent by Isaac Barrow to John Collins in June 1669, was identified by Barrow in a letter sent to Collins in August of that year as "[...] of an extraordinary genius and proficiency in these things."[23]
In the same work, Newton presented a calculus-like method of geometrical analysis using 'first and last ratios', gave the first analytical determination (based on Boyle's law) of the speed of sound in air, inferred the oblateness of Earth's spheroidal figure, accounted for the precession of the equinoxes as a result of the Moon's gravitational attraction on the Earth's oblateness, initiated the gravitational study of the irregularities in the moti