# Text Summarizer

***

## Contents
1. [Overview](#1)
2. [Extractive Text Summarizer](#2)
3. [Abstractive Text Summarizer](#3)

***
<a id = '1'></a>
## 1. Overview
In this notebook we will create text summarizers to summarize Wikipedia articles using two different methods

***

<a id = '2'></a>
## 2. Extractive Text Summarizer
Text summarizers that are extractive reads the input texts and discards sentences that are deemed less important. Input data in this case will be text scraped from Wikipedia.

In [1]:
import bs4 as bs
import urllib.request
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import heapq
import warnings
warnings.filterwarnings('ignore')

In [2]:
### Scrape Wikipedia
scraped_data = urllib.request.urlopen("https://en.wikipedia.org/wiki/Bear")
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article, 'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""
for p in paragraphs:
    article_text += p.text

print("\n First 500 characters of Wikipedia article: \n", article_text[:500])


 First 500 characters of Wikipedia article: 
 

Amphicynodontinae
Hemicyoninae
Ursavinae
Agriotheriinae
Ailuropodinae
Tremarctinae
Ursinae
Bears are carnivoran mammals of the family Ursidae. They are classified as caniforms, or doglike carnivorans.  Although only eight species of bears are extant, they are widespread, appearing in a wide variety of habitats throughout the Northern Hemisphere and partially in the Southern Hemisphere. Bears are found on the continents of North America, South America, Europe, and Asia. Common characteristics o


In [3]:
### Preprocessing

# Clean formatting
article_text = re.sub(r'(\r\n?|\n)+', '. ', article_text)
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

# Shorten article to first 512 words
temp = article_text.split()
temp = temp[:512]
shortened_text = ''
for word in temp:
    shortened_text = shortened_text + ' ' + word

print("\n First 500 characters of shortened article: \n", shortened_text[:500])


 First 500 characters of shortened article: 
  . Amphicynodontinae. Hemicyoninae. Ursavinae. Agriotheriinae. Ailuropodinae. Tremarctinae. Ursinae. Bears are carnivoran mammals of the family Ursidae. They are classified as caniforms, or doglike carnivorans. Although only eight species of bears are extant, they are widespread, appearing in a wide variety of habitats throughout the Northern Hemisphere and partially in the Southern Hemisphere. Bears are found on the continents of North America, South America, Europe, and Asia. Common characteri


In [4]:
# Remove punctuations
cleaned_article_text = re.sub('[^a-zA-Z]', ' ', shortened_text)
cleaned_article_text = re.sub(r'\s+', ' ', cleaned_article_text)

print("\n First 500 characters of cleaned text: \n", cleaned_article_text[:500])


 First 500 characters of cleaned text: 
  Amphicynodontinae Hemicyoninae Ursavinae Agriotheriinae Ailuropodinae Tremarctinae Ursinae Bears are carnivoran mammals of the family Ursidae They are classified as caniforms or doglike carnivorans Although only eight species of bears are extant they are widespread appearing in a wide variety of habitats throughout the Northern Hemisphere and partially in the Southern Hemisphere Bears are found on the continents of North America South America Europe and Asia Common characteristics of modern bea


In [5]:
# Tokenize sentences
sentences = sent_tokenize(article_text)

# Word Frequency Table
stopwords = stopwords.words('english')

freqTable = dict()
words = word_tokenize(cleaned_article_text)

for word in words:
    word = word.lower()
    if word in stopwords:
        continue
    if word in freqTable.keys():
        freqTable[word] += 1
    else:
        freqTable[word] = 1

max_freq = max(freqTable.values())
for word in freqTable.keys():
    freqTable[word] = (freqTable[word]/max_freq)

The dictionary contains a score for every unique word in the article based on their frequency, which is then used to calculate the scores of each sentence in the text.

In [6]:
# Sentence scores
sentenceValue = dict()
for sent in sentences:
    for word, freq in freqTable.items():
        if word in sent.lower():
            if len(sent.split(' ')) < 40:
                if sent in sentenceValue:
                    sentenceValue[sent] += freq
                else:
                    sentenceValue[sent] = freq

With the scores of each sentence in the text, we are able to sieve out the 'n' most important sentences as the summarized version of our text. In this case we shall choose the top 4 sentences with the highest scores to form a paragraph to summarize the article.

In [7]:
# Return summary
n = 4
summary = ' '.join(heapq.nlargest(n, sentenceValue, key = sentenceValue.get))   

print("\n Summarised version of the Wikipedia article: \n\n", summary)


 Summarised version of the Wikipedia article: 

 The English word "bear" comes from Old English bera and belongs to a family of names for the bear in Germanic languages, such as Swedish björn, also used as a first name. This terminology for the animal originated as a taboo avoidance term: proto-Germanic tribes replaced their original word for bear—arkto—with this euphemistic expression out of fear that speaking the animal's true name might cause it to appear. Although only eight species of bears are extant, they are widespread, appearing in a wide variety of habitats throughout the Northern Hemisphere and partially in the Southern Hemisphere. Common characteristics of modern bears include large bodies with stocky legs, long snouts, small rounded ears, shaggy hair, plantigrade paws with five nonretractile claws, and short tails..


***

<a id ='3'></a>
## 3. Abstractive Text Summarizer
Abstractive text summarizers differs from extractive text summarizers as they do not simply extract the most important sentences from the original text, but also generate new sentences based on details that are deemed important to the algorithm. In this notebook we shall us Huggingface's *transformers* library for transfer learning on their pre-trained T5-Transformers.

In [8]:
from transformers import pipeline
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = T5Tokenizer.from_pretrained("t5-base")

In [9]:
### Scrape Wikipedia
scraped_data = urllib.request.urlopen("https://en.wikipedia.org/wiki/Bear")
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article, 'lxml')
paragraphs = parsed_article.find_all('p')

article_text = ""
for p in paragraphs:
    article_text += p.text

In [None]:
# Model input
inputs = tokenizer.encode("summarize: " + article_text, return_tensors = "pt", max_length = 512, truncation = True)

# Model output
outputs = model.generate(inputs, max_length = 150, min_length = 40,
                         length_penalty = 2.0, num_beams = 4,
                         early_stopping = True)

In [None]:
#print(outputs)
print('\n\n')
print("\n Summarised version of the Wikipedia article: \n\n",tokenizer.decode(outputs[0]))