# Simple Text Summarization using NLTK
_NLTK is the **N**atural **L**anguage **T**ool**k**it library in Python_.
> Reference: [How-to article by Usman Malik](https://stackabuse.com/text-summarization-with-nltk-in-python/)

This technique uses relative word frequency to score sentences in a long body of text, then ranks those sentences by the highest scores and returns the top sentences. It is a simple way to summarize large amounts of text using the author's own words, and provides a surprisingly smooth summary of key themes in the text.

This is an _extractive_ method because the sentences used for the summary come from the text itself.

Other techniques of text summarization, called _abstractive_ methods, use machine learning to generate new text summaries based on language patterns and themes of the text.

## Uses
This Python program will extract brief summaries from large amounts of text. **It is best used with text that has a single theme or message.** For example: 
* News articles 
* Academic research papers 
* Speeches or addresses 
* Wikipedia articles 
* Sports updates 
* Chapters of popular business books 😉

## Limitations
This technique is not the best for summarizing large bodies of text with multiple themes or an evolving story. For example: 
* Novels (especially those with complex plots)
* Entire books, even if each chapter has a central theme
* How-to guides (since the order of instructions is important)

# Setup
A user inputs the following settings:
* URLs (web pages) to summarize
* Summary length (in number of sentences)

Also, in this section I import the required packages for the analysis: `requests` for interacting with web sites, `BeautifulSoup` for parsing webpage HTML, `re` for using regular expressions (RegEx) to clean the text, `nltk` for the text analysis, and `heapq` for returning the top sentences by word frequency rank.

In [None]:
# How long (in sentences) would you like each summary to be?
summary_lenth = 6

page_to_summarize = 'https://en.wikipedia.org/wiki/Hubble_Ultra-Deep_Field'

# In the list below, enter websites to summarize
# pages_to_summarize = [
#                       'https://speeches.byu.edu/talks/lawrence-e-corbridge/stand-for-ever/', 
#                       'https://en.wikipedia.org/wiki/Machine_learning', 
#                       'https://en.wikipedia.org/wiki/Maui', 
#                       'https://en.wikipedia.org/wiki/Hana_Highway', 
#                       'https://en.wikipedia.org/wiki/Hubble_Ultra-Deep_Field', 
#                       'https://en.wikipedia.org/wiki/Milky_Way'
#                       ]

In [None]:
# import required packages
import requests                 # for making web requests
from bs4 import BeautifulSoup   # for parsing web pages
import re                       # for working with regular expressions to clean text
import nltk                     # for text analysis
import statistics               # for performing basic statistical functions
import heapq                    # for returning "top N" sentences

# get list of stopwords from NLTK
nltk.download('stopwords')

# get punctuation from NLTK
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Step-by-Step process

## Get all text from the webpage


In [None]:
webpage = requests.get(page_to_summarize)
parsed_page = BeautifulSoup(webpage.text, 'html')
# paragraphs = parsed_page.findAll(text=True)       # find all tags that have text, like <div>, <span> or <p>
paragraphs = parsed_page.findAll('p')

article_text = ''

for paragraph in paragraphs:
    article_text += ' ' + paragraph.text

# remove blank spaces before and after the article text
article_text = article_text.strip()

## Remove Wikipedia references from the text
In Wikipedia articles, references are contained in superscripted brackets following a word, like this: _sample reference_ `[3]`

In [None]:
# Use the re (RegEx) library to substitute any references with an empty space
# See: https://www.kite.com/python/answers/how-to-use-re.sub()-in-python
# Also: https://docs.python.org/3/library/re.html#regular-expression-syntax

# The 'r' in front of the pattern tells Python to treat this as a raw string
# so any Python-specific character sequences (like /n for a new line)
# will be treated as ordinary text.
article_text = re.sub(
    pattern = r'\[[0-9]*\]', # or, to remove characters too: pattern = r'\[[a-z0-9]*\]'
    repl = ' ', 
    string = article_text)

# replace the extra spaces with a single space
article_text = re.sub(
    pattern = r'\s+', 
    repl = ' ', 
    string = article_text)

article_text

"Coordinates: 3h 32m 39.0s, −27° 47′ 29.1″ The Hubble Ultra-Deep Field (HUDF) is a deep-field image of a small region of space in the constellation Fornax, containing an estimated 10,000 galaxies. The original data for the image was collected by the Hubble Space Telescope from September 2003 to January 2004. It includes light from galaxies that existed about 13 billion years ago, some 400 to 800 million years after the Big Bang. The HUDF image was taken in a section of the sky with a low density of bright stars in the near-field, allowing much better viewing of dimmer, more distant objects. Located southwest of Orion in the southern-hemisphere constellation Fornax, the rectangular image is 2.4 arcminutes to an edge, or 3.4 arcminutes diagonally. This is about one-tenth of the angular diameter of a full moon viewed from Earth (less than 34 arcminutes), smaller than a 1 mm2 piece of paper held 1 m away, and equal to roughly one twenty-six-millionth of the total area of the sky. The image

## Copy text, then remove special characters and digits to have words only 

In [None]:
# replace non-letter characters with a single space
words_only = re.sub(
    pattern = r'[^a-zA-Z]', 
    repl = ' ', 
    string = article_text)

# replace double spaces with a single space
words_only = re.sub(
    pattern = r'\s+', 
    repl = ' ', 
    string = words_only)

words_only

'Coordinates h m s The Hubble Ultra Deep Field HUDF is a deep field image of a small region of space in the constellation Fornax containing an estimated galaxies The original data for the image was collected by the Hubble Space Telescope from September to January It includes light from galaxies that existed about billion years ago some to million years after the Big Bang The HUDF image was taken in a section of the sky with a low density of bright stars in the near field allowing much better viewing of dimmer more distant objects Located southwest of Orion in the southern hemisphere constellation Fornax the rectangular image is arcminutes to an edge or arcminutes diagonally This is about one tenth of the angular diameter of a full moon viewed from Earth less than arcminutes smaller than a mm piece of paper held m away and equal to roughly one twenty six millionth of the total area of the sky The image is oriented so that the upper left corner points toward north on the celestial sphere

## Determine weighted frequency of words

In [None]:
# Get list of stopwords; words with little meaning, like "the", "an", "is", etc.
stopwords = nltk.corpus.stopwords.words('english')

# Create a dictionary to hold the words and their frequencies
word_frequencies = {}

# loop through the words and create a unique list of words and their counts
for word in words_only.lower().split(sep = ' '):
    # make sure the word isn't a meaningless stopword
    if word not in stopwords:
        # if the word isn't in the frequency dictionary yet, add it
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

# Alternative form of the loop:
# for word in nltk.word_tokenize(words_only.lower()):

Next steps: implement a TF-IDF (term frequency times inverse document frequency) measure on each term, defining "documents" as sentences or perhaps as paragraphs of text in the page.

In [None]:
# Update the word frequency count to a weighted frequency

# Determine the maximum frequency
max_frequency = max(word_frequencies.values())

# Loop through the dictionary and update the value of each word
# to its weighted frequency (that is, word_count ÷ max_frequency)
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word] / max_frequency)

Sort the dictionary of word frequencies to facilitate ranking by order. This way, we can add extra points to sentences that begin with one of the top words.

Note here: We can choose an arbitrary "top _n_" number of words to use, or we can calculate breaks that determine what the top words are. Here a few methods:
* [pandas `qcut`](https://pbpython.com/pandas-qcut-cut.html): divide the data into bins with the same number of data points. For example, use 20 bins and take the top one for the top 5% of words
* Python's built-in `statistics` package has a [`.quantile()`](https://docs.python.org/3/library/statistics.html#statistics.quantiles) function that performs the same technique (though not as robust) as pandas `qcut`
* Use [Jenk's algorithm to determine natural breaks](https://pbpython.com/natural-breaks.html) in the data and find the top group of words. This technique produces breaks that appear intuitive: for example, if the 7th-most used word has a weighted frequency of 0.5 and the 8th-most frequent word's frequency is 0.38, with the next 10 words at a similar frequency, we could determine that the top group would be the 7 most-used words.

In [None]:
sorted_word_freq = sorted(word_frequencies.items(), key = lambda item: item[1], reverse=True)

In [None]:
# Set top word count to 10 for more than 200 unique words 
# or to 5% of the number of unique words
if len(sorted_word_freq) < 200:
    top_word_count = int(len(sorted_word_freq) * .05)
else:
    top_word_count = 10

top_words = [word[0] for word in sorted_word_freq[:top_word_count]]

print(top_word_count)
print(top_words)

10
['field', 'galaxies', 'deep', 'image', 'years', 'hubble', 'hudf', 'redshifts', 'acs', 'ultra']


In [None]:
# Using natural breaks

!pip install jenkspy
import jenkspy

breaks = jenkspy.jenks_breaks(values=word_frequencies.values(), nb_class=2)
print(breaks)

top_word_freq = {}
for item in sorted_word_freq:
    top_word_freq[item[0]] = item[1] >= breaks[1]

top_word_freq = [item[0] for item in top_word_freq.items() if item[1] == True]
top_word_freq

[0.03571428571428571, 0.35714285714285715, 1.0]


['field', 'galaxies', 'deep', 'image', 'years', 'hubble', 'hudf', 'redshifts']

In [None]:
sorted_word_freq[:10]

[('field', 1.0),
 ('galaxies', 0.9642857142857143),
 ('deep', 0.75),
 ('image', 0.5714285714285714),
 ('years', 0.5357142857142857),
 ('hubble', 0.5),
 ('hudf', 0.5),
 ('redshifts', 0.35714285714285715),
 ('acs', 0.32142857142857145),
 ('ultra', 0.2857142857142857)]

## Split text into sentences (i.e., tokenize by sentence)

In [None]:
# Use the full article text (with references removed, but punctuation and numbers intact)
# to generate a list of the sentences in the article
sentence_list = nltk.sent_tokenize(article_text)

## Calculate sentence scores
These scores are based on the sum of the weighted frequencies of the words in the sentence. 

Note that stopwords, numbers, and special characters have a weighted frequency of 0 since they weren't included in the distinct word list used to calculate the weighted frequencies.

In [None]:
# Set a maximum length of the sentences allowed to be used in the summary.
# Note that sentences with more words could be scored higher simply
# by having more words. This attempts to reduce that effect.

# Create a list of the word counts of each sentence
sentence_word_counts = [len(sent.split(' ')) for sent in sentence_list]

# Find the maximum, average, and median sentence lengths
max_word_count = max(sentence_word_counts)
avg_word_count = sum(sentence_word_counts) / len(sentence_list)
median_word_count = statistics.median(sentence_word_counts)

# Find the standard deviation of sentence length
stdev_word_count = statistics.stdev(sentence_word_counts)

# Set the maximum summary sentence length to the median sentence length
# max_summary_sentence_length = median_word_count

# Set the maximum summary sentence length to the avg sentence length
# max_summary_sentence_length = avg_word_count

# Set the maximum summary sentence length to 1 stdev above the mean
max_summary_sentence_length = int(avg_word_count + stdev_word_count)

In [None]:
# create a dictionary of the sentence scores
sentence_scores = {}
score_multiplier = 1
# loop through the sentences and score them
for sent in sentence_list:
    # skip sentences longer than the maximum word count for summary sentences
    # if len(nltk.word_tokenize(sent)) > 30: #max_summary_sentence_length:
    #     continue
    if len(sent.split(' ')) > max_summary_sentence_length:
        continue
    
    i = 0       # reset word count
    
    for word in nltk.word_tokenize(sent.lower()):
        # check if the word is one of the scored words
        if word in word_frequencies.keys():
            # give double points to sentences with a top word among the first 5 words
            if (i < 5) and (word in top_words):
                score_multiplier = 2
            else:
                score_multiplier = 1
            i += 1
            
            # check if the sentence does not yet have a score assigned
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = (word_frequencies[word]) * score_multiplier
            else: 
                sentence_scores[sent] += (word_frequencies[word]) * score_multiplier


## Sort the sentences in descending order by sentence score

In [None]:
# Sort the sentence scores dictionary in descending order by word frequency score
sorted_sentences = sorted(sentence_scores.items(), key = lambda item: item[1], reverse=True)

In [None]:
# Alternate version using the heapq library
sorted_sentences_v2 = heapq.nlargest(summary_lenth, sentence_scores, key=sentence_scores.get)
article_summary_v2 = ' '.join(sorted_sentences_v2)

article_summary_v2

'The Hubble eXtreme Deep Field (HXDF), released on September 25, 2012, is an image of a portion of space in the center of the Hubble Ultra Deep Field image. In the years since the original Hubble Deep Field, the Hubble Deep Field South and the GOODS sample were analyzed, providing increased statistics at the high redshifts probed by the HDF. Coordinates: 3h 32m 39.0s, −27° 47′ 29.1″ The Hubble Ultra-Deep Field (HUDF) is a deep-field image of a small region of space in the constellation Fornax, containing an estimated 10,000 galaxies. Many of the smaller galaxies in the image are very young galaxies that eventually developed into major galaxies, similar to the Milky Way and other galaxies in our galactic neighborhood. On January 23, 2019, the Instituto de Astrofísica de Canarias released an even deeper version of the infrared images of the Hubble Ultra Deep Field obtained with the WFC3 instrument, named the ABYSS Hubble Ultra Deep Field. Galaxies at high redshifts have been confirmed to

## Create the summary paragraph

In [None]:
article_summary = ''

for sent in sorted_sentences[0:summary_lenth]:
    article_summary += ' ' + sent[0]

# remove whitespace from the beginning and end of the article summary
article_summary = article_summary.replace('\n', ' ')
article_summary = article_summary.strip()

In [None]:
article_summary

'The Hubble eXtreme Deep Field (HXDF), released on September 25, 2012, is an image of a portion of space in the center of the Hubble Ultra Deep Field image. In the years since the original Hubble Deep Field, the Hubble Deep Field South and the GOODS sample were analyzed, providing increased statistics at the high redshifts probed by the HDF. Coordinates: 3h 32m 39.0s, −27° 47′ 29.1″ The Hubble Ultra-Deep Field (HUDF) is a deep-field image of a small region of space in the constellation Fornax, containing an estimated 10,000 galaxies. Many of the smaller galaxies in the image are very young galaxies that eventually developed into major galaxies, similar to the Milky Way and other galaxies in our galactic neighborhood. On January 23, 2019, the Instituto de Astrofísica de Canarias released an even deeper version of the infrared images of the Hubble Ultra Deep Field obtained with the WFC3 instrument, named the ABYSS Hubble Ultra Deep Field. Galaxies at high redshifts have been confirmed to

In [None]:
print(article_summary)

The Hubble eXtreme Deep Field (HXDF), released on September 25, 2012, is an image of a portion of space in the center of the Hubble Ultra Deep Field image. In the years since the original Hubble Deep Field, the Hubble Deep Field South and the GOODS sample were analyzed, providing increased statistics at the high redshifts probed by the HDF. Coordinates: 3h 32m 39.0s, −27° 47′ 29.1″ The Hubble Ultra-Deep Field (HUDF) is a deep-field image of a small region of space in the constellation Fornax, containing an estimated 10,000 galaxies. Many of the smaller galaxies in the image are very young galaxies that eventually developed into major galaxies, similar to the Milky Way and other galaxies in our galactic neighborhood. On January 23, 2019, the Instituto de Astrofísica de Canarias released an even deeper version of the infrared images of the Hubble Ultra Deep Field obtained with the WFC3 instrument, named the ABYSS Hubble Ultra Deep Field. Galaxies at high redshifts have been confirmed to 

# Single-function process