<a href="https://colab.research.google.com/github/royam0820/DL/blob/master/A_Simple_Text_Summarization_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization Process

Ref.: https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/

Let's create a text summarizer that can shorten the information found in a lengthy web article. To keep things simple, apart from Python’s NLTK toolkit, we’ll not use any other machine learning library.


Here is the blue print for this process: 

```
# Creating a dictionary for the word frequency table
frequency_table = _create_dictionary_table(article)

# Tokenizing the sentences
sentences = sent_tokenize(article)

# Algorithm for scoring a sentence by its words
sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

# Getting the threshold
threshold = _calculate_average_score(sentence_scores)

# Producing the summary
article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

print(article_summary)
```



## Importing Libraries

In [7]:
#importing libraries
import nltk
nltk.download("stopwords")
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Import a text using the Beautiful Soup Library


In [0]:
# Fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

# Parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

# Returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

# Looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text
    

In [9]:
print(article_content)

The 20th (twentieth) century was a century that began on 
January 1, 1901[1] and ended on December 31, 2000.[2] It was the tenth and final century of the 2nd millennium. It is distinct from the century known as the 1900s which began on January 1, 1900 and ended on December 31, 1999.
The 20th century was dominated by a chain of events that heralded significant changes in world history as to redefine the era: flu pandemic, World War I and World War II, nuclear power and space exploration, nationalism and decolonization, the Cold War and post-Cold War conflicts; intergovernmental organizations and cultural homogenization through developments in emerging transportation and communications technology; poverty reduction and world population growth, awareness of environmental degradation, ecological extinction;[3][4] and the birth of the Digital Revolution. It saw great advances in communication and medical technology that by the late 1980s allowed for near-instantaneous worldwide computer com

NOTE:  BeautifulSoup converts the incoming text to Unicode characters and the outgoing text to UTF-8 characters, saving you the hassle of managing different charset encodings while scraping text from the web.

We’ll are using  the `urlopen` function from the `urllib.request` utility to open the web page. Then, we’ll use the `read` function to read the scraped data object. For parsing the data, we’ll call the `BeautifulSoup` object and pass two parameters to it; that is, the `article_read` and the `html.parser`.

The `find_all` function is used to return all the <p> elements present in the HTML. Furthermore, using .text enables us to select only the texts found within the <p> elements.





## Processing the data
To ensure the scrapped textual data is as noise-free as possible, we’ll perform some **basic text cleaning**.  To assist us to do the processing, we’ll import a list of `stopwords` from the `nltk` library.

We are also importing `PorterStemmer`, which is an algorithm for reducing words into their root forms. For example, *cleaning*, *cleaned*, and *cleaner* can be reduced to the root *clean*.

Then we will create a dictionary table having the frequency of occurrence of each of the words in the text. We’ll loop through the text and the corresponding words to eliminate any stop words.

Then, we’ll check if the words are present in the frequency_table. If the word was previously available in the dictionary, its value is updated by 1. Otherwise, if the word is recognized for the first time, its value is set to 1.

In [0]:
def _create_dictionary_table(text_string) -> dict:
   
    #removing stop words
    stop_words = set(stopwords.words("english"))
    
    words = word_tokenize(text_string)
    
    #reducing words to their root form
    stem = PorterStemmer()
    
    #creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table

In [11]:
_create_dictionary_table(article_content)


{'%': 1,
 "''": 3,
 "'s": 21,
 '(': 10,
 ')': 10,
 ',': 266,
 '.': 115,
 '/': 2,
 '0.15-0.20°c': 1,
 '0.8°': 1,
 '1': 4,
 '1.4°': 1,
 '1.6': 1,
 '100': 1,
 '11': 1,
 '12': 1,
 '13': 1,
 '14': 1,
 '15': 2,
 '16': 1,
 '17': 1,
 '18': 1,
 '1804': 1,
 '1880': 1,
 '18th': 1,
 '19': 1,
 '1900': 2,
 '1901': 2,
 '1905': 1,
 '1914': 1,
 '1920': 2,
 '1927': 2,
 '1929': 1,
 '1930': 2,
 '1945': 3,
 '1975': 1,
 '1976': 1,
 '1980': 2,
 '1991': 1,
 '1999': 2,
 '19th': 3,
 '2': 2,
 '20': 1,
 '2000': 1,
 '20th': 18,
 '24': 1,
 '262,000,000': 1,
 '28': 1,
 '29': 1,
 '2nd': 1,
 '3': 1,
 '30': 1,
 '31': 2,
 '35': 1,
 '4': 1,
 '40+': 1,
 '5': 1,
 '50': 1,
 '6': 2,
 '6.1': 1,
 '60': 2,
 '65': 1,
 '7': 1,
 '70': 1,
 '70+': 1,
 '8': 1,
 '80': 2,
 '9': 1,
 ':': 4,
 ';': 10,
 'An': 1,
 'At': 2,
 'By': 1,
 'I': 4,
 'II': 8,
 'In': 6,
 'It': 7,
 'US': 5,
 '[': 23,
 ']': 23,
 '``': 5,
 'aaron': 1,
 'abba': 2,
 'abdel': 1,
 'abhorr': 1,
 'abstract': 1,
 'academi': 1,
 'acceler': 2,
 'accord': 1,
 'achiev': 1,
 'ack

## Tokenizing the article into sentences

To split the article_content into a set of sentences, we’ll use the built-in method from the nltk library.

In [0]:
from nltk.tokenize import word_tokenize, sent_tokenize

sentences = sent_tokenize(article_content)

In [83]:
print(sentences)

['The 20th (twentieth) century was a century that began on \nJanuary 1, 1901[1] and ended on December 31, 2000.', '[2] It was the tenth and final century of the 2nd millennium.', 'It is distinct from the century known as the 1900s which began on January 1, 1900 and ended on December 31, 1999.', 'The 20th century was dominated by a chain of events that heralded significant changes in world history as to redefine the era: flu pandemic, World War I and World War II, nuclear power and space exploration, nationalism and decolonization, the Cold War and post-Cold War conflicts; intergovernmental organizations and cultural homogenization through developments in emerging transportation and communications technology; poverty reduction and world population growth, awareness of environmental degradation, ecological extinction;[3][4] and the birth of the Digital Revolution.', 'It saw great advances in communication and medical technology that by the late 1980s allowed for near-instantaneous worldw

## Finding the weighted frequencies
To evaluate the score for every sentence in the text, we’ll be analyzing the frequency of occurrence of each term. In this case, we’ll be scoring each sentence by its words; that is, adding the frequency of each important word found in the sentence.

In [0]:
def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    #algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words

       

    return sentence_weight

NOTE:  Importantly, to ensure long sentences do not have unnecessarily high scores over short sentences, we divided each score of a sentence by the number of words found in that sentence.

Also, to optimize the dictionary’s memory, we arbitrarily added sentence[:7], which refers to the first 7 characters in each sentence. However, for longer documents, where you are likely to encounter sentences with the same first n_chars, it’s better to use hash functions or smart index functions to take into account such edge-cases and avoid collisions.

## Calculating the threshold
To further tweak the kind of sentences eligible for summarization, we’ll create the average score for the sentences. With this threshold, we can avoid selecting the sentences with a lower score than the average score.



In [0]:
def _calculate_average_score(sentence_weight) -> int:
   
    #calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    #getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

## Getting the Summary
Lastly, since we have all the required parameters, we can now generate a summary for the article.

In [0]:
def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

## Producing a text summary


In [0]:
def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)

    #tokenizing the sentences
    sentences = sent_tokenize(article)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)

    #producing the summary
    article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

    return article_summary

In [18]:
summary_results = _run_article_summary(article_content)
print(summary_results)

 Terms like ideology, world war, genocide, and nuclear war entered common usage. Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. In total, World War II left some 60 million people dead. During the century, the social taboo of sexism fell. Communications and information technology, transportation technology, and medical advances had radically altered daily lives. Since the US was in a dominant position, a major part of the process was Americanization. Terrorism, dictatorship, and the spread of nuclear weapons were pressing global issues. Millions were infected with HIV, the virus which causes AIDS. This includes deaths caused by wars, genocide, politicide and mass murders. Later in the 20th century, the development of computers led to the establishment of a theory of computation.


NOTE: Result:  The text summarized


```
Terms like ideology, world war, genocide, and nuclear war entered common usage. Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. In total, World War II left some 60 million people dead. During the century, the social taboo of sexism fell. Communications and information technology, transportation technology, and medical advances had radically altered daily lives. Since the US was in a dominant position, a major part of the process was Americanization. Terrorism, dictatorship, and the spread of nuclear weapons were pressing global issues. Millions were infected with HIV, the virus which causes AIDS. This includes deaths caused by wars, genocide, politicide and mass murders. Later in the 20th century, the development of computers led to the establishment of a theory of computation.
```



## Conclusion

Here is the workflow for text summarization:

- create word frequencey table
- tokenize the sentence
- create algorithm for scoring sentences
- calculate the sentence threshold
- create the summary

## The full code

In [89]:
#importing libraries
import nltk
nltk.download("stopwords")
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

#fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

#parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

#returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

#looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text


def _create_dictionary_table(text_string) -> dict:
   
    #removing stop words
    stop_words = set(stopwords.words("english"))
    
    words = word_tokenize(text_string)
    
    #reducing words to their root form
    stem = PorterStemmer()
    
    #creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table


def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    #algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words

       

    return sentence_weight

def _calculate_average_score(sentence_weight) -> int:
   
    #calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    #getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)

    #tokenizing the sentences
    sentences = sent_tokenize(article)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)

    #producing the summary
    article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

    return article_summary

if __name__ == '__main__':
    summary_results = _run_article_summary(article_content)
    print(summary_results)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


TypeError: ignored