# Text Summarization

This notebook will go through the steps to perform text summarization on the wikipedia page for Natural Language Processing. This example of text summarization will be extractive as opposed to abstractive. I used this [medium](https://becominghuman.ai/text-summarization-in-5-steps-using-nltk-65b21e352b65) article as a resource

In [1]:
# import libraries
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

The first step to summarizing text is getting the text. For this example we will be xtracting text from a webpage using beautiful soup which is a popular web scraping library. 

In [23]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Natural_language_processing'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    'style',
    
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)

Natural language processing - Wikipedia 
 
 
 
 
  CentralNotice  
 
 
 Natural language processing 
 
 From Wikipedia, the free encyclopedia 
 
 
 Jump to navigation 
 Jump to search 
 Not to be confused with  Neuro-linguistic programming . 
 This article is about language processing by computers. For the processing of language by the human brain, see  Language processing in the brain . 
   An  automated online assistant  providing  customer service  on a web page, an example of an application where natural language processing is a major component. [1] 
 Natural language processing  ( NLP ) is a subfield of  linguistics ,  computer science ,  information engineering , and  artificial intelligence  concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of  natural language  data.
 Challenges in natural language processing frequently involve  speech recognition ,  natural language under

In [24]:
text_str = output

### Frequency Table

The extractive meathod of text summarization does exactly what it says it does, it extracts the most important sentences from a corpus and puts them all together into a summary. In order to determine which sentences are the most important we will want to know which words occure the most often which is why a frequency table must be created. 

In [16]:
# create frequency table
def _create_frequency_table(text_string) -> dict:
    
    stopWords = set(stopwords.words("english")) 
    words = word_tokenize(text_string)
    ps = PorterStemmer()

    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable

### Score Sentences

The next thing that needs to be done is to score each sentence so we know which one are the most important. One simple way to score each sentence is to add up the value of the non stop words using our frequency table. 

In [17]:
def _score_sentences(sentences, freqTable) -> dict:
    
    sentenceValue = dict()

    for sentence in sentences:
        word_count_in_sentence = (len(word_tokenize(sentence)))
        word_count_in_sentence_except_stop_words = 0
        for wordValue in freqTable:
            if wordValue in sentence.lower():
                word_count_in_sentence_except_stop_words += 1
                if sentence[:10] in sentenceValue:
                    sentenceValue[sentence[:10]] += freqTable[wordValue]
                else:
                    sentenceValue[sentence[:10]] = freqTable[wordValue]

        if sentence[:10] in sentenceValue:
            sentenceValue[sentence[:10]] = sentenceValue[sentence[:10]] / word_count_in_sentence_except_stop_words


    return sentenceValue

Now that each sentence is scored we need a way of selecting the ones with the highest score. The easiest way of doing this is by finding the average score for all sentences. 

In [18]:
def _find_average_score(sentenceValue) -> int:
    
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original text
    average = (sumValues / len(sentenceValue))

    return average

Using the average score as the threshold we can generate the summary be selecting all senences with a score higher than the average. 

In [19]:
def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:10] in sentenceValue and sentenceValue[sentence[:10]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary

In [25]:
def run_summarization(text):
    # 1 Create the word frequency table
    freq_table = _create_frequency_table(text)

    '''
    We already have a sentence tokenizer, so we just need 
    to run the sent_tokenize() method to create the array of sentences.
    '''

    # 2 Tokenize the sentences
    sentences = sent_tokenize(text)

    # 3 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(sentences, freq_table)

    # 4 Find the threshold
    threshold = _find_average_score(sentence_scores)

    # 5 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, 1.3 * threshold)

    return summary


if __name__ == '__main__':
    result = run_summarization(text_str)
    print(result)

 This article is about language processing by computers. For the processing of language by the human brain, see  Language processing in the brain . As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the  World Wide Web ), which can often make up for the inferior results if the algorithm used has a low enough  time complexity  to be practical. However, this is rarely robust to natural language variation. the structure of words) of the language being considered. "open, opens, opened, opening") as separate words. Some languages have more such ambiguity than others. There are two primary types of parsing, Dependency Parsing and Constituency Parsing. marking  abbreviations ). (e.g. "close" will be the root for "closed", "closing", "close", "closer" etc.). person, location, organization). Furthermore, ma

As we can see, the extractive method does not produce the best results. Some parts of the summary make sense while other parts dont really make sense. It is possible that we could improve the performance by cleaning up the input text more. Even with improvements extractive text summarization often reads very choppy which has lead to the creating of abstractive text summarization where neural networks create new sentences to summarize the text. 