# Semantic Text Summarization
Here we are using the semantic method to understand the text and also keep up the standards of the extractive summarization. The task is implemnted using the various pre-defined models such **BERT, BART, T5, XLNet and GPT2** for summarizing the articles. It is also comapared with a classical method i.e. **summarzation based on word frequencies**.


In [1]:
## installation
!pip install transformers --upgrade
!pip install bert-extractive-summarizer
!pip install neuralcoref
!python -m spacy download en_core_web_md

Requirement already up-to-date: transformers in /usr/local/lib/python3.6/dist-packages (2.8.0)
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [0]:
from transformers import pipeline
from summarizer import Summarizer, TransformerSummarizer

import pprint

pp = pprint.PrettyPrinter(indent=14)

In [32]:
## documentation for summarizer: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
# summarize with BART
summarizer_bart = pipeline(task='summarization', model="bart-large-cnn")

#summarize with BERT
summarizer_bert = Summarizer()

# summarize with T5
summarizer_t5 = pipeline(task='summarization', model="t5-large") # options: ‘t5-small’, ‘t5-base’, ‘t5-large’, ‘t5-3b’, ‘t5-11b’
#for T5 you can chose the size of the model. Everything above t5-base is very slow, even on GPU or TPU.

# summarize with XLNet
summarizer_xlnet = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")

# summarize with GPT2
summarizer_gpt2 = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")



HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




In [0]:
data = '''
For the actual assembly of an module, the Material list of a complete module is displayed in order to make the necessary materials physically available. Also CAD model of the assembly and 2-D construction models can be viewed or printed out in order to be able to later on
to carry out individual steps.
Necessary steps: The material list, 3D model and 2D drawings of a complete assembly must be available.
'''

In [40]:
# Bart for Text - Summarization
print('Bart for Text - Summarization')
summary_bart = summarizer_bart(data, min_length=10, max_length=40) # change min_ and max_length for different output
pp.pprint(summary_bart[0]['summary_text'])

# BERT for Text - Summarization
print('\n BERT for Text - Summarization')
summary_bert = summarizer_bert(data, min_length=60)
full = ''.join(summary_bert)
pp.pprint(full)

# XLNet for Text - Summarization
print('\n XLNet for Text - Summarization')
summary_xlnet = summarizer_xlnet(data, min_length=60)
full = ''.join(summary_xlnet)
pp.pprint(full)

# GPT2 for Text - Summarization
print('\n GPT2 for Text - Summarization')
summary_gpt2 = summarizer_gpt2(data, min_length=60, ratio = 0.1)
full = ''.join(summary_gpt2)
pp.pprint(full)

# T5 for Text - Summarization
print('\n T5 for Text - Summarization')
summary_t5 = summarizer_t5(data, min_length=10) # change min_ and max_length for different output
pp.pprint(summary_t5[0]['summary_text'])

Bart for Text - Summarization
('Necessary steps: The material list, 3D model and 2D drawings of a complete '
 'assembly must be available.')

 BERT for Text - Summarization
('For the actual assembly of an module, the Material list of a complete module '
 'is displayed in order to make the necessary materials physically available. '
 'Necessary steps: The material list, 3D model and 2D drawings of a complete '
 'assembly must be available.')

 XLNet for Text - Summarization




('For the actual assembly of an module, the Material list of a complete module '
 'is displayed in order to make the necessary materials physically available.')

 GPT2 for Text - Summarization
('For the actual assembly of an module, the Material list of a complete module '
 'is displayed in order to make the necessary materials physically available. '
 'Necessary steps: The material list, 3D model and 2D drawings of a complete '
 'assembly must be available.')

 T5 for Text - Summarization
('Material list of a complete module is displayed in order to make the '
 'necessary materials physically available . also CAD model of the assembly '
 'and 2-D construction models can be viewed or printed out .')


In [0]:
# a review on another data
data = '''
In the production of SMC (Sheet Moulding Compound), the maturing of the semi-finished product (resin+glass fibre) is of decisive importance. 
The associated thickening of the material determines the viscosity and thus the quality of the end product. 
Possible defects due to short maturing and soft semi-finished products are lack of fibre transport, while too long maturing and hard semi-finished products result in incompletely filled components. 
By adjusting the press parameters such as closing force, closing speed, mould temperature etc., the fluctuations in thickening can normally be compensated. 
By measuring the flowability/viscosity of the material or by measuring additional parameters during the manufacturing process, the ideal process window for the production of SMC is to be controlled even better.
'''

In [42]:
# Bart for Text - Summarization
print('Bart for Text - Summarization')
summary_bart = summarizer_bart(data, min_length=10, max_length=40) # change min_ and max_length for different output
pp.pprint(summary_bart[0]['summary_text'])

# BERT for Text - Summarization
print('\n BERT for Text - Summarization')
summary_bert = summarizer_bert(data, min_length=60)
full = ''.join(summary_bert)
pp.pprint(full)

# XLNet for Text - Summarization
print('\n XLNet for Text - Summarization')
summary_xlnet = summarizer_xlnet(data, min_length=60)
full = ''.join(summary_xlnet)
pp.pprint(full)

# GPT2 for Text - Summarization
print('\n GPT2 for Text - Summarization')
summary_gpt2 = summarizer_gpt2(data, min_length=60)
full = ''.join(summary_gpt2)
pp.pprint(full)

# T5 for Text - Summarization
print('\n T5 for Text - Summarization')
summary_t5 = summarizer_t5(data, min_length=10) # change min_ and max_length for different output
pp.pprint(summary_t5[0]['summary_text'])

Bart for Text - Summarization
('The maturing of the semi-finished product is of decisive importance. The '
 'associated thickening of the material determines the viscosity and thus the '
 'quality of the end product.')

 BERT for Text - Summarization
('In the production of SMC (Sheet Moulding Compound), the maturing of the '
 'semi-finished product (resin+glass fibre) is of decisive importance. The '
 'associated thickening of the material determines the viscosity and thus the '
 'quality of the end product.')

 XLNet for Text - Summarization
('In the production of SMC (Sheet Moulding Compound), the maturing of the '
 'semi-finished product (resin+glass fibre) is of decisive importance.')

 GPT2 for Text - Summarization




('In the production of SMC (Sheet Moulding Compound), the maturing of the '
 'semi-finished product (resin+glass fibre) is of decisive importance. By '
 'measuring the flowability/viscosity of the material or by measuring '
 'additional parameters during the manufacturing process, the ideal process '
 'window for the production of SMC is to be controlled even better.')

 T5 for Text - Summarization
('maturing of semi-finished product (resin+glass fibre) is of decisive '
 'importance . the associated thickening of the material determines the '
 'viscosity and thus the quality of the end product . fluctuations in '
 'thickness can normally be compensated by adjusting press parameters .')


In [43]:
# Text - Summarization using word frequencies
# importing libraries

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

#fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

#parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

#returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''' 
In the production of SMC (Sheet Moulding Compound), the maturing of the semi-finished product (resin+glass fibre) is of decisive importance. The associated thickening of the material determines the viscosity and thus the quality of the end product. Possible defects due to short maturing and soft semi-finished products are lack of fibre transport, while too long maturing and hard semi-finished products result in incompletely filled components. By adjusting the press parameters such as closing force, closing speed, mould temperature etc., the fluctuations in thickening can normally be compensated. By measuring the flowability/viscosity of the material or by measuring additional parameters during the manufacturing process, the ideal process window for the production of SMC is to be controlled even better.
'''

#looping through the paragraphs and adding them to the variable
#for p in paragraphs:  
#    article_content += p.text

#print(article_content)


def _create_dictionary_table(text_string) -> dict:
   
    #removing stop words
    stop_words = set(stopwords.words("english"))
    
    words = word_tokenize(text_string)
    
    #reducing words to their root form
    stem = PorterStemmer()
    
    #creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table


def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    #algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words

       

    return sentence_weight

def _calculate_average_score(sentence_weight) -> int:
   
    #calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    #getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)

    #tokenizing the sentences
    sentences = sent_tokenize(article)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)

    #producing the summary
    article_summary = _get_article_summary(sentences, sentence_scores, 1.1 * threshold)

    return article_summary

if __name__ == '__main__':
    summary_results = _run_article_summary(article_content)
    print(summary_results)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  
In the production of SMC (Sheet Moulding Compound), the maturing of the semi-finished product (resin+glass fibre) is of decisive importance.


In [6]:
# Text - Summarization using GenSim

from gensim.summarization.summarizer import summarize
print(summarize(data))

By measuring the flowability/viscosity of the material or by measuring additional parameters during the manufacturing process, the ideal process window for the production of SMC is to be controlled even better.
