# An Extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form.

## How to do text summarization

* Text cleaning
* Sentence Tokenization
* Word tokenization
* Word-frequency table
* Text Summarization

In [2]:
# Load text

text = """

Text Summarization using NLP
Published by georgiannacambel on 4 September 2020
Extractive Text Summarization
What is text summarization?
Text summarization is the process of creating a short, accurate, and fluent summary of a longer text document. It is the process of distilling the most important information from a source text. Automatic text summarization is a common problem in machine learning and natural language processing (NLP). Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.

Why automatic text summarization?
Summaries reduce reading time.
While researching using various documents, summaries make the selection process easier.
Automatic summarization improves the effectiveness of indexing.
Automatic summarization algorithms are less biased than human summarizers.
Personalized summaries are useful in question-answering systems as they provide personalized information.
Using automatic or semi-automatic summarization systems enables commercial abstract services to - increase the number of text documents they are able to process.

An Extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form.
An Abstractive summarization is an understanding of the main concepts in a document and then express those concepts in clear natural language.
The Domain-specific summarization techniques utilize the available knowledge specific to the domain of text. For example, automatic summarization research on medical text generally attempts to utilize the various sources of codified medical knowledge and ontologies.
The Generic summarization focuses on obtaining a generic summary or abstract of the collection of documents, or sets of images, or videos, news stories etc.
TheQuery-based summarization, sometimes called query-relevant summarization, summarizes objects specific to a query.
The Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. Resulting summary report allows individual users, such as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents.
The Single-document summarization generates a summary from a single source document.

"""

In [3]:
# Let's Get Started with SpaCy
!pip install -U spacy

!python -m spacy download en_core_web_sm

"""
spacy for Natural Language Processing.
STOP_WORDS is a set of default stop words for English language model in SpaCy.
punctuation is a pre-initialized string which will give the all sets of punctuation.

"""


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [5]:
stopwords = list(STOP_WORDS)

In [7]:
#list of stop words. -- > we can add more stop words mannualy
print(stopwords)

['even', 'a', 'around', 'from', 'whereas', 'wherever', 'upon', 'within', 'thereafter', 'really', 'besides', 'than', 'show', '’m', 'various', 'should', 'never', 'some', '’ve', 'somehow', 'serious', 'again', 'same', 'except', 'rather', 'nine', 'namely', 'seem', 'whoever', 'five', 'nowhere', 'them', 'beyond', 'twelve', 'very', 'their', 'being', 'by', 'latterly', '’re', 'hereupon', 'using', 'ten', 'thus', 'enough', 'quite', 'third', 'toward', 'last', 'say', 'he', 'see', 'these', "'d", 'does', 'doing', 'him', 'top', 'on', 'be', 'an', 'above', 'up', '‘d', 'or', 'others', 'his', 'many', 'thence', 'did', 'whenever', 'here', 'six', 'only', 'few', 'until', 'all', 'were', 'behind', 'bottom', 'ours', 'yourself', 'n’t', 'why', 'cannot', 'forty', 'becomes', 'yours', 'also', 'myself', 'moreover', 'take', 'three', 'where', 'across', 'nor', 'per', 'thru', 'will', 'already', 'since', 'therefore', 'mostly', 'whereafter', "'m", 'due', 'n‘t', 'almost', 'me', 'been', 'along', 'into', 'too', 'latter', 'seeme

In [9]:
# load nlp model
"""
spacy.load is used to load a model. spacy.load('en_core_web_sm') loads the model package en_core_web_sm.
This will return a language object nlp containing all components and data needed to process text.
"""

nlp = spacy.load('en_core_web_sm')

In [10]:
doc = nlp(text)

In [11]:
# Tokenize the document

tokens = [token.text for token in doc]
print(tokens)

['\n\n', 'Text', 'Summarization', 'using', 'NLP', '\n', 'Published', 'by', 'georgiannacambel', 'on', '4', 'September', '2020', '\n', 'Extractive', 'Text', 'Summarization', '\n', 'What', 'is', 'text', 'summarization', '?', '\n', 'Text', 'summarization', 'is', 'the', 'process', 'of', 'creating', 'a', 'short', ',', 'accurate', ',', 'and', 'fluent', 'summary', 'of', 'a', 'longer', 'text', 'document', '.', 'It', 'is', 'the', 'process', 'of', 'distilling', 'the', 'most', 'important', 'information', 'from', 'a', 'source', 'text', '.', 'Automatic', 'text', 'summarization', 'is', 'a', 'common', 'problem', 'in', 'machine', 'learning', 'and', 'natural', 'language', 'processing', '(', 'NLP', ')', '.', 'Automatic', 'text', 'summarization', 'methods', 'are', 'greatly', 'needed', 'to', 'address', 'the', 'ever', '-', 'growing', 'amount', 'of', 'text', 'data', 'available', 'online', 'to', 'both', 'better', 'help', 'discover', 'relevant', 'information', 'and', 'to', 'consume', 'relevant', 'information',

In [12]:
print(len(tokens))

201


In [14]:
## Perform text cleaning
#1> remove stop punctuation

from string import punctuation

punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [15]:
# make wordfrequence table By ignoreing stop words, and punctuation.

word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

print(word_frequencies)

{'\n\n': 3, 'Text': 3, 'Summarization': 2, 'NLP': 2, 'Published': 1, 'georgiannacambel': 1, '4': 1, 'September': 1, '2020': 1, 'Extractive': 1, 'text': 8, 'summarization': 8, 'process': 4, 'creating': 1, 'short': 1, 'accurate': 1, 'fluent': 1, 'summary': 1, 'longer': 1, 'document': 1, 'distilling': 1, 'important': 1, 'information': 4, 'source': 1, 'Automatic': 4, 'common': 1, 'problem': 1, 'machine': 1, 'learning': 1, 'natural': 1, 'language': 1, 'processing': 1, 'methods': 1, 'greatly': 1, 'needed': 1, 'address': 1, 'growing': 1, 'data': 1, 'available': 1, 'online': 1, 'better': 1, 'help': 1, 'discover': 1, 'relevant': 2, 'consume': 1, 'faster': 1, 'automatic': 3, 'Summaries': 1, 'reduce': 1, 'reading': 1, 'time': 1, 'researching': 1, 'documents': 2, 'summaries': 2, 'selection': 1, 'easier': 1, 'improves': 1, 'effectiveness': 1, 'indexing': 1, 'algorithms': 1, 'biased': 1, 'human': 1, 'summarizers': 1, 'Personalized': 1, 'useful': 1, 'question': 1, 'answering': 1, 'systems': 2, 'provi

In [16]:
max_frequency = max(word_frequencies.values())
max_frequency

8

In [17]:
# Normalize this frequency.

for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

print(word_frequencies)

{'\n\n': 0.375, 'Text': 0.375, 'Summarization': 0.25, 'NLP': 0.25, 'Published': 0.125, 'georgiannacambel': 0.125, '4': 0.125, 'September': 0.125, '2020': 0.125, 'Extractive': 0.125, 'text': 1.0, 'summarization': 1.0, 'process': 0.5, 'creating': 0.125, 'short': 0.125, 'accurate': 0.125, 'fluent': 0.125, 'summary': 0.125, 'longer': 0.125, 'document': 0.125, 'distilling': 0.125, 'important': 0.125, 'information': 0.5, 'source': 0.125, 'Automatic': 0.5, 'common': 0.125, 'problem': 0.125, 'machine': 0.125, 'learning': 0.125, 'natural': 0.125, 'language': 0.125, 'processing': 0.125, 'methods': 0.125, 'greatly': 0.125, 'needed': 0.125, 'address': 0.125, 'growing': 0.125, 'data': 0.125, 'available': 0.125, 'online': 0.125, 'better': 0.125, 'help': 0.125, 'discover': 0.125, 'relevant': 0.25, 'consume': 0.125, 'faster': 0.125, 'automatic': 0.375, 'Summaries': 0.125, 'reduce': 0.125, 'reading': 0.125, 'time': 0.125, 'researching': 0.125, 'documents': 0.25, 'summaries': 0.25, 'selection': 0.125, '

In [18]:
# sentence tokenization

sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[

Text Summarization using NLP
Published by georgiannacambel on 4 September 2020
Extractive Text Summarization
What is text summarization?
, Text summarization is the process of creating a short, accurate, and fluent summary of a longer text document., It is the process of distilling the most important information from a source text., Automatic text summarization is a common problem in machine learning and natural language processing (NLP)., Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.

, Why automatic text summarization?
, Summaries reduce reading time.
, While researching using various documents, summaries make the selection process easier.
, Automatic summarization improves the effectiveness of indexing.
, Automatic summarization algorithms are less biased than human summarizers.
, Personalized summaries are useful

In [21]:
"""
Now we will calculate the sentence scores. The sentence score for a particular sentence is the sum of the normalized frequencies of the words in that sentence.
All the sentences will be stored with their score in the dictionary sentence_scores.
"""

'\nNow we will calculate the sentence scores. The sentence score for a particular sentence is the sum of the normalized frequencies of the words in that sentence. \nAll the sentences will be stored with their score in the dictionary sentence_scores.\n'

In [22]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

sentence_scores

{
 
 Text Summarization using NLP
 Published by georgiannacambel on 4 September 2020
 Extractive Text Summarization
 What is text summarization?: 6.75,
 Text summarization is the process of creating a short, accurate, and fluent summary of a longer text document.: 4.375,
 It is the process of distilling the most important information from a source text.: 2.375,
 Automatic text summarization is a common problem in machine learning and natural language processing (NLP).: 3.25,
 Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.
 : 6.875,
 Why automatic text summarization?: 2.375,
 Summaries reduce reading time.: 0.625,
 While researching using various documents, summaries make the selection process easier.: 1.375,
 Automatic summarization improves the effectiveness of indexing.: 1.75,
 Automatic summarization algorithms are le

In [23]:
# Now we are going to select 30% of the sentences having the largest scores. For this we are going to import nlargest from heapq.

from heapq import nlargest

In [24]:
select_length = int(len(sentence_tokens)*0.3)
select_length

3

In [25]:
"""
nlargest() will return a list with the select_length largest elements
i.e. 4 largest elements from sentence_scores.
key = sentence_scores.get specifies a function of one argument that is used to extract a comparison key from each element in sentence_scores
"""

' \nnlargest() will return a list with the select_length largest elements \ni.e. 4 largest elements from sentence_scores. \nkey = sentence_scores.get specifies a function of one argument that is used to extract a comparison key from each element in sentence_scores\n'

In [26]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
summary

[Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.
 ,
 
 
 Text Summarization using NLP
 Published by georgiannacambel on 4 September 2020
 Extractive Text Summarization
 What is text summarization?,
 Using automatic or semi-automatic summarization systems enables commercial abstract services to - increase the number of text documents they are able to process.
 ]

In [27]:
final_summary = [word.text for word in summary]
summary = ' '.join(final_summary)

In [28]:
print(text)



Text Summarization using NLP
Published by georgiannacambel on 4 September 2020
Extractive Text Summarization
What is text summarization?
Text summarization is the process of creating a short, accurate, and fluent summary of a longer text document. It is the process of distilling the most important information from a source text. Automatic text summarization is a common problem in machine learning and natural language processing (NLP). Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.

Why automatic text summarization?
Summaries reduce reading time.
While researching using various documents, summaries make the selection process easier.
Automatic summarization improves the effectiveness of indexing.
Automatic summarization algorithms are less biased than human summarizers.
Personalized summaries are useful in question-answe

In [29]:
print(summary)


Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.

 

Text Summarization using NLP
Published by georgiannacambel on 4 September 2020
Extractive Text Summarization
What is text summarization?
 Using automatic or semi-automatic summarization systems enables commercial abstract services to - increase the number of text documents they are able to process.


