### Text Summarization

Text Cleaning<br>
Sentence Tokenizatiob<br>
Word Tokenization<br>
Word-frequency table<br>
Summarization

In [1]:
text = """
There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.

Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[9] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.

At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set. This is also called the core-set. These algorithms model notions like diversity, coverage, information and representativeness of the summary. Query based summarization techniques, additionally model for relevance of the summary with the query. Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.

Keyphrase extraction"""


In [2]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [3]:
stopwords = list(STOP_WORDS)
stopwords

['can',
 'becomes',
 'wherein',
 '‘s',
 'their',
 'three',
 'those',
 '‘d',
 'will',
 'perhaps',
 'anyhow',
 'who',
 'for',
 'due',
 'out',
 'whatever',
 'beside',
 'without',
 'yours',
 'yet',
 'during',
 'when',
 'hereby',
 'became',
 'whereas',
 'how',
 'very',
 'nor',
 'say',
 'us',
 'up',
 'itself',
 'other',
 'thereupon',
 'others',
 'ours',
 'themselves',
 're',
 'forty',
 'no',
 'being',
 'ca',
 'together',
 'above',
 'although',
 'everyone',
 'namely',
 'ever',
 'behind',
 'using',
 'our',
 'along',
 'be',
 'become',
 'been',
 '‘re',
 'whence',
 'has',
 'have',
 'well',
 'five',
 'side',
 'go',
 'at',
 'per',
 'whereafter',
 'seeming',
 'nine',
 'under',
 'herself',
 'many',
 'thereby',
 'elsewhere',
 'them',
 'hereupon',
 '‘ll',
 'thru',
 '’ve',
 'mine',
 'an',
 'your',
 'call',
 'about',
 'same',
 'could',
 'show',
 'less',
 'here',
 'you',
 "'d",
 'there',
 'sometime',
 'where',
 'another',
 'and',
 'each',
 'therein',
 'seemed',
 'throughout',
 'n’t',
 'with',
 'was',
 'ge

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
doc = nlp(text)

In [6]:
tokens = [token.text for token in doc]
print(tokens)

['\n', 'There', 'are', 'broadly', 'two', 'types', 'of', 'extractive', 'summarization', 'tasks', 'depending', 'on', 'what', 'the', 'summarization', 'program', 'focuses', 'on', '.', 'The', 'first', 'is', 'generic', 'summarization', ',', 'which', 'focuses', 'on', 'obtaining', 'a', 'generic', 'summary', 'or', 'abstract', 'of', 'the', 'collection', '(', 'whether', 'documents', ',', 'or', 'sets', 'of', 'images', ',', 'or', 'videos', ',', 'news', 'stories', 'etc', '.', ')', '.', 'The', 'second', 'is', 'query', 'relevant', 'summarization', ',', 'sometimes', 'called', 'query', '-', 'based', 'summarization', ',', 'which', 'summarizes', 'objects', 'specific', 'to', 'a', 'query', '.', 'Summarization', 'systems', 'are', 'able', 'to', 'create', 'both', 'query', 'relevant', 'text', 'summaries', 'and', 'generic', 'machine', '-', 'generated', 'summaries', 'depending', 'on', 'what', 'the', 'user', 'needs', '.', '\n\n', 'An', 'example', 'of', 'a', 'summarization', 'problem', 'is', 'document', 'summarizat

In [7]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [8]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [9]:
word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [11]:
print(word_frequencies)

{'broadly': 1, 'types': 1, 'extractive': 1, 'summarization': 14, 'tasks': 1, 'depending': 2, 'program': 1, 'focuses': 2, 'generic': 3, 'obtaining': 1, 'summary': 6, 'abstract': 2, 'collection': 3, 'documents': 2, 'sets': 1, 'images': 4, 'videos': 3, 'news': 4, 'stories': 1, 'etc': 2, 'second': 1, 'query': 5, 'relevant': 2, 'called': 3, 'based': 2, 'summarizes': 1, 'objects': 2, 'specific': 1, 'Summarization': 1, 'systems': 1, 'able': 1, 'create': 1, 'text': 1, 'summaries': 2, 'machine': 1, 'generated': 1, 'user': 1, 'needs': 1, '\n\n': 4, 'example': 3, 'problem': 2, 'document': 4, 'attempts': 1, 'automatically': 3, 'produce': 1, 'given': 2, 'interested': 1, 'generating': 1, 'single': 1, 'source': 2, 'use': 1, 'multiple': 1, 'cluster': 1, 'articles': 3, 'topic': 2, 'multi': 1, 'related': 2, 'application': 2, 'summarizing': 1, 'Imagine': 1, 'system': 3, 'pulls': 1, 'web': 1, 'concisely': 1, 'represents': 1, 'latest': 1, 'Image': 1, 'automatic': 1, 'consists': 1, 'selecting': 1, 'represen

In [12]:
max_frequency = max(word_frequencies.values())

In [13]:
max_frequency

14

In [14]:
for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_frequency

In [15]:
print(word_frequencies)

{'broadly': 0.07142857142857142, 'types': 0.07142857142857142, 'extractive': 0.07142857142857142, 'summarization': 1.0, 'tasks': 0.07142857142857142, 'depending': 0.14285714285714285, 'program': 0.07142857142857142, 'focuses': 0.14285714285714285, 'generic': 0.21428571428571427, 'obtaining': 0.07142857142857142, 'summary': 0.42857142857142855, 'abstract': 0.14285714285714285, 'collection': 0.21428571428571427, 'documents': 0.14285714285714285, 'sets': 0.07142857142857142, 'images': 0.2857142857142857, 'videos': 0.21428571428571427, 'news': 0.2857142857142857, 'stories': 0.07142857142857142, 'etc': 0.14285714285714285, 'second': 0.07142857142857142, 'query': 0.35714285714285715, 'relevant': 0.14285714285714285, 'called': 0.21428571428571427, 'based': 0.14285714285714285, 'summarizes': 0.07142857142857142, 'objects': 0.14285714285714285, 'specific': 0.07142857142857142, 'Summarization': 0.07142857142857142, 'systems': 0.07142857142857142, 'able': 0.07142857142857142, 'create': 0.07142857

In [16]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[
, There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on., The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.)., The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query., Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

, An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document., Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic)., This problem is called multi-document summarization., A related applicati

In [17]:
sentence_scores = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_scores.keys():
        sentence_scores[sent] = word_frequencies[word.text.lower()]
      else:
        sentence_scores[sent] += word_frequencies[word.text.lower()]

In [18]:
sentence_scores

{There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.: 2.642857142857143,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).: 3.642857142857143,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.: 3.928571428571429,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
 : 3.0000000000000004,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 3.5714285714285716,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of arti

In [19]:
from heapq import nlargest

In [20]:
select_length = int(len(sentence_tokens)*0.3)
select_length

6

In [21]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [22]:
summary

[At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set.,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.
 
 Keyphrase extraction,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

In [23]:
final_summary = [word.text for word in summary]

In [24]:
summary = ' '.join(final_summary)

In [26]:
print(summary)

At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set. The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.

Keyphrase extraction Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.




In [27]:
len(summary)

989

In [28]:
len(text)

2498