In [2]:
text = """
There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.
The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).
The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.
Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).
This problem is called multi-document summarization. A related application is summarizing news articles.
Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization.
It consists in selecting a representative set of images from a larger set of images.
[3] A summary in this context is useful to show the most representative images of results in an image collection exploration system. 
Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions.
Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured."""

In [3]:
import nltk
from nltk.corpus import stopwords
from string import punctuation

In [4]:
# nltk.download('stopwords')
stops=list(stopwords.words('english'))
stops

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

TOKENIZING SENTENCES

In [22]:
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize
sentences=sent_tokenize(text)
# print(type(sentences))
print(sentences)

['\nThere are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.', 'The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).', 'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.', 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.', 'An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.', 'Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).', 'This problem is called multi-document summarization.', 'A relat

TOKENIZING WORDS

In [6]:
from nltk.tokenize import word_tokenize
words=word_tokenize(text)
print(words)

['There', 'are', 'broadly', 'two', 'types', 'of', 'extractive', 'summarization', 'tasks', 'depending', 'on', 'what', 'the', 'summarization', 'program', 'focuses', 'on', '.', 'The', 'first', 'is', 'generic', 'summarization', ',', 'which', 'focuses', 'on', 'obtaining', 'a', 'generic', 'summary', 'or', 'abstract', 'of', 'the', 'collection', '(', 'whether', 'documents', ',', 'or', 'sets', 'of', 'images', ',', 'or', 'videos', ',', 'news', 'stories', 'etc.', ')', '.', 'The', 'second', 'is', 'query', 'relevant', 'summarization', ',', 'sometimes', 'called', 'query-based', 'summarization', ',', 'which', 'summarizes', 'objects', 'specific', 'to', 'a', 'query', '.', 'Summarization', 'systems', 'are', 'able', 'to', 'create', 'both', 'query', 'relevant', 'text', 'summaries', 'and', 'generic', 'machine-generated', 'summaries', 'depending', 'on', 'what', 'the', 'user', 'needs', '.', 'An', 'example', 'of', 'a', 'summarization', 'problem', 'is', 'document', 'summarization', ',', 'which', 'attempts', 't

MAKING WORD FREQUENCY TABLE


In [7]:
word_frequency={}
for word in words:
    if word.lower() not in stops:
        if word.lower() not in punctuation:
            if word.lower() not in word_frequency.keys():
                word_frequency[word.lower()]=1
            else:
                word_frequency[word.lower()]+=1

print(word_frequency)

{'broadly': 1, 'two': 1, 'types': 1, 'extractive': 1, 'summarization': 12, 'tasks': 1, 'depending': 2, 'program': 1, 'focuses': 2, 'first': 1, 'generic': 3, 'obtaining': 1, 'summary': 4, 'abstract': 2, 'collection': 3, 'whether': 1, 'documents': 2, 'sets': 1, 'images': 4, 'videos': 3, 'news': 4, 'stories': 1, 'etc.': 1, 'second': 1, 'query': 3, 'relevant': 2, 'sometimes': 2, 'called': 2, 'query-based': 1, 'summarizes': 1, 'objects': 1, 'specific': 1, 'systems': 1, 'able': 1, 'create': 1, 'text': 1, 'summaries': 2, 'machine-generated': 1, 'user': 1, 'needs': 1, 'example': 3, 'problem': 2, 'document': 3, 'attempts': 1, 'automatically': 3, 'produce': 1, 'given': 2, 'one': 3, 'might': 2, 'interested': 1, 'generating': 1, 'single': 1, 'source': 2, 'others': 1, 'use': 1, 'multiple': 1, 'cluster': 1, 'articles': 3, 'topic': 2, 'multi-document': 1, 'related': 2, 'application': 2, 'summarizing': 1, 'imagine': 1, 'system': 3, 'pulls': 1, 'together': 1, 'web': 1, 'concisely': 1, 'represents': 1, 

NORMALIZING WORD FREQUENCIES 

In [10]:
maxi=max(word_frequency.values())
for key in word_frequency.keys():
    word_frequency[key]/=maxi

print(word_frequency)

{'broadly': 0.08333333333333333, 'two': 0.08333333333333333, 'types': 0.08333333333333333, 'extractive': 0.08333333333333333, 'summarization': 1.0, 'tasks': 0.08333333333333333, 'depending': 0.16666666666666666, 'program': 0.08333333333333333, 'focuses': 0.16666666666666666, 'first': 0.08333333333333333, 'generic': 0.25, 'obtaining': 0.08333333333333333, 'summary': 0.3333333333333333, 'abstract': 0.16666666666666666, 'collection': 0.25, 'whether': 0.08333333333333333, 'documents': 0.16666666666666666, 'sets': 0.08333333333333333, 'images': 0.3333333333333333, 'videos': 0.25, 'news': 0.3333333333333333, 'stories': 0.08333333333333333, 'etc.': 0.08333333333333333, 'second': 0.08333333333333333, 'query': 0.25, 'relevant': 0.16666666666666666, 'sometimes': 0.16666666666666666, 'called': 0.16666666666666666, 'query-based': 0.08333333333333333, 'summarizes': 0.08333333333333333, 'objects': 0.08333333333333333, 'specific': 0.08333333333333333, 'systems': 0.08333333333333333, 'able': 0.0833333

In [31]:
 sentence_importance = {}

for sent in sentences:
    words=sent.split()
    for word in words:
        if word.lower() in word_frequency:
            if sent not in sentence_importance:
                sentence_importance[sent]=word_frequency[word.lower()]
            else:
                sentence_importance[sent]+=word_frequency[word.lower()]
                
                

maximum=max(sentence_importance.values())

for key in sentence_importance:
    sentence_importance[key]/=maximum
sentence_importance
    

    


{'\nThere are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.': 1.0,
 'The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).': 0.735294117647059,
 'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.': 0.4117647058823529,
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.': 0.9411764705882354,
 'An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.': 0.8529411764705883,
 'Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articl

In [32]:
from heapq import nlargest
sentences_selected=int(len(sentences)*0.2)
final_sentences=nlargest(sentences_selected,sentence_importance,key=sentence_importance.get)
print(final_sentences)

['\nThere are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.', 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.', 'An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.']


In [33]:
summary=' '.join(final_sentences)
print(summary)


There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs. An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
