## Text summarization with spacy 

Example text

In [1]:
text = 'In 1940, an advertising executive named James Webb Young published a short guide titled, A Technique for Producing Ideas. In this guide, he made a simple, but profound statement about generating creative ideas.According to Young, innovative ideas happen when you develop new combinations of old elements. In other words, creative thinking is not about generating something new from a blank slate, but rather about taking what is already present and combining those bits and pieces in a way that has not been done previously.Most important, the ability to generate new combinations hinges upon your ability to see the relationships between concepts. If you can form a new link between two old ideas, you have done something creative.Young believed this process of creative connection always occurred in five steps.Gather new material. At first, you learn. During this stage you focus on 1) learning specific material directly related to your task and 2) learning general material by becoming fascinated with a wide range of conceptsThoroughly work over the materials in your mind. During this stage, you examine what you have learned by looking at the facts from different angles and experimenting with fitting various ideas togetherStep away from the problem. Next, you put the problem completely out of your mind and go do something else that excites you and energizes youLet your idea return to you. At some point, but only after you have stopped thinking about it, your idea will come back to you with a flash of insight and renewed energy.Shape and develop your idea based on feedback. For any idea to succeed, you must release it out into the world, submit it to criticism, and adapt it as needed.'

Installation of spacy library

In [2]:
#!pip install -U spacy
#!python -m spacy download en_core_web_sm

### Importing necessary libraries

In [3]:
import spacy
import nltk
from string import punctuation
from spacy.lang.en.stop_words import STOP_WORDS

converting text to spacy doc

In [4]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
doc

In 1940, an advertising executive named James Webb Young published a short guide titled, A Technique for Producing Ideas. In this guide, he made a simple, but profound statement about generating creative ideas.According to Young, innovative ideas happen when you develop new combinations of old elements. In other words, creative thinking is not about generating something new from a blank slate, but rather about taking what is already present and combining those bits and pieces in a way that has not been done previously.Most important, the ability to generate new combinations hinges upon your ability to see the relationships between concepts. If you can form a new link between two old ideas, you have done something creative.Young believed this process of creative connection always occurred in five steps.Gather new material. At first, you learn. During this stage you focus on 1) learning specific material directly related to your task and 2) learning general material by becoming fascinate

## Preprocessing 

word tokenization can be done using .text

In [5]:
words = [word.text for word in doc]
words

['In',
 '1940',
 ',',
 'an',
 'advertising',
 'executive',
 'named',
 'James',
 'Webb',
 'Young',
 'published',
 'a',
 'short',
 'guide',
 'titled',
 ',',
 'A',
 'Technique',
 'for',
 'Producing',
 'Ideas',
 '.',
 'In',
 'this',
 'guide',
 ',',
 'he',
 'made',
 'a',
 'simple',
 ',',
 'but',
 'profound',
 'statement',
 'about',
 'generating',
 'creative',
 'ideas',
 '.',
 'According',
 'to',
 'Young',
 ',',
 'innovative',
 'ideas',
 'happen',
 'when',
 'you',
 'develop',
 'new',
 'combinations',
 'of',
 'old',
 'elements',
 '.',
 'In',
 'other',
 'words',
 ',',
 'creative',
 'thinking',
 'is',
 'not',
 'about',
 'generating',
 'something',
 'new',
 'from',
 'a',
 'blank',
 'slate',
 ',',
 'but',
 'rather',
 'about',
 'taking',
 'what',
 'is',
 'already',
 'present',
 'and',
 'combining',
 'those',
 'bits',
 'and',
 'pieces',
 'in',
 'a',
 'way',
 'that',
 'has',
 'not',
 'been',
 'done',
 'previously',
 '.',
 'Most',
 'important',
 ',',
 'the',
 'ability',
 'to',
 'generate',
 'new',
 '

Below are the stopwords from spacy 

In [6]:
stopwords = STOP_WORDS
stopwords 

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

available punctuations in string. any number of punctuations can be added by " punctuations + '/n' "

In [7]:
punctuations = punctuation
punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Finding the frequency of each words.
Here:

    1. Checking if each word(lower case) in the doc is present in stopwords variable. The words that are not in stopwords are passed to the next statement.
    
    2. Checking if each word(lower case) in the doc is present in punctuations variable. The words that are not in punctuations are passed to the next statement.
    
    3. Checking if each word(lower case) in the doc is present in word_frequency dictionary with its key value if not its value is set to 1.
    
    4. Else the value is incremented


In [8]:
word_frequency = {}

for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuations:
            if word.text.lower() not in word_frequency.keys():
                word_frequency[word.text.lower()]= 1
            else:
                word_frequency[word.text.lower()] +=1

In [9]:
word_frequency

{'1940': 1,
 'advertising': 1,
 'executive': 1,
 'named': 1,
 'james': 1,
 'webb': 1,
 'young': 3,
 'published': 1,
 'short': 1,
 'guide': 2,
 'titled': 1,
 'technique': 1,
 'producing': 1,
 'ideas': 5,
 'simple': 1,
 'profound': 1,
 'statement': 1,
 'generating': 2,
 'creative': 4,
 'according': 1,
 'innovative': 1,
 'happen': 1,
 'develop': 2,
 'new': 5,
 'combinations': 2,
 'old': 2,
 'elements': 1,
 'words': 1,
 'thinking': 2,
 'blank': 1,
 'slate': 1,
 'taking': 1,
 'present': 1,
 'combining': 1,
 'bits': 1,
 'pieces': 1,
 'way': 1,
 'previously': 1,
 'important': 1,
 'ability': 2,
 'generate': 1,
 'hinges': 1,
 'relationships': 1,
 'concepts': 1,
 'form': 1,
 'link': 1,
 'believed': 1,
 'process': 1,
 'connection': 1,
 'occurred': 1,
 'steps': 1,
 'gather': 1,
 'material': 3,
 'learn': 1,
 'stage': 2,
 'focus': 1,
 '1': 1,
 'learning': 2,
 'specific': 1,
 'directly': 1,
 'related': 1,
 'task': 1,
 '2': 1,
 'general': 1,
 'fascinated': 1,
 'wide': 1,
 'range': 1,
 'conceptsthoroug

finding the maximum frequency value

In [10]:
max_frequency = max(word_frequency.values())
max_frequency

5

To normalize the frequency values dividing each frequency value by maximum frequency

In [11]:
for word in word_frequency.keys():
    word_frequency[word]= word_frequency[word]/max_frequency
    

In [12]:
word_frequency

{'1940': 0.2,
 'advertising': 0.2,
 'executive': 0.2,
 'named': 0.2,
 'james': 0.2,
 'webb': 0.2,
 'young': 0.6,
 'published': 0.2,
 'short': 0.2,
 'guide': 0.4,
 'titled': 0.2,
 'technique': 0.2,
 'producing': 0.2,
 'ideas': 1.0,
 'simple': 0.2,
 'profound': 0.2,
 'statement': 0.2,
 'generating': 0.4,
 'creative': 0.8,
 'according': 0.2,
 'innovative': 0.2,
 'happen': 0.2,
 'develop': 0.4,
 'new': 1.0,
 'combinations': 0.4,
 'old': 0.4,
 'elements': 0.2,
 'words': 0.2,
 'thinking': 0.4,
 'blank': 0.2,
 'slate': 0.2,
 'taking': 0.2,
 'present': 0.2,
 'combining': 0.2,
 'bits': 0.2,
 'pieces': 0.2,
 'way': 0.2,
 'previously': 0.2,
 'important': 0.2,
 'ability': 0.4,
 'generate': 0.2,
 'hinges': 0.2,
 'relationships': 0.2,
 'concepts': 0.2,
 'form': 0.2,
 'link': 0.2,
 'believed': 0.2,
 'process': 0.2,
 'connection': 0.2,
 'occurred': 0.2,
 'steps': 0.2,
 'gather': 0.2,
 'material': 0.6,
 'learn': 0.2,
 'stage': 0.4,
 'focus': 0.2,
 '1': 0.2,
 'learning': 0.4,
 'specific': 0.2,
 'directl

Performing senetence tokenization 

In [13]:
sent_token = [sent for sent in doc.sents]
sent_token

[In 1940, an advertising executive named James Webb Young published a short guide titled, A Technique for Producing Ideas.,
 In this guide, he made a simple, but profound statement about generating creative ideas.,
 According to Young, innovative ideas happen when you develop new combinations of old elements.,
 In other words, creative thinking is not about generating something new from a blank slate, but rather about taking what is already present and combining those bits and pieces in a way that has not been done previously.,
 Most important, the ability to generate new combinations hinges upon your ability to see the relationships between concepts.,
 If you can form a new link between two old ideas, you have done something creative.,
 Young believed this process of creative connection always occurred in five steps.,
 Gather new material.,
 At first, you learn.,
 During this stage you focus on 1) learning specific material directly related to your task and 2) learning general materia

finding the senetence frequency by aggregating the frequency value of word frequencies

In [14]:
sent_frequency= {}

for sent in doc.sents:
    for word in sent:
        if word.text.lower() in word_frequency.keys():
            if sent not in sent_frequency.keys():
                sent_frequency[sent] = word_frequency[word.text.lower()]
            else:
                sent_frequency[sent] += word_frequency[word.text.lower()]
                
        

In [15]:
sent_frequency

{In 1940, an advertising executive named James Webb Young published a short guide titled, A Technique for Producing Ideas.: 4.2,
 In this guide, he made a simple, but profound statement about generating creative ideas.: 3.2,
 According to Young, innovative ideas happen when you develop new combinations of old elements.: 4.6000000000000005,
 In other words, creative thinking is not about generating something new from a blank slate, but rather about taking what is already present and combining those bits and pieces in a way that has not been done previously.: 4.600000000000001,
 Most important, the ability to generate new combinations hinges upon your ability to see the relationships between concepts.: 3.2000000000000006,
 If you can form a new link between two old ideas, you have done something creative.: 3.5999999999999996,
 Young believed this process of creative connection always occurred in five steps.: 2.4000000000000004,
 Gather new material.: 1.7999999999999998,
 At first, you le

In [16]:
from heapq import nlargest

max_sentence is the value of number of sentence we would like to have as a summary

In [17]:
max_sentence = int(len(sent_token)*0.3)
max_sentence

4

with nlargest finding 4 sentences with largest frequencies

In [18]:
summary = nlargest(max_sentence,sent_frequency, sent_frequency.get)
summary 

[During this stage you focus on 1) learning specific material directly related to your task and 2) learning general material by becoming fascinated with a wide range of conceptsThoroughly work over the materials in your mind.,
 In other words, creative thinking is not about generating something new from a blank slate, but rather about taking what is already present and combining those bits and pieces in a way that has not been done previously.,
 According to Young, innovative ideas happen when you develop new combinations of old elements.,
 In 1940, an advertising executive named James Webb Young published a short guide titled, A Technique for Producing Ideas.]

In [19]:
fs = [i.text for i in summary]
fs

['During this stage you focus on 1) learning specific material directly related to your task and 2) learning general material by becoming fascinated with a wide range of conceptsThoroughly work over the materials in your mind.',
 'In other words, creative thinking is not about generating something new from a blank slate, but rather about taking what is already present and combining those bits and pieces in a way that has not been done previously.',
 'According to Young, innovative ideas happen when you develop new combinations of old elements.',
 'In 1940, an advertising executive named James Webb Young published a short guide titled, A Technique for Producing Ideas.']

Joining all the sentences into a paragraph

In [20]:
summary = ''.join(fs)
summary

'During this stage you focus on 1) learning specific material directly related to your task and 2) learning general material by becoming fascinated with a wide range of conceptsThoroughly work over the materials in your mind.In other words, creative thinking is not about generating something new from a blank slate, but rather about taking what is already present and combining those bits and pieces in a way that has not been done previously.According to Young, innovative ideas happen when you develop new combinations of old elements.In 1940, an advertising executive named James Webb Young published a short guide titled, A Technique for Producing Ideas.'

In [21]:
len(text)

1702

In [22]:
len(summary)

658