### Extractive Summarization

Here we try to use a very simple algorithm to create a text summary. Text Summarization is of two types: **Extractive Summarization** and **Abstractive Summarization**. We will be extracting text summnary using the **Extractive Summarization Technique**.

In [1]:
## Import Spacy 
import spacy # for filtering and text processing
from collections import Counter # to monitor the word count
from string import punctuation  # to create the list of punctuations that need to be removed

In [3]:
# Load the Large english model
nlp = spacy.load('en_core_web_lg')

We are interested in extracting the keywords from the long text, meaning the sentences. The following steps need to be followed in lorder to extract keywords.

- Tokenize the text
- Remove the stopwords and punctuations
- Filter the words with necessary POS tags
- Selectively lower casing words

In [6]:
# get the list of stop words
from stop_words import get_stop_words
STOPWORDS = get_stop_words('en')

In [8]:
len(STOPWORDS)

174

In [10]:
len(punctuation)

32

So we are removing 32 punctuations and 174 stopwords

In [24]:
# create function to extract keywords
def extract_words(text):
    result = []
    # creating an object of language model
    doc = nlp(text)
    for d in doc:
        if(d.text in STOPWORDS or d.text in punctuation):
            continue
        if(d.pos_ in ['PROPN', 'ADJ', 'NOUN']):
            result.append(d.text)
                
    return result 

In [89]:
# create a function that takes a list of keywords and scores the words according to their freq.
def norm_scores(list_words):
    # here list_word is the list of lists of words, constructed from the extract_words func.
    lw = [item for sublist in list_words for item in sublist]
    word_count = Counter(lw)
    max_freq = Counter(lw).most_common(1)[0][1]
    for k in word_count:
        word_count[k] /= max_freq
    # returns a list keywords with their scores depending upon frequency
    return word_count

In [100]:
# create a function that will calculate the importance of a sentence based on number of 
# keywords and relative importance of those keywords as reflected by their frequency in 
# the text
def sent_score(text,word_count):
    list_score = []
    for sent in nlp(text).sents:
        sent_score = extract_words(str(sent))
        if len(sent_score) == 0:
            list_score.append(0)
            continue
        else:
            s = 0
            for w in sent_score:
                s += word_count[w]
            list_score.append(s)
    return list_score

In [91]:
sentences = [str(sent) for sent in nlp(text).sents]

In [96]:
sentences

['I’ve always been a bargain shopper.',
 'When I moved to New York in 2000 I discovered H&M.',
 'At the time, fast fashion didn’t mean sweatshop labor and climate damage —',
 'it meant that I could find a brand-new sensible office dress for $14.99 and still have enough money to pay for groceries.',
 'I thought my penchant for cheap clothing was temporary, that sometime in my 30s, after a decade of working in the corporate world, a switch would flip and suddenly the clothing I saw in fashion magazines would become available to me like a birthright.',
 'It hasn’t happened yet.']

In [92]:
list_words = [extract_words(sent) for sent in sentences]

In [97]:
list_words

[['bargain', 'shopper'],
 ['New', 'York', 'H&M.'],
 ['time', 'fast', 'fashion', 'sweatshop', 'labor', 'climate', 'damage'],
 ['brand',
  'new',
  'sensible',
  'office',
  'dress',
  'enough',
  'money',
  'groceries'],
 ['penchant',
  'cheap',
  'clothing',
  'temporary',
  '30s',
  'decade',
  'corporate',
  'world',
  'switch',
  'clothing',
  'fashion',
  'magazines',
  'available',
  'birthright'],
 []]

In [98]:
word_count = norm_scores(list_words)

In [99]:
word_count

Counter({'bargain': 0.5,
         'shopper': 0.5,
         'New': 0.5,
         'York': 0.5,
         'H&M.': 0.5,
         'time': 0.5,
         'fast': 0.5,
         'fashion': 1.0,
         'sweatshop': 0.5,
         'labor': 0.5,
         'climate': 0.5,
         'damage': 0.5,
         'brand': 0.5,
         'new': 0.5,
         'sensible': 0.5,
         'office': 0.5,
         'dress': 0.5,
         'enough': 0.5,
         'money': 0.5,
         'groceries': 0.5,
         'penchant': 0.5,
         'cheap': 0.5,
         'clothing': 1.0,
         'temporary': 0.5,
         '30s': 0.5,
         'decade': 0.5,
         'corporate': 0.5,
         'world': 0.5,
         'switch': 0.5,
         'magazines': 0.5,
         'available': 0.5,
         'birthright': 0.5})

In [101]:
sent_score(text,word_count)

[1.0, 1.5, 4.0, 4.0, 8.5, 0]

**NOTE:** We quickly that the sentence score not only depends on the length of the sentence but also the number of keywords present in the sentence which is the basic premise of associating importance to sentences. So, we could arrange the sentences of the text according to its importance.

In [106]:
scores = sent_score(text,word_count)
sent_importance = dict()

for s,sent in zip(scores,nlp(text).sents):
    sent_importance[str(sent)] = s

In [107]:
sent_importance

{'I’ve always been a bargain shopper.': 1.0,
 'When I moved to New York in 2000 I discovered H&M.': 1.5,
 'At the time, fast fashion didn’t mean sweatshop labor and climate damage —': 4.0,
 'it meant that I could find a brand-new sensible office dress for $14.99 and still have enough money to pay for groceries.': 4.0,
 'I thought my penchant for cheap clothing was temporary, that sometime in my 30s, after a decade of working in the corporate world, a switch would flip and suddenly the clothing I saw in fashion magazines would become available to me like a birthright.': 8.5,
 'It hasn’t happened yet.': 0}

In [110]:
sorted_sents = {k: v for k, v in sorted(sent_importance.items(), key=lambda item: item[1], reverse=True)}

In [113]:
for i,k in enumerate(sorted_sents.keys()):
    print("Importance Rank:{}".format(i+1))
    print(k)

Importance Rank:1
I thought my penchant for cheap clothing was temporary, that sometime in my 30s, after a decade of working in the corporate world, a switch would flip and suddenly the clothing I saw in fashion magazines would become available to me like a birthright.
Importance Rank:2
At the time, fast fashion didn’t mean sweatshop labor and climate damage —
Importance Rank:3
it meant that I could find a brand-new sensible office dress for $14.99 and still have enough money to pay for groceries.
Importance Rank:4
When I moved to New York in 2000 I discovered H&M.
Importance Rank:5
I’ve always been a bargain shopper.
Importance Rank:6
It hasn’t happened yet.


### Shortcomings:

Although, this method is simple to execute and intuitively associates importance to a sentence with more keywords that occur in a document more frequently, it has the following shorcomings:

- By combining the sentences that are important as portrayed by their individual scores we find that the resulting text may lack coherance, especially when the text on which the scoring is done is bigger than a paragraph.
- Certain non-textual keywords such as dates and numeric quantities .eg. prices, are not included. Depending upon our use case they might be important.
- All the tokens are taken as unigrams, in future n-grams must be included. So it should be key phrases extraction, instead of keyword extraction.