### Reference

- [Extractive Text Summarization Using spaCy in Python](https://medium.com/better-programming/extractive-text-summarization-using-spacy-in-python-88ab96d1fd97)

# TF-IDF 

- TF-IDF *(Term Frequency-Inverse Data Frequency)* is used to calculate the importance of a sentence for text summarization in 
    
    - *information retrieval* and 
    
    - *text mining* 
    

## TF: Term Frequency

- measure of frequency of a term in a document

*normalization*

- since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones

- thus, *the term frequency is often divided by the document length* (such as the total number of terms in the document) as a way of normalization



## IDF: Inverse Document Frequency

- measures how important a term is

*stopwords*

- while computing the term frequency (TF), all terms are considered equally important

- however, it is known that certain terms may appear a lot of times but have little importance in the document
    - For example: is, are, they, and so on
    - these are called stopwords


### TF-IDF Limitations

- A common term in a domain might be an important term in another domain
    
    - As the saying goes: “One man’s meat is another man’s poison”

- TF-IDF is not a good choice if you are dealing with multiple domains

    -  A unbalanced dataset tends to be biased and it will greatly affect the result

# Identifying top sentences in an article (Text Summarization)

Following are the steps in text summarization:

1. tokenize article using spacy's language model

2. extract important keywords and normalize weight

3. calculate importance of each sentence in the article based on keyword appearance

4. sort the sentence based on the calculated importance

## Implementation

In [105]:
# conduct imports
import spacy 
from collections import Counter
from string import punctuation
import pprint

### Load SpaCy Model


In [107]:
# load large english web model
nlp = spacy.load("en_core_web_lg")

### Text to summarize

In [109]:
input_text = '''Yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan Carlos Ghosn reportedly was smuggled out of Japan in one. In a tweet over the weekend, the Japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases. Yamaha (YAMCY) warned people not to get into, or let others get into, its cases to avoid "unfortunate accidents." Multiple media outlets have reported that Ghosn managed to sneak through a Japanese airport to a private jet that whisked him out of the country by hiding in a large, black music equipment case with breathing holes drilled in the bottom. CNN Business has not independently confirmed those details of his escape. The former Nissan (NSANF) CEO had been out on bail awaiting trial in Japan on charges of financial wrongdoing before making his stunning escape to Lebanon at the end of December. Ghosn has referred to his departure as an effort to "escape injustice." In an interview with CNN\'s Richard Quest last week, Ghosn did not comment on the nature of his escape, saying he didn\'t want to endanger any of the people who aided in the operation. Ghosn did, however, respond to a question about what it felt like to ride through the airport in a packing case by first declining to comment but then adding: "Freedom, no matter the way it happens, is always sweet." In a press conference in Lebanon ahead of the CNN interview last Wednesday, Ghosn\'s first public appearance since fleeing Japan, Ghosn said he decided to leave the country because he believed he would not receive a fair trial, a claim Japanese authorities have disputed. Brands sometimes capitalize on their tangential relationship to big news in order to attract attention on social media. Yamaha is one of Japan\'s best known brands and Ghosn was one of Japan\'s top executives before being ousted from Nissan — a match made in social media heaven. Not surprisingly, Yamaha\'s post went viral on Twitter over the weekend.'''



### Tokenize text

In [111]:
doc = nlp(input_text.lower())

### Explore the tokens

In [113]:
print("token ----- lemma_ ----- is_stop? ----- part-of-speech")
print('='*75)
for token in doc:
    print(f" {str(token)} ----- {str(token.lemma_)} ----- {str(token.is_stop)} ----- {str(token.pos_)}")

token ----- lemma_ ----- is_stop? ----- part-of-speech
 yamaha ----- yamaha ----- False ----- PROPN
 is ----- be ----- True ----- AUX
 reminding ----- remind ----- False ----- VERB
 people ----- people ----- False ----- NOUN
 that ----- that ----- True ----- SCONJ
 musical ----- musical ----- False ----- ADJ
 equipment ----- equipment ----- False ----- NOUN
 cases ----- case ----- False ----- NOUN
 are ----- be ----- True ----- AUX
 for ----- for ----- True ----- ADP
 musical ----- musical ----- False ----- ADJ
 equipment ----- equipment ----- False ----- NOUN
 — ----- — ----- False ----- PUNCT
 not ----- not ----- True ----- PART
 people ----- people ----- False ----- NOUN
 — ----- — ----- False ----- PUNCT
 two ----- two ----- True ----- NUM
 weeks ----- week ----- False ----- NOUN
 after ----- after ----- True ----- ADP
 fugitive ----- fugitive ----- False ----- ADJ
 auto ----- auto ----- False ----- PROPN
 titan ----- titan ----- False ----- PROPN
 carlos ----- carlos ----- False -

### Extract keywords (filter out keywords)

In [115]:
# init blank keyword list 
keyword = []

for token in doc:
    # check if current token is a stop word or punctuation, if yes then skip
    if token.text in nlp.Defaults.stop_words or token.text in punctuation:
        continue

    # if token text is Noun, Verb, Adjective or Propernoun, save to keywords list
    elif token.pos_ in ['NOUN','VERB','ADJ','PROPN']:
       keyword.append(token.text)

# check populated keyword list
keyword

['yamaha',
 'reminding',
 'people',
 'musical',
 'equipment',
 'cases',
 'musical',
 'equipment',
 'people',
 'weeks',
 'fugitive',
 'auto',
 'titan',
 'carlos',
 'ghosn',
 'smuggled',
 'japan',
 'tweet',
 'weekend',
 'japanese',
 'musical',
 'equipment',
 'company',
 'said',
 'naming',
 'names',
 'noted',
 'recent',
 'stories',
 'people',
 'getting',
 'musical',
 'equipment',
 'cases',
 'yamaha',
 'yamcy',
 'warned',
 'people',
 'let',
 'cases',
 'avoid',
 'unfortunate',
 'accidents',
 'multiple',
 'media',
 'outlets',
 'reported',
 'ghosn',
 'managed',
 'sneak',
 'japanese',
 'airport',
 'private',
 'jet',
 'whisked',
 'country',
 'hiding',
 'large',
 'black',
 'music',
 'equipment',
 'case',
 'breathing',
 'holes',
 'drilled',
 'cnn',
 'business',
 'confirmed',
 'details',
 'escape',
 'nissan',
 'nsanf',
 'ceo',
 'bail',
 'awaiting',
 'trial',
 'japan',
 'charges',
 'financial',
 'wrongdoing',
 'making',
 'stunning',
 'escape',
 'lebanon',
 'end',
 'december',
 'ghosn',
 'referred',

### Normalize weightage of keywords

##### Get frequency of each word

- `Counter` will convert the list into a dictionary with their respective frequency values

In [117]:
# get frequency of each word
freq_of_keywords = Counter(keyword)

# pretty print frequency of words 
pprint.pprint(freq_of_keywords)

Counter({'ghosn': 8,
         'people': 5,
         'equipment': 5,
         'japan': 5,
         'yamaha': 4,
         'musical': 4,
         'escape': 4,
         'cases': 3,
         'japanese': 3,
         'media': 3,
         'cnn': 3,
         'weekend': 2,
         'said': 2,
         'airport': 2,
         'country': 2,
         'case': 2,
         'nissan': 2,
         'trial': 2,
         'lebanon': 2,
         'interview': 2,
         'comment': 2,
         'brands': 2,
         'social': 2,
         'reminding': 1,
         'weeks': 1,
         'fugitive': 1,
         'auto': 1,
         'titan': 1,
         'carlos': 1,
         'smuggled': 1,
         'tweet': 1,
         'company': 1,
         'naming': 1,
         'names': 1,
         'noted': 1,
         'recent': 1,
         'stories': 1,
         'getting': 1,
         'yamcy': 1,
         'warned': 1,
         'let': 1,
         'avoid': 1,
         'unfortunate': 1,
         'accidents': 1,
         'multiple': 1,


##### Sort by frequency - most common to least common

In [152]:
# get the most frequent keyword
sort_occurence = freq_of_keywords.most_common()
pprint.pprint(sort_occurence)

[('ghosn', 8),
 ('people', 5),
 ('equipment', 5),
 ('japan', 5),
 ('yamaha', 4),
 ('musical', 4),
 ('escape', 4),
 ('cases', 3),
 ('japanese', 3),
 ('media', 3),
 ('cnn', 3),
 ('weekend', 2),
 ('said', 2),
 ('airport', 2),
 ('country', 2),
 ('case', 2),
 ('nissan', 2),
 ('trial', 2),
 ('lebanon', 2),
 ('interview', 2),
 ('comment', 2),
 ('brands', 2),
 ('social', 2),
 ('reminding', 1),
 ('weeks', 1),
 ('fugitive', 1),
 ('auto', 1),
 ('titan', 1),
 ('carlos', 1),
 ('smuggled', 1),
 ('tweet', 1),
 ('company', 1),
 ('naming', 1),
 ('names', 1),
 ('noted', 1),
 ('recent', 1),
 ('stories', 1),
 ('getting', 1),
 ('yamcy', 1),
 ('warned', 1),
 ('let', 1),
 ('avoid', 1),
 ('unfortunate', 1),
 ('accidents', 1),
 ('multiple', 1),
 ('outlets', 1),
 ('reported', 1),
 ('managed', 1),
 ('sneak', 1),
 ('private', 1),
 ('jet', 1),
 ('whisked', 1),
 ('hiding', 1),
 ('large', 1),
 ('black', 1),
 ('music', 1),
 ('breathing', 1),
 ('holes', 1),
 ('drilled', 1),
 ('business', 1),
 ('confirmed', 1),
 ('deta

##### Get the most common word

In [121]:
print(sort_occurence[0])

('ghosn', 8)


##### Get frequency of most common word

In [123]:
max_frequency = sort_occurence[0][1]
print(max_frequency)

8


##### Normalize frequency of each keyword with the max frequency value

In [154]:
for word in freq_of_keywords:
    freq_of_words[word] = freq_of_keywords[word]/max_frequency

print('Normalized TF:'+'-'*50,'\n')
pprint.pprint(freq_of_words)

Normalized TF:-------------------------------------------------- 

Counter({'ghosn': 1.0,
         'people': 0.625,
         'equipment': 0.625,
         'japan': 0.625,
         'yamaha': 0.5,
         'musical': 0.5,
         'escape': 0.5,
         'cases': 0.375,
         'japanese': 0.375,
         'media': 0.375,
         'cnn': 0.375,
         'weekend': 0.25,
         'said': 0.25,
         'airport': 0.25,
         'country': 0.25,
         'case': 0.25,
         'nissan': 0.25,
         'trial': 0.25,
         'lebanon': 0.25,
         'interview': 0.25,
         'comment': 0.25,
         'brands': 0.25,
         'social': 0.25,
         'reminding': 0.125,
         'weeks': 0.125,
         'fugitive': 0.125,
         'auto': 0.125,
         'titan': 0.125,
         'carlos': 0.125,
         'smuggled': 0.125,
         'tweet': 0.125,
         'company': 0.125,
         'naming': 0.125,
         'names': 0.125,
         'noted': 0.125,
         'recent': 0.125,
         'stor

### Sentence Importance Computation

##### Extract all sentences

In [127]:
for sentence in doc.sents:
    print('-'*150)
    pprint.pprint(sentence)
    

------------------------------------------------------------------------------------------------------------------------------------------------------
yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan carlos ghosn reportedly was smuggled out of japan in one.
------------------------------------------------------------------------------------------------------------------------------------------------------
in a tweet over the weekend, the japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases.
------------------------------------------------------------------------------------------------------------------------------------------------------
yamaha (yamcy) warned people not to get into, or let others get into, its cases to avoid "unfortunate accidents.
---------------------------------------------

In [129]:
# access each word in each sentence
for sentence in doc.sents:
    print('='*150)
    for word in sentence:
        print('---'*10)
        print(word, ', type:' ,type(word))


------------------------------
yamaha , type: <class 'spacy.tokens.token.Token'>
------------------------------
is , type: <class 'spacy.tokens.token.Token'>
------------------------------
reminding , type: <class 'spacy.tokens.token.Token'>
------------------------------
people , type: <class 'spacy.tokens.token.Token'>
------------------------------
that , type: <class 'spacy.tokens.token.Token'>
------------------------------
musical , type: <class 'spacy.tokens.token.Token'>
------------------------------
equipment , type: <class 'spacy.tokens.token.Token'>
------------------------------
cases , type: <class 'spacy.tokens.token.Token'>
------------------------------
are , type: <class 'spacy.tokens.token.Token'>
------------------------------
for , type: <class 'spacy.tokens.token.Token'>
------------------------------
musical , type: <class 'spacy.tokens.token.Token'>
------------------------------
equipment , type: <class 'spacy.tokens.token.Token'>
------------------------------

##### Exploratory Sentence-Word Analysis

In [131]:
# access each word in each sentence 
# check if word is in list of keywords
for sentence in doc.sents:
    print('='*150)
    for word in sentence:
        print('---'*15)
        print(word.text)
        print('keyword? = ', word.text in freq_of_keywords.keys()) # check if word text is a keyword
        print('stop-word? = ', word.is_stop ) # check if word-token is a stop word

---------------------------------------------
yamaha
keyword? =  True
stop-word? =  False
---------------------------------------------
is
keyword? =  False
stop-word? =  True
---------------------------------------------
reminding
keyword? =  True
stop-word? =  False
---------------------------------------------
people
keyword? =  True
stop-word? =  False
---------------------------------------------
that
keyword? =  False
stop-word? =  True
---------------------------------------------
musical
keyword? =  True
stop-word? =  False
---------------------------------------------
equipment
keyword? =  True
stop-word? =  False
---------------------------------------------
cases
keyword? =  True
stop-word? =  False
---------------------------------------------
are
keyword? =  False
stop-word? =  True
---------------------------------------------
for
keyword? =  False
stop-word? =  True
---------------------------------------------
musical
keyword? =  True
stop-word? =  False
---------------

##### Calculate the importance of the sentences 

- identify occurence of important keywords in each sentence 

- then, sum up the value to weight each sentence

In [132]:
# initialize dictionary to accumulate strength for each sentence
sentence_strength = {} 


- for a given sentence

    - each time a word is found in the `freq_of_keywords` dictionary
    
    - add the frequency of the word to the sentence strength

In [133]:
# loop over each sentence in document
for sentence in doc.sents:

    # loop over each word in current sentence 
    for word in sentence: 

        # check if current word is in important keyword list
        if word.text in freq_of_keywords.keys():

            # check if current sentence is present in sentence_strength dictionary 
            if sentence in sentence_strength.keys():

                # increase weight of sentence by normalized frequency of the current word
                sentence_strength[sentence] += freq_of_keywords[word.text]

            # if sentence is not in the sentence_strength dictionary
            else:
                
                # create key entry for current sentence and set the value to current word's normalized frequecy
                sentence_strength[sentence] = freq_of_keywords[word.text]

pprint.pprint(sentence_strength)
            

{yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan carlos ghosn reportedly was smuggled out of japan in one.: 55,
 in a tweet over the weekend, the japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases.: 41,
 yamaha (yamcy) warned people not to get into, or let others get into, its cases to avoid "unfortunate accidents.: 18,
 " multiple media outlets have reported that ghosn managed to sneak through a japanese airport to a private jet that whisked him out of the country by hiding in a large, black music equipment case with breathing holes drilled in the bottom.: 40,
 cnn business has not independently confirmed those details of his escape.: 10,
 the former nissan (nsanf) ceo had been out on bail awaiting trial in japan on charges of financial wrongdoing before making his stunning escape to lebano

### Sort the sentences by their strengths (importance)


In [141]:
sorted_sentence_strengths = sorted(sentence_strength.items(), key = lambda key_value: key_value[1], reverse=True)
pprint.pprint(sorted_sentence_strengths)

[(yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan carlos ghosn reportedly was smuggled out of japan in one.,
  55),
 (in a press conference in lebanon ahead of the cnn interview last wednesday, ghosn's first public appearance since fleeing japan, ghosn said he decided to leave the country because he believed he would not receive a fair trial, a claim japanese authorities have disputed.,
  51),
 (in a tweet over the weekend, the japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases.,
  41),
 (" multiple media outlets have reported that ghosn managed to sneak through a japanese airport to a private jet that whisked him out of the country by hiding in a large, black music equipment case with breathing holes drilled in the bottom.,
  40),
 (yamaha is one of japan's best known brands and ghosn was o

### Set summary sentence limit

In [142]:
num_summary_sentences = 3

##### Extract the first `num_summary_sentences` from the `soreted_sentence_strength` list

In [143]:
if num_summary_sentences <= len(sorted_sentence_strengths):
    summary_sentence_set = sorted_sentence_strengths[:num_summary_sentences]
else: 
    print('Number of sentences required in summary is more than sentences in original text!')

pprint.pprint(summary_sentence_set)


[(yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan carlos ghosn reportedly was smuggled out of japan in one.,
  55),
 (in a press conference in lebanon ahead of the cnn interview last wednesday, ghosn's first public appearance since fleeing japan, ghosn said he decided to leave the country because he believed he would not receive a fair trial, a claim japanese authorities have disputed.,
  51),
 (in a tweet over the weekend, the japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases.,
  41)]


### Generate final summary 

- by cleaning up the extracted sentence list

In [151]:
summary = ' '.join([ item[0].text.capitalize() for item in summary_sentence_set ])
print(summary)

Yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan carlos ghosn reportedly was smuggled out of japan in one. In a press conference in lebanon ahead of the cnn interview last wednesday, ghosn's first public appearance since fleeing japan, ghosn said he decided to leave the country because he believed he would not receive a fair trial, a claim japanese authorities have disputed. In a tweet over the weekend, the japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases.
