### Reference

- [Extractive Text Summarization Using spaCy in Python](https://medium.com/better-programming/extractive-text-summarization-using-spacy-in-python-88ab96d1fd97)

# TF-IDF 

- TF-IDF *(Term Frequency-Inverse Data Frequency)* is used to calculate the importance of a sentence for text summarization in 
    
    - *information retrieval* and 
    
    - *text mining* 
    

## TF: Term Frequency

- measure of frequency of a term in a document

*normalization*

- since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones

- thus, *the term frequency is often divided by the document length* (such as the total number of terms in the document) as a way of normalization



## IDF: Inverse Document Frequency

- measures how important a term is

*stopwords*

- while computing the term frequency (TF), all terms are considered equally important

- however, it is known that certain terms may appear a lot of times but have little importance in the document
    - For example: is, are, they, and so on
    - these are called stopwords


### TF-IDF Limitations

- A common term in a domain might be an important term in another domain
    
    - As the saying goes: “One man’s meat is another man’s poison”

- TF-IDF is not a good choice if you are dealing with multiple domains

    -  A unbalanced dataset tends to be biased and it will greatly affect the result

# Identifying top sentences in an article (Text Summarization)

Following are the steps in text summarization:

1. tokenize article using spacy's language model

2. extract important keywords and normalize weight

3. calculate importance of each sentence in the article based on keyword appearance

4. sort the sentence based on the calculated importance

## Implementation

In [1]:
# conduct imports
import spacy 
from collections import Counter
from string import punctuation
import pprint

### Load SpaCy Model


In [2]:
# load large english web model
nlp = spacy.load("en_core_web_lg")

### Text to summarize

In [3]:
# input_text = '''Yamaha is reminding people that musical equipment cases are for musical equipment — not people — two weeks after fugitive auto titan Carlos Ghosn reportedly was smuggled out of Japan in one. In a tweet over the weekend, the Japanese musical equipment company said it was not naming any names, but noted there had been many recent stories about people getting into musical equipment cases. Yamaha (YAMCY) warned people not to get into, or let others get into, its cases to avoid "unfortunate accidents." Multiple media outlets have reported that Ghosn managed to sneak through a Japanese airport to a private jet that whisked him out of the country by hiding in a large, black music equipment case with breathing holes drilled in the bottom. CNN Business has not independently confirmed those details of his escape. The former Nissan (NSANF) CEO had been out on bail awaiting trial in Japan on charges of financial wrongdoing before making his stunning escape to Lebanon at the end of December. Ghosn has referred to his departure as an effort to "escape injustice." In an interview with CNN\'s Richard Quest last week, Ghosn did not comment on the nature of his escape, saying he didn\'t want to endanger any of the people who aided in the operation. Ghosn did, however, respond to a question about what it felt like to ride through the airport in a packing case by first declining to comment but then adding: "Freedom, no matter the way it happens, is always sweet." In a press conference in Lebanon ahead of the CNN interview last Wednesday, Ghosn\'s first public appearance since fleeing Japan, Ghosn said he decided to leave the country because he believed he would not receive a fair trial, a claim Japanese authorities have disputed. Brands sometimes capitalize on their tangential relationship to big news in order to attract attention on social media. Yamaha is one of Japan\'s best known brands and Ghosn was one of Japan\'s top executives before being ousted from Nissan — a match made in social media heaven. Not surprisingly, Yamaha\'s post went viral on Twitter over the weekend.'''



In [4]:
input_text = '''Vice President Mike Pence was administered the Covid-19 vaccine in a televised event Friday, becoming the senior-most member of the Trump administration known to have received the shot.

Mr. Pence said he hoped his vaccination would help to build public confidence that the vaccine was safe and effective. It isn’t known whether President Trump and first lady Melania Trump, who both contracted Covid-19 this fall, will receive the vaccine in the coming weeks. White House press secretary Kayleigh McEnany said this week that the president is open to taking the vaccine but “wants to show Americans that our priority are the most vulnerable.”

President-elect Joe Biden and his wife, Jill Biden, will get the vaccine in public on Monday, his transition team said.

The transition team said Vice President-elect Kamala Harris and her husband, Doug Emhoff, would receive the vaccine the following week.

Mr. Pence received his first of two doses alongside second lady Karen Pence and Surgeon General Jerome Adams, in an event held in the Eisenhower Executive Office Building, in front of two signs reading “Safe and Effective.” Dr. Adams flashed the vice president a thumbs-up after all three of them had received their shots.'''

### Tokenize text

In [5]:
doc = nlp(input_text.lower())

### Explore the tokens

In [6]:
print("token ----- lemma_ ----- is_stop? ----- part-of-speech")
print('='*75)
for token in doc:
    print(f" {str(token)} ----- {str(token.lemma_)} ----- {str(token.is_stop)} ----- {str(token.pos_)}")

token ----- lemma_ ----- is_stop? ----- part-of-speech
 vice ----- vice ----- False ----- PROPN
 president ----- president ----- False ----- PROPN
 mike ----- mike ----- False ----- PROPN
 pence ----- pence ----- False ----- PROPN
 was ----- be ----- True ----- AUX
 administered ----- administer ----- False ----- VERB
 the ----- the ----- True ----- DET
 covid-19 ----- covid-19 ----- False ----- NUM
 vaccine ----- vaccine ----- False ----- NOUN
 in ----- in ----- True ----- ADP
 a ----- a ----- True ----- DET
 televised ----- televise ----- False ----- VERB
 event ----- event ----- False ----- NOUN
 friday ----- friday ----- False ----- PROPN
 , ----- , ----- False ----- PUNCT
 becoming ----- become ----- True ----- VERB
 the ----- the ----- True ----- DET
 senior ----- senior ----- False ----- ADV
 - ----- - ----- False ----- PUNCT
 most ----- most ----- True ----- ADJ
 member ----- member ----- False ----- NOUN
 of ----- of ----- True ----- ADP
 the ----- the ----- True ----- DET
 tr

### Extract keywords (filter out keywords)

In [7]:
# init blank keyword list 
keyword = []

for token in doc:
    # check if current token is a stop word or punctuation, if yes then skip
    if token.text in nlp.Defaults.stop_words or token.text in punctuation:
        continue

    # if token text is Noun, Verb, Adjective or Propernoun, save to keywords list
    elif token.pos_ in ['NOUN','VERB','ADJ','PROPN']:
       keyword.append(token.text)

# check populated keyword list
keyword

['vice',
 'president',
 'mike',
 'pence',
 'administered',
 'vaccine',
 'televised',
 'event',
 'friday',
 'member',
 'trump',
 'administration',
 'known',
 'received',
 'shot',
 'mr',
 'pence',
 'said',
 'hoped',
 'vaccination',
 'help',
 'build',
 'public',
 'confidence',
 'vaccine',
 'safe',
 'effective',
 'known',
 'president',
 'trump',
 'lady',
 'melania',
 'trump',
 'contracted',
 'covid-19',
 'fall',
 'receive',
 'vaccine',
 'coming',
 'weeks',
 'white',
 'house',
 'press',
 'secretary',
 'kayleigh',
 'mcenany',
 'said',
 'week',
 'president',
 'open',
 'taking',
 'vaccine',
 'wants',
 'americans',
 'priority',
 'vulnerable',
 'president',
 'elect',
 'joe',
 'biden',
 'wife',
 'jill',
 'biden',
 'vaccine',
 'public',
 'monday',
 'transition',
 'team',
 'said',
 'transition',
 'team',
 'said',
 'vice',
 'president',
 'elect',
 'kamala',
 'harris',
 'husband',
 'doug',
 'emhoff',
 'receive',
 'vaccine',
 'following',
 'week',
 'mr',
 'pence',
 'received',
 'doses',
 'second',
 'l

### Normalize weightage of keywords

##### Get frequency of each word

- `Counter` will convert the list into a dictionary with their respective frequency values

In [8]:
# get frequency of each word
freq_of_keywords = Counter(keyword)

# pretty print frequency of words 
pprint.pprint(freq_of_keywords)

Counter({'president': 6,
         'vaccine': 6,
         'pence': 4,
         'said': 4,
         'vice': 3,
         'trump': 3,
         'received': 3,
         'event': 2,
         'known': 2,
         'mr': 2,
         'public': 2,
         'safe': 2,
         'effective': 2,
         'lady': 2,
         'receive': 2,
         'week': 2,
         'elect': 2,
         'biden': 2,
         'transition': 2,
         'team': 2,
         'adams': 2,
         'mike': 1,
         'administered': 1,
         'televised': 1,
         'friday': 1,
         'member': 1,
         'administration': 1,
         'shot': 1,
         'hoped': 1,
         'vaccination': 1,
         'help': 1,
         'build': 1,
         'confidence': 1,
         'melania': 1,
         'contracted': 1,
         'covid-19': 1,
         'fall': 1,
         'coming': 1,
         'weeks': 1,
         'white': 1,
         'house': 1,
         'press': 1,
         'secretary': 1,
         'kayleigh': 1,
         'mcenany

##### Sort by frequency - most common to least common

In [9]:
# get the most frequent keyword
sort_occurence = freq_of_keywords.most_common()
pprint.pprint(sort_occurence)

[('president', 6),
 ('vaccine', 6),
 ('pence', 4),
 ('said', 4),
 ('vice', 3),
 ('trump', 3),
 ('received', 3),
 ('event', 2),
 ('known', 2),
 ('mr', 2),
 ('public', 2),
 ('safe', 2),
 ('effective', 2),
 ('lady', 2),
 ('receive', 2),
 ('week', 2),
 ('elect', 2),
 ('biden', 2),
 ('transition', 2),
 ('team', 2),
 ('adams', 2),
 ('mike', 1),
 ('administered', 1),
 ('televised', 1),
 ('friday', 1),
 ('member', 1),
 ('administration', 1),
 ('shot', 1),
 ('hoped', 1),
 ('vaccination', 1),
 ('help', 1),
 ('build', 1),
 ('confidence', 1),
 ('melania', 1),
 ('contracted', 1),
 ('covid-19', 1),
 ('fall', 1),
 ('coming', 1),
 ('weeks', 1),
 ('white', 1),
 ('house', 1),
 ('press', 1),
 ('secretary', 1),
 ('kayleigh', 1),
 ('mcenany', 1),
 ('open', 1),
 ('taking', 1),
 ('wants', 1),
 ('americans', 1),
 ('priority', 1),
 ('vulnerable', 1),
 ('joe', 1),
 ('wife', 1),
 ('jill', 1),
 ('monday', 1),
 ('kamala', 1),
 ('harris', 1),
 ('husband', 1),
 ('doug', 1),
 ('emhoff', 1),
 ('following', 1),
 ('dose

##### Get the most common word

In [10]:
print(sort_occurence[0])

('president', 6)


##### Get frequency of most common word

In [11]:
max_frequency = sort_occurence[0][1]
print(max_frequency)

6


##### Normalize frequency of each keyword with the max frequency value

In [12]:
for word in freq_of_keywords:
    freq_of_keywords[word] = freq_of_keywords[word]/max_frequency

print('Normalized TF:'+'-'*50,'\n')
pprint.pprint(freq_of_keywords)

Normalized TF:-------------------------------------------------- 

Counter({'president': 1.0,
         'vaccine': 1.0,
         'pence': 0.6666666666666666,
         'said': 0.6666666666666666,
         'vice': 0.5,
         'trump': 0.5,
         'received': 0.5,
         'event': 0.3333333333333333,
         'known': 0.3333333333333333,
         'mr': 0.3333333333333333,
         'public': 0.3333333333333333,
         'safe': 0.3333333333333333,
         'effective': 0.3333333333333333,
         'lady': 0.3333333333333333,
         'receive': 0.3333333333333333,
         'week': 0.3333333333333333,
         'elect': 0.3333333333333333,
         'biden': 0.3333333333333333,
         'transition': 0.3333333333333333,
         'team': 0.3333333333333333,
         'adams': 0.3333333333333333,
         'mike': 0.16666666666666666,
         'administered': 0.16666666666666666,
         'televised': 0.16666666666666666,
         'friday': 0.16666666666666666,
         'member': 0.1666666666

### Sentence Importance Computation

##### Extract all sentences

In [13]:
for sentence in doc.sents:
    print('-'*150)
    pprint.pprint(sentence)
    

------------------------------------------------------------------------------------------------------------------------------------------------------
vice president mike pence was administered the covid-19 vaccine in a televised event friday, becoming the senior-most member of the trump administration known to have received the shot.


------------------------------------------------------------------------------------------------------------------------------------------------------
mr. pence said he hoped his vaccination would help to build public confidence that the vaccine was safe and effective.
------------------------------------------------------------------------------------------------------------------------------------------------------
it isn’t known whether president trump and first lady melania trump, who both contracted covid-19 this fall, will receive the vaccine in the coming weeks.
-------------------------------------------------------------------------------------

In [14]:
# access each word in each sentence
for sentence in doc.sents:
    print('='*150)
    for word in sentence:
        print('---'*10)
        print(word, ', type:' ,type(word))


------------------------------
vice , type: <class 'spacy.tokens.token.Token'>
------------------------------
president , type: <class 'spacy.tokens.token.Token'>
------------------------------
mike , type: <class 'spacy.tokens.token.Token'>
------------------------------
pence , type: <class 'spacy.tokens.token.Token'>
------------------------------
was , type: <class 'spacy.tokens.token.Token'>
------------------------------
administered , type: <class 'spacy.tokens.token.Token'>
------------------------------
the , type: <class 'spacy.tokens.token.Token'>
------------------------------
covid-19 , type: <class 'spacy.tokens.token.Token'>
------------------------------
vaccine , type: <class 'spacy.tokens.token.Token'>
------------------------------
in , type: <class 'spacy.tokens.token.Token'>
------------------------------
a , type: <class 'spacy.tokens.token.Token'>
------------------------------
televised , type: <class 'spacy.tokens.token.Token'>
------------------------------
ev

##### Exploratory Sentence-Word Analysis

In [15]:
# access each word in each sentence 
# check if word is in list of keywords
for sentence in doc.sents:
    print('='*150)
    for word in sentence:
        print('---'*15)
        print(word.text)
        print('keyword? = ', word.text in freq_of_keywords.keys()) # check if word text is a keyword
        print('stop-word? = ', word.is_stop ) # check if word-token is a stop word

---------------------------------------------
vice
keyword? =  True
stop-word? =  False
---------------------------------------------
president
keyword? =  True
stop-word? =  False
---------------------------------------------
mike
keyword? =  True
stop-word? =  False
---------------------------------------------
pence
keyword? =  True
stop-word? =  False
---------------------------------------------
was
keyword? =  False
stop-word? =  True
---------------------------------------------
administered
keyword? =  True
stop-word? =  False
---------------------------------------------
the
keyword? =  False
stop-word? =  True
---------------------------------------------
covid-19
keyword? =  True
stop-word? =  False
---------------------------------------------
vaccine
keyword? =  True
stop-word? =  False
---------------------------------------------
in
keyword? =  False
stop-word? =  True
---------------------------------------------
a
keyword? =  False
stop-word? =  True
------------------

##### Calculate the importance of the sentences 

- identify occurence of important keywords in each sentence 

- then, sum up the value to weight each sentence

In [16]:
# initialize dictionary to accumulate strength for each sentence
sentence_strength = {} 


- for a given sentence

    - each time a word is found in the `freq_of_keywords` dictionary
    
    - add the frequency of the word to the sentence strength

In [17]:
# loop over each sentence in document
for sentence in doc.sents:

    # loop over each word in current sentence 
    for word in sentence: 

        # check if current word is in important keyword list
        if word.text in freq_of_keywords.keys():

            # check if current sentence is present in sentence_strength dictionary 
            if sentence in sentence_strength.keys():

                # increase weight of sentence by normalized frequency of the current word
                sentence_strength[sentence] += freq_of_keywords[word.text]

            # if sentence is not in the sentence_strength dictionary
            else:
                
                # create key entry for current sentence and set the value to current word's normalized frequecy
                sentence_strength[sentence] = freq_of_keywords[word.text]

pprint.pprint(sentence_strength)
            

{vice president mike pence was administered the covid-19 vaccine in a televised event friday, becoming the senior-most member of the trump administration known to have received the shot.

: 6.166666666666667,
 mr. pence said he hoped his vaccination would help to build public confidence that the vaccine was safe and effective.: 4.499999999999999,
 it isn’t known whether president trump and first lady melania trump, who both contracted covid-19 this fall, will receive the vaccine in the coming weeks.: 5.0,
 white house press secretary kayleigh mcenany said this week that the president is open to taking the vaccine but “wants to show americans that our priority are the most vulnerable.”

: 5.000000000000001,
 president-elect joe biden and his wife, jill biden, will get the vaccine in public on monday, his transition team said.

: 5.333333333333333,
 the transition team said vice president-elect kamala harris and her husband, doug emhoff, would receive the vaccine the following week.

: 5

### Sort the sentences by their strengths (importance)


In [18]:
sorted_sentence_strengths = sorted(sentence_strength.items(), key = lambda key_value: key_value[1], reverse=True)
pprint.pprint(sorted_sentence_strengths)

[(vice president mike pence was administered the covid-19 vaccine in a televised event friday, becoming the senior-most member of the trump administration known to have received the shot.

,
  6.166666666666667),
 (mr. pence received his first of two doses alongside second lady karen pence and surgeon general jerome adams, in an event held in the eisenhower executive office building, in front of two signs reading “safe and effective.”,
  6.000000000000001),
 (the transition team said vice president-elect kamala harris and her husband, doug emhoff, would receive the vaccine the following week.

,
  5.833333333333332),
 (president-elect joe biden and his wife, jill biden, will get the vaccine in public on monday, his transition team said.

,
  5.333333333333333),
 (white house press secretary kayleigh mcenany said this week that the president is open to taking the vaccine but “wants to show americans that our priority are the most vulnerable.”

,
  5.000000000000001),
 (it isn’t known wh

### Set summary sentence limit

In [19]:
num_summary_sentences = 3

##### Extract the first `num_summary_sentences` from the `soreted_sentence_strength` list

In [20]:
if num_summary_sentences <= len(sorted_sentence_strengths):
    summary_sentence_set = sorted_sentence_strengths[:num_summary_sentences]
else: 
    print('Number of sentences required in summary is more than sentences in original text!')

pprint.pprint(summary_sentence_set)


[(vice president mike pence was administered the covid-19 vaccine in a televised event friday, becoming the senior-most member of the trump administration known to have received the shot.

,
  6.166666666666667),
 (mr. pence received his first of two doses alongside second lady karen pence and surgeon general jerome adams, in an event held in the eisenhower executive office building, in front of two signs reading “safe and effective.”,
  6.000000000000001),
 (the transition team said vice president-elect kamala harris and her husband, doug emhoff, would receive the vaccine the following week.

,
  5.833333333333332)]


### Generate final summary 

- by cleaning up the extracted sentence list

In [21]:
summary = ' '.join([ item[0].text.capitalize() for item in summary_sentence_set ])
print(summary)

Vice president mike pence was administered the covid-19 vaccine in a televised event friday, becoming the senior-most member of the trump administration known to have received the shot.

 Mr. pence received his first of two doses alongside second lady karen pence and surgeon general jerome adams, in an event held in the eisenhower executive office building, in front of two signs reading “safe and effective.” The transition team said vice president-elect kamala harris and her husband, doug emhoff, would receive the vaccine the following week.


