# Text Summarization

- Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).
- Idea of summarization is to find a subset of data which contains the “information” of the entire set
- Main Idea
    - Text Preprocessing(remove stopwords,punctuations).
    - Frequency table of words/Word Frequency Distribution - how many times each word appears in the document
    - Score each sentence depending on the words it contains and the frequency table
    - Build summary by joining every sentence above a certain score limit

In [80]:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.lang.en.stop_words import STOP_WORDS

### Text Processing + Tokenization

In [2]:
stop_words = list(STOP_WORDS)
print(stop_words)

['is', 'latterly', 'his', 'became', 'once', 'amongst', 'across', 'also', 'whither', 'however', 'least', 'since', 'must', 'will', "'ll", 'such', 'towards', 'get', 'twenty', 'were', 'along', "'d", 'still', 'done', 'a', 'down', 'someone', 'here', 'last', 'enough', 'becoming', 'somewhere', 'them', 'some', 'used', 'five', 'n’t', 'was', 'other', 'eight', 'serious', 'whole', '‘m', 'what', 'throughout', 'but', 'over', 'from', 'yet', 'whereby', 'either', 'eleven', 'cannot', 'fifteen', 'can', 'she', 'else', 'when', 'whenever', 'less', 'in', 'themselves', 'so', 'with', 'except', 'beside', 'our', 'due', 'one', 'hereupon', 'the', 'and', 'herein', 'your', '’m', 'after', 'another', 'move', 'please', 'twelve', 'their', 'to', 'seems', 'do', 'under', 'on', 'even', 'which', 'neither', 'see', 'n‘t', 'be', 'again', 'every', 'before', 'all', 'just', 'two', 'toward', 'these', 'meanwhile', 'thereupon', 'former', 'those', 'very', 'each', '’s', 'without', 'full', 'are', 'ca', 'how', 'does', 'myself', '‘ve', 'go

In [3]:
len(stop_words)

326

In [47]:
doc = nlp(open('covid19.txt').read())

In [48]:
mytokens = [token.text for token in doc]

### Word Frequency Table
- Dictionary of words and their counts
- How many times each word appears in the document
- Using non-stopwords

In [49]:
# Build Word Frequency
# word.text is tokenization in spacy
word_frequencies = {}
for word in doc:
    if word.text not in stop_words and word.is_punct != True and word.pos_ != 'SPACE' :
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [50]:
word_frequencies

{'Through': 1,
 'International': 1,
 'Food': 2,
 'Safety': 2,
 'Authorities': 2,
 'Network': 1,
 'INFOSAN)and': 1,
 'National': 1,
 'seeking': 1,
 'information': 1,
 'potential': 2,
 'persistence': 1,
 'SARS': 2,
 'CoV-2': 2,
 'Which': 1,
 'causes': 1,
 'COVID-19': 1,
 'foods': 2,
 'traded': 1,
 'internationally': 1,
 'As': 2,
 'role': 1,
 'food': 1,
 'transmission': 1,
 'virus': 1,
 'Currently': 1,
 'investigations': 1,
 'conducted': 1,
 'evaluate': 1,
 'viability': 1,
 'survival': 1,
 'time': 1,
 'general': 1,
 'rule': 1,
 'consumption': 1,
 'raw': 3,
 'undercooked': 1,
 'animal': 2,
 'products': 1,
 'avoided': 1,
 'Raw': 1,
 'meat': 1,
 'milk': 1,
 'organs': 1,
 'handled': 1,
 'care': 1,
 'avoid': 1,
 'crosscontamination': 1,
 'uncooked': 1}

### Maximum Word Frequency
- find the weighted frequency
- Each word over most occurring word
- Long sentence over short sentence

In [51]:
# Maximum Word Frequency
maximum_frequency = max(word_frequencies.values())
maximum_frequency

3

In [52]:
for word in word_frequencies.keys():  
    word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

### Word Frequency Distribution

In [53]:
# Frequency Table
word_frequencies

{'Through': 0.3333333333333333,
 'International': 0.3333333333333333,
 'Food': 0.6666666666666666,
 'Safety': 0.6666666666666666,
 'Authorities': 0.6666666666666666,
 'Network': 0.3333333333333333,
 'INFOSAN)and': 0.3333333333333333,
 'National': 0.3333333333333333,
 'seeking': 0.3333333333333333,
 'information': 0.3333333333333333,
 'potential': 0.6666666666666666,
 'persistence': 0.3333333333333333,
 'SARS': 0.6666666666666666,
 'CoV-2': 0.6666666666666666,
 'Which': 0.3333333333333333,
 'causes': 0.3333333333333333,
 'COVID-19': 0.3333333333333333,
 'foods': 0.6666666666666666,
 'traded': 0.3333333333333333,
 'internationally': 0.3333333333333333,
 'As': 0.6666666666666666,
 'role': 0.3333333333333333,
 'food': 0.3333333333333333,
 'transmission': 0.3333333333333333,
 'virus': 0.3333333333333333,
 'Currently': 0.3333333333333333,
 'investigations': 0.3333333333333333,
 'conducted': 0.3333333333333333,
 'evaluate': 0.3333333333333333,
 'viability': 0.3333333333333333,
 'survival': 0.

### Sentence Score and Ranking of Words in Each Sentence
- Sentence Tokens
- scoring every sentence based on number of words
- non stopwords in our word frequency table

In [54]:
# Sentence Tokens
sentence_list = [ sentence for sentence in doc.sents ]
print(sentence_list)

[Through the International Food Safety Authorities Network (, INFOSAN)and
, National Food Safety Authorities are seeking more information on the
potential for persistence of SARS-CoV-2., Which causes COVID-19 on foods
traded internationally., As well as the potential role of food in the transmission
of the virus., Currently there are investigations conducted to evaluate the
viability and survival time of SARS-CoV-2., As a general rule, the consumption
of raw or undercooked animal products should be avoided., Raw meat, raw
, milk or raw animal, organs should be handled with care to avoid crosscontamination with uncooked foods.]


In [57]:
# Sentence Score via comparrng each word with sentence
sentence_scores = {}  
for sent in sentence_list:  
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if len(sent.text.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]


In [58]:
sentence_scores

{Through the International Food Safety Authorities Network (: 0.3333333333333333,
 National Food Safety Authorities are seeking more information on the
 potential for persistence of SARS-CoV-2.: 1.9999999999999998,
 Which causes COVID-19 on foods
 traded internationally.: 1.6666666666666665,
 As well as the potential role of food in the transmission
 of the virus.: 1.9999999999999998,
 Currently there are investigations conducted to evaluate the
 viability and survival time of SARS-CoV-2.: 1.9999999999999998,
 As a general rule, the consumption
 of raw or undercooked animal products should be avoided.: 3.666666666666667,
 Raw meat, raw: 2.333333333333333,
 milk or raw animal: 2.0,
 organs should be handled with care to avoid crosscontamination with uncooked foods.: 2.6666666666666665}

### Finding Top N Sentence with largest score
**using heapq**

In [59]:
from heapq import nlargest

In [60]:
summarized_sentences = nlargest(4, sentence_scores, key=sentence_scores.get)
summarized_sentences

[As a general rule, the consumption
 of raw or undercooked animal products should be avoided.,
 organs should be handled with care to avoid crosscontamination with uncooked foods.,
 Raw meat, raw,
 milk or raw animal]

### Convert fro spacy span to text

In [61]:
# Convert Sentences from Spacy Span to Strings for joining entire sentence
for w in summarized_sentences:
    print(w.text)

As a general rule, the consumption
of raw or undercooked animal products should be avoided.
organs should be handled with care to avoid crosscontamination with uncooked foods.
Raw meat, raw

milk or raw animal


In [64]:
# List Comprehension of Sentences Converted From Spacy.span to strings
final_sentences = [ w.text for w in summarized_sentences ]

### Join sentences

In [65]:
summary = ' '.join(final_sentences)
summary

'As a general rule, the consumption\nof raw or undercooked animal products should be avoided. organs should be handled with care to avoid crosscontamination with uncooked foods. Raw meat, raw\n milk or raw animal'

## Gensim Summarization

In [74]:
from gensim.summarization import summarize

In [77]:
summarize(open('covid19.txt').read())

'National Food Safety Authorities are seeking more information on the\npotential for persistence of SARS-CoV-2.'

### All in One Place

In [87]:
# Place All As A Function For Reuseability
def text_summarizer(raw_docx):
    raw_text = raw_docx
    docx = nlp(raw_text)
    stopwords = list(STOP_WORDS)
    # Build Word Frequency
    # word.text is tokenization in spacy
    word_frequencies = {}  
    for word in docx:  
        if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1


    maximum_frequncy = max(word_frequencies.values())

    for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
    # Sentence Tokens
    sentence_list = [ sentence for sentence in docx.sents ]

    # Calculate Sentence Score and Ranking
    sentence_scores = {}  
    for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

    # Find N Largest
    summary_sentences = nlargest(4, sentence_scores, key=sentence_scores.get)
    final_sentences = [ w.text for w in summary_sentences ]
    summary = ' '.join(final_sentences)
    print("Original Document\n")
    print(raw_docx)
    print("Total Length:",len(raw_docx))
    print('\n\nSummarized Document\n')
    print(summary)
    print("Total Length:",len(summary))

In [88]:
text_summarizer(open('covid_research.txt').read())

Original Document

Dr Sonya Babu-Narayan, Associate Medical Director at the British Heart Foundation and Honorary Consultant Cardiologist, said: 

“Every day we learn more about Covid-19. Information to date suggests that people with heart disease, or are at risk of heart disease due to factors such as high blood pressure, diabetes or being severely overweight with a body mass index higher than 40, are at an increased risk of complications caused by the virus.

“If you have one of these conditions you should be taking all precautions possible to reduce your chance of catching the virus.

“Viruses can cause significant inflammation which can injure the heart and can worsen a person’s existing heart condition even if the virus does not enter the heart directly.

“Evidence shows that people with higher levels of a protein used to measure heart injury in their blood are more likely to die after contracting Covid-19. 

“However this kind of observational evidence can’t tell us why some peop

In [89]:
summarize(open('covid_research.txt').read())

'Information to date suggests that people with heart disease, or are at risk of heart disease due to factors such as high blood pressure, diabetes or being severely overweight with a body mass index higher than 40, are at an increased risk of complications caused by the virus.\n“Evidence shows that people with higher levels of a protein used to measure heart injury in their blood are more likely to die after contracting Covid-19.\n“However this kind of observational evidence can’t tell us why some people suffer heart damage, whether this is caused by the virus or how exactly this may lead to worse outcomes.\nis a case series, which follows-up patients who have been infected by COVID-19 and try to draw conclusions regarding the impact of treatments and the pathophysiology of the disease on mortality.\n“The comment in the press release – “It is likely that even in the absence of previous heart disease, the heart muscle can be affected by coronavirus disease,” said Mohammad Madjid, MD, MS