# AUTOMATIC TEXT SUMMARIZATION USING SUMY





Automatic text summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. The main idea of summarization is to find a subset of data which contains the “information” of the entire set. Such techniques are widely used in industry today.

### WHY SHOULD WE DO SUMMARIZATION?

Knowing when it is appropriate to summarize and when it is not appropriate to summarize both aides in the definition of summarization and allows students to talk and write about long bodies of work in an organized and efficient way. Being able to summarize is important because it can simplify the complicated, aid in efficient studying, and can improve ones ability to communicate clearly and effectively.


There are two types of automatic summarization: Extractive and Abstractive


**Extractive summarization** means identifying important sections of the text and generating them verbatim producing a subset of the sentences from the original text; while **abstractive summarization** reproduces important material in a new way after interpretation and examination of the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original one.


**SUMY LIBRARY**-It is a Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods:

* Luhn - heurestic method, reference
* Edmundson heurestic method with previous statistic research, reference
* Latent Semantic Analysis, LSA - Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis).
* LexRank - Unsupervised approach inspired by algorithms PageRank and HITS, reference
* TextRank - Unsupervised approach, also using PageRank algorithm, reference
* SumBasic - Method that is often used as a baseline in the literature. Source: Read about SumBasic
* KL-Sum - Method that greedily adds sentences to a summary so long as it decreases the KL Divergence. Source: Read about KL-Sum
* Reduction - Graph-based summarization, where a sentence salience is computed as the sum of the weights of its edges to other sentences. The weight of an edge between two sentences is computed in the same manner as TextRank.





In this tutorial, we are considering four types of automatic summarization, where they basically differ depending on the type of algorithm they use to do the summarization. 

1.   Lex Rank
2.   LUHN
3.   Latent Semantic Analysis (LSA)
4.   Text Rank


We are going to use a random document from the internet and try to summarize the document using all the four methods listed above. We will then compare each of the technique and analyze the summarization using a metric called Rogue-N




In [23]:
!pip install sumy




In [0]:
from sumy.parsers.plaintext import PlaintextParser


In [0]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lex_rank import LexRankSummarizer



In [0]:
 document1 = """ A spate of mysterious second-time infections is calling into question the accuracy of COVID-19 diagnostic tools even as China prepares to lift quarantine measures to allow residents to leave the epicenter of its outbreak next month. It's also raising concerns of a possible second wave of cases.

From March 18-22, the Chinese city of Wuhan reported no new cases of the virus through domestic transmission — that is, infection passed on from one person to another. The achievement was seen as a turning point in efforts to contain the virus, which has infected more than 80,000 people in China. Wuhan was particularly hard-hit, with more than half of all confirmed cases in the country.

But some Wuhan residents who had tested positive earlier and then recovered from the disease are testing positive for the virus a second time. Based on data from several quarantine facilities in the city, which house patients for further observation after their discharge from hospitals, about 5%-10% of patients pronounced "recovered" have tested positive again.

Some of those who retested positive appear to be asymptomatic carriers — those who carry the virus and are possibly infectious but do not exhibit any of the illness's associated symptoms — suggesting that the outbreak in Wuhan is not close to being over.

NPR has spoken by phone or exchanged text messages with four individuals in Wuhan who are part of this group of individuals testing positive a second time in March. All four said they had been sickened with the virus and tested positive, then were released from medical care in recent weeks after their condition improved and they tested negative.

Two of them are front-line doctors who were sickened after treating patients in their Wuhan hospitals. The other two are Wuhan residents. They all requested anonymity when speaking with NPR because those who have challenged the government's handling of the outbreak have been detained.

One of the Wuhan residents who spoke to NPR exhibited severe symptoms during their first round of illness and was eventually hospitalized. The second resident displayed only mild symptoms at first and was quarantined in one of more than a dozen makeshift treatment centers erected in Wuhan during the peak of the outbreak.

But when both were tested a second time for the coronavirus on Sunday, March 22, as a precondition for seeking medical care for unrelated health issues, they tested positive for the coronavirus even though they exhibited none of the typical symptoms, such as a fever or dry cough. The time from their recovery and release to the retest ranged from a few days to a few weeks.

Could that second positive test mean a second round of infection? Virologists think it is unlikely that a COVID-19 patient could be re-infected so quickly after recovery but caution that it is too soon to know.

Under its newest COVID-19 prevention guidelines, China does not include in its overall daily count for total and for new cases those who retest positive after being released from medical care. China also does not include asymptomatic cases in case counts.

"I have no idea why the authorities choose not to count [asymptomatic] cases in the official case count. I am baffled," said one of the Wuhan doctors who had a second positive test after recovering.

These four people are now being isolated under medical observation. It is unclear whether they are infectious and why they tested positive after their earlier negative test.

It is possible they were first given a false negative test result, which can happen if the swab used to collect samples of the virus misses bits of the virus. Dr. Li Wenliang, a whistleblowing doctor who later died of the virus himself in February, tested negative for the coronavirus several times before being accurately diagnosed.

In February, Wang Chen, a director at the state-run Chinese Academy of Medical Sciences, estimated that the nucleic acid tests used in China were accurate at identifying positive cases of the coronavirus only 30%-50% of the time.

Another theory is that, because the test amplifies tiny bits of DNA, residual virus from the initial infection could have falsely resulted in that second positive reading.

"There are false positives with these types of tests," Dr. Jeffrey Shaman, a professor of environmental health sciences at Columbia University, told NPR by email. Shaman recently co-authored a modeling study showing that transmission by individuals who did not exhibit any symptoms was a driver of the Wuhan outbreak.

How real is China's recovery?

On Tuesday, Hubei province, where Wuhan is the capital, said it would relax lockdown measures that have now been in place for more than two months and begin letting residents leave cities the following day. Wuhan said it would begin lifting its quarantine measures and letting residents leave two weeks later, on April 8.

To leave Wuhan, residents must first test negative for the coronavirus, according to municipal authorities. Such screenings will identify some remaining asymptomatic virus carriers. But the high rate of false negatives that Chinese doctors have cited means many with the virus could pass undetected.

Last Thursday, Wuhan reported for the first time since the outbreak began that it had no new cases of the virus from the day before — a milestone in China's virus containment efforts. The city reported a zero rise in new cases for the following four days.

Assessing asymptomatic carriers

But Caixin, an independent Chinese news outlet, reported earlier this week that Wuhan hospitals were continuing to see new cases of asymptomatic virus carriers, citing a health official who said he had seen up to a dozen such cases a day.

Responding to inquiries about how the city was counting asymptomatic cases, Wuhan's health commission said Monday that it is quarantining new asymptomatic patients in specialized wards for 14 days. Such patients would be included in new daily case counts if they develop symptoms during that time, authorities said.

"Based on available World Health Organization data, new infections are mainly transmitted by patients who have developed symptoms. Hence [asymptomatic cases] may not be the main source of transmission," the commission said.

A researcher at China's health commission told reporters Tuesday that asymptomatic carriers "would not cause the spread" of the virus. Zunyou Wu, the researcher, explained this was because the authorities were isolating people who had close contact with confirmed patients. Wu did not explain how they would identify asymptomatic carriers who had no close contact with confirmed patients.

Addressing growing public concern of asymptomatic patients, China's Premier Li Keqiang urged during Thursday's senior-level government meeting that "relevant departments must ... truthfully, timely, and openly" answer questions, such as whether these patients are infectious and how the course of the outbreak may change.

Research suggests that the spread can be caused by asymptomatic carriers. Studies of patients from Wuhan and other Chinese cities who were diagnosed early in the outbreak suggest that asymptomatic carriers of the virus can infect those they have close contact with, such as family members.

"In terms of those who retested positive, the official party line is that they have not been proven to be infectious. That is not the same as saying they are not infectious," one of the Wuhan doctors who tested positive twice told NPR. He is now isolated and under medical observation. "If they really are not infectious," the doctor said, "then there would be no need to take them back to the hospitals again."

Geoff Brumfiel contributed reporting from Washington, D.C.

"""


In [68]:
import nltk
nltk.download('punkt')
parser = PlaintextParser.from_string(document1,Tokenizer("english"))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**LEX RANK**

LexRank is an unsupervised approach to text summarization based on graph-based centrality scoring of sentences. The main idea is that sentences “recommend” other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. The importance of this sentence also stems from the importance of the sentences “recommending” it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to any arbitrary new text.

In [0]:
# Using LexRank
summarizer = LexRankSummarizer()
#Summarize the document with 15 sentences
summary = summarizer(parser.document, 15)

In [83]:
for sentence in summary:
    print(sentence)

From March 18-22, the Chinese city of Wuhan reported no new cases of the virus through domestic transmission — that is, infection passed on from one person to another.
But some Wuhan residents who had tested positive earlier and then recovered from the disease are testing positive for the virus a second time.
Based on data from several quarantine facilities in the city, which house patients for further observation after their discharge from hospitals, about 5%-10% of patients pronounced "recovered" have tested positive again.
Some of those who retested positive appear to be asymptomatic carriers — those who carry the virus and are possibly infectious but do not exhibit any of the illness's associated symptoms — suggesting that the outbreak in Wuhan is not close to being over.
All four said they had been sickened with the virus and tested positive, then were released from medical care in recent weeks after their condition improved and they tested negative.
But when both were tested a se

**LUHN**

Luhns is a Heuristic Method for text summarization. This is one of the earliest approaches of text summarization. Luhn proposed that the significance of each word in a document signifies how important it is. The idea is that any sentence with maximum occurances of the highest frequency words(Stopwords) and least occurances are not important to the meaning of document than others. Although it is not considered very accurate approach.Luhn’s algorithm is an approach based on TF-IDF. It selects only the words of higher importance as per their frequency. Higher weights are assigned to the words present at the begining of the document. 

Luhn introduced the following criteria during text preprocesing:

1.  Removing stopwords
2.  Stemming (Likes->Like)

In [0]:
from sumy.summarizers.luhn import LuhnSummarizer

In [0]:
summarizer_luhn = LuhnSummarizer()
summary_1 =summarizer_luhn(parser.document,15)

In [77]:
for sentence in summary_1:
    print(sentence)

From March 18-22, the Chinese city of Wuhan reported no new cases of the virus through domestic transmission — that is, infection passed on from one person to another.
But some Wuhan residents who had tested positive earlier and then recovered from the disease are testing positive for the virus a second time.
Based on data from several quarantine facilities in the city, which house patients for further observation after their discharge from hospitals, about 5%-10% of patients pronounced "recovered" have tested positive again.
Some of those who retested positive appear to be asymptomatic carriers — those who carry the virus and are possibly infectious but do not exhibit any of the illness's associated symptoms — suggesting that the outbreak in Wuhan is not close to being over.
NPR has spoken by phone or exchanged text messages with four individuals in Wuhan who are part of this group of individuals testing positive a second time in March.
All four said they had been sickened with the vi

**LSA- LATENT SEMANTIC ANALYSIS**

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the
contextual-usage meaning of words by statistical computations applied to a large corpus of
text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word
contexts in which a given word does and does not appear provides a set of mutual
constraints that largely determines the similarity of meaning of words and sets of words to
each other. 

To extract and understand patterns from the documents, LSA inherently follows certain assumptions:

1. Meaning of Sentences or Documents is a sum of the meaning of all words occurring in it. Overall, the meaning of a certain word is an average across all the documents it occurs in.
2.  LSA assumes that the semantic associations between words are present not explicitly, but only latently in the large sample of language.

In [0]:
from sumy.summarizers.lsa import LsaSummarizer

In [0]:
summarizer_lsa = LsaSummarizer()
summary_2 =summarizer_lsa(parser.document,15)

In [79]:
for sentence in summary_2:
    print(sentence)

Wuhan was particularly hard-hit, with more than half of all confirmed cases in the country.
Based on data from several quarantine facilities in the city, which house patients for further observation after their discharge from hospitals, about 5%-10% of patients pronounced "recovered" have tested positive again.
NPR has spoken by phone or exchanged text messages with four individuals in Wuhan who are part of this group of individuals testing positive a second time in March.
All four said they had been sickened with the virus and tested positive, then were released from medical care in recent weeks after their condition improved and they tested negative.
Two of them are front-line doctors who were sickened after treating patients in their Wuhan hospitals.
Under its newest COVID-19 prevention guidelines, China does not include in its overall daily count for total and for new cases those who retest positive after being released from medical care.
I am baffled," said one of the Wuhan doctor

In [0]:
## Alternative Method using stopwords
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
summarizer_lsa2 = LsaSummarizer()
summarizer_lsa2 = LsaSummarizer(Stemmer("english"))
summarizer_lsa2.stop_words = get_stop_words("english")

In [81]:
for sentence in summarizer_lsa2(parser.document,15):
    print(sentence)

A spate of mysterious second-time infections is calling into question the accuracy of COVID-19 diagnostic tools even as China prepares to lift quarantine measures to allow residents to leave the epicenter of its outbreak next month.
From March 18-22, the Chinese city of Wuhan reported no new cases of the virus through domestic transmission — that is, infection passed on from one person to another.
Some of those who retested positive appear to be asymptomatic carriers — those who carry the virus and are possibly infectious but do not exhibit any of the illness's associated symptoms — suggesting that the outbreak in Wuhan is not close to being over.
One of the Wuhan residents who spoke to NPR exhibited severe symptoms during their first round of illness and was eventually hospitalized.
The second resident displayed only mild symptoms at first and was quarantined in one of more than a dozen makeshift treatment centers erected in Wuhan during the peak of the outbreak.
Under its newest COVI

**TEXT RANK**

It is a graph-based ranking model for text processing which can be used in order to find the most relevant sentences in text and also to find keywords. 
In order to find the most relevant sentences in text, a graph is constructed where the vertices of the graph represent each sentence in a document and the edges between sentences are based on content overlap, namely by calculating the number of words that 2 sentences have in common.The basic idea implemented by a graph-based ranking model is that of voting or recommendation.
When one vertex links to another one, it is basically casting a vote for that vertex. The higher the number of votes cast for a vertex, the higher the importance of that vertex.

Based on this network of sentences, the sentences are fed into the Pagerank algorithm which identifies the most important sentences. When we want to extract a summary of the text, we can now take only the most important sentences.

In [0]:
from sumy.summarizers.text_rank import TextRankSummarizer
summarizer_3 = TextRankSummarizer()
summary_3 =summarizer_3(parser.document,15)

In [88]:
for sentence in summary_3:
 print(sentence)

A spate of mysterious second-time infections is calling into question the accuracy of COVID-19 diagnostic tools even as China prepares to lift quarantine measures to allow residents to leave the epicenter of its outbreak next month.
From March 18-22, the Chinese city of Wuhan reported no new cases of the virus through domestic transmission — that is, infection passed on from one person to another.
But some Wuhan residents who had tested positive earlier and then recovered from the disease are testing positive for the virus a second time.
Some of those who retested positive appear to be asymptomatic carriers — those who carry the virus and are possibly infectious but do not exhibit any of the illness's associated symptoms — suggesting that the outbreak in Wuhan is not close to being over.
The second resident displayed only mild symptoms at first and was quarantined in one of more than a dozen makeshift treatment centers erected in Wuhan during the peak of the outbreak.
But when both wer

**ROGUE METRIC**

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

The following five evaluation metrics are available.

* ROUGE-N: Overlap of N-grams between the system and reference summaries.
* ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.
* ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.

Generally for summarization evaluation, only ROUGE-1 and ROUGE-2 (sometimes ROUGE-3, if we have really long gold and model summaries) metrics are used, rationale being that as we increase N, we increase the length of the N-gram word phrase that needs to be matched completely in both the gold and model summary.

As an example, consider two semantically similar phrases “apples bananas” and “bananas apples”. If we use ROUGE-1 we only consider uni-grams, which are the same for both phrases. But if we use ROUGE-2, we use 2-word phrases, so “apples bananas” become a single entity which is different from “bananas apples”, leading to a “miss” and lower evaluation score.

Now we have measured the ROGUE value for the above summarization techniques and the score have been tabulated below.

Summarization technique | ROGUE-1 Scores 
--- | ---
Lex Rank | 0.26
LUHN | 0.126
LSA | 0.211
Text Rank | 0.197



**ANALYSIS AND CONCLUSION**

LexRank is the winner here as it yields a better ROUGE score. Unfortunately we found the summaries generated by it to be less informative than summaries by TextRank and Luhn’s model. Furthermore, LexRank doesn’t always beat TextRank in the ROUGE score – for example, TextRank performs marginally better than LexRank on some of the text documents I have tested. So the choice between LexRank and TextRank depends on the text document, it’s worth trying both.

Another point from the table is that Luhn’s algorithm has a lower ROGUE score. This is because it extracts a longer summary and hence covers more reviews of the product. Unfortunately, we couldn’t make it shorter because the wrapper for Luhn’s algorithm in Sumy doesn’t provide the parameters to change the word limit.