# TEXT SUMMARIZATION USING SUMY & PYTHON

- pip install sumy

- Sumy offers several algorithms and methods for summarization such as:
    - Luhn – heurestic method
    - Latent Semantic Analysis
    - Edmundson heurestic method with previous statistic research
    - LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    - TextRank
    - SumBasic – Method that is often used as a baseline in the literature
    - KL-Sum – Method that greedily adds sentences to a summary so long as it decreases the KL Divergence.

In [40]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser

from sumy.summarizers.lsa import LsaSummarizer 
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer

from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

## LsaSummarizer

Latent Semantic Analysis which combines term frequency with singular value decomposition.

In [42]:
document ="""Machine learning (ML) is the scientific study of algorithms and 
statistical models that computer systems use to progressively improve their performance on a specific task. 
Machine learning algorithms build a mathematical model of sample data, known as "training data", in order 
to make predictions or decisions without being explicitly programmed to perform the task. Machine learning 
algorithms are used in the applications of email filtering, detection of network intruders, and computer 
vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. 
Machine learning is closely related to computational statistics, which focuses on making predictions using 
computers. The study of mathematical optimization delivers methods, theory and application domains to the 
field of machine learning. Data mining is a field of study within machine learning, and focuses on 
exploratory data analysis through unsupervised learning.In its application across business problems, 
machine learning is also referred to as predictive analytics."""

In [60]:
url_wiki = "https://en.wikipedia.org/wiki/Automatic_summarization"
url_npa  = "https://corporatefinanceinstitute.com/resources/knowledge/other/non-performing-assets-in-indian-banks/"
url_covid = "https://en.wikipedia.org/wiki/Coronavirus_disease_2019"

In [64]:
LANGUAGE = "english"
SENTENCES_COUNT = 10

In [61]:
# from url
parser_url_wiki = HtmlParser.from_url(url_wiki, Tokenizer(LANGUAGE))
parser_url_npa  = HtmlParser.from_url(url_npa,  Tokenizer(LANGUAGE))
parser_url_covid  = HtmlParser.from_url(url_covid,  Tokenizer(LANGUAGE))

In [62]:
print(parser_url_wiki.document)
print(parser_url_npa.document)
print(parser_url_covid.document)

<DOM with 60 paragraphs>
<DOM with 12 paragraphs>
<DOM with 87 paragraphs>


In [55]:
# from strings
parser_string = PlaintextParser.from_string(document, Tokenizer(LANGUAGE))

In [56]:
stemmer = Stemmer(LANGUAGE)

In [57]:
summ = LsaSummarizer(stemmer)
summ.stop_words = get_stop_words(LANGUAGE)

In [65]:
# for the wiki text
for sentence in summ(parser_url_covid.document, SENTENCES_COUNT):
    print("-----------------------------------------------------------------------------------")
    print(sentence)

-----------------------------------------------------------------------------------
Infection appears to set off a chain of vasoconstrictive responses within the body, constriction of blood vessels within the pulmonary circulation has also been posited as a mechanism in which oxygenation decreases alongside the presentation of viral pneumonia.
-----------------------------------------------------------------------------------
[104] Chinese scientists were able to isolate a strain of the coronavirus and publish the genetic sequence so laboratories across the world could independently develop polymerase chain reaction (PCR) tests to detect infection by the virus.
-----------------------------------------------------------------------------------
This graphic shows how early adoption of containment measures tends to protect wider swaths of the population.Progressively stronger mitigation efforts to reduce the number of active cases at any given time—" flattening the curve "—allows healthc

In [58]:
# for the wiki text
for sentence in summ(parser_url_wiki.document, SENTENCES_COUNT):
    print("-----------------------------------------------------------------------------------")
    print(sentence)

-----------------------------------------------------------------------------------
Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function , Determinantal point process , maximal marginal relevance (MMR) etc.
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
LexRank deals with diversity as a heuristic final stage using CSIS, and other systems have used similar methods, such as Maximal Marginal Relevance (MMR), [12] in trying to eliminate redundancy in information retrieval results.
-----------------------------------------------------------------------------------
^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization , In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2

In [59]:
# for NPA text
for sentence in summ(parser_url_npa.document, SENTENCES_COUNT):
    print("-----------------------------------------------------------------------------------")
    print(sentence)
    


-----------------------------------------------------------------------------------
The situation became serious with the substantial delay in environmental permits, affecting the infrastructure sector – power, iron, and steel – resulting in volatility in prices of raw materials and a shortage of supply.
-----------------------------------------------------------------------------------
Another reason is the relaxed lending norms adopted by banks, especially to the big corporate houses, foregoing analysis of their financials and credit ratings.
-----------------------------------------------------------------------------------
More “Haircuts” for Banks – For quite some time, PSU lenders have started putting aside a large portion of their profits for provisions and losses because of NPA.
-----------------------------------------------------------------------------------
Refinancing from the Central Bank – The US Federal Reserve spent $700 billion to purchase stressed assets in 2008-09 u