# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
#! pip install sumy

### Import the libraries

In [2]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

### Scrape the text

In [3]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer("english"))

In [4]:
doc = parser.document
doc

<DOM with 62 paragraphs>

### Summarize - TextRankSummarizer

In [5]:
summarizer = TextRankSummarizer()

In [6]:
summary_text = summarizer(doc, 5)
summary_text

(<Sentence: For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.>,
 <Sentence: Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm [8] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages.>,
 <Sentence: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).>,
 <Sentence: While the goal of a brief summary is to simplify information search and cut the

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [7]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [8]:
lexSummarizer =  LexRankSummarizer()
luhnSummarizer = LuhnSummarizer()
lsaSummarizer = LsaSummarizer()

### LexRankSummarizer

In [9]:
lex_summary_text = lexSummarizer(doc, 5)
lex_summary_text

(<Sentence: An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.>,
 <Sentence: The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".>,
 <Sentence: For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document.>,
 <Sentence: Automatic Text Summarization .>,
 <Sentence: Automatic Keyphrases Extraction .>)

### LuhnSummarizer

In [10]:
luhn_summary_text = luhnSummarizer(doc, 5)
luhn_summary_text

(<Sentence: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).>,
 <Sentence: Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner.>,
 <Sentence: It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system ( MEAD ) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights.>,
 <Sentence: This tool does no

### LsaSummarizer

In [11]:
lsa_summary_text = lsaSummarizer(doc, 5)
lsa_summary_text

(<Sentence: For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases.>,
 <Sentence: Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney’s seminal paper.>,
 <Sentence: However, when summarizing multiple documents, there is a greater risk of selecting duplicate or highly redundant sentences to place in the same summary.>,
 <Sentence: Ideally, we would like to extract sentences that are both "central" (i.e., contain the main ideas) and "diverse" (i.e., they differ from one another).>,
 <Sentence: Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single documents, which was considered to be nearing maturity.>)

## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [12]:
#!pip install gensim

### Import the library

In [13]:
from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [14]:
import requests
import re
from bs4 import BeautifulSoup

In [15]:
def get_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    return soup

In [16]:
def collect_text(soup):
    text = f'url: {url}\n\n'
    para_text = soup.find_all('p')
    print(f"paragraphs text = \n {para_text}")
    for para in para_text:
        text += f"{para.text}\n\n"
    return text

In [17]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [18]:
text = collect_text(get_page(url))
text

paragraphs text = 
 [<p><b>Automatic summarization</b> is the process of shortening a set of data computationally, to create a subset (a <a href="/wiki/Abstract_(summary)" title="Abstract (summary)">summary</a>) that represents the most important or relevant information within the original content. 
</p>, <p>In addition to text, images and videos can also be summarized. Text summarization finds the most informative sentences in a document;<sup class="reference" id="cite_ref-Torres2014_1-0"><a href="#cite_note-Torres2014-1">[1]</a></sup> image summarization finds the most representative images within an image collection<sup class="noprint Inline-Template Template-Fact" style="white-space:nowrap;">[<i><a href="/wiki/Wikipedia:Citation_needed" title="Wikipedia:Citation needed"><span title="This claim needs references to reliable sources. (February 2019)">citation needed</span></a></i>]</sup>; video summarization extracts the most important frames from the video content.<sup class="referen

'url: https://en.wikipedia.org/wiki/Automatic_summarization\n\nAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. \n\n\nIn addition to text, images and videos can also be summarized. Text summarization finds the most informative sentences in a document;[1] image summarization finds the most representative images within an image collection[citation needed]; video summarization extracts the most important frames from the video content.[2]\n\n\nThere are two general approaches to automatic summarization: extraction and abstraction. \n\n\nHere, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative ima

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [19]:
gensim_summary_text = summarize(text, word_count=200, ratio = 0.1)
gensim_summary_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\nExamples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above.\nFor text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.[3] Other examples of extraction that include key sequences of text in terms of clinical relevance (including patient/problem, intervention, and outcome).[4]\nSome techniques and algorithms which naturally model summarization problems are TextRank and 

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [20]:
# !pip install summa

### Import the library

In [21]:
from summa import summarizer
from summa import keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [22]:
summa_summary_text = summarizer.summarize(text, ratio=0.1)
summa_summary_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\nText summarization finds the most informative sentences in a document;[1] image summarization finds the most representative images within an image collection[citation needed]; video summarization extracts the most important frames from the video content.[2]\nExamples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above.\nFor text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in 