# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [None]:
#! pip install sumy
#!pip install sumy


### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [1]:
# Import necessary libraries
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

### Scrape the text

In [2]:
import nltk
nltk.download('punkt')
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
parser = HtmlParser.from_url(url, Tokenizer("english"))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\macwa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Summarize - TextRankSummarizer

In [3]:
# Set parameters for TextRankSummarizer
sentence_count = 5  # Maximum number of sentences in the summary

# Summarize using TextRankSummarizer
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=sentence_count)  

# Print the summary
for sentence in summary:
    print(sentence)

Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.”
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain other neurons will say, ‘No, I have not seen this.’ The neurons that have seen this before, will group together and form logical connections from the past and gives us an object from our memory.
The same principle is applied for a song that you hear, a cartoon that you watch, a rhyme that you sing, a

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [15]:
#installing gensim
!pip install gensim==3.8.3


Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
     ---------------------------------------- 0.0/23.4 MB ? eta -:--:--
     --------------------------------------- 0.0/23.4 MB 660.6 kB/s eta 0:00:36
     --------------------------------------- 0.1/23.4 MB 787.7 kB/s eta 0:00:30
     ---------------------------------------- 0.3/23.4 MB 1.8 MB/s eta 0:00:14
     - -------------------------------------- 0.8/23.4 MB 4.4 MB/s eta 0:00:06
     -- ------------------------------------- 1.6/23.4 MB 6.7 MB/s eta 0:00:04
     ----- ---------------------------------- 3.5/23.4 MB 12.2 MB/s eta 0:00:02
     -------- ------------------------------- 5.1/23.4 MB 16.2 MB/s eta 0:00:02
     ------------ --------------------------- 7.1/23.4 MB 19.7 MB/s eta 0:00:01
     -------------- ------------------------- 8.8/23.4 MB 21.6 MB/s eta 0:00:01
     ----------------- --------------------- 10.8/23.4 MB 34.4 MB/s eta 0:00:01
     -------------------- ------------------ 12.6/23.4 MB 40.9

  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [751 lines of output]
  C:\Users\macwa\anaconda3\Lib\site-packages\setuptools\__init__.py:84: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!
  
          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************
  
  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\gensim
  copying gensim\downloader.py -> build\lib.win-amd64-cpython-311\gensim
  copying gensim\interfaces.py -> build\lib.win-amd64-cpython-311\gensim
  copying ge

In [4]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer


In [None]:
# Import necessary libraries
from bs4 import BeautifulSoup
import requests
from gensim.summarization import summarize


### Create Summarizers

### LexRankSummarizer

In [6]:
# Initialize LexRankSummarizer
lex_rank_summarizer = LexRankSummarizer()
lex_summary = lex_rank_summarizer(parser.document, sentences_count=5)  # Adjust sentence count as needed
# Print LexRank summary
print("LexRank Summary:")
for sentence in lex_summary:
    print(sentence)
print("\n")

LexRank Summary:
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
Was it a dog or a lion?
Do you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.
Picture of my version of Neural Network with their Neuron friends“Your brain is here inside our head.
Ultimately, the neurons in your brain tell that it is a lion and not a dog.




### LuhnSummarizer

In [7]:
# Initialize LuhnSummarizer
luhn_summarizer = LuhnSummarizer()
luhn_summary = luhn_summarizer(parser.document, sentences_count=5)  # Adjust sentence count as needed

# Print Luhn summary
print("Luhn Summary:")
for sentence in luhn_summary:
    print(sentence)
print("\n")

Luhn Summary:
Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.
Every neuron is waiting for your eyes to see something new, for your nose to smell something new, for your ears to hear something new, for your tongue to taste something new.
When something new is heard, or smelled, or seen, or tasted, the neurons will group together to send signals and forms connections with already seen, heard, tasted or smelled neurons.
When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have see

### LsaSummarizer

In [8]:
lsa_summarizer = LsaSummarizer()
lsa_summary = lsa_summarizer(parser.document, sentences_count=5)  # Adjust sentence count as needed

# Print LSA summary
print("LSA Summary:")
for sentence in lsa_summary:
    print(sentence)
print("\n")

LSA Summary:
If you’ve noticed, this is how ML people make their machines learn through Reinforcement Learning.
For example, when I showed you a lion picture, your brain asked the neurons who had seen it before.
Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.
And I hope she will not come to me running asking “Papa, what is Meural Metark?” again.
And I have a strong feeling; she would ask me another stunning question sooner or later.




## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [9]:
#!pip install gensim

### Import the library

In [None]:
from bs4 import BeautifulSoup
import requests
from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [11]:
# Function to get page content
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

In [12]:
# Function to extract text
def collect_text(soup):
    text = ""
    for paragraph in soup.find_all('p'):
        text += paragraph.text
    return text

In [13]:
# Scrape the text
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
soup = get_page(url)

In [14]:
text = collect_text(soup)
text

'Sign upSign inSign upSign inSubash GandyerFollow--1ListenShareIt was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually.My 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby. Got to code some stuff.” I thought that would be the end of the conversation. No! It wasn’t. As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave. She again asked, “What is code something?” I wanted to end this conversation, as I was half past asleep. “Just some stuff baby. You wouldn’t understand. Way beyond your age.” Tanishi never takes NO for an answer. “Papa, tell me 

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [None]:
# Set parameters for Gensim summarization
word_count = 1000  # Maximum amount of words in the summary
ratio = 0.2  # Fraction of sentences in the original text to be returned as output

# Summarize
summary = summarize(text, word_count=word_count, ratio=ratio)  

# Print the summary
print(summary)

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [16]:
!pip install summa



### Import the library

In [17]:
# Import the necessary libraries
from bs4 import BeautifulSoup
import requests
from summa.summarizer import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [18]:
# Function to get page content
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

# Function to extract text
def collect_text(soup):
    text = ""
    for paragraph in soup.find_all('p'):
        text += paragraph.text + " "
    return text

# Scrape the text
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
soup = get_page(url)
text = collect_text(soup)
text

'Sign up Sign in Sign up Sign in Subash Gandyer Follow -- 1 Listen Share It was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually. My 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby. Got to code some stuff.” I thought that would be the end of the conversation. No! It wasn’t. As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave. She again asked, “What is code something?” I wanted to end this conversation, as I was half past asleep. “Just some stuff baby. You wouldn’t understand. Way beyond your age.” Tanishi never takes NO for an answer. “Pap

### Summarize

In [19]:
# Summarize
summary = summarize(text, ratio=0.2)  # Adjust ratio as needed

# Print the summary
print(summary)

“Papa, tell me what stuff means and something means.” Cannot help evade a cute curious face, I said, “I am working on Neural Network.” Before I finish the statement, “Papa, What is a Meural Metark?” I gave up my stubbornness of avoiding her.
With a smile, I said slowly, “Its Neu — ral Net — work” She asked, “Papa, What is Meu-ral Met-ark?” At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.
Asked her to draw a dog out of it.
After all, neural network inside our brain helps us to learn new things in our life.
What I was actually doing here was teaching her neural networ

In [20]:
#!pip install transformers[sentencepiece]


In [21]:
# Function to get page content
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

# Function to extract text
def collect_text(soup):
    text = ""
    for paragraph in soup.find_all('p'):
        text += paragraph.text + " "
    return text

# Scrape the text
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
soup = get_page(url)
text = collect_text(soup)
text

'Sign up Sign in Sign up Sign in Subash Gandyer Follow -- 1 Listen Share It was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually. My 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby. Got to code some stuff.” I thought that would be the end of the conversation. No! It wasn’t. As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave. She again asked, “What is code something?” I wanted to end this conversation, as I was half past asleep. “Just some stuff baby. You wouldn’t understand. Way beyond your age.” Tanishi never takes NO for an answer. “Pap

In [22]:
# import and initialize the tokenizer and model from the checkpoint
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "sshleifer/distilbart-cnn-12-6"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [23]:
# max tokens including the special tokens
print(tokenizer.model_max_length)
     
# max tokens excluding the special tokens
print(tokenizer.max_len_single_sentence) 
# number of special tokens
print(tokenizer.num_special_tokens_to_add())

1024
1022
2


In [24]:
# extract the sentences from the document
import nltk
nltk.download('punkt')
sentences = nltk.tokenize.sent_tokenize(text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\macwa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [25]:
# find the max tokens in the longest sentence
max([len(tokenizer.tokenize(sentence)) for sentence in sentences])

94

In [26]:
#Create Chunks
# initialize
length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

4

In [27]:
[len(tokenizer.tokenize(c)) for c in chunks]
[len(tokenizer(c).input_ids) for c in chunks]


[1008, 1016, 1020, 126]

In [28]:
#With special tokens added
sum([len(tokenizer(c).input_ids) for c in chunks])

3170

In [29]:
len(tokenizer(text).input_ids)


Token indices sequence length is longer than the specified maximum sequence length for this model (3164 > 1024). Running this sequence through the model will result in indexing errors


3164

In [30]:
#without Special token
sum([len(tokenizer.tokenize(c)) for c in chunks])


3162

In [31]:
len(tokenizer.tokenize(text))


3162

In [32]:
# inputs to the model
inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

In [33]:
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

 Subash Gandyer's 4-year-old daughter Tanishi asked him to play with her for a while. He evaded the chance to play by telling her “Papa’s got some work baby. Got to code some stuff.” She again asked, “What is code something?” I wanted to end this conversation, as I was half past asleep. She asked, "Papa, What is Neu — ral Net — work"
 A big beard is the main difference between a lion and a dog. I asked her what these pictures look like to you? Was it a dog or a lion? She kept mixing the answers first. After rewarding her for correct classifications with nice adjectives and correcting her for wrong classifications, her detection accuracy improved a lot.
 When a picture is shown to you, your neurons will group together and tries to signal what that object is by forming logical connections between the past and the present. Ultimately, the neurons in your brain tell that it is a lion and not a dog. This complex working of the neurons inside the brain works super fast in the order of millis

In [36]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from summa import summarizer as summa_summarizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
import nltk
import requests
from bs4 import BeautifulSoup

nltk.download('punkt')

def fetch_summaries(url):
    # Fetch the content of the webpage
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    text = ' '.join([p.text for p in soup.find_all('p')])

    # Summarize using different methods
    parser = HtmlParser.from_url(url, Tokenizer("english"))
    text_rank_summarizer = TextRankSummarizer()
    lex_rank_summarizer = LexRankSummarizer()
    luhn_summarizer = LuhnSummarizer()
    lsa_summarizer = LsaSummarizer()

    # Summarize using Summa
    summa_summary = summa_summarizer.summarize(text)

    # Summarize using Hugging Face Transformers
    checkpoint = "sshleifer/distilbart-cnn-12-6"
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
    inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(**inputs)
    hugging_face_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return {
        'TextRank': text_rank_summarizer(parser.document, sentences_count=5),
        'LexRank': lex_rank_summarizer(parser.document, sentences_count=5),
        'Luhn': luhn_summarizer(parser.document, sentences_count=5),
        'LSA': lsa_summarizer(parser.document, sentences_count=5),
        'Summa': summa_summary,
        'Hugging Face': hugging_face_summary
    }

def evaluate_summaries(summaries, reference_text):
    scores = {}
    for method, summary in summaries.items():
        # Calculate coherence score
        coherence_score = sum(sentence_bleu([str(summary[i]), str(summary[i + 1])], str(summary[i + 1]), smoothing_function=SmoothingFunction().method7) for i in range(len(summary) - 1)) / len(summary)
        
        # Calculate relevance score
        relevance_score = len(summary) / len(reference_text.split())

        # Calculate informativeness score
        informativeness_score = sentence_bleu(reference_text.split(), ' '.join(map(str, summary)), smoothing_function=SmoothingFunction().method7)

        # Calculate overall score
        overall_score = 0.4 * coherence_score + 0.3 * relevance_score + 0.3 * informativeness_score

        scores[method] = overall_score

    return scores

def find_best_model(url):
    # Fetch reference text
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    reference_text = ' '.join([p.text for p in soup.find_all('p')])

    # Fetch summaries
    summaries = fetch_summaries(url)

    # Evaluate summaries
    scores = evaluate_summaries(summaries, reference_text)

    # Select the best model
    best_model = max(scores, key=scores.get)

    return best_model, summaries[best_model], scores[best_model]

# Example usage:
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
best_model, best_summary, score = find_best_model(url)
print("Best Summarization Method:", best_model)
print("Overall Score:", score)
print("\nBest Summary:")
print(best_summary)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\macwa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Best Summarization Method: Summa
Overall Score: 0.6201809856387016

Best Summary:
“Papa, tell me what stuff means and something means.” Cannot help evade a cute curious face, I said, “I am working on Neural Network.” Before I finish the statement, “Papa, What is a Meural Metark?” I gave up my stubbornness of avoiding her.
With a smile, I said slowly, “Its Neu — ral Net — work” She asked, “Papa, What is Meu-ral Met-ark?” At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.
Asked her to draw a dog out of it.
After all, neural network inside our brain helps us to learn new