# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [177]:
! pip install sumy



### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [178]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

from bs4 import BeautifulSoup
import requests
import nltk

### Scrape the text

In [179]:
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"

In [180]:
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup

def collect_text(soup):
    paragraphs = soup.find_all('p')
    text = '\n'.join([p.text for p in paragraphs])
    return text

soup = get_page(url)
text = collect_text(soup)

text

'Sign up\nSign in\nSign up\nSign in\nSubash Gandyer\nFollow\n--\n1\nListen\nShare\nIt was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually.\nMy 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby. Got to code some stuff.” I thought that would be the end of the conversation. No! It wasn’t. As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave. She again asked, “What is code something?” I wanted to end this conversation, as I was half past asleep. “Just some stuff baby. You wouldn’t understand. Way beyond your age.” Tanishi never takes NO for an a

### Summarize - TextRankSummarizer

In [181]:
import nltk
#nltk.download('punkt')

text_rank_summary = TextRankSummarizer()

def summarize_with_sumy(text, summarizer, num_sentences):
    parser = HtmlParser.from_string(text, url, Tokenizer("english"))
    summary = summarizer(parser.document, num_sentences)
    return "\n".join([str(sentence) for sentence in summary])


In [186]:
text_rank_summary = summarize_with_sumy(text, text_rank_summarizer, 5)
print(text_rank_summary)

With a smile, I said slowly, “Its Neu — ral Net — work”
She asked, “Papa, What is Meu-ral Met-ark?”
At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.”
It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain other neurons will say, ‘No, I have not seen this.’ The neurons that have seen this before, will group together 

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [187]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [188]:
lex_rank_summarizer = LexRankSummarizer()
luhn_summarizer = LuhnSummarizer()
lsa_summarizer = LsaSummarizer()

In [189]:
from nltk.tokenize import sent_tokenize

def preprocess_text(text):
    sentences = sent_tokenize(text)
    cleaned_text = ' '.join(sentences)
    return cleaned_text

cleaned_text = preprocess_text(text)

### LexRankSummarizer

In [190]:
lex_rank_summary = summarize_with_sumy(cleaned_text, lex_rank_summarizer, 5)
lex_rank_summary

'I said, “Good Job!” and asked her, “Where’s the tail, baby?” She smiled and drew a tail.\nAfter telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.\nWas it a dog or a lion?\nUltimately, the neurons in your brain tell that it is a lion and not a dog.\nTanishi: That’s it.'

### LuhnSummarizer

In [191]:
luhn_summary = summarize_with_sumy(cleaned_text, luhn_summarizer, 5)
print(luhn_summary)

With a smile, I said slowly, “Its Neu — ral Net — work”
She asked, “Papa, What is Meu-ral Met-ark?”
At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.”
It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.
How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.
Every neuron is waiting for your eyes to see something new, for your nose to smell something new, for your ears to hear something new, for your tongue to taste something new.
When yo

### LsaSummarizer

In [192]:
lsa_summary = summarize_with_sumy(cleaned_text, lsa_summarizer, 5)
print(lsa_summary)

If you’ve noticed, this is how ML people make their machines learn through Reinforcement Learning.
For example, when I showed you a lion picture, your brain asked the neurons who had seen it before.
Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.
And I hope she will not come to me running asking “Papa, what is Meural Metark?” again.
And I have a strong feeling; she would ask me another stunning question sooner or later.


## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [27]:
!pip install gensim==3.6.0

Collecting gensim==3.6.0
  Downloading gensim-3.6.0.tar.gz (23.1 MB)
     ---------------------------------------- 0.0/23.1 MB ? eta -:--:--
     --------------------------------------- 0.0/23.1 MB 660.6 kB/s eta 0:00:35
      --------------------------------------- 0.4/23.1 MB 4.5 MB/s eta 0:00:06
     - -------------------------------------- 1.1/23.1 MB 8.5 MB/s eta 0:00:03
     --- ------------------------------------ 1.9/23.1 MB 10.9 MB/s eta 0:00:02
     ---- ----------------------------------- 2.6/23.1 MB 11.9 MB/s eta 0:00:02
     ------ --------------------------------- 3.7/23.1 MB 13.9 MB/s eta 0:00:02
     -------- ------------------------------- 4.7/23.1 MB 15.1 MB/s eta 0:00:02
     ---------- ----------------------------- 5.9/23.1 MB 16.3 MB/s eta 0:00:02
     ------------ --------------------------- 7.0/23.1 MB 17.3 MB/s eta 0:00:01
     -------------- ------------------------- 8.1/23.1 MB 18.6 MB/s eta 0:00:01
     --------------- ------------------------ 8.9/23.1 MB 18.

### Import the library

In [193]:
from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [194]:
from bs4 import BeautifulSoup
import requests

def get_text_from_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    paragraphs = soup.find_all('p')
    text = ' '.join([para.text for para in paragraphs])
    return text

url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
text = get_text_from_url(url)

In [195]:
text

'Sign up Sign in Sign up Sign in Subash Gandyer Follow -- 1 Listen Share It was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually. My 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby. Got to code some stuff.” I thought that would be the end of the conversation. No! It wasn’t. As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave. She again asked, “What is code something?” I wanted to end this conversation, as I was half past asleep. “Just some stuff baby. You wouldn’t understand. Way beyond your age.” Tanishi never takes NO for an answer. “Pap

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [204]:
summary_word_count = summarize(text, word_count=100)
summary_ratio = summarize(text, ratio=0.4)

print("Summary with word count limit:")
print(summary_word_count)

print("\nSummary with ratio limit:")
print(summary_ratio)

Summary with word count limit:
What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
The neurons grouped together with features like face, body, legs, tail and a beard forms a lion.
Once all the features are there, the neurons will send a signal that the picture you are looking at is a lion and not a dog.

Summary with ratio limit:
As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby.
As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave.
Way beyond your age.” Tanishi never takes NO for an answer.
“Papa, tell me what stuff means and s

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [152]:
!pip install summa



### Import the library

In [210]:
from summa import summarizer

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [223]:
from bs4 import BeautifulSoup
import requests

def get_text_from_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    paragraphs = soup.find_all('p')
    text = ' '.join([para.text for para in paragraphs])
    return text

url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
text = get_text_from_url(url)


suma_summary = summarizer.summarize(text)
suma_summary

'“Papa, tell me what stuff means and something means.” Cannot help evade a cute curious face, I said, “I am working on Neural Network.” Before I finish the statement, “Papa, What is a Meural Metark?” I gave up my stubbornness of avoiding her.\nWith a smile, I said slowly, “Its Neu — ral Net — work” She asked, “Papa, What is Meu-ral Met-ark?” At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.\n“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.\nAsked her to draw a dog out of it.\nAfter all, neural network inside our brain helps us to learn new things in our life.\nWhat I was actually doing here was teaching her neural 

## ASSIGNMENT: Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt

In [224]:
selected_summary = suma_summary
with open("Vidit_Task5_Summarization.txt", "w") as file:
    file.write(selected_summary)

In [225]:
justification = """

I selected Summa's summary because it discusses different types of neural networks and uses a comparison for better understanding. The summary captures the main points of the article in a short and clear way. It strikes a good balance between being short and easy to understand, making it the best choice for explaining the key ideas of the article.
"""
print("Justification for choosing Summa:\n", justification)

with open("Vidit_Task5_Summarization.txt", "a") as file:
    file.write("/n/n")
    file.write(justification)

Justification for choosing Summa:
 

I selected Summa's summary because it discusses different types of neural networks and uses a comparison for better understanding. The summary captures the main points of the article in a short and clear way. It strikes a good balance between being short and easy to understand, making it the best choice for explaining the key ideas of the article.

