# **Text Summarizer:**

Description: NLP text summarization is the process of breaking down lengthy text into digestible paragraphs or sentences. This method extracts vital information while also preserving the meaning of the text. This reduces the time required for grasping lengthy pieces such as articles without losing vital information.

In [None]:
!pip install sumy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pycountry>=18.2.23
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting breadability>=0.1.20
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting docopt<0.7,>=0.6.1
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: breadability, docopt, pycount

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#Code to summarize a given webpage using Sumy's TextRank implementation. 
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

num_sentences_in_summary = 2 #getting 2 sentence
url = "https://en.wikipedia.org/wiki/Automatic_summarization" #URL link
parser = HtmlParser.from_url(url, Tokenizer("english"))

summarizer_list=("TextRankSummarizer:","LexRankSummarizer:","LuhnSummarizer:","LsaSummarizer") #list of summarizers
summarizers = [TextRankSummarizer(), LexRankSummarizer(), LuhnSummarizer(), LsaSummarizer()]

for i,summarizer in enumerate(summarizers):
    print(summarizer_list[i])
    for sentence in summarizer(parser.document, num_sentences_in_summary):
        print((sentence))
    print("-"*30)

TextRankSummarizer:
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
A Class of Submodular Functions for Document Summarization", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization, In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014.^ Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan and Jeff Bilmes, Summarizing Multi-Document Topic Hierarchies using Submodular Mixtures, To Appear In the Annual Meeting of the Association for Computational Linguistics (ACL), Beijing, China, July - 2015

Summarization with Gensim

In [None]:
!pip install gensim==3.8.3 #installation of the library

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.4/23.4 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py) ... [?25l[?25hdone
  Created wheel for gensim: filename=gensim-3.8.3-cp39-cp39-linux_x86_64.whl size=26528039 sha256=6c050cf5f07a9ca0294b10761168aaf067022d645e515bdeb9353d83e9f59199
  Stored in directory: /root/.cache/pip/wheels/ca/5d/af/618594ec2f28608c1d6ee7d2b7e95a3e9b06551e3b80a491d6
Successfully built gensim
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.1
    Uninstalling gensim-4.3.1:
      Successfully uninstalled gensim-4.3.1
Successfully installed gensim-3.8.3


In [None]:
from gensim.summarization import summarize,summarize_corpus
from gensim.summarization.textcleaner import split_sentences
from gensim import corpora


In [None]:
text = open("/content/nlp.txt").read()

#summarize method extracts the most relevant sentences in a text
print("Summarize:\n",summarize(text, word_count=200, ratio = 0.1))


#the summarize_corpus selects the most important documents in a corpus:
sentences = split_sentences(text)# Creates a corpus where each document is a sentence.
tokens = [sentence.split() for sentence in sentences]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

# Extracts the most important documents (shown here in BoW representation)
print("-"*30,"\nSummarize Corpus\n",summarize_corpus(corpus,ratio=0.1))

Summarize:
 As a computer science student specializing in artificial intelligence and machine learning, I am eager to apply my knowledge and skills to a machine learning development internship role.
Relevant coursework: My specialization in artificial intelligence and machine learning has equipped me with the necessary knowledge to understand the fundamental concepts of machine learning.
I have taken courses in algorithms, statistics, and probability, which are critical components of machine learning.
NLP experience: I have experience in natural language processing, which is a crucial area in machine learning.
This experience has helped me develop an understanding of how machine learning algorithms can be used to analyze and process natural language data.
I am always eager to learn and stay updated with the latest developments in the field of machine learning.
Good communication skills: Effective communication is a critical component of any successful team.
As a computer science studen

Observations: summary of the given data is displayed as the output
Summarization with Sumy Sumy is a simple library and command-line utility for extracting summaries from HTML pages or plain texts. The
package also contains a simple evaluation framework for text summaries.
Sumy offers several algorithms and methods for summarization such as: Luhn - Heuristic method: Luhn’s algorithm is an approach based
on TF-IDF, this is one of the earliest approaches to text summarization.
Luhn proposed that the significance of each word in a document signifies how important it is. The idea is that any sentence with
maximum occurrences of the highest frequency words and least occurrences are not important to the meaning of the document than
others. Although it is not considered a very accurate approach.
Latent Semantic Analysis: Latent Semantic Analysis is a technique for creating a vector representation of a document. Having a vector
representation of a document gives you a way to compare documents for their similarity by calculating the distance between the vectors.
LexRank - Unsupervised approach inspired by algorithm PageRank and HITS.LexRank is an unsupervised graph-based approach for
automatic text summarization. The scoring of sentences is done using the graph method. LexRank is used for computing sentence
importance based on the concept of eigenvector centrality in a graph representation of sentences.
TextRank - TextRank uses an extractive approach and is unsupervised graph-based and PageRank-based for text summarization. In
TextRank, the vertices of the graph are sentences, and the edge weights between sentences denote the similarity between sentences.
Summarization with Gensim Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The
target audience is the natural language processing (NLP) and information retrieval (IR) community.