# Text Summarization

Text summarization is the process of finding the most important information from a document to produce a short version with all the important ideas.
+ The Idea of summarization is to find a **subset of data which contains the “information” of the entire set**.

#### Types of Summarization Approaches
- **Extractive**
- **Abstractive**

In [1]:
!pip install sumy summa lexrank spacy nltk gensim



## 1) Extractive Summarization

In [2]:
docx = """Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. 
It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", 
in order to make predictions or decisions without being explicitly programmed to do so. 
Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or 
unfeasible to develop conventional algorithms to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; 
but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains 
to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. 
In its application across business problems, machine learning is also referred to as predictive analytics."""

### Using Sumy
+ Steps
    - parser
    - tokenizer
    - algorithm

In [3]:
# Load Our Pkgs
import sumy

In [4]:
# Load Pkgs
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [5]:
print(docx)

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. 
It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", 
in order to make predictions or decisions without being explicitly programmed to do so. 
Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or 
unfeasible to develop conventional algorithms to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; 
but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains 
to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. 
In it

In [6]:
# Parsing From String
import nltk
nltk.download('punkt')
parser = PlaintextParser.from_string(docx,Tokenizer("english"))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# Get The Text
parser.document

<DOM with 1 paragraphs>

In [8]:
# Get The Sentence Text
parser.document.sentences

(<Sentence: Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.>,
 <Sentence: It is seen as a part of artificial intelligence.>,
 <Sentence: Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.>,
 <Sentence: Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.>,
 <Sentence: A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.>,
 <Sentence: The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.>,
 <Sentence: Data mining is a related f

In [9]:
# Get Word Tokens
parser.document.words

('Machine',
 'learning',
 'ML',
 'is',
 'the',
 'study',
 'of',
 'computer',
 'algorithms',
 'that',
 'improve',
 'automatically',
 'through',
 'experience',
 'and',
 'by',
 'the',
 'use',
 'of',
 'data',
 'It',
 'is',
 'seen',
 'as',
 'a',
 'part',
 'of',
 'artificial',
 'intelligence',
 'Machine',
 'learning',
 'algorithms',
 'build',
 'a',
 'model',
 'based',
 'on',
 'sample',
 'data',
 'known',
 'as',
 'training',
 'data',
 'in',
 'order',
 'to',
 'make',
 'predictions',
 'or',
 'decisions',
 'without',
 'being',
 'explicitly',
 'programmed',
 'to',
 'do',
 'so',
 'Machine',
 'learning',
 'algorithms',
 'are',
 'used',
 'in',
 'a',
 'wide',
 'variety',
 'of',
 'applications',
 'such',
 'as',
 'email',
 'filtering',
 'and',
 'computer',
 'vision',
 'where',
 'it',
 'is',
 'difficult',
 'or',
 'unfeasible',
 'to',
 'develop',
 'conventional',
 'algorithms',
 'to',
 'perform',
 'the',
 'needed',
 'tasks',
 'A',
 'subset',
 'of',
 'machine',
 'learning',
 'is',
 'closely',
 'related',


In [None]:
# Parsing From File
# Mounting Colab Drive
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# Parsing From File
parser_file = PlaintextParser.from_file("/content/drive/My Drive/Colab Notebooks/Summarization/example.txt",Tokenizer("english"))

In [None]:
parser_file.document

## 1.1)  **Word Frequency Based**

### **Using Luhn** (in Sumy)
+ Based on frequency of most important words

In [10]:
from sumy.summarizers.luhn import LuhnSummarizer

In [17]:
luhn_summarizer = LuhnSummarizer()
summary_1 = luhn_summarizer(parser.document,2) # 2 =  number of sentences we want

In [18]:
print(summary_1)

(<Sentence: Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.>, <Sentence: Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.>)


In [19]:
for sentence in summary_1:
    print(sentence)

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.
Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.


### **Using LSA (Latent Semantic Analysis)** in Sumy
+ Based on term frequency techniques with **Singular Value Decomposition** (SVD) to summarize texts.

In [20]:
# Load Our Pkgs
from sumy.summarizers.lsa import LsaSummarizer

In [21]:
lsa_summarizer = LsaSummarizer()

In [22]:
summary_2 = lsa_summarizer(parser.document,2)

In [23]:
for sentence in summary_2:
    print(sentence)

Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
In its application across business problems, machine learning is also referred to as predictive analytics.


In [24]:
# Alternative Method using stopwords
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
summarizer_lsa2 = LsaSummarizer()
summarizer_lsa2 = LsaSummarizer(Stemmer("english"))
summarizer_lsa2.stop_words = get_stop_words("english")

In [25]:
for summary in summarizer_lsa2(parser.document,2):
    print(summary)

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.


## 1.2) **Graph-Based Algorithms**
+ **TextRank**
+ **LexRank**

### **Using LexRank** (in Sumy)
+ It is **unsupervised approach** to text summarization based on graph-based centrality scoring of sentences.
+ The main idea is that sentences "recommend" other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance
+ LexRank uses degrees of similarity between words and phrases and computes the centrality of the sentences to assign weights



In [26]:
# Using Sumy Lexrank
from sumy.summarizers.lex_rank import LexRankSummarizer

In [27]:
lex_summarizer = LexRankSummarizer()

In [28]:
summary_3 = lex_summarizer(parser.document,2)

In [29]:
for summary in summary_3:
    print(summary)

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.
Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning.


### **Using TextRank**
+ TextRank – is a graph-based ranking model for text processing which can be used in order to find the most relevant sentences in text and also to find keywords.


### Using Sumy with TextRank

In [30]:
from sumy.summarizers.text_rank import TextRankSummarizer

In [31]:
txtRank = TextRankSummarizer()

In [32]:
summary_4 = txtRank(parser.document,2)

In [33]:
for summary in summary_4:
    print(summary)

Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.


### Using Gensim 

In [34]:
# Using Gensim
from gensim.summarization.summarizer import summarize

In [35]:
summarize(docx)

'A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; \nIn its application across business problems, machine learning is also referred to as predictive analytics.'

## Using Summa

In [36]:
# Using Summa
from summa import summarizer

In [37]:
summarizer.summarize(docx)

'Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.\nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; '

## 2) Abstractive Summarization (with prebuilt **Transformers** provided by **Hugging Face**)
!pip install transformers

In [38]:
!pip install transformers



### 2.1) Fetch our data by Wikipedia Pkg
!pip install wikipedia

In [39]:
!pip install wikipedia



In [40]:
import wikipedia

wikipedia.search("Computer Science")

['Computer science',
 'Abstraction (computer science)',
 'Computer science and engineering',
 'Glossary of computer science',
 'Heuristic (computer science)',
 'Computer graphics (computer science)',
 'Record (computer science)',
 'Theoretical computer science',
 'Recursion (computer science)',
 'History of computer science']

In [41]:
wiki_page = wikipedia.page('Computer science')

In [42]:
wiki_page.title

'Computer science'

In [43]:
docx_wiki = wiki_page.content

In [44]:
print(docx_wiki)

Computer science is the study of algorithmic processes, computational machines and computation itself. As a discipline, computer science spans a range of topics from theoretical studies of algorithms, computation and information to the practical issues of implementing computational systems in hardware and software.Its fields can be divided into theoretical and practical disciplines. For example, the theory of computation concerns abstract models of computation and general classes of problems that can be solved using them, while computer graphics or computational geometry emphasize more specific applications. Algorithms and data structures have been called the heart of computer science. Programming language theory considers approaches to the description of computational processes, while computer programming involves the use of them to create complex systems. Computer architecture describes construction of computer components and computer-operated equipment. Artificial intelligence aims 

In [45]:
# Let's pick just a part of the content
new_text = docx_wiki[0:1472]

In [46]:
print(new_text)

Computer science is the study of algorithmic processes, computational machines and computation itself. As a discipline, computer science spans a range of topics from theoretical studies of algorithms, computation and information to the practical issues of implementing computational systems in hardware and software.Its fields can be divided into theoretical and practical disciplines. For example, the theory of computation concerns abstract models of computation and general classes of problems that can be solved using them, while computer graphics or computational geometry emphasize more specific applications. Algorithms and data structures have been called the heart of computer science. Programming language theory considers approaches to the description of computational processes, while computer programming involves the use of them to create complex systems. Computer architecture describes construction of computer components and computer-operated equipment. Artificial intelligence aims 

### 2.2) Using Transformers for Summarization

In [47]:
# import transformers pipeline
from transformers import pipeline

In [48]:
# Create our Summarizer (this will download the prebuild model for summarization)
my_summarizer = pipeline("summarization")

In [49]:
# Make the summarization
print(my_summarizer(new_text))

[{'summary_text': ' As a discipline, computer science spans a range of topics from theoretical studies of algorithms, computation and information to the practical issues of implementing computational systems in hardware and software . Algorithms and data structures have been called the heart of computer science . The Turing Award is generally recognized as the highest distinction in computer sciences .'}]


In [50]:
# Save the summary in a variable
my_summary = my_summarizer(new_text)

In [51]:
type(my_summary)

list

In [52]:
my_summary_text = my_summary[0]['summary_text']

In [53]:
print(my_summary_text)

 As a discipline, computer science spans a range of topics from theoretical studies of algorithms, computation and information to the practical issues of implementing computational systems in hardware and software . Algorithms and data structures have been called the heart of computer science . The Turing Award is generally recognized as the highest distinction in computer sciences .


In [54]:
# Length of documents
print("Original Doc Len:",len(new_text))
print(("Summary Length:", len(my_summary_text)))

Original Doc Len: 1472
('Summary Length:', 386)


In [55]:
# Hugging Face Transformers
print(my_summary_text)

 As a discipline, computer science spans a range of topics from theoretical studies of algorithms, computation and information to the practical issues of implementing computational systems in hardware and software . Algorithms and data structures have been called the heart of computer science . The Turing Award is generally recognized as the highest distinction in computer sciences .
