# **Text Summarization** [[readmore]](https://devopedia.org/text-summarization)

# 1. Extractive Text Summarization

Identifying important sections (paragraphs or sentences or even words) of the text and selecting to produce a subset of text from the original text.

### Dependencies

In [30]:
!pip install -U spacy



In [31]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [19]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [20]:
stopwords = list(STOP_WORDS)

In [21]:
# Creating object of a pretrained NLP pipeline
nlp = spacy.load('en_core_web_sm')

In [1]:
text = '''Not very far in the past, Artificial Intelligence seemed a distant dream. Today, it’s pretty much everywhere, even though in its weakest forms. The AI industry is now worth $327.5 billion and is expected to grow up to $500 billion by 2024, according to a report by International Data Corporation (IDC). Though, the history of artificial intelligence is longer than we can imagine. 

Artificial Intelligence is defined as the development of a computer system that is able to perform a task that requires human intelligence, for example, speech recognition, visual perception, decision making, and translation between languages. A more lucid definition would be “the intelligence exhibited by machines or software”. It also refers to the academic field of study, which studies how to create computers and computer software capable of intelligent behavior. Since there is no clear line between ordinary software and AI, it becomes difficult to distinguish the difference. Hence, one may think of AI technology as the sort of technology one would use to perform tasks that require some level of intelligence to accomplish. 

The long-term goal of AI research ever since its inception has been what is called Strong AI, which describes a computer capable of doing every intelligent task that a human can perform. However, presently the AI innovation no more relies upon whether we accomplish Strong AI or not. The technology has now progressed to a level where, regardless of future advancements in the field, it is poised to disrupt any innovation-based industry. 

'''

### Word Tokenization

In [23]:
doc = nlp(text)

In [24]:
word_tokens = [token.text for token in doc]

In [25]:
print(word_tokens)

['\n', 'Here', 'are', 'some', 'everyday', 'examples', 'of', 'text', 'summarization', ':', 'news', 'headlines', ',', 'outlines', 'for', 'students', ',', 'movie', 'previews', ',', 'meeting', 'minutes', ',', 'biographies', 'for', 'resumes', 'or', 'obituaries', ',', 'abridged', 'versions', 'of', 'books', ',', 'newsletter', 'production', ',', 'financial', 'research', ',', 'patent', 'research', ',', 'legal', 'contract', 'analysis', ',', 'tweeting', 'about', 'new', 'content', ',', 'chatbots', 'that', 'answer', 'questions', ',', 'email', 'summaries', ',', 'and', 'more', '.', '\n\n', 'When', 'Google', 'Search', 'presents', 'search', 'results', ',', 'some', 'entries', 'are', 'accompanied', 'by', 'auto', '-', 'generated', 'summaries', '.', 'Google', 'may', 'be', 'leveraging', 'a', 'knowledge', 'graph', 'for', 'this', 'purpose', '.', 'Google', "'s", 'approach', 'to', 'summarization', 'is', 'mainly', 'entity', 'centric', '.', 'Summarization', 'extends', 'to', 'timelines', 'and', 'events', 'about', 

In [26]:
punctuation = punctuation + "\n\n" + '”' + '“'  
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n\n”“'

### Text Cleaning & Word frequency calculation (or BOW)

In [27]:
word_frequencies = {}
for w_token in doc:
    if w_token.text.lower() not in stopwords and w_token.text.lower() not in punctuation:
        if w_token.text not in word_frequencies.keys():
            word_frequencies[w_token.text] = 1
        else:
            word_frequencies[w_token.text] += 1

In [28]:
print(word_frequencies)

{'everyday': 1, 'examples': 1, 'text': 1, 'summarization': 2, 'news': 1, 'headlines': 1, 'outlines': 1, 'students': 1, 'movie': 1, 'previews': 1, 'meeting': 1, 'minutes': 1, 'biographies': 1, 'resumes': 1, 'obituaries': 1, 'abridged': 1, 'versions': 1, 'books': 1, 'newsletter': 1, 'production': 1, 'financial': 1, 'research': 2, 'patent': 1, 'legal': 1, 'contract': 1, 'analysis': 1, 'tweeting': 1, 'new': 1, 'content': 1, 'chatbots': 1, 'answer': 1, 'questions': 1, 'email': 1, 'summaries': 3, 'Google': 3, 'Search': 1, 'presents': 1, 'search': 1, 'results': 1, 'entries': 1, 'accompanied': 1, 'auto': 1, 'generated': 1, 'leveraging': 1, 'knowledge': 1, 'graph': 1, 'purpose': 1, 'approach': 1, 'mainly': 1, 'entity': 1, 'centric': 1, 'Summarization': 1, 'extends': 1, 'timelines': 1, 'events': 1, 'entities': 1, 'Doctors': 1, 'write': 1, 'long': 1, 'medical': 1, 'notes': 1, 'containing': 1, 'nutritional': 1, 'information': 1, 'pregnant': 2, 'mothers': 2, 'reduced': 1, 'short': 1, 'crisp': 1, 'f

In [29]:
max_frequency = max(word_frequencies.values())

In [30]:
max_frequency

3

In [31]:
# Normalizing word frequencies
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [32]:
print(word_frequencies)

{'everyday': 0.3333333333333333, 'examples': 0.3333333333333333, 'text': 0.3333333333333333, 'summarization': 0.6666666666666666, 'news': 0.3333333333333333, 'headlines': 0.3333333333333333, 'outlines': 0.3333333333333333, 'students': 0.3333333333333333, 'movie': 0.3333333333333333, 'previews': 0.3333333333333333, 'meeting': 0.3333333333333333, 'minutes': 0.3333333333333333, 'biographies': 0.3333333333333333, 'resumes': 0.3333333333333333, 'obituaries': 0.3333333333333333, 'abridged': 0.3333333333333333, 'versions': 0.3333333333333333, 'books': 0.3333333333333333, 'newsletter': 0.3333333333333333, 'production': 0.3333333333333333, 'financial': 0.3333333333333333, 'research': 0.6666666666666666, 'patent': 0.3333333333333333, 'legal': 0.3333333333333333, 'contract': 0.3333333333333333, 'analysis': 0.3333333333333333, 'tweeting': 0.3333333333333333, 'new': 0.3333333333333333, 'content': 0.3333333333333333, 'chatbots': 0.3333333333333333, 'answer': 0.3333333333333333, 'questions': 0.333333

In [33]:
max(word_frequencies.values())

1.0

### Sentence Tokenization & Sentence score calculation

In [34]:
## Sentence tokenization
sentence_tokens = [sent for sent in doc.sents]
print(len(sentence_tokens))

7


In [35]:
for sent in sentence_tokens:
    print(sent, end="\n****")


Here are some everyday examples of text summarization: news headlines, outlines for students, movie previews, meeting minutes, biographies for resumes or obituaries, abridged versions of books, newsletter production, financial research, patent research, legal contract analysis, tweeting about new content, chatbots that answer questions, email summaries, and more.


****When Google Search presents search results, some entries are accompanied by auto-generated summaries.
****Google may be leveraging a knowledge graph for this purpose.
****Google's approach to summarization is mainly entity centric.
****Summarization extends to timelines and events about entities.


****Doctors write long medical notes containing nutritional information for pregnant mothers.
****When these were reduced to short crisp summaries, pregnant mothers found them a lot easier to understand.

****

In [36]:
# Calculate sentence score
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [37]:
sentence_scores

{
 Here are some everyday examples of text summarization: news headlines, outlines for students, movie previews, meeting minutes, biographies for resumes or obituaries, abridged versions of books, newsletter production, financial research, patent research, legal contract analysis, tweeting about new content, chatbots that answer questions, email summaries, and more.
 : 13.333333333333336,
 When Google Search presents search results, some entries are accompanied by auto-generated summaries.: 3.6666666666666665,
 Google may be leveraging a knowledge graph for this purpose.: 1.3333333333333333,
 Google's approach to summarization is mainly entity centric.: 1.9999999999999998,
 Summarization extends to timelines and events about entities.
 : 1.9999999999999998,
 Doctors write long medical notes containing nutritional information for pregnant mothers.: 3.666666666666666,
 When these were reduced to short crisp summaries, pregnant mothers found them a lot easier to understand.: 4.66666666666

### Summarizing Text (extractive)

In [38]:
from heapq import nlargest

In [39]:
select_length = int(len(sentence_tokens)*0.3)
select_length

2

In [40]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
summary

[
 Here are some everyday examples of text summarization: news headlines, outlines for students, movie previews, meeting minutes, biographies for resumes or obituaries, abridged versions of books, newsletter production, financial research, patent research, legal contract analysis, tweeting about new content, chatbots that answer questions, email summaries, and more.
 ,
 When these were reduced to short crisp summaries, pregnant mothers found them a lot easier to understand.]

In [41]:
final_summary = [word.text for word in summary]
summary = " ".join(final_summary)

In [42]:
summary = summary.replace("\n","").replace("  ", " ")
print(summary)

Here are some everyday examples of text summarization: news headlines, outlines for students, movie previews, meeting minutes, biographies for resumes or obituaries, abridged versions of books, newsletter production, financial research, patent research, legal contract analysis, tweeting about new content, chatbots that answer questions, email summaries, and more. When these were reduced to short crisp summaries, pregnant mothers found them a lot easier to understand.


# 2. Abstractive Text Summarization using `pegasus-xsum`
Technique of generating a summary of a text from its main ideas, not by copying the verbatim salient sentences from text.
**Remark:** `pegasus-xsum` is a pretrained Seq2Seq Model that specializes in *text summarization*

### Dependencies

In [27]:
# Checking the hightest CUDA version supported by PC
# https://stackoverflow.com/questions/9727688/how-to-get-the-cuda-version
!nvidia-smi

Sat Apr  8 21:16:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 512.95       Driver Version: 512.95       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P0    12W /  N/A |    101MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

**Hightest version of CUDA supported is 11.6**

In [29]:
# Installing Pytorch with CUDA 11.6
# https://pytorch.org/get-started
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu116

Looking in indexes: https://download.pytorch.org/whl/cu116
Collecting torch
  Downloading https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp39-cp39-win_amd64.whl (2434.0 MB)
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp39-cp39-win_amd64.whl (4.8 MB)
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/cu116/torchaudio-0.13.1%2Bcu116-cp39-cp39-win_amd64.whl (2.3 MB)
Installing collected packages: torch, torchvision, torchaudio
Successfully installed torch-1.13.1+cu116 torchaudio-0.13.1+cu116 torchvision-0.14.1+cu116


In [32]:
# Install HuggingFace Transformer
!pip install transformers

Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-win_amd64.whl (3.5 MB)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.27.4


In [36]:
# PegasusTokenizer requires the SentencePiece
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-win_amd64.whl (1.1 MB)
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


### Importing and loading pre-trained Model from `huggingface_hub`

In [1]:
# Importing dependencies from transformers
# https://huggingface.co/google/pegasus-xsum
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [2]:
# Loading tokenizer and leveraging the existing tokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

In [3]:
# Load model
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

### Summarizing Text (Abstractive)

In [48]:
text = '''We introduce a new language representation model called BERT, which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from
unlabeled text by jointly conditioning on both
left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer
to create state-of-the-art models for a wide
range of tasks, such as question answering and
language inference, without substantial taskspecific architecture modifications.
BERT is conceptually simple and empirically
powerful. It obtains new state-of-the-art results on eleven natural language processing
tasks, including pushing the GLUE score to
80.5% (7.7% point absolute improvement),
MultiNLI accuracy to 86.7% (4.6% absolute
improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1
(5.1 point absolute improvement).
'''

In [49]:
# Create tokens - number representation of source text
# Input: RAW Source Text
# Output: pytorch tensors
tokens = tokenizer(text, truncation=True, padding="longest", return_tensors="pt")

In [50]:
tokens

{'input_ids': tensor([[  184,  4094,   114,   177,  1261,  5114,   861,   568,   110, 62613,
           108,   162,  3034,   118,  7671, 37390, 93789, 37955,   116,   135,
         38979,   107,  6857,   909,  1261,  5114,  1581,   143, 15617,   116,
          3256,  2700,   107,   108,   931,   304,   206, 66506,  3256,  2700,
           107,   108, 28505,   108,   110, 62613,   117,   638,   112,  1133,
         20098,  1355, 79050, 13872,   135,  1596, 53541,  1352,   141, 13039,
          5542,   124,   302,   518,   111,   268,  2956,   115,   149,  4427,
           107,   398,   114,   711,   108,   109,  1133,   121, 14787,   110,
         62613,   861,   137,   129,  1226, 37126,   122,   188,   156,   853,
          2940,  2865,   112,   421,   449,   121,  1313,   121,   544,   121,
          3904,  1581,   118,   114,   827,   499,   113,  2722,   108,   253,
           130,   906,  9421,   111,  1261, 40410,   108,   347,  4844,  1778,
          7115,  3105,  9896,   107,  

In [51]:
# Summarize text
summary = model.generate(**tokens)



In [52]:
# Tokenized summary
summary

tensor([[   0,  222,  136,  800,  108,  145,  799,  114,  177,  449,  121, 1313,
          121,  544,  121, 3904,  861,  118, 1355, 1261, 2196,  107,    1]])

In [53]:
# Decoded Summary
tokenizer.decode(summary[0])

'<pad>In this paper, we present a new state-of-the-art model for deep language processing.</s>'