# **Text Summarization using NLP**


**What is text summarization?**

Text summarization is the process of distilling the most important information from a source text.

**Why automatic text summarization?**



1.   Summaries reduce reading time.
2.   When researching documents,summaries make the  selection process easier.
3.   Automatic summarization improves the effectiveness of indexing.
4.   Automatice summarization algorithms are less biased than human summarization.
5.   Personalized summaries are useful in question-answering systems as they provied personalized information.
6.   Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of text documents they are able to process.







**How to do text summarization**


*   Text cleaning
*   Sentence tokenization
*   Word tokenzation
*   Word-frequency table
*   Summarization



  **Text variable**








# **1. Abstractive Summarization**

Abstractive summarization techniques emulate human writing by generating entirely new sentences to convey key concepts from the source text, rather than merely rephrasing portions of it. These fresh sentences distill the vital information while eliminating irrelevant details, often incorporating novel vocabulary absent in the original text. The term “Transformers” has recently dominated the natural language processing field, although these models initially relied on designs based on recurrent neural networks (RNNs).

 **What are Transformers?**

Transformers represent a series of systems that employ a unique encoder-decoder architecture to transform an input sequence into an output sequence. Transformers feature a distinctive “self-attention” mechanism, along with several other enhancements like positional encoding, which set them apart. NOTE: Not all Transformers are intended for use in text summarization. Let’s delve into the recently released model called PEGASUS, which appears to excel in terms of output quality for text summarization.

PEGASUS shares similarities with other transformer models, with its primary distinction lying in a unique approach used during the model’s pre-training. Specifically, the most crucial sentences in the training text corpora are “masked” (hidden from the model) during PEGASUS pre-training. The model is then tasked with generating these concealed sentences as a single output sequence.

In [None]:
from transformers import pipeline
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Pick model
model_name = "google/pegasus-xsum"

# Load pretrained tokenizer
pegasus_tokenizer = PegasusTokenizer.from_pretrained(model_name)

# Take user input for the text
example_text = input("Please enter the text to summarize:\n")

print('Original Document Size:', len(example_text))

# Define PEGASUS model
pegasus_model = PegasusForConditionalGeneration.from_pretrained(model_name)

# Create tokens
tokens = pegasus_tokenizer(example_text, truncation=True, padding="longest", return_tensors="pt")

# Generate the summary
encoded_summary = pegasus_model.generate(**tokens)

# Decode the summarized text
decoded_summary = pegasus_tokenizer.decode(encoded_summary[0], skip_special_tokens=True)

# Print the summary
print('Decoded Summary:', decoded_summary)

# Alternatively, you can use the pipeline for summarization
summarizer = pipeline(
    "summarization",
    model=model_name,
    tokenizer=pegasus_tokenizer,
    framework="pt"
)

# Generate summary using the pipeline
summary = summarizer(example_text, min_length=30, max_length=150)
print('Pipeline Summary:', summary[0]["summary_text"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]



Please enter the text to summarize:
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning.  Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as  deep neural networks, deep belief networks, deep reinforcement learning,  recurrent neural networks and convolutional neural networks have been applied to  fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis,  material inspection and board game programs, where they have produced results  comparable to and in some cases surpassing human expert performance.  Artificial neural networks (ANNs) were inspired by information processing and  distributed communication nodes in biological systems. ANNs have various differences  from biological brains. Specifically, neural networks tend t

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

Decoded Summary: Deep learning is a branch of computer science that deals with the study and training of machine learning.


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pipeline Summary: Deep learning is a branch of computer science which deals with the study and training of complex systems such as speech recognition, natural language processing, machine translation and medical image analysis. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and neuralal networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.


# **Input Passage:**

Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning.  Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as  deep neural networks, deep belief networks, deep reinforcement learning,  recurrent neural networks and convolutional neural networks have been applied to  fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis,  material inspection and board game programs, where they have produced results  comparable to and in some cases surpassing human expert performance.  Artificial neural networks (ANNs) were inspired by information processing and  distributed communication nodes in biological systems. ANNs have various differences  from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier,  but that a network with a nonpolynomial activation function with one hidden layer of  unbounded width can. Deep learning is a modern variation which is concerned with an  unbounded number of layers of bounded size, which permits practical application and  optimized implementation, while retaining theoretical universality under mild conditions.  In deep learning the layers are also permitted to be heterogeneous and to deviate widely  from biologically informed connectionist models, for the sake of efficiency, trainability  and understandability, whence the structured part.

#**Output (Summary):**

 Deep learning is a branch of computer science which deals with the study and training of complex systems such as speech recognition, natural language processing, machine translation and medical image analysis. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and neuralal networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.

# **2. Extractive Summarization:**

Extractive summarization is a technique in natural language processing (NLP) that involves selecting the most important sentences or phrases directly from a document to create a summary. Unlike abstractive summarization, which generates new sentences, extractive summarization identifies and extracts key parts of the original text. It typically uses algorithms that rank sentences based on factors such as term frequency, importance, or similarity to the central theme. Techniques like TextRank or pre-trained models such as BERT can be used to rank sentences and select the most relevant ones for the summary, preserving the original meaning and context of the source document.

In [2]:
!pip install bert-extractive-summarizer
from summarizer import Summarizer

# Take user input for the text
example_text = input("Please enter the text to summarize:\n")

# Load the BERT-based summarizer
bert_summarizer = Summarizer()

# Generate the summary using BERT
summary = bert_summarizer(example_text)

# Print the summary
print('Extractive Summary:', summary)


Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl.metadata (15 kB)
Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Installing collected packages: bert-extractive-summarizer
Successfully installed bert-extractive-summarizer-0.10.1
Please enter the text to summarize:
Climate change refers to long-term changes in temperatures and weather patterns. These shifts may be natural, such as through variations in the solar cycle. But since the 1800s, human activities have been the main driver of climate change, primarily due to burning fossil fuels like coal, oil, and gas. Burning these materials releases what are called greenhouse gases, which act like a blanket wrapped around the Earth, trapping the sun's heat and raising temperatures. Some consequences of climate change include rising sea levels, extreme weather events, and loss of biodiversity. Reducing emissions and transitioning to renewable energy are key steps tow

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Extractive Summary: Climate change refers to long-term changes in temperatures and weather patterns. But since the 1800s, human activities have been the main driver of climate change, primarily due to burning fossil fuels like coal, oil, and gas.


# **Input Passage:**
Climate change refers to long-term changes in temperatures and weather patterns. These shifts may be natural, such as through variations in the solar cycle. But since the 1800s, human activities have been the main driver of climate change, primarily due to burning fossil fuels like coal, oil, and gas. Burning these materials releases what are called greenhouse gases, which act like a blanket wrapped around the Earth, trapping the sun's heat and raising temperatures. Some consequences of climate change include rising sea levels, extreme weather events, and loss of biodiversity. Reducing emissions and transitioning to renewable energy are key steps toward mitigating these impacts.



# **Output (Summary):**

Climate change refers to long-term changes in temperatures and weather patterns. But since the 1800s, human activities have been the main driver of climate change, primarily due to burning fossil fuels like coal, oil, and gas.