# <div style="text-align: center; background-color: white; font-family:'Poppins', sans-serif; color: black; padding: 20px; line-height: 2; border-bottom:4px solid black; overflow:hidden"> Text summarization </div>

* Text summarization is a very important task in NLP. It either generate short summaries of long articles or extract the most important sentences from the text.
* It is of two types: Extractive vs. Abstractive Summarization:
* Extractive Summarization:

    * In extractive summarization, the goal is to identify and extract important sentences directly from the original text. These sentences are chosen based on their relevance or importance, usually via methods like TF-IDF or Word2Vec. However, the summary consists of the exact sentences from the original text.
    * Limitation: It doesn't produce new sentences or concepts. If you want to reduce the text size but maintain the meaning, this method alone may not work well.
    * It's often done using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or Word2Vec embeddings. These methods help to extract the most relevant information from large amounts of text, making it more digestible.


* Abstractive Summarization:

    * Abstractive summarization involves generating new sentences that convey the same meaning as the original text, rather than simply extracting parts of it. This method typically uses advanced deep learning models, such as Transformers, to understand and paraphrase the content.
    * Advantage: This method reduces the size of the text while preserving its core meaning, and it can produce more coherent summaries that are often more human-like.
    
    * Methods for Abstractive Summarization:
        * Seq2Seq Models (with Attention Mechanism):
        
            * Traditional Sequence-to-Sequence (Seq2Seq) models with Attention mechanisms are widely used for abstractive summarization. The model is trained to map the input sequence (the original text) to a condensed sequence (the summary). The attention mechanism helps the model focus on the most relevant parts of the text.
        * Transformer Models:
        
            * Modern state-of-the-art models like BERT (Bidirectional Encoder Representations from Transformers) or GPT-3/4 (Generative Pre-trained Transformers) are very good at abstractive summarization.
            * The T5 (Text-to-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformers) models are especially designed for text generation tasks like summarization and have shown excellent performance.
        * Pre-trained Models:
        
            * Instead of training from scratch, you can use pre-trained models for summarization. Popular pre-trained models include BART, T5, and PEGASUS, which are optimized for tasks like summarization and text generation.

<div style="text-align: center; background-color: white; font-family: 'Poppins', sans-serif; color: black; padding: 20px;font-size: 24px; line-height: 2; overflow:hidden"> Extractive Summarization </div>

### 1. Text Summarization with TF-IDF (Extractive Summarization)
The idea of extractive summarization is to extract the most important sentences from the text. TF-IDF helps in identifying important terms in a document based on their frequency and their rarity across multiple documents.

* We'll follow these steps:

    * Preprocess the text.
    * Calculate the TF-IDF for each word.
    * Use the TF-IDF scores to rank and extract the most relevant sentences.

In [5]:
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords

In [6]:
# Sample text
sample_text = """
Natural language processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand,
interpret, and generate human language. It is an interdisciplinary field that draws from linguistics, computer science, 
and artificial intelligence. NLP is used in various applications, including machine translation, sentiment analysis, 
chatbots, and text summarization. NLP helps in extracting meaningful information from unstructured text data.
"""

In [3]:
# Preprocess the text
import nltk
from nltk.corpus import stopwords
import string

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Remove "not" from stopwords
stop_words.discard('not')

punctuations = string.punctuation


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\test\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\test\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\test\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [12]:
from nltk.tokenize import sent_tokenize, word_tokenize
import string

def preprocess(text):
    words = word_tokenize(text.lower())   # Word tokenization
    words = [w for w in words if w not in stop_words and w not in string.punctuation]
    return " ".join(words)

# Sentence tokenization
sentences = sent_tokenize(sample_text)

# Preprocess each sentence
processed_sentences = [preprocess(sentence) for sentence in sentences]


In [13]:
sentences

['\nNatural language processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand,\ninterpret, and generate human language.',
 'It is an interdisciplinary field that draws from linguistics, computer science, \nand artificial intelligence.',
 'NLP is used in various applications, including machine translation, sentiment analysis, \nchatbots, and text summarization.',
 'NLP helps in extracting meaningful information from unstructured text data.']

In [14]:
processed_sentences

['natural language processing nlp field artificial intelligence ai enables computers understand interpret generate human language',
 'interdisciplinary field draws linguistics computer science artificial intelligence',
 'nlp used various applications including machine translation sentiment analysis chatbots text summarization',
 'nlp helps extracting meaningful information unstructured text data']

In [15]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_sentences)

In [16]:
# Sentence scoring based on TF-IDF
sentence_scores = tfidf_matrix.sum(axis=1).A1  # Get the sum of the TF-IDF scores for each sentence
ranked_sentences = [(score, sentence) for score, sentence in zip(sentence_scores, sentences)]
ranked_sentences = sorted(ranked_sentences, reverse=True, key=lambda x: x[0])

In [17]:
ranked_sentences

[(3.5833285240010877,
  '\nNatural language processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand,\ninterpret, and generate human language.'),
 (3.440746222419363,
  'NLP is used in various applications, including machine translation, sentiment analysis, \nchatbots, and text summarization.'),
 (2.8110807898436923,
  'It is an interdisciplinary field that draws from linguistics, computer science, \nand artificial intelligence.'),
 (2.8012310403735534,
  'NLP helps in extracting meaningful information from unstructured text data.')]

In [None]:
# tfidf_matrix is a TF-IDF matrix, where:
# Rows represent sentences.
# Columns represent words (features).
# The values in the matrix are TF-IDF scores for each word in a given sentence.
# .sum(axis=1) → Sums up all the TF-IDF scores for each sentence.
# .A1 → Converts the result from a NumPy matrix to a 1D NumPy array.
# Now, sentence_scores contains a single score for each sentence, representing its importance based on TF-IDF


# ranked_sentences is a list where each sentence is associated with its TF-IDF-based score.

In [18]:
# Extract top 2 sentences as summary
num_sentences = 2
summary = [sentence for _, sentence in ranked_sentences[:num_sentences]]
# Extracts only the sentence text (ignoring the score).
# The _ is a placeholder for the TF-IDF score, which we don't need in the summary.

In [19]:
# Print the summary
print("\nSummary:")
print(" ".join(summary))


Summary:

Natural language processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand,
interpret, and generate human language. NLP is used in various applications, including machine translation, sentiment analysis, 
chatbots, and text summarization.


### 2. Text Summarization using Word2Vec (Advanced)
While TF-IDF provides a way to identify important sentences based on word frequencies, Word2Vec embeddings represent words in a continuous vector space. This allows for semantic understanding. The Word2Vec model can capture relationships like similarity between words and their contexts. You can use Word2Vec embeddings to improve summarization, especially in capturing the meaning of sentences.

* Steps to use Word2Vec for text summarization:
    * Preprocess the text (same as TF-IDF).
    * Train a Word2Vec model using the preprocessed text.
    * Use the model to calculate the similarity of each sentence.
    * Rank the sentences by similarity and extract the most relevant ones.

In [1]:
import gensim
import string
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

If this throws error - then
* Create a compatible environment: conda create -n nlp_env python=3.10
* Activate it: conda activate nlp_env
* Install required packages:
    * pip install numpy==1.26.4
    * pip install scipy==1.10.
    * pip install gensim==4.3.2
    * pip install nltk
    * pip install scikit-learn


* Install Jupyter Kernel for This Env
    * pip install ipykernel
    * python -m ipykernel install --user --name nlp_env --display-name "Python (nlp_env)
    * conda install notebook
* Launch Jupyter (FROM THIS ENV) : jupyter notebook
* Select CORRECT KERNEL : Kernel → Change Kernel → Python (nlp_env)"


In [2]:
# Sample text
sample_text = """
Natural language processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand,
interpret, and generate human language. It is an interdisciplinary field that draws from linguistics, computer science, 
and artificial intelligence. NLP is used in various applications, including machine translation, sentiment analysis, 
chatbots, and text summarization. NLP helps in extracting meaningful information from unstructured text data.
"""

In [5]:
from nltk.tokenize import sent_tokenize, word_tokenize
import string

def preprocess(text):
    words = word_tokenize(text.lower())   # Word tokenization
    words = [w for w in words if w not in stop_words and w not in string.punctuation]
    return words

# Sentence tokenization
sentences = sent_tokenize(sample_text)

# Preprocess each sentence
processed_sentences = [preprocess(sentence) for sentence in sentences]


In [6]:
# Train Word2Vec model
model = gensim.models.Word2Vec(processed_sentences, vector_size=100, window=5, min_count=1, sg=0)

In [7]:
# Function to get the average sentence vector
def sentence_vector(sentence):
    word_vectors = [model.wv[word] for word in sentence if word in model.wv]
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(model.vector_size)


# If there are valid word vectors, compute their mean (average) along axis=0. This results in a single vector representing the sentence
# If none of the words in the sentence exist in the model, return a zero vector with the same size as the model's word embeddings.


In [8]:
# Create sentence vectors
sentence_vectors = [sentence_vector(sentence) for sentence in processed_sentences]

In [9]:
# Compute similarity matrix
similarity_matrix = cosine_similarity(sentence_vectors)

# Calculates cosine similarity between every pair of sentence vectors

In [10]:
similarity_matrix

array([[ 1.0000001 ,  0.27345452,  0.2209907 ,  0.18821998],
       [ 0.27345452,  0.99999994,  0.11778174, -0.0738035 ],
       [ 0.2209907 ,  0.11778174,  1.        ,  0.2206038 ],
       [ 0.18821998, -0.0738035 ,  0.2206038 ,  1.        ]],
      dtype=float32)

In [11]:
# Compute the average similarity of each sentence to all others
sentence_scores = similarity_matrix.mean(axis=1)

In [12]:
ranked_sentences = [(score, sentence) for score, sentence in zip(sentence_scores, sentences)]
ranked_sentences = sorted(ranked_sentences, reverse=True, key=lambda x: x[0])

In [13]:
# Extract top 2 sentences for the summary
top_n = 2
summary = [sentence for _, sentence in ranked_sentences[:top_n]]

In [14]:
# Print the summary
print("\nSummary:")
print(" ".join(summary))


Summary:

Natural language processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand,
interpret, and generate human language. NLP is used in various applications, including machine translation, sentiment analysis, 
chatbots, and text summarization.
