# Text Summarization in Natural Language Processing (NLP)

Text Summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for particular users (or uses) and tasks.
It involves understanding the context, semantics, and the overall meaning of the text.

## Techniques

There are mainly two types of techniques used in text summarization:

1. **Extractive Summarization**: 
+ Simple and Traditional Algorithms
+ Basically, its like highlighting points from a paragraph. It preserves the meaning
+ This technique involves pulling key phrases from the source material and combining them to make a summary. The sentences are selected based on their significance and relevance.
+ The disadvantage of this method is that the sentences might be out of context as they are directly picked from the source text.

2. **Abstractive Summarization**: 
+ Semantic, more time required
+ This technique involves understanding the context and **semantics** of the text and rephrasing it to create a summary. 
+ It's more complex and closer to how humans summarize text.
+ We use DL, Transformers etc

## Models
There are several models used for text summarization. Some of them are:

1. **Seq2Seq Model**: This model is used in abstractive summarization. It uses two Recurrent Neural Networks (RNNs), one to read the input and one to generate the output.
2. **BERT (Bidirectional Encoder Representations from Transformers)**: BERT is a transformer-based machine learning technique for NLP pre-training. It can be used for extractive summarization.
3. **T5 (Text-To-Text Transfer Transformer)**: T5 is a transformer model that can be used for both extractive and abstractive summarization.

![Image](/Users/daver/Desktop/NLP_Lab_Exam_Codes/resources/text_summ.png)

# Pipeline
1. Perform Cleaning and Preprocessing
1. Generate Representations using the following Word EMbedding Techniques : 
    + Bag of Words
    + TF-IDF
    + Continuous Bag of Words
    + Skip gram
    + Word2Vec
    + GloVe
    + FastText
1. Identify the distance between each sentence. Evaluate if semantically similar sentences lie closer to each other.
1. See online for other models and understand working

# Theory
1. How do we identify informative sentences
1. Explain the impact of embedding techniques in identifying the distance between sentences.
1. Which Embedding technique generates contextual representations? Justify.


# Other Work

In [1]:
import numpy as np
import pandas as pd

import re
import textwrap
import string
import ssl

import spacy

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from gensim.models import Word2Vec

from sklearn.metrics.pairwise import cosine_similarity

In [2]:
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [3]:
# nltk.download()

In [10]:
nltk.download('punkt', download_dir = "/Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_data")
nltk.download('stopwords', download_dir = "/Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_data")
nltk.download('wordnet', download_dir = "/Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_data")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_dat
[nltk_data]     a...


True

In [5]:
stopwords = stopwords.words('english')
lem  = WordNetLemmatizer()

NUM_CLUSTERS = 15
iterations = 25

In [6]:
text = """Millions go missing at China bank Two senior officials at one of China's top commercial banks have reportedly disappeared after funds worth up to $120m (£64m) went missing. The pair both worked at Bank of China in the northern city of Harbin, the South China Morning Post reported. The latest scandal at Bank of China will do nothing to reassure foreign investors that China's big four banks are ready for international listings. Government policy sees the bank listings as vital economic reforms. Bank of China is one of two frontrunners in the race to list overseas. The other is China Construction Bank. Both are expected to list abroad during 2005. They shared a $45bn state bailout in 2003, to help clean up their balance sheets in preparation for a foreign stock market debut. However, a report in the China-published Economic Observer said on Monday that the two banks may have scrapped plans to list in New York because of the cost of meeting regulatory requirements imposed since the Enron scandal. Bank of China is the country's biggest foreign exchange dealer, while China Construction Bank is the largest deposit holder. China's banking sector is burdened with at least $190bn of bad debt according to official data, though most observers believe the true figure is far higher. Officially, one in five loans is not being repaid. Attempts to strengthen internal controls an tighten lending policies have uncovered a succession of scandals involving embezzlement by bank officials and loans-for-favours. The most high-profile case involved the ex-president of Bank of China, Wang Xuebing, jailed for 12 years in 2003. Although, he committed the offences whilst running Bank of China in New York, Mr.Wang was head of China Construction Bank when the scandal broke. Earlier this month, a China Construction Bank branch manager was jailed for life in a separate case. China's banks used to act as cash offices for state enterprises and did not require checks on credit worthiness. The introduction of market reforms has been accompanied by attempts to modernize the banking sector, but links between banks and local government remain strong. Last year, China's premier, Wen Jiabao, targeted bank lending practices in a series of speeches, and regulators ordered all big loans to be scrutinized, in an attempt to cool down irresponsible lending. China's leaders see reforming the top four banks as vital to distribute capital to profitable companies and protect the health of China's economic boom. But two problems persist. First, inefficient state enterprises continue to receive protection from bankruptcy because they employ large numbers of people. Second, many questionable loans come not from the big four, but from smaller banks. Another high-profile financial firm, China Life, is facing shareholder lawsuits and a probe by the US Securities and Exchange Commission following its 2004 New York listing over its failure to disclose accounting irregularities at its parent company."""

In [7]:
def average_word_vectors(tokens, model, vector_size):
    vector_sum = np.zeros(vector_size)
    count = 0
    for word in tokens:
        if word in model.wv:
            vector_sum += model.wv[word]
            count += 1
    if count != 0:
        return vector_sum / count
    else:
        return np.zeros(vector_size)

In [8]:

def preprocess(text):
    words = word_tokenize(text)
    words_lower = [word.lower() for word in words ]
    words_strip = [word.strip() for word in words_lower] 
    words_remove_stopwords = [word for word in words_strip if word not in stopwords]
    words_remove_punc = [word for word in words_remove_stopwords if
    word not in string.punctuation]
    words_lemmatised = [lem.lemmatize(word) for word in words_remove_punc]
    words_joined = " ".join(words_lemmatised)
    return words_joined

In [11]:
sentences = sent_tokenize(text)
sentence_df = pd.DataFrame(sentences)
sentence_df.columns = ['Original']
sentence_df['Preprocessed'] = sentence_df['Original'].apply(preprocess)
sentence_df

Unnamed: 0,Original,Preprocessed
0,Millions go missing at China bank Two senior o...,million go missing china bank two senior offic...
1,The pair both worked at Bank of China in the n...,pair worked bank china northern city harbin so...
2,The latest scandal at Bank of China will do no...,latest scandal bank china nothing reassure for...
3,Government policy sees the bank listings as vi...,government policy see bank listing vital econo...
4,Bank of China is one of two frontrunners in th...,bank china one two frontrunners race list over...
5,The other is China Construction Bank.,china construction bank
6,Both are expected to list abroad during 2005.,expected list abroad 2005
7,"They shared a $45bn state bailout in 2003, to ...",shared 45bn state bailout 2003 help clean bala...
8,"However, a report in the China-published Econo...",however report china-published economic observ...
9,Bank of China is the country's biggest foreign...,bank china country 's biggest foreign exchange...


In [12]:
# import requests, zipfile, io

# def download_glove():
#     glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
#     response = requests.get(glove_url)
#     z = zipfile.ZipFile(io.BytesIO(response.content))
#     z.extractall()

# download_glove()

In [13]:
# import requests

# model_url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz"
# model_path = "cc.en.300.bin.gz"

# response = requests.get(model_url)
# with open(model_path, "wb") as file:
#     file.write(response.content)

In [14]:
# import gzip
# import shutil

# with gzip.open('cc.en.300.bin.gz', 'rb') as f_in:
#     with open('cc.en.300.bin', 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)

# Links

https://www.analyticsvidhya.com/blog/2021/11/a-beginners-guide-to-understanding-text-summarization-with-nlp/

text summarization model on custom dataset
https://www.geeksforgeeks.org/text-summarization-in-nlp/

https://www.topcoder.com/thrive/articles/text-summarization-in-nlp