# MULTI-DOCUMENT TEXT SUMMARIZATION

The goal of text summarizing is to see if we can come up with a method that employs natural language processing to do so. This method will not only save time in comprehending a text, but it will also allow someone to read multiple texts in a short period of time, saving time in the long term.


### TYPES OF TEXT SUMMARIZATION

There are two types of Text Summarization: 
1. Extractive Type 
2. Abstractive Type 

Extractive summarization takes the original text and extracts information that is identical to it. In other words, rather than providing a unique summary based on the full content, it will rate each sentence in the document against all others, based on how well each line explains. 

Abstractive seeks to construct a one-of-a-kind summary by learning the most significant points from the original text.


# EXTRACTIVE SUMMARIZATION

In [171]:
# import the required libraries

import os
import nltk
nltk.download('punkt')
nltk.download('stopwords') 
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from heapq import nlargest


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [172]:
# get input from dataset

src_file = "./dataset/multi_news/test.truncate.fix.pun.src.txt"
with open(src_file, "r+", encoding="utf8") as f:
    s = f.readlines()

input = s[:20]

out_file = "./dataset/multi_news/test.txt.tgt.tokenized.fixed.cleaned.final.truncated.txt"
with open(out_file, "r+", encoding="utf8") as f:
    s = f.readlines()

output = s[:20]

In [173]:
# preprocessing text (input) data

for text in input:
    text=text.replace("x91","'")
    text=text.replace("x92","'")
    
for text in output:
    text=text.replace("x91","'")
    text=text.replace("x92","'")


## FREQUENCY-BASED APPROACH

In [174]:
def generate_summary(text, n):
    sentences = sent_tokenize(text)
    stop_words = set(stopwords.words('english'))
    words = [word.lower() for word in word_tokenize(text) if word.lower() not in stop_words and word.isalnum()]

    word_freq = Counter(words)

    sentence_scores = {}

    for sentence in sentences:
        sentence_words = [word.lower() for word in word_tokenize(sentence) if word.lower() not in stop_words and word.isalnum()]
        sentence_score = sum([word_freq[word] for word in sentence_words])
        if len(sentence_words) < 20:
            sentence_scores[sentence] = sentence_score

        # Select the top n sentences with the highest scores
        summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n]
        summary = ' '.join(summary_sentences)

    return summary

## FREQUENCY BASED USING TF-IDF

In [175]:
def generate_summary_tfidf(text, n):
    # Tokenize the text into individual sentences
    sentences = sent_tokenize(text)

    # Create the TF-IDF matrix
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Compute the cosine similarity between each sentence and the document
    sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]

    # Select the top n sentences with the highest scores
    summary_sentences = nlargest(n, range(len(sentence_scores)), key=sentence_scores.__getitem__)

    summary_tfidf = ' '.join([sentences[i] for i in sorted(summary_sentences)])

    return summary_tfidf

In [176]:
summary = []
summary_sentences = []
formatted_summary = []

for i in range(0, 20):
    s = generate_summary(input[i], 9)
    f = s.split('. ')
    e = '.\n'.join(f)
    summary.append(s)
    summary_sentences.append(f)
    formatted_summary.append(e)


summary_tfidf = []
summary_sentences_tfidf = []
formatted_summary_tfidf = []

for i in range(0, 20):
    s = generate_summary_tfidf(input[i], 9)
    f = s.split('. ')
    e = '.\n'.join(f)
    summary_tfidf.append(s)
    summary_sentences_tfidf.append(f)
    formatted_summary_tfidf.append(e)

## ROGUE SCORES FOR THESE TWO METHODS OF SUMMARIZATION

In [177]:
!pip install rouge
from rouge import Rouge

def calculate_rouge_score(text, summary):
    rouge = Rouge()
    scores = rouge.get_scores(summary, text)
    rouge_1_recall = scores[0]['rouge-1']['f']
    rouge_2_recall = scores[0]['rouge-2']['f']
    rouge_l_recall = scores[0]['rouge-l']['f']
    return rouge_1_recall, rouge_2_recall, rouge_l_recall

Defaulting to user installation because normal site-packages is not writeable


### For Frequency based approach

In [180]:
rouge1, rouge2, rougel = 0,0,0

for i in range(0,20):
    rouge_1_recall, rouge_2_recall, rouge_l_recall = calculate_rouge_score(output[i], formatted_summary[i])
    rouge1 += rouge_1_recall
    rouge2 += rouge_2_recall
    rougel += rouge_l_recall

print("ROGUE Scores for Frequency based approach")
print("ROUGE-1:", rouge1/20)
print("ROUGE-2:", rouge2/20)
print("ROUGE-L:", rougel/20)

ROGUE Scores for Frequency based approach
ROUGE-1: 0.2766996648238301
ROUGE-2: 0.08311613973407764
ROUGE-L: 0.24519889172116605


### For Frequncy based using Tf-Idf

In [181]:
rouge1, rouge2, rougel = 0,0,0

for i in range(0,20):
    rouge_1_recall, rouge_2_recall, rouge_l_recall = calculate_rouge_score(output[i], formatted_summary_tfidf[i])
    rouge1 += rouge_1_recall
    rouge2 += rouge_2_recall
    rougel += rouge_l_recall

print("ROGUE Scores for Tf-Idf approach")
print("ROUGE-1:", rouge1/20)
print("ROUGE-2:", rouge2/20)
print("ROUGE-L:", rougel/20)

ROGUE Scores for Frequency based Tf-Idf approach
ROUGE-1: 0.29930268602339466
ROUGE-2: 0.10104034882506026
ROUGE-L: 0.26855892342746535
