<a href="https://colab.research.google.com/github/thakkar-hiren/News-Summarisation/blob/main/5_News_Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text Summarization Using GloVe Word Embeddings and TextRank Algorithm
This file contains an implementation of text summarization using GloVe (Global Vectors for Word Representation) word embeddings and the TextRank algorithm. The process starts with loading a dataset containing news articles and preprocessed text. Pre-trained GloVe word vectors are then downloaded and loaded to capture semantic relationships between words. The text is tokenized into sentences, and each sentence is embedded into a continuous vector space using the GloVe word embeddings. Cosine similarity is utilized to construct a similarity matrix based on the sentence embeddings. The TextRank algorithm is employed to rank sentences based on their similarity scores, and the top-ranked sentences are selected to form the summary. The number of sentences in the summary is customizable, with options for generating summaries of 3 or 5 sentences. Additionally, the file includes evaluation functions to compute the average ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, providing insights into the quality of the generated summaries. This implementation is suitable for tasks requiring extractive summarization and can be applied in various domains, including document summarization, text analysis, and information retrieval.

# Mounting Drive

In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Loading Data

In [None]:
# Loading Dataset
import pandas as pd
news_data = pd.read_csv('/content/drive/MyDrive/Datasets/filtered_news_data.csv')
news_data

Unnamed: 0.1,Unnamed: 0,text,ctext,text_len,ctext_len,text_sent_count,text_word_count,ctext_sent_count,ctext_word_count,preprocessed_text,preprocessed_ctext
0,0,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...,358,2313,2,62,16,413,the administration of union territory daman an...,the daman and diu administration on wednesday ...
1,2,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...,398,2112,3,70,18,379,the indira gandhi institute of medical science...,the indira gandhi institute of medical science...
2,4,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...,366,3249,3,67,8,569,hotels in maharashtra will train their staff t...,hotels in mumbai and other indian cities are t...
3,5,A 32-year-old man on Wednesday was found hangi...,An alleged suspect in a kidnapping case was fo...,347,2247,3,63,23,440,a 32yearold man on wednesday was found hanging...,an alleged suspect in a kidnapping case was fo...
4,6,The Delhi High Court reduced the compensation ...,"In an interesting ruling, the Delhi high court...",361,2367,3,70,11,444,the delhi high court reduced the compensation ...,in an interesting ruling the delhi high court ...
...,...,...,...,...,...,...,...,...,...,...,...
3911,4507,After emerging as the single largest party in ...,As uncertainty loomed large in the BMC polls d...,335,606,3,66,1,115,after emerging as the single largest party in ...,as uncertainty loomed large in the bmc polls d...
3912,4509,Fruit juice concentrate maker Rasna is eyeing ...,"Mumbai, Feb 23 (PTI) Fruit juice concentrate m...",361,1000,3,70,8,200,fruit juice concentrate maker rasna is eyeing ...,mumbai feb 23 pti fruit juice concentrate make...
3913,4510,Former Indian cricketer Sachin Tendulkar atten...,Former cricketer Sachin Tendulkar was spotted ...,362,933,2,63,3,158,former indian cricketer sachin tendulkar atten...,former cricketer sachin tendulkar was spotted ...
3914,4512,The Maharashtra government has initiated an in...,Maharahstra Power Minister Chandrashekhar Bawa...,387,1397,3,63,9,233,the maharashtra government has initiated an in...,maharahstra power minister chandrashekhar bawa...


# Loading Pre-trained Glove Embeddings

In [None]:
!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip

--2024-02-19 07:28:12--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2024-02-19 07:30:51 (5.17 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [None]:
# Load pre-trained GloVe word vectors
from gensim.models import KeyedVectors

glove_file_path = '/content/glove.6B.100d.txt'
glove_model = KeyedVectors.load_word2vec_format(glove_file_path, binary=False, no_header=True)

In [None]:
glove_model.vector_size

100

# Building Model

In [None]:
# Importing important libraries

import nltk
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Tokenize function
def tokenize_text(text):
    return [word_tokenize(sent) for sent in sent_tokenize(text)]

In [None]:
# Sentence embedding function
def sentence_embedding(sentence):
    word_embeddings = [glove_model[word] for word in sentence if word in glove_model.key_to_index]
    if len(word_embeddings) == 0:
        return np.zeros(glove_model.vector_size)
    return np.mean(word_embeddings, axis=0)

In [None]:
# Calculate similarity matrix based on sentence embeddings
def cosine_similarity_matrix(embeddings):
    similarity_matrix = cosine_similarity(embeddings, embeddings)
    return similarity_matrix

In [None]:
# TextRank function
def textrank(similarity_matrix):
    damping_factor = 0.85
    n_iterations = 500
    n_sentences = similarity_matrix.shape[0]
    ranks = np.ones(n_sentences) / n_sentences  # Initialize page ranks

    for _ in range(n_iterations):
        new_ranks = (1 - damping_factor) / n_sentences + damping_factor * similarity_matrix.T.dot(ranks)
        ranks = new_ranks
    return ranks

In [None]:
# Summarization function
def summarize_article(article_text, nos=3):
    tokenized_sentences = tokenize_text(article_text)
    sentence_embeddings = np.array([sentence_embedding(sentence) for sentence in tokenized_sentences])
    similarity_matrix = cosine_similarity_matrix(sentence_embeddings)
    ranks = textrank(similarity_matrix)

    n_summary_sentences = min(nos, len(tokenized_sentences))
    top_sentence_indices = ranks.argsort()[-n_summary_sentences:][::-1]
    predicted_summary = " ".join([" ".join(tokenized_sentences[i]) for i in sorted(top_sentence_indices)])

    return predicted_summary

In [None]:
# Apply summarization to each article
news_data['predictedSummary_3'] = news_data['preprocessed_ctext'].apply(lambda x: summarize_article(x))

In [None]:
# Apply summarization to each article
news_data['predictedSummary_5'] = news_data['preprocessed_ctext'].apply(lambda x: summarize_article(x,nos=5))

# Model Evaluation

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=74a05a4607490ac239fba286d97f34febda1282ee57f7858634e29a94530bac3
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
# Model evaluation function
from rouge_score import rouge_scorer

def evaluate_summaries_3(news_data):
    scorer = rouge_scorer.RougeScorer(rouge_types=['rouge1', 'rouge2', 'rougeL'])
    rouge_scores = []

    for idx, row in news_data.iterrows():
        scores = scorer.score(target=row['preprocessed_text'], prediction=row['predictedSummary_3'])
        rouge_scores.append(scores)

    avg_rouge1_precision = np.mean([score['rouge1'].precision for score in rouge_scores])
    avg_rouge1_recall = np.mean([score['rouge1'].recall for score in rouge_scores])
    avg_rouge1_f1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])

    avg_rouge2_precision = np.mean([score['rouge2'].precision for score in rouge_scores])
    avg_rouge2_recall = np.mean([score['rouge2'].recall for score in rouge_scores])
    avg_rouge2_f1 = np.mean([score['rouge2'].fmeasure for score in rouge_scores])

    avg_rougeL_precision = np.mean([score['rougeL'].precision for score in rouge_scores])
    avg_rougeL_recall = np.mean([score['rougeL'].recall for score in rouge_scores])
    avg_rougeL_f1 = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

    print("Average ROUGE-1 Precision: ", avg_rouge1_precision)
    print("Average ROUGE-1 Recall: ", avg_rouge1_recall)
    print("Average ROUGE-1 F1-Score: ", avg_rouge1_f1)
    print("Average ROUGE-2 Precision: ", avg_rouge2_precision)
    print("Average ROUGE-2 Recall: ", avg_rouge2_recall)
    print("Average ROUGE-2 F1-Score: ", avg_rouge2_f1)
    print("Average ROUGE-L Precision: ", avg_rougeL_precision)
    print("Average ROUGE-L Recall: ", avg_rougeL_recall)
    print("Average ROUGE-L F1-Score: ", avg_rougeL_f1)

In [None]:
def evaluate_summaries_5(news_data):
    scorer = rouge_scorer.RougeScorer(rouge_types=['rouge1', 'rouge2', 'rougeL'])
    rouge_scores = []

    for idx, row in news_data.iterrows():
        scores = scorer.score(target=row['preprocessed_text'], prediction=row['predictedSummary_5'])
        rouge_scores.append(scores)

    avg_rouge1_precision = np.mean([score['rouge1'].precision for score in rouge_scores])
    avg_rouge1_recall = np.mean([score['rouge1'].recall for score in rouge_scores])
    avg_rouge1_f1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])

    avg_rouge2_precision = np.mean([score['rouge2'].precision for score in rouge_scores])
    avg_rouge2_recall = np.mean([score['rouge2'].recall for score in rouge_scores])
    avg_rouge2_f1 = np.mean([score['rouge2'].fmeasure for score in rouge_scores])

    avg_rougeL_precision = np.mean([score['rougeL'].precision for score in rouge_scores])
    avg_rougeL_recall = np.mean([score['rougeL'].recall for score in rouge_scores])
    avg_rougeL_f1 = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

    print("Average ROUGE-1 Precision: ", avg_rouge1_precision)
    print("Average ROUGE-1 Recall: ", avg_rouge1_recall)
    print("Average ROUGE-1 F1-Score: ", avg_rouge1_f1)
    print("Average ROUGE-2 Precision: ", avg_rouge2_precision)
    print("Average ROUGE-2 Recall: ", avg_rouge2_recall)
    print("Average ROUGE-2 F1-Score: ", avg_rouge2_f1)
    print("Average ROUGE-L Precision: ", avg_rougeL_precision)
    print("Average ROUGE-L Recall: ", avg_rougeL_recall)
    print("Average ROUGE-L F1-Score: ", avg_rougeL_f1)

In [None]:
print("Evaluation for the summary of 3 sentences: \n")
evaluate_summaries_3(news_data)
print("\n")
print("Evaluation for the summary of 5 sentences: \n")
evaluate_summaries_5(news_data)

Evaluation for the summary of 3 sentences: 

Average ROUGE-1 Precision:  0.31602691596299876
Average ROUGE-1 Recall:  0.3305359114962941
Average ROUGE-1 F1-Score:  0.3121011673863618
Average ROUGE-2 Precision:  0.10721657244461096
Average ROUGE-2 Recall:  0.11312171908637578
Average ROUGE-2 F1-Score:  0.1059864920909852
Average ROUGE-L Precision:  0.20546140823208367
Average ROUGE-L Recall:  0.21182388426528806
Average ROUGE-L F1-Score:  0.20094894997070192


Evaluation for the summary of 5 sentences: 

Average ROUGE-1 Precision:  0.28724820351086544
Average ROUGE-1 Recall:  0.4706448681567012
Average ROUGE-1 F1-Score:  0.34547384483344384
Average ROUGE-2 Precision:  0.1147934564335748
Average ROUGE-2 Recall:  0.1865170688772358
Average ROUGE-2 F1-Score:  0.13696756106167463
Average ROUGE-L Precision:  0.18340939007251364
Average ROUGE-L Recall:  0.29621744127764577
Average ROUGE-L F1-Score:  0.21860605914019293


# Rough Work

In [None]:
# Tokenize text
tokenized_texts = [word_tokenize(sent) for sent in sent_tokenize(news_data['preprocessed_ctext'][0])]
tokenized_texts

[['the',
  'daman',
  'and',
  'diu',
  'administration',
  'on',
  'wednesday',
  'withdrew',
  'a',
  'circular',
  'that',
  'asked',
  'women',
  'staff',
  'to',
  'tie',
  'rakhis',
  'on',
  'male',
  'colleagues',
  'after',
  'the',
  'order',
  'triggered',
  'a',
  'backlash',
  'from',
  'employees',
  'and',
  'was',
  'ripped',
  'apart',
  'on',
  'social',
  'media',
  '.'],
 ['the',
  'union',
  'territorys',
  'administration',
  'was',
  'forced',
  'to',
  'retreat',
  'within',
  '24',
  'hours',
  'of',
  'issuing',
  'the',
  'circular',
  'that',
  'made',
  'it',
  'compulsory',
  'for',
  'its',
  'staff',
  'to',
  'celebrate',
  'rakshabandhan',
  'at',
  'workplace',
  '.'],
 ['it',
  'has',
  'been',
  'decided',
  'to',
  'celebrate',
  'the',
  'festival',
  'of',
  'rakshabandhan',
  'on',
  'august',
  '7',
  '.'],
 ['in',
  'this',
  'connection',
  'all',
  'offices',
  'departments',
  'shall',
  'remain',
  'open',
  'and',
  'celebrate',
  'the',


In [None]:
# Sentence embedding function
def sentence_embedding(sentence):
    word_embeddings = [glove_model[word] for word in sentence if word in glove_model.key_to_index]
    if len(word_embeddings) == 0:
        return np.zeros(glove_model.vector_size)
    return np.mean(word_embeddings, axis=0)  # You can also try sum instead of mean

In [None]:
sentence_embeddings = np.array([sentence_embedding(sentence) for sentence in tokenized_texts])
sentence_embeddings

array([[-0.04380909, -0.09082867,  0.18125317, ..., -0.12316607,
         0.53395194,  0.06034261],
       [-0.01150746, -0.06801052,  0.20491797, ..., -0.16314735,
         0.5325506 ,  0.08096194],
       [-0.0167263 , -0.01593585,  0.22243348, ..., -0.20327693,
         0.60671365,  0.08409016],
       ...,
       [-0.16707925, -0.03983718,  0.4561327 , ..., -0.45501277,
         0.5545738 ,  0.2964479 ],
       [-0.08643195,  0.0122804 ,  0.26521334, ...,  0.03965085,
         0.5447589 ,  0.08598915],
       [-0.17723392,  0.07728847,  0.32824934, ..., -0.06450294,
         0.6807927 ,  0.06960159]], dtype=float32)

In [None]:
len(sentence_embeddings)

17

In [None]:
# Calculate similarity matrix based on sentence embeddings
def cosine_similarity_matrix(embeddings):
    similarity_matrix = cosine_similarity(embeddings, embeddings)
    return similarity_matrix

In [None]:
similarity_matrix = cosine_similarity_matrix(sentence_embeddings)

In [None]:
similarity_matrix

array([[1.        , 0.9696266 , 0.94387895, 0.96249056, 0.961668  ,
        0.97046554, 0.9654246 , 0.87012523, 0.85056084, 0.95545536,
        0.91417015, 0.92672354, 0.9440801 , 0.93899983, 0.8944693 ,
        0.96034044, 0.9543654 ],
       [0.9696266 , 1.0000001 , 0.95525235, 0.97226185, 0.97376335,
        0.9670247 , 0.9714387 , 0.8866603 , 0.8512679 , 0.9615502 ,
        0.9068859 , 0.9001034 , 0.94502527, 0.9373632 , 0.89195114,
        0.96381354, 0.96420914],
       [0.94387895, 0.95525235, 1.0000002 , 0.9596636 , 0.952504  ,
        0.9442219 , 0.9348501 , 0.8609182 , 0.84288347, 0.93383986,
        0.9018837 , 0.8859077 , 0.9329976 , 0.9488643 , 0.8775196 ,
        0.96080804, 0.9583695 ],
       [0.96249056, 0.97226185, 0.9596636 , 0.9999999 , 0.9688114 ,
        0.9589154 , 0.9634564 , 0.8732775 , 0.87199837, 0.9643351 ,
        0.9267743 , 0.92086715, 0.95138025, 0.9483334 , 0.88241124,
        0.96559215, 0.97414786],
       [0.961668  , 0.97376335, 0.952504  , 0.968811

In [None]:
# TextRank function
def textrank(similarity_matrix):
    damping_factor = 0.85
    n_iterations = 250
    n_sentences = similarity_matrix.shape[0]
    ranks = np.ones(n_sentences) / n_sentences  # Initialize page ranks

    for _ in range(n_iterations):
        new_ranks = (1 - damping_factor) / n_sentences + damping_factor * similarity_matrix.T.dot(ranks)
        ranks = new_ranks
    return ranks

In [None]:
ranks = textrank(similarity_matrix)

In [None]:
ranks

array([5.68263670e+279, 5.69560896e+279, 5.61589075e+279, 5.71164792e+279,
       5.68806309e+279, 5.67184756e+279, 5.62744907e+279, 5.24359208e+279,
       5.17514934e+279, 5.66557629e+279, 5.41323969e+279, 5.40239951e+279,
       5.62651025e+279, 5.59915158e+279, 5.33184871e+279, 5.65966493e+279,
       5.66399958e+279])

In [None]:
top_sentence_indices = ranks.argsort()[-3:][::-1]
predicted_summary = " ".join([" ".join(tokenized_texts[i]) for i in sorted(top_sentence_indices)])
predicted_summary

'the union territorys administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate rakshabandhan at workplace . in this connection all offices departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues the order issued on august 1 by gurpreet singh deputy secretary personnel had said . to ensure that no one skipped office an attendance report was to be sent to the government the next evening .'