<a href="https://colab.research.google.com/github/thakkar-hiren/News-Summarisation/blob/main/4_News_Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text Summarization Using Word Embeddings and TextRank Algorithm
This file contains an implementation of text summarization using Word Embeddings and the TextRank algorithm. The process begins with loading a dataset containing news articles and preprocessing the text. Word2Vec word embeddings are then trained on the preprocessed corpus to capture semantic relationships between words. The text is tokenized into sentences, and each sentence is embedded into a continuous vector space using the Word2Vec model. Cosine similarity is employed to construct a similarity matrix based on the sentence embeddings. The TextRank algorithm is applied to rank sentences based on their similarity scores, and the top-ranked sentences are selected to form the summary. The number of sentences in the summary is customizable, with options for generating summaries of 3 or 5 sentences. The file includes evaluation functions that compute the average ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores to assess the quality of the generated summaries. This implementation is suitable for tasks requiring extractive summarization and can be applied in various domains, including document summarization, text analysis, and information retrieval.

# Mounting Drive

In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Loading Data

In [None]:
# Loading Dataset
import pandas as pd
news_data = pd.read_csv('/content/drive/MyDrive/Datasets/filtered_news_data.csv')
news_data

Unnamed: 0.1,Unnamed: 0,text,ctext,text_len,ctext_len,text_sent_count,text_word_count,ctext_sent_count,ctext_word_count,preprocessed_text,preprocessed_ctext
0,0,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...,358,2313,2,62,16,413,the administration of union territory daman an...,the daman and diu administration on wednesday ...
1,2,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...,398,2112,3,70,18,379,the indira gandhi institute of medical science...,the indira gandhi institute of medical science...
2,4,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...,366,3249,3,67,8,569,hotels in maharashtra will train their staff t...,hotels in mumbai and other indian cities are t...
3,5,A 32-year-old man on Wednesday was found hangi...,An alleged suspect in a kidnapping case was fo...,347,2247,3,63,23,440,a 32yearold man on wednesday was found hanging...,an alleged suspect in a kidnapping case was fo...
4,6,The Delhi High Court reduced the compensation ...,"In an interesting ruling, the Delhi high court...",361,2367,3,70,11,444,the delhi high court reduced the compensation ...,in an interesting ruling the delhi high court ...
...,...,...,...,...,...,...,...,...,...,...,...
3911,4507,After emerging as the single largest party in ...,As uncertainty loomed large in the BMC polls d...,335,606,3,66,1,115,after emerging as the single largest party in ...,as uncertainty loomed large in the bmc polls d...
3912,4509,Fruit juice concentrate maker Rasna is eyeing ...,"Mumbai, Feb 23 (PTI) Fruit juice concentrate m...",361,1000,3,70,8,200,fruit juice concentrate maker rasna is eyeing ...,mumbai feb 23 pti fruit juice concentrate make...
3913,4510,Former Indian cricketer Sachin Tendulkar atten...,Former cricketer Sachin Tendulkar was spotted ...,362,933,2,63,3,158,former indian cricketer sachin tendulkar atten...,former cricketer sachin tendulkar was spotted ...
3914,4512,The Maharashtra government has initiated an in...,Maharahstra Power Minister Chandrashekhar Bawa...,387,1397,3,63,9,233,the maharashtra government has initiated an in...,maharahstra power minister chandrashekhar bawa...


# Building Model

In [None]:
# Importing Important Libraries

import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Tokenize function
def tokenize_text(text):
    return [word_tokenize(sent) for sent in sent_tokenize(text)]

In [None]:
# Train Word2Vec model
tokenized_corpus = [word_tokenize(article) for article in news_data['preprocessed_ctext']]
word2vec_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

In [None]:
# Sentence embedding function
def sentence_embedding(sentence):
    word_embeddings = [word2vec_model.wv[word] for word in sentence if word in word2vec_model.wv]
    if len(word_embeddings) == 0:
        return np.zeros(word2vec_model.vector_size)
    return np.mean(word_embeddings, axis=0)

In [None]:
# Calculate similarity matrix based on sentence embeddings
def cosine_similarity_matrix(embeddings):
    similarity_matrix = cosine_similarity(embeddings, embeddings)
    return similarity_matrix

In [None]:
# TextRank function
def textrank(similarity_matrix):
    damping_factor = 0.85
    n_iterations = 500
    n_sentences = similarity_matrix.shape[0]
    ranks = np.ones(n_sentences) / n_sentences  # Initialize page ranks

    for _ in range(n_iterations):
        new_ranks = (1 - damping_factor) / n_sentences + damping_factor * similarity_matrix.T.dot(ranks)
        ranks = new_ranks
    return ranks

In [None]:
# Summarization function
def summarize_article(article_text, nos=3):
    tokenized_sentences = tokenize_text(article_text)
    sentence_embeddings = np.array([sentence_embedding(sentence) for sentence in tokenized_sentences])
    similarity_matrix = cosine_similarity_matrix(sentence_embeddings)
    ranks = textrank(similarity_matrix)

    n_summary_sentences = min(nos, len(tokenized_sentences))
    top_sentence_indices = ranks.argsort()[-n_summary_sentences:][::-1]
    predicted_summary = " ".join([" ".join(tokenized_sentences[i]) for i in sorted(top_sentence_indices)])

    return predicted_summary

In [None]:
# Apply summarization to each article
news_data['predictedSummary_3'] = news_data['preprocessed_ctext'].apply(lambda x: summarize_article(x))

In [None]:
# Apply summarization to each article
news_data['predictedSummary_5'] = news_data['preprocessed_ctext'].apply(lambda x: summarize_article(x,nos=5))

# Model Evaluation

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=e29d0d716ee211c770ac83d288c1a1742aea2ea480d93337c771c56f056adebe
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
# Model evaluation function
from rouge_score import rouge_scorer

def evaluate_summaries_3(news_data):
    scorer = rouge_scorer.RougeScorer(rouge_types=['rouge1', 'rouge2', 'rougeL'])
    rouge_scores = []

    for idx, row in news_data.iterrows():
        scores = scorer.score(target=row['preprocessed_text'], prediction=row['predictedSummary_3'])
        rouge_scores.append(scores)

    avg_rouge1_precision = np.mean([score['rouge1'].precision for score in rouge_scores])
    avg_rouge1_recall = np.mean([score['rouge1'].recall for score in rouge_scores])
    avg_rouge1_f1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])

    avg_rouge2_precision = np.mean([score['rouge2'].precision for score in rouge_scores])
    avg_rouge2_recall = np.mean([score['rouge2'].recall for score in rouge_scores])
    avg_rouge2_f1 = np.mean([score['rouge2'].fmeasure for score in rouge_scores])

    avg_rougeL_precision = np.mean([score['rougeL'].precision for score in rouge_scores])
    avg_rougeL_recall = np.mean([score['rougeL'].recall for score in rouge_scores])
    avg_rougeL_f1 = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

    print("Average ROUGE-1 Precision: ", avg_rouge1_precision)
    print("Average ROUGE-1 Recall: ", avg_rouge1_recall)
    print("Average ROUGE-1 F1-Score: ", avg_rouge1_f1)
    print("Average ROUGE-2 Precision: ", avg_rouge2_precision)
    print("Average ROUGE-2 Recall: ", avg_rouge2_recall)
    print("Average ROUGE-2 F1-Score: ", avg_rouge2_f1)
    print("Average ROUGE-L Precision: ", avg_rougeL_precision)
    print("Average ROUGE-L Recall: ", avg_rougeL_recall)
    print("Average ROUGE-L F1-Score: ", avg_rougeL_f1)

In [None]:
def evaluate_summaries_5(news_data):
    scorer = rouge_scorer.RougeScorer(rouge_types=['rouge1', 'rouge2', 'rougeL'])
    rouge_scores = []

    for idx, row in news_data.iterrows():
        scores = scorer.score(target=row['preprocessed_text'], prediction=row['predictedSummary_5'])
        rouge_scores.append(scores)

    avg_rouge1_precision = np.mean([score['rouge1'].precision for score in rouge_scores])
    avg_rouge1_recall = np.mean([score['rouge1'].recall for score in rouge_scores])
    avg_rouge1_f1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])

    avg_rouge2_precision = np.mean([score['rouge2'].precision for score in rouge_scores])
    avg_rouge2_recall = np.mean([score['rouge2'].recall for score in rouge_scores])
    avg_rouge2_f1 = np.mean([score['rouge2'].fmeasure for score in rouge_scores])

    avg_rougeL_precision = np.mean([score['rougeL'].precision for score in rouge_scores])
    avg_rougeL_recall = np.mean([score['rougeL'].recall for score in rouge_scores])
    avg_rougeL_f1 = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

    print("Average ROUGE-1 Precision: ", avg_rouge1_precision)
    print("Average ROUGE-1 Recall: ", avg_rouge1_recall)
    print("Average ROUGE-1 F1-Score: ", avg_rouge1_f1)
    print("Average ROUGE-2 Precision: ", avg_rouge2_precision)
    print("Average ROUGE-2 Recall: ", avg_rouge2_recall)
    print("Average ROUGE-2 F1-Score: ", avg_rouge2_f1)
    print("Average ROUGE-L Precision: ", avg_rougeL_precision)
    print("Average ROUGE-L Recall: ", avg_rougeL_recall)
    print("Average ROUGE-L F1-Score: ", avg_rougeL_f1)

In [None]:
print("Evaluation for the summary of 3 sentences: \n")
evaluate_summaries_3(news_data)
print("\n")
print("Evaluation for the summary of 5 sentences: \n")
evaluate_summaries_5(news_data)

Evaluation for the summary of 3 sentences: 

Average ROUGE-1 Precision:  0.31554466757350613
Average ROUGE-1 Recall:  0.32987876215131134
Average ROUGE-1 F1-Score:  0.31159437019474256
Average ROUGE-2 Precision:  0.10658324693814096
Average ROUGE-2 Recall:  0.11234937042676994
Average ROUGE-2 F1-Score:  0.10532370743191859
Average ROUGE-L Precision:  0.20485872139471567
Average ROUGE-L Recall:  0.21104750040219733
Average ROUGE-L F1-Score:  0.2003123362992979


Evaluation for the summary of 5 sentences: 

Average ROUGE-1 Precision:  0.2872579946333704
Average ROUGE-1 Recall:  0.4708750074048507
Average ROUGE-1 F1-Score:  0.3455439024952956
Average ROUGE-2 Precision:  0.11485351274032472
Average ROUGE-2 Recall:  0.18676330106271918
Average ROUGE-2 F1-Score:  0.13708012816620416
Average ROUGE-L Precision:  0.1834844888123956
Average ROUGE-L Recall:  0.29651496659848775
Average ROUGE-L F1-Score:  0.21874481235115323


# Rough Work

In [None]:
# Tokenize text
tokenized_texts = [word_tokenize(sent) for sent in sent_tokenize(news_data['preprocessed_ctext'][0])]
tokenized_texts

[['the',
  'daman',
  'and',
  'diu',
  'administration',
  'on',
  'wednesday',
  'withdrew',
  'a',
  'circular',
  'that',
  'asked',
  'women',
  'staff',
  'to',
  'tie',
  'rakhis',
  'on',
  'male',
  'colleagues',
  'after',
  'the',
  'order',
  'triggered',
  'a',
  'backlash',
  'from',
  'employees',
  'and',
  'was',
  'ripped',
  'apart',
  'on',
  'social',
  'media',
  '.'],
 ['the',
  'union',
  'territorys',
  'administration',
  'was',
  'forced',
  'to',
  'retreat',
  'within',
  '24',
  'hours',
  'of',
  'issuing',
  'the',
  'circular',
  'that',
  'made',
  'it',
  'compulsory',
  'for',
  'its',
  'staff',
  'to',
  'celebrate',
  'rakshabandhan',
  'at',
  'workplace',
  '.'],
 ['it',
  'has',
  'been',
  'decided',
  'to',
  'celebrate',
  'the',
  'festival',
  'of',
  'rakshabandhan',
  'on',
  'august',
  '7',
  '.'],
 ['in',
  'this',
  'connection',
  'all',
  'offices',
  'departments',
  'shall',
  'remain',
  'open',
  'and',
  'celebrate',
  'the',


In [None]:
from gensim.models import Word2Vec
# Train Word2Vec model
tokenized_corpus = [word_tokenize(article) for article in news_data['preprocessed_ctext'][0]]
word2vec_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

In [None]:
# Sentence embedding function
def sentence_embedding(sentence):
    word_embeddings = [word2vec_model.wv[word] for word in sentence if word in word2vec_model.wv]
    if len(word_embeddings) == 0:
        return np.zeros(word2vec_model.vector_size)
    return np.mean(word_embeddings, axis=0)  # You can also try sum instead of mean

In [None]:
sentence_embeddings = np.array([sentence_embedding(sentence) for sentence in tokenized_texts])
sentence_embeddings

array([[ 0.00050722,  0.00423183, -0.00121356, ...,  0.00336173,
         0.00354513, -0.00345067],
       [ 0.00133252,  0.00654086,  0.0099846 , ...,  0.00906   ,
        -0.00579077,  0.00368608],
       [ 0.00172302,  0.00613801,  0.0039341 , ...,  0.00674994,
        -0.00693857, -0.00018685],
       ...,
       [ 0.00133252,  0.00654086,  0.0099846 , ...,  0.00906   ,
        -0.00579077,  0.00368608],
       [ 0.00133252,  0.00654086,  0.0099846 , ...,  0.00906   ,
        -0.00579077,  0.00368608],
       [ 0.00071354,  0.00480909,  0.00158598, ...,  0.00478629,
         0.00121116, -0.00166648]], dtype=float32)

In [None]:
# Calculate similarity matrix based on sentence embeddings
def cosine_similarity_matrix(embeddings):
    similarity_matrix = cosine_similarity(embeddings, embeddings)
    return similarity_matrix

In [None]:
similarity_matrix = cosine_similarity_matrix(sentence_embeddings)

In [None]:
len(similarity_matrix)

17

In [None]:
similarity_matrix

array([[1.        , 0.4306957 , 0.32566708, 0.78693867, 0.4306957 ,
        0.94901186, 0.94901186, 0.4306957 , 0.4306957 , 0.7607797 ,
        0.4306957 , 0.4306957 , 0.94901186, 0.4306957 , 0.4306957 ,
        0.4306957 , 0.94901186],
       [0.4306957 , 0.99999976, 0.66515803, 0.5865928 , 0.99999976,
        0.6932388 , 0.6932388 , 0.99999976, 0.99999976, 0.54168797,
        0.99999976, 0.99999976, 0.6932388 , 0.99999976, 0.99999976,
        0.99999976, 0.6932388 ],
       [0.32566708, 0.66515803, 1.        , 0.38877335, 0.66515803,
        0.49240652, 0.49240652, 0.66515803, 0.66515803, 0.35651818,
        0.66515803, 0.66515803, 0.49240652, 0.66515803, 0.66515803,
        0.66515803, 0.49240652],
       [0.78693867, 0.5865928 , 0.38877335, 0.9999999 , 0.5865928 ,
        0.8333221 , 0.8333221 , 0.5865928 , 0.5865928 , 0.64797264,
        0.5865928 , 0.5865928 , 0.8333221 , 0.5865928 , 0.5865928 ,
        0.5865928 , 0.8333221 ],
       [0.4306957 , 0.99999976, 0.66515803, 0.586592

In [None]:
# TextRank function
def textrank(similarity_matrix):
    damping_factor = 0.85
    n_iterations = 250
    n_sentences = similarity_matrix.shape[0]
    ranks = np.ones(n_sentences) / n_sentences  # Initialize page ranks

    for _ in range(n_iterations):
        new_ranks = (1 - damping_factor) / n_sentences + damping_factor * similarity_matrix.T.dot(ranks)
        ranks = new_ranks
    return ranks

In [None]:
ranks = textrank(similarity_matrix)

In [None]:
ranks

array([1.97876676e+261, 2.74324803e+261, 1.93543359e+261, 2.17602147e+261,
       2.74324803e+261, 2.53839601e+261, 2.53839601e+261, 2.74324803e+261,
       2.74324803e+261, 2.05209536e+261, 2.74324803e+261, 2.74324803e+261,
       2.53839601e+261, 2.74324803e+261, 2.74324803e+261, 2.74324803e+261,
       2.53839601e+261])

In [None]:
top_sentence_indices = ranks.argsort()[-5:][::-1]
predicted_summary = " ".join([" ".join(tokenized_texts[i]) for i in sorted(top_sentence_indices)])
predicted_summary

'the union territorys administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate rakshabandhan at workplace . there are sensitivities involved . the notice was issued on daman and diu administrator and former gujarat home minister praful kodabhai patels direction sources said . in 2014 the year bjp stormed to power at the centre rashtriya swayamsevak sangh rss chief mohan bhagwat said the festival had national significance and should be celebrated widely to protect hindu culture and live by the values enshrined in it . the rss is the ideological parent of the ruling bjp .'

In [None]:
len(predicted_summary.split())

107