<a href="https://colab.research.google.com/github/thakkar-hiren/News-Summarisation/blob/main/3_News_Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization Using TF-IDF and TextRank Algorithm
This file contains an implementation of text summarization techniques using TF-IDF (Term Frequency-Inverse Document Frequency) and TextRank algorithms. The TF-IDF method assigns weights to words based on their frequency in the document and across the corpus, while TextRank ranks sentences based on their similarity to other sentences in the document. The process begins by loading a dataset containing news articles, preprocessing the text, and then applying TF-IDF to calculate the importance of words. Subsequently, the TextRank algorithm is used to rank sentences based on the similarity of their TF-IDF vectors, producing summaries of varying lengths (3 and 5 sentences). The generated summaries are evaluated using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, comparing them to human-generated reference summaries. Finally, the file outputs the average ROUGE scores, providing insights into the effectiveness of the summarization techniques. These implementations find applications in various domains such as document summarization, information retrieval, and text mining.

# Mounting Drive

In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Loading Data

In [None]:
# Loading Dataset
import pandas as pd
news_data = pd.read_csv('/content/drive/MyDrive/Datasets/filtered_news_data.csv')
news_data

Unnamed: 0.1,Unnamed: 0,text,ctext,text_len,ctext_len,text_sent_count,text_word_count,ctext_sent_count,ctext_word_count,preprocessed_text,preprocessed_ctext
0,0,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...,358,2313,2,62,16,413,the administration of union territory daman an...,the daman and diu administration on wednesday ...
1,2,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...,398,2112,3,70,18,379,the indira gandhi institute of medical science...,the indira gandhi institute of medical science...
2,4,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...,366,3249,3,67,8,569,hotels in maharashtra will train their staff t...,hotels in mumbai and other indian cities are t...
3,5,A 32-year-old man on Wednesday was found hangi...,An alleged suspect in a kidnapping case was fo...,347,2247,3,63,23,440,a 32yearold man on wednesday was found hanging...,an alleged suspect in a kidnapping case was fo...
4,6,The Delhi High Court reduced the compensation ...,"In an interesting ruling, the Delhi high court...",361,2367,3,70,11,444,the delhi high court reduced the compensation ...,in an interesting ruling the delhi high court ...
...,...,...,...,...,...,...,...,...,...,...,...
3911,4507,After emerging as the single largest party in ...,As uncertainty loomed large in the BMC polls d...,335,606,3,66,1,115,after emerging as the single largest party in ...,as uncertainty loomed large in the bmc polls d...
3912,4509,Fruit juice concentrate maker Rasna is eyeing ...,"Mumbai, Feb 23 (PTI) Fruit juice concentrate m...",361,1000,3,70,8,200,fruit juice concentrate maker rasna is eyeing ...,mumbai feb 23 pti fruit juice concentrate make...
3913,4510,Former Indian cricketer Sachin Tendulkar atten...,Former cricketer Sachin Tendulkar was spotted ...,362,933,2,63,3,158,former indian cricketer sachin tendulkar atten...,former cricketer sachin tendulkar was spotted ...
3914,4512,The Maharashtra government has initiated an in...,Maharahstra Power Minister Chandrashekhar Bawa...,387,1397,3,63,9,233,the maharashtra government has initiated an in...,maharahstra power minister chandrashekhar bawa...


# Building Model

In [None]:
# Importing important libraries

import numpy as np
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Tokenize function
def tokenize_text(text):
    return [word_tokenize(sent) for sent in sent_tokenize(text)]

In [None]:
# TF-IDF function
def calculate_tfidf_vectors(tokenized_texts):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform([" ".join(tokens) for tokens in tokenized_texts])
    return tfidf_matrix

In [None]:
# TextRank function
def textrank(tfidf_matrix):
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

    damping_factor = 0.85
    n_iterations = 500
    n_sentences = similarity_matrix.shape[0]
    ranks = np.ones(n_sentences) / n_sentences  # Initialize page ranks

    for _ in range(n_iterations):
        new_ranks = (1 - damping_factor) / n_sentences + damping_factor * similarity_matrix.T.dot(ranks)
        ranks = new_ranks

    return ranks

In [None]:
# Summarization function
def summarize_article(article_text, nos=3):
    tokenized_texts = tokenize_text(article_text)
    tfidf_matrix = calculate_tfidf_vectors(tokenized_texts)
    ranks = textrank(tfidf_matrix)

    n_summary_sentences = min(nos, len(tokenized_texts))
    top_sentence_indices = ranks.argsort()[-n_summary_sentences:][::-1]
    predicted_summary = " ".join([" ".join(tokenized_texts[i]) for i in sorted(top_sentence_indices)])

    return predicted_summary

In [None]:
# Apply summarization to each article for 3 sentences summary
news_data['predictedSummary_3'] = news_data['preprocessed_ctext'].apply(lambda x: summarize_article(x))

In [None]:
# Apply summarization to each article for 5 sentences summary
news_data['predictedSummary_5'] = news_data['preprocessed_ctext'].apply(lambda x: summarize_article(x,nos=5))

# Model Evaluation

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=19a8e4c65a049e86f2c6fd0e2baf8f6f97197464420554b777f465077b44df32
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
# Function to evaluate summaries generated with 3 sentences
from rouge_score import rouge_scorer
import numpy as np

def evaluate_summaries_3(news_data):
    scorer = rouge_scorer.RougeScorer(rouge_types=['rouge1', 'rouge2', 'rougeL'])
    rouge_scores = []

    for idx, row in news_data.iterrows():
        scores = scorer.score(target=row['preprocessed_text'], prediction=row['predictedSummary_3'])
        rouge_scores.append(scores)

    avg_rouge1_precision = np.mean([score['rouge1'].precision for score in rouge_scores])
    avg_rouge1_recall = np.mean([score['rouge1'].recall for score in rouge_scores])
    avg_rouge1_f1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])

    avg_rouge2_precision = np.mean([score['rouge2'].precision for score in rouge_scores])
    avg_rouge2_recall = np.mean([score['rouge2'].recall for score in rouge_scores])
    avg_rouge2_f1 = np.mean([score['rouge2'].fmeasure for score in rouge_scores])

    avg_rougeL_precision = np.mean([score['rougeL'].precision for score in rouge_scores])
    avg_rougeL_recall = np.mean([score['rougeL'].recall for score in rouge_scores])
    avg_rougeL_f1 = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

    print("Average ROUGE-1 Precision: ", avg_rouge1_precision)
    print("Average ROUGE-1 Recall: ", avg_rouge1_recall)
    print("Average ROUGE-1 F1-Score: ", avg_rouge1_f1)
    print("Average ROUGE-2 Precision: ", avg_rouge2_precision)
    print("Average ROUGE-2 Recall: ", avg_rouge2_recall)
    print("Average ROUGE-2 F1-Score: ", avg_rouge2_f1)
    print("Average ROUGE-L Precision: ", avg_rougeL_precision)
    print("Average ROUGE-L Recall: ", avg_rougeL_recall)
    print("Average ROUGE-L F1-Score: ", avg_rougeL_f1)

In [None]:
# Function to evaluate summaries generated with 5 sentences
def evaluate_summaries_5(news_data):
    scorer = rouge_scorer.RougeScorer(rouge_types=['rouge1', 'rouge2', 'rougeL'])
    rouge_scores = []

    for idx, row in news_data.iterrows():
        scores = scorer.score(target=row['preprocessed_text'], prediction=row['predictedSummary_5'])
        rouge_scores.append(scores)

    avg_rouge1_precision = np.mean([score['rouge1'].precision for score in rouge_scores])
    avg_rouge1_recall = np.mean([score['rouge1'].recall for score in rouge_scores])
    avg_rouge1_f1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])

    avg_rouge2_precision = np.mean([score['rouge2'].precision for score in rouge_scores])
    avg_rouge2_recall = np.mean([score['rouge2'].recall for score in rouge_scores])
    avg_rouge2_f1 = np.mean([score['rouge2'].fmeasure for score in rouge_scores])

    avg_rougeL_precision = np.mean([score['rougeL'].precision for score in rouge_scores])
    avg_rougeL_recall = np.mean([score['rougeL'].recall for score in rouge_scores])
    avg_rougeL_f1 = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

    print("Average ROUGE-1 Precision: ", avg_rouge1_precision)
    print("Average ROUGE-1 Recall: ", avg_rouge1_recall)
    print("Average ROUGE-1 F1-Score: ", avg_rouge1_f1)
    print("Average ROUGE-2 Precision: ", avg_rouge2_precision)
    print("Average ROUGE-2 Recall: ", avg_rouge2_recall)
    print("Average ROUGE-2 F1-Score: ", avg_rouge2_f1)
    print("Average ROUGE-L Precision: ", avg_rougeL_precision)
    print("Average ROUGE-L Recall: ", avg_rougeL_recall)
    print("Average ROUGE-L F1-Score: ", avg_rougeL_f1)

In [None]:
# Evaluation for summaries of 3 sentences
print("For TF-IDF:\n ")
print("Evaluation for the summary of 3 sentences: \n")
evaluate_summaries_3(news_data)
print("\n")

# Evaluation for summaries of 5 sentences
print("Evaluation for the summary of 5 sentences: \n")
evaluate_summaries_5(news_data)

For TF-IDF:
 
Evaluation for the summary of 3 sentences: 

Average ROUGE-1 Precision:  0.34813080008022884
Average ROUGE-1 Recall:  0.4652806088247644
Average ROUGE-1 F1-Score:  0.3879817309253674
Average ROUGE-2 Precision:  0.157848574570862
Average ROUGE-2 Recall:  0.21217665838372335
Average ROUGE-2 F1-Score:  0.176232613746748
Average ROUGE-L Precision:  0.24301372704045004
Average ROUGE-L Recall:  0.32288573984484364
Average ROUGE-L F1-Score:  0.2698092123912391


Evaluation for the summary of 5 sentences: 

Average ROUGE-1 Precision:  0.28849736891942535
Average ROUGE-1 Recall:  0.5850701679136406
Average ROUGE-1 F1-Score:  0.3748496948778918
Average ROUGE-2 Precision:  0.14467388988768537
Average ROUGE-2 Recall:  0.29505779412296745
Average ROUGE-2 F1-Score:  0.18801923814000965
Average ROUGE-L Precision:  0.20549825971349794
Average ROUGE-L Recall:  0.41419477891706513
Average ROUGE-L F1-Score:  0.26594790106321636


# Rough Work

In [None]:
# Tokenize text
tokenized_texts = [word_tokenize(sent) for sent in sent_tokenize(news_data['preprocessed_ctext'][0])]
tokenized_texts

[['the',
  'daman',
  'and',
  'diu',
  'administration',
  'on',
  'wednesday',
  'withdrew',
  'a',
  'circular',
  'that',
  'asked',
  'women',
  'staff',
  'to',
  'tie',
  'rakhis',
  'on',
  'male',
  'colleagues',
  'after',
  'the',
  'order',
  'triggered',
  'a',
  'backlash',
  'from',
  'employees',
  'and',
  'was',
  'ripped',
  'apart',
  'on',
  'social',
  'media',
  '.'],
 ['the',
  'union',
  'territorys',
  'administration',
  'was',
  'forced',
  'to',
  'retreat',
  'within',
  '24',
  'hours',
  'of',
  'issuing',
  'the',
  'circular',
  'that',
  'made',
  'it',
  'compulsory',
  'for',
  'its',
  'staff',
  'to',
  'celebrate',
  'rakshabandhan',
  'at',
  'workplace',
  '.'],
 ['it',
  'has',
  'been',
  'decided',
  'to',
  'celebrate',
  'the',
  'festival',
  'of',
  'rakshabandhan',
  'on',
  'august',
  '7',
  '.'],
 ['in',
  'this',
  'connection',
  'all',
  'offices',
  'departments',
  'shall',
  'remain',
  'open',
  'and',
  'celebrate',
  'the',


In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([" ".join(tokens) for tokens in tokenized_texts])

In [None]:
similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [None]:
len(similarity_matrix)

17

In [None]:
similarity_matrix

array([[1.        , 0.15817213, 0.14458832, 0.21463194, 0.10321352,
        0.22367913, 0.15062308, 0.09771081, 0.        , 0.0643706 ,
        0.01924432, 0.23111835, 0.07939656, 0.09016174, 0.05563854,
        0.10145473, 0.08349548],
       [0.15817213, 1.        , 0.22429982, 0.11193639, 0.15692566,
        0.13770593, 0.12101404, 0.11528361, 0.        , 0.10864859,
        0.04541063, 0.03907747, 0.10969673, 0.11963701, 0.09436176,
        0.12815779, 0.12583561],
       [0.14458832, 0.22429982, 1.        , 0.17947787, 0.08256713,
        0.12299898, 0.06893996, 0.03090215, 0.        , 0.07176602,
        0.03751553, 0.06034897, 0.12402665, 0.12743929, 0.10168019,
        0.17788537, 0.10975404],
       [0.21463194, 0.11193639, 0.17947787, 1.        , 0.05938272,
        0.1345974 , 0.19217785, 0.04184935, 0.        , 0.08598232,
        0.01693518, 0.11392982, 0.03764818, 0.18836864, 0.07344363,
        0.1355215 , 0.18305363],
       [0.10321352, 0.15692566, 0.08256713, 0.059382

In [None]:
damping_factor = 0.85
n_iterations = 500
n_sentences = similarity_matrix.shape[0]
ranks = np.ones(n_sentences) / n_sentences  # Initialize page ranks

for _ in range(n_iterations):
  new_ranks = (1 - damping_factor) / n_sentences + damping_factor * similarity_matrix.T.dot(ranks)
  ranks = new_ranks

ranks

array([9.33596474e+175, 9.14948874e+175, 8.69844213e+175, 9.29087653e+175,
       8.23978914e+175, 1.08944067e+176, 9.55781547e+175, 6.13812095e+175,
       3.00291642e+174, 7.33897628e+175, 2.74858686e+175, 6.25345142e+175,
       6.91873841e+175, 1.01485885e+176, 8.55396315e+175, 9.47741595e+175,
       7.13144352e+175])

In [None]:
top_sentence_indices = ranks.argsort()[-3:][::-1]
print(top_sentence_indices)
predicted_summary = [" ".join(tokenized_texts[i]) for i in sorted(top_sentence_indices)]
predicted_summary

[ 5 13  6]


['the two notifications one mandating the celebration of rakshabandhan left and the other withdrawing the mandate right were issued by the daman and diu administration a day apart .',
 'the circular was withdrawn through a oneline order issued late in the evening by the uts department of personnel and administrative reforms .',
 'in 2014 the year bjp stormed to power at the centre rashtriya swayamsevak sangh rss chief mohan bhagwat said the festival had national significance and should be celebrated widely to protect hindu culture and live by the values enshrined in it .']

In [None]:
smr = " ".join(predicted_summary)
len(smr.split())

94

In [None]:
# Model evaluation function
from rouge_score import rouge_scorer

def evaluate_summaries(news_data):
    scorer = rouge_scorer.RougeScorer(rouge_types=['rouge1', 'rouge2', 'rougeL'])
    rouge_scores = []

    for idx, row in news_data.iterrows():
        scores = scorer.score(target=row['preprocessed_text'], prediction=row['predictedSummary'])
        rouge_scores.append(scores)

    avg_rouge1_precision = np.mean([score['rouge1'].precision for score in rouge_scores])
    avg_rouge1_recall = np.mean([score['rouge1'].recall for score in rouge_scores])
    avg_rouge1_f1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])

    avg_rouge2_precision = np.mean([score['rouge2'].precision for score in rouge_scores])
    avg_rouge2_recall = np.mean([score['rouge2'].recall for score in rouge_scores])
    avg_rouge2_f1 = np.mean([score['rouge2'].fmeasure for score in rouge_scores])

    avg_rougeL_precision = np.mean([score['rougeL'].precision for score in rouge_scores])
    avg_rougeL_recall = np.mean([score['rougeL'].recall for score in rouge_scores])
    avg_rougeL_f1 = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

    print("Average ROUGE-1 Precision: ", avg_rouge1_precision)
    print("Average ROUGE-1 Recall: ", avg_rouge1_recall)
    print("Average ROUGE-1 F1-Score: ", avg_rouge1_f1)
    print("Average ROUGE-2 Precision: ", avg_rouge2_precision)
    print("Average ROUGE-2 Recall: ", avg_rouge2_recall)
    print("Average ROUGE-2 F1-Score: ", avg_rouge2_f1)
    print("Average ROUGE-L Precision: ", avg_rougeL_precision)
    print("Average ROUGE-L Recall: ", avg_rougeL_recall)
    print("Average ROUGE-L F1-Score: ", avg_rougeL_f1)

In [None]:
# Evaluate summaries
evaluate_summaries(news_data)

Average ROUGE-1 Precision:  0.34813080008022884
Average ROUGE-1 Recall:  0.4652806088247644
Average ROUGE-1 F1-Score:  0.3879817309253674
Average ROUGE-2 Precision:  0.157848574570862
Average ROUGE-2 Recall:  0.21217665838372335
Average ROUGE-2 F1-Score:  0.176232613746748
Average ROUGE-L Precision:  0.24301372704045004
Average ROUGE-L Recall:  0.32288573984484364
Average ROUGE-L F1-Score:  0.2698092123912391
