In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/361.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/245.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/141.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/372.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/333.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/276.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/244.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/175.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/351.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/265.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/178.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/201.txt
/kaggle/input/bbc-news-summary/BBC News Summary/Summaries/politics/087.txt
/kaggle/input/bbc-news-su

# Table of contents

 1. Problem Statement
 
 2. Methods
 
           Deep Dive Into Extractive Text Summarization Methods
           Graph Based Summarization Understanding
 
 
 3. Initialization
 
 4. EDA
 
 5. Preprocessing
 
           Sentence Tokenization
           Spell Correction
           Sentence Similarity
 
 
 6. Summarization
 
 7. Validation
           Bleu Score
           Similarity Score


 8. Summarization With Inbuilt Tool
 
           Sumy


 9. Conclusion

# 1. Problem statement

Text summarization involves creating a shorter version of a text that retains its key information. While humans are able to understand the main points of a text by simply reading it, machines can also perform this task through the use of natural language processing (NLP). There are various applications of automatic text summarization, including condensing customer reviews, summarizing news articles, and creating summary reports from business meeting notes. In this Notebook, I'll explore the details of this exciting application of NLP in greater depth.

# Methods

**Extractive Extractive** summarization involves identifying and extracting the most important sentences from the original text. Extractive summarization is easier to implement and can be done quickly using an unsupervised approach that does not require prior training.

**Abstractive involves** understanding the main ideas of the text and generating a new, summarized version based on that understanding. Abstractive summarization has the advantage of being able to generate new text, but it is more complex to implement and requires language generation capabilities.

# Deep Dive Into Extractive Text Summarization Methods

The Extractive based summarization method selects informative sentences from the document as they exactly appear in source based on specific criteria to form summary. The main challenge before extractive summarization is to decide which sentences from the input document is significant and likely to be included in the summary. For this task, sentence scoring is employed based on features of sentences. It first, assigns a score to each sentence based on feature then rank sentences according to their score. Sentences with the highest score are likely to be included in final summary.

**Following methods are the technique of extractive text summarization**

> 1. Term frequency (TF) and the inverse document frequency (IDF)
> 2. Cluster Based Method
> 3. Text Summarization with Neural Network
> 4. Text Summarization with Fuzzy Logic
> 5. Graph based Method
> 6. Latent Semantic Analysis Method
> 7. Machine Learning approach
> 8. Query based summarization

# Graph Based Summarization

**From the above listed Strategies we will be focussing on Graph Based Methods for Summarization**

The core idea behind this method is to find the similarities among all the sentences and returning the sentences having maximum similarity scores. We use Cosine Similarity as the similarity matrix and TextRank algorithm to rank the sentences based on their importance.

Before understanding the TextRank algorithm, it is important to briefly talk about the PageRank algorithm, the influence behind TextRank.

PageRank is a graph based algorithm used by Google to rank the web pages based on a search result. PageRank first creates a graph with pages being the vertices and the links between pages being the edges. The PageRank score is calculated for each page, which is basically the probability of user visiting that page. 

****

**Similarity of TextRank with PageRank can be underlined using following points:**

1. sentences are used in place of pages as vertices in graph.
2. Similarity between sentences is used as edges instead of links.
3. Instead of page visit probability, sentences similarities are used to calculate the ranks.

# Initialization : Loading data

In [2]:
import os
import pandas as pd
path_, filename_, category_, article_or_summary_ = [],[],[],[]
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        path_.append(os.path.join(dirname, filename))
        filename_.append(filename)
        category_.append(dirname.split("/")[-1])
        article_or_summary_.append(dirname.split("/")[-2])

In [3]:
df = pd.DataFrame({"path":path_, "filename":filename_, "category":category_, "article_or_summary":article_or_summary_}, columns=["path", "filename", "category", "article_or_summary"])
df

Unnamed: 0,path,filename,category,article_or_summary
0,/kaggle/input/bbc-news-summary/BBC News Summar...,361.txt,politics,Summaries
1,/kaggle/input/bbc-news-summary/BBC News Summar...,245.txt,politics,Summaries
2,/kaggle/input/bbc-news-summary/BBC News Summar...,141.txt,politics,Summaries
3,/kaggle/input/bbc-news-summary/BBC News Summar...,372.txt,politics,Summaries
4,/kaggle/input/bbc-news-summary/BBC News Summar...,333.txt,politics,Summaries
...,...,...,...,...
8895,/kaggle/input/bbc-news-summary/bbc news summar...,380.txt,business,News Articles
8896,/kaggle/input/bbc-news-summary/bbc news summar...,192.txt,business,News Articles
8897,/kaggle/input/bbc-news-summary/bbc news summar...,248.txt,business,News Articles
8898,/kaggle/input/bbc-news-summary/bbc news summar...,004.txt,business,News Articles


# EDA

In [4]:
!pip install cufflinks
import plotly_express as pe
import cufflinks as cf

cf.go_offline()




A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5



**Distribution of Number of Articles in Each Category**

In [5]:
from collections import Counter

ct = Counter(df[df['article_or_summary']=="News Articles"]["category"])
pd.DataFrame({"category":ct.keys(), "value":ct.values()}).iplot(kind='bar', x='category', y='value')

**Distribution of Category and its Values**

In [6]:
pd.DataFrame({"category":ct.keys(), "value":ct.values()}).iplot(kind='box')

**Distribution Size of Each Category**

In [7]:
pd.DataFrame({"category":ct.keys(), "value":ct.values()}).iplot(kind='bubble', x='category', y='value', size='value')

**Coverage Ratio of Each Category**

In [8]:
pd.DataFrame({"category":ct.keys(), "value":ct.values()}).iplot(kind='pie', labels="category", values='value')

# PreProcessing : 1.Sentence Tokenization

In [9]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
import numpy as np
import networkx as nx
import re

In [10]:
def read_article(text):        
    sentences =[]        
    sentences = sent_tokenize(text)    
    for sentence in sentences:        
        sentence.replace("[^a-zA-Z0-9]"," ")     
    return sentences

In [11]:
file_path = df[df['article_or_summary']=='News Articles'].iloc[0]['path']
with open(file_path, "r") as f:
    article = f.read()

In [12]:
sent_tok = read_article(article)
sent_tok

["Budget to set scene for election\n\nGordon Brown will seek to put the economy at the centre of Labour's bid for a third term in power when he delivers his ninth Budget at 1230 GMT.",
 'He is expected to stress the importance of continued economic stability, with low unemployment and interest rates.',
 'The chancellor is expected to freeze petrol duty and raise the stamp duty threshold from £60,000.',
 'But the Conservatives and Lib Dems insist voters face higher taxes and more means-testing under Labour.',
 'Treasury officials have said there will not be a pre-election giveaway, but Mr Brown is thought to have about £2bn to spare.',
 "- Increase in the stamp duty threshold from £60,000 \n - A freeze on petrol duty \n - An extension of tax credit scheme for poorer families \n - Possible help for pensioners The stamp duty threshold rise is intended to help first time buyers - a likely theme of all three of the main parties' general election manifestos.",
 'Ten years ago, buyers had a m

# 2.Spell Correction

In [13]:
from textblob import TextBlob
mod_sent = []
for tok in sent_tok:
    blob_obj = TextBlob(tok)
    correct_sent = str(blob_obj.correct())
    print(f"\033[94m Original Token : {tok} \033[0m")
    print(f"\033[92m Corrected Token: {correct_sent} \033[92m")
    mod_sent.append(correct_sent)

[94m Original Token : Budget to set scene for election

Gordon Brown will seek to put the economy at the centre of Labour's bid for a third term in power when he delivers his ninth Budget at 1230 GMT. [0m
[92m Corrected Token: Budget to set scene for election

Gordon Grown will seek to put the economy at the centre of Labour's bid for a third term in power when he delivers his ninth Budget at 1230 GMT. [92m
[94m Original Token : He is expected to stress the importance of continued economic stability, with low unemployment and interest rates. [0m
[92m Corrected Token: He is expected to stress the importance of continued economic stability, with low unemployment and interest rates. [92m
[94m Original Token : The chancellor is expected to freeze petrol duty and raise the stamp duty threshold from £60,000. [0m
[92m Corrected Token: The chancellor is expected to freeze petrol duty and raise the stamp duty threshold from £60,000. [92m
[94m Original Token : But the Conservatives 

**Modified Sentences**

In [14]:
" ".join(mod_sent)

'Budget to set scene for election\n\nGordon Grown will seek to put the economy at the centre of Labour\'s bid for a third term in power when he delivers his ninth Budget at 1230 GMT. He is expected to stress the importance of continued economic stability, with low unemployment and interest rates. The chancellor is expected to freeze petrol duty and raise the stamp duty threshold from £60,000. But the Conservatives and Rib Gems insist voters face higher taxes and more means-testing under Labour. Treasury officials have said there will not be a pre-election giveaway, but Or Grown is thought to have about £in to spare. - Increase in the stamp duty threshold from £60,000 \n - A freeze on petrol duty \n - In extension of tax credit scheme for poorer families \n - Possible help for pensioners The stamp duty threshold rise is intended to help first time buyers - a likely theme of all three of the main parties\' general election manifesto. Men years ago, buyers had a much greater chance of avo

# Sentence Similarity

In [15]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

def sentence_similarity(sent1,sent2,embed):  
    A = embed([sent1])[0]
    B = embed([sent2])[0]
    return 1 - (np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B)))


unable to load libtensorflow_io_plugins.so: unable to open file: libtensorflow_io_plugins.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']


file system plugins are not loaded: unable to open file: libtensorflow_io.so, from paths: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']



In [16]:
print(f"\033[92m Sentence 1 : {mod_sent[0]}")
print(f"\033[92m Sentence 2 : {mod_sent[1]}")
print(f"\033[92m Similarity Score : {sentence_similarity(mod_sent[0], mod_sent[1], embed)}")

[92m Sentence 1 : Budget to set scene for election

Gordon Grown will seek to put the economy at the centre of Labour's bid for a third term in power when he delivers his ninth Budget at 1230 GMT.
[92m Sentence 2 : He is expected to stress the importance of continued economic stability, with low unemployment and interest rates.
[92m Similarity Score : 0.7819880545139313


In [17]:
def build_similarity_matrix(sentences,embeds):
    similarity_matrix = np.zeros((len(sentences),len(sentences)))
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1!=idx2:
                similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1],sentences[idx2],embeds)
    return similarity_matrix

In [18]:
sim_mat = build_similarity_matrix(mod_sent, embed)

In [19]:
from bokeh.io import output_notebook, show, save
from bokeh.models import Range1d, Circle, ColumnDataSource, MultiLine
from bokeh.plotting import figure
from bokeh.plotting import from_networkx
import networkx
from bokeh.io import output_notebook, show, save

output_notebook()

g = nx.Graph()

for i in range(sim_mat.shape[0]):
    for j in range(sim_mat.shape[1]):
        if sim_mat[i][j] >=.9:
            g.add_edge(i, j)

HOVER_TOOLTIPS = [("sent_tok", "@index")]
plot = figure(tooltips = HOVER_TOOLTIPS, tools="pan,wheel_zoom,save,reset", active_scroll='wheel_zoom',x_range=Range1d(-10.1, 10.1), y_range=Range1d(-10.1, 10.1))

network_graph = from_networkx(g, networkx.spring_layout, scale=7, center=(0, 0))
network_graph.node_renderer.glyph = Circle(size=15,fill_color='green')
network_graph.edge_renderer.glyph = MultiLine(line_alpha=0.5, line_width=1)
plot.renderers.append(network_graph)
show(plot)
            
            

# Summarization

Defining Function of Summary which is basically combined actions which we have gone through in the above steps. To be more clear we will be collection Nth top most relevant sentences to summarize entire articles.

Steps:

Reading Article and extracting Text from it.

Generate Sentence tokens.

Compute cosine similarity.

Using NetworkX to compute Graph Similiarity nodes

Using Page Ranking method to rank the sentences.

Collect Top N Sentences and represent as summary of the Entire Article.

Note : The Above steps metioned is applicaple for Extractive Strategy for Text Summarization

In [20]:
file_path_summary = df[df['article_or_summary']=='Summaries'].iloc[0]['path']
with open(file_path_summary, "r") as f:
    actual_summary = f.read()

In [21]:
def generate_summary(text,top_n,embeds):
    summarize_text = []  
    sentences = read_article(text)           
    sentence_similarity_matrix = build_similarity_matrix(sentences,embeds)  
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix)
    scores = nx.pagerank(sentence_similarity_graph) 
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),reverse=True)
    for i in range(top_n):
        summarize_text.append(ranked_sentences[i][1]) 
    return " ".join(summarize_text)

In [22]:
pip install scipy

Note: you may need to restart the kernel to use updated packages.


In [23]:
!pip install networkx



In [24]:
pip install pyg-nightly

Collecting pyg-nightly
  Downloading pyg_nightly-2.4.0.dev20230717-py3-none-any.whl (970 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m970.3/970.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyg-nightly
Successfully installed pyg-nightly-2.4.0.dev20230717
Note: you may need to restart the kernel to use updated packages.


In [25]:
pip install scipy

Note: you may need to restart the kernel to use updated packages.


In [26]:
pip install --upgrade networkx==2.6

Collecting networkx==2.6
  Downloading networkx-2.6-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
Reason for being yanked: Need to resolve: https://github.com/networkx/networkx/pull/4967[0m[33m
[0mInstalling collected packages: networkx
  Attempting uninstall: networkx
    Found existing installation: networkx 3.1
    Uninstalling networkx-3.1:
      Successfully uninstalled networkx-3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
momepy 0.6.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
scikit-image 0.21.0 requires networkx>=2.8, but you have networkx 2.6 which is incompatible.
ydata-profiling 4.3.1 requires scipy<1.11,>=1.4.1, but you have scipy 1.11.1 which is incompatible.[0m[31m
[0mSuccessfully insta

In [27]:
pip install --upgrade scipy==1.8.0

Collecting scipy==1.8.0
  Downloading scipy-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.1
    Uninstalling scipy-1.11.1:
      Successfully uninstalled scipy-1.11.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
momepy 0.6.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.8.0 which is incompatible.
scikit-image 0.21.0 requires networkx>=2.8, but you have networkx 2.6 which is incompatible.[0m[31m
[

In [28]:
Original_Text = " ".join(mod_sent)
Summarized_Text = generate_summary(Original_Text, top_n=5, embeds=embed)

Custom TB Handler failed, unregistering


Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_20/3785386790.py", line 2, in <module>
    Summarized_Text = generate_summary(Original_Text, top_n=5, embeds=embed)
  File "/tmp/ipykernel_20/1773938359.py", line 6, in generate_summary
    scores = nx.pagerank(sentence_similarity_graph)
  File "/opt/conda/lib/python3.10/site-packages/networkx/classes/backends.py", line 148, in wrapper
  File "/opt/conda/lib/python3.10/site-packages/networkx/algorithms/link_analysis/pagerank_alg.py", line 110, in pagerank
  File "/opt/conda/lib/python3.10/site-packages/networkx/algorithms/link_analysis/pagerank_alg.py", line 461, in _pagerank_scipy
    S = np.array(M.sum(axis=1)).flatten()
  File "/opt/conda/lib/python3.10/site-packages/networkx/convert_matrix.py", line 593, in to_scipy_sparse_array
AttributeError: module 'scipy.sparse' h

In [None]:
Original_Text

In [None]:
Summarized_Text

In [None]:
actual_summary

# Validation

There are Multiple ways we can compary Two sentences to compute accuracy


    1.N-Grams/Bleu Score : Mostly used in Translation
    2.Similarity Score for Computing similarity from two sentences : Used mostly for Summary comparision or similar word/sentence Search.


In Our case 2nd option is best but will implement both Cases and see the difference of scores

# N-Grams/Bleu Score

In [None]:
import nltk

hypothesis = Summarized_Text
reference = actual_summary
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis)
print(f"BLEUscore : {BLEUscore}")

We can Clearly see that score is only 31% which does not means that summary is wrong since it is comparing the words but not the context and semantic meanings.

Hence even though both summary could mean the same and still Bleu Score will be less, and for the same very reason this comparision is used only for Translation purposes and not for this very perticular case.

# Similarity Score

Below is the definition same has been used for the above methods

In [None]:
def sentence_similarity(sent1,sent2,embed):  
    A = embed([sent1])[0]
    B = embed([sent2])[0]
    return 1 - (np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B)))

In [None]:
print(f"Senetence Similarity Score : {sentence_similarity(Summarized_Text, actual_summary, embed)}")

This gives us the better score comparitive to the Bleu Score for our use case i.e ~56.3%

# Summarization With Sumy

In [None]:
!pip install  sumy
import sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

In [None]:
# For Strings
parser = PlaintextParser.from_string(Original_Text,Tokenizer("english"))

summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 5)

for sentence in summary:
    print(sentence)

# Conclusion

As coming to the end to implemention Summarizing what we have done and what we have achieved so far:


Summary:

     1. We have collected BBC Articles and its summary for the part of reference and comparision.
     
     2. we have collected multiple methods and techniques used for Text Summarization including                 Extractive and Abstractive methods.
     3. We deep Dived into detailes Extractive methodologies.
     4. Picked up Graph Implementation method for Extractive text Summarization. 6 Converted Article to         Senetence Tokens.
     5. Computed Similarity matric for graph creation.
     6. Used Page rank algorithm to rank snetence tokens and selected top N to represent Summary.
     7. For Validation part we implemented both BleuScore and Similarity Score and learned for our case         Bleu can not be used and similarity score is much more reliable.
     

 
 
     