# Text summarization

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df=pd.read_csv('C:/users/ylepen/OneDrive - Université Paris-Dauphine/COURS Dauphine/NLP/session 4 Topic modeling/un-general-debates.csv')

In [3]:
df.head()

Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


In [4]:
text = df.loc[0,'text']

In [5]:
print(text)

﻿It is indeed a pleasure for me and the members of my delegation to extend to Ambassador Garba our sincere congratulations on his election to the presidency of the forty-fourth session of the General Assembly. His election to this high office is a well-deserved tribute to his personal qualities and experience. I am fully confident that under his able and wise leadership the Assembly will further consolidate the gains achieved during the past year.
My delegation associates itself with previous speakers in expressing its appreciation of the dedicated efforts of his predecessor, His Excellency Mr. Dante Caputo, for the exemplary manner in which he discharged his duties as President of the forty-third session of the General Assembly.
As in previous years, my delegation wishes to note its satisfaction with and gratitude for the assiduous and unrelenting efforts exerted by the Secretary-General of the United Nations in the cause of peace and international harmony. We pay a tribute to him for

##  Extractive methods

### Summarization based on tf-idf

In [6]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ylepen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

In [8]:
from nltk.tokenize import sent_tokenize
# Sample text
text_0 = "NLTK is a great NLP toolkit. It makes processing text easy!"
# Tokenize sentences
sentences = sent_tokenize(text_0)
print(sentences)

['NLTK is a great NLP toolkit.', 'It makes processing text easy!']


In [9]:
sentences = tokenize.sent_tokenize(text,language='english')
tfidfVectorizer = TfidfVectorizer()
words_tfidf = tfidfVectorizer.fit_transform(sentences)

In [10]:
print(len(sentences))

121


In [11]:
sentences[0]

'\ufeffIt is indeed a pleasure for me and the members of my delegation to extend to Ambassador Garba our sincere congratulations on his election to the presidency of the forty-fourth session of the General Assembly.'

In [12]:
words_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2542 stored elements and shape (121, 987)>

In [13]:
# Parameter to specify the required number of sentences in the summary
num_summary_sentence = 10

In [14]:
# Sort the sentences in descending order by the sum of the tf_idf values
sent_sum = words_tfidf.sum(axis=1)
important_sent = np.argsort(sent_sum,axis=0)[::-1]

#### Print the 10 most important sentences in the order they appears in the text

In [21]:
summary_idf=[]

In [22]:
for i in range(0,len(sentences)):
    if i in important_sent[:num_summary_sentence]:
        summary_idf.append(sentences[i])
        print(sentences[i])

My delegation associates itself with previous speakers in expressing its appreciation of the dedicated efforts of his predecessor, His Excellency Mr. Dante Caputo, for the exemplary manner in which he discharged his duties as President of the forty-third session of the General Assembly.
Developments in southern Africa, and more particularly in Namibia with regard to the implementation of the United Nations independence plan, are welcome signals of hope, but amidst the hopes there are still dark reminders of the precariousness of global political reconciliation.
The link between economic development and the environment has recently been recognized and it is encouraging to note the high profile given to environmental issues at the Paris summit meeting of the Group of Seven in July this year, in this regard, it is of particular interest that there is an increasing awareness and acceptance of the fact that certain technologies have a deleterious effect on the environment.
The transition th

In [23]:
print(''.join(summary_idf))

My delegation associates itself with previous speakers in expressing its appreciation of the dedicated efforts of his predecessor, His Excellency Mr. Dante Caputo, for the exemplary manner in which he discharged his duties as President of the forty-third session of the General Assembly.Developments in southern Africa, and more particularly in Namibia with regard to the implementation of the United Nations independence plan, are welcome signals of hope, but amidst the hopes there are still dark reminders of the precariousness of global political reconciliation.The link between economic development and the environment has recently been recognized and it is encouraging to note the high profile given to environmental issues at the Paris summit meeting of the Group of Seven in July this year, in this regard, it is of particular interest that there is an increasing awareness and acceptance of the fact that certain technologies have a deleterious effect on the environment.The transition that 

In [24]:
summary_tf_idf = ' '.join(summary_idf)

### LSA algorithm

We use the **sumy** library

sumy provides:
- a tokenizer and a stemmer (comparable to lemmatization)

- a stop words list

https://www.geeksforgeeks.org/nlp/mastering-text-summarization-with-sumy-a-python-library-overview/


In [15]:
num_summary_sentence = 10

In [23]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer

In [24]:
LANGUAGE = "english"
stemmer = Stemmer(LANGUAGE)

**Step 1** : we convert the input text into a format suitable for summarization using a tokenizer and a parser

Parsing is the process of examining the grammatical structure and relationships inside a given sentence or text in natural language processing (NLP). It involves analyzing the text to determine the roles of specific words, such as nouns, verbs, and adjectives, as well as their interrelationships


In [20]:
from sumy.parsers.plaintext import PlaintextParser

In [19]:
parser =PlaintextParser.from_string(text,Tokenizer(LANGUAGE))

**Step 2**: We initialize LSA summarizer

In [20]:
summarizer = LsaSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

**Step 3**: We estimate a summary for a given number of sentences

In [21]:
summary = summarizer(parser.document,num_summary_sentence)

In [25]:
print(summary)

(<Sentence: My delegation associates itself with previous speakers in expressing its appreciation of the dedicated efforts of his predecessor, His Excellency Mr. Dante Caputo, for the exemplary manner in which he discharged his duties as President of the forty-third session of the General Assembly.>, <Sentence: As in previous years, my delegation wishes to note its satisfaction with and gratitude for the assiduous and unrelenting efforts exerted by the Secretary-General of the United Nations in the cause of peace and international harmony.>, <Sentence: We pay a tribute to him for his untiring efforts to promote conditions conducive to the realization of the noble principles enshrined in the Charter of the United Nations, we praise and congratulate him on the successes the Organization has achieved in recent years.>, <Sentence: We fervently hope that, with the developments taking place in the region and elsewhere, the question of Lebanon will be solved in a manner which will restore its

In [22]:
summary_lsa=[]

In [23]:
for sentence in summary:
    summary_lsa.append(str(sentence))
    print(sentence)

My delegation associates itself with previous speakers in expressing its appreciation of the dedicated efforts of his predecessor, His Excellency Mr. Dante Caputo, for the exemplary manner in which he discharged his duties as President of the forty-third session of the General Assembly.
As in previous years, my delegation wishes to note its satisfaction with and gratitude for the assiduous and unrelenting efforts exerted by the Secretary-General of the United Nations in the cause of peace and international harmony.
We pay a tribute to him for his untiring efforts to promote conditions conducive to the realization of the noble principles enshrined in the Charter of the United Nations, we praise and congratulate him on the successes the Organization has achieved in recent years.
We fervently hope that, with the developments taking place in the region and elsewhere, the question of Lebanon will be solved in a manner which will restore its independence and national integrity, and alleviate

In [24]:
summary_lsa = ''.join(summary_lsa)
print(summary_lsa)

My delegation associates itself with previous speakers in expressing its appreciation of the dedicated efforts of his predecessor, His Excellency Mr. Dante Caputo, for the exemplary manner in which he discharged his duties as President of the forty-third session of the General Assembly.As in previous years, my delegation wishes to note its satisfaction with and gratitude for the assiduous and unrelenting efforts exerted by the Secretary-General of the United Nations in the cause of peace and international harmony.We pay a tribute to him for his untiring efforts to promote conditions conducive to the realization of the noble principles enshrined in the Charter of the United Nations, we praise and congratulate him on the successes the Organization has achieved in recent years.We fervently hope that, with the developments taking place in the region and elsewhere, the question of Lebanon will be solved in a manner which will restore its independence and national integrity, and alleviate th

### Summarizing a text using an TextRank

In [22]:
from sumy.summarizers.text_rank import TextRankSummarizer

In [27]:
parser = PlaintextParser.from_string(text,Tokenizer(LANGUAGE))
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print(str(sentence))

Developments in southern Africa, and more particularly in Namibia with regard to the implementation of the United Nations independence plan, are welcome signals of hope, but amidst the hopes there are still dark reminders of the precariousness of global political reconciliation.
The link between economic development and the environment has recently been recognized and it is encouraging to note the high profile given to environmental issues at the Paris summit meeting of the Group of Seven in July this year, in this regard, it is of particular interest that there is an increasing awareness and acceptance of the fact that certain technologies have a deleterious effect on the environment.
The international political climate and the security perceptions of states, as well as the environment, are actual and potential sacrifices to nuclear weapons.
The transition that many of the world's conflict, are making towards negotiations and understanding owes a great deal to improved relations betwe

In [28]:
summary=summarizer(parser.document, num_summary_sentence)

In [29]:
print(summary)

(<Sentence: Developments in southern Africa, and more particularly in Namibia with regard to the implementation of the United Nations independence plan, are welcome signals of hope, but amidst the hopes there are still dark reminders of the precariousness of global political reconciliation.>, <Sentence: The link between economic development and the environment has recently been recognized and it is encouraging to note the high profile given to environmental issues at the Paris summit meeting of the Group of Seven in July this year, in this regard, it is of particular interest that there is an increasing awareness and acceptance of the fact that certain technologies have a deleterious effect on the environment.>, <Sentence: The international political climate and the security perceptions of states, as well as the environment, are actual and potential sacrifices to nuclear weapons.>, <Sentence: The transition that many of the world's conflict, are making towards negotiations and understa

## Measuring the Performance of Text summarization Methods

- Some common accuracy metrics

- ROUGE score: Recall_Oriented Understudy for Gisting Evaluation 
    - A metric used to evaluate text summarization and translation models
- Several variations on ROUGE scores:
    - ROUGE1
    - ROUGE2
    - Rouge L
    

In [2]:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'],use_stemmer=True)

In [50]:
reference_summary="the cat is on the mat"
candidate_summary="the cat and the dog"

In [51]:
scores=scorer.score(reference_summary,candidate_summary)

In [52]:
for key in scores:
    print(f'{key}: {scores[key]}')

rouge1: Score(precision=0.6, recall=0.5, fmeasure=0.5454545454545454)
rouge2: Score(precision=0.25, recall=0.2, fmeasure=0.22222222222222224)
rougeL: Score(precision=0.6, recall=0.5, fmeasure=0.5454545454545454)


### Example with a real dataset

In [44]:
df=pd.read_csv('C:/users/ylepen/OneDrive - Université Paris-Dauphine/COURS Dauphine/NLP/session 5 Text Summarization/cnn_dailymail/train.csv')

In [54]:
df.head()

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [77]:
df.loc[200,'highlights']

'English champion Manchester City posts loss of $158 million for 2011-12 season .\nIts revenues rose to a record $374 million after winning title and playing in Champions League .\nOwner Sheikh Mansour bin Zayed injected $273 million to keep the club debt free .\nCity paid more than $325 million in player wages -- the first English club to reach that level .'

In [78]:
df.loc[200,'article']

'(CNN) -- Big-spending English club Manchester City moved a step closer to meeting European football\'s financial fairplay requirements on Friday despite posting a loss of almost $160 million for last season. City\'s deficit of £97.9 million ($158 million) for 2011-12\'s Premier League-winning campaign was just under half that of the £197.5 million ($318 million) for the previous period -- which was the biggest loss in soccer history. The latest figure represents the fourth highest deficit in the English game -- three of which belong to City since the arrival of its Abu Dhabi owners in 2008. It can be contrasted with the $37 million net profit made by rival Manchester United in 2011-12. United posted a reduced revenue of £320 million ($517 million) for that period, while City closed the gap with a club-record turnover of £231.1 million ($374 million). Both are substantially behind leading Spanish clubs Real Madrid and Barcelona. Chelsea boosted by first profit in Abramovich era . It wa

In [79]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize

In [80]:
sentences = tokenize.sent_tokenize(text)
tfidfVectorizer = TfidfVectorizer()
words_tfidf = tfidfVectorizer.fit_transform(sentences)

In [81]:
num_summary_sentence = 4

In [82]:
from sumy.summarizers.text_rank import TextRankSummarizer

In [83]:
parser = PlaintextParser.from_string(df.loc[200,'article'],Tokenizer(LANGUAGE))
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print(str(sentence))

(CNN) -- Big-spending English club Manchester City moved a step closer to meeting European football's financial fairplay requirements on Friday despite posting a loss of almost $160 million for last season.
City's deficit of £97.9 million ($158 million) for 2011-12's Premier League-winning campaign was just under half that of the £197.5 million ($318 million) for the previous period -- which was the biggest loss in soccer history.
United posted a reduced revenue of £320 million ($517 million) for that period, while City closed the gap with a club-record turnover of £231.1 million ($374 million).
It is building an academy to try to avoid paying over the odds for star players in the future -- the £201.8 million ($326 million) wage bill for 2011-12 made City the first English club to break £200 million in salaries, according to the Sporting Intelligence website.


In [84]:
summary_exp=[]

for sentence in summarizer(parser.document, num_summary_sentence):
    summary_exp.append(str(sentence))

In [85]:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'],use_stemmer=True)

In [86]:
summary_exp = ''.join(summary_exp)

In [87]:
scores=scorer.score(df.loc[200,'highlights'],summary_exp)

In [30]:
print(scores)

{'rouge1': Score(precision=0.24, recall=0.6, fmeasure=0.34285714285714286), 'rouge2': Score(precision=0.08053691275167785, recall=0.2033898305084746, fmeasure=0.11538461538461539), 'rougeL': Score(precision=0.16666666666666666, recall=0.4166666666666667, fmeasure=0.23809523809523808)}
