# Text Summarisation using Natural Language Processing

In practice, there are two main approaches to automatic text summarisation: extraction and abstraction.

Abstraction is the notion of generating a summary based on the text, with words in the summary that are not necessarily present in the text itself. This method however, can be too complex at times. Extraction therefore is more commonly used, and works by selecting certain words or sentences from the text, and creating summaries using them.

In [8]:
import nltk
from nltk import FreqDist
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()
from nltk.corpus import stopwords
stop = stopwords.words('english')
from bs4 import BeautifulSoup
from urllib.request import urlopen
from gensim.models import Phrases
from gensim.models.phrases import Phraser
import os
from collections import Counter
import string

In [9]:
punctuations = list(string.punctuation)
#Add some more punctuation, as the list doesn't cover all cases.
punctuations.extend(['”', '–', '``', "''"])
stop = stop + punctuations

The core idea behind unsupervised text summarisation is the following:
- split text into sentences;
- tokenise these sentences into separate words;
- assign scores to sentences based on their importance;
- select several top sentences and display them in original order;

The key idea is assigning scores to sentences. Here are some of the ways to do this:
- calculate the similarity between each pair of sentences and select sentences which are most similar to most sentences;
- calculate word frequences, select most frequent words and select sentences which have most of these words;

In this notebook I'll use the following old news article from 2017:

In [10]:
url = urlopen('http://news.sky.com/story/snap-election-to-be-held-in-march-after-northern-ireland-government-collapses-10731488')
soup = BeautifulSoup(url.read().decode('utf8'), "lxml")
text = '\n\n'.join(map(lambda p: p.text, soup.find_all('p')))

text = text[text.find('An early election'):]
title = soup.find('h1').text.strip()
print(title, '\n', '_' * 60, '\n', text)

Snap election to be held in March after Northern Ireland government collapses 
 ____________________________________________________________ 
 An early election will be held in Northern Ireland on 2 March after the collapse of its government, it has been announced.

Northern Ireland Secretary James Brokenshire said the devolved Northern Ireland Assembly will sit for the last time on 25 January, before it is dissolved the following day.

The break-up of the power-sharing government comes amid a dispute between Sinn Fein and the DUP over a botched renewable energy scheme that could have cost the taxpayer £500m.

The "cash for ash" scandal prompted the resignation of deputy first minister Martin McGuinness, who called for DUP first minister Arlene Foster to quit.

She refused, calling Mr McGuinness' actions "not principled" and "purely political".

On Monday afternoon, Sinn Fein announced it would not replace Mr McGuinness - triggering the snap election.

Despite a last-ditch attempt by T

## Calculating the similarity between sentences

In general, this is how it works:

- the text will be split into sentences;
- the split sentences into words/tokens (there are several ways to do it, which give various results);
- calculate similarity between sentences - while there are many ways to do it, I'll use a simple one: comparing tokens in each sentence. Similarity between sentences is calculated as number of words which are present in both sentences divided by average length of sentences (for normalization);
- assign scores to sentences based on their similarity with other sentences - for each sentence get a sum of similarity scores with each other sentence;
- select several best sentences and show them in order, in which they appeare in the article;

At first, we simply split sentences into words, using space as a separator. 

In [11]:
def intersection(sent1, sent2):
    s1 = sent1.split(' ')
    s2 = sent2.split(' ')

    intersection = [i for i in s1 if i in s2]
    #Normalization
    return len(intersection) / ((len(s1) + len(s2)) / 2)

Now, we form a matrix of similarities between each pair of sentences. Spatially, this is a 2D-matrix with a length equal to the number of sentences.

In [14]:
sentences = sent_tokenize(text)
matrix = [[intersection(sentences[i], sentences[j]) for i in range(0,len(sentences))] for j in range(0,len(sentences))]
matrix[:2]

[[1.0,
  0.40816326530612246,
  0.1568627450980392,
  0.08695652173913043,
  0.0,
  0.10256410256410256,
  0.15384615384615385,
  0.07936507936507936,
  0.1791044776119403,
  0.1111111111111111,
  0.1875,
  0.3018867924528302,
  0.12121212121212122,
  0.0,
  0.13793103448275862,
  0.08888888888888889,
  0.17857142857142858,
  0.10256410256410256,
  0.34782608695652173,
  0.4,
  0.0],
 [0.24489795918367346,
  1.0,
  0.10714285714285714,
  0.11764705882352941,
  0.0,
  0.09090909090909091,
  0.17543859649122806,
  0.030534351145038167,
  0.027777777777777776,
  0.1016949152542373,
  0.21621621621621623,
  0.20689655172413793,
  0.21052631578947367,
  0.0,
  0.19047619047619047,
  0.0,
  0.19672131147540983,
  0.09090909090909091,
  0.3137254901960784,
  0.24,
  0.0]]

We calculate the score for each sentence, which is a sum of similarity scores with other sentences.

In [15]:
scores = {sentences[i]: sum(matrix[i]) for i in range(len(matrix))}
scores

{'An early election will be held in Northern Ireland on 2 March after the collapse of its government, it has been announced.': 4.144353911770331,
 'Northern Ireland Secretary James Brokenshire said the devolved Northern Ireland Assembly will sit for the last time on 25 January, before it is dissolved the following day.': 3.5615137735140294,
 'The break-up of the power-sharing government comes amid a dispute between Sinn Fein and the DUP over a botched renewable energy scheme that could have cost the taxpayer £500m.': 4.356835581188402,
 'The "cash for ash" scandal prompted the resignation of deputy first minister Martin McGuinness, who called for DUP first minister Arlene Foster to quit.': 3.7467069848694576,
 'She refused, calling Mr McGuinness\' actions "not principled" and "purely political".': 1.4685232694430241,
 'On Monday afternoon, Sinn Fein announced it would not replace Mr McGuinness - triggering the snap election.': 3.285412517463482,
 'Despite a last-ditch attempt by Theres

We select five best sentences.

In [16]:
sents = sorted(scores, key=scores.__getitem__, reverse=True)[:5]
sents

['He added that the collapse of the power-sharing government was the "greatest challenge to face the Northern Ireland peace process in a decade".',
 'The break-up of the power-sharing government comes amid a dispute between Sinn Fein and the DUP over a botched renewable energy scheme that could have cost the taxpayer £500m.',
 'He said: "This is essential for the operation of devolved government.',
 'Please use Chrome browser for a more accessible video player\n\n\n\nSky News Ireland Correspondent David Blevins said the relationship between Sinn Fein and the DUP had been "slowly breaking down for a period of months".',
 'An early election will be held in Northern Ireland on 2 March after the collapse of its government, it has been announced.']

In [17]:
tuples = [(i, text.find(i)) for i in sents]
sorted_tuples = sorted(tuples, key=lambda x: x[0])
#Leave only sentences.
best_sents = [i[0] for i in sorted_tuples]
best_sents

['An early election will be held in Northern Ireland on 2 March after the collapse of its government, it has been announced.',
 'He added that the collapse of the power-sharing government was the "greatest challenge to face the Northern Ireland peace process in a decade".',
 'He said: "This is essential for the operation of devolved government.',
 'Please use Chrome browser for a more accessible video player\n\n\n\nSky News Ireland Correspondent David Blevins said the relationship between Sinn Fein and the DUP had been "slowly breaking down for a period of months".',
 'The break-up of the power-sharing government comes amid a dispute between Sinn Fein and the DUP over a botched renewable energy scheme that could have cost the taxpayer £500m.']

We can put everything together as follows:

In [18]:
def intersection(sent1, sent2):
    s1 = sent1.split(' ')
    s2 = sent2.split(' ')
    intersection = [i for i in s1 if i in s2]
    return len(intersection) / ((len(s1) + len(s2)) / 2)

def get_summary(text, limit=3):
    sentences = sent_tokenize(text)
    matrix = [[intersection(sentences[i], sentences[j]) for i in range(0,len(sentences))] for j in range(0,len(sentences))]
    scores = {sentences[i]: sum(matrix[i]) for i in range(len(matrix))}
    sents = sorted(scores, key=scores.__getitem__, reverse=True)[:limit]
    best_sents = [i[0] for i in sorted([(i, text.find(i)) for i in sents], key=lambda x: x[0])]
    return best_sents

def summarize(text, limit=3):
    summary = get_summary(text, limit)
    print(title)
    print()
    print(' '.join(summary))

In [19]:
summarize(text,5)

Snap election to be held in March after Northern Ireland government collapses

An early election will be held in Northern Ireland on 2 March after the collapse of its government, it has been announced. He added that the collapse of the power-sharing government was the "greatest challenge to face the Northern Ireland peace process in a decade". He said: "This is essential for the operation of devolved government. Please use Chrome browser for a more accessible video player



Sky News Ireland Correspondent David Blevins said the relationship between Sinn Fein and the DUP had been "slowly breaking down for a period of months". The break-up of the power-sharing government comes amid a dispute between Sinn Fein and the DUP over a botched renewable energy scheme that could have cost the taxpayer £500m.


This is our summary. The number of sentences in the summary is actually arbitrary, and can be changed where appropriate, to get the desired results.

In order to improve this algorithm, I think that splitting sentences while calculating intersections might be changed. If we split by space, then punctuation stays attached to the words, and this yields problems when calculating sentence - wise similarities. 

In [20]:
def intersection(sent1, sent2):
    s1 = [i for i in word_tokenize(sent1) if i not in punctuations and i not in stop]
    s2 = [i for i in word_tokenize(sent2) if i not in punctuations and i not in stop]
    intersection = [i for i in s1 if i in s2]
    return len(intersection) / ((len(s1) + len(s2)) / 2)

In [21]:
summarize(text,5)

Snap election to be held in March after Northern Ireland government collapses

An early election will be held in Northern Ireland on 2 March after the collapse of its government, it has been announced. He added that the collapse of the power-sharing government was the "greatest challenge to face the Northern Ireland peace process in a decade". Please use Chrome browser for a more accessible video player



Sinn Fein and the DUP are expected to remain the two largest parties following the election, meaning they will still have to hammer out a power-sharing arrangement. Please use Chrome browser for a more accessible video player



Sky News Ireland Correspondent David Blevins said the relationship between Sinn Fein and the DUP had been "slowly breaking down for a period of months". The break-up of the power-sharing government comes amid a dispute between Sinn Fein and the DUP over a botched renewable energy scheme that could have cost the taxpayer £500m.


We can see that there are bigrams and trigrams among the most common words. Now I'll use this.

In [13]:
def intersection(sent1, sent2):
    #As sentences are lists of tokens, there is no need to split them.
    intersection = [i for i in sent1 if i in sent2]
    return len(intersection) / ((len(sent1) + len(sent2)) / 2)

def split_sentences(sents):
    sentence_stream = [[i for i in word_tokenize(sent) if i not in stop] for sent in sents]
    bigram = Phrases(sentence_stream, min_count=2, threshold=2, delimiter=b'_')
    bigram_phraser = Phraser(bigram)
    bigram_tokens = bigram_phraser[sentence_stream]
    trigram = Phrases(bigram_tokens,min_count=2, threshold=2, delimiter=b'_')
    trigram_phraser = Phraser(trigram)
    trigram_tokens = trigram_phraser[bigram_tokens]
    return [i for i in trigram_tokens]

def get_summary(text, limit=3):
    sents = sent_tokenize(text)
    sentences = split_sentences(sents)
    matrix = [[intersection(sentences[i], sentences[j]) for i in range(0,len(sentences))] for j in range(0,len(sentences))]
    scores = {sents[i]: sum(matrix[i]) for i in range(len(matrix))}
    sents = sorted(scores, key=scores.__getitem__, reverse=True)[:limit]
    best_sents = [i[0] for i in sorted([(i, text.find(i)) for i in sents], key=lambda x: x[0])]
    return best_sents

In [14]:
summarize(text,5)

Snap election to be held in March after Northern Ireland government collapses

"The botched renewable energy scheme is being blamed for the collapse of the devolved government but it was just the tip of the iceberg." An early election will be held in Northern Ireland on 2 March after the collapse of its government, it has been announced. Announcing the dissolution of the Northern Ireland Assembly, Mr Brokenshire urged both parties "to conduct this election with a view to...re-establishing a partnership government at the earliest opportunity after that poll." He added that the collapse of the power-sharing government was the "greatest challenge to face the Northern Ireland peace process in a decade". The break-up of the power-sharing government comes amid a dispute between Sinn Fein and the DUP over a botched renewable energy scheme that could have cost the taxpayer £500m.


The summary changed again. Numerous ways to split sentences may work better on some types of texts and worse on others.