### In this notebook, we are building a Extractive Text Summarization model using [RoBERTa](https://arxiv.org/abs/1907.11692).

Approach:

1. Convert the articles/passages into a list of sentences using nltk's sentence tokenizer.   
2. For each sentence, extract contextual embeddings using Sentence transformer.
3. Apply K-means clustering on the embeddings. The idea is to cluster the sentences that are contextually similar to each other & pick one sentence from each cluster that is closest to the mean(centroid).
4. For each sentence embedding, calculate the distance from centroid.The distance would be zero if centroid itself is the actual sentence embedding.
5. For each cluster, select the sentence embedding with lowest distance from centroid & return the summary based on the order in which the sentences appear in the original text.

### Sentence transformer

**Sentence transformer** is a python library that alow us to represent the sentences & paragraphs into dense vectors. This package is compatible with the state of the art models like BERT, RoBERTa, XLM-RoBERTa etc.    

For more details, check [here](https://www.sbert.net/).

**Implementation**

Let's implement the extractive text summarizer.

In [None]:
# Install required libraries
! pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3b/fd/8a81047bbd9fa134a3f27e12937d2a487bd49d353a038916a5d7ed4e5543/sentence-transformers-2.0.0.tar.gz (85kB)
[K     |████████████████████████████████| 92kB 9.4MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/b5/d5/c6c23ad75491467a9a84e526ef2364e523d45e2b0fae28a7cbe8689e7e84/transformers-4.8.1-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 25.3MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/ac/aa/1437691b0c7c83086ebb79ce2da16e00bef024f24fec2a5161c35476f499/sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 41.2MB/s 
[?25hCollecting huggingface-hub
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423fe0b/huggingface_hub

Import required libraries

In [None]:
import nltk
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer 
from nltk.cluster import KMeansClusterer

Initialize the Sentence Transformer with STS (Sentence Text Similarity) model.

In [None]:
model = SentenceTransformer('stsb-roberta-base')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=744.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3645.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=672.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=122.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456356.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=229.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498661169.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=52.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=239.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355881.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1172.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798293.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…




We are going to perform a summarization of the latest article about Airbus states, " How air-traffic growth offers huge international potential for Indian carriers".   
Reference:   
https://www.airbus.com/newsroom/stories/How-air-traffic-growth-offers-huge-international-potential-for-Indian-carriers.html

In [None]:
article = '''
Airbus shares a long-standing relationship with the Indian aviation industry and has worked collaboratively with the sector to catalyse its growth for the past 50 years.

The Covid pandemic continues to have a huge impact on the aviation sector. In India, with a population of almost 1.3 billion and a growing middle-class eager to travel, domestic aviation is already picking up, while long-haul travel is set to follow over the next few years.

India is the single biggest market for the A320neo Family, with about 900 aircraft ordered between two of the country’s biggest carriers, IndiGo and GoAir. The country, which currently has 99 operational airports, is set to reach 200 airports by 2040. Meanwhile, air cargo is projected to quadruple to 17 million tons in the same timescale.

India is an ideal halfway-point between the Far East and the West, offering an additional pivotal access point to major world capitals and populations. This, together with the global Indian diaspora and immense international business links, means the market for international long-haul travel will also thrive.

Today, the share of revenue captured by Indian carriers serving the country’s international air traffic is just 36%; many of the widebodies serving the Indian market are Airbus aircraft but the proportion flying with Indian carriers is small. With the integration of new-generation Airbus aircraft, providing airlines with lower fuel burn, lower costs and greater range, Indian carriers are in the best position to capture a larger share of this ever-growing market, which is not only profitable but has huge potential for growth.

Airbus can offer Indian carriers two of the most modern, cost-effective and environmentally-sustainable widebody aircraft on the market today, specifically designed for comfort and the passenger in mind.

The A350 is an aircraft with the range, economics, passenger capacity and comfort perfectly suited for long-haul travel, offering direct flights from India to both the US East and West Coasts - potentially linking the strategically-important IT hubs of Silicon Bengaluru and Silicon Valley.

Airbus’ other latest widebody - the A330neo - is the perfect fit for flights from India to Europe’s major commercial centres, as well as destinations in South-East Asia and Australia, including Sydney.

Throughout the Covid crisis, cargo has been a key and stable source of revenue for carriers worldwide. Airlines have been using Airbus’ widebody aircraft for such missions but India’s fleet is mainly composed of single-aisle aircraft. This provides huge potential for Indian carriers to build capacity on cargo to service this growing demand for worldwide freight transportation.

Our most advanced and reliable widebody aircraft could be a strategic enabler for Indian airlines to grow in a profitable way, diversify their operations and become more resilient in their journey to recapturing international market share.
'''

Now will convert our article into a list of sentences using nltk tokenizer.

In [None]:
# Download punkt
nltk.download('punkt')
sentences = nltk.sent_tokenize(article)

# Strip white spaces (leading & trailing)
sentences = [sentence.strip() for sentence in sentences]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Let's check the tokenized sentences.

In [None]:
sentences

['Airbus shares a long-standing relationship with the Indian aviation industry and has worked collaboratively with the sector to catalyse its growth for the past 50 years.',
 'The Covid pandemic continues to have a huge impact on the aviation sector.',
 'In India, with a population of almost 1.3 billion and a growing middle-class eager to travel, domestic aviation is already picking up, while long-haul travel is set to follow over the next few years.',
 'India is the single biggest market for the A320neo Family, with about 900 aircraft ordered between two of the country’s biggest carriers, IndiGo and GoAir.',
 'The country, which currently has 99 operational airports, is set to reach 200 airports by 2040.',
 'Meanwhile, air cargo is projected to quadruple to 17 million tons in the same timescale.',
 'India is an ideal halfway-point between the Far East and the West, offering an additional pivotal access point to major world capitals and populations.',
 'This, together with the global I

Let's convert above list of sentences to a pandas data frame.

In [None]:
df = pd.DataFrame(sentences, columns = ['sentences'])
df.head()

Unnamed: 0,sentences
0,Airbus shares a long-standing relationship wit...
1,The Covid pandemic continues to have a huge im...
2,"In India, with a population of almost 1.3 bill..."
3,India is the single biggest market for the A32...
4,"The country, which currently has 99 operationa..."


In [None]:
print(f'Number of sentences in our article : {len(df)}')

Number of sentences in our article : 17


Now create a function that takes input as sentence  & returns the dense vectors. Basically, this function converts the sentences into dense vectors.

In [None]:
def get_sent_embeddings(sent):
  embeddings = model.encode([sent])
  return embeddings[0]

Apply above function to get embeddings for each sentence.

In [None]:
df['embeddings'] = df['sentences'].apply(get_sent_embeddings)

In [None]:
df.head()

Unnamed: 0,sentences,embeddings
0,Airbus shares a long-standing relationship wit...,"[0.21784155, 0.16952349, -0.6167233, 0.1363491..."
1,The Covid pandemic continues to have a huge im...,"[-0.49463013, 0.41615656, 0.5399393, 0.0096432..."
2,"In India, with a population of almost 1.3 bill...","[0.28975138, 0.25816405, -0.018161606, 0.51176..."
3,India is the single biggest market for the A32...,"[0.1325182, -0.04266982, 0.018145405, -0.19314..."
4,"The country, which currently has 99 operationa...","[-0.20675133, -0.3792965, -0.58686835, 0.08583..."


Let's cluster text embeddings using nltk's **KMeansCluster**.

In [None]:
n_clusters = 5     # The number of sentences that end user expects in the summary.
iterations = 25

# convert embeddings into numpy array
X = np.array(df['embeddings'].tolist())

kcluster = KMeansClusterer(n_clusters, distance = nltk.cluster.util.cosine_distance,       # Cosine distance measure the distance/similarity between 2 vectors
                           repeats = iterations, avoid_empty_clusters = True)

assigned_clusters = kcluster.cluster(X, assign_clusters = True)

In [None]:
assigned_clusters

[2, 0, 3, 1, 1, 0, 3, 3, 2, 2, 4, 4, 1, 0, 1, 3, 2]

Finally, we compute the distance between sentence embedding & centroid for each cluster.

In [None]:
df['Cluster'] = assigned_clusters 
df['Centroid'] = df['Cluster'].apply(lambda x : kcluster.means()[x])

In [None]:
df

Unnamed: 0,sentences,embeddings,Cluster,Centroid
0,Airbus shares a long-standing relationship wit...,"[0.21784155, 0.16952349, -0.6167233, 0.1363491...",2,"[0.2902415, 0.2772183, -0.087056234, 0.1957103..."
1,The Covid pandemic continues to have a huge im...,"[-0.49463013, 0.41615656, 0.5399393, 0.0096432...",0,"[-0.22139701, 0.40296745, 0.40665817, 0.288259..."
2,"In India, with a population of almost 1.3 bill...","[0.28975138, 0.25816405, -0.018161606, 0.51176...",3,"[0.2061173, 0.3206009, -0.12644273, -0.0346485..."
3,India is the single biggest market for the A32...,"[0.1325182, -0.04266982, 0.018145405, -0.19314...",1,"[0.10943105, -0.2057881, -0.17042246, 0.097669..."
4,"The country, which currently has 99 operationa...","[-0.20675133, -0.3792965, -0.58686835, 0.08583...",1,"[0.10943105, -0.2057881, -0.17042246, 0.097669..."
5,"Meanwhile, air cargo is projected to quadruple...","[0.2733606, 0.24424215, 0.40174732, 0.51163584...",0,"[-0.22139701, 0.40296745, 0.40665817, 0.288259..."
6,India is an ideal halfway-point between the Fa...,"[-0.13439874, -0.07330459, -0.6907727, -1.6413...",3,"[0.2061173, 0.3206009, -0.12644273, -0.0346485..."
7,"This, together with the global Indian diaspora...","[0.3611502, 0.6893522, 0.11529731, 0.7016257, ...",3,"[0.2061173, 0.3206009, -0.12644273, -0.0346485..."
8,"Today, the share of revenue captured by Indian...","[0.119512305, -0.07320264, 0.44155404, -0.3226...",2,"[0.2902415, 0.2772183, -0.087056234, 0.1957103..."
9,With the integration of new-generation Airbus ...,"[0.24622929, 0.39527175, -0.12129541, 0.254528...",2,"[0.2902415, 0.2772183, -0.087056234, 0.1957103..."


To compute the distance, we will use scipy's **distance_matrix** function.

In [None]:
from scipy.spatial import distance_matrix 

def distance_from_centroid(row):
  dist_matrix = distance_matrix([row['embeddings']], [row['Centroid'].tolist()])[0][0]
  return dist_matrix

In [None]:
df['distance_from_centroid'] = df.apply(distance_from_centroid, axis=1)
df

Unnamed: 0,sentences,embeddings,Cluster,Centroid,distance_from_centroid
0,Airbus shares a long-standing relationship wit...,"[0.21784155, 0.16952349, -0.6167233, 0.1363491...",2,"[0.2902415, 0.2772183, -0.087056234, 0.1957103...",9.632981
1,The Covid pandemic continues to have a huge im...,"[-0.49463013, 0.41615656, 0.5399393, 0.0096432...",0,"[-0.22139701, 0.40296745, 0.40665817, 0.288259...",11.51541
2,"In India, with a population of almost 1.3 bill...","[0.28975138, 0.25816405, -0.018161606, 0.51176...",3,"[0.2061173, 0.3206009, -0.12644273, -0.0346485...",10.491758
3,India is the single biggest market for the A32...,"[0.1325182, -0.04266982, 0.018145405, -0.19314...",1,"[0.10943105, -0.2057881, -0.17042246, 0.097669...",9.578666
4,"The country, which currently has 99 operationa...","[-0.20675133, -0.3792965, -0.58686835, 0.08583...",1,"[0.10943105, -0.2057881, -0.17042246, 0.097669...",15.147627
5,"Meanwhile, air cargo is projected to quadruple...","[0.2733606, 0.24424215, 0.40174732, 0.51163584...",0,"[-0.22139701, 0.40296745, 0.40665817, 0.288259...",13.31949
6,India is an ideal halfway-point between the Fa...,"[-0.13439874, -0.07330459, -0.6907727, -1.6413...",3,"[0.2061173, 0.3206009, -0.12644273, -0.0346485...",11.681203
7,"This, together with the global Indian diaspora...","[0.3611502, 0.6893522, 0.11529731, 0.7016257, ...",3,"[0.2061173, 0.3206009, -0.12644273, -0.0346485...",8.5164
8,"Today, the share of revenue captured by Indian...","[0.119512305, -0.07320264, 0.44155404, -0.3226...",2,"[0.2902415, 0.2772183, -0.087056234, 0.1957103...",10.523816
9,With the integration of new-generation Airbus ...,"[0.24622929, 0.39527175, -0.12129541, 0.254528...",2,"[0.2902415, 0.2772183, -0.087056234, 0.1957103...",7.688102


The final step is to generate summary. This can be done by following steps,
1. Group the sentences based on Cluster column.
2. Sort the group ascending order based on distance_from_centroid column & select the first row (sentence having miimum distance from centroid).
3. Sort the sentences based on their sequence in the original text.  

In [None]:
sents = df.sort_values(by = 'distance_from_centroid', ascending=True).groupby('Cluster')
sents = sents.head(1)['sentences'].tolist()
sents

['Airbus can offer Indian carriers two of the most modern, cost-effective and environmentally-sustainable widebody aircraft on the market today, specifically designed for comfort and the passenger in mind.',
 'With the integration of new-generation Airbus aircraft, providing airlines with lower fuel burn, lower costs and greater range, Indian carriers are in the best position to capture a larger share of this ever-growing market, which is not only profitable but has huge potential for growth.',
 'This, together with the global Indian diaspora and immense international business links, means the market for international long-haul travel will also thrive.',
 'Airlines have been using Airbus’ widebody aircraft for such missions but India’s fleet is mainly composed of single-aisle aircraft.',
 'Throughout the Covid crisis, cargo has been a key and stable source of revenue for carriers worldwide.']

In [None]:
# Final summary
summary = ' '.join(sents)
summary

'Airbus can offer Indian carriers two of the most modern, cost-effective and environmentally-sustainable widebody aircraft on the market today, specifically designed for comfort and the passenger in mind. With the integration of new-generation Airbus aircraft, providing airlines with lower fuel burn, lower costs and greater range, Indian carriers are in the best position to capture a larger share of this ever-growing market, which is not only profitable but has huge potential for growth. This, together with the global Indian diaspora and immense international business links, means the market for international long-haul travel will also thrive. Airlines have been using Airbus’ widebody aircraft for such missions but India’s fleet is mainly composed of single-aisle aircraft. Throughout the Covid crisis, cargo has been a key and stable source of revenue for carriers worldwide.'

References:
1. https://www.sbert.net/
2. https://www.airbus.com/newsroom/stories/How-air-traffic-growth-offers-huge-international-potential-for-Indian-carriers.html
3. https://www.topbots.com/extractive-text-summarization-using-contextual-embeddings/