# Text Summarization

Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP).

There are broadly two different approaches that are used for text summarization:

* Extractive Summarization
* Abstractive Summarization


In this notebook, we will build an extraction based text summarizers using python

### Extractive Summarization
The name gives away what this approach does. We identify the important sentences or phrases from the original text and extract only those from the text. Those extracted sentences would be our summary. The below diagram illustrates extractive summarization:

![Extractive_Summarization.png](images/Extractive_Summarization.png)

# Problem Statement

* Generating short length descriptions(headlines) from text(news articles).
* Summarizing large amount of information which can be represented in compressed space

The objective here is to generate a summary for the News Articles using the abstraction-based approach. You can download the dataset from[ here ](https://www.kaggle.com/sunnysai12345/news-summary)

# TextRank Algorithm

The PageRank algorithm inspired TextRank! PageRank is used primarily for ranking web pages in online search results. Let’s quickly understand the basics of this algorithm with the help of an example.

TextRank is an extractive and unsupervised text summarization technique. Let’s take a look at the flow of the TextRank algorithm that we will be following:

* The first step would be to concatenate all the text contained in the articles
* Then split the text into individual sentences
In the next step, we will find vector representation (word embeddings) for each and every sentence
* Similarities between sentence vectors are then calculated and stored in a matrix
* The similarity matrix is then converted into a graph, with sentences as vertices and similarity scores as edges, for sentence rank calculation
Finally, a certain number of top-ranked sentences form the final summary

![graph.png](images/graph.png)

# Import the Libraries

In [0]:
project_path = "/text_summarization/"

In [0]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Read the dataset 

In [0]:
news_summary=pd.read_csv(project_path+"news_summary.csv", encoding='iso-8859-1')
news_summary_more=pd.read_csv(project_path+"news_summary_more.csv", encoding='iso-8859-1')

In [0]:
news_summary.head(2)

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."


In [0]:
news_summary_more.head(2)

Unnamed: 0,headlines,text
0,upGrad learner switches to career in ML & Al w...,"Saurav Kant, an alumnus of upGrad and IIIT-B's..."
1,Delhi techie wins free food from Swiggy for on...,Kunal Shah's credit card bill payment platform...


In [0]:
temp_df = news_summary.copy()
temp_df["text"] = temp_df["text"].str.cat(temp_df['ctext'], sep = " ")

In [0]:
data = pd.DataFrame()
data['Text'] = pd.concat([news_summary_more['text'], temp_df['text']], ignore_index=True)
data['Summary'] = pd.concat([news_summary_more['headlines'],temp_df['headlines']],ignore_index = True)

In [0]:
limited_data = data.iloc[:10,:]

In [0]:
limited_data.shape

(10, 2)

In [0]:
limited_data["Text"][3]

'With Aegon Life iTerm Insurance plan, customers can enjoy tax benefits on your premiums paid and save up to Ã¢Â\x82Â¹46,800^ on taxes. The plan provides life cover up to the age of 100 years. Also, customers have options to insure against Critical Illnesses, Disability and Accidental Death Benefit Rider with a life cover up to the age of 80 years.'

In [0]:
limited_data["Text"][7]

'Weeks after ex-CBI Director Alok Verma told the Department of Personnel and Training to consider him retired, the Home Ministry asked him to join work on the last day of his fixed tenure as Director on Thursday. The ministry directed him to immediately join as DG, Fire Services, the post he was transferred to after his removal as CBI chief.'

# Split Text into Sentences
Now the next step is to break the text into individual sentences. We will use the sent_tokenize( ) function of the nltk library to do this.

In [0]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in limited_data['Text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

In [0]:
sentences[:5]

["Saurav Kant, an alumnus of upGrad and IIIT-B's PG Program in Machine learning and Artificial Intelligence, was a Sr Systems Engineer at Infosys with almost 5 years of work experience.",
 "The program and upGrad's 360-degree career support helped him transition to a Data Scientist at Tech Mahindra with 90% salary hike.",
 "upGrad's Online Power Learning has powered 3 lakh+ careers.",
 "Kunal Shah's credit card bill payment platform, CRED, gave users a chance to win free food from Swiggy for one year.",
 'Pranav Kaushik, a Delhi techie, bagged this reward after spending 2000 CRED coins.']

# Download GloVe Word Embeddings
GloVe word embeddings are vector representation of words. These word embeddings will be used to create vectors for our sentences. We could have also used the Bag-of-Words or TF-IDF approaches to create features for our sentences, but these methods ignore the order of the words (and the number of features is usually pretty large).

We will be using the pre-trained Wikipedia 2014 + Gigaword 5 GloVe vectors available here. Heads up – the size of these word embeddings is 822 MB.

In [0]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove*.zip

# Text Preprocessing
It is always a good practice to make your textual data noise-free as much as possible. So, let’s do some basic text cleaning.

In [0]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

Get rid of the stopwords (commonly used words of a language – is, am, the, of, in, etc.) present in the sentences. If you have not downloaded nltk-stopwords, then execute the following line of code:

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Now we can import the stopwords.

In [0]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Let’s define a function to remove these stopwords from our dataset.

In [0]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [0]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

We will use clean_sentences to create vectors for sentences in our data with the help of the GloVe word vectors.

# Vector Representation of Sentences

Let’s extract the words embeddings or word vectors.

In [0]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

Now, let’s create vectors for our sentences. We will first fetch vectors (each of size 100 elements) for the constituent words in a sentence and then take mean/average of those vectors to arrive at a consolidated vector for the sentence.

In [0]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

# Similarity Matrix Preparation
The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge. Let’s create an empty similarity matrix for this task and populate it with cosine similarities of the sentences.

Let’s first define a zero matrix of dimensions (n * n).  We will initialize this matrix with cosine similarity scores of the sentences. Here, n is the number of sentences.

In [0]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

We will use Cosine Similarity to compute the similarity between a pair of sentences.

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

And initialize the matrix with cosine similarity scores.

In [0]:
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

# Applying PageRank Algorithm
Before proceeding further, let’s convert the similarity matrix sim_mat into a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings.

![text_rank.png](images/text_rank.png)

In [0]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

# Summary Extraction
Finally, it’s time to extract the top N sentences based on their rankings for summary generation.

In [0]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [0]:
# Extract top 10 sentences as the summary
for i in range(10):
    print(ranked_sentences[i][1])

Weeks after ex-CBI Director Alok Verma told the Department of Personnel and Training to consider him retired, the Home Ministry asked him to join work on the last day of his fixed tenure as Director on Thursday.
Kunal Shah's credit card bill payment platform, CRED, gave users a chance to win free food from Swiggy for one year.
"It would have been better if the authorities would have served the notice first if any and then publicised this," reads a press release issued on behalf of Rahat.
The plan provides life cover up to the age of 100 years.
India lost an international match under Rohit Sharma's captaincy after 12 consecutive victories dating back to March 2018.
Also, customers have options to insure against Critical Illnesses, Disability and Accidental Death Benefit Rider with a life cover up to the age of 80 years.
"I did this...to satisfy his ego in the hope that he will do justice to the state," he added.
The program and upGrad's 360-degree career support helped him transition to