# Text Summarization

Alice continues her journey and now she is in 2015. Now it has become easier, as you can use word2vec! This time Alice needs help to solve the problem of summarizing news texts.

The task of summarization is to obtain a shorter text from the original text, which will contain all (or almost all) the information that was in the original text. Thus, from the text you need to obtain its summary in such a way as to lose as little information as possible.

Methods for solving this problem are usually divided into two categories:
- Extractive Summarization $-$ algorithms based on identifying the most informative parts of the source text (sentences, paragraphs, etc.) and compiling a summary from them.
- Abstractive Summarization $-$ algorithms that generate new text based on the source.

We will work with Extractive Summarization.

## 0. Dataset Preprocessing

In [1]:
import os
import nltk
import numpy as np

from scipy import sparse
from collections import defaultdict
from tqdm import tqdm_notebook as tqdm
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

### Loading dataset

We will use data from the CNN/DailyMail news corpus.

In [2]:
DATA_DIR = './cnn_stories_short/'

In [None]:
%%capture

!wget https://www.dropbox.com/s/kofxrgod7kl720m/cnn_stories_short.zip
!mkdir cnn_data
!unzip cnn_stories_short.zip -d $DATA_DIR
!rm -r ./cnn_stories_short/__MACOSX

### Dataset preparation

The dataset consists of source texts and already written summaries for them. We will save original texts.

In [4]:
texts = []
for filename in os.listdir(DATA_DIR):
    with open(os.path.join(DATA_DIR,filename),'r') as input_file:
        all_texts = input_file.read().split('@highlight')
        texts.append(all_texts[0])

#### We will need:
* texts broken into sentences
* sentences broken into tokens
* texts, broken sentences that are broken into tokens

In [5]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/gotheartem/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [6]:
sent_tokenized_texts = [sent_tokenize(text) for text in texts]
tokenized_sentences = [word_tokenize(sent) for text in texts for sent in sent_tokenize(text)]
tokenized_texts = [[word_tokenize(sent) for sent in text] for text in sent_tokenized_texts]

### Loading Word Embedding Model

For the TextRank algorithm, we need to obtain a vector representation for each sentence in the text.

We will use pre-trained Glove vectors. **GloVe** (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words, developed by Stanford University. It leverages global word-word co-occurrence statistics from a corpus to create dense vector embeddings that capture semantic meanings. GloVe vectors enable improved performance in various natural language processing tasks by representing words in a continuous vector space, where similar words are located closer together.

Let's load models:

In [None]:
%%capture

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

The downloaded archive contains a set of files with vectors of different lengths. Each file stores a word on each line, followed by a space, the values ​​of the vector representation of this word.

In [8]:
word_embeddings = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f.readlines():
        values = line.split()
        word = values[0]
        word_embeddings[word] = np.asarray(values[1:], dtype='float32')

We stored vectors to word_embeddings value. Thus, word_embeddings is a dictionary, where key is a word and value is a vector of this word.

## Task 1: Word2Vec text representation

*   For each text obtaint it's vector representation by averaging word2vec representation of each word. Just sum it component by component and divide on number of words in sentence. If word embedding model do not contain word initialize it with zeros. Use word representations saved in word_embeddings values.
*   Count cosine similarity between each sentences and obtain matrix of cosine similarity **G**.

In [9]:
# TODO complete transform function. You can add additional values in class constructor if neccesary.

class TfidfEmbeddingVectorizer:
    def __init__(self, embedding_model, dim=100):
        self.embedding_model = embedding_model
        self.dim = dim
    
    def transform(self, X):
        sentence_vectors = []
        for sentence in X:
            if len(sentence) != 0:
                v = sum([self.embedding_model.get(word, np.zeros((self.dim,))) for word in sentence]) / len(sentence)
            else:
                v = np.zeros((self.dim,))
            sentence_vectors.append(v)
        return np.array(sentence_vectors)

In [10]:
sentence_vectorizer = TfidfEmbeddingVectorizer(word_embeddings)

### Building the Cosine Similarity Matrix

For the *TextRank* algorithm, we need to build a weighted graph from the text. The graph will be represented as a matrix of cosine similarity between sentences.

For example, let's build a graph in the form of a distance matrix for one of the texts.
Let's choose one text and build a distance matrix for it. We'll use the cosine distance as a metric.

In [11]:
TEXT_NUM = 5

In [12]:
sentences = tokenized_texts[TEXT_NUM]

Using the vectorizer, we will obtain vectors for all sentences of the text.

In [13]:
vectorized_sentences = sentence_vectorizer.transform(sentences)

Let's calculate the matrix with cosine distances.

In [14]:
## TODO calculate the matrix of cosine similarity and assign it to G value.

from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_similarity_matrix(sentences):
    return cosine_similarity(sentences)

G = get_cosine_similarity_matrix(vectorized_sentences)

## Extractive Summarization $-$ TextRank

Now we will implement the text summarization method itself. It will be based on the *PageRank* algorithm.

*PageRank* $-$ is a recursive algorithm that evaluates the importance of each node in the graph based on its connections with other nodes. Initially, the algorithm was used to evaluate the importance of Internet pages for search engines.

The adaptation of this algorithm for text summarization is called *TextRank*.

The algorithm sequentially goes through all the nodes in the graph and recalculates the PageRank values ​​for each of them using the formula below.

This happens until the process stabilizes, that is, the *PageRank* values ​​for all nodes stop changing significantly with each new iteration.

$$ G = (V,E) - граф $$
$$$$
$$ PageRank(v) = \frac{(1-d)}{N} +  d \sum_{u} \frac {PageRank(u) * W_{(u, v)}} {C(u)}$$

$$v\ -\ вершина\ графа, v \in V $$

$$u\ -\ вершины\ графа,\ такие\ что\ (u,v) \in E$$

$$C(u) - количество \ вершин, \ таких \ что (u,v) \in E$$

$$W_{(u, v)} - вес\ ребра\ (u, v) \in E $$

$$d = 0,85\ -\ коэффициент\ затухания$$

Let's use NetworkX library to Page Rank algorithm.

In [15]:
import networkx as nx

nx_graph = nx.from_numpy_array(G)
nx_scores = nx.pagerank(nx_graph)

In [16]:
ranked_sentences = sorted(((nx_scores[i], s, i) for i,s in enumerate(sentences)), reverse=True)

Let's output 5 sentences with the highest TextRank. This will be our final text summation.

In [17]:
SUMMARY_LEN = 5

for i in range(SUMMARY_LEN):
    print(' '.join(ranked_sentences[i][1]))

Last year 's Hero of the Year was Pushpa Basnet , a Nepalese woman who supports children so they do n't have to live behind bars with their incarcerated parents .
Click here to see all the extraordinary Heroes who have been featured this year .
( CNN ) -- Ten everyday people will be recognized Thursday for their remarkable efforts to make the world a better place .
`` We want to work with the government to bring them all out of prison .
One of the top 10 will receive an additional $ 250,000 for their cause if the public chooses them as the CNN Hero of the Year .


Now let's combine everything into one summarize function, which will receive text divided into sentences as input and output 5 sentences with the highest *TextRank*.

In [18]:
def summarize(sentences,summary_len=5):
    vectorized_sentences = sentence_vectorizer.transform(sentences)
    G = get_cosine_similarity_matrix(vectorized_sentences)
    nx_graph = nx.from_numpy_array(G)
    nx_scores = nx.pagerank(nx_graph)
    ranked_sentences = sorted(((nx_scores[i],s,i) for i,s in enumerate(sentences)), reverse=True)
    summary = []
    for i in range(summary_len):
        summary.append(' '.join(ranked_sentences[i][1]))
    return summary

In [19]:
summarize(tokenized_texts[5])

["Last year 's Hero of the Year was Pushpa Basnet , a Nepalese woman who supports children so they do n't have to live behind bars with their incarcerated parents .",
 'Click here to see all the extraordinary Heroes who have been featured this year .',
 '( CNN ) -- Ten everyday people will be recognized Thursday for their remarkable efforts to make the world a better place .',
 '`` We want to work with the government to bring them all out of prison .',
 'One of the top 10 will receive an additional $ 250,000 for their cause if the public chooses them as the CNN Hero of the Year .']

Let's get summaries for all our texts:

In [20]:
system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]


  0%|          | 0/300 [00:00<?, ?it/s]

Let's look on the 10th sample

In [21]:
print("\n".join(system_summaries[10][:5]))

`` As a club , we had the honor of his services , albeit for a very short time , in the 1987-88 season , when we won the Spanish Cup .
`` I wish to express our sorrow at the loss of one of football 's greatest men and one of the most charismatic and likeable managers we remember , '' Barca president Josep Maria Bartomeu said before his side 's 3-2 defeat by Valencia -- its first at home in the league since April 2012 .
But his heart lay in the nation 's capital , where he was on the books of Atletico 's big rival Real Madrid from 1958-60 as a player -- though he spent most of that time out on loan to other clubs .
`` Always with us , Luis , '' led the website tribute of Atletico Madrid , the club where Aragones played for a decade between 1964-74 and was head coach on four occasions , most recently 2001-03 .
`` Today is a day of mourning for this sport , but it should also be a day of recognition for a legendary figure who was vital in giving us a glorious period with our Spanish natio

## Task 2 IDF word2vec modification

Modify your previous solution. For each text obtaint it's vector representation by averaging word2vec representation of each word multiplied by the IDF value of this word.

In [22]:
# TODO complete transform function. You can add additional values in class constructor if neccesary.

class TfidfEmbeddingVectorizer:
    def __init__(self, embedding_model, dim=100):
        self.embedding_model = embedding_model
        self.dim = dim
        self.word_idf = {}
    
    def fit(self, X):
        # Преобразуем тексты в строки
        texts = [' '.join(sentence) for sentence in X]
        
        # Используем TF-IDF для оценки важности слов
        tfidf = TfidfVectorizer()
        tfidf.fit(texts)
        
        # Получаем IDF-значения
        self.word_idf = {word: tfidf.idf_[i] for word, i in tfidf.vocabulary_.items()}
        return self
    
    def transform(self, X):
        sentence_vectors = []
        for sentence in X:
            if len(sentence) != 0:
                v = sum([self.embedding_model.get(word, np.zeros((self.dim,))) * self.word_idf.get(word, 1.0) for word in sentence]) / len(sentence)
            else:
                v = np.zeros((self.dim,))
            sentence_vectors.append(v)
        return np.array(sentence_vectors)

In [23]:
sentence_vectorizer = TfidfEmbeddingVectorizer(word_embeddings)
sentence_vectorizer = sentence_vectorizer.fit(tokenized_sentences)

In [24]:
# TODO copy your function for cosine similarity here

def get_cosine_similarity_matrix(sentences):
    return cosine_similarity(sentences)

In [25]:
def summarize(sentences,summary_len=5):
    vectorized_sentences = sentence_vectorizer.transform(sentences)
    G = get_cosine_similarity_matrix(vectorized_sentences)
    nx_graph = nx.from_numpy_array(G)
    nx_scores = nx.pagerank(nx_graph)
    ranked_sentences = sorted(((nx_scores[i],s,i) for i,s in enumerate(sentences)), reverse=True)
    summary = []
    for i in range(summary_len):
        summary.append(' '.join(ranked_sentences[i][1]))
    return summary

 Summarize your texts

In [26]:
system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  system_summaries = [summarize(text) for text in tqdm(tokenized_texts)]


  0%|          | 0/300 [00:00<?, ?it/s]

Print summary for 7-th sample:

In [27]:
system_summaries[7]

["`` It should be noted that over many years , Jack Cafferty has expressed critical comments on many governments , including the U.S. government and its leaders . ''",
 "Cafferty , who appears daily on CNN 's `` The Situation Room , '' made the remarks as host Wolf Blitzer was comparing today 's China to that of 20 or 30 years ago .",
 '`` On this occasion Jack was offering his strongly held opinion of the Chinese government , not the Chinese people -- - a point he subsequently clarified on The Situation Room on April 14 .',
 "He issued a clarification of his remarks on Monday 's `` Situation Room , '' saying that by `` goons and thugs , '' he meant the Chinese government , not the Chinese people .",
 "CNN issued a statement Tuesday saying : `` We are aware of concerns about Jack Cafferty 's comments related to China in the context of the upcoming Olympics , which were broadcast on The Situation Room on April 9 , 2008 ."]