## Text Summarization

В данном задании мы будем строить саммари из текста, используя алгоритм LexRank (https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html)

In [25]:
import pandas as pd
import numpy as np
import re
import os
import math
from collections import Counter
from collections import defaultdict
import nltk
from nltk import RegexpTokenizer
from nltk import SnowballStemmer

Загрузим документы из файлов.

In [26]:
data_dir = './data'

In [27]:
documents = []
for file in os.listdir(data_dir):
    file_name = data_dir + '/' + file
    with open(file_name) as f:
        data = f.read()
        documents.append(data.strip())

Посмотрим, как выглядят документы:

In [28]:
documents[0]

"The New York Times said in an editorial for Sunday, Dec. 20: The Republicans' drive for a partisan impeachment based soley on party-line voting power rather than any sense of proportion produced an unexpected sideshow in the resignation of Rep. Bob Livingston from his role as future speaker of the House. Analysts on both sides of the struggle over President Clinton's future will point to Livingston's downfall as evidence of a generalized breakdown in legislative civility on Capitol Hill. Democrats will see a moral symmetry, depicting Livingston as a victim of the sexual puritanism he was wielding against Clinton. But the one thing that no one should fall for is Livingston's invitation to use his resignation as a model for resolving the crisis at the White House. Livingston's statement rocked Congress, but it should not shake the foundations of the somber constitutional process now underway. By lying under oath, Clinton made it necessary for the House to consider impeachment. The evide

Разобьем текст каждого документа на предложения.

In [29]:
def get_sentences(doc):
    """
    Get sentences from document

    Inputs:
    - text: string

    Returns:
    - list of sentences
    """
    return re.split(u'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', doc)

In [30]:
sentences = get_sentences(documents[0])
print sentences

['The New York Times said in an editorial for Sunday, Dec.', "20: The Republicans' drive for a partisan impeachment based soley on party-line voting power rather than any sense of proportion produced an unexpected sideshow in the resignation of Rep.", 'Bob Livingston from his role as future speaker of the House.', "Analysts on both sides of the struggle over President Clinton's future will point to Livingston's downfall as evidence of a generalized breakdown in legislative civility on Capitol Hill.", 'Democrats will see a moral symmetry, depicting Livingston as a victim of the sexual puritanism he was wielding against Clinton.', "But the one thing that no one should fall for is Livingston's invitation to use his resignation as a model for resolving the crisis at the White House.", "Livingston's statement rocked Congress, but it should not shake the foundations of the somber constitutional process now underway.", 'By lying under oath, Clinton made it necessary for the House to consider 

Далее нужно разделить каждое предложение на токены (слова), при этом можно удалить пунктуацию, стоп-слова, выполнить стемминг.

In [31]:
from nltk.stem import WordNetLemmatizer
morph = WordNetLemmatizer()

In [32]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = stopwords.words('english')

In [33]:
def is_stop(word):
    if word in stop_words:
        return False
    return True
def lem_word(w):
    return morph.lemmatize(w)
def stem_word(w):
    stemmer = SnowballStemmer("english")
    return stemmer.stem(w)

In [34]:
def sentence_words(sentence):
    """
    Get words from sentence
    
    Perform normalization, remove punctuation, remove stop_words, etc...

    Inputs:
    - sentence: string - sentence text

    Returns:
    - list of strings
    """
    sentence_terms = None
    tokenizer = RegexpTokenizer(u'\w+')
    sentence = tokenizer.tokenize(sentence)
    #############################################################################
    #                                 YOUR CODE                                 #
    #############################################################################
    pass
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    return map(stem_word, map(lem_word, filter(is_stop, sentence)))

In [35]:
sentences = map(sentence_words, sentences)

In [36]:
sentences

[[u'the', u'new', u'york', u'time', u'said', u'editori', u'sunday', u'dec'],
 ['20',
  u'the',
  u'republican',
  u'drive',
  u'partisan',
  u'impeach',
  u'base',
  u'soley',
  u'parti',
  u'line',
  u'vote',
  u'power',
  u'rather',
  u'sens',
  u'proport',
  u'produc',
  u'unexpect',
  u'sideshow',
  u'resign',
  u'rep'],
 [u'bob', u'livingston', u'role', u'futur', u'speaker', u'hous'],
 [u'analyst',
  u'side',
  u'struggl',
  u'presid',
  u'clinton',
  u'futur',
  u'point',
  u'livingston',
  u'downfal',
  u'evid',
  u'general',
  u'breakdown',
  u'legisl',
  u'civil',
  u'capitol',
  u'hill'],
 [u'democrat',
  u'see',
  u'moral',
  u'symmetri',
  u'depict',
  u'livingston',
  u'victim',
  u'sexual',
  u'puritan',
  u'wield',
  u'clinton'],
 [u'but',
  u'one',
  u'thing',
  u'one',
  u'fall',
  u'livingston',
  u'invit',
  u'use',
  u'resign',
  u'model',
  u'resolv',
  u'crisi',
  u'white',
  u'hous'],
 [u'livingston',
  u'statement',
  u'rock',
  u'congress',
  u'shake',
  u'foun

Нам необходима функция для подсчета tf слов в каждом предложении:

In [37]:
def compute_tf(sentences):
    """
    Compute TF for each term in sentece for all sentences

    Inputs:
    - sentences: list of tokenized sentences

    Returns:
    - list of dicts of tf values
    """
    tf_values = map(Counter, sentences)

    tf_metrics = []
    #############################################################################
    #                                 YOUR CODE                                 #
    #############################################################################
    for idx, sentence in enumerate(sentences):
        tf_metrics.append(defaultdict(float, {key : float(value) / len(sentence) for (key, value) in tf_values[idx].items()}))
            
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return tf_metrics

In [38]:
tf_metrics = compute_tf(sentences)

Так же нам необходима функция для подсчета IDF всех токенов из текста:

In [39]:
def compute_idf(sentences):
    """
    Compute IDF for each term in sentece

    Inputs:
    - sentences: list of sentences
    
    Returns:
    - dictionary of form { term : idf }
    """
    idf_metrics = {}
    sentences_count = len(sentences)
    
    counter = Counter()

    #############################################################################
    #                                 YOUR CODE                                 #
    #############################################################################
    for sentence in sentences:
        tmp_counter = Counter(sentence)
        for key in tmp_counter.keys():
            counter[key] += 1
            
    for key, value in counter.items():
        idf_metrics[key] = np.log(float(sentences_count) / value)
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return idf_metrics

In [40]:
idf_metrics = compute_idf(sentences)

Для того, чтобы определить похожесть двух предложений, нам нужно использовать какую-то метрику похожести. Мы будем использовать модицифированное косинусное расстояние.

In [41]:
def cosine_similarity(sentence1, sentence2, tf1, tf2, idf_metrics):
    """
    Compute idf-modified-cosine distance between two sentences.
    
    Inputs:
    - sentence1: list of terms of sentence 1
    - sentence2: list of terms of sentence 2
    - tf1: dict of term frequencies of sentence 1
    - tf2: dict of term frequencies of sentence 2
    - idf_metrics: list of inverted document frequencies

    Returns:
    - modified cosine similarity
    """
    similarity = 0.0
    norm_1 = 0.0
    norm_2 = 0.0
    
    #############################################################################
    #                                 YOUR CODE                                 #
    #############################################################################
    for word, idf in idf_metrics.items():
        similarity += tf1[word] * idf * tf2[word] * idf
        norm_1 += (tf1[word] * idf) ** 2
        norm_2 += (tf2[word] * idf) ** 2
    similarity /= (np.sqrt(norm_1) * np.sqrt(norm_2))
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    
    return similarity

Создадим матрицу размером NxN, где N - это количество предложений. В каждой ячейки i,j - будем записывать похожесть i-го предложения на j-ое.

In [42]:
def create_matrix(sentences, threshold, tf_metrics, idf_metrics):
    """
    Creates matrix of shape |sentences|×|sentences|.
    
    Inputs:
    - sentences: list of sentences
    - threshold: threshold value for selecting edges
    - tf_metrics: list of TF metrics for each sentence
    - idf_metrics: list of inverted document frequencies

    Returns:
    - matrix of similarities
    """
    sentences_count = len(sentences)
    matrix = np.zeros((sentences_count, sentences_count), dtype=np.float)
    degrees = np.zeros((sentences_count, ))

    #############################################################################
    #                                 YOUR CODE                                 #
    #############################################################################
    for idx in xrange(sentences_count):
        for jdx in xrange(sentences_count):
            if cosine_similarity(sentences[idx], sentences[jdx], tf_metrics[idx], tf_metrics[jdx], idf_metrics) > threshold and idx != jdx:
                matrix[idx][jdx] = 1
        matrix[idx][idx] = 1
        degrees[idx] = np.sum(matrix[idx])
        matrix[idx] /= degrees[idx]
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return matrix

In [43]:
matrix = create_matrix(sentences, 0, tf_metrics, idf_metrics)

Реализуем power-метод:

In [44]:
def power_method(matrix, epsilon):
    """
    Perform Power-method.
    
    Inputs:
    - matrix: matrix of sentences similarities
    - epsilon: stop when values changes less then epsilon
    - tf_metrics: list of TF metrics for each sentence
    - idf_metrics: list of inverted document frequencies

    Returns:
    - vector of sentence's importancies
    """
    transposed_matrix = matrix.T
    sentences_count = len(transposed_matrix )
    p_vector = np.array([1.0 / sentences_count] * sentences_count)
    lambda_val = 1.0

    #############################################################################
    #                                 YOUR CODE                                 #
    #############################################################################
    while True:
        p_vector_new = np.matmul(p_vector, transposed_matrix )
        if np.linalg.norm(p_vector_new - p_vector) < epsilon:
            break
        p_vector = p_vector_new
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return p_vector

In [45]:
values = power_method(matrix, 0.00000001)

По полученым оценкам важности каждого предложения выделим k-самых важных.

In [46]:
def get_top_sentences(sentences, values, k):
    """
    Get K top rated sentences from all sentences.
    
    Inputs:
    - sentences: list of sentences
    - values: list of computed sentence importancies
    - k: number of sentences to extract

    Returns:
    - list of sentences
    """
    top_sentences = []
    
    #############################################################################
    #                                 YOUR CODE                                 #
    #############################################################################
    for idx, val in enumerate(np.argsort(-values)):
        if idx > k:
            break
        top_sentences.append(sentences[val])
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    
    return top_sentences

In [47]:
get_top_sentences(sentences, values, 3)

[[u'the', u'new', u'york', u'time', u'said', u'editori', u'sunday', u'dec'],
 ['if',
  u'lott',
  u'refus',
  u'allow',
  u'bipartisan',
  u'search',
  u'censur',
  u'burden',
  u'fall',
  u'upon',
  u'respect',
  u'member',
  u'like',
  u'joseph',
  u'lieberman',
  u'democrat',
  u'side',
  u'orrin',
  u'hatch',
  u'republican'],
 ['in',
  u'intrigu',
  u'report',
  u'nbc',
  u'tim',
  u'russert',
  u'describ',
  u'discuss',
  u'among',
  u'bipartisan',
  u'group',
  u'senat',
  u'censur',
  u'would',
  u'involv',
  u'form',
  u'presidenti',
  u'confess',
  u'fine',
  u'joint',
  u'congression',
  u'request',
  u'independ',
  u'counsel',
  u'kenneth',
  u'starr',
  u'prosecut',
  u'clinton',
  u'court'],
 [u'bob', u'dole']]

Построим саммари:

In [48]:
for document in documents:
    sentences = get_sentences(document)
    sentences_tokenized = [sentence_words(x) for x in sentences]
    tf_metrics = compute_tf(sentences_tokenized)
    idf_metrics = compute_idf(sentences_tokenized)
    
    matrix = create_matrix(sentences_tokenized, 0.3, tf_metrics, idf_metrics)
    sentence_values = power_method(matrix, 0.001)
    
    print(get_top_sentences(sentences, sentence_values, 3))

['The New York Times said in an editorial for Sunday, Dec.', 'If Lott refuses to allow a bipartisan search for censure, the burden will fall upon respected members like Joseph Lieberman on the Democratic side and Orrin Hatch for the Republicans.', 'In an intriguing report on NBC, Tim Russert described discussions among a bipartisan group of senators about a censure that would involve some form of presidential confession, a fine and a joint congressional request that Independent Counsel Kenneth Starr not prosecute Clinton in the courts.', 'Bob Dole.']
["``We are here to debate impeachment and should not be distracted from that,'' the minority whip, Rep.", "CBS' corporate decision to try to have it both ways meant that Rather would break in with updates from Washington.", "``I'll see you very soon,'' he said as he signed off the complete coverage, looking like a trouper.", 'But he was in a position no news anchor should have been in on this day.']
['Bob Livingston, the incoming speaker o