## Summarize Document

We have converted the input document to a Neo4j graph, where nodes correspond to sentences, and edges connect sentences that share common meaningful nouns.

Our objective is to generate a fixed-length (number of sentences) extractive summary of the document.

In this notebook, we will apply various graph metrics and techniques to build different summarizers. We then use a voting summarizer that reports, for each sentence of the summary, the sentence that has been recommended by the maximum number of summarizers.

In [1]:
import collections
import numpy as np
import pandas as pd
import py2neo
import operator
import os

In [2]:
DATA_DIR = "../data"
SENTENCE_PATH = os.path.join(DATA_DIR, "sentences.txt")

NEO4J_CONN_URL = "bolt://localhost:7687"

NUM_SENTS_IN_SUMMARY = 5

### Utility Functions

Our graph does not contain the sentence text, so we need to build an external index that returns the sentence text given the index. The `build_sentence_index` creates a dictionary of sentence index to text.

In [3]:
def build_sentence_index(sentence_file):
    sent_index = {}
    fsents = open(sentence_file, "r")
    for lid, line in enumerate(fsents):
        sent_index["s{:03d}".format(lid)] = line.strip()
    fsents.close()
    return sent_index

sent_index = build_sentence_index(os.path.join(DATA_DIR, "sentences.txt"))

The `display_recommendations` is a convenience function that returns the recommendations for the top positions from a particular summarizer.

In [4]:
def display_recommendations(sent_index, label, sids, debug=True):
    if debug:
        print("%10s: %s" % (label, str(sids)))
    else:
        print("%10s: %s" % (label, sids))
        for sid in sids:
            print(sid, ":", sent_index[sid])

In [5]:
graph = py2neo.Graph(NEO4J_CONN_URL, auth=("neo4j", "graph"))

### Degree Summarizer (CN-Degree)

Summarization involves identifying "important" sentences in the document, and degree centrality gives an indication of how connected it is to other sentences. Hence this metric is used in this summarizer.

In [6]:
def compute_degree_centrality(graph):
    query = """
        CALL algo.degree.stream("Sentence", "SIM", {direction: "both"})
        YIELD nodeId, score
        RETURN algo.asNode(nodeId).sid AS sid, score
        ORDER BY score DESC
        LIMIT %d
    """ % (NUM_SENTS_IN_SUMMARY)
    results = graph.run(query).data()
    recos = [x["sid"] for x in results]
    recos = sorted(recos)
    return recos

degree_recos = compute_degree_centrality(graph)
display_recommendations(sent_index, "degree", degree_recos)

    degree: ['s005', 's021', 's056', 's062', 's063']


### Strength Summarizer (CN-Strength)

Degree centrality is just based only on the number of edges incident upon a node. Strength centrality is basically degree centrality but considering the weights of the edges as well. Similar to degree centrality, this measure can also be thought of as a useful proxy for importance.

In [7]:
def compute_strength_centrality(graph):
    query = """
        CALL algo.degree.stream("Sentence", "SIM", {
            direction: "both", weightProperty: "sim"})
        YIELD nodeId, score
        RETURN algo.asNode(nodeId).sid AS sid, score
        ORDER BY score DESC
        LIMIT %d
    """ % (NUM_SENTS_IN_SUMMARY)
    results = graph.run(query).data()
    recos = [x["sid"] for x in results]
    recos = sorted(recos)
    return recos

strength_recos = compute_strength_centrality(graph)
display_recommendations(sent_index, "strength", strength_recos)

  strength: ['s008', 's021', 's056', 's062', 's063']


### Closeness Summarizer (CN-LI)

High values of closeness centrality indicates nodes that are able to spread information across the network more efficiently. This roughly corresponds to CN-LI (Locality Index summarizer) of the paper.

In [8]:
def compute_closeness_centrality(graph):
    query = """
        CALL algo.closeness.stream('Sentence', 'SIM')
        YIELD nodeId, centrality
        RETURN algo.asNode(nodeId).sid AS sid, centrality AS score
        ORDER BY score DESC
        LIMIT %d
    """ % (NUM_SENTS_IN_SUMMARY)
    results = graph.run(query).data()
    recos = [x["sid"] for x in results]
    recos = sorted(recos)
    return recos

closeness_recos = compute_closeness_centrality(graph)
display_recommendations(sent_index, "closeness", closeness_recos)

 closeness: ['s005', 's021', 's056', 's062', 's063']


### PageRank Summarizer (CN-PR)

PageRank is a well-known measure of centrality and is hence included here. The paper did not use this measure, but we think it is important enough to include in our implementation.

In [9]:
def compute_pagerank_centrality(graph):
    query = """
        CALL algo.pageRank.stream('Sentence', 'SIM', 
            {iterations:20, dampingFactor:0.85})
        YIELD nodeId, score
        RETURN algo.asNode(nodeId).sid AS sid, score
        ORDER BY score DESC
        LIMIT %d
    """ % (NUM_SENTS_IN_SUMMARY)
    results = graph.run(query).data()
    recos = [x["sid"] for x in results]
    recos = sorted(recos)
    return recos

pagerank_recos = compute_pagerank_centrality(graph)
display_recommendations(sent_index, "pagerank", pagerank_recos)

  pagerank: ['s005', 's021', 's056', 's062', 's063']


### Shortest Path Summarizer (CN-SP)

Sentences that are close to other sentences in the graph are likely to be good candidates for summary sentences. This technique computes the shortest paths between all pairs of nodes, then computes the mean shortest path from each node. The ones with low mean shortest paths are preferred for summary purposes. The corresponds to the CN-SP summarizer in the paper.

Two variants of CN-SP have been proposed, first by normalizing the non-zero weights, and the second by computing reciprocals of non-zero weights of the shortest path matrix. These correspond to the CN-SP<sup>WC</sup> and CN-SP<sup>WI</sup> in the paper respectively.

In [10]:
def compute_mean_shortest_path(graph, preprocess_weight=None):
    query = """
        CALL algo.allShortestPaths.stream('sim', {nodeQuery:'Sentence',defaultValue:1.0})
        YIELD sourceNodeId, targetNodeId, distance
        WITH sourceNodeId, targetNodeId, distance
        WHERE algo.isFinite(distance) = true
        MATCH (source:Sentence) WHERE id(source) = sourceNodeId
        MATCH (target:Sentence) WHERE id(target) = targetNodeId
        WITH source, target, distance WHERE source <> target
        RETURN source.sid AS source, target.sid AS target, distance
    """
    results_df = graph.run(query).to_data_frame()
    if preprocess_weight is not None:
        if preprocess_weight == "norm":
            max_dist = results_df["distance"].max()
            results_df["distance"] = np.where(
                results_df["distance"]==0, 0, 
                max_dist - results_df["distance"] + 1)
        if preprocess_weight == "recip":
            results_df["distance"] = np.where(
                results_df["distance"]==0, 0, 
                1.0 / results_df["distance"])
    msps_df = results_df.groupby("source").mean()
    msps = msps_df.to_dict()["distance"]
    sorted_msps = sorted([(k, msps[k]) for k in msps.keys()], key=operator.itemgetter(1))
    recos = [x[0] for x in sorted_msps[0:NUM_SENTS_IN_SUMMARY]]
    return recos


msp_recos = compute_mean_shortest_path(graph)
display_recommendations(sent_index, "MSP", msp_recos)
msp_n_recos = compute_mean_shortest_path(graph, preprocess_weight="norm")
display_recommendations(sent_index, "MSP_n", msp_n_recos)
msp_r_recos = compute_mean_shortest_path(graph, preprocess_weight="recip")
display_recommendations(sent_index, "MSP_r", msp_n_recos)

       MSP: ['s010', 's058', 's017', 's051', 's024']
     MSP_n: ['s071', 's067', 's078', 's041', 's050']
     MSP_r: ['s071', 's067', 's078', 's041', 's050']


### d-Ring Summarizer (CN-Ring)

In this technique, we first choose the node with highest degree centrality, then add nodes to this subgraph by decreasing the degree centrality requirement in a stepped manner. At each stage, we add the newly qualified nodes to our subgraph. The process continues until we have reached the maximum number of sentences required by our summary.

The intuition here is that we choose the most central nodes, proxies for the best summary sentences, from the graph in a step-wise manner. 

Two variants are proposed. The first one picks qualifying sentences by their position (earlier sentences are preferred), and the second picks qualifying sentences by their degree centrality. The correspond to the CN-Rings<sup>L</sup> and CN-Rings<sup>K</sup> respectively.

Implementation wise, we have to generate a lookup table of node centrality, as well as a sorted list of nodes by centrality to support this and other similar methods later. This is done by the `compute_node_degrees` function.

In [11]:
def compute_node_degrees(graph):
    query = """
        CALL algo.degree.stream("Sentence", "SIM", {direction: "both"})
        YIELD nodeId, score
        RETURN algo.asNode(nodeId).sid AS sid, score
        ORDER BY score DESC
    """
    result = graph.run(query).data()
    result_list = sorted([(x["sid"], x["score"]) for x in result],
        key=operator.itemgetter(1), reverse=True)
    result_dict = {x["sid"]: x["score"] for x in result}
    return result_list, result_dict


def _compute_rings(graph, cent_list, cent_dict, summary_sents, sort_by):
    neighbor_sids = set()
    for sid in summary_sents:
        query = """
            MATCH (a {sid:"%s"})-[:SIM]->(b) 
            RETURN algo.getNodeById(id(b)).sid AS sid
        """ % (sid)
        result = graph.run(query).data()
        neighbor_sids.update([x["sid"] for x in result])
    if sort_by == "degree":
        neighbor_sids_scored = sorted([(x, cent_dict[x]) for x in neighbor_sids],
            key=operator.itemgetter(1), reverse=True)
        neighbor_sids = [x[0] for x in neighbor_sids_scored]
    else:
        neighbor_sids = sorted(neighbor_sids)
    summary_sents.extend(neighbor_sids)
    if len(summary_sents) >= NUM_SENTS_IN_SUMMARY:
        return summary_sents
    else:
        _compute_rings(graph, cent_list, cent_dict, summary_sents, sort_by)

        
def compute_rings(graph, cent_list, cent_dict, sort_by):
    summary_sents = [cent_list[0][0]]
    _compute_rings(graph, cent_list, cent_dict, summary_sents, sort_by)
    return sorted(summary_sents[0:NUM_SENTS_IN_SUMMARY])


cent_list, cent_dict = compute_node_degrees(graph)

ring_k_recos = compute_rings(graph, cent_list, cent_dict, sort_by="degree")
display_recommendations(sent_index, "ring_k", ring_k_recos)
ring_l_recos = compute_rings(graph, cent_list, cent_dict, sort_by="position")
display_recommendations(sent_index, "ring_l", ring_l_recos)

    ring_k: ['s005', 's021', 's056', 's062', 's063']
    ring_l: ['s001', 's003', 's004', 's005', 's063']


### k-Core Summarizers (CN-Cores)

Starts by choosing a small core consisting of node with highest degree centrality, then adding more nodes by sequentially decreasing the value of k. At each step, candidate nodes are added until the number of summary sentences is reached.

Two variants are proposed, first adds candidate nodes by position, and second adds candidate nodes by centrality.

In [12]:
def compute_cores(graph, result_list, result_dict, sort_by):
    k_values = sorted(list(set([x[1] for x in result_list])), reverse=True)
    summary_sents = []
    for k_value in k_values:
        new_sents = [x[0] for x in cent_list if x[1] == k_value]
        if sort_by == "degree":
            new_sents_scored = sorted([(x, cent_dict[x]) for x in new_sents],
                key=operator.itemgetter(1), reverse=True)
            new_sents = [x[0] for x in new_sents_scored]
        else:
            new_sents = sorted(new_sents)
        summary_sents.extend(new_sents)
        if len(summary_sents) > NUM_SENTS_IN_SUMMARY:
            return sorted(summary_sents[0:NUM_SENTS_IN_SUMMARY])

core_k_recos = compute_cores(graph, cent_list, cent_dict, sort_by="degree")
display_recommendations(sent_index, "core_k", core_k_recos)
core_l_recos = compute_cores(graph, cent_list, cent_dict, sort_by="position")
display_recommendations(sent_index, "core_l", core_l_recos)

    core_k: ['s005', 's021', 's056', 's062', 's063']
    core_l: ['s005', 's021', 's056', 's062', 's063']


### w-Cuts Summarizers (CN-Cuts)

Similar to k-cores, w-cuts starts with node pair with highest edge weight, and then progressively adds more edges by decreasing the edge weight threshold. At each stage, nodes from candidate triples are added to the summary, until the number of sentences for the summary are reached.

Two variants are proposed, one where nodes are added based on position, and another where nodes are added based on degree. They correspond to CN-Cuts<sup>L</sup> and CN-Cuts<sup>K</sup> respectively.

In [13]:
def compute_edge_weights(graph):
    query = """
        MATCH (a)-[e:SIM]->(b) 
        RETURN a.sid AS source, e.sim AS weight, b.sid AS target 
        ORDER BY weight DESC
    """
    result = graph.run(query).data()
    edge_list = sorted([(x["source"], x["weight"], x["target"]) for x in result],
        key=operator.itemgetter(1), reverse=True)
    return edge_list


def compute_cuts(graph, edge_list, cent_dict, sort_by):
    w_values = sorted(list(set([x[1] for x in edge_list])), reverse=True)
    summary_sents = []
    for w_value in w_values:
        node_pairs = [(x[0], x[2]) for x in edge_list if x[1] == w_value]
        new_sents = []
        for source, target in node_pairs:
            if source not in summary_sents:
                new_sents.append(source)
            if target not in summary_sents:
                new_sents.append(target)
        if sort_by == "degree":
            new_sents_scored = sorted([(x, cent_dict[x]) for x in new_sents],
                key=operator.itemgetter(1), reverse=True)
            new_sents = [x[0] for x in new_sents_scored]
        else:
            new_sents = sorted(new_sents)
        summary_sents.extend(new_sents)
        if len(summary_sents) > NUM_SENTS_IN_SUMMARY:
            return sorted(summary_sents[0:NUM_SENTS_IN_SUMMARY])

        
edge_list = compute_edge_weights(graph)

cut_k_recos = compute_cuts(graph, edge_list, cent_dict, sort_by="degree")
display_recommendations(sent_index, "cut_k", cut_k_recos)
cut_l_recos = compute_cuts(graph, edge_list, cent_dict, sort_by="position")
display_recommendations(sent_index, "cut_l", cut_l_recos)

     cut_k: ['s052', 's052', 's062', 's063', 's063']
     cut_l: ['s052', 's052', 's062', 's063', 's063']


### Majority Voting (CN-Voting)

Recommendations from each summarizer are passed through a Majority Voting ensembling summarizer. The summarizer picks the most frequent sentence at each summary position.

In [14]:
def find_most_frequent(votes):
    vote_counter = collections.Counter()
    for vote in votes:
        vote_counter[vote] += 1
    return vote_counter.most_common(1)[0][0]


def compute_majority_vote(results):
    lhs = None
    for result in results:
        if lhs is None:
            lhs = result
            continue
        else:
            rhs = result
            assert(len(lhs) == len(rhs))
    majority_recos = []
    for i in range(len(lhs)):
        recos_at_i = []
        for result in results:
            recos_at_i.append(result[i])
        majority_recos.append(find_most_frequent(recos_at_i))
    return majority_recos


majvote_recos = compute_majority_vote([
    degree_recos, strength_recos, closeness_recos, pagerank_recos,
    msp_recos, msp_n_recos, msp_r_recos,
    ring_k_recos, ring_l_recos, 
    core_k_recos, core_l_recos,
    cut_k_recos, cut_l_recos
])
display_recommendations(sent_index, "vote", majvote_recos, debug=False)

      vote: ['s005', 's021', 's056', 's062', 's063']
s005 : Well before the Iranian leader's arrival in Caracas, his plans for a Latin America tour grabbed global attention as tensions grow between many Western powers and Iran over the nation's nuclear program.
s021 : Spanish-language headlines on the network's website last week described Israeli spies, foreign intervention in Syria, a report that Japan plans to "disobey" U.S. sanctions against Iran and an allegation that airport security screening machines in the United States cause death.
s056 : Officials in the United States and other Western nations have ratcheted up sanctions against Iran several times since a November report by the U.N. nuclear watchdog agency said the Iranian government was developing the technology needed to build a nuclear weapon.
s062 : Some U.S. government officials and Washington analysts allege that Iran could be using new business relationships and energy deals with Latin American countries as a cover for