## Create Graph

For the summarization task, we model a document as a graph of sentence nodes. Each node represents a single sentence from our `sentences.txt` file.

Nodes are connected to each other by edges that represent co-occurring commonly occurring nouns. Consider two sentences `s1` and `s2`. If they share a noun, we draw an edge between them with weight 1. If they share two nouns between them, the edge between them has weight 2, and so on.

In this notebook, we will process each sentence to extract their nouns, compute similarity between all pairs of sentences based on noun co-occurrence, and generate CSV files that will be used by `neo4j-admin` to load the graph into Neo4j.

In [1]:
import collections
import numpy as np
import os
import spacy

In [2]:
DATA_DIR = "../data"
SENTENCE_PATH = os.path.join(DATA_DIR, "sentences.txt")

NODE_PATH = os.path.join(DATA_DIR, "nodes.csv")
EDGE_PATH = os.path.join(DATA_DIR, "edges.csv")

### Extract noun phrases

We will use models from the Spacy English model to tokenize each sentence in our `sentences.txt` file, POS tag them and extract only nouns, lemmatize the nouns, and create a vocabulary of nouns in our document.

Lemmatization is done so we can treat words such as `countries` and `country` the same way. Clearly a sentence that mentions the first variation of the word is similar to another that mentions the second.

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
noun_ctr = collections.Counter()
num_docs = 0
fsents = open(SENTENCE_PATH, "r")
for line in fsents:
    line = line.strip()
    doc = nlp(line)
    for token in doc:
        if token.tag_.startswith("NN"):
            word = token.text.lower()
            lemma = token.lemma_
            noun_ctr[lemma] += 1
    num_docs += 1

fsents.close()

In [5]:
noun_ctr.most_common(10)

[('iran', 41),
 ('u.s.', 14),
 ('country', 12),
 ('america', 11),
 ('united', 11),
 ('states', 11),
 ('venezuela', 10),
 ('sanction', 10),
 ('ahmadinejad', 9),
 ('latin', 9)]

In [6]:
len(noun_ctr)

340

### Discard "uncommon" nouns

We will discard nouns which occur only once. It is also reasonable to use a higher threshold depending on the richness of the vocabulary extracted.

The `vocab` dictionary is used to map each common noun to a position, so we can represent each document as a vector of common nouns.

In [7]:
common_nouns = [x[0] for x in noun_ctr.most_common() if x[1] > 1]
vocab = {x: i for (i, x) in enumerate(common_nouns)}

In [8]:
len(common_nouns)

111

### Construct Term-Document Matrix

We will represent each document as a vector of (lemmas of) common nouns. For that, we will construct a Term Document (TD) Matrix first, which is initially populated with zeros.

We populate the TD Matrix by reading through the `sentence.txt` file again, line by line, this time matching the tokens against our vocabulary represented by the `vocab` dictionary. For each common noun, we will add 1 to the TD matrix at the position for that document and noun position.

In [9]:
td_matrix = np.zeros((num_docs, len(common_nouns)))

num_docs = 0
fsents = open(SENTENCE_PATH, "r")
for line in fsents:
    line = line.strip()
    doc = nlp(line)
    for token in doc:
        if token.lemma_ in vocab.keys():
            td_matrix[num_docs, vocab[token.lemma_]] += 1
    num_docs += 1

fsents.close()

### Construct Similarity Matrix

The shape of the TD Matrix is (number of sentences, number of common nouns). 

We want to construct a matrix that will contain similarities between sentences in terms of common noun co-occurrence, i.e., a matrix of shape (number of sentences, number of sentences).

This can be done by matrix multiplying the TD Matrix of shape (number of sentences, number of common nouns) with its transpose of shape (number of common nouns, number of sentences).

Normalization (as is done in case of cosine similarity) is not required, since we are really concerned about edge weights relative to each other.

In [10]:
S = np.matmul(td_matrix, td_matrix.T)
print(S.shape)

(80, 80)


### Create CSV files to load into Neo4j

We will create two CSV files, one containing nodes, and the other containing edges, to be used by the `neo4j-admin` tool, to bulk import our graph into Neo4j.

The `neo4j-admin` tool is able to parse the schema from the CSV header line. 

Our nodes just contain the sentence ID (`sid`) and the entity label `Sentence`. For our `sid`, we just use `s` followed by a zero padded 3-digit running number (Padding is needed since our algorithm will attempt to sort by position -- we probably should have specified a numeric position attribute and used that instead, but this works as well). The first few lines of `nodes.csv` looks like this:

    sid:ID,:LABEL
    s000,Sentence
    s001,Sentence
    ...

Our relationships contain the start `sid`, the similarity value `sim`, the end `sid`, and the relationship type `SIM`. The similarity value corresponds to the value extracted from the similarity matrix `S` we generated earlier. First few lines of `edges.csv` look like this:

    :START_ID,sim:int,:END_ID,:TYPE
    s000,1,s001,SIM
    s000,1,s002,SIM

Note that we have used `sim:int` to indicate that Neo4j should consider the second column as an integer.

In [11]:
num_nodes, num_edges = 0, 0

fnodes = open(NODE_PATH, "w")
fnodes.write("sid:ID,:LABEL\n")
for i in range(S.shape[0]):
    fnodes.write("s{:03d},Sentence\n".format(i))
    num_nodes += 1

print("number of nodes: {:d}".format(num_nodes))
fnodes.close()

fedges = open(EDGE_PATH, "w")
fedges.write(":START_ID,sim:int,:END_ID,:TYPE\n")
for i in range(S.shape[0]):
    for j in range(S.shape[1]):
        if S[i, j] > 0 and i != j:
            fedges.write("s{:03d},{:d},s{:03d},SIM\n".format(i, int(S[i,j]), j))
            num_edges += 1
print("number of edges: {:d}".format(num_edges))
fedges.close()

number of nodes: 80
number of edges: 2400


### Bulk Load Graph into Neo4j

We will use the neo4j-admin tool to bulk load data from the generated CSV files `nodes.csv` and `edges.csv`. This is achieved by the following command.

    cd $NEO4J_HOME
    bin/neo4j-admin import \
        --nodes=/path/to/nodes.csv \
        --relationships=/path/to/edges.csv

Note that the neo4j server should be stopped during this time, or `neo4j-admin` complains about other users using the database.