## Create Graph

This notebook reads the serialized matrices of generation probabilities computed from random walks of path lengths 1..3, and writes out CSV files for nodes and relationships that can be imported by `neo4j-admin` to create a Neo4j graph.

The only reason we will need to do this for this example is because we want to use the **Louvain Community Detection Algorithm** to do clustering of the resulting graph of generation probabilities.

In [1]:
import numpy as np
import os

In [2]:
DATA_DIR = "../data"

LABEL_FILEPATH = os.path.join(DATA_DIR, "labels.tsv")
GENPROB_FILEPATH_TEMPLATE = os.path.join(DATA_DIR, "genprobs_{:d}.npy")

NODES_FILEPATH = os.path.join(DATA_DIR, "nodes.csv")
EDGES_FILEPATH_TEMPLATE = os.path.join(DATA_DIR, "edges-{:d}.csv")

### Generate node.csv and node_id lookup

We read through the `labels.tsv` file to generate a mapping between the row ID of the generation probabilities matrix and the unique `doc_id` that we generated for the document.

The `nodes.csv` file looks something like this.

    doc_id:ID,category,:LABEL
    0-0-54241,alt.atheism,Document
    0-0-54242,alt.atheism,Document
    0-0-54243,alt.atheism,Document
    0-0-54244,alt.atheism,Document
    0-0-54245,alt.atheism,Document

In [3]:
row2docid = {}
fnodes = open(NODES_FILEPATH, "w")
fnodes.write("doc_id:ID,category,:LABEL\n")
flabels = open(LABEL_FILEPATH, "r")
num_nodes = 0
print("generating nodes...")
for line in flabels:
    doc_id, label = line.strip().split('\t')
    row2docid[num_nodes] = doc_id
    fnodes.write(",".join([doc_id, label, "Document"]) + "\n")
    num_nodes += 1

fnodes.close()
print("number of nodes: {:d}".format(num_nodes))

generating nodes...
number of nodes: 18810


### Generate edges.csv

Next we deserialize each of the `genprobs*.npy` (corresponding to t=1..3 respectively) to find the edges. The `edges-*.csv` file looks like this:

    :START_ID,gen_prob:float,:END_ID,:TYPE
    0-0-54241,0.02500,0-0-54255,PROB
    0-0-54241,0.03750,0-0-53404,PROB
    0-0-54241,0.02500,0-0-53783,PROB
    0-0-54241,0.02500,0-0-53579,PROB
    0-0-54241,0.01250,0-0-53671,PROB

In [4]:
for i in range(3):
    num_hops = i + 1
    fedges = open(EDGES_FILEPATH_TEMPLATE.format(num_hops), "w")
    fedges.write(":START_ID,gen_prob:float,:END_ID,:TYPE\n")
    print("generating edges for generation probabilities ({:d})..."
          .format(num_hops))
    G = np.load(GENPROB_FILEPATH_TEMPLATE.format(num_hops))
    print(G.shape)
    num_edges = 0
    for i in range(G.shape[0]):
        for j in range(G.shape[1]):
            if G[i, j] == 0:
                continue
            start_id = row2docid[i]
            end_id = row2docid[j]
            gen_prob = G[i, j]
            fedges.write("{:s},{:.5f},{:s},PROB\n".format(
                start_id, gen_prob, end_id))
            num_edges += 1

    print("number of edges: {:d}".format(num_edges))
    print("sparsity (%):", 
          (num_edges * 100) / (np.power(G.shape[0], 2) / 2))
    fedges.close()

generating edges for generation probabilities (1)...
(18810, 18810)
number of edges: 897785
sparsity (%): 0.5074867989331181
generating edges for generation probabilities (2)...
(18810, 18810)
number of edges: 1424403
sparsity (%): 0.8051657344027024
generating edges for generation probabilities (3)...
(18810, 18810)
number of edges: 1411704
sparsity (%): 0.7979874290627249


### Import to Neo4j

Sparsity numbers above indicate that the resulting generation probability graphs are quite sparse.

To import a pair of `node.csv`, `edge.csv` CSV files, the following `neo4j-admin` command can be used.

    cd $NEO4J_HOME
    bin/neo4j-admin import \
        --nodes=/path/to/node.csv \
        --relationships=/path/to/edges.csv