## Create Graph

This notebook creates the node and edge CSV files readable by `neo4j-admin` to import the graph into the DB.

Node file stores an additional `seed_label` attribute along with `doc_id`. For sentences which were manually annotated with label of either 1 or 2, the `seed_label` contains that value. For sentences which were not manually annotated, the `seed_label` contains a unique running number.

Edges are extracted from the similarity matrix we built in the previous notebook.

In [1]:
import numpy as np
import os

In [2]:
DATA_DIR = "../data"
SENTS_FILEPATH = os.path.join(DATA_DIR, "sentences-compound-plabels.tsv")
SIM_MATRIX_FILEPATH = os.path.join(DATA_DIR, "sim-matrix.npy")

NODES_FILEPATH = os.path.join(DATA_DIR, "nodes-compound.csv")
EDGES_FILEPATH = os.path.join(DATA_DIR, "edges-compound.csv")

### Write Nodes File

In addition to reading the partially manually annotated `sentences-compound-plabels.tsv` file to write out the node values as described above, we also build a `row_id` to `doc_id` mapping that will be used in the next step for writing the edges file.

The `nodes-compound.csv` file looks like this:

    doc_id:ID,seed_label:int,:LABEL
    S000292971500333X-2942,2,Sentence
    S000292971500333X-4265,2,Sentence
    ...

In [3]:
row2docid = {}
curr_rowid = 0
fnodes = open(NODES_FILEPATH, "w")
fnodes.write("doc_id:ID,seed_label:int,:LABEL\n")
fsents = open(SENTS_FILEPATH, "r")
for line in fsents:
    if line.startswith("#"):
        continue
    pii, sent_id, text, label = line.strip().split('\t')
    doc_id = "-".join([pii, sent_id])
    label = int(label)
    if label != -1:  # manually set
        fnodes.write("{:s},{:d},Sentence\n".format(doc_id, label))
    else:
        fnodes.write("{:s},{:d},Sentence\n".format(doc_id, curr_rowid))
    row2docid[curr_rowid] = doc_id
    curr_rowid += 1

fsents.close()
fnodes.close()
print("number of nodes: {:d}".format(curr_rowid))

number of nodes: 668


### Write Edges File

The `edges-compound.csv` file should look something like this.

    :START_ID,similarity:float,:END_ID,:TYPE
    S000292971500333X-2942,0.23417,S000292971500333X-4265,SIM
    S000292971500333X-2942,0.21130,S000292971500333X-6155,SIM
    ...

The similarity file is used to get the edges between sentence i and sentence j. The actual `doc_id` values are looked up via the `row2docid` mapping created in the previous step.

In [4]:
# edges
num_edges = 0
S = np.load(SIM_MATRIX_FILEPATH)
fedges = open(EDGES_FILEPATH, "w")
fedges.write(":START_ID,similarity:float,:END_ID,:TYPE\n")
for i in range(S.shape[0]):
    for j in range(S.shape[1]):
        if S[i, j] != 0:
            src_doc_id = row2docid[i]
            dst_doc_id = row2docid[j]
            weight = S[i, j]
            fedges.write("{:s},{:.5f},{:s},SIM\n"
                .format(src_doc_id, weight, dst_doc_id))
            num_edges += 1

fedges.close()
print("number of edges: {:d}".format(num_edges))

number of edges: 3340
