## Create Graphs

This notebook uses the provided CSV files to build the node and edge CSV files for two separate Neo4j graphs, in the format the `neo4j-admin` tool expects.

Both graphs use the same `nodes.csv` file. The first graph will have only edges that are not parent-child relationships (since we don't want our Personalized PageRank to be influenced by these relationships), and the second graph contains `isChildOf` relationships (since we will be using this graph to rollup concepts into categories). They are represented by edge files called `redges.csv` and `pedges.csv` respectively.

These files will be used to populate Neo4j graphs using the following command pattern.

    neo4j-admin import \
        --nodes=/path/to/nodes.csv \
        --relationships=/path/to/edges.csv


In [1]:
import os
import pandas as pd

In [2]:
DATA_DIR = "../data"

# TSV dumps from knowledge graph (provided)
KG_NODES_FILEPATH = os.path.join(DATA_DIR, "emmet-vertices.tsv")
KG_EDGES_FILEPATH = os.path.join(DATA_DIR, "emmet-edges.tsv")

# neo4j-admin import compatible node and relationship files (outputs)
NEO4J_NODES_FILEPATH = os.path.join(DATA_DIR, "neo4j-nodes.tsv")
NEO4J_RELS_FILEPATH = os.path.join(DATA_DIR, "neo4j-redges.tsv")
NEO4J_PARENT_FILEPATH = os.path.join(DATA_DIR, "neo4j-pedges.tsv")

### Write out Nodes

As with the annotation, we extract all concepts except those of class qualifier or event.

In [3]:
valid_nodes = set()
num_nodes = 0

fnodes_neo = open(NEO4J_NODES_FILEPATH, "w")
fnodes_neo.write("cid:ID,:LABEL\n")

fnodes_kg = open(KG_NODES_FILEPATH, "r")
for line in fnodes_kg:
    if num_nodes % 100000 == 0:
        print("{:d} nodes processed".format(num_nodes))
    cols = line.strip().split('\t')
    if len(cols) != 5:
        continue
    cid, syns, codes, stycodes, stygrps = cols
    if stygrps == "qualifier" or stygrps == "event":
        continue
    fnodes_neo.write("{:s},Concept\n".format(cid))
    valid_nodes.add(cid)
    num_nodes += 1

print("{:d} nodes processed, COMPLETE".format(num_nodes))
fnodes_kg.close()
fnodes_neo.close()

0 nodes processed
100000 nodes processed
200000 nodes processed
300000 nodes processed
400000 nodes processed
500000 nodes processed
570639 nodes processed, COMPLETE


In [4]:
nodes_df = pd.read_csv(NEO4J_NODES_FILEPATH)
nodes_df.head()

Unnamed: 0,cid:ID,:LABEL
0,8247637,Concept
1,8247638,Concept
2,8247640,Concept
3,8247641,Concept
4,8247642,Concept


### Write out Relationships

We want to write out one set of relationships that are all except `IsParentOf` and `isChildOf`, and the other that is only `isChildOf`. We also only consider edges whose weights are greater than 7.0 (weights are on a 10 point scale).

Further, for the first graph, we aggregate multiple relationships into a single one since our PPR (and PageRank as well) function on a single relationship, and we want to find high pagerank nodes across all relationships. So this is done in two steps.

In the first step, we filter out edges which are not in our node list, and whose weights are 7.0 and below. We accumulate the weights for same (source, destination) pairs across multiple relationships where applicable, and finally we write them out into two distinct files.

In [5]:
rel_dict, parent_dict = {}, {}
num_edges, num_rel_edges, num_parent_edges = 0, 0, 0

In [6]:
fedges_kg = open(KG_EDGES_FILEPATH, "r")
for line in fedges_kg:
    # print(line)
    if num_edges % 500000 == 0:
        print("{:d} edges ({:d} parent, {:d} rel) processed".format(
            num_edges, num_parent_edges, num_rel_edges))
    src, rel, dst, weight = line.strip().split('\t')
    weight = float(weight)
    if weight < 7.0:
        # remove low weights
        continue
    if src not in valid_nodes or dst not in valid_nodes:
        # remove edges connecting an invalid node (discarded earlier)
        continue
    if rel == "IsParentOf":
        # we discard this regardless
        continue
    key = (src, dst)
    if rel == "IsChildOf":
        # any other relationship
        if key in parent_dict.keys():
            parent_dict[key] += weight
        else:
            parent_dict[key] = weight
        num_parent_edges += 1
    else:
        # parent-child relationship
        if key in rel_dict.keys():
            rel_dict[key] += weight
        else:
            rel_dict[key] = weight
        num_rel_edges += 1
    num_edges += 1
    
print("{:d} edges ({:d} parent, {:d} rel) processed, COMPLETE".format(
    num_edges, num_parent_edges, num_rel_edges))
fedges_kg.close()

0 edges (0 parent, 0 rel) processed
0 edges (0 parent, 0 rel) processed
500000 edges (470680 parent, 29320 rel) processed
500000 edges (470680 parent, 29320 rel) processed
500000 edges (470680 parent, 29320 rel) processed
1000000 edges (946771 parent, 53229 rel) processed
1000000 edges (946771 parent, 53229 rel) processed
1397470 edges (1316373 parent, 81097 rel) processed, COMPLETE


### Aggregated Relationship Edges

In [7]:
num_rel_edges = 0
fedges_neo_r = open(NEO4J_RELS_FILEPATH, "w")
fedges_neo_r.write(":START_ID,weight:float,:END_ID,:TYPE\n")
for key in rel_dict.keys():
    weight = rel_dict[key]
    src, dst = key
    fedges_neo_r.write("{:s},{:.1f},{:s},REL\n".format(src, weight, dst))
    num_rel_edges += 1
fedges_neo_r.close()
print("number of relation edges: {:d}".format(num_rel_edges))

number of relation edges: 79859


In [8]:
redges_df = pd.read_csv(NEO4J_RELS_FILEPATH)
redges_df.head()

Unnamed: 0,:START_ID,weight:float,:END_ID,:TYPE
0,8247673,8.0,2791353,REL
1,8247673,8.0,5348908,REL
2,8247690,10.0,8121429,REL
3,8247690,8.0,8881721,REL
4,8247690,8.0,8921489,REL


### Aggregated Parent edges

In [9]:
num_parent_edges = 0
fedges_neo_p = open(NEO4J_PARENT_FILEPATH, "w")
fedges_neo_p.write(":START_ID,weight:float,:END_ID,:TYPE\n")
for key in parent_dict.keys():
    weight = parent_dict[key]
    src, dst = key
    fedges_neo_p.write("{:s},{:.1f},{:s},REL\n".format(src, weight, dst))
    num_parent_edges += 1
fedges_neo_p.close()
print("number of parent edges: {:d}".format(num_parent_edges))

number of parent edges: 720877


In [10]:
pedges_df = pd.read_csv(NEO4J_PARENT_FILEPATH)
pedges_df.head()

Unnamed: 0,:START_ID,weight:float,:END_ID,:TYPE
0,8247637,20.0,8237638,REL
1,8247638,20.0,8237638,REL
2,8242678,20.0,8247638,REL
3,8247640,20.0,3814900,REL
4,8247640,20.0,3816314,REL
