## Run Personalized PageRank (PPR)

We set up the first graph of aggregated relations (one without parent and child relationships). The Neo4j graph is built using the following `neo4j-admin` command.

    neo4j-admin import \
        --nodes=/path/to/neo4j-nodes.tsv \
        --relationships=/path/to/neo4j-redges.tsv
        
In this notebook, we will use the top (10) concepts for each article from the `story-concepts.tsv` file to populate the sourceNodes parameter for a Personalized PageRank PPR call. We will then write out the top (20) concepts that have the highest pagerank value from PPR.

Output is written (in same format) to the file `story-ppr-concepts.tsv`.

In [1]:
import os
import pandas as pd
import py2neo

In [2]:
DATA_DIR = "../data"
ANNOTATION_FILEPATH = os.path.join(DATA_DIR, "story-concepts.tsv")
PPR_RESULTS_FILEPATH = os.path.join(DATA_DIR, "story-ppr-concepts.tsv")

NEO4J_CONN_URL = "bolt://localhost:7687"

TOP_INPUTS = 10
TOP_OUTPUTS = 20

### Supporting functions

* `get_concepts` extracts the most frequent `num_concepts` annotated for the story given by `story_id`.

* `build_ppr_query` builds the appropriate PPR query given `cids` the input list of concept IDs and the number of output concepts `num_concepts`.

In [3]:
def get_concepts(story_id, num_concepts):
    cids = []
    fannots = open(ANNOTATION_FILEPATH, "r")
    for line in fannots:
        sid, cid, num_matched = line.strip().split('\t')
        if sid != story_id:
            continue
        num_matched = int(num_matched)
        if num_matched > 1:
            cids.append(cid)
    fannots.close()
    if len(cids) > num_concepts:
        return cids[0:num_concepts]
    else:
        return cids


def build_ppr_query(cids, num_concepts):
    query_str = []
    for cid in cids:
        query_str.append("MATCH (c%s:Concept {cid:'%s'})" % (cid, cid))
    source_nodes = "[" + ",".join(["c{:s}".format(cid) for cid in cids]) + "]"
    query_str.append("""
CALL algo.pageRank.stream('Concept', 'REL', 
    {iterations:50, dampingFactor:0.85, sourceNodes:%s})
YIELD nodeId, score
RETURN algo.asNode(nodeId).cid AS cid, score
ORDER BY score DESC
LIMIT %d
    """ % (source_nodes, num_concepts))
    return "\n".join(query_str)


### Run PPR for each story

In [4]:
graph = py2neo.Graph(NEO4J_CONN_URL, auth=("neo4j", "graph"))

fppr = open(PPR_RESULTS_FILEPATH, "w")
for fname in os.listdir(DATA_DIR):
    if not fname.endswith(".story"):
        continue
    story_id = fname.split('.')[0]
    cids = get_concepts(story_id, TOP_INPUTS)
    if len(cids) == 0:
        print("Not enough concepts found in {:s} for PPR, skipping"
            .format(story_id))
        continue
    query = build_ppr_query(cids, TOP_OUTPUTS)
    print("running PPR for story ID: {:s}".format(story_id))
    print("query:")
    print(query)
    print("---")
    results = graph.run(query).data()
    for result in results:
        fppr.write("{:s}\t{:s}\t{:.5f}\n"
            .format(story_id, result["cid"], result["score"]))

fppr.close()

running PPR for story ID: 190823140729
query:
MATCH (c8001550:Concept {cid:'8001550'})
MATCH (c9199226:Concept {cid:'9199226'})
MATCH (c2795416:Concept {cid:'2795416'})
MATCH (c9790284:Concept {cid:'9790284'})
MATCH (c8837843:Concept {cid:'8837843'})
MATCH (c9793882:Concept {cid:'9793882'})
MATCH (c8113323:Concept {cid:'8113323'})

CALL algo.pageRank.stream('Concept', 'REL', 
    {iterations:50, dampingFactor:0.85, sourceNodes:[c8001550,c9199226,c2795416,c9790284,c8837843,c9793882,c8113323]})
YIELD nodeId, score
RETURN algo.asNode(nodeId).cid AS cid, score
ORDER BY score DESC
LIMIT 20
    
---
running PPR for story ID: 190904194433
query:
MATCH (c8101607:Concept {cid:'8101607'})
MATCH (c8133519:Concept {cid:'8133519'})
MATCH (c9199226:Concept {cid:'9199226'})
MATCH (c9723275:Concept {cid:'9723275'})
MATCH (c8106247:Concept {cid:'8106247'})
MATCH (c9722807:Concept {cid:'9722807'})
MATCH (c8128900:Concept {cid:'8128900'})
MATCH (c8131137:Concept {cid:'8131137'})
MATCH (c8120858:Concept {

### Verify Output

In [5]:
ppr_df = pd.read_csv(PPR_RESULTS_FILEPATH, delimiter="\t",
    names=["story_id", "concept_id", "pagerank"])
ppr_df.head()

Unnamed: 0,story_id,concept_id,pagerank
0,190823140729,8837843,0.15
1,190823140729,2795416,0.15
2,190823140729,9199226,0.15
3,190823140729,8113323,0.15
4,190823140729,9793882,0.15
