## Cluster with Louvain Community Detection

We construct a graph from the set of nodes given by `nodes.csv` and the appropriate `edges-${t}.csv` for t=1..3, corresponding to random walks generated for number of hops t.

For the graph, we will run the Louvain Community Detection Algorithm, and use the community predicted by the algorithm to cluster our documents.

The Louvain Algorithm maximizes the modularity score for each community in the network. The modularity score is a measure of how densely connected nodes in a community are compared to their connectivity in a random network.

**NOTE:** This notebook will need to be run multiple times for different values of `t`. For each time, the graph will have to be created manually using `neo4j-admin`.

In [1]:
import collections
import numpy as np
import pandas as pd
import py2neo
import operator
import os

### Set the NUM_HOPS parameter

&#x26A0; Note that this should be accompanied by repopulating the database with the correct (node, edge) CSV file pair.

In [2]:
# change this for each run (1, 2, 3)
NUM_HOPS = 1

In [3]:
DATA_DIR = "../data"
PRED_FILEPATH_TEMPLATE = os.path.join(
    DATA_DIR, "louvain-preds-g{:d}.tsv")
NEO4J_CONN_URL = "bolt://localhost:7687"

In [4]:
graph = py2neo.Graph(NEO4J_CONN_URL, auth=("neo4j", "graph"))

### Execute Louvain Community Detection

In [5]:
query = """
    CALL algo.louvain.stream('Document', 'PROB', 
        {weightProperty: 'gen_prob'})
    YIELD nodeId, community
    RETURN algo.asNode(nodeId).doc_id AS doc_id, 
           algo.asNode(nodeId).category AS label, 
           community AS pred
    ORDER BY pred DESC
"""
results = graph.run(query).data()

### Collect Predictions

The community value returned by the Louvain community detection algorithm is used as the cluster id.

In [6]:
fpred = open(PRED_FILEPATH_TEMPLATE.format(NUM_HOPS), "w")
num_results = 0
pred_category_counts = collections.Counter()
for result in results:
    if num_results % 1000 == 0:
        print("{:d} document predictions extracted".format(num_results))
    doc_id = result["doc_id"]
    label = result["label"]
    pred = result["pred"]
    pred_category_counts[pred] += 1
    fpred.write("{:s}\t{:s}\t{:d}\n".format(doc_id, label, pred))
    num_results += 1

print("{:d} document predictions extracted, COMPLETE".format(num_results))
print("category counts")
print(sorted(pred_category_counts.most_common(len(pred_category_counts)),
    key=operator.itemgetter(0)))

fpred.close()

0 document predictions extracted
1000 document predictions extracted
2000 document predictions extracted
3000 document predictions extracted
4000 document predictions extracted
5000 document predictions extracted
6000 document predictions extracted
7000 document predictions extracted
8000 document predictions extracted
9000 document predictions extracted
10000 document predictions extracted
11000 document predictions extracted
12000 document predictions extracted
13000 document predictions extracted
14000 document predictions extracted
15000 document predictions extracted
16000 document predictions extracted
17000 document predictions extracted
18000 document predictions extracted
18810 document predictions extracted, COMPLETE
category counts
[(0, 1295), (1, 2560), (2, 1001), (3, 1119), (4, 915), (5, 1096), (6, 3304), (7, 2327), (8, 2087), (9, 2560), (10, 504), (11, 24), (12, 1), (13, 1), (14, 10), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)]


In [7]:
pred_df = pd.read_csv(PRED_FILEPATH_TEMPLATE.format(NUM_HOPS), 
                      delimiter="\t",
                      names=["doc_id", "label", "prediction"])
pred_df.head()

Unnamed: 0,doc_id,label,prediction
0,1-19-83677,talk.religion.misc,20
1,1-18-178690,talk.politics.misc,19
2,1-9-104905,rec.sport.baseball,18
3,1-9-104547,rec.sport.baseball,17
4,1-6-76496,misc.forsale,16
