## Cluster using KMeans

Not technically a graph method, but included here for evaluation purposes.

Each document is represented by its row in the appropriate generation probabilities matrix, as a sparse vector of generation probabilities q(d<sub>i</sub>|d<sub>j</sub>).

We use `MiniBatchKMeans` from the scikit-learn library to generate clusters, setting k=20 (since this is the 20 newsgroups dataset). The `MiniBatchKMeans` is preferred because of the size of our dataset (`KMeans` is generally safe to use for dataset sizes < 10k, but `MiniBatchKMeans` recommended for larger datasets).

**NOTE:** We will run this notebook multiple times for different values of `NUM_HOPS` (and once more for generating baseline K-Means clusters for the original TD Matrix).

In [1]:
import numpy as np
import os
import pandas as pd

from scipy.sparse import load_npz
from sklearn.cluster import MiniBatchKMeans

### Set NUM_HOPS parameter

We will run this notebook multiple times for different values of the `NUM_HOPS` parameter below.

In [2]:
NUM_HOPS = 1

### Constants

In [3]:
NUM_CLUSTERS = 20  # dataset is 20 newsgroups

DATA_DIR = "../data"
LABEL_FILEPATH = os.path.join(DATA_DIR, "labels.tsv")

PREDS_FILEPATH_TEMPLATE = os.path.join(DATA_DIR, "kmeans-preds-g{:d}.tsv")
GENPROBS_FILEPATH_TEMPLATE = os.path.join(DATA_DIR, "genprobs_{:d}.npy")
# # reusing for predictions for TD Matrix
# PREDS_FILEPATH_TEMPLATE = os.path.join(DATA_DIR, "kmeans-preds-td.tsv")
# GENPROBS_FILEPATH_TEMPLATE = os.path.join(DATA_DIR, "tdmatrix.npz")

### Generate doc_id mappings

Generating mappings to map the generated `doc_id` values to row IDs in the generation probability matrix.

In [4]:
row2docid_labels = {}
flabels = open(LABEL_FILEPATH, "r")
num_nodes = 0
for line in flabels:
    doc_id, label = line.strip().split('\t')
    row2docid_labels[num_nodes] = (doc_id, label)
    num_nodes += 1
    
flabels.close()

### Load Data

In [5]:
X = np.load(GENPROBS_FILEPATH_TEMPLATE.format(NUM_HOPS))
# # reusing for predictions for TD Matrix
# X = load_npz(GENPROBS_FILEPATH_TEMPLATE)

### KMeans Clustering

In [6]:
kmeans = MiniBatchKMeans(n_clusters=NUM_CLUSTERS, random_state=42)
kmeans.fit(X)
preds = kmeans.predict(X)

### Write out predictions

In [7]:
num_predicted = 0
fpreds = open(PREDS_FILEPATH_TEMPLATE.format(NUM_HOPS), "w")
for row_id, pred in enumerate(preds):
    if num_predicted % 1000 == 0:
        print("{:d} rows predicted".format(num_predicted))
    doc_id, label = row2docid_labels[row_id]
    fpreds.write("{:s}\t{:s}\t{:d}\n".format(doc_id, label, pred))
    num_predicted += 1

print("{:d} rows predicted, COMPLETE".format(num_predicted))
fpreds.close()

0 rows predicted
1000 rows predicted
2000 rows predicted
3000 rows predicted
4000 rows predicted
5000 rows predicted
6000 rows predicted
7000 rows predicted
8000 rows predicted
9000 rows predicted
10000 rows predicted
11000 rows predicted
12000 rows predicted
13000 rows predicted
14000 rows predicted
15000 rows predicted
16000 rows predicted
17000 rows predicted
18000 rows predicted
18810 rows predicted, COMPLETE


In [8]:
pred_df = pd.read_csv(PREDS_FILEPATH_TEMPLATE.format(NUM_HOPS), 
                      delimiter="\t",
                      names=["doc_id", "label", "prediction"])
pred_df.head()

Unnamed: 0,doc_id,label,prediction
0,0-0-54241,alt.atheism,8
1,0-0-54242,alt.atheism,8
2,0-0-54243,alt.atheism,8
3,0-0-54244,alt.atheism,8
4,0-0-54245,alt.atheism,8
