## Evaluation

We evaluate each of our clusters using [Silhouette Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html). 

The Silhouette Coefficient is calculated using the mean intra-cluster distance (`a`) and the mean nearest-cluster distance (`b`), i.e., distane between sample and nearest cluster, for each sample. The Silhouette Coefficient for a sample is `(b - a) / max(a, b)`.

The Silhouette Score is a measure of cluster quality, where valid values range from -1 (worst) to +1 (best). A value of 0 indicates overlapping clusters.

**NOTE:** This notebook will need to be run for each of the following experiments:

* TD-Matrix, KMeans
* G1-Matrix, KMeans
* G1-Matrix, Louvain
* G2-Matrix, KMeans
* G2-Matrix, Louvain
* G3-Matrix, KMeans
* G3-Matrix, Louvain


In [1]:
import numpy as np
import os

from scipy.sparse import load_npz
from sklearn.metrics import silhouette_score

In [2]:
DATA_DIR = "../data"

In [3]:
# MATRIX_FILEPATH = os.path.join(DATA_DIR, "tdmatrix.npz")
# PRED_FILEPATH = os.path.join(DATA_DIR, "kmeans-preds-td.tsv")

MATRIX_FILEPATH = os.path.join(DATA_DIR, "genprobs_1.npy")
PRED_FILEPATH = os.path.join(DATA_DIR, "kmeans-preds-g1.tsv")

# MATRIX_FILEPATH = os.path.join(DATA_DIR, "genprobs_1.npy")
# PRED_FILEPATH = os.path.join(DATA_DIR, "louvain-preds-g1.tsv")

# MATRIX_FILEPATH = os.path.join(DATA_DIR, "genprobs_2.npy")
# PRED_FILEPATH = os.path.join(DATA_DIR, "kmeans-preds-g2.tsv")

# MATRIX_FILEPATH = os.path.join(DATA_DIR, "genprobs_2.npy")
# PRED_FILEPATH = os.path.join(DATA_DIR, "louvain-preds-g2.tsv")

# MATRIX_FILEPATH = os.path.join(DATA_DIR, "genprobs_3.npy")
# PRED_FILEPATH = os.path.join(DATA_DIR, "kmeans-preds-g3.tsv")

# MATRIX_FILEPATH = os.path.join(DATA_DIR, "genprobs_3.npy")
# PRED_FILEPATH = os.path.join(DATA_DIR, "louvain-preds-g3.tsv")

### Create label mappings

In [4]:
unique_labels = set()
LABEL_FILEPATH = os.path.join(DATA_DIR, "labels.tsv")
flabel = open(LABEL_FILEPATH, "r")
for line in flabel:
    _, label = line.strip().split('\t')
    unique_labels.add(label)

flabel.close()

label2lid = {}
sorted_labels = sorted(list(unique_labels))
for lid, label in enumerate(sorted_labels):
    label2lid[label] = lid

print(label2lid)

{'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6, 'rec.autos': 7, 'rec.motorcycles': 8, 'rec.sport.baseball': 9, 'rec.sport.hockey': 10, 'sci.crypt': 11, 'sci.electronics': 12, 'sci.med': 13, 'sci.space': 14, 'soc.religion.christian': 15, 'talk.politics.guns': 16, 'talk.politics.mideast': 17, 'talk.politics.misc': 18, 'talk.religion.misc': 19}


### Collect labels and predictions

In [5]:
labels, predictions = [], []
fpreds = open(PRED_FILEPATH, "r")
for line in fpreds:
    _, label, pred = line.strip().split('\t')
    labels.append(label2lid[label])
    predictions.append(int(pred))

fpreds.close()

### Load data

Only for the TD-Matrix, we will calculate the silhouette score for labels and predictions.

In [6]:
# X = load_npz(MATRIX_FILEPATH)
X = np.load(MATRIX_FILEPATH)
# print(silhouette_score(X, labels))
print(silhouette_score(X, predictions))

0.05757169248381719
