# Clustering Evaluation Metrics

Having our clustering results and assigning a cluster ID to every data point, the next step would be to have some type of metrics to evaluate this clustering result.

Rand Index: It considers both correct assignments (data points grouped together that belong in the same class) and incorrect assignments (data points placed together that shouldn't be, or points separated that should be together) between the clustering results and the ground truth. A higher Rand index indicates better agreement.

Purity: It focuses on how well each cluster represents a single class. A high purity for a cluster means most data points in that cluster belong to the same class according to the ground truth.

In [2]:
def RandIndex(result, gs):
  TP = 0
  TN = 0
  FP = 0
  FN = 0

  for i in range(0,len(result)-1):
    for j in range(i+1, len(result)):
      # TP
      if (result[i] == result[j]): # Positive
        if(gs[i]==gs[j]): # Truth
          TP += 1
        else:
          FP += 1
      else: # Negative
        if (gs[i]==gs[j]):
          FN += 1
        else:
          TN += 1

  RI = (TP+TN)/(TP+FP+TN+FN)

  return RI

In [3]:
def Purity(assignment, known):
    aLabels = set(assignment)
    aLabels = list(aLabels)
    kLabels = set(known)
    kLabels=list(kLabels)

    maxOverlaps = []

    for cID in aLabels:
        indicesOfCID = [ii for ii in \
            range(0, len(assignment)) if \
             assignment[ii]==cID]
        overlap=[]

        for ckID in kLabels:
            indicesOfCkID = [ii for ii in \
            range(0, len(known)) if \
             known[ii]==ckID]
            overlap.append(\
            len(set(indicesOfCID).intersection(indicesOfCkID))\
            )
        maxOverlaps.append(max(overlap))

    purity = sum(maxOverlaps)/len(assignment)
    return purity

In [4]:
# Simple dataset of clustering
assign = [1,1,2,1,2]
known = [0,0,0,1,1]

In [5]:
# Trying Rand Index evaluation
print(RandIndex(assign, known))

0.4


In [6]:
# Trying Purity evaluation
print(Purity(assign, known))

0.6
