# Find "optimal" number of clusters for transaction amounts

In the class solution the number of desired clusters (`k`) is passed to the algorithm.

How do we know that this number is optimal? Well, we don't but what we can do is assess how good it is by calculating the total distance of all clustered points from their cluster centers and compare it between different ks. This total distance is called Within Set Sum of Squared Error (`WSSSE`) and it's used by the algorithm to evaluate the quality of the clustering for a given k. You can easily compute `WSSSE` yourself or you can obtain it by calling `computeCost` function on the model.

So, the lower `WSSSE` the better, but not quite as you can easily see that `WSSSE` is 0 when the number of desired clusters is the same as the number of clustered points. You cannot beat that!

What we're looking for, instead, is the point where WSSSE starts decreasing slowly with increased number of k.

When you plot WSSSE the optimal `k` is where there is an "elbow". In our case 6.

In [None]:
from numpy import array
from math import pow

from pyspark.mllib.clustering import KMeans, KMeansModel

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Clustering").getOrCreate()


file = "/home/jovyan/Resources/tx.csv"
text = spark.sparkContext.textFile(file)

txns = text.map(lambda st: st.split(",")).map(lambda el: (el[0], el[1], float(el[2])))

# get the amounts for clustering training
amnts = txns.map(lambda v: array([v[2]]))

iterationCount = 20

print("Within Set Sum of Squared Errors")

for clusterCount in range(1, 21):

  model = KMeans.train(amnts, clusterCount, iterationCount)

  wssse = model.computeCost(amnts)
  print(str(clusterCount) + "\t" + str(round(wssse)))


![Optimal Cluster Count](../Resources/img/optimal_cluster_count.png)