# Find "optimal" number of clusters for transaction amounts

In the class solution the number of desired clusters (`k`) is passed to the algorithm.

How do we know that this number is optimal? Well, we don't but what we can do is assess how good it is by calculating the total distance of all clustered points from their cluster centers and compare it between different ks. This total distance is called Within Set Sum of Squared Error (`WSSSE`) and it's used by the algorithm to evaluate the quality of the clustering for a given k. You can easily compute `WSSSE` yourself or you can obtain it by calling `computeCost` function on the model.

So, the lower `WSSSE` the better, but not quite as you can easily see that `WSSSE` is 0 when the number of desired clusters is the same as the number of clustered points. You cannot beat that!

What we're looking for, instead, is the point where WSSSE starts decreasing slowly with increased number of k.

When you plot WSSSE the optimal `k` is where there is an "elbow". In our case 6.

In [None]:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("Scala Machine Learning Clustering").getOrCreate()

val file = "/home/jovyan/Resources/tx.csv"
val text = spark.sparkContext.textFile(file)

val txns = text.map(s => s.split(",")).map(a => a(2).toFloat)

// get the amounts for clustering training, the amounts need to be vectorized
val amnts = txns.map(v => Vectors.dense(v)).cache

val iterationCount = 20

println("Within Set Sum of Squared Errors")
for (clusterCount <- 1 to 20) {
  val model = KMeans.train(amnts, clusterCount, iterationCount)

  val wssse = model.computeCost(amnts)
  println(clusterCount + "\t" + Math.round(wssse))
}


![Optimal Cluster Count](../Resources/img/optimal_cluster_count.png)