This code implements constrained k-means clustering in R. Constrained k-means in this context involves imposing lower bounds on cluster sizes to rectify a common problem of small, or even empty, clusters resulting from the usual k-means clustering algorithm. It does not involve must-include or must-exclude constraints. Our principal reference is P. S. Bradley, K. P. Bennett, and A. Demiriz, "Constrained K-Means Clustering," Microsoft Research document MSR-TR-2000-65, May, 2000. Available online at https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-65.pdf.
In our R-function, we extend the Bradley-Bennett-Demiriz constrained k-means algorithm to include problems where indivisible "chunks" of population are clustered. In this case, the stipulated lower bounds are targets to be met approximately. A good example is congressional districting where the number of clusters is stipulated decenially by apportionment and where population chunks can be census tracts, census block groups, or even census blocks. We demonstrate the algorithm by clustering Illinois census block groups into eighteen clusters based on geocoordinates to achieve minimal total within-cluster distance to centroids while achieving approximately equal cluster populations. Data was downloaded from the U.S. Census Bureau; we were unable to find suitable data at census block level. Illinois presents an interesting challenge with one huge metropolis, Chicago, a few smaller cities, and vast swaths of tiny towns and farmland.
We demonstrate a normal clustering problem with pulsar emissions data downloaded from the UCI Machine Learning Repository. In this context, we compare standard k-means clustering results (via R function kmeans) based on eight standardized numeric attributes. As you might expect, imposing moderate lower bounds on cluster sizes causes an increase in total within-cluster distance to centroids. However, the pulsar dataset includes a classification variable that indicates whether the pulsar was confirmed or not. Even though k-means is an unsupervised clustering algorithm with no explicit predictive element, it is interesting that moderate lower bounds decrease conditional entropy for the classification variable, i.e., classification entropy across clusters weighted by proportionate cluster sizes.
Our algorithm solves a linear programming problem at each iteration (function lp.transport in R package lpSolve). Even though this is a very efficient algorithm for problems of this type, we still encountered significant run times for our demo problems on a modern 64-bit PC with four 2.7 GHz processors and 8 GB of RAM. As an alternative we have utilized a collapsed dual algorithm which can be applied to much larger clustering problems. This approach produced very good results on our pulsars emissions example, matching the transportation solution in much less time, 10 minutes versus 103 minutes. So this approach may have merit on standard clustering problems with moderate lower bounding. The collapsed dual approach did not do so well on the congressional districting problem. It converged, but it had higher run time than the transportation algorithm (104 minutes versus 24 minutes), and moreover it produced less equal populations across the clusters. Hence, the collapsed dual approach may not work very well on problems with widely varying populations and very tight population bounds.