# K-Means Demo

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed, and this becomes the new centroid.

cuML's KMeans supports the scalable KMeans++ initialization method. This method is more stable than randomnly selecting K points.
    
The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input.

For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).
    
For additional information on cuML's k-means implementation: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.KMeans.

## Thread-Safety Configuration (Optional)

On systems with many CPU cores, scikit-learn's default threading behavior can cause issues. OpenBLAS and other BLAS libraries may spawn excessive threads, leading to resource contention and potential hangsâ€”especially in notebook environments.

**This cell is optional.** If you experience hangs or performance issues when running the notebook, execute the cell below to restrict BLAS/OpenMP backends to single-threaded mode. It must run before importing NumPy or scikit-learn to take effect.


In [None]:
# OpenBLAS / thread-safety guard
# MUST be executed before importing numpy / sklearn
import os

# OpenBLAS (pthreads build, MAX_THREADS=80)
os.environ.setdefault('OPENBLAS_NUM_THREADS', '1')

# Prevent other BLAS/OpenMP backends from oversubscribing
os.environ.setdefault('OMP_NUM_THREADS', '1')
os.environ.setdefault('MKL_NUM_THREADS', '1')

# Make joblib conservative inside notebooks
os.environ.setdefault('JOBLIB_MULTIPROCESSING', '0')

print('OPENBLAS_NUM_THREADS =', os.environ.get('OPENBLAS_NUM_THREADS'))

## Imports

In [None]:
import cudf
import cupy
import matplotlib.pyplot as plt
from cuml.cluster import KMeans as cuKMeans
from cuml.datasets import make_blobs
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score

%matplotlib inline

## Define Parameters

In [None]:
n_samples = 100000
n_features = 25

n_clusters = 8
random_state = 0

## Generate Data

In [None]:
device_data, device_labels = make_blobs(
    n_samples=n_samples,
    n_features=n_features,
    centers=n_clusters,
    random_state=random_state,
    cluster_std=0.1
)

In [None]:
# Copy CuPy arrays from GPU memory to host memory (NumPy arrays).
# This is done to later compare CPU and GPU results.
host_data = device_data.get()
host_labels = device_labels.get()

## Scikit-learn model

### Fit

In [None]:
kmeans_sk = skKMeans(
    init="k-means++",
    n_clusters=n_clusters,
    random_state=random_state,
    n_init='auto'
)
%timeit kmeans_sk.fit(host_data)

## cuML Model

### Fit

In [None]:
kmeans_cuml = cuKMeans(
    init="k-means||",
    n_clusters=n_clusters,
    random_state=random_state
)

%timeit kmeans_cuml.fit(device_data)

## Visualize Centroids

Scikit-learn's k-means implementation uses the `k-means++` initialization strategy while cuML's k-means uses `k-means||`. As a result, the exact centroids found may not be exact as the std deviation of the points around the centroids in `make_blobs` is increased.

*Note*: This is visualizing the centroids in only two dimensions. 

In [None]:
fig = plt.figure(figsize=(16, 10))
plt.scatter(host_data[:, 0], host_data[:, 1], c=host_labels, s=50, cmap='viridis')

#plot the sklearn kmeans centers with blue filled circles
centers_sk = kmeans_sk.cluster_centers_
plt.scatter(centers_sk[:,0], centers_sk[:,1], c='blue', s=100, alpha=.5)

#plot the cuml kmeans centers with red circle outlines
centers_cuml = kmeans_cuml.cluster_centers_
plt.scatter(cupy.asnumpy(centers_cuml[:, 0]), 
            cupy.asnumpy(centers_cuml[:, 1]), 
            facecolors = 'none', edgecolors='red', s=100)

plt.title('cuML and sklearn kmeans clustering')

plt.show()

## Compare Results

In [None]:
%%time
cuml_score = adjusted_rand_score(host_labels, kmeans_cuml.labels_.get())
sk_score = adjusted_rand_score(host_labels, kmeans_sk.labels_)

In [None]:
threshold = 1e-4

passed = (cuml_score - sk_score) < threshold
print('compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))