# K-Means Clustering with PySpark

This notebook shows how to use and measure K-Means clustering with PySpark.

* Method: [K-Means](https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans)
* Dataset: Spark KMeans Sample Data

## Imports

import findspark
findspark.init()

import numpy as np

from pyspark import SparkContext
from pyspark.sql import SQLContext

from pyspark.ml.clustering import KMeans

import seaborn as sb
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')

## Get Some Context

In [None]:
# Create a SparkContext and a SQLContext context to use
sc = SparkContext(appName="KMeans Clustering with Spark")
sqlContext = SQLContext(sc)

## Load and Prepare the Data

In [None]:
DATA_FILE = "/Users/robert.dempsey/Dev/daamlobd/data/mllib/sample_kmeans_data.txt"

In [None]:
data = sqlContext.read.format("libsvm").load(DATA_FILE)

In [None]:
# View one of the records
data.take(3)

## Identify the Number of Clusters to Use

Arguments:
* k: number of clusters
* maxIter: max number of iterations
* initMode: initalization algoritm
  * random: select random points as initial cluster centers
  * k-means||: parallel variant of k-means++
* seed: random seed

In [None]:
# Define the cluster range
cluster_range = range(2, 20)

# Create a list of KMeans models with differing numbers of clusters
kmeans_models = [KMeans(k=i, seed=42) for i in cluster_range]

# Let's take a look at one of the models
kmeans_models[12]

In [None]:
# Fit each model and evaluate the clustering using Within Set Sum of Squared Errors
cluster_scores = list()
for i in range(len(kmeans_models)):
    kmeans = kmeans_models[i]
    model = kmeans.fit(data)
    cluster_score = model.computeCost(data)
    cluster_scores.append(cluster_score)

cluster_scores[12]

In [None]:
# Plot an elbow curve of the scores to find the optimal number of clusters
plt.plot(cluster_range, cluster_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

**Interpretation**: it appears that 2 is the optimal number of clusters for this dataset. That's our first model.

## Fit a K-Means Clustering Model

In [None]:
# Get the index value of the max cluster score
max_score_index = cluster_scores.index(max(cluster_scores))

# Get the number of clusters used for the model with the max score
model_to_use = kmeans_models[max_score_index]
best_number_of_clusters = model_to_use.getK()
print("Best number of clusters: {}".format(best_number_of_clusters))

In [None]:
# Fit the model with the best number of clusters
kmeans = KMeans(k=best_number_of_clusters, seed=42)
model = kmeans.fit(data)
model

## Model Evaluation

In [None]:
# Get the model summary
summary = model.summary

### Number of Observations in Each Cluster

In [None]:
summary.clusterSizes

### Within Set Sum of Squared Errors

A measure of the total variance in your dataset explained by the clustering. By assigning the samples to k clusters rather than n (number of samples) clusters achieved a reduction in sums of squares of X%. ([cite](https://discuss.analyticsvidhya.com/t/what-is-within-cluster-sum-of-squares-by-cluster-in-k-means/2706/2))

The higher this number the better.

In [None]:
wssse = model.computeCost(data)
print("Within Set Sum of Squared Errors: %0.2f" % wssse)

### Show the Cluster Centers

In [None]:
centers = model.clusterCenters()
for center in centers:
    print(center)

### Model Predictions

In [None]:
# Show the predictions
summary.predictions.show()

## Clean Up

In [None]:
sc.stop()