# Spark K-Means Clustering Example

This notebook demonstrates how to use the K-Means clustering algorithm in PySpark's MLlib.

## Step 1: Set Up Spark Session
First, we create a Spark session to work with PySpark.

In [1]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("KMeansClustering") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/13 20:18:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/13 20:18:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/01/13 20:18:52 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/01/13 20:18:52 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


## Step 2: Load and Prepare Data
We create a simple dataset for clustering. This dataset contains points in a 2D space.

In [2]:
from pyspark.sql import Row

# Create a sample dataset
data = [
    Row(id=1, x=1.0, y=1.0),
    Row(id=2, x=1.5, y=1.5),
    Row(id=3, x=3.0, y=3.0),
    Row(id=4, x=5.0, y=5.0),
    Row(id=5, x=3.5, y=3.5),
    Row(id=6, x=4.5, y=4.5)
]

# Convert the list to a DataFrame
df = spark.createDataFrame(data)
df.show()

                                                                                

+---+---+---+
| id|  x|  y|
+---+---+---+
|  1|1.0|1.0|
|  2|1.5|1.5|
|  3|3.0|3.0|
|  4|5.0|5.0|
|  5|3.5|3.5|
|  6|4.5|4.5|
+---+---+---+



## Step 3: Feature Engineering
We use `VectorAssembler` to combine the feature columns (`x` and `y`) into a single vector column required by the K-Means algorithm.

In [3]:
from pyspark.ml.feature import VectorAssembler

# Assemble features into a single vector column
assembler = VectorAssembler(inputCols=["x", "y"], outputCol="features")
dataset = assembler.transform(df)
dataset.show()

+---+---+---+---------+
| id|  x|  y| features|
+---+---+---+---------+
|  1|1.0|1.0|[1.0,1.0]|
|  2|1.5|1.5|[1.5,1.5]|
|  3|3.0|3.0|[3.0,3.0]|
|  4|5.0|5.0|[5.0,5.0]|
|  5|3.5|3.5|[3.5,3.5]|
|  6|4.5|4.5|[4.5,4.5]|
+---+---+---+---------+



## Step 4: Apply K-Means Clustering
We apply the K-Means algorithm to cluster the data points into two clusters.

In [4]:
from pyspark.ml.clustering import KMeans

# Create and train the K-Means model
kmeans = KMeans(k=2, seed=1, featuresCol="features", predictionCol="cluster")
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)
predictions.show()

25/01/13 20:19:01 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


+---+---+---+---------+-------+
| id|  x|  y| features|cluster|
+---+---+---+---------+-------+
|  1|1.0|1.0|[1.0,1.0]|      1|
|  2|1.5|1.5|[1.5,1.5]|      1|
|  3|3.0|3.0|[3.0,3.0]|      0|
|  4|5.0|5.0|[5.0,5.0]|      0|
|  5|3.5|3.5|[3.5,3.5]|      0|
|  6|4.5|4.5|[4.5,4.5]|      0|
+---+---+---+---------+-------+



## Step 5: Evaluate the Clustering Model
We evaluate the model by computing the **Within Set Sum of Squared Errors (WSSSE)**.

In [5]:
# Evaluate clustering by computing WSSSE
wssse = model.computeCost(dataset)
print(f"Within Set Sum of Squared Errors (WSSSE): {wssse}")

AttributeError: 'KMeansModel' object has no attribute 'computeCost'

## Step 6: Extract Cluster Centers
We extract and display the cluster centers determined by the algorithm.

In [6]:
# Display cluster centers
centers = model.clusterCenters()
print("Cluster Centers:")
for center in centers:
    print(center)

Cluster Centers:
[4. 4.]
[1.25 1.25]


## Step 7: Stop the Spark Session
Finally, we stop the Spark session to release resources.

In [None]:
# Stop the Spark session
spark.stop()