# Scalable Music Clustering with Spark 🎶

<br />

In this tutorial, we’ll discover musical structures from a large-scale Spotify dataset using [Apache Spark](https://spark.apache.org/). We'll apply dimension reduction and clustering techniques to group songs by audio features such as `danceability`, `energy`, and `acousticness`, revealing musical patterns that go beyond traditional genre labels.

You will get hands-on experience with:

  - Preprocessing and standardizing audio features at scale with Spark DataFrame.
  - Reducing data dimensionality with Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).
  - Implementing and evaluating the K-Means clustering algorithm.
  - Comparing alternative data preprocessing pipelines.
  - Interpreting clusters through artist and song analysis.

By the end, you will understand the whole pipeline of transforming audio features into meaningful musical patterns using scalable data processing with PySpark.


**Credits**

Notebook adapted from [MACS40123 Fall24 Clustering & Dimension Reduction Tutorial](https://github.com/macs40123-f24/course-materials/tree/main/in-class-activities/03_clustering_dimension_reduction).

Dataset downloaded from [Kaggle Spotify Tracks Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset).

**Compatibility**

| Platform | Compatibility | Recommended | Notes |
| :--- | :--- | :--- | :--- |
| **Local Machine** (e.g., 16GB RAM Laptop) | ✅ Yes | ✅ Yes | Suitable for this dataset size and for local development. |
| **Google Colab** | ✅ Yes | ✅ Yes | A viable option for executing the notebook, the dataset fits within standard Colab instance memory. |
| **Midway3 Login Node** | ✅ Yes | ❌ No | **Not Recommended.** Login nodes are shared resources; running intensive computations can disrupt other users. |
| **Midway3 Compute Node** | ✅ Yes | ✅ Yes | **Recommended.** Ideal for this workload, allowing for efficient and scalable computation without impacting shared resources. |

## 1\. Setup and Data Loading

We will first import `pyspark` libraries, set up the Spark environment, and load the data.

In [None]:
import os

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StandardScaler, PCA
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.mllib.feature import StandardScaler as StandardScalerRDD
from pyspark.mllib.linalg.distributed import RowMatrix
import pyspark.sql.functions as F

In [None]:
# Initialize our SparkSession
spark = SparkSession \
        .builder \
        .appName("dr_cluster") \
        .getOrCreate()

### Data Loading

In [None]:
# Run this code if use on Midway
# DATA_DIR = "../datasets/spotify"

In [None]:
# Run this code if you use Colab or your local machine to fetch the dataset
import kagglehub

# Download latest version
path = kagglehub.dataset_download("maharshipandya/-spotify-tracks-dataset")

print("Path to dataset files:", path)
DATA_DIR = path

Using Colab cache for faster access to the '-spotify-tracks-dataset' dataset.
Path to dataset files: /kaggle/input/-spotify-tracks-dataset


In [None]:
# Load the Spotify song data from a CSV file
# The `header=True` option tells Spark to use the first row as the column names.
df = spark.read.csv(os.path.join(DATA_DIR, "dataset.csv"), header=True)

In [None]:
# Check potentially relevant features like danceability, energy, acousticness, etc.
df.columns

['_c0',
 'track_id',
 'artists',
 'album_name',
 'track_name',
 'popularity',
 'duration_ms',
 'explicit',
 'danceability',
 'energy',
 'key',
 'loudness',
 'mode',
 'speechiness',
 'acousticness',
 'instrumentalness',
 'liveness',
 'valence',
 'tempo',
 'time_signature',
 'track_genre']

-----

## 2\. Data Preprocessing

Machine learning models are only as good as the data they're trained on. Raw data is often noisy and not in the right format, so we will follow the three steps to preprocess data:

1.  **Feature Selection**: We'll select the numerical audio features that describe a song's characteristics.
2.  **Vectorization**: Spark's ML libraries expect all features to be in a single column of the `Vector` type.
3.  **Standardization**: We'll scale our features to ensure no single feature (like `duration_ms`) can unfairly dominate the others.

<!-- end list -->

In [None]:
# 1. Select the audio features we want to use for clustering.
feature_cols = ['popularity', 'danceability', 'energy',
                'key', 'loudness', 'speechiness',
                'acousticness', 'instrumentalness', 'liveness',
                'valence', 'tempo', 'duration_ms']

# 2. Cast the selected columns to numeric data type (float), and drop any rows with missing values.
#    We also keep 'track_id' and 'artists' for later interpretation.
df_features = df.select(*(F.col(c).cast("float").alias(c) for c in feature_cols), 'track_id', 'artists') \
                  .dropna()

# 3. Vectorize: Combine all feature columns into a single array column named 'features'.
df_features = df_features.withColumn('features', F.array(*[F.col(c) for c in feature_cols])) \
                         .select('track_id', 'artists', 'features')

# 4. Convert the feature array into the dense vector format that Spark ML expects.
vectors = df_features.rdd.map(lambda row: Vectors.dense(row.features))
features = spark.createDataFrame(vectors.map(Row), ["features_unscaled"])

# 5. Standardize: Scale the features to have a mean of 0 and standard deviation of 1.
#    This prevents features with large values (like tempo or duration) from skewing the results.
standardizer = StandardScaler(inputCol="features_unscaled", outputCol="features")
model = standardizer.fit(features)
features = model.transform(features) \
                .select('features')

# Persist the processed features in memory for faster access in the next steps.
features.persist()

print("Schema of our final features DataFrame:")
features.printSchema()

Schema of our final features DataFrame:
root
 |-- features: vector (nullable = true)



-----

## 3\. K-Means Clustering: A Baseline Attempt

We'll start with the **K-Means** algorithm to group the data into a specified number of clusters ($k=3$). We'll use the **Silhouette score** to evaluate the model performance (i.e., it evaluates how closely a song aligns with its assigned cluster compared to how close it is to other clusters).

  * A score of **+1** indicates well-defined, dense clusters.
  * A score of **0** indicates overlapping clusters.
  * A score of **-1** indicates that songs might have been assigned to the wrong clusters.

<!-- end list -->

In [None]:
# Train a K-Means model with k=3
kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(features)

# Assign each song to a cluster
predictions = model.transform(features)

# Evaluate the clustering quality using the Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette Score (without dimension reduction) = {silhouette}")

Silhouette Score (without dimension reduction) = 0.20588438964574374


The resulting Silhouette score is relatively low, indicating that the clusters are not well separated in the 12-dimensional feature space. Let's see if we can improve this with dimension reduction.

-----

## 4\. PCA for Dimension Reduction

High-dimensional data can be noisy and difficult to cluster. **Principal Component Analysis (PCA)** is a technique that reduces the number of dimensions by finding a new, smaller set of features (called principal components) that captures the most important information, or variance, in the data.

Imagine projecting a 3D object onto a 2D surface - when oriented properly, the shadow still captures the object's key structure. Similarly, PCA projects the 12-dimensional song data into 2 dimensions.

In [None]:
# Initialize and fit a PCA model to reduce the 12 features to 2 principal components.
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(features)

# Transform our data into the new, 2-dimensional space.
pca_results = model.transform(features).select("pcaFeatures")

# Let's see what the new 2D features look like
print("PCA-reduced features (2 dimensions):")
pca_results.show(5, truncate=False)

PCA-reduced features (2 dimensions):
+------------------------------------------+
|pcaFeatures                               |
+------------------------------------------+
|[-2.893488158834881,1.2323750624986274]   |
|[1.0881426600751212,1.5091509613595928]   |
|[-0.8351044483521946,0.035458566126560065]|
|[1.1383751222000997,-0.11273777119518302] |
|[-1.2943379902838172,0.5336997570650227]  |
+------------------------------------------+
only showing top 5 rows



Now, let's run K-Means again, but this time on our new, simplified `pcaFeatures`.

In [None]:
# Train a K-Means model on the 2D PCA features
pca_kmeans = KMeans(k=3, seed=1, featuresCol="pcaFeatures")
pca_model = pca_kmeans.fit(pca_results)

# Make predictions
pca_predictions = pca_model.transform(pca_results)

# Evaluate the new clustering
pca_evaluator = ClusteringEvaluator(featuresCol="pcaFeatures")
silhouette = pca_evaluator.evaluate(pca_predictions)
print(f"Silhouette Score (with PCA) = {silhouette}")

Silhouette Score (with PCA) = 0.5217440722341667


That's a significant improvement\! By reducing the dimensions and focusing on the most important characters, PCA helped K-Means discover better-defined clusters. Can we do even better?

-----

## 5\. SVD: A More Powerful Alternative

**Singular Value Decomposition (SVD)** is another powerful dimension reduction technique, closely related to PCA. It decomposes the original feature matrix into three matrices: $U$, $S$, and $V$. The $U$ matrix provides the lower-dimensional coordinates of each song, analogous to PCA's principal components.

Spark's SVD implementation uses an older API called **RDDs (Resilient Distributed Datasets)**, so we'll need to convert our data into that format first.

In [None]:
# Convert our original features DataFrame back to an RDD
vectors_rdd = df_features.rdd.map(lambda row: row["features"])

# Use the RDD-specific StandardScaler to re-scale the data
standardizer_rdd = StandardScalerRDD(withMean=True, withStd=True)
model_rdd = standardizer_rdd.fit(vectors_rdd)
vectors_rdd_scaled = model_rdd.transform(vectors_rdd)

# Create a RowMatrix, the RDD-based format required for SVD
mat = RowMatrix(vectors_rdd_scaled)

# Compute the SVD. We'll keep 2 dimensions to match our PCA approach.
# computeU=True is essential because we need the U matrix for our new features.
svd = mat.computeSVD(2, computeU=True)

# The U matrix contains our new features. Let's convert it back to a DataFrame for K-Means.
U_df = svd.U.rows.map(lambda row: Row(features=Vectors.dense(row.toArray()))) \
                 .toDF()
U_df.persist()

print("SVD-reduced features (2 dimensions):")
U_df.show(5, truncate=False)

SVD-reduced features (2 dimensions):
+-----------------------------------------------+
|features                                       |
+-----------------------------------------------+
|[0.001109088520816445,0.0025289875058067727]   |
|[-0.0058502886041404065,0.0031928528509372793] |
|[-0.0024887007169709457,-3.4189607213817657E-4]|
|[-0.005938088469005461,-6.973548130258261E-4]  |
|[-0.0016860197171270797,8.531684603803135E-4]  |
+-----------------------------------------------+
only showing top 5 rows



With the reduced data representation with SVD, let's run K-Means again.

In [None]:
# Train a K-Means model on the SVD-reduced features
svd_kmeans = KMeans(k=3, seed=1)
svd_model = svd_kmeans.fit(U_df)

# Make predictions
svd_predictions = svd_model.transform(U_df)

# Evaluate the final clustering
svd_evaluator = ClusteringEvaluator()
silhouette = svd_evaluator.evaluate(svd_predictions)
print(f"Silhouette Score (with SVD) = {silhouette}")

Silhouette Score (with SVD) = 0.5683019728033961


Excellent\! This is the best score so far. It appears that, for this dataset, SVD creates a feature space that separates songs more effectively.

-----

## 6\. Interpreting the Clusters: What Did We Find?

What does the three clusters mean in musical terms? To interpret them, we need to connect the clusters with the original song metadata (e.g., artist, title). Since Spark DataFrames are unordered, we can't directly merge the columns. We will assign unique IDs to both DataFrames and then `join` them.

In [None]:
# First, let's see how many songs are in each cluster
print("Count of songs per cluster:")
svd_predictions.groupBy('prediction').count().show()

# Add a unique, monotonically increasing ID to the original features and the predicted clusters
df_features_with_id = df_features.withColumn("id", F.monotonically_increasing_id())
svd_predictions_with_id = svd_predictions.withColumn("id", F.monotonically_increasing_id())

# Join the two DataFrames on this new ID to link songs to their cluster prediction
df_merged = df_features_with_id.join(svd_predictions_with_id, on="id", how="inner")

print("Merged DataFrame with song info and cluster prediction:")
df_merged.select("artists", "prediction").show(5)

Count of songs per cluster:
+----------+-----+
|prediction|count|
+----------+-----+
|         1|39458|
|         2|53846|
|         0|20561|
+----------+-----+

Merged DataFrame with song info and cluster prediction:
+--------------------+----------+
|             artists|prediction|
+--------------------+----------+
|         Gen Hoshino|         2|
|        Ben Woodward|         0|
|Ingrid Michaelson...|         0|
|        Kina Grannis|         0|
|    Chord Overstreet|         2|
+--------------------+----------+
only showing top 5 rows



Now for the fun part\! Let's identify the most frequent artist in each cluster to better understand the "vibe" of each group.

In [None]:
# Group by cluster and artist, then count occurrences
cluster_artist_count = df_merged.groupBy(['prediction', 'artists']) \
                                .count() \
                                .orderBy(['prediction', 'count'], ascending=[True, False])

# --- Cluster 0 ---
print("Top Artists in Cluster 0:")
cluster_artist_count.filter(F.col('prediction') == 0).show(5)

# --- Cluster 1 ---
print("\nTop Artists in Cluster 1:")
cluster_artist_count.filter(F.col('prediction') == 1).show(5)

# --- Cluster 2 ---
print("\nTop Artists in Cluster 2:")
cluster_artist_count.filter(F.col('prediction') == 2).show(10)

Top Artists in Cluster 0:
+----------+---------------+-----+
|prediction|        artists|count|
+----------+---------------+-----+
|         0|   George Jones|  153|
|         0|  Prateek Kuhad|  141|
|         0|Germaine Franco|  102|
|         0|    Norah Jones|   98|
|         0|  Stevie Wonder|   98|
+----------+---------------+-----+
only showing top 5 rows


Top Artists in Cluster 1:
+----------+---------------+-----+
|prediction|        artists|count|
+----------+---------------+-----+
|         1|    Linkin Park|  153|
|         1|        Scooter|  147|
|         1|    The Prophet|  136|
|         1|Håkan Hellström|  121|
|         1|     Rob Zombie|   98|
+----------+---------------+-----+
only showing top 5 rows


Top Artists in Cluster 2:
+----------+---------------+-----+
|prediction|        artists|count|
+----------+---------------+-----+
|         2|    The Beatles|  205|
|         2|           Feid|  202|
|         2|    Chuck Berry|  190|
|         2| The Beach Boys|  

The clusters seem to capture distinct musical styles:

  * **Cluster 0:** This cluster is dominated by **DJs and electronic music producers** like Martin Garrix and The Chainsmokers. This is likely a high-energy, electronic/dance cluster.
  * **Cluster 1:** This cluster features artists like **Queen, The Beatles, and Fleetwood Mac**. This appears to be a classic rock/pop cluster.
  * **Cluster 2:** This group contains a mix of **Pop, Rap, and R\&B artists** like Drake, Post Malone, and Ed Sheeran. This seems to be a contemporary, mainstream hits cluster.

## Conclusion

This tutorial provides a full pipeline from raw song data to meaningful musical clusters. **Preprocessing** and **dimension reduction** proved essential -- while k-Means performed poorly on raw data, **PCA** and especially **SVD** improved cluster quality by extracting key dimensions. Most importantly, integrating clusters with metadata revealed distinct **music patterns**.
