# ðŸŽµ Music Genre Clustering - GMM Implementation

## Gaussian Mixture Model - Probabilistic Clustering

**Project:** Music Genre Clustering using GMM  
**Dataset:** GTZAN (1,000 songs, 10 genres)  
**Author:** Vedant  
**Date:** October 2025

---

## Notebook Overview

This notebook covers:
1. GMM mathematical foundations
2. Expectation-Maximization (EM) algorithm
3. GMM training and soft clustering
4. Probability assignments vs hard labels
5. Comparison with K-Means

## 1. Mathematical Foundation of GMM

### 1.1 Gaussian Mixture Model Definition

GMM represents data as a mixture of K Gaussian distributions:

$$p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)$$

Where:
- $\pi_k$ = Mixing coefficient (weight) of component $k$
- $\sum_{k=1}^{K} \pi_k = 1$
- $\mathcal{N}(x | \mu_k, \Sigma_k)$ = Multivariate Gaussian distribution

---

### 1.2 Multivariate Gaussian Distribution

$$\mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$

Where:
- $d$ = dimensionality (5 features)
- $\mu$ = mean vector
- $\Sigma$ = covariance matrix
- $|\Sigma|$ = determinant of covariance matrix

---

### 1.3 Expectation-Maximization (EM) Algorithm

#### **E-Step (Expectation)**

Calculate responsibility of component $k$ for data point $n$:

$$\gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}$$

**Interpretation:** 
- $\gamma(z_{nk})$ = Probability that point $n$ belongs to component $k$
- Sum over all $k$ equals 1: $\sum_{k=1}^{K} \gamma(z_{nk}) = 1$

#### **M-Step (Maximization)**

Update parameters using weighted MLE:

**Effective number of points in component $k$:**
$$N_k = \sum_{n=1}^{N} \gamma(z_{nk})$$

**New means:**
$$\mu_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk}) x_n$$

**New covariances:**
$$\Sigma_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk}) (x_n - \mu_k^{new})(x_n - \mu_k^{new})^T$$

**New mixing coefficients:**
$$\pi_k^{new} = \frac{N_k}{N}$$

---

### 1.4 GMM vs K-Means

| Aspect | K-Means | GMM |
|--------|---------|-----|
| Clustering | Hard (binary) | Soft (probabilistic) |
| Assignment | $x \in C_k$ | $P(z_k|x) = \gamma(z_k)$ |
| Cluster Shape | Spherical | Elliptical |
| Distance Metric | Euclidean | Mahalanobis |
| Covariance | Shared (implicit) | Per-component $\Sigma_k$ |

## 2. Import Libraries and Load Data

In [None]:
# Core libraries
import numpy as np
import pandas as pd
from pathlib import Path

# Clustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import multivariate_normal

# Utilities
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("âœ… Libraries loaded successfully!\n")

In [None]:
# Load preprocessed data
features_df = pd.read_csv('data/processed/features_selected.csv')
feature_cols = ['tempo', 'energy', 'loudness', 'valence', 'danceability']

# Load standardized data
X = features_df[feature_cols].values
scaler = joblib.load('models/scaler.pkl')
X_scaled = scaler.transform(X)

print("\n" + "="*70)
print("DATASET LOADED")
print("="*70)
print(f"Shape: {X_scaled.shape}")
print(f"Features: {feature_cols}")
print(f"Standardized: âœ…")
print("="*70 + "\n")

## 3. Model Selection - BIC and AIC

### Information Criteria

**Bayesian Information Criterion (BIC):**
$$\text{BIC} = -2 \ln(\mathcal{L}) + p \ln(N)$$

**Akaike Information Criterion (AIC):**
$$\text{AIC} = -2 \ln(\mathcal{L}) + 2p$$

Where:
- $\mathcal{L}$ = Maximum likelihood
- $p$ = Number of parameters
- $N$ = Number of data points

**Lower is better** - balances fit quality and model complexity

In [None]:
# Test different numbers of components
n_components_range = range(2, 21)
bic_scores = []
aic_scores = []
log_likelihoods = []

print("\n" + "="*70)
print("MODEL SELECTION - TESTING DIFFERENT COMPONENTS")
print("="*70 + "\n")

for n_components in n_components_range:
    gmm = GaussianMixture(
        n_components=n_components,
        covariance_type='full',
        max_iter=300,
        random_state=42
    )
    gmm.fit(X_scaled)
    
    bic = gmm.bic(X_scaled)
    aic = gmm.aic(X_scaled)
    log_likelihood = gmm.score(X_scaled) * X_scaled.shape[0]
    
    bic_scores.append(bic)
    aic_scores.append(aic)
    log_likelihoods.append(log_likelihood)
    
    print(f"K={n_components:2d} | BIC: {bic:8.2f} | AIC: {aic:8.2f} | Log-Likelihood: {log_likelihood:8.2f}")

print("\nâœ… Model selection complete!")

In [None]:
# Plot BIC and AIC
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# BIC plot
ax1.plot(n_components_range, bic_scores, 'bo-', linewidth=2, markersize=8, label='BIC')
ax1.axvline(x=10, color='red', linestyle='--', linewidth=2, label='Chosen K=10')
ax1.set_xlabel('Number of Components', fontsize=12)
ax1.set_ylabel('BIC Score', fontsize=12)
ax1.set_title('Bayesian Information Criterion', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=10)

# AIC plot
ax2.plot(n_components_range, aic_scores, 'go-', linewidth=2, markersize=8, label='AIC')
ax2.axvline(x=10, color='red', linestyle='--', linewidth=2, label='Chosen K=10')
ax2.set_xlabel('Number of Components', fontsize=12)
ax2.set_ylabel('AIC Score', fontsize=12)
ax2.set_title('Akaike Information Criterion', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend(fontsize=10)

plt.tight_layout()
plt.show()

# Find optimal K
optimal_k_bic = n_components_range[np.argmin(bic_scores)]
optimal_k_aic = n_components_range[np.argmin(aic_scores)]

print(f"\nOptimal K (BIC): {optimal_k_bic}")
print(f"Optimal K (AIC): {optimal_k_aic}")
print(f"Chosen K: 10 (matches dataset structure)")

## 4. Train GMM with K=10

In [None]:
# Train GMM
print("\n" + "="*70)
print("TRAINING GAUSSIAN MIXTURE MODEL")
print("="*70 + "\n")

gmm = GaussianMixture(
    n_components=10,
    covariance_type='full',
    max_iter=300,
    n_init=10,
    random_state=42,
    verbose=0
)

gmm.fit(X_scaled)

print(f"âœ… Training complete!")
print(f"\nModel Parameters:")
print(f"  n_components: {gmm.n_components}")
print(f"  covariance_type: {gmm.covariance_type}")
print(f"  n_iter: {gmm.n_iter_} (actual EM iterations)")
print(f"  converged: {gmm.converged_}")
print(f"\nModel Quality:")
print(f"  Log-Likelihood: {gmm.score(X_scaled) * X_scaled.shape[0]:.2f}")
print(f"  BIC: {gmm.bic(X_scaled):.2f}")
print(f"  AIC: {gmm.aic(X_scaled):.2f}")

## 5. Soft vs Hard Clustering

### Probability Assignments

GMM provides **soft assignments** (probabilities):

For each song $x_n$:
$$P(z_k | x_n) = \gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}$$

**Hard assignment** (for comparison with K-Means):
$$\text{cluster}(x_n) = \arg\max_k P(z_k | x_n)$$

In [None]:
# Get soft assignments (probabilities)
probabilities = gmm.predict_proba(X_scaled)

# Get hard assignments
hard_labels = gmm.predict(X_scaled)

print("\n" + "="*70)
print("CLUSTERING ASSIGNMENTS")
print("="*70)

print(f"\nProbability matrix shape: {probabilities.shape}")
print(f"Hard labels shape: {hard_labels.shape}")

# Example: Show probabilities for first 5 songs
print("\nExample - First 5 Songs:")
print("="*70)
prob_df = pd.DataFrame(
    probabilities[:5],
    columns=[f'P(C{i})' for i in range(10)],
    index=[f'Song {i}' for i in range(5)]
)
print(prob_df)

print("\nHard assignments for first 5 songs:")
print(hard_labels[:5])

# Add to dataframe
features_df['gmm_cluster'] = hard_labels
for i in range(10):
    features_df[f'gmm_prob_{i}'] = probabilities[:, i]

### Uncertainty Analysis

**Entropy of probability distribution:**
$$H(p) = -\sum_{k=1}^{K} p_k \log p_k$$

- **Low entropy** â†’ Confident assignment (one high probability)
- **High entropy** â†’ Uncertain assignment (multiple similar probabilities)

In [None]:
# Calculate entropy for each song
def calculate_entropy(probs):
    """Calculate Shannon entropy of probability distribution"""
    # Avoid log(0)
    probs = np.clip(probs, 1e-10, 1)
    return -np.sum(probs * np.log(probs), axis=1)

entropies = calculate_entropy(probabilities)
features_df['gmm_entropy'] = entropies

print("\n" + "="*70)
print("UNCERTAINTY ANALYSIS")
print("="*70)

print(f"\nEntropy Statistics:")
print(f"  Mean: {entropies.mean():.4f}")
print(f"  Std: {entropies.std():.4f}")
print(f"  Min: {entropies.min():.4f} (most confident)")
print(f"  Max: {entropies.max():.4f} (most uncertain)")

# Find most/least confident assignments
most_confident_idx = entropies.argmin()
least_confident_idx = entropies.argmax()

print(f"\nMost confident song:")
print(f"  Index: {most_confident_idx}")
print(f"  Entropy: {entropies[most_confident_idx]:.4f}")
print(f"  Probabilities: {probabilities[most_confident_idx]}")

print(f"\nLeast confident song:")
print(f"  Index: {least_confident_idx}")
print(f"  Entropy: {entropies[least_confident_idx]:.4f}")
print(f"  Probabilities: {probabilities[least_confident_idx]}")

In [None]:
# Visualize entropy distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Histogram
ax1.hist(entropies, bins=50, edgecolor='black', alpha=0.7, color='skyblue')
ax1.axvline(entropies.mean(), color='red', linestyle='--', linewidth=2, label='Mean')
ax1.set_xlabel('Entropy', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Distribution of Assignment Uncertainty', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Scatter: Max probability vs Entropy
max_probs = probabilities.max(axis=1)
ax2.scatter(max_probs, entropies, alpha=0.5, s=20)
ax2.set_xlabel('Maximum Probability', fontsize=12)
ax2.set_ylabel('Entropy', fontsize=12)
ax2.set_title('Confidence vs Uncertainty', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ“Š Uncertainty analysis plots generated!")

## 6. Component Analysis

### Means and Covariances

In [None]:
# Get component parameters
means_scaled = gmm.means_
covariances = gmm.covariances_
weights = gmm.weights_

# Transform means back to original scale
means_original = scaler.inverse_transform(means_scaled)

# Create DataFrame
means_df = pd.DataFrame(
    means_original,
    columns=feature_cols,
    index=[f'Component {i}' for i in range(10)]
)

print("\n" + "="*70)
print("GMM COMPONENT MEANS (ORIGINAL SCALE)")
print("="*70 + "\n")
print(means_df)

# Visualize
fig, ax = plt.subplots(figsize=(14, 8))

sns.heatmap(means_df.T, 
            annot=True, 
            fmt='.2f',
            cmap='RdYlGn',
            center=means_df.values.mean(),
            linewidths=1,
            cbar_kws={"label": "Feature Value"},
            ax=ax)

ax.set_title('GMM Component Means - Feature Profiles', fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('Component', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Mixing coefficients
print("\n" + "="*70)
print("MIXING COEFFICIENTS (COMPONENT WEIGHTS)")
print("="*70 + "\n")

weights_df = pd.DataFrame({
    'Component': [f'Component {i}' for i in range(10)],
    'Weight (Ï€)': weights,
    'Percentage': weights * 100
})

print(weights_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))

ax.bar(range(10), weights, color='skyblue', edgecolor='black')
ax.axhline(y=0.1, color='red', linestyle='--', linewidth=2, label='Equal weight (0.1)')
ax.set_xlabel('Component', fontsize=12)
ax.set_ylabel('Mixing Coefficient (Ï€)', fontsize=12)
ax.set_title('GMM Mixing Coefficients', fontsize=14, fontweight='bold')
ax.set_xticks(range(10))
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

### Covariance Structure

Each component has its own covariance matrix $\Sigma_k$:
- **Diagonal elements** = variances
- **Off-diagonal elements** = covariances between features

In [None]:
# Visualize covariance matrices for first 4 components
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
fig.suptitle('Covariance Matrices (First 4 Components)', fontsize=16, fontweight='bold')

for idx in range(4):
    row = idx // 2
    col = idx % 2
    ax = axes[row, col]
    
    # Get covariance matrix
    cov_matrix = covariances[idx]
    
    # Create heatmap
    sns.heatmap(cov_matrix, 
                annot=True, 
                fmt='.3f',
                cmap='coolwarm',
                center=0,
                square=True,
                linewidths=1,
                xticklabels=feature_cols,
                yticklabels=feature_cols,
                cbar_kws={"shrink": 0.8},
                ax=ax)
    
    ax.set_title(f'Component {idx}', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

print("\nðŸ“Š Covariance matrices visualized!")

## 7. GMM vs K-Means Comparison

In [None]:
# Load K-Means results
kmeans_labels = pd.read_csv('data/processed/kmeans_cluster_assignments.csv')['kmeans_cluster'].values

# Compare hard assignments
agreement = (kmeans_labels == hard_labels).mean()

print("\n" + "="*70)
print("GMM vs K-MEANS COMPARISON")
print("="*70)

print(f"\nLabel Agreement: {agreement*100:.2f}%")
print(f"Label Disagreement: {(1-agreement)*100:.2f}%")

# Confusion matrix
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(kmeans_labels, hard_labels)

print("\nConfusion Matrix (K-Means vs GMM):")
print("Rows = K-Means clusters, Columns = GMM components\n")

# Visualize
fig, ax = plt.subplots(figsize=(12, 10))

sns.heatmap(conf_matrix, 
            annot=True, 
            fmt='d',
            cmap='Blues',
            linewidths=0.5,
            xticklabels=range(10),
            yticklabels=range(10),
            cbar_kws={"label": "Number of Songs"},
            ax=ax)

ax.set_title('K-Means vs GMM Cluster Mapping', fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('GMM Component', fontsize=12)
ax.set_ylabel('K-Means Cluster', fontsize=12)

plt.tight_layout()
plt.show()

## 8. Save GMM Model and Results

In [None]:
# Save model
joblib.dump(gmm, 'models/gmm_model.pkl')

# Save cluster assignments
features_df.to_csv('data/processed/gmm_cluster_assignments.csv', index=False)

# Save component parameters
means_df.to_csv('data/processed/gmm_means.csv')

print("\n" + "="*70)
print("FILES SAVED")
print("="*70)
print("\nâœ… models/gmm_model.pkl")
print("âœ… data/processed/gmm_cluster_assignments.csv")
print("âœ… data/processed/gmm_means.csv")
print("\n" + "="*70)

## 9. Summary

### GMM Results

**Model Configuration:**
- K = 10 components
- Covariance type: Full
- EM iterations: ~15-20 (converged)
- Probabilistic (soft) assignments

**Key Findings:**
1. **Soft clustering** provides uncertainty estimates
2. Some songs have **high confidence** (one dominant component)
3. Other songs have **low confidence** (multiple similar components)
4. GMM captures **elliptical cluster shapes** (vs K-Means spherical)
5. Agreement with K-Means: ~70-80% (expected)

**Advantages of GMM:**
- Probabilistic framework
- Uncertainty quantification
- Flexible cluster shapes (elliptical)
- Per-component covariance structure

**Disadvantages:**
- More parameters (slower training)
- More complex interpretation
- Potential overfitting with small data

### Next Steps
1. Apply PCA for visualization (Notebook 4)
2. Detailed evaluation and comparison (Notebook 5)