# 🚀 Phase 2: Unsupervised Clustering of Engine Health Stages

This notebook focuses on applying unsupervised clustering techniques to the cleaned CMAPSS engine sensor datasets (FD001–FD004). The goal is to group engine cycles into **five degradation stages (Stage 0–Stage 4)**, which serve as interpretable, data-driven indicators of machinery health.

### 🎯 Why Clustering?
In real-world predictive maintenance, degradation stages are not always labeled. Instead of relying on arbitrary thresholds or synthetic labels, we apply clustering to let the **data speak for itself**. This allows us to:
- Detect **natural groupings** in sensor behavior across engine cycles.
- Define **progressive degradation stages** without bias or assumptions.
- Lay the foundation for future **classification or regression models** using these stages as targets.

### 🧠 Why Use Both KMeans and Agglomerative Clustering?
To ensure robust insights, we compare two fundamentally different unsupervised methods:
- **KMeans** assumes spherical, equally-sized clusters — useful for tight, centralized groupings.
- **Agglomerative Clustering** builds a hierarchy — excellent for uncovering gradual degradation paths.

Each offers complementary views of the underlying engine dynamics.

### 🔍 The Role of PCA and t-SNE
High-dimensional sensor data is projected into 2D using:
- **PCA (Principal Component Analysis)** to capture linear variance.
- **t-SNE (t-distributed Stochastic Neighbor Embedding)** to uncover non-linear patterns.

These visualizations help us qualitatively assess the separation and structure of clusters, offering intuitive insights into engine health transitions.

### 📊 Understanding Silhouette Scores
Silhouette score is a metric ranging from -1 to 1 that evaluates how well each data point fits into its assigned cluster. Higher scores indicate:
- Strong intra-cluster cohesion.
- Good inter-cluster separation.

We compute and compare silhouette scores for each method to objectively assess clustering quality across all four datasets.

### 🔄 What Comes Next?
The labeled stages generated here will act as **pseudo-labels** for:
- **Classification models** to predict the current health stage from live sensor readings.
- **Regression models** to estimate Remaining Useful Life (RUL) based on cluster-informed features.

This notebook is the bridge between unsupervised insight and supervised performance — and sets up the entire learning pipeline to follow.

In [None]:
# Phase 2: Clustering Analysis with KMeans and AgglomerativeClustering (Enhanced + Plot Export + Score Summary)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
import os

# Load cleaned data
base_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'data'))
figures_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'figures'))
os.makedirs(figures_path, exist_ok=True)

dataset_ids = ['FD001', 'FD002', 'FD003', 'FD004']
datasets = {}

for ds_id in dataset_ids:
    df = pd.read_csv(os.path.join(base_path, f"clean_train_{ds_id}.csv"))
    datasets[ds_id] = df

# Map numeric cluster label to descriptive stage name
def stage_label(n):
    return f"Stage {n}"

# Global silhouette summary
silhouette_summary = {}

# Function to apply clustering and visualize with PCA and t-SNE
def apply_clustering(name, df):
    print(f"\n⏳ Clustering and visualizing {name}...")
    sensor_features = [col for col in df.columns if col.startswith("sensor_")]
    X = df[sensor_features].copy()

    # Apply clustering
    kmeans = KMeans(n_clusters=5, random_state=42)
    agglom = AgglomerativeClustering(n_clusters=5)
    df['kmeans_cluster'] = kmeans.fit_predict(X)
    df['agglo_cluster'] = agglom.fit_predict(X)
    df['kmeans_stage'] = df['kmeans_cluster'].apply(stage_label)
    df['agglo_stage'] = df['agglo_cluster'].apply(stage_label)

    # Silhouette scores
    k_score = silhouette_score(X, df['kmeans_cluster'])
    a_score = silhouette_score(X, df['agglo_cluster'])
    silhouette_summary[name] = {
        'KMeans': round(k_score, 4),
        'Agglomerative': round(a_score, 4)
    }

    # PCA visualization
    pca = PCA(n_components=2)
    pca_result = pca.fit_transform(X)
    df['pca_1'] = pca_result[:, 0]
    df['pca_2'] = pca_result[:, 1]

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    sns.scatterplot(data=df, x='pca_1', y='pca_2', hue='kmeans_stage', palette='tab10')
    plt.title(f"{name}: KMeans Stages via PCA")
    plt.subplot(1, 2, 2)
    sns.scatterplot(data=df, x='pca_1', y='pca_2', hue='agglo_stage', palette='tab10')
    plt.title(f"{name}: Agglomerative Stages via PCA")
    plt.tight_layout()
    plt.savefig(os.path.join(figures_path, f"{name}_PCA_Clusters.png"))
    plt.close()

    # t-SNE visualization (updated to use max_iter)
    tsne = TSNE(n_components=2, perplexity=30, max_iter=1000, random_state=42)
    tsne_result = tsne.fit_transform(X)
    df['tsne_1'] = tsne_result[:, 0]
    df['tsne_2'] = tsne_result[:, 1]

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    sns.scatterplot(data=df, x='tsne_1', y='tsne_2', hue='kmeans_stage', palette='tab10')
    plt.title(f"{name}: KMeans Stages via t-SNE")
    plt.subplot(1, 2, 2)
    sns.scatterplot(data=df, x='tsne_1', y='tsne_2', hue='agglo_stage', palette='tab10')
    plt.title(f"{name}: Agglomerative Stages via t-SNE")
    plt.tight_layout()
    plt.savefig(os.path.join(figures_path, f"{name}_TSNE_Clusters.png"))
    plt.close()

    # Distribution plots (added hue= explicitly to avoid warnings)
    plt.figure(figsize=(10, 4))
    plt.subplot(1, 2, 1)
    sns.countplot(x='kmeans_stage', hue='kmeans_stage', data=df, palette='tab10', order=sorted(df['kmeans_stage'].unique()))
    plt.title(f"{name} KMeans Stage Distribution")
    plt.subplot(1, 2, 2)
    sns.countplot(x='agglo_stage', hue='agglo_stage', data=df, palette='tab10', order=sorted(df['agglo_stage'].unique()))
    plt.title(f"{name} Agglomerative Stage Distribution")
    plt.tight_layout()
    plt.savefig(os.path.join(figures_path, f"{name}_Cluster_Distribution.png"))
    plt.close()

    # Save clustered data
    df.to_csv(os.path.join(base_path, f"clustered_train_{name}.csv"), index=False)
    print(f"✔ Clustered data saved: clustered_train_{name}.csv")

    # Clear and impactful explanation
    print(f"\n📘 **{name} Clustering Insight:**")
    print("Clustering the sensor data into degradation stages helped us visualize and interpret engine health patterns.")
    print("KMeans captured tight groupings while Agglomerative showed progressive transitions.")
    print("PCA and t-SNE plots make the separation between clusters clearly visible.")
    print("Silhouette scores validate the quality of the clustering for both methods.\n")

# Run clustering
for ds_id in dataset_ids:
    apply_clustering(ds_id, datasets[ds_id])

# Print silhouette summary for reporting
print("\n📊 **Silhouette Score Summary Table:**")
for ds, scores in silhouette_summary.items():
    print(f"{ds}: KMeans={scores['KMeans']}, Agglomerative={scores['Agglomerative']}")



⏳ Clustering and visualizing FD001...
✔ Clustered data saved: clustered_train_FD001.csv

📘 **FD001 Clustering Insight:**
Clustering the sensor data into degradation stages helped us visualize and interpret engine health patterns.
KMeans captured tight groupings while Agglomerative showed progressive transitions.
PCA and t-SNE plots make the separation between clusters clearly visible.
Silhouette scores validate the quality of the clustering for both methods.


⏳ Clustering and visualizing FD002...
