# Time Series Clustering â€” Overview

## Why this matters
- Group sequences by **shape**, **phase**, or **dynamics** to discover regimes.
- Build cohorts for monitoring, forecasting, or anomaly detection.
- Create prototypes (medoids/centroids) that summarize behavior.

## Key decisions
1. **Representation**: raw series, aligned series, or engineered features.
2. **Similarity**: Euclidean, correlation, DTW, shape-based distances.
3. **Algorithm**: partitional (k-means/k-medoids), hierarchical, density.
4. **Validation**: silhouette, stability, and domain interpretation.

## Notation
A univariate series is $x = (x_1, \dots, x_T)$. We often standardize per-series using

$$z_t = \frac{x_t - \mu_x}{\sigma_x}.$$


In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

colors = ["#1B9E77", "#D95F02", "#7570B3"]


## A simple clustering pipeline
This pipeline frames the common choices you make in most projects.


In [None]:
steps = ["Ingest", "Preprocess", "Representation", "Similarity", "Clustering", "Validate"]fig = go.Figure()for i, step in enumerate(steps):    fig.add_shape(        type="rect",        x0=i - 0.45,        x1=i + 0.45,        y0=-0.25,        y1=0.25,        line=dict(color="#2C7FB8", width=2),        fillcolor="#D0ECF4",    )    fig.add_annotation(x=i, y=0, text=step, showarrow=False, font=dict(size=12))    if i < len(steps) - 1:        fig.add_annotation(            x=i + 0.45,            y=0,            ax=i + 0.55,            ay=0,            showarrow=True,            arrowhead=3,            arrowsize=1,        )fig.update_xaxes(visible=False)fig.update_yaxes(visible=False)fig.update_layout(    title="Time series clustering pipeline",    height=240,    margin=dict(l=10, r=10, t=40, b=10),)fig

## Synthetic dataset
We will create three families of series to make the effects of representation clear.


In [None]:
rng = np.random.default_rng(7)n_series = 90length = 120t = np.linspace(0, 2 * np.pi, length)series = []labels_true = []for _ in range(n_series // 3):    phase = rng.normal(0, 0.2)    series.append(np.sin(t + phase) + 0.1 * rng.normal(size=length))    labels_true.append("sine")for _ in range(n_series // 3):    phase = rng.normal(1.0, 0.2)    series.append(np.sin(t + phase) + 0.1 * rng.normal(size=length))    labels_true.append("phase_shifted")for _ in range(n_series // 3):    trend = np.linspace(-0.6, 0.8, length)    series.append(trend + 0.6 * np.sin(2 * t) + 0.1 * rng.normal(size=length))    labels_true.append("trend_seasonal")series = np.array(series)labels_true = np.array(labels_true)sample_ids = rng.choice(len(series), size=12, replace=False)df_plot = (    pd.DataFrame(series[sample_ids])    .assign(series_id=[f"s{i}" for i in sample_ids])    .assign(true_label=labels_true[sample_ids])    .melt(id_vars=["series_id", "true_label"], var_name="time", value_name="value"))fig = px.line(    df_plot,    x="time",    y="value",    color="series_id",    facet_col="true_label",    facet_col_spacing=0.05,    title="Sample series by generating process",)fig.update_layout(showlegend=False)fig

## Representation choices

### 1) Raw series
- Requires equal length and aligned time indexes.
- Sensitive to **phase shifts** and **time warping**.

### 2) Feature-based vectors
- Works with standard clustering algorithms.
- Useful for scale, trend, seasonality, or volatility patterns.
- Examples: mean, slope, dominant frequency, autocorrelation.

### 3) Learned or shape-based embeddings
- ROCKET, shapelets, autoencoders, wavelets.
- Often cluster in latent space.


## Feature-based clustering (baseline)
We first map each series to a feature vector and cluster with $k$-means.


In [None]:
from sklearn.cluster import KMeansfrom sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScalerdef extract_features(x):    idx = np.arange(len(x))    slope = np.polyfit(idx, x, 1)[0]    mean = x.mean()    std = x.std()    autocorr = np.corrcoef(x[:-1], x[1:])[0, 1]    fft = np.fft.rfft(x - mean)    dominant_freq = np.argmax(np.abs(fft[1:])) + 1    amplitude = x.max() - x.min()    return [mean, std, slope, autocorr, dominant_freq, amplitude]X = np.array([extract_features(s) for s in series])scaler = StandardScaler()X_scaled = scaler.fit_transform(X)kmeans = KMeans(n_clusters=3, n_init=20, random_state=7)labels_km = kmeans.fit_predict(X_scaled)pca = PCA(n_components=2, random_state=7)X_2d = pca.fit_transform(X_scaled)df_feat = pd.DataFrame(X_2d, columns=["PC1", "PC2"])df_feat["cluster"] = labels_km.astype(str)df_feat["true_label"] = labels_truefig = px.scatter(    df_feat,    x="PC1",    y="PC2",    color="cluster",    symbol="true_label",    title="Feature-based clustering (PCA view)",)fig

In [None]:
cluster_means = []for k in range(3):    cluster_series = series[labels_km == k]    cluster_means.append(cluster_series.mean(axis=0))fig = go.Figure()for k, mean_series in enumerate(cluster_means):    fig.add_trace(        go.Scatter(            x=np.arange(length),            y=mean_series,            mode="lines",            line=dict(color=colors[k], width=3),            name=f"Cluster {k} mean",        )    )fig.update_layout(title="Average series per feature-based cluster")fig

## Distance-based clustering with DTW
Dynamic Time Warping aligns series by allowing flexible time shifts. The classic recurrence is

$$D_{i,j} = (x_i - y_j)^2 + \min\{D_{i-1,j}, D_{i,j-1}, D_{i-1,j-1}\},$$

and the DTW distance is $\sqrt{D_{T,T'}}$.

This helps when two series share the same **shape** but are phase-shifted.


In [None]:
from sklearn.cluster import AgglomerativeClusteringdef dtw_distance(x, y):    n, m = len(x), len(y)    D = np.full((n + 1, m + 1), np.inf)    D[0, 0] = 0.0    for i in range(1, n + 1):        for j in range(1, m + 1):            cost = (x[i - 1] - y[j - 1]) ** 2            D[i, j] = cost + min(D[i - 1, j], D[i, j - 1], D[i - 1, j - 1])    return np.sqrt(D[n, m])def pairwise_dtw(X):    n = len(X)    dist = np.zeros((n, n))    for i in range(n):        for j in range(i + 1, n):            d = dtw_distance(X[i], X[j])            dist[i, j] = d            dist[j, i] = d    return distsubset = 36series_small = series[:subset]labels_small = labels_true[:subset]D = pairwise_dtw(series_small)try:    model = AgglomerativeClustering(        n_clusters=3, metric="precomputed", linkage="average"    )except TypeError:    model = AgglomerativeClustering(        n_clusters=3, affinity="precomputed", linkage="average"    )labels_dtw = model.fit_predict(D)medoids = []for k in range(3):    idx = np.where(labels_dtw == k)[0]    if len(idx) == 0:        medoids.append(None)        continue    sub = D[np.ix_(idx, idx)]    medoid = idx[np.argmin(sub.sum(axis=1))]    medoids.append(medoid)fig = go.Figure()for k, medoid_idx in enumerate(medoids):    idx = np.where(labels_dtw == k)[0]    for j in idx:        fig.add_trace(            go.Scatter(                x=np.arange(length),                y=series_small[j],                mode="lines",                line=dict(color=colors[k], width=1),                opacity=0.25,                showlegend=False,            )        )    if medoid_idx is not None:        fig.add_trace(            go.Scatter(                x=np.arange(length),                y=series_small[medoid_idx],                mode="lines",                line=dict(color=colors[k], width=3),                name=f"DTW cluster {k} medoid",            )        )fig.update_layout(title="DTW-based clusters with medoids")fig

## Quick sanity check: Euclidean vs DTW
A phase-shifted pair may look far apart in Euclidean space but close under DTW.


In [None]:
x = series[0]
y = series[n_series // 3]  # shifted sine

euclid = np.linalg.norm(x - y)
dtw = dtw_distance(x, y)

pd.DataFrame(
    {
        "metric": ["Euclidean", "DTW"],
        "distance": [euclid, dtw],
    }
)


## Validation and selection
- **Silhouette** scores work well for feature-based clustering.
- For DTW or custom distances, use **stability** (resampling) and **domain coherence**.
- Choose $k$ with elbow/Gap statistics or by inspecting medoids.


In [None]:
from sklearn.metrics import silhouette_score

silhouette = silhouette_score(X_scaled, labels_km)
float(silhouette)


## Practical tips
- Always normalize or standardize per-series when shape matters more than scale.
- Consider **seasonal differencing** before clustering if trend dominates.
- For multivariate series, concatenate features per channel or use multivariate DTW.
- Use **medoids** instead of means when using non-Euclidean distances.


## Exercises
1. Replace DTW with **correlation distance** and compare cluster assignments.
2. Add a new feature (e.g., spectral entropy) and observe changes in PCA space.
3. Try **k-medoids** on the DTW distance matrix and compare medoids.

## Further reading
- tslearn documentation for DTW and k-medoids.
- aeon/sktime for time-series transformations and pipelines.
- Shapelets and ROCKET for fast, discriminative representations.
