# Counterparty **Re‑clustering** Analytics Workbook
This interactive notebook walks through a **rolling‑window clustering** workflow for equities‑trading counterparties.  We focus on three analytic questions:
1. **Are we using the *right* number of clusters (`k`) at any point in time?**  
   → Inspect internal quality metrics such as *silhouette*, *Davies‑Bouldin*.
2. **How does the *importance of features* evolve over time?**  
   → Train a lightweight **XGBoost** classifier window‑by‑window and track `gain` feature importances.
3. **Which clusters are *emerging* (growing fast) or *volatile* (frequent member churn)?**  
   → Compare cluster summaries across adjacent windows.

We rely on the `constrained_clustering` utility module created earlier ‑‑ it gives us size‑balanced K‑Means, volume‑share repair, rolling‑window helpers, and ready‑made Seaborn FacetGrid plotting.

## 0. Environment & dependencies
**Purpose:** ensure required libraries are present.

Run the `pip install` line only the *first* time you open the notebook.

In [None]:
# !pip install polars scikit-learn xgboost seaborn tqdm matplotlib

## 1. Imports, plotting theme, and data load
**Purpose:** set a clean plotting theme via **Seaborn** and bring the trade history into memory with the helper function.

In [None]:
import polars as pl
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm.auto import tqdm

import constrained_clustering as cc

sns.set_theme(style='whitegrid', context='notebook')

TRADE_GLOB = Path('/data/equities/trades_*.parquet')  # adjust to your location
trades = cc.load_trade_history(str(TRADE_GLOB))
print(f'Trades loaded: {trades.shape[0]:,}')

## 2. Define rolling time windows
**Methodology:** 90‑day overlapping windows stepped every 30 days.  This captures *local* behaviour while giving enough samples per window.

In [None]:
WINDOW_DAYS = 90
STEP_DAYS = 30
windows = cc.generate_time_windows(trades, window_days=WINDOW_DAYS, step_days=STEP_DAYS)
print(f'{len(windows)} windows spanning {windows[0][0].date()} – {windows[-1][1].date()}')

## 3. Choosing the number of clusters, *k*
**Purpose:** use *silhouette* to compare clustering quality across a range of `k` values.

**What to look for:**
* Higher median silhouette is better.
* Watch for a *knee* where gains flatten — often indicates the sweet‑spot.

In [None]:
k_range = range(4, 13)  # evaluate k = 4–12
silhouette_by_k = {k: [] for k in k_range}

for start, end in tqdm(windows, desc='Windows'):
    win_trades = trades.filter((pl.col('ts') >= start) & (pl.col('ts') < end))
    if win_trades.is_empty():
        continue
    X, agg, _ = cc.build_counterparty_features(win_trades)
    for k in k_range:
        labels, _ = cc.cluster_counterparties(X, n_clusters=k, size_min=5)
        sil = cc.evaluate_clustering(X, labels)['silhouette']
        silhouette_by_k[k].append(sil)

sil_df = pd.DataFrame(silhouette_by_k)
sil_df.head()

In [None]:
# --- Seaborn boxplot of silhouette scores ---
fig, ax = plt.subplots(figsize=(8, 4))
sns.boxplot(data=sil_df, ax=ax)
ax.set_xlabel('k (number of clusters)')
ax.set_ylabel('Silhouette')
ax.set_title('Silhouette distribution across windows')
plt.tight_layout()

### Decision: pick the `k` with the highest *average* silhouette

In [None]:
K_SELECTED = int(sil_df.mean().idxmax())
print(f'Chosen k = {K_SELECTED}')

## 4. Run full rolling pipeline with the chosen `k`
**Purpose:** produce a time‑series of clustering results, aligned across windows, plus internal metrics for each window.

In [None]:
results = cc.cluster_over_time(trades, windows,
                               n_clusters=K_SELECTED,
                               size_min=5,
                               max_share=0.20)  # no client >20 % notional

## 5. Cluster stability over time
**Methodology:** use *Adjusted Rand index* (ARI) between adjacent windows.

**Interpretation tips:**
* ARI near **1.0** → clusters are stable.
* Dips signal structural shifts in counterparty behaviour.

In [None]:
stability = cc.evaluate_stability_over_time(results)
stability.head()

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
sns.lineplot(data=stability, x='window_end', y='adjusted_rand', marker='o', ax=ax)
ax.set_title('Adjusted Rand index (cluster stability)')
ax.set_xlabel('Window end date')
ax.set_ylabel('ARI')
plt.xticks(rotation=45)
plt.tight_layout()

## 6. Feature‑importance drift using XGBoost
**Methodology:** train a simple **XGBoost** multi‑class classifier in each window, then record `gain` feature importances.

**What to watch:** which drivers of cluster membership are trending up/down?

In [None]:
import xgboost as xgb

feat_cols = ['avg_pnl', 'std_pnl', 'tot_ntl', 'trade_count', 'avg_participation']
fi_over_time = {c: [] for c in feat_cols}

for res in tqdm(results, desc='Windows (XGB)'):
    X = res.agg.select(feat_cols).to_numpy()
    y = res.labels
    model = xgb.XGBClassifier(max_depth=3, n_estimators=120, learning_rate=0.1, verbosity=0)
    model.fit(X, y)
    for c, imp in zip(feat_cols, model.feature_importances_):
        fi_over_time[c].append(imp)

fi_df = pd.DataFrame(fi_over_time)
fi_df.head()

In [None]:
# --- Seaborn area plot for top features ---
top_feats = fi_df.mean().sort_values(ascending=False).head(4).index
fig, ax = plt.subplots(figsize=(10, 4))
fi_df[top_feats].plot(kind='area', stacked=True, ax=ax, alpha=0.8)
ax.set_xlabel('Window index')
ax.set_ylabel('Gain importance')
ax.set_title('Feature importance drift (top 4)')
ax.legend(loc='upper right')
plt.tight_layout()

## 7. Intra‑cluster feature distributions (latest window)
**Purpose:** inspect how features are distributed *within* each cluster.

**Tool:** leverage `cc.plot_feature_distributions` which internally builds a Seaborn **FacetGrid**.

In [None]:
latest = results[-1]
fig = cc.plot_feature_distributions(latest.agg, latest.labels, kind='hist', bins=40, col_wrap=3)
fig.suptitle('Feature distributions by cluster – latest window', y=1.02)
plt.tight_layout()

## 8. Emerging clusters
**Methodology:** flag clusters whose number of counterparties grows by >50 % relative to the previous window.

**Actionable insight:** these clusters might deserve bespoke spread rules or deeper investigation.

In [None]:
emerging = []
for prev, curr in zip(results[:-1], results[1:]):
    prev_ct = prev.summary.sort('cluster_notional')
    curr_ct = curr.summary.sort('cluster_notional')
    merged = prev_ct.join(curr_ct, on='cluster', how='inner', suffix='_curr')
    growth = (
        (merged['n_counterparties_curr'] - merged['n_counterparties']) / merged['n_counterparties']
    )
    big = merged.filter(growth > 0.5)
    for row in big.iter_rows():
        emerging.append({'window_end': curr.end.date(), 'cluster': row['cluster'], 'growth_pct': growth[big.row_indices[0]]})

pd.DataFrame(emerging)

---
### Wrap‑up
* Re‑run the silhouette section regularly — cluster structure can drift.
* Use feature‑importance drift to update the feature pipeline (e.g. add/remove metrics).
* Automate the entire workflow on a schedule and push the key plots to your reporting dashboard.

Happy clustering!