<div style="background-color:white;" >
<div style="clear: both; display: table;">
  <div style="float: left; width: 14%; padding: 5px; height:auto">
    <img src="img/TUBraunschweig_CO_200vH_300dpi.jpg" alt="TU_Braunschweig" style="width:100%">
  </div>
  <div style="float: left; width: 28%; padding: 5px; height:auto">
    <img src="img/TU_Clausthal_Logo.png" alt="TU_Clausthal" style="width:100%">
  </div>
  <div style="float: left; width: 25%; padding: 5px; height:auto">
    <img src="img/ostfalia.jpg" alt="Ostfalia" style="width:100%">
  </div>
  <div style="float: left; width: 21%; padding: 5px;">
    <img src="img/niedersachsen_rgb_whitebg.png" alt="Niedersachsen" style="width:100%">
  </div>
  <div style="float: left; width: 9%; padding: 5px;">
    <img src="img/internet_BMBF_gefoerdert_2017_en.jpg" alt="bmbf" style="width:100%">
  </div>
</div>
<div style="text-align:center">
<img src="img/ki4all.jpg" alt="KI4ALL-Logo" width="200"/>
</div>
</div>

# Synthetic Biomedical Data ‚Äì Lesson 3: Advanced Data Generation
Part of the *Microcredit Artificial Data Generator* module.

‚û°Ô∏è [Back to Lesson 3b: Irrelevant Features - noise distributions](03b_noise_distributions.ipynb)
‚û°Ô∏è [Module README](../README.md)

*Before continuing, please ensure you reviewed the prerequisites and learning goals in Lesson 1.*


# Lesson 3c: Correlated Features (Biological Pathways)

### Recap
Previously, you learned that not all features are informative.
Spurious correlations between noise features and the label can arise by chance, especially as dimensionality increases.
Now we explore what happens when features are redundant because they move together.


### Why this lesson: Correlated features?
In real biomedical data, features are often not independent. Biological processes often involve groups of molecules that move together. Biological examples include:
- Genes in the same pathway may be co-expressed (e.g., Gene A and Gene B always expressed together).
- Proteins in a signaling cascade may be activated together.
- Metabolites in the same biochemical pathway are chemically linked, and their concentrations co-vary.

These dependencies create correlated features in real datasets. Strong correlations can:
- Reduce the unique information each feature provides
- Create instability in model training and feature importance
- Mislead or complicate model training and feature selection

Synthetic data allows us to add and control correlations. By simulating them in synthetic data, we can study their impact on visualization, classification, and feature selection under known ground truth.

### Key terms
- **Feature correlation**: statistical association between features (e.g., Pearson correlation).
- **Cluster / module**: a group of features that co-vary.
- **Co-variation**: features that change together across samples.
- **Equicorrelation**: all features in a cluster have nearly the same pairwise correlation.
- **AR(1) / Toeplitz**: correlation decays with distance in feature index.
- **Redundancy**: multiple features carry nearly the same information.
- **Anchor** (driver): the feature that primarily carries the class signal in a cluster.
- **Proxies**: followers that correlate with the anchor and partially mirror its information.
- **Multicollinearity**: strong feature correlations that make individual coefficients unstable (especially in linear models).
- **Attribution ambiguity (non-identifiability of individual effects)**: when several correlated features predict equally well, importance can be spread or exchanged among them.

### What you'll learn
After completing this notebook, you will be able to:
- Generate pathway-like clusters of features with a tunable correlation (e.g., equicorrelated or AR(1)/Toeplitz structure).
- Compose a dataset by mixing correlated blocks with independent features to reach a target dimensionality p.
- Introduce a modest class-conditional shift so that only a subset of cluster members carries signal.
- Visualize and interpret the empirical correlation matrix.
- Evaluate how correlation affects model performance and selection stability.
- Explain practical implications (redundancy, multicollinearity, non-identifiability of individual effects, proxy features) and recommend mitigation strategies (grouped selection, stability checks).

# Step 1: Code ‚Äì Imports, Installation/Upgrade

In [None]:
# If needed, install or upgrade the package biomedical-data-generator(uncomment in managed environments) via:
# %pip install -U biomedical-data-generator

In [None]:
# Standard package imports
import nb_imports as nb

# Set plotting style
from nb_setup import apply_style

apply_style()

rng = nb.np.random.default_rng(42)

# Step 2. Generate synthetic data with correlated feature clusters


## 2.1 Equicorrelated Feature Cluster

**Equicorrelated features**: A correlation structure where all pairwise correlations are similar (œÅ). There are no "special pairs" ‚Äì all features are equally connected to all others. While rare in nature, understanding this extreme case helps us recognize feature redundancy problems.

For a cluster of k features X‚ÇÅ, X‚ÇÇ, ..., X‚Çñ, the equicorrelated structure means:

**Correlation Matrix**
```
     X‚ÇÅ   X‚ÇÇ   X‚ÇÉ   ...
X‚ÇÅ [ 1    œÅ    œÅ   ... ]
X‚ÇÇ [ œÅ    1    œÅ   ... ]
X‚ÇÉ [ œÅ    œÅ    1   ... ]
...
```
All pairwise correlations are identical:

- Cor(X·µ¢, X‚±º) = œÅ for all i ‚â† j
- Cor(X·µ¢, X·µ¢) = 1 (self-correlation is always 1)

### Why This Matters for Machine Learning
- High œÅ ‚Üí features are nearly redundant (multicollinearity)
- Feature selection becomes non-identifiable: any cluster member
  can serve as proxy for the others
- Including 10 highly correlated texture features
doesn't give you "10√ó more information" ‚Äì it inflates feature
importance scores and destabilizes selection.

### üß¨ Where Do We Find Equicorrelated Structures?
Imagine a group of biomarkers that all "move together":
- If one goes up, all others tend to go up too
- If one goes down, all others follow
- Every pair has about the same correlation œÅ (rho)

#### Example 1: Transcription Factor Target Genes
Genes regulated by the same transcription factor often correlate
highly because:
- They respond to the same upstream signal via shared regulatory elements
- Coordinated expression serves a common biological goal

**Clinical implication:** Highly correlated features often have
similar predictive value‚Äîselecting "the best" single feature
becomes somewhat arbitrary without biological context.

#### Example 2: Imaging-Derived Features
Radiomics features from the same tissue region:
- Texture features (homogeneity, entropy, contrast) derived from
  overlapping pixel sets
- Mathematically related through shared image statistics
- High redundancy not obvious from feature names

**Research challenge:** Different texture features may be selected
across folds despite capturing similar information‚Äîcomplicating
biological interpretation.

### Generate and Visualize

In [None]:
# Generate equicorrelated cluster
equicorrelated_cluster = nb.sample_cluster(
    n_samples=30,
    n_features=10,
    rng=rng,
    structure="equicorrelated",
    rho=0.9,
)
correlation_matrix, labels = nb.compute_correlation_matrix(nb.pd.DataFrame(equicorrelated_cluster))

# plot the heatmap
fig, ax = nb.plot_correlation_matrix(
    # Compute correlation matrix from cluster_data
    correlation_matrix=correlation_matrix,
    labels=labels,
    title="Equicorrelated Cluster (œÅ=0.7)",
    annot=True,
)

## 2.2 Toeplitz/AR(1) Feature Cluster: When Distance Matters

**Toeplitz/AR(1) features**: A correlation structure where correlation decays exponentially with distance (œÅ^|i-j|). Features have a natural ordering (sequence, time, position), and only nearby features share strong redundancy. Common in real biological data‚Äîgene expression cascades, chromosomal neighborhoods, metabolic pathways‚Äîmaking it essential for realistic synthetic benchmarks.

### The Correlation Pattern

When features have positional or temporal ordering, correlation typically decays exponentially with distance:

$$\text{Cor}(X_i, X_j) = \rho^{|i-j|}$$

This creates a localized "neighborhood" structure:
- **Immediate neighbors** (distance 1): correlation = œÅ
- **Next-door neighbors** (distance 2): correlation = œÅ¬≤
- **Distant features** (distance k): correlation = œÅ^k

For a cluster of k features X‚ÇÅ, X‚ÇÇ, ..., X‚Çñ, the Toeplitz/AR(1) structure means:

**Correlation Matrix R**
```
     X‚ÇÅ    X‚ÇÇ    X‚ÇÉ    X‚ÇÑ   ...
X‚ÇÅ [ 1     œÅ     œÅ¬≤    œÅ¬≥  ... ]
X‚ÇÇ [ œÅ     1     œÅ     œÅ¬≤  ... ]
X‚ÇÉ [ œÅ¬≤    œÅ     1     œÅ   ... ]
X‚ÇÑ [ œÅ¬≥    œÅ¬≤    œÅ     1   ... ]
...
```
**Distance-dependent correlation:**
- Cor(X·µ¢, X·µ¢‚Çä‚ÇÅ) = œÅ (neighbors)
- Cor(X·µ¢, X·µ¢‚Çä‚ÇÇ) = œÅ¬≤ (distance 2)
- Cor(X·µ¢, X·µ¢‚Çä‚Çñ) = œÅ^k (distance k)
-
**Example:** With œÅ = 0.7:
- Cor(X‚ÇÅ, X‚ÇÇ) = 0.70 (strong)
- Cor(X‚ÇÅ, X‚ÇÉ) = 0.49 (moderate)
- Cor(X‚ÇÅ, X‚ÇÖ) = 0.24 (weak)
- Cor(X‚ÇÅ, X‚ÇÅ‚ÇÄ) = 0.03 (negligible)

In [1]:
# Generate Toeplitz/AR(1) cluster (mimics sequential regulatory cascade)
toeplitz_cluster = nb.sample_cluster(
    n_samples=30,
    n_features=10,
    rng=rng,
    structure="toeplitz",  # Changed from "equicorrelated"
    rho=0.7,  # Lag-1 correlation; correlation decays as œÅ^k with distance
)
correlation_matrix, labels = nb.compute_correlation_matrix(nb.pd.DataFrame(toeplitz_cluster))

# plot the heatmap
fig, ax = nb.plot_correlation_matrix(
    correlation_matrix=correlation_matrix,
    labels=labels,
    title="Toeplitz/AR(1) Cluster (œÅ=0.7)",
    annot=True,
)

NameError: name 'nb' is not defined

Correlation decays exponentially. Features at distance k have much weaker correlation than neighbors. This creates a "neighborhood" structure where redundancy is localized. Features share information primarily with their immediate neighbors, not the entire cluster.

### Why This Matters for Machine Learning

1. **Localized Feature selection instability**: Correlated neighbors are interchangeable
2. **Interpretation challenges**: Which gene in a chromosomal region is "causal" vs. just a correlated neighbor?
3. **Overfitting via redundant neighborhoods**: High-dimensional data with multiple local clusters creates many correlated pathways
4. **Realistic benchmarks**: Real biomarker panels often have this structure

**Key difference from equicorrelated features:**
- Redundancy is local, not global ‚Üí distant features (e.g., Feature 1 and Feature 100) add independent information
- Multicollinearity still present, but localized to neighborhoods ‚Üí not every feature competes with every other
- Selection stability improves between neighborhoods ‚Üí choosing one feature from each cluster is more reproducible

### üß¨ Where Do We Find the Toeplitz/AR(1) Pattern?

Biomarker groups with inherent positional or temporal structure:

- **Gene expression cascades**: Upstream genes regulate downstream genes
- **Chromosomal position**: Nearby genes share regulatory elements
- **Metabolic pathways**: Sequential enzymatic steps
- **Temporal measurements**: Hormone levels throughout a circadian cycle
- **Spatial transcriptomics**: Gene expression from adjacent tissue regions

In each case, correlation decays with distance: nearby features are strongly correlated, distant ones are nearly independent.

### Terminology: Toeplitz vs. AR(1)

We use two equivalent terms depending on the domain context. Both terms describe the same correlation matrix.
- "near/far" relationship ‚Üí **Toeplitz**
- "before/after" relationship ‚Üí **AR(1)**

#### **Toeplitz Structure** (Linear Algebra)
Toeplitz describes the mathematical form of the correlation matrix where diagonals are constant.
Features have a **"near/far"** relationship without directionality like spatial relationships (chromosomal position, anatomical proximity).

#### **AR(1) Process** (Time Series)
Describes a data-generating mechanism where each value depends on its predecessor:

$$X_t = \rho \cdot X_{t-1} + \varepsilon_t$$

This autoregressive process naturally produces Toeplitz correlation.
Features have a **"before/after"** relationship with directionality. Examples are temporal dynamics (time series, cascades) or sequential processes (metabolic pathways).

Generating Toeplitz/AR(1) clusters is relevant to explore how correlation strength (œÅ) and cluster size affect model behavior and feature selection stability. For example, when many features move together, models may attribute importance to several‚Äîeven if only one drives the class difference. This demonstrates how correlation obscures which features are truly causal.

# Step 3. Generate a Dataset with Two Correlated Clusters and High-Dimensional Noise

From isolated clusters to full data set: We now generate a complete p‚â´n dataset (1225 features, 120 samples) with the `biomedical-data-generator` where two small correlated pathways carry class information while being drowned in high-dimensional noise. This reflects the challenge biomedical researchers face‚Äîfinding signal among overwhelming irrelevant variation.

In each of the two pathways, only one feature (the "anchor") truly differs between classes. The other features are correlated "proxies" that follow their anchor without their own class effect. This reveals how correlation makes many features appear important even when only a few are causal.

#### Generator Settings
* **Samples and classes**

  >`n_samples=120`, balanced `class_counts={0: 60, 1: 60}` for stable comparisons.

* **Correlated clusters** (`corr_clusters`)
  >- **Pathway A** (10 features): Equicorrelated with `rho=0.7` ‚Äî all pairs share identical correlation (global redundancy).
  >- **Pathway B** (15 features): Toeplitz with `rho=0.6` ‚Äî correlation decays with distance (Cor(X_i, X_j) ‚âà œÅ^|i-j|), which mimics local dependencies.


* **Anchors vs. proxies**

  >`anchor_role="informative"` makes the first feature in each pathway the class-informative anchor. All other features in that pathway are correlated proxies without their own class shift.

* **Effect sizes**

  >`anchor_effect_size="medium"` gives moderate separation for the two anchors. `class_sep=1.2` scales the overall task difficulty‚Äîhigh enough to require modeling, low enough to avoid trivial perfect separation.

* **Informative feature count**

  >`n_informative=2` ‚Äî exactly the two pathway anchors. No additional standalone informative features.

* **High-dimensional noise**

  >`n_noise=1200` adds independent features unrelated to class, creating realistic p‚â´n conditions (1225 features vs. 120 samples).

* **Pathway independence**

  >`corr_between=0.0` keeps the two pathways uncorrelated, creating visually distinct blocks in correlation heatmaps.

* **Reproducibility**

  >`random_state=42`. Feature names use `feature_naming="prefixed"` ‚Üí expect `i1`, `i2` (anchors), `corr1_1`, `corr1_2`, ..., `corr2_1`, ..., `n1`, `n2`, ... (noise).

In [None]:
cfg = nb.DatasetConfig(
    n_samples=120,
    n_classes=2,
    class_counts={0: 60, 1: 60},  # required by the generator
    # Correlated clusters (anchors are the only informative features in each cluster)
    corr_clusters=[
        nb.CorrClusterConfig(
            n_cluster_features=6,
            structure="equicorrelated",
            rho=0.7, # rho=0.7 is typical for co-regulated genes under same transcription factor
            anchor_role="informative",  # first column is the informative "anchor"
            anchor_effect_size="large",  # shift strength for the anchor
            anchor_class=0,  # one-vs-rest effect (defaults to 0 if omitted)
            label="Pathway A (equicorr)",
        ),
        nb.CorrClusterConfig(
            n_cluster_features=8,
            structure="toeplitz",
            rho=0.6,
            anchor_role="informative",
            anchor_effect_size="medium",
            anchor_class=1,  # anchor targets class 1 here
            label="Pathway B (toeplitz)",
        ),
    ],
    # Important: n_informative must include the number of informative anchors
    n_informative=2,  # exactly 2 anchors above ‚Üí no extra free informative features
    n_pseudo=0,  # no additional free pseudo features; proxies come from clusters
    n_noise=1200,  # a bit of extra noise to make p>n
    noise_distribution=nb.NoiseDistribution.normal,
    noise_scale=1.0,
    class_sep=1.2,  # modest separation so redundancy/correlation still matters
    anchor_mode="equalized",
    corr_between=0.0,  # keep clusters independent for clarity
    feature_naming="prefixed",  # i1, corr1_2, ..., n1, n2, ...
    random_state=42,
)

X, y, meta = nb.generate_dataset(cfg, return_dataframe=True)
X.head()

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# 1) Nur Spalten mit "corr" im Namen ausw√§hlen
corr_cols = [c for c in X.columns if "corr" in c]
if not corr_cols:
    raise ValueError("Keine Spalten mit 'corr' gefunden")

X_sub = X.loc[:, corr_cols]


# 2) Sortierung: Zuerst nach Cluster-ID, dann innerhalb Cluster nach Index (Anchor=1 zuerst)
def parse_feature_name(feature_name: str):
    """Extrahiert (cluster_id, feature_idx) aus 'corrX_Y' oder 'corrX_anchor'."""
    parts = feature_name.split("_")
    if len(parts) == 2 and parts[0].startswith("corr"):
        cluster_str = parts[0][4:]  # z.B. "1" aus "corr1"
        if cluster_str.isdigit():
            cluster_id = int(cluster_str)
            # Behandle "anchor" als Index 1 (kommt zuerst)
            if parts[1] == "anchor":
                return (cluster_id, 1)
            elif parts[1].isdigit():
                feature_idx = int(parts[1])
                return (cluster_id, feature_idx)
    return (999, 999)  # Fallback f√ºr ung√ºltige Namen


# Sortiere nach (cluster_id, feature_idx) ‚Üí Anchor (_1 oder _anchor) kommt zuerst
ordered_cols = sorted(corr_cols, key=parse_feature_name)
X_ordered = X_sub[ordered_cols]

# 3) Korrelation berechnen
corr = X_ordered.corr(method="spearman")

# 4) Heatmap zeichnen (Blues_r: hohe Werte = dunkles Blau)
annot = corr.shape[0] <= 25
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(
    corr,
    ax=ax,
    cmap="Blues_r",
    vmin=-1.0,
    vmax=1.0,
    square=True,
    xticklabels=True,
    yticklabels=True,
    annot=annot,
    fmt=".2f" if annot else None,
    cbar_kws={"label": "Correlation"},
)
ax.set_title("Alle Korrelations-Cluster (sortiert: Cluster-ID ‚Üí Anchor ‚Üí 2 ‚Üí 3 ‚Üí ...)")
plt.xticks(rotation=90)
plt.yticks(rotation=0)


# 5) Trennlinien zwischen Clustern zeichnen
def get_cluster_id(feature_name: str):
    """Extrahiert Cluster-ID aus 'corrX_Y' ‚Üí X."""
    parts = feature_name.split("_")
    if len(parts) == 2 and parts[0].startswith("corr"):
        cluster_str = parts[0][4:]
        if cluster_str.isdigit():
            return int(cluster_str)
    return None


cluster_ids = [get_cluster_id(f) for f in ordered_cols]
boundaries = [i for i in range(1, len(cluster_ids)) if cluster_ids[i] != cluster_ids[i - 1]]
for b in boundaries:
    ax.hlines(b, *ax.get_xlim(), colors="black", linewidth=1.5)
    ax.vlines(b, *ax.get_ylim(), colors="black", linewidth=1.5)

plt.tight_layout()
plt.show()

In [None]:
# Hilfsfunktion: erwarteter Off-Diag-Mittelwert bei Toeplitz (rho = Lag-1)
def expected_mean_offdiag_toeplitz(p: int, rho: float) -> float:
    num = sum((p - k) * (rho**k) for k in range(1, p))
    den = p * (p - 1) / 2
    return num / den


# 1) Equicorrelated-Block (Cluster 1): Toleranz-Modus um ~rho zu treffen
c1 = cfg.corr_clusters[0]  # p=6, rho=0.8, equicorrelated
seed1, meta1 = nb.find_seed_for_correlation_from_config(
    c1,
    n_samples=cfg.n_samples,  # hier: 120
    tolerance=0.03,  # ¬±0.03 um den Mittelwert
    start_seed=0,
    max_tries=300,
    return_best_on_fail=True,
)

# 2) Toeplitz-Block (Cluster 2): Schwellen-Modus mit erwartetem Mittelwert
c2 = cfg.corr_clusters[1]  # p=8, rho=0.6, toeplitz
target_mean_tpx = expected_mean_offdiag_toeplitz(c2.n_cluster_features, c2.rho)  # ~0.30
seed2, meta2 = nb.find_seed_for_correlation_from_config(
    c2,
    n_samples=cfg.n_samples,
    tolerance=None,  # <- wichtig: threshold-Modus aktivieren
    threshold=target_mean_tpx,  # z.B. ~0.30
    op=">=",  # mean_offdiag >= threshold
    start_seed=0,
    max_tries=500,
    return_best_on_fail=True,
)

# Seeds in die Cluster schreiben (damit der Generator genau diese Seeds nutzt)
cfg.corr_clusters[0].random_state = seed1
cfg.corr_clusters[1].random_state = seed2

# Jetzt generieren
X, y, meta = nb.generate_dataset(cfg, return_dataframe=True)

#### Expected outcome

* About 34 features in total: 2 informative anchors (one per pathway), 12 correlated proxies (5 in Pathway A, 7 in Pathway B), and 20 noise features.
* A correlation heatmap shows two clear blocks. Within each block, anchor and proxies move together, but only the anchor is causally linked to the label.
* Many models will rank some proxies highly due to correlation, illustrating redundancy, multicollinearity, and attribution ambiguity (credit can be split or swapped among correlated features).

## Step 2: Plot the correlation heatmap

In [None]:
# Example: visualize a single cluster by ID via meta
cluster_id = 1  # change to any valid cluster id in your meta
_ = nb.plot_correlation_matrix_for_cluster(
    X[y == 0],
    meta,
    cluster_id=cluster_id,
    correlation_method="spearman",  # "pearson" | "spearman" | "kendall"
    anchor_first=True,
    natural_sort_rest=True,
    title=None,  # auto-title "Cluster {id} ‚Äî Spearman correlation"
    vmin=-1.0,
    vmax=1.0,
    annot=True,  # numeric labels if matrix is small (<=25√ó25)
    fmt=".2f",
    show=True,
)

In [None]:
fig, axes = nb.plt.subplots(1, 2, figsize=(16, 6))
nb.plot_correlation_matrix_for_cluster(
    X[y == 0],
    meta,
    cluster_id=1,
    title="Cluster 1 ‚Äì class 0",
    ax=axes[0],
    annot=True,
)
nb.plot_correlation_matrix_for_cluster(
    X[y == 1],
    meta,
    cluster_id=1,
    title="Cluster 1 ‚Äì class 1",
    ax=axes[1],
    annot=True,
)
nb.plt.tight_layout()

In [None]:
rng = nb.np.random.default_rng(123)

X_block = nb.sample_cluster(n_samples=30, n_features=6, rng=rng, structure="equicorrelated", rho=0.90)
dfb = nb.pd.DataFrame(X_block, columns=[f"corr1_{i}" for i in range(6)])

C = dfb.corr(method="pearson").to_numpy()
print("min/std per col:", dfb.min().min(), dfb.max().max(), dfb.std(ddof=1).min())
assert nb.np.isfinite(C).all(), "Sampler produced NaNs? Then something else is off."

nb.plt.figure(figsize=(4, 4))
nb.plt.imshow(C, vmin=0, vmax=1, origin="lower", aspect="equal")
nb.plt.title("Equicorrelated œÅ=0.90 (pooled, Pearson)")
nb.plt.colorbar()
nb.plt.xticks(range(6), dfb.columns, rotation=90)
nb.plt.yticks(range(6), dfb.columns)
nb.plt.show()

In [None]:
_ = nb.plot_correlation_matrix_for_cluster(
    dfb,
    meta,
    cluster_id=1,
    correlation_method="pearson",  # "pearson" | "spearman" | "kendall"
    anchor_first=True,
    natural_sort_rest=True,
    title=None,  # auto-title "Cluster {id} ‚Äî Spearman correlation"
    vmin=-1.0,
    vmax=1.0,
    annot=True,  # numeric labels if matrix is small (<=25√ó25)
    fmt=".2f",
    show=True,
)

# Lesson 3b ‚Äî Correlated Features (Biological Pathways)

## Recap
In 3a you saw that **not all features are informative**.
Here we study a different challenge: features that are **redundant** because they move together.

## Why this lesson? (Biological rationale)
In real biomedical data, features are rarely independent. Biological processes act in **modules**:
- **Co-expressed genes** in one pathway,
- **Proteins** in the same signaling cascade,
- **Metabolites** within a biochemical route.

These dependencies yield **correlated features**. Correlation can:
- Reduce the *unique* information per feature,
- Create **instability** in model training and feature importance,
- Mislead selection procedures (e.g., picking many proxies of the same driver).

Synthetic data lets us **inject and control** correlation to study its effects on
visualization, classification, and feature selection under known ground truth.

## Learning goals
After this notebook you will be able to:
- **Simulate** correlated feature clusters that mimic pathway-like structure.
- **Visualize** correlation with a heatmap (e.g., `sns.heatmap`) or pair plot (`sns.pairplot`).
- **Reason** about how correlation affects model performance and feature selection stability.

## What you‚Äôll do in this notebook
1. **Generate pathway-like clusters** of features with a tunable correlation (e.g., equicorrelated or AR(1)/Toeplitz structure).
2. **Combine** correlated blocks with independent noise/features to reach a chosen dimensionality.
3. **(Optional)** Add a modest class-conditional shift so that some (but not all) features in a cluster carry signal.
4. **Visualize** the empirical correlation matrix with a heatmap; inspect small clusters via pair plots.
5. **Reflect** on implications for linear models, tree ensembles, and feature selection.

## Key terms (at a glance)
- **Cluster / module**: a group of features that co-vary.
- **Anchor** (driver): the feature that primarily carries the class signal in a cluster.
- **Proxies**: followers that correlate with the anchor and partially mirror its information.
- **Multicollinearity**: high linear dependence among predictors; can inflate variance of estimates.
- **Equicorrelation**: all features in a cluster have the same pairwise correlation.
- **AR(1) / Toeplitz**: correlation decays with distance in feature index.

## Typical pitfalls you‚Äôll observe
- **Redundant picks**: selectors choose many proxies of one anchor ‚Üí low diversity.
- **Unstable rankings**: correlated features swap order across resamples/CV folds.
- **Masked effects**: in linear models, coefficients shrink or flip under strong collinearity.
- **Overconfident metrics**: naive CV may look good even when the model relies on one cluster.

## Quick takeaways
- Correlated blocks are **biologically realistic** and **statistically tricky**.
- Good evaluations use **group-aware** reasoning (e.g., prefer diverse features across clusters).
- Visual checks (heatmaps/pair plots) help detect **modules** before modeling.

---

**Next:** Generate a small dataset with 2‚Äì3 correlated clusters, plot the correlation heatmap, and compare how a linear model vs. a tree ensemble behaves on these features.


## 3.2 Correlated Features

So far, we have inspected features one by one.
In real biomedical data, however, many biomarkers are **not independent**.

> **Why this matters:**
> - Biological processes often involve groups of molecules that move together.
> - Strong correlations can reduce the *unique* information each feature provides.
> - Correlated features can mislead or complicate model training and feature selection.
> - Synthetic data allows us to add and control correlations deliberately.

### Biological examples
- **Genes in the same pathway** may be co-expressed (e.g., Gene A and Gene B always expressed together).
- **Proteins in a signaling cascade** may be activated together.
- **Metabolites in the same biochemical pathway** are chemically linked, and their concentrations co-vary.

These dependencies create **correlated features** in real datasets.
By simulating them in synthetic data, we can study their impact on visualization, classification, and feature selection.

### Goal
Simulate correlated features and visualize their relationships with a correlation matrix.

**TODO**: Generate features based on a shared underlying signal (e.g., a ‚Äúpathway activity‚Äù variable + small random noise).
Visualize the results with a heatmap (`sns.heatmap`) or pairplot (`sns.pairplot`).


## Typical pitfalls you‚Äôll observe
- Redundant picks: selectors choose many proxies of one anchor ‚Üí low diversity.
- Unstable rankings: correlated features swap order across resamples/CV folds.
- Masked effects: in linear models, coefficients shrink or flip under strong collinearity.
- Overconfident metrics: naive CV may look good even when the model relies on one cluster.

## Quick Takeaway
- Correlated features are biologically realistic and statistically tricky.
- They complicate feature selection and model training.

---

## Next Steps
Proceed to **Lesson 3c: Hidden Subgroups**.