## Glossary — Non-Causal Variation

**Batch effect**

>Systematic **non-biological** differences between groups (e.g., site, device, plate, day, operator).
>
>*Mechanics:* group-specific mean/scale shifts or covariance changes in (X).
>
>*Note:* Exists regardless of the target (Y); not a confounder by itself.

**Pseudo-class**

>A **synthetic group label** introduced to create subgroups (e.g., “site”, “instrument”).
Has effect only if you assign shifts/variance changes. Can be used to simulate batch structure.
Becomes a confounder *only* when associated with (Y).

**Confounder (Z)**

>A variable that **influences both** features and target: (Z \rightarrow X) and (Z \rightarrow Y).
Creates spurious associations between (X) and (Y).
A batch/pseudo-class **is a confounder** if it is unequally distributed across classes (i.e., correlated with (Y)).

**Proxy feature**

>A feature carrying information about a subgroup (batch/pseudo-class), enabling a model to **infer group membership** and “cheat” on the target.

**Spurious correlation**

>Association driven by **confounding or sampling artifacts**, not by the underlying biology/causal mechanism.

**Group-aware cross-validation**

>Splitting strategy that **keeps groups intact** across train/test (e.g., `GroupKFold`, `LeaveOneGroupOut`).
Prevents leakage of subgroup signals; often essential with batch/patient/site effects.

**Data leakage (across groups)**

>When train/test folds share samples from the **same group**, letting models exploit group identity rather than biology.


### Quick diagnostics

* **CV sanity check:** Large drop from `KFold` → `GroupKFold/LOGO` ⇒ subgroup reliance.
* **Per-group metrics:** Performance or calibration varies strongly by group.
* **Predict-the-group test:** High AUC for “group from (X)” ⇒ strong batch signature.
* **Attributions:** Importances/SHAP dominated by features tied to group proxies.

### Mitigations (minimal set)

* **Balanced design:** Distribute classes evenly across groups.
* **Group-aware splits:** Use `GroupKFold`/`LOGO`.
* **Controls:** Include group as covariate, residualize, or apply batch correction (e.g., ComBat) **without leaking (Y)**.
* **Feature hygiene:** Remove obvious group identifiers if appropriate.

**Edge note:** Balanced batches remove **confounding** (Z↔Y) but the **batch effect** (Z→X) can still harm generalization if evaluation ignores grouping.

## Simulating Batch Effects (Non-Causal Variation)

**Why this matters**

In real biomedical datasets, variation often arises not only from biology but also from *technical* or *organizational* factors — such as sequencing batches, scanner type, or measurement site.
These **batch effects** can mimic true biological signal and mislead models if not handled properly.

### What a batch effect is — statistically

A batch effect is just a **group-specific shift** or **scaling difference** in the features.
Formally, if (X) are your features and (b) denotes the batch label:

[
X_{i, \text{batch}=b} = X_i + \Delta_b + \varepsilon_i
]

Each batch (b) has its own offset (\Delta_b), but the underlying biology is unchanged.
The generator already supports this mechanism through *group-wise mean shifts* or *pseudo-classes* — no new function is required.

### Semantics vs. mechanics

* The **mechanics** (mean shift, scaling, correlation) are identical to class-conditional shifts used for informative features.
* The **semantics** differ:

  * For informative features, the shift reflects a *biological effect* (e.g., disease vs. control).
  * For batch effects, the shift reflects a *non-biological factor* (e.g., site, device, technician).

By assigning the shift to a variable like `meta.batch_id` instead of the class label, we reinterpret the same statistical operation as a *non-causal source of variation*.

### Why it matters for model evaluation

If batches are unevenly distributed across classes (e.g., all class 0 samples from site A, all class 1 from site B),
a model can achieve high accuracy by learning **batch identity** rather than biology.
Under **RandomKFold**, such leakage remains hidden;
under **GroupKFold** or **LeaveOneGroupOut**, performance collapses — revealing the confounding.

### Takeaway

> Batch effects are not a new data-generation feature but a new **interpretation** of an existing one.
> The generator stays the same — what changes is **how you label and analyze** the resulting structure.

**Reflection Questions**
* How could you detect a batch effect?
* What happens if batches align with class labels?

## Confounding: when subgroup structure mimics the target

Conceptually:
A confounder is a variable that influences both the features and the target label, creating a spurious association.
In your synthetic setup, this happens when a non-biological factor (e.g., batch, site, or instrument) is correlated with class labels.

Example:
Imagine you simulate two hospitals (A and B) as batches:

* All samples from hospital A happen to belong to class 0,
* All samples from hospital B belong to class 1.

Now, if you include batch-specific shifts (e.g., slightly higher mean expression or intensity values per hospital), the model can perfectly separate the classes without learning the biological signal at all — it only needs to detect which hospital a sample came from.

This means the model’s apparent accuracy is high under Random Cross-Validation, because samples from the same hospital appear in both train and test folds. But under GroupKFold (splitting by hospital), performance collapses — revealing that the model learned batch identity rather than biology.

In short:

“Create confounding where models learn subgroup membership instead of biology” means you deliberately align the subgroup variable (batch/pseudo-class) with the target, so that the model’s success reflects spurious correlation rather than causal signal.

## Pseudo-classes (Artificial Subgroups)

**What is a pseudo-class?**
A *pseudo-class* is a categorical subgroup present in the data that **creates visible clusters or patterns** but is **not causally related to the target outcome**. Models can latch onto these subgroups as “shortcuts,” leading to **spurious performance** and poor generalization.

> **Why this matters**
> - Pseudo-classes can **mislead models** (shortcut learning) and inflate metrics.
> - If train/test splits are not subgroup-aware, you risk **data leakage** (e.g., model learns “site” instead of biology).
> - Handling pseudo-classes properly improves **robustness** and **reproducibility**.

### Examples of pseudo-classes
- **Eye color (when predicting heart disease)** — while “eye color” by itself is an **irrelevant feature**, if we *use it to divide the dataset into subgroups*
  (e.g., “blue-eyed patients” vs. “green-eyed patients”), it becomes a **pseudo-class**.
  The subgroups exist, but they do not causally explain heart disease.
- **Gender/sex**, **age-band**, **ethnicity** *can* behave like pseudo-classes **if domain knowledge says they are unrelated to the outcome you study**.
    - Example: If you simulate a disease independent of sex, then **sex** can be a pseudo-class.
    - ⚠️ In many real settings these variables **do influence** biology; treat as **sensitive attributes** and consider fairness implications.
- **Hospital / site ID** — same disease biology, but different centers.
- **Batch or instrument ID** — processing differences unrelated to outcome.
- **Recruitment year or technician** — administrative grouping, not biology.

### How pseudo-classes mislead models
- They introduce **structure orthogonal to the target** (e.g., site-specific shifts), which the model might exploit.
- If train/test splits don’t block by subgroup, metrics can be **over-optimistic**.
- Correlated pseudo-classes (e.g., site correlates with outcome due to sampling bias) act as **confounders**.

### What we do in synthetic data
We deliberately introduce pseudo-classes to **stress-test** pipelines:
1. **Independent pseudo-class**: subgroup affects features (e.g., mean shift) but **not** the label.
2. **Partially confounded pseudo-class**: subgroup correlates with the label (sampling bias) to demonstrate spurious gains.
3. **Strongly confounded pseudo-class**: extreme case to show failure when CV is not subgroup-aware.

### Goal
Add an artificial categorical variable and explore how it might mislead models or inflate performance when not handled correctly.