# Clinical Dataset Exploration — Explanation Notebook

This notebook is a narrative companion to `clinica/clinica_data_exploration.ipynb`.
It summarizes the analysis logic, methods, and expected outputs, and indicates where to place the figures generated by the exploration notebook.


## 1. Objectives and Dataset

- Goal: characterize PD vs MSA (MSA-P, MSA-C) clinically; surface discriminative variables; prepare for integration with imaging models.
- Source: clinical spreadsheet (`minimal.csv`) standardized into a clean `pandas` DataFrame.
- Classes considered: `MSA-P`, `MSA-C`, `PD` (MSA-P and MSA-C are sometimes pooled as `MSA`).
- Binary features: red flags, autonomic/sleep/bulbar symptoms; Continuous features: age, disease duration, UPDRS, LEDD, COMPASS subscales, etc.


## 2. Data Ingestion and Harmonization

- Robust path setup so the notebook works from any cwd; style is unified via `StyleManager`.
- Load `minimal.csv` and drop the exported index column.
- Column names normalized: Unicode stripped, whitespace collapsed, then slugified to ASCII snake_case.
- Diagnosis labels trimmed and harmonized (e.g., `MSA-P/C` → `MSA-P`).
- Numeric casting based on patterns; categorical cleaning (uppercasing, NA handling).


## 3. Missingness Profiling

- We quantify missingness per variable and optionally per class to guide usable features.
- A 30% threshold is used to flag features with excessive NA.

![Missingness by variable](../images/missing_by_variable.png)


## 4. Cohort Overview

- Distribution of patients across definitive diagnoses.
- A compact summary table (N, M/F, age at onset, duration, current age).

![Patient distribution by diagnosis](../images/patient_distribution.png)
![Clinical/Demographic summary table](../images/summary_table.png)


## 5. Derived Variables and Cleaning

- Create analysis-friendly fields (examples):
  - Timing/severity: `eta_attuale`, `eta_esordio`, `durata_malattia`, `percentuale_risposta_ldopa`.
  - Aggregates: `n_red_flags_msa`, `n_anomalie_mri`.
- Remove or standardize ambiguous/mixed-type columns.
- Keep consistent class ordering: `MSA-P`, `MSA-C`, `PD`.


## 6. Outlier Detection

- Use Tukey's fences on key continuous variables (global scan, optionally per class) to spot extreme values.
- Provides a record of suspected outliers for clinician review; values are not auto-dropped.

![Outlier distributions](../images/outliers_boxplots.png)


## 7. Symptom Prevalence and Co-occurrence (PD vs MSA)

- Binary symptom prevalence computed for PD and pooled MSA (P + C).
- 95% CIs (Wald), absolute risk difference Δ = PD% − MSA%, Benjamini–Hochberg FDR across symptoms.
- UpSet-like panels show common multi-symptom clusters per class.
- Jaccard heatmaps reveal symptom co-activation patterns (masked when a symptom has zero positives in-group).

![Binary symptom prevalence (PD vs MSA)](../images/symptom_prevalence_pd_vs_msa.png)
![Symptom clusters (UpSet view)](../images/symptom_upset_plots.png)
![Symptom co-occurrence (Jaccard)](../images/symptom_jaccard_heatmaps.png)


## 8. Motor Function and L‑Dopa Responsiveness

- Compare `updrs_off` across classes; boxplots show severity distribution.
- `percentuale_risposta_ldopa` is a key differentiator (MSA < PD), reference threshold at ~30%.

# here insert the image which shows UPDRS OFF and L‑dopa response comparison


## 9. MSA Red Flags and Hoehn & Yahr Severity

- Red flag counts (`n_red_flags_msa` and clinician-certified variant) contrasted across classes.
- Disease stage distribution via Hoehn & Yahr stacked bars.

![Red flags and H&Y severity](../images/red_flags_severity.png)


## 10. Diagnostic Delay and Progression

- Inspect `ritardo_diagnostico` and related timing variables to profile progression differences.

![Diagnostic delay and progression](../images/diagnostic_delay_progression.png)


## 11. MRI Abnormalities (Supportive Features)

- Total abnormality count by diagnosis; specific signs: hot‑cross‑bun, putamen atrophy/signal changes.
- Clinical reading: MSA‑C shows pontocerebellar signs; MSA‑P shows putaminal changes; PD often near-normal MRI.

![MRI findings comparison](../images/mri_findings_comparison.png)


## 12. Autonomic Dysfunction Profile (COMPASS)

- Compare subscales (`compass_oh`, `compass_vasc`, `compass_sudor`, `compass_gi`, `compass_uin`, `compass_pupil`) across classes.

![COMPASS subscales comparison](../images/compass_subscales_comparison.png)


## 13. Univariate Screening and Mutual Information

- For each feature: appropriate test (ANOVA/Welch/Kruskal for continuous; Chi‑square for binary).
- Effect sizes and p-values aggregated; FDR applied; MI computed as a non‑linear relevance proxy.
- A combined visualization summarizes top features by q-value/effect and MI.

![Feature importance analysis](../images/feature_importance_analysis.png)


## 14. Differential and Within‑Diagnosis Correlations

- Differential correlations: identify pairs whose association differs between diagnoses.
- Within‑diagnosis correlation heatmaps (MSA‑P, MSA‑C, PD) for key clinical variables.

![Differential correlations](../images/differential_correlations.png)
![Correlation heatmap (MSA-P)](../images/subplt_htmap_0_MSA-P.png)
![Correlation heatmap (MSA-C)](../images/subplt_htmap_1_MSA-C.png)
![Correlation heatmap (PD)](../images/subplt_htmap_2_PD.png)


## 15. Practical Takeaways for Model Integration

- High‑yield, easy‑to‑collect binaries for fusion with imaging (no image‑label leakage):
  `unexplained_urinary_urge_incontinence`, `constipation`, `anamnestic_oh`, `russamento_osas`, `poor_l_dopa_responsivenes`, `deambulaz_appoggio`.
- Consider adding (if coverage good): `severe_dysphagia_w_3_yrs`, `cerebellar_syndrome`.
- Avoid radiology signs in fused tabular when training CNN on MRIs (leakage risk). Use them for ablation only.
- Suggested fusion: late fusion (CNN embedding ⊕ MLP(tabular)) with robust preprocessing and stratified CV.


## 16. Reproducibility Notes

- Figures saved under `images/` via `StyleManager` (300 DPI, transparent, tight).
- Class order fixed across plots: `MSA-P`, `MSA-C`, `PD` (and pooled `MSA`).
- Missingness and coverage checks precede any feature selection to avoid biased comparisons.
- All inferential results report 95% CIs and FDR‑adjusted q‑values where relevant.
