# Clinical Dataset Exploration — Explanation Notebook

narrative of `clinica/clinica_data_exploration.ipynb`.

## 1. Objectives and Dataset

- Goal: characterize PD vs MSA (MSA-P, MSA-C) clinically; surface discriminative variables; prepare for integration with imaging models.
- Classes considered: `MSA-P`, `MSA-C`, `PD` (MSA-P and MSA-C are sometimes pooled as `MSA`).


| Column                                                         | Type        | Short explanation                                                                |
| :------------------------------------------------------------- | :---------- | :------------------------------------------------------------------------------- |
| anni_dalla_diagnosi                                            | numeric     | Years elapsed since formal diagnosis.                                            |
| anni_dopaminoaagonisti                                         | numeric     | Duration of dopamine-agonist therapy (years).                                    |
| anni_l_dopa                                                    | numeric     | Duration of levodopa therapy (years).                                            |
| anno_diagnosi                                                  | numeric     | Calendar year of diagnosis.                                                      |
| anno_esordio_disautonomia                                      | numeric     | Year of first autonomic symptom.                                                 |
| anno_esordio_sintomi_motori                                    | numeric     | Year of first motor symptom.                                                     |
| anno_esordio_sintomi_non_motori                                | numeric     | Year of first non-motor symptom.                                                 |
| anno_nascita                                                   | numeric     | Year of birth.                                                                   |
| compass_gi                                                     | numeric     | COMPASS-31 gastrointestinal sub-score.                                           |
| compass_oh                                                     | numeric     | COMPASS-31 orthostatic hypotension sub-score.                                    |
| compass_pupil                                                  | numeric     | COMPASS-31 pupillomotor sub-score.                                               |
| compass_sudor                                                  | numeric     | COMPASS-31 sudomotor (sweating) sub-score.                                       |
| compass_totale                                                 | numeric     | Total COMPASS-31 autonomic dysfunction score.                                    |
| compass_uin                                                    | numeric     | COMPASS-31 urinary sub-score.                                                    |
| compass_vasc                                                   | numeric     | COMPASS-31 vasomotor sub-score.                                                  |
| delta_off_on                                                   | numeric     | Difference between UPDRS_OFF and UPDRS_ON (treatment effect).                    |
| durata_malattia                                                | numeric     | Disease duration from onset (years).                                             |
| eta_attuale                                                    | numeric     | Current patient age.                                                             |
| eta_diagnosi                                                   | numeric     | Age at diagnosis.                                                                |
| eta_esordio                                                    | numeric     | Age at first motor symptom.                                                      |
| h_and_y                                                        | numeric     | Hoehn & Yahr stage (1–5).                                                        |
| ledd                                                           | numeric     | Levodopa equivalent daily dose (mg/day).                                         |
| ledd_per_anno                                                  | numeric     | LEDD normalized per year of disease.                                             |
| n_anomalie_mri                                                 | numeric     | Number of abnormal MRI findings.                                                 |
| n_red_flags_msa                                                | numeric     | Count of MSA “red-flag” features (per MDS).                                      |
| n_red_flags_msa_clinic_certified                               | numeric     | Clinically certified number of red flags.                                        |
| parkinsonism                                                   | numeric     | Severity/composite score of parkinsonian signs (rigidity, bradykinesia, tremor). |
| percentuale_risposta_ldopa                                     | numeric     | % improvement after acute L-Dopa test.                                           |
| progression_rate                                               | numeric     | Calculated progression speed (e.g. H&Y / disease years).                         |
| ritardo_diagnostico                                            | numeric     | Diagnostic delay (years from symptom onset to diagnosis).                        |
| updrs_off                                                      | numeric     | UPDRS-III motor score in OFF-medication state.                                   |
| updrs_on                                                       | numeric     | UPDRS-III motor score in ON-medication state.                                    |
| atrofia_cervelletto                                            | binary      | MRI: cerebellar atrophy.                                                         |
| atrofia_del_putamen                                            | binary      | MRI: putaminal atrophy.                                                          |
| atrofia_peduncoli_cerebellari_medi                             | binary      | MRI: middle cerebellar peduncle atrophy.                                         |
| atrofia_ponte                                                  | binary      | MRI: pontine atrophy.                                                            |
| behavioural_alteration                                         | binary      | Behavioural or personality changes.                                              |
| caduta_segnale_putamen                                         | binary      | MRI: putaminal signal loss on T2*/SWI.                                           |
| cadute                                                         | binary      | History of falls.                                                                |
| carrozzina                                                     | binary      | Wheelchair use.                                                                  |
| cerebellar_syndrome                                            | binary      | Presence of cerebellar signs (ataxia, dysmetria).                                |
| cognitive_decline                                              | binary      | Cognitive impairment or dementia.                                                |
| cold_discolored_hands_and_feet                                 | binary      | Peripheral vasomotor disturbance (autonomic).                                    |
| constipation                                                   | binary      | Chronic constipation.                                                            |
| craniocervical_dyst_induced_dy_l_dopa                          | binary      | Craniocervical dystonia induced by L-Dopa.                                       |
| deambulaz_appoggio                                             | binary      | Ambulates with support.                                                          |
| deambulaz_autonoma                                             | binary      | Ambulates independently.                                                         |
| drooling                                                       | binary      | Hypersalivation / drooling.                                                      |
| erectile_disfunction                                           | binary      | Erectile dysfunction (autonomic symptom).                                        |
| fatigue                                                        | binary      | Fatigue / lack of energy.                                                        |
| hot_cross_bun_sign                                             | binary      | MRI: pontine cruciform hyperintensity typical of MSA-C.                          |
| hyposmia                                                       | binary      | Reduced sense of smell.                                                          |
| inspiratory_sighs                                              | binary      | Sighing or irregular breathing pattern.                                          |
| iperintensita_peduncoli_cerebellari_medi                       | binary      | MRI: MCP hyperintensity.                                                         |
| iperintensita_putamen                                          | binary      | MRI: putaminal hyperintensity.                                                   |
| jerky_myoclonic_postural_or_kinetic_tremor                     | binary      | Irregular / jerky tremor type.                                                   |
| moderate_to_severe_postural_instability_w_3_yrs_of_motor_onset | binary      | Postural instability within 3 years of motor onset.                              |
| normal_rmn                                                     | binary      | Normal brain MRI.                                                                |
| pain                                                           | binary      | Chronic or neuropathic pain.                                                     |
| pathologic_laughter_or_crying                                  | binary      | Emotional incontinence (pseudobulbar affect).                                    |
| poor_l_dopa_responsivenes                                      | binary      | Poor or absent clinical response to L-Dopa.                                      |
| postural_deformities                                           | binary      | Axial/postural deformities (camptocormia, Pisa).                                 |
| rapid_progression_w_3_yrs                                      | binary      | Rapid disease progression within 3 years.                                        |
| rbd                                                            | binary      | REM sleep behavior disorder.                                                     |
| russamento_osas                                                | binary      | Snoring / sleep apnea (OSAS).                                                    |
| severe_dysphagia_w_3_yrs                                       | binary      | Severe dysphagia within 3 years of onset.                                        |
| severe_speech_impairement_w_3_yrs                              | binary      | Severe dysarthria within 3 years of onset.                                       |
| sonnolenza_diurna                                              | binary      | Excessive daytime sleepiness.                                                    |
| stridor                                                        | binary      | Laryngeal stridor (inspiratory noise).                                           |
| unexplained_babinski                                           | binary      | Pathological Babinski sign unexplained by stroke.                                |
| unexplained_urinary_urge_incontinence                          | binary      | Urinary urge incontinence unexplained by obstruction.                            |
| unexplained_voiding_difficulties                               | binary      | Urinary retention / difficulty voiding unexplained by prostate disease.          |
| visual_alteration                                              | binary      | Visual disturbances or blurring.                                                 |
| anamnestic_oh                                                  | binary      | History of orthostatic hypotension from anamnesis.                               |
| diagnosi_definita                                              | categorical | Final confirmed diagnosis (PD / MSA-P / MSA-C / Control).                        |
| diagnosi_di_invio                                              | categorical | Referral diagnosis at first evaluation.                                          |
| gruppo_eta                                                     | categorical | Age group (e.g. <50, 50-60, 60-70, >70).                                         |
| sesso                                                          | categorical | Sex (M/F).                                                                       |
| stadio_malattia                                                | categorical | Disease stage grouping (e.g. early / mid / late).                                |
| data_di_nascita                                                | date        | Full birth date.                                                                 |


| **Set**                           | **Variables Included**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                                                                                                                          |
| **Clinician-Certified Red Flags** | `poor_l_dopa_responsivenes`, `rapid_progression_w_3_yrs`, `moderate_to_severe_postural_instability_w_3_yrs_of_motor_onset`, `craniocervical_dyst_induced_dy_l_dopa`, `severe_speech_impairement_w_3_yrs`, `severe_dysphagia_w_3_yrs`, `unexplained_babinski`, `jerky_myoclonic_postural_or_kinetic_tremor`, `postural_deformities`, `unexplained_voiding_difficulties`, `unexplained_urinary_urge_incontinence`, `stridor`, `inspiratory_sighs`, `cold_discolored_hands_and_feet`, `pathologic_laughter_or_crying` |


## 2. Data Ingestion and Harmonization

- Load clinical csv (harmonizing different rapresentations of missing values to Nan)
- Column names normalized: Unicode stripped, whitespace collapsed, then slugified to ASCII snake_case.
- Diagnosis labels trimmed and harmonized (e.g., `MSA-P/C` → `MSA-P`).
- Numeric casting based on patterns; categorical cleaning (uppercasing, NA handling).
- Duplicates check


## 3. Missingness Profiling

- We quantify missingness per variable and optionally per class to guide usable features.
- A 30% threshold is used to flag features with excessive NA.

![Missingness by variable](../images/missing_by_variable.png)
![Missing PD](../images/missing_by_variable_PD.png)
![Missing MSA](../images/missing_by_variable_MSA.png)

**note that PD patients have much more missing values**


**Removed columns with more than 50% missings**
remaining columns
![remaining_columns](../images/missing_by_variable_after_removal.png)

## 4. Derived Variables and Cleaning

- Create analysis-friendly fields (examples):
  - Timing/severity: `eta_attuale`, `eta_esordio`, `durata_malattia`, `percentuale_risposta_ldopa`.
  - Aggregates: `n_red_flags_msa`, `n_anomalie_mri`.
- Remove or standardize ambiguous/mixed-type columns.
- Keep consistent class ordering: `MSA-P`, `MSA-C`, `PD`.


| Variable                               | Short description                                                      |
|---------------------------------------:|------------------------------------------------------------------------|
| eta_attuale                            | Current patient age.                                                   |
| ritardo_diagnostico                    | Diagnostic delay (years from symptom onset to diagnosis).              |
| anni_dalla_diagnosi                    | Years elapsed since formal diagnosis.                                  |
| percentuale_risposta_ldopa             | % improvement after acute L‑Dopa test.                                 |
| ledd_per_anno                          | Levodopa equivalent daily dose normalized per year of disease.         |
| n_red_flags_msa                        | Count of MSA “red-flag” features (per MDS).                            |
| n_red_flags_msa_clinic_certified       | Clinically certified number of red flags.                              |
| n_anomalie_mri                         | Number of abnormal MRI findings.                                       |
| stadio_malattia                        | Disease stage grouping (e.g. early / mid / late).                      |
| gruppo_eta                             | Age group (e.g. <50, 50-60, 60-70, >70).                               |
| progression_rate                       | Calculated progression speed (e.g. H&Y / disease years).               |

## 5. Cohort Overview

- Distribution of patients across definitive diagnoses.
- A compact summary table (N, M/F, age at onset, duration, current age).

![Patient distribution by diagnosis](../images/patient_distribution.png)


## 6. Outlier Detection

- Use Tukey's fences on key continuous variables (global scan, optionally per class) to spot extreme values.
- Provides a record of suspected outliers for clinician review.

![Outlier distributions](../images/outliers_boxplots_by_diagnosis.png)
![Outlier Summary Table](../images/tables/outlier_summary_table.png) 


## 7. Symptom Prevalence and Co-occurrence (PD vs MSA)

- Binary symptom prevalence computed for PD and pooled MSA (P + C).
- 95% CIs (Wald), absolute risk difference Δ = PD% − MSA%, Benjamini–Hochberg FDR across symptoms.
- UpSet-like panels show common multi-symptom clusters per class.
- Jaccard heatmaps reveal symptom co-activation patterns (masked when a symptom has zero positives in-group).

![Binary symptom prevalence (PD vs MSA)](../images/symptom_prevalence_pd_vs_msa.png)
![Symptom clusters (UpSet view)](../images/symptom_upset_plots.png)
![Symptom co-occurrence (Jaccard)](../images/symptom_jaccard_heatmaps.png)


## 8. Motor Function and L‑Dopa Responsiveness

- Compare `updrs_off` across classes; boxplots show severity distribution.
- `percentuale_risposta_ldopa` is a key differentiator (MSA < PD), reference threshold at ~30%.

# here insert the image which shows UPDRS OFF and L‑dopa response comparison


## 9. MSA Red Flags and Hoehn & Yahr Severity

- Red flag counts (`n_red_flags_msa` and clinician-certified variant) contrasted across classes.
- Disease stage distribution via Hoehn & Yahr stacked bars.

![Red flags and H&Y severity](../images/red_flags_severity.png)


## 10. Diagnostic Delay and Progression

- Inspect `ritardo_diagnostico` and related timing variables to profile progression differences.

![Diagnostic delay and progression](../images/diagnostic_delay_progression.png)


## 11. MRI Abnormalities (Supportive Features)

- Total abnormality count by diagnosis; specific signs: hot‑cross‑bun, putamen atrophy/signal changes.
- Clinical reading: MSA‑C shows pontocerebellar signs; MSA‑P shows putaminal changes; PD often near-normal MRI.

![MRI findings comparison](../images/mri_findings_comparison.png)


## 12. Autonomic Dysfunction Profile (COMPASS)

- Compare subscales (`compass_oh`, `compass_vasc`, `compass_sudor`, `compass_gi`, `compass_uin`, `compass_pupil`) across classes.

![COMPASS subscales comparison](../images/compass_subscales_comparison.png)


## 13. Univariate Screening and Mutual Information

- For each feature: appropriate test (ANOVA/Welch/Kruskal for continuous; Chi‑square for binary).
- Effect sizes and p-values aggregated; FDR applied; MI computed as a non‑linear relevance proxy.
- A combined visualization summarizes top features by q-value/effect and MI.

![Feature importance analysis](../images/feature_importance_analysis.png)


## 14. Differential and Within‑Diagnosis Correlations

- Differential correlations: identify pairs whose association differs between diagnoses.
- Within‑diagnosis correlation heatmaps (MSA‑P, MSA‑C, PD) for key clinical variables.

![Differential correlations](../images/differential_correlations.png)
![Correlation heatmap (MSA-P)](../images/subplt_htmap_0_MSA-P.png)
![Correlation heatmap (MSA-C)](../images/subplt_htmap_1_MSA-C.png)
![Correlation heatmap (PD)](../images/subplt_htmap_2_PD.png)


## 15. Practical Takeaways for Model Integration

- High‑yield, easy‑to‑collect binaries for fusion with imaging (no image‑label leakage):
  `unexplained_urinary_urge_incontinence`, `constipation`, `anamnestic_oh`, `russamento_osas`, `poor_l_dopa_responsivenes`, `deambulaz_appoggio`.
- Consider adding (if coverage good): `severe_dysphagia_w_3_yrs`, `cerebellar_syndrome`.
- Avoid radiology signs in fused tabular when training CNN on MRIs (leakage risk). Use them for ablation only.
- Suggested fusion: late fusion (CNN embedding ⊕ MLP(tabular)) with robust preprocessing and stratified CV.


## 16. Reproducibility Notes

- Figures saved under `images/` via `StyleManager` (300 DPI, transparent, tight).
- Class order fixed across plots: `MSA-P`, `MSA-C`, `PD` (and pooled `MSA`).
- Missingness and coverage checks precede any feature selection to avoid biased comparisons.
- All inferential results report 95% CIs and FDR‑adjusted q‑values where relevant.
