# 🏥 Clinical Data Analysis of Breast Cancer (GSE96058)

This notebook focuses on exploring and visualizing clinical characteristics of breast cancer patients in the GSE96058 cohort. It provides an overview of patient- and tumor-level features, treatment information, survival outcomes, and molecular predictions.

---

## 📌 Objectives

- Summarize and visualize important clinical variables grouped by PAM50 molecular subtype.
- Explore associations between subtypes and:
  - Tumor characteristics (e.g., size, grade, receptor status)
  - Treatment patterns (e.g., chemotherapy, endocrine therapy)
  - Survival outcomes
  - In silico predictions from molecular grade classifiers (MGC, SGC)

---

## 📊 Variable Groups Analyzed

- **Patient & Tumor Characteristics**  
  Age at diagnosis, tumor size, lymph node status, grade, Ki67 index, and hormone receptor status.

- **Treatment**  
  Chemotherapy and endocrine therapy status.

- **Survival**  
  Overall survival time and censoring status.

- **Molecular Predictions (MGC & SGC)**  
  Model-based predictions of ER, PR, HER2, Ki67, and NHG.

---

## 📈 Visualizations

- **Boxplots** for numeric variables (e.g., age, tumor size)
- **Countplots** for categorical variables (e.g., ER status, treatment type)
- Grouped by PAM50 subtype to reveal subtype-specific clinical trends

---

This analysis supports downstream integration with molecular data (e.g., expression or mutation data), survival modeling, and subtype-specific outcome prediction.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
df_meta_all = pd.read_csv("./data/df_meta.tsv",sep="\t",index_col=0)
df_zscore = pd.read_csv("./data/df_merged.tsv",sep="\t",index_col=0)
df_zscore_meta = df_meta_all[df_meta_all["sample_id"].isin(list(df_zscore.columns))]

### 🧪 Survival Analysis by PAM50 Subtype

To assess whether different PAM50 subtypes are associated with significantly different survival outcomes, we perform Kaplan-Meier survival analysis:

- **T**: overall survival time (in days)
- **E**: event indicator (1 = death, 0 = censored)
- **Group**: PAM50 subtype

We use `lifelines.KaplanMeierFitter` to fit survival curves for each subtype and visualize them on the same plot.

This helps determine whether certain subtypes (e.g., Basal vs LumA) are associated with shorter or longer survival times.

The plot shows survival probabilities over time for each subtype.

In [None]:
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
import matplotlib.pyplot as plt
import seaborn as sns

# Extract survival time (T), event status (E), and PAM50 subtype group
T = df_zscore_meta["overall_survival_days"]
E = df_zscore_meta["overall_survival_event"]
subtypes = df_zscore_meta["pam50_subtype"]

# Initialize plot
plt.figure(figsize=(8, 6))
kmf = KaplanMeierFitter()

# Fit and plot KM curve for each subtype
for subtype in subtypes.unique():
    mask = subtypes == subtype
    kmf.fit(T[mask], E[mask], label=subtype)
    kmf.plot_survival_function(ci_show=False)

plt.title("Kaplan-Meier Survival Curves by PAM50 Subtype")
plt.xlabel("Time (days)")
plt.ylabel("Survival Probability")
plt.legend(title="Subtype")
plt.tight_layout()
plt.show()


## 🏥 Clinical Variable Analysis Across PAM50 Subtypes

To better understand how clinical and molecular features vary across breast cancer subtypes, we grouped variables into several categories:

- **Patient & Tumor Characteristics**: age at diagnosis, tumor size, lymph node status, ER/PR/HER2 status, etc.
- **Treatment Information**: whether patients received endocrine therapy or chemotherapy.
- **Survival**: overall survival time and event status.
- **Molecular Predictions**:
  - MGC: Model-based predictions for ER, PR, HER2, Ki67, NHG.
  - SGC: Alternative predictions based on another scoring method.

### 🔍 Visualization Strategy

For each variable group, we visualize distributions across PAM50 subtypes using:

- **Boxplots** for continuous (numeric) variables (e.g., age, tumor size).
- **Countplots** for categorical variables (e.g., ER status, treatment).

This allows us to:

- Identify subtype-specific trends (e.g., LumA tends to have higher ER positivity).
- Explore associations between molecular subtype and clinical features.
- Generate hypotheses for downstream predictive modeling or survival analysis.

Each figure below represents one category of variables. Within each group, individual subplots are labeled accordingly.

---


In [None]:
variable_groups = {
    "Patient & Tumor Characteristics": [
        "age_at_diagnosis", "tumor_size", "lymph_node_status", "lymph_node_group",
        "nhg", "ki67_status", "er_status", "pgr_status", "her2_status"
    ],
    "Treatment": [
        "endocrine_treated", "chemo_treated"
    ],
    "Survival": [
        "overall_survival_days", "overall_survival_event"
    ],
    "Molecular Predictions (MGC)": [
        "er_prediction_mgc", "pgr_prediction_mgc", "her2_prediction_mgc", "ki67_prediction_mgc", "nhg_prediction_mgc"
    ],
    "Molecular Predictions (SGC)": [
        "er_prediction_sgc", "pgr_prediction_sgc", "her2_prediction_sgc", "ki67_prediction_sgc"
    ]
}
import matplotlib.pyplot as plt
import seaborn as sns

df = df_zscore_meta.copy()
subtype_col = "pam50_subtype"

for group_name, variables in variable_groups.items():
    n = len(variables)
    ncols = 3
    nrows = (n + ncols - 1) // ncols

    fig, axes = plt.subplots(nrows, ncols, figsize=(5*ncols, 4*nrows))
    axes = axes.flatten()

    for i, var in enumerate(variables):
        ax = axes[i]
        if var in df.columns:
            if pd.api.types.is_numeric_dtype(df[var]):
                sns.boxplot(data=df, x=subtype_col, y=var, ax=ax)
                ax.set_title(f"{var} (Boxplot)")
            else:
                sns.countplot(data=df, x=var, hue=subtype_col, ax=ax)
                ax.set_title(f"{var} (Countplot)")
            ax.tick_params(axis='x', rotation=45)
        else:
            ax.axis('off')

    # Remove unused axes
    for j in range(i + 1, len(axes)):
        axes[j].axis('off')

    plt.tight_layout()
    plt.suptitle(group_name, fontsize=16, y=1.02)
    plt.show()