# 🧠 Supervised Learning for Breast Cancer Subtype and Clinical Outcome Prediction

This notebook demonstrates a supervised learning pipeline to classify breast cancer subtypes (PAM50) using gene expression data. The same framework can be adapted to predict clinical outcomes such as survival events by changing the input features `X` and target labels `y`.

---

## 📌 Objectives

- Build models to predict **PAM50 molecular subtypes** using gene expression data.
- Evaluate multiple supervised learning methods:
  - Random Forest
  - Logistic Regression
  - Deep Neural Networks (MLP)
- Use performance metrics like:
  - Classification report
  - Confusion matrix
  - Per-class and average accuracy
- Extend the framework to **clinical outcome prediction** (e.g. survival binary classification).

---


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

## ⚙️ Feature and Label Preparation

- `X`: Subset of informative genes (e.g. `limma_top_genes`) or dimensionality-reduced expression data.
- `y`:  
  - For subtype classification: `pam50_subtype`  
  - For clinical prediction: `survival_binary` (e.g., long-term survivor vs. early death)

```python
# Example
X = df_for_clustering.loc[:, limma_top_genes]
y = df_zscore_meta["pam50_subtype"]

In [None]:
df_meta_all = pd.read_csv("./data/df_meta.tsv",sep="\t",index_col=0)
df_zscore = pd.read_csv("./data/df_merged.tsv",sep="\t",index_col=0)
df_all = pd.read_csv("./data/GSE96058_gene_expression_3273_samples_and_136_replicates_transformed.csv",sep=",",index_col=0)
df_zscore_meta = df_meta_all[df_meta_all["sample_id"].isin(list(df_zscore.columns))]
df_for_clustering = df_zscore.T 

In [None]:
X = df_for_clustering

# # Assume df_zscore_meta is your metadata DataFrame
# df_zscore_meta['survival_binary'] = np.nan  # Initialize as NaN
#
# # Long-term survivors: survival > 5 years (1825 days) and no death event
# df_zscore_meta.loc[(df_zscore_meta['overall_survival_days'] > 1825) & (df_zscore_meta['overall_survival_event'] == 0), 'survival_binary'] = 1
#
# # Short-term death: survival < 3 years (1095 days) and death occurred
# df_zscore_meta.loc[(df_zscore_meta['overall_survival_days'] < 1095) & (df_zscore_meta['overall_survival_event'] == 1), 'survival_binary'] = 0

y = df_zscore_meta["pam50_subtype"]

# Encode subtype labels into integers
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# === Step 2: Split training and test sets ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)


### 🎯 Model A: Random Forest Classifier

- **Type**: Ensemble Tree-based Classifier  
- **Key Features**:
  - Handles high-dimensional data and nonlinear relationships.
  - Built-in feature selection through variable importance.
  - Robust to overfitting (especially with many trees).
  - Supports `class_weight="balanced"` to handle class imbalance.

- **Hyperparameters**:
  - `n_estimators=200`: Number of trees
  - `random_state=42`: Ensures reproducibility

- **Evaluation**:
  - Classification report
  - Normalized confusion matrix
  - Per-class accuracy
  - Average accuracy

In [None]:
# === 3. Model A: Random Forest ===
rf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("🎯 Random Forest Results:")

# Classification report — ensure target_names are strings
target_names = [str(cls) for cls in np.unique(y_train)]  # Or use le.classes_ if LabelEncoder was used

# Print classification evaluation report
print(classification_report(y_test, y_pred_rf, target_names=target_names))

# Plot confusion matrix
class_names = le.classes_
cm = confusion_matrix(y_test, y_pred_rf, normalize='true')
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt=".2f", cmap="Blues",
            xticklabels=class_names, yticklabels=class_names)

# plt.title('Random Forest on PHATE Expression Confusion Matrix')
plt.show()

# Per-class accuracy
per_class_acc = np.diag(cm) / cm.sum(axis=1)

# Average accuracy
average_accuracy = np.mean(per_class_acc)
print(average_accuracy)


### 📊 Model B: Logistic Regression (L2-regularized)

- **Type**: Linear Model with Multinomial Softmax
- **Key Features**:
  - Fast and interpretable
  - Coefficients reflect importance of features
  - Supports multiclass settings
  - Can regularize with L2 (Ridge) to avoid overfitting

- **Hyperparameters**:
  - `max_iter=1000`: Ensure convergence
  - `solver='lbfgs'`: Suitable for multinomial loss
  - `multi_class='auto'`: Automatically handles multiclass

- **Model Interpretation**:
  - Coefficients (`coef_`) are extracted and sorted by magnitude.
  - Top features can be linked to known subtype markers.

---

In [None]:
# === 4. Model B: Logistic Regression (with L2 regularization) ===
logit = LogisticRegression(max_iter=1000, solver='lbfgs', multi_class='auto')
logit.fit(X_train, y_train)
y_pred_logit = logit.predict(X_test)

# Define class label names — use le.classes_ if LabelEncoder was used
target_names = [str(cls) for cls in np.unique(y_train)]

print("\n📊 Logistic Regression Results:")
print(classification_report(y_test, y_pred_logit, target_names=target_names))

# Plot normalized confusion matrix
class_names = le.classes_
cm = confusion_matrix(y_test, y_pred_logit, normalize='true')
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt=".2f", cmap="Blues",
            xticklabels=class_names, yticklabels=class_names)

# plt.title('Logistic Regression on Raw Expression Confusion Matrix')
plt.show()

# Per-class accuracy
per_class_acc = np.diag(cm) / cm.sum(axis=1)

# Average accuracy across classes
average_accuracy = np.mean(per_class_acc)
print(average_accuracy)


In [None]:
# Get coefficients from the logistic regression model
coefficients = logit.coef_[0]   # For multiclass, model.coef_ is a 2D array

# Retrieve feature names if X is a DataFrame
feature_names = X.columns       

# Create a DataFrame to display feature importance
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Compute absolute coefficient values for ranking
coef_df['AbsCoef'] = coef_df['Coefficient'].abs()

# Sort features by absolute coefficient magnitude (descending)
coef_df = coef_df.sort_values(by='AbsCoef', ascending=False)

# Display the sorted coefficient table
print(coef_df)


### 🤖 Model C: Feedforward Neural Network (MLP)

- **Type**: Multi-layer Perceptron using Keras
- **Structure**:
  - Input layer → Dense(512) → Dropout
  - → Dense(256) → Dense(128) → Dropout
  - → Output layer with Softmax for multiclass classification
- **Highlights**:
  - Captures complex nonlinear patterns in gene expression data
  - Dropout used to reduce overfitting
  - Class weights are used to balance subtype frequency
- **Training**:
  - Loss: categorical crossentropy
  - Optimizer: Adam
  - Evaluation: accuracy and classification report on validation set

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.models import Model
from keras.layers import Input, Dense, Dropout, Concatenate
from keras.utils import to_categorical
from sklearn.utils.class_weight import compute_class_weight
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import LabelEncoder

# Assume you have already prepared:
# X_expr: (3238, 16531) gene expression data
# X_clinical: (3238, 28) clinical feature data
# y: (3238,) multi-class labels, integer encoded
# Example: clinical_df is your clinical data DataFrame

# Use overall survival event as the target
y = df_zscore_meta["overall_survival_event"]

# Number of classes (usually 2: event = 0 or 1)
num_classes = len(np.unique(y))

# Encode labels as integers
le = LabelEncoder()
y_int = le.fit_transform(y)

# Convert labels to one-hot encoded format
y_cat = to_categorical(y_int, num_classes)

# Use the transposed z-score normalized expression matrix as input features
X_expr = df_zscore.T


In [None]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

# === Define expression input model ===
expr_input = Input(shape=(16531,), name="expr_input")

# Fully connected layers for expression input
x_expr = Dense(512, activation='relu')(expr_input)
x_expr = Dropout(0.3)(x_expr)
x_expr = Dense(256, activation='relu')(x_expr)

# Combine (only expression here; can be extended with clinical input later)
x = x_expr
x = Dense(128, activation='relu')(x)
x = Dropout(0.3)(x)

# Output layer for classification
output = Dense(num_classes, activation='softmax')(x)

# Build and compile the model
model = Model(inputs=expr_input, outputs=output)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Automatically compute class weights to address class imbalance
class_weights = compute_class_weight(class_weight='balanced',
                                     classes=np.unique(y),
                                     y=y)
class_weights = dict(enumerate(class_weights))

# === Train/Validation split ===
X_expr_train, X_expr_val, y_train, y_val = train_test_split(
    X_expr, y_cat, test_size=0.2, random_state=42)

# === Model training ===
model.fit(X_expr_train,
          y_train,
          validation_data=(X_expr_val, y_val),
          epochs=50,
          class_weight=class_weights,
          batch_size=64)

# === Evaluation ===
y_pred = model.predict(X_expr_val)
y_pred_label = np.argmax(y_pred, axis=1)
y_true_label = np.argmax(y_val, axis=1)

print(classification_report(y_true_label, y_pred_label))
