# Assignment 10 – Applied Component
Topic: Model Explainability, Fairness, and Interpretability in Predictive Modeling

---

## 1. Theoretical Background

### 1.1 Logistic Regression

Logistic Regression is used for binary classification tasks and models the probability of a given input belonging to a specific class. It uses the logistic (sigmoid) function to map any real-valued number into a range between 0 and 1:

$$
\displaystyle
P(y = 1 \mid X) \;=\; \frac{1}{\,1 + e^{-(\beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \cdots + \beta_{n}X_{n})}\,}
$$



Equivalently, in compact summation form:


$$
\displaystyle
P(y = 1 \mid X) \;=\; \frac{1}{\,1 + e^{-(\beta_{0} + \sum_{j=1}^{n} \beta_{j} X_{j})}\,}
$$

The model is trained using **maximum likelihood estimation (MLE)**, which finds the parameters (\(\beta\)) that maximize the probability of observing the given data.

### 1.2 XGBoost (Extreme Gradient Boosting)

XGBoost is an ensemble learning algorithm based on gradient boosting. It builds trees sequentially, where each tree corrects the errors of the previous one.

The objective function combines a loss term and a regularization term:

$$
\displaystyle
\mathrm{Obj} \;=\; \sum_{i} \ell\!\left(y_i, \hat{y}_i\right) \;+\; \sum_{k} \Omega\!\left(f_k\right)
$$

with regularization:

$$
\displaystyle
\Omega\!\left(f_k\right) \;=\; \gamma\,T \;+\; \tfrac{1}{2}\,\lambda\,\lVert w \rVert^{2}
$$

Here, \(T\) is the number of leaves and \(w\) are the leaf weights. Regularization helps prevent overfitting. This makes XGBoost powerful yet regularized to prevent overfitting.

### 1.3 Explainability with SHAP

SHAP (SHapley Additive exPlanations) assigns each feature an importance value for a particular prediction. It is based on cooperative game theory, where feature contributions are computed as:

$$
\displaystyle
\phi_i \;=\; \sum_{S \subseteq F \setminus \{i\}}\; \frac{|S|!\,\left(|F|-|S|-1\right)!}{|F|!}\;\Big[\, f(S \cup \{i\}) \;-\; f(S) \,\Big]
$$

This helps in explaining how each input variable influences the model’s output.

# Step 1 – Import Required Libraries

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay
import shap


#Step 2 – Load and Explore the Dataset

In [None]:
import pandas as pd
import chardet

# Detect the correct file encoding
with open("____", 'rb') as f:                 # FILL IN ALL BLANKS
    result = chardet.detect(f.read(____))

# Load dataset safely with detected encoding
df = pd.read_csv("____", encoding=result['____'], low_memory=False)

print("File loaded successfully!")
print("Detected encoding:", result['____'])
print("Shape of dataset:", df.shape)
print("\nPreview of data:")
display(df.____())


In [None]:
# Check for missing values
print("Missing values per column:\n", df.____().sum())    # FILL IN ALL BLANKS

# Handle missing data
df = df.____(df.____(numeric_only=True))

display(df.describe())



Interpretation Question:
What would be the possible implications of not handling missing or inconsistent values in this dataset before training predictive models like Logistic Regression or XGBoost?


# Step 3 – Feature Correlation and Target Distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

print("Available columns:\n", df.columns.tolist())  # FILL IN ALL BLANKS

# Identify the likely target column
possible_targets = [col for col in df.columns if '____' in col.upper() or '____' in col.upper() or '____' in col.upper()]
print("\nPossible target columns:", possible_targets)

df.rename(columns={'____': 'Outcome'}, inplace=True)

#visualize
plt.figure(figsize=(8,5))
sns.countplot(x='Outcome', data=df)
plt.title("Outcome Distribution (0 = No Stroke, 1 = Stroke)")
plt.show()


#Step 4 – Split Data and Scale Features

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler

# Drop obvious leakage columns
leakage_cols = [col for col in df.columns if '____' in col or '____' in col or '____' in col]    # FILL IN ALL BLANKS
print("Dropping leakage columns:", leakage_cols)
X = df.drop(columns=____)
y = df['____']

# Encode categorical columns
from sklearn.preprocessing import LabelEncoder
for col in X.columns:
    if X[col].dtype == '____' or X[col].dtype == '____':
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col].astype(str))
X = X.____(0)

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=____, stratify=y, random_state=____

# Balance classes with SMOTE
sm = SMOTE(random_state=____)
X_train_bal, y_train_bal = sm.fit_resample(____, ____ )
print("After SMOTE:", y_train_bal.value_counts().to_dict())

# Scale for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.____(____)
X_test_scaled = scaler.____(____)


Interpretation Question:
Why is it crucial to encode categorical variables before scaling, and what might occur if categorical features remain unencoded during the training of these models?

#Step 5 – Logistic Regression Model

In [None]:
# Retrain Logistic Regression with SMOTE-balanced data
log_model = LogisticRegression(max_iter=____, random_state=____)    # FILL IN ALL BLANKS


log_model.____(____, ____ )

y_pred_lr = log_model.____(____)

print("-- Logistic Regression Report --")
print(classification_report(____, ____))
print("ROC-AUC Score:", roc_auc_score(____, log_model.predict_proba(____)[:,1]))


Interpretation Question:
What can we infer about feature relationships from the magnitude and sign of Logistic Regression coefficients, and how does this aid interpretability in a clinical setting?

#Step 6 – XGBoost Model

In [None]:
# Initialize XGBoost classifier
xgb_model = XGBClassifier(
    n_estimators=____,         # FILL IN ALL BLANKS
    learning_rate=____,
    max_depth=____,
    subsample=____,
    colsample_bytree=____,
    random_state=____,
    use_label_encoder=False,
    eval_metric='____'
)

# Train XGBoost model
xgb_model.____(____, ____ )

# Predict on test set
y_pred_xgb = xgb_model.____(____)

print("-- XGBoost Model Report --")
print(classification_report(____, ____))
print("ROC-AUC Score:", roc_auc_score(____, xgb_model.predict_proba(____)[:,1]))


Interpretation Question:
In comparison to Logistic Regression, how does XGBoost’s gradient-boosting mechanism influence the way it learns patterns from the same dataset?

#Step 7 – Model Comparison Visualization

In [None]:
plt.figure(figsize=(____, ____))   # FILL IN ALL BLANKS

# Plot ROC curves for both models
RocCurveDisplay.from_estimator(____, ____, ____, ax=plt.gca(), name="Logistic Regression")
RocCurveDisplay.from_estimator(____, ____, ____, ax=plt.gca(), name="XGBoost")
plt.title("ROC Curve Comparison")
plt.show()


Interpretation Question:
How should we interpret the comparative ROC curves of Logistic Regression and XGBoost, and what does this reveal about model robustness?

#Step 8 – SHAP Explainability for XGBoost

In [None]:
sample_X = X_test.sample(____, random_state=____)   # FILL IN ALL BLANKS
explainer = shap.Explainer(lambda x: xgb_model.predict_proba(x)[:,____], sample_X, algorithm="____")
shap_values = explainer(____)
shap.summary_plot(shap_values.values, sample_X, feature_names=____)

Interpretation Question:
How do SHAP values enhance the interpretability of complex models like XGBoost, and why might they be preferred over traditional feature importance scores?


#Step 9 – Feature Importance Comparison

In [None]:
# Logistic Regression Feature Importance
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

coeffs = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": log_model.coef_[____]   ### FILL IN ALL BLANKS
}).sort_values(by="____", ascending=____)

top_features = pd.concat([coeffs.head(____), coeffs.tail(____)])

plt.figure(figsize=(____, ____))
sns.barplot(x="____", y="____", data=top_features, palette="____")
plt.title("Top & Bottom 15 Feature Importances – Logistic Regression", fontsize=____)
plt.xlabel("____", fontsize=____)
plt.ylabel("____", fontsize=____)
plt.tight_layout()
plt.show()


# Reflection and Discussion:

Question 1:
Can prioritizing model accuracy compromise ethical responsibility in healthcare AI systems?


Question 2:
How could demographic imbalance in the dataset (ex: age or sex distribution) influence fairness in stroke prediction models?


Question 3:
How can explainability tools like SHAP support ethical compliance and model validation in sensitive domains such as healthcare?


Question 4:
How does handling missing data intersect with privacy and model transparency principles?


Question 5:
What responsibilities do data scientists bear when deploying predictive models in healthcare decision-making pipelines?