## A6: Imputation via Regression for Missing Data 
Submitted by Siddharth Nair, CE22B106

---

### Part A: Data Preprocessing and Imputation
#### Import all the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import classification_report, accuracy_score
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

#### Understand the data

In [2]:
# Load the UCI Credit Card dataset
df = pd.read_csv(r"C:\Users\Siddharth Nair\OneDrive\Desktop\DA5401 Assignments\A6\UCI_Credit_Card.csv")
# Display basic information about the dataset
print("Dataset shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nFirst few rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values (original):")
print(df.isnull().sum())


Dataset shape: (30000, 25)

Column names:
['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default.payment.next.month']

First few rows:
   ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0   1    20000.0    2          2         1   24      2      2     -1     -1   
1   2   120000.0    2          2         2   26     -1      2      0      0   
2   3    90000.0    2          2         2   34      0      0      0      0   
3   4    50000.0    2          2         1   37      0      0      0      0   
4   5    50000.0    1          2         1   57     -1      0     -1      0   

   ...  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3  \
0  ...        0.0        0.0        0.0       0.0     689.0       0.0   
1  ...     3272.0     3455.

#### There are no missing values. Hence, artificially introduce Missing At Random values in `AGE` and `BILL_AMT1` columns with a missing rate of 7% and 8% respectively

In [3]:
# Part A.1: Load and Prepare Data - Artificially introduce MAR missing values
# Set random seed for reproducibility
np.random.seed(42)

# Make a copy of the original dataset
df_original = df.copy()

# Remove the ID column as it's not needed for modeling
df_work = df.drop('ID', axis=1).copy()

# Introduce Missing At Random (MAR) values in AGE and BILL_AMT1 columns (5-10%)
missing_percentage_age = 0.07  # 7% missing
missing_percentage_bill = 0.08  # 8% missing

# Create missing indices for AGE
n_missing_age = int(len(df_work) * missing_percentage_age)
missing_idx_age = np.random.choice(df_work.index, size=n_missing_age, replace=False)

# Create missing indices for BILL_AMT1
n_missing_bill = int(len(df_work) * missing_percentage_bill)
missing_idx_bill = np.random.choice(df_work.index, size=n_missing_bill, replace=False)

# Introduce missing values
df_work.loc[missing_idx_age, 'AGE'] = np.nan
df_work.loc[missing_idx_bill, 'BILL_AMT1'] = np.nan

print(f"Introduced {n_missing_age} missing values in AGE column ({missing_percentage_age*100:.1f}%)")
print(f"Introduced {n_missing_bill} missing values in BILL_AMT1 column ({missing_percentage_bill*100:.1f}%)")

print("\nMissing values after introduction:")
print(df_work.isnull().sum()[df_work.isnull().sum() > 0])

print(f"\nDataset shape after preprocessing: {df_work.shape}")
print(f"Target variable: 'default.payment.next.month'")
print(f"Target distribution:")
print(df_work['default.payment.next.month'].value_counts())

# Check skewness of the columns we're imputing
print(f"\nSkewness analysis:")
print(f"AGE skewness: {df_work['AGE'].skew():.3f}")
print(f"BILL_AMT1 skewness: {df_work['BILL_AMT1'].skew():.3f}")
print("(Values > 1 or < -1 indicate significant skewness, where median is preferable)")

Introduced 2100 missing values in AGE column (7.0%)
Introduced 2400 missing values in BILL_AMT1 column (8.0%)

Missing values after introduction:
AGE          2100
BILL_AMT1    2400
dtype: int64

Dataset shape after preprocessing: (30000, 24)
Target variable: 'default.payment.next.month'
Target distribution:
default.payment.next.month
0    23364
1     6636
Name: count, dtype: int64

Skewness analysis:
AGE skewness: 0.731
BILL_AMT1 skewness: 2.628
(Values > 1 or < -1 indicate significant skewness, where median is preferable)


#### Let's impute the misssing values via simple imputation using median values

In [7]:
# Part A.2: Imputation Strategy 1 - Simple Imputation (Baseline)
# Create Dataset A with median imputation
dataset_A = df_work.copy()

# Impute missing values with median
median_age = dataset_A['AGE'].median()
median_bill_amt1 = dataset_A['BILL_AMT1'].median()

dataset_A['AGE'] = dataset_A['AGE'].fillna(median_age)
dataset_A['BILL_AMT1'] = dataset_A['BILL_AMT1'].fillna(median_bill_amt1)

print(f"Median imputation completed for Dataset A")
print(f"AGE median value used: {median_age}")
print(f"BILL_AMT1 median value used: {median_bill_amt1}")

# Verify no missing values remain
print(f"\nMissing values in Dataset A after imputation:")
print(dataset_A.isnull().sum().sum())


Median imputation completed for Dataset A
AGE median value used: 34.0
BILL_AMT1 median value used: 22518.5

Missing values in Dataset A after imputation:
0


#### Why Median over Mean?
1. Median is less affected by extreme values.
2. Many real-world features are skewed, and median better represents central tendency.
3. Median imputation maintains the original distribution better

#### Let's impute the missing values using Linear Regression

In [8]:
# Part A.3: Imputation Strategy 2 - Regression Imputation (Linear)
# Create Dataset B with linear regression imputation
dataset_B = df_work.copy()

# Choose BILL_AMT1 for regression imputation (as it has more missing values and may have better predictive relationships)
target_column = 'BILL_AMT1'
print(f"Chosen column for regression imputation: {target_column}")

# First, handle missing values in AGE with median for Dataset B (to use as predictor)
dataset_B['AGE'] = dataset_B['AGE'].fillna(dataset_B['AGE'].median())

# Separate rows with and without missing BILL_AMT1
has_bill_amt1 = dataset_B[target_column].notna()
missing_bill_amt1 = dataset_B[target_column].isna()

print(f"Rows with {target_column} values: {has_bill_amt1.sum()}")
print(f"Rows missing {target_column} values: {missing_bill_amt1.sum()}")

# Prepare feature matrix (all columns except target and outcome variable)
feature_columns = [col for col in dataset_B.columns if col not in [target_column, 'default.payment.next.month']]
print(f"Features used for prediction: {len(feature_columns)} columns")

# Get training data (non-missing rows)
X_train = dataset_B.loc[has_bill_amt1, feature_columns]
y_train = dataset_B.loc[has_bill_amt1, target_column]

# Get prediction data (missing rows)
X_predict = dataset_B.loc[missing_bill_amt1, feature_columns]

# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict missing values
predicted_values = lr_model.predict(X_predict)

# Fill missing values with predictions
dataset_B.loc[missing_bill_amt1, target_column] = predicted_values

print(f"\nLinear Regression imputation completed for Dataset B")
print(f"R-squared score on training data: {lr_model.score(X_train, y_train):.4f}")

# Verify no missing values remain
print(f"Missing values in Dataset B after imputation: {dataset_B.isnull().sum().sum()}")


Chosen column for regression imputation: BILL_AMT1
Rows with BILL_AMT1 values: 27600
Rows missing BILL_AMT1 values: 2400
Features used for prediction: 22 columns

Linear Regression imputation completed for Dataset B
R-squared score on training data: 0.9300
Missing values in Dataset B after imputation: 0


#### Linear Regression Imputation assumes MAR, meaning:
1. The probability of missingness depends on observed variables, not unobserved ones.
2. Once we condition on observed features, missingness is random.
3. The relationship between the missing variable and predictors is linear.
4. This allows us to use observed data to predict missing values systematically

#### Let's impute the missing values using non-linear regression techniques like K-Nearest Neighbours Regression

In [9]:
# Part A.4: Imputation Strategy 3 - Regression Imputation (Non-Linear)
# Create Dataset C with non-linear regression imputation
dataset_C = df_work.copy()

# Use the same target column (BILL_AMT1) for consistency
target_column = 'BILL_AMT1'
print(f"Chosen column for non-linear regression imputation: {target_column}")

# First, handle missing values in AGE with median for Dataset C
dataset_C['AGE'] = dataset_C['AGE'].fillna(dataset_C['AGE'].median())

# Separate rows with and without missing BILL_AMT1
has_bill_amt1_c = dataset_C[target_column].notna()
missing_bill_amt1_c = dataset_C[target_column].isna()

# Prepare feature matrix
feature_columns = [col for col in dataset_C.columns if col not in [target_column, 'default.payment.next.month']]

# Get training and prediction data
X_train_c = dataset_C.loc[has_bill_amt1_c, feature_columns]
y_train_c = dataset_C.loc[has_bill_amt1_c, target_column]
X_predict_c = dataset_C.loc[missing_bill_amt1_c, feature_columns]

# Try K-Nearest Neighbors Regression for non-linear imputation
knn_model = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn_model.fit(X_train_c, y_train_c)

# Predict missing values
predicted_values_knn = knn_model.predict(X_predict_c)

# Fill missing values with predictions
dataset_C.loc[missing_bill_amt1_c, target_column] = predicted_values_knn

print(f"K-Nearest Neighbors Regression imputation completed for Dataset C")
print(f"Number of neighbors used: 5")
print(f"Weight function: distance-based")

# Also try Decision Tree Regression as alternative
print(f"\n--- Alternative: Decision Tree Regression ---")
dataset_C_dt = df_work.copy()
dataset_C_dt['AGE'] = dataset_C_dt['AGE'].fillna(dataset_C_dt['AGE'].median())

# Decision Tree Regression
dt_model = DecisionTreeRegressor(max_depth=10, min_samples_split=50, random_state=42)
dt_model.fit(X_train_c, y_train_c)
predicted_values_dt = dt_model.predict(X_predict_c)

print(f"Decision Tree parameters: max_depth=10, min_samples_split=50")
print(f"Feature importance (top 5):")
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)
print(feature_importance.head())

# Use KNN for Dataset C 
print(f"\nUsing KNN Regression for final Dataset C")

# Verify no missing values remain
print(f"Missing values in Dataset C after imputation: {dataset_C.isnull().sum().sum()}")

Chosen column for non-linear regression imputation: BILL_AMT1
K-Nearest Neighbors Regression imputation completed for Dataset C
Number of neighbors used: 5
Weight function: distance-based

--- Alternative: Decision Tree Regression ---
Decision Tree parameters: max_depth=10, min_samples_split=50
Feature importance (top 5):
      feature  importance
11  BILL_AMT2    0.957370
16   PAY_AMT1    0.020529
21   PAY_AMT6    0.002978
12  BILL_AMT3    0.002818
0   LIMIT_BAL    0.002696

Using KNN Regression for final Dataset C
Missing values in Dataset C after imputation: 0


#### All missing values are now imputed using various techniques

---

### Part B: Model Training and Performance Assessment

#### Let's split the data into training and testing sets for all 3 imputed sets. Also, create another dataset by ignoring rows with missing values

In [13]:
# Part B.1: Data Split - Create Dataset D (Listwise Deletion) and split all datasets
# Create Dataset D by removing all rows with any missing values (Listwise Deletion)
dataset_D = df_work.dropna()
print(f"Dataset D (Listwise Deletion) shape: {dataset_D.shape}")
print(f"Original dataset shape: {df_work.shape}")
print(f"Rows removed: {df_work.shape[0] - dataset_D.shape[0]} ({((df_work.shape[0] - dataset_D.shape[0])/df_work.shape[0]*100):.1f}%)")

# Prepare datasets for modeling
datasets = {
    'A (Median Imputation)': dataset_A,
    'B (Linear Regression Imputation)': dataset_B,
    'C (KNN Regression Imputation)': dataset_C,
    'D (Listwise Deletion)': dataset_D
}

# Split each dataset into features and target
splits = {}
for name, data in datasets.items():
    X = data.drop('default.payment.next.month', axis=1)
    y = data['default.payment.next.month']
    
    # Split into train and test sets (80-20 split)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    splits[name] = {
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test
    }
    
    print(f"\nDataset {name}:")
    print(f"  Training set: {X_train.shape[0]} samples")
    print(f"  Test set: {X_test.shape[0]} samples")
    print(f"  Features: {X_train.shape[1]} columns")
    print(f"  Training mean: {X_train.mean():}, std: {X_train.std():}")
    print(f"  Test mean: {X_test.mean():}, std: {X_test.std():}")
    print(f"  Target distribution in train: {y_train.value_counts().to_dict()}")


Dataset D (Listwise Deletion) shape: (25676, 24)
Original dataset shape: (30000, 24)
Rows removed: 4324 (14.4%)

Dataset A (Median Imputation):
  Training set: 24000 samples
  Test set: 6000 samples
  Features: 23 columns
  Training mean: LIMIT_BAL    167364.666667
SEX               1.604750
EDUCATION         1.853792
MARRIAGE          1.552875
AGE              35.319917
PAY_0            -0.014125
PAY_2            -0.134083
PAY_3            -0.166917
PAY_4            -0.221333
PAY_5            -0.270333
PAY_6            -0.293333
BILL_AMT1     48933.557875
BILL_AMT2     49012.267583
BILL_AMT3     46861.880083
BILL_AMT4     43156.661458
BILL_AMT5     40164.412625
BILL_AMT6     38675.979875
PAY_AMT1       5623.556292
PAY_AMT2       5879.974917
PAY_AMT3       5215.777583
PAY_AMT4       4790.331833
PAY_AMT5       4769.941750
PAY_AMT6       5229.905500
dtype: float64, std: LIMIT_BAL    129511.313151
SEX               0.488915
EDUCATION         0.792375
MARRIAGE          0.521903
AGE        

#### There's obvious imbalance in features, hence standardization is necessary

In [11]:
# Part B.2: Classifier Setup - Standardize features
# Standardize features for all datasets
scalers = {}
scaled_splits = {}

for name in splits.keys():
    # Create and fit scaler on training data
    scaler = StandardScaler()
    scaler.fit(splits[name]['X_train'])
    scalers[name] = scaler
    
    # Transform both training and test data
    X_train_scaled = scaler.transform(splits[name]['X_train'])
    X_test_scaled = scaler.transform(splits[name]['X_test'])
    
    scaled_splits[name] = {
        'X_train': X_train_scaled,
        'X_test': X_test_scaled,
        'y_train': splits[name]['y_train'],
        'y_test': splits[name]['y_test']
    }
    
    print(f"Dataset {name}: Features standardized")
    print(f"  Training mean: {X_train_scaled.mean():.6f}, std: {X_train_scaled.std():.6f}")
    print(f"  Test mean: {X_test_scaled.mean():.6f}, std: {X_test_scaled.std():.6f}")

Dataset A (Median Imputation): Features standardized
  Training mean: 0.000000, std: 1.000000
  Test mean: 0.006028, std: 1.060038
Dataset B (Linear Regression Imputation): Features standardized
  Training mean: 0.000000, std: 1.000000
  Test mean: 0.006319, std: 1.060035
Dataset C (KNN Regression Imputation): Features standardized
  Training mean: 0.000000, std: 1.000000
  Test mean: 0.006304, std: 1.060122
Dataset D (Listwise Deletion): Features standardized
  Training mean: 0.000000, std: 1.000000
  Test mean: 0.004194, std: 0.986466


#### Let's train a logistic regression on all 4 datasets and compare their performance

In [14]:
# Part B.3: Model Evaluation - Train and evaluate Logistic Regression models
# Train Logistic Regression models and evaluate
results = {}
models = {}

for name in scaled_splits.keys():
    print(f"\n--- Training Model {name} ---")
    
    # Train Logistic Regression
    lr_classifier = LogisticRegression(random_state=42, max_iter=1000)
    lr_classifier.fit(scaled_splits[name]['X_train'], scaled_splits[name]['y_train'])
    
    # Make predictions
    y_pred = lr_classifier.predict(scaled_splits[name]['X_test'])
    y_pred_proba = lr_classifier.predict_proba(scaled_splits[name]['X_test'])
    
    # Calculate metrics
    accuracy = accuracy_score(scaled_splits[name]['y_test'], y_pred)
    report = classification_report(scaled_splits[name]['y_test'], y_pred, output_dict=True)
    
    # Store results
    models[name] = lr_classifier
    results[name] = {
        'accuracy': accuracy,
        'precision_0': report['0']['precision'],
        'recall_0': report['0']['recall'],
        'f1_0': report['0']['f1-score'],
        'precision_1': report['1']['precision'],
        'recall_1': report['1']['recall'],
        'f1_1': report['1']['f1-score'],
        'macro_precision': report['macro avg']['precision'],
        'macro_recall': report['macro avg']['recall'],
        'macro_f1': report['macro avg']['f1-score'],
        'weighted_f1': report['weighted avg']['f1-score']
    }
    
    print(f"Accuracy: {accuracy:.4f}")
    print("Classification Report:")
    print(classification_report(scaled_splits[name]['y_test'], y_pred))



--- Training Model A (Median Imputation) ---
Accuracy: 0.8080
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.36      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.60      0.62      6000
weighted avg       0.79      0.81      0.77      6000


--- Training Model B (Linear Regression Imputation) ---
Accuracy: 0.8085
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4673
           1       0.69      0.24      0.36      1327

    accuracy                           0.81      6000
   macro avg       0.75      0.61      0.62      6000
weighted avg       0.79      0.81      0.77      6000


--- Training Model C (KNN Regression Imputation) ---
Accuracy: 0.8087
Classification Report:
              precision    recall  f1-score   support

           0  

#### Let's compare the results 

In [16]:
# Part C.1: Results Comparison - Create summary table
# Create comprehensive results comparison table
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [results[name]['accuracy'] for name in results.keys()],
    'Precision_Class_0': [results[name]['precision_0'] for name in results.keys()],
    'Recall_Class_0': [results[name]['recall_0'] for name in results.keys()],
    'F1_Class_0': [results[name]['f1_0'] for name in results.keys()],
    'Precision_Class_1': [results[name]['precision_1'] for name in results.keys()],
    'Recall_Class_1': [results[name]['recall_1'] for name in results.keys()],
    'F1_Class_1': [results[name]['f1_1'] for name in results.keys()],
    'Macro_F1': [results[name]['macro_f1'] for name in results.keys()],
    'Weighted_F1': [results[name]['weighted_f1'] for name in results.keys()]
})

# Round values for better presentation
comparison_df = comparison_df.round(4)

print(comparison_df.to_string(index=False))

# Save results to CSV for further analysis
comparison_df.to_csv('model_comparison_results.csv', index=False)
print(f"\nResults saved to 'model_comparison_results.csv'")

# Focus on F1-scores for class 1 (default cases) as this is typically more important
print(f"\n--- KEY INSIGHTS ---")
print(f"Ranking by F1-score for Class 1 (Default Detection):")
f1_class1_ranking = comparison_df.sort_values('F1_Class_1', ascending=False)
for i, (idx, row) in enumerate(f1_class1_ranking.iterrows(), 1):
    print(f"{i}. {row['Model']}: F1 = {row['F1_Class_1']:.4f}")

print(f"\nRanking by Weighted F1-score (Overall Performance):")
weighted_f1_ranking = comparison_df.sort_values('Weighted_F1', ascending=False)
for i, (idx, row) in enumerate(weighted_f1_ranking.iterrows(), 1):
    print(f"{i}. {row['Model']}: Weighted F1 = {row['Weighted_F1']:.4f}")

print(f"\n--- DATASET SIZE IMPACT ---")
print("Training Set Sizes:")
for name in splits.keys():
    train_size = splits[name]['X_train'].shape[0]
    print(f"{name}: {train_size:,} samples")

                           Model  Accuracy  Precision_Class_0  Recall_Class_0  F1_Class_0  Precision_Class_1  Recall_Class_1  F1_Class_1  Macro_F1  Weighted_F1
           A (Median Imputation)    0.8080             0.8178          0.9694      0.8872             0.6898          0.2396      0.3557    0.6214       0.7696
B (Linear Regression Imputation)    0.8085             0.8184          0.9692      0.8874             0.6910          0.2427      0.3592    0.6233       0.7706
   C (KNN Regression Imputation)    0.8087             0.8185          0.9692      0.8875             0.6916          0.2434      0.3601    0.6238       0.7709
           D (Listwise Deletion)    0.8115             0.8201          0.9710      0.8892             0.7107          0.2507      0.3706    0.6299       0.7744

Results saved to 'model_comparison_results.csv'

--- KEY INSIGHTS ---
Ranking by F1-score for Class 1 (Default Detection):
1. D (Listwise Deletion): F1 = 0.3706
2. C (KNN Regression Imputation): F1 =

In [19]:
import plotly.graph_objects as go
import plotly.express as px

# Data
methods = ["A: Median", "B: Lin Regress", "C: KNN", "D: Listwise"]
f1_scores = [0.3557, 0.3592, 0.3601, 0.3706]

# Create bar chart
fig = go.Figure()

# Add bars with different colors from the brand palette
colors = ['#1FB8CD', '#DB4545', '#2E8B57', '#5D878F']

fig.add_trace(go.Bar(
    x=methods,
    y=f1_scores,
    marker_color=colors,
    text=[f'{score:.2f}' for score in f1_scores],
    textposition='outside',
    showlegend=False
))

# Update layout
fig.update_layout(
    title="F1-Score Default Detection",
    xaxis_title="Method",
    yaxis_title="F1-Score"
)

# Update traces for better visualization
fig.update_traces(cliponaxis=False)

fig.show()

In [21]:
import pandas as pd
import plotly.graph_objects as go
import numpy as np

# Create the data from the provided JSON
data = {
    "Method": ["A: Median Imputation", "B: Linear Regression", "C: KNN Regression", "D: Listwise Deletion"], 
    "Accuracy": [0.8080, 0.8085, 0.8087, 0.8115], 
    "Precision_Class_1": [0.6898, 0.6910, 0.6916, 0.7107], 
    "Recall_Class_1": [0.2396, 0.2427, 0.2434, 0.2507], 
    "F1_Class_1": [0.3557, 0.3592, 0.3601, 0.3706], 
    "Weighted_F1": [0.7696, 0.7706, 0.7709, 0.7744]
}

df = pd.DataFrame(data)

# Prepare data for heatmap - set methods as index
df_viz = df.set_index('Method')

# Shorten column names to fit 15 character limit
df_viz.columns = ['Accuracy', 'Precision C1', 'Recall C1', 'F1 C1', 'Weighted F1']

# Shorten method names to fit 15 character limit
df_viz.index = ['A: Median', 'B: Linear', 'C: KNN', 'D: Listwise']

# Create heatmap with darker colors representing better performance
fig = go.Figure(data=go.Heatmap(
    z=df_viz.values,
    x=df_viz.columns,
    y=df_viz.index,
    colorscale='viridis',  # Blues scale where darker = higher values = better performance
    text=np.round(df_viz.values, 4),
    texttemplate="%{text}",
    textfont={"size": 14, "color": "white"},
    hoverongaps=False,
    colorbar=dict(title="Performance")
))

fig.update_layout(
    title="Method Performance Comparison",
    xaxis_title="Metrics",
    yaxis_title="Methods"
)

fig.update_xaxes(side="bottom")
fig.update_yaxes(side="left")
fig.show()

##### Model trained on Dataset D performs best!

---

### PART C.2: Efficacy Discussion

#### 1. LISTWISE DELETION vs IMPUTATION TRADE-OFF 

Sample Size Analysis:
- Original dataset: 30,000 samples
- Listwise deletion: 25,676 samples (4,324 lost, 14.4% reduction)
- Imputation methods: 30,000 samples (no loss)

Performance vs Sample Size Trade-off:
- Despite losing 4,324 samples, Listwise Deletion performed best
- This suggests the missing data pattern was informative or imputation introduced noise

Why Listwise Deletion might outperform despite fewer samples:
1. Data Quality: Removed samples may have been systematically different
2. Model Complexity: With fewer, higher-quality samples, the model may generalize better
3. Missing Pattern: The missingness itself might be informative for prediction
4. Imputation Error: Incorrect imputed values can mislead the model

#### 2. LINEAR vs NON-LINEAR REGRESSION COMPARISON 

Performance Comparison:
- Linear Regression: F1-Class1 = 0.3592, Accuracy = 0.8085
- KNN Regression: F1-Class1 = 0.3601, Accuracy = 0.8087
- Performance difference: 0.0009 (minimal)

Why KNN performed slightly better:
1. Non-linear relationships: KNN can capture complex, non-linear patterns
2. Local similarity: KNN finds samples most similar to missing ones
3. Flexibility: No assumption of linear relationship between variables
4. Robustness: Less affected by outliers in the relationship

Linear Regression Analysis:
- R-squared: 0.9300 (very high, suggesting strong linear relationships)
- This explains why Linear and KNN performed similarly
- Strong linear relationships mean linear imputation is nearly optimal

#### 3. OVERALL RECOMMENDATIONS 

Best Overall Performance: D (Listwise Deletion) \
Best Default Detection: D (Listwise Deletion)

Recommendation: LISTWISE DELETION

Justification:
1. Highest performance across all metrics despite smaller sample size
2. Simpler approach with no risk of imputation errors
3. In this dataset, missing pattern appears informative
4. 14.4% sample loss is acceptable given performance gains

When to use each method:
- Listwise Deletion: When missing % is small (<20%) and missingness is informative
- Median Imputation: Quick baseline, interpretable, robust to outliers
- Linear Regression: When strong linear relationships exist (high R²)
- Non-linear Regression: When relationships are complex and non-linear

Cautionary Notes:
1. Results may vary with different missing patterns and percentages
2. Domain knowledge should guide the choice of imputation method
3. Multiple imputation methods could be explored for better results
4. The class imbalance (78% vs 22%) affects all methods similarly