# EE331 Machine Learning - Final Project
## Heart Attack Risk Prediction

Download dataset here: https://www.kaggle.com/datasets/iamsouravbanerjee/heart-attack-prediction-dataset

**Project Overview:**
This project involves building machine learning models to predict heart attack risk based on patient health data.

**Three Main Tasks:**
1. **Task 1**: Achieve the best prediction performance
2. **Task 2**: Minimize memory usage while maintaining accuracy ≥ 60%
3. **Task 3**: Achieve the best performance without using neural networks

**Additional Requirements:**
- Error analysis of your models
- Data visualization using clustering

## 1. Library Imports

Import all necessary libraries for data processing, visualization, and machine learning.

**Key libraries:**
- `numpy`, `pandas`: Data manipulation
- `matplotlib`, `seaborn`: Visualization
- `sklearn`: Preprocessing and evaluation tools (You MUST implement all models yourself, without using `sklearn`)

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import pickle
import sys
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("All libraries imported successfully!")

## 2. Data Loading from Google Drive

**Instructions:**
1. Upload the dataset `heart_attack_prediction_dataset.csv` to your Google Drive
2. Update the `DATA_PATH` variable with your file location
3. Run the cell to mount Google Drive and load the dataset

**Expected output:**
- Dataset shape and basic information
- Confirmation of successful loading

In [None]:
# Mount Google Drive and Load Dataset
from google.colab import drive
drive.mount('/content/drive')

# TODO: Update this path to your dataset location in Google Drive
DATA_PATH = '/content/drive/MyDrive/EE331/heart_attack_dataset_updated.csv'

print("\nLoading dataset...")
try:
    df = pd.read_csv(DATA_PATH)
    print(f"  Dataset loaded successfully!")
    print(f"  Shape: {df.shape}")
    print(f"  Samples: {len(df)}")
    print(f"  Features: {len(df.columns)-2} (excluding Patient ID and Label)")
except FileNotFoundError:
    print("  Error: Dataset file not found!")
    print("  Please update DATA_PATH with your correct file location.")
    raise

## 3. Data Analysis

Understanding your data is crucial before building models.

**What to look for:**
- Dataset size and structure
- Data types of each feature
- Missing values
- Label distribution (class imbalance)
- Statistical summary of numerical features

In [None]:
# Data Analysis
print("="*80)
print("DATA ANALYSIS")
print("="*80)

# Display basic information
print("\n1. Dataset Information:")
print(f"   Total samples: {len(df)}")
print(f"   Total columns: {len(df.columns)}")
print(f"\n   Column names:")
for i, col in enumerate(df.columns, 1):
    print(f"   {i:2d}. {col}")

# Check data types
print(f"\n2. Data Types:")
print(df.dtypes)

# Check for missing values
print(f"\n3. Missing Values:")
missing = df.isnull().sum()
if missing.sum() == 0:
    print("   No missing values found!")
else:
    print(missing[missing > 0])

# Label distribution
print(f"\n4. Label Distribution (Heart Attack Risk):")
label_counts = df['Heart Attack Risk'].value_counts()
print(f"   Class 0 (No Risk): {label_counts[0]} ({label_counts[0]/len(df)*100:.2f}%)")
print(f"   Class 1 (Risk):    {label_counts[1]} ({label_counts[1]/len(df)*100:.2f}%)")
print(f"   Class Imbalance Ratio: {label_counts[0]/label_counts[1]:.2f}:1")

# Statistical summary for numerical features
print(f"\n5. Numerical Features Statistics:")
numerical_cols = df.select_dtypes(include=[np.number]).columns
print(df[numerical_cols].describe())

## 4. Data Visualization

Visualize the data to gain insights.

**Visualizations included:**
- Label distribution (bar plot and pie chart)

In [None]:
# Data Visualization
print("="*80)
print("DATA VISUALIZATION")
print("="*80)

# Label distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
label_counts.plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Heart Attack Risk Distribution')
axes[0].set_xlabel('Heart Attack Risk')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No Risk (0)', 'Risk (1)'], rotation=0)

# Pie chart
axes[1].pie(label_counts, labels=['No Risk (0)', 'Risk (1)'], autopct='%1.1f%%',
            colors=['green', 'red'], startangle=90)
axes[1].set_title('Heart Attack Risk Proportion')

plt.tight_layout()
plt.show()

## 5. Data Preprocessing (This is just an example. You may modify this if you want.)

Clean and transform the data for machine learning.

**Preprocessing steps:**
1. Remove Patient ID (not a predictive feature)
2. Split Blood Pressure into Systolic and Diastolic components

**(We recommend re-encoding it using your own method.)**

In [None]:
# Data Preprocessing
print("="*80)
print("DATA PREPROCESSING")
print("="*80)

# Create a copy for preprocessing
df_processed = df.copy()

# Step 1: Remove Patient ID
print("\n1. Removing Patient ID...")
df_processed = df_processed.drop('Patient ID', axis=1)
print("     Patient ID removed")

# Step 2: Split Blood Pressure into Systolic and Diastolic
print("\n2. Splitting Blood Pressure...")
df_processed[['Systolic_BP', 'Diastolic_BP']] = df_processed['Blood Pressure'].str.split('/', expand=True).astype(int)
df_processed = df_processed.drop('Blood Pressure', axis=1)
print("     Blood Pressure split into Systolic_BP and Diastolic_BP")

# Step 3: Categorical Data One-hot encoding
print("\n3. Categorical Data One hot encoding")
categorical_columns = ["Sex", "Diet", "Country", "Continent", "Hemisphere"]

df_processed = pd.get_dummies(
    df_processed,
    columns=categorical_columns,
    drop_first=True
).astype(int)

print("   -> One-hot encoding completed.")


print("\n4. Processed data summary:")
print(f"   Shape after preprocessing: {df_processed.shape}")
print("\n   Column names after preprocessing:")
for i, col in enumerate(df_processed.columns, 1):
    print(f"   {i:2d}. {col}")

# 숫자형/비숫자형 타입 체크 (모델 입력이 모두 숫자인지 확인용)
non_numeric = df_processed.select_dtypes(exclude=[np.number]).columns.tolist()
if len(non_numeric) == 0:
    print("\n   All remaining features are numeric (good for most ML models).")
else:
    print("\n   Warning: Non-numeric columns still present:")
    print("   ", non_numeric)


In [None]:
# 1) 숫자형 컬럼 전체
numeric_cols = df_processed.select_dtypes(include=[np.number]).columns

# 2) 이진(0/1) 컬럼 찾기
binary_cols = [col for col in numeric_cols
               if df_processed[col].dropna().nunique() == 2
               and set(df_processed[col].dropna().unique()) <= {0, 1}]

# 3) 연속형(클리핑 대상) 컬럼 = 숫자형 - 이진 컬럼
continuous_cols = [col for col in numeric_cols if col not in binary_cols]
print(continuous_cols)

In [None]:
numeric_cols = df_processed.select_dtypes(include=[np.number]).columns

# 각 컬럼별 1%, 99% 분위수 계산
lower = df_processed[numeric_cols].quantile(0.01)
upper = df_processed[numeric_cols].quantile(0.99)

# 컬럼별로 상·하위 1% 바깥값을 클리핑
df_processed[numeric_cols] = df_processed[numeric_cols].clip(lower=lower, upper=upper, axis=1)


## 6. Train/Test Split (Maintain the train/test split ratio at 8:2)

Split data into training and test sets.

**Split ratio:** 80% train, 20% test

**Important:**
- `stratify=y` maintains class balance in both sets
- Test set simulates unseen data
- **NEVER** TRAIN ON THE TEST DATA

In [None]:
# Train/Test Split
print("="*80)
print("TRAIN/TEST SPLIT")
print("="*80)

# Separate features and labels
X = df_processed.drop('Heart Attack Risk', axis=1)
y = df_processed['Heart Attack Risk']

# Split into train and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)

print(f"\n  Data split completed:")
print(f"  Training samples:   {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"  Test samples:       {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f"\n  Train label distribution:")
print(f"    No Risk (0): {sum(y_train==0)} ({sum(y_train==0)/len(y_train)*100:.1f}%)")
print(f"    Risk (1):    {sum(y_train==1)} ({sum(y_train==1)/len(y_train)*100:.1f}%)")
print(f"\n  Test label distribution:")
print(f"    No Risk (0): {sum(y_test==0)} ({sum(y_test==0)/len(y_test)*100:.1f}%)")
print(f"    Risk (1):    {sum(y_test==1)} ({sum(y_test==1)/len(y_test)*100:.1f}%)")

## 7. Feature engineering (TODO)

Feature engineering is the process of creating new features or transforming existing features to improve model performance. This includes encoding categorical variables, creating interaction features, and feature scaling.

### The following is an example that requires feature engineering.
**1. Binary Variables (2 categories)**
- Example: Sex (Male/Female) etc.

**2. Nominal Variables (no inherent order)**
- Examples: Diet, Country, Continent, Hemisphere etc.


1. Age --> 고령자/저령 으로 구분
2. Cholesterol --> 고 콜레스테롤 환자 구분
3. Heart Rate --> 고혈압 환자 구분
4. Diabetes --> 당뇨(이미 0/1로 구분되어있음)
5. Family History -->0/1
6. Smoking --> 0/1
7. Obesity --> 0/1
8. Alcohol Consumption -> 0/1
9. Exercise Hours Per Week --> 운동 자주하는 사람 구분
10. Previous Heart Problems
11. Medication Use
12. Stress Level --> 고스트레스 위험군 구분
13. Sedentary Hours Per Day --> 앉아서 오래 생활하는지 구분
14. Income --> 고수익자 구분
15. BMI --> BMI 지수에 따라 구분
16. Triglycerides
17. Physical Activity Days Per Week
18. Sleep Hours Per Day
19. Heart Attack Risk
20. Systolic_BP
21. Diastolic_BP
22. Sex_Male
23. Diet_Healthy
24. Diet_Unhealthy
25. Country_Australia
26. Country_Brazil
27. Country_Canada
28. Country_China
29. Country_Colombia
30. Country_France
31. Country_Germany
32. Country_India
33. Country_Italy
34. Country_Japan
35. Country_New Zealand
36. Country_Nigeria
37. Country_South Africa
38. Country_South Korea
39. Country_Spain
40. Country_Thailand
41. Country_United Kingdom
42. Country_United States
43. Country_Vietnam
44. Continent_Asia
45. Continent_Australia
46. Continent_Europe
47. Continent_North America
48. Continent_South America
49. Hemisphere_Southern Hemisphere


In [None]:
print("="*80)
print("FEATURE ENGINEERING")
print("="*80)

def add_engineered_features(X):
    """X: DataFrame (train 또는 test)
       리턴: 새로운 파생 피처가 추가된 DataFrame
    """
    X = X.copy()

    # 1) 고령 여부
    X['Is_Senior'] = (X['Age'] >= 65).astype(int)

    # 2) 고콜레스테롤 여부 (대략 240 이상)
    X['High_Chol'] = (X['Cholesterol'] >= 240).astype(int)

    # 3) 고혈압 여부
    X['High_BP'] = ((X['Systolic_BP'] >= 140) | (X['Diastolic_BP'] >= 90)).astype(int)

    # 4) 활동 점수 & 좌식 대비 활동 비율
    X['Exercise_Frequency'] = (X['Exercise Hours Per Week'] * X['Physical Activity Days Per Week']).astype(int)
    X['Sedentary_per_Activity'] = X['Sedentary Hours Per Day'] / (X['Exercise_Frequency'] + 1e-3)

    # 5) 스트레스 × 좌식시간
    X['Stress_Sedentary'] = X['Stress Level'] * X['Sedentary Hours Per Day']

    # 6) BMI 따라 구분
    X['BMI_Fat'] = (X["BMI"]>30).astype(int)

    # 7) 고수익자일수록 관리를 잘 할 수 있는데 비만이라는건 특히나 심혈관 질환 위험이 높을 수 있다고 생각
    X['HighIncome_Obesity'] = (X['Income'] > 200000).astype(int) * X['BMI_Fat']

    # 8) BMI와 나이 결합
    X['Age_BMI_Interaction'] = X['Age'] * X['BMI']

    # 9) 운동을 자주하는데도 고혈압인 사람
    X['Exercise_HighBP'] = X['Exercise_Frequency'] * X['High_BP']

    return X


In [None]:
X_train_fe = add_engineered_features(X_train)
X_test_fe  = add_engineered_features(X_test)

print(X_train_fe.columns)

In [None]:
def check_feature_types(df):
    """
    This function checks whether the features are binary (0/1) or continuous variables.

    Args:
    - df: DataFrame with the features to check.

    Returns:
    - A tuple containing lists of binary and continuous feature names.
    """
    binary_features = []    # List to store binary features (0/1)
    continuous_features = []  # List to store continuous features (numeric)

    for col in df.columns:
        # Check if the feature contains only 0 and 1 (binary)
        if df[col].dtype in ['int64', 'bool'] and df[col].nunique() == 2:
            binary_features.append(col)
        # Check if the feature is a continuous variable (int or float)
        elif df[col].dtype in ['int64', 'float64']:
            continuous_features.append(col)

    print(f"Binary Features (0/1): {binary_features}")
    print(f"Continuous Features: {continuous_features}")

    return binary_features, continuous_features

# Example usage with df_processed
binary_features, continuous_features = check_feature_types(X_train_fe)


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_fe[continuous_features] = scaler.fit_transform(X_train_fe[continuous_features])
X_test_fe[continuous_features] = scaler.transform(X_test_fe[continuous_features])

print(X_train_fe[continuous_features].head())

## 8. Model Evaluation Functions (This is just an example. You may modify this if you want.)

Pre-built functions to evaluate your models consistently.

**Functions provided:**
- `evaluate_model()`: Calculate accuracy, precision, recall, F1-score
- `plot_confusion_matrix()`: Visualize prediction errors

**Evaluation metrics explained:**
- **Accuracy**: Overall correctness
- **Precision**: Of predicted positives, how many are correct?
- **Recall**: Of actual positives, how many did we find?
- **F1-Score**: Harmonic mean of precision and recall

In [None]:
# Cell 8: Model Evaluation Functions
print("="*80)
print("MODEL EVALUATION FUNCTIONS")
print("="*80)

def evaluate_model(y_true, y_pred, model_name="Model"):
    """
    Evaluate model performance with multiple metrics

    Args:
        y_true: True labels
        y_pred: Predicted labels
        model_name: Name of the model for display

    Returns:
        Dictionary containing all metrics
    """
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    print(f"\n{model_name} Performance:")
    print(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1-Score:  {f1:.4f}")

    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    print(f"\n  Confusion Matrix:")
    print(f"    TN={cm[0,0]:4d}  FP={cm[0,1]:4d}")
    print(f"    FN={cm[1,0]:4d}  TP={cm[1,1]:4d}")

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'confusion_matrix': cm
    }

def plot_confusion_matrix(cm, model_name="Model"):
    """
    Plot confusion matrix as heatmap

    Args:
        cm: Confusion matrix
        model_name: Name of the model for title
    """
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['No Risk', 'Risk'],
                yticklabels=['No Risk', 'Risk'])
    plt.title(f'Confusion Matrix - {model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()

print("  Evaluation functions defined:")
print("  - evaluate_model(y_true, y_pred, model_name)")
print("  - plot_confusion_matrix(cm, model_name)")

## 9. Memory Measurement Functions (This is just an example. You may modify this if you want.)

For Task 2, you need to minimize model size.

**Functions provided:**
- `measure_model_memory()`: Get model size
- `evaluate_model_with_memory()`: Combined performance and memory evaluation

In [None]:
# Memory Measurement Functions
print("="*80)
print("MEMORY MEASUREMENT FUNCTIONS")
print("="*80)

def measure_model_memory(model):
    """
    Measure model size in memory

    Args:
        model: Trained model object

    Returns:
        Model size
    """
    model_size = sys.getsizeof(pickle.dumps(model)) / 1024  # Model size in KB
    return model_size

def evaluate_model_with_memory(model, X_test, y_test, model_name="Model"):
    """
    Evaluate model performance and memory usage

    Args:
        model: Trained model
        X_test: Test features
        y_test: Test labels
        model_name: Name of the model

    Returns:
        Dictionary with metrics and memory info
    """
    # Predictions
    y_pred = model.predict(X_test)

    # Performance metrics
    metrics = evaluate_model(y_test, y_pred, model_name)

    # Memory measurement
    memory = measure_model_memory(model)

    metrics['memory'] = memory

    return metrics

print("  Memory measurement functions defined:")
print("  - measure_model_memory(model)")
print("  - evaluate_model_with_memory(model, X_test, y_test, model_name)")



```
# 코드로 형식 지정됨
```

## 10. Model Implementation (TODO) (This is just an example. You may modify this if you want.)

**This is where you implement your models**

**Tips:**
- Start with simple models
- Gradually increase complexity

TODO: Implement your models here

Task 1: Best Prediction Performance
- Goal: Achieve the highest accuracy possible
- Suggested approaches:
  * Try feature engineering
  * Experiment with different hyperparameters

Task 2: Minimize Memory with Accuracy >= 60%
- Goal: Smallest model size while maintaining at least 60% accuracy
- Suggested approaches:
  * Try feature selection

Task 3: Best Performance without Neural Networks
- Goal: Highest accuracy using classical ML algorithms only
- Suggested approaches:
  * Feature engineering and selection

Example Model Implementation Structure:


TODO: Create and train your models here
Example:
model_task1 = YourBestModel()
model_task1.fit(X_train, y_train)

TODO: For Task 2
model_task2 = YourMemoryEfficientModel()
model_task2.fit(X_train, y_train)

TODO: For Task 3, implement non-neural network models
model_task3 = YourBestNonNNModel()
model_task3.fit(X_train, y_train)

In [None]:
print(X_test_fe.shape)

In [None]:
def make_pca_features(X_train_fe, X_test_fe, n_components=20):
    X_train_np = X_train_fe.values.astype(float)
    X_test_np = X_test_fe.values.astype(float)

    # 1) centering
    mean = X_train_np.mean(axis=0, keepdims=True)
    X_train_centered = X_train_np - mean
    X_test_centered = X_test_np - mean

    # 2) covariance
    cov = np.cov(X_train_centered, rowvar=False)

    # 3) eigen decomposition
    eigvals, eigvecs = np.linalg.eigh(cov)

    # 4) sort by eigenvalue (desc)
    idx = np.argsort(eigvals)[::-1]
    eigvals = eigvals[idx]
    eigvecs = eigvecs[:, idx]

    # 5) explained variance ratio
    explained_variance_ratio = eigvals / eigvals.sum()

    # 6) top n_components만 사용
    eigvecs_top = eigvecs[:, :n_components]
    evr_top = explained_variance_ratio[:n_components]

    X_train_pca_np = X_train_centered @ eigvecs_top
    X_test_pca_np = X_test_centered @ eigvecs_top

    pc_cols = [f"PC{i+1}" for i in range(n_components)]
    X_train_pca = pd.DataFrame(X_train_pca_np, columns=pc_cols, index=X_train_fe.index)
    X_test_pca = pd.DataFrame(X_test_pca_np, columns=pc_cols, index=X_test_fe.index)

    # 로딩 행렬: row = 원래 feature, col = PC
    loadings = pd.DataFrame(
        eigvecs_top,
        index=X_train_fe.columns,
        columns=pc_cols,
    )

    return X_train_pca, X_test_pca, loadings, evr_top

In [None]:
import inspect # Import inspect for get_params
from sklearn.base import BaseEstimator, ClassifierMixin

class LogisticRegressionScratch(BaseEstimator, ClassifierMixin):
    """
    간단한 L2-정규화 로지스틱 회귀 (배치 GD)
    개선: 가중치 초기화, 클래스 가중치, 학습률 스케줄링
    """
    def __init__(self, lr=0.1, epochs=1000, reg_lambda=0.01, class_weight=None, threshold=0.5, random_state=None):
        self.lr = lr
        self.epochs = epochs
        self.reg_lambda = reg_lambda
        self.class_weight = class_weight  # None or 'balanced' or dict
        self.threshold = threshold
        self.random_state = random_state # Added random_state
        self.w = None
        self.b = 0.0

    @staticmethod
    def _sigmoid(z):
        # 수치 안정성을 위해 클리핑
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        if self.random_state is not None:
            np.random.seed(self.random_state)

        X = np.asarray(X)
        y = np.asarray(y).reshape(-1, 1)
        n_samples, n_features = X.shape

        # 가중치 초기화 개선 (Xavier 초기화)
        self.w = np.random.normal(0, 0.01, (n_features, 1))
        self.b = 0.0

        # 클래스 가중치 계산
        sample_weights = np.ones(n_samples)
        if self.class_weight == 'balanced':
            n_pos = np.sum(y == 1)
            n_neg = np.sum(y == 0)
            if n_pos > 0 and n_neg > 0:
                sample_weights[y.ravel() == 0] = n_samples / (2 * n_neg)
                sample_weights[y.ravel() == 1] = n_samples / (2 * n_pos)
        elif isinstance(self.class_weight, dict):
            sample_weights[y.ravel() == 0] = self.class_weight.get(0, 1.0)
            sample_weights[y.ravel() == 1] = self.class_weight.get(1, 1.0)

        sample_weights = sample_weights.reshape(-1, 1)

        # 학습률 스케줄링을 위한 초기 학습률 저장
        initial_lr = self.lr

        for epoch in range(self.epochs):
            z = X @ self.w + self.b
            y_hat = self._sigmoid(z)

            # 가중치가 적용된 손실의 그래디언트
            error = (y_hat - y) * sample_weights
            grad_w = (X.T @ error) / n_samples + self.reg_lambda * self.w / n_samples
            grad_b = np.mean(error)

            # 학습률 감소 (선형 스케줄링)
            current_lr = initial_lr * (1 - epoch / self.epochs) * 0.5 + initial_lr * 0.5

            self.w -= current_lr * grad_w
            self.b -= current_lr * grad_b
        return self # Return self for sklearn compatibility

    def predict_proba(self, X):
        X = np.asarray(X)
        return self._sigmoid(X @ self.w + self.b).ravel()

    def predict(self, X):
        proba = self.predict_proba(X)
        return (proba >= self.threshold).astype(int)

    # Implement get_params and set_params for sklearn compatibility
    def get_params(self, deep=True):
        return {
            'lr': self.lr,
            'epochs': self.epochs,
            'reg_lambda': self.reg_lambda,
            'class_weight': self.class_weight,
            'threshold': self.threshold,
            'random_state': self.random_state
        }

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

class DecisionTreeNode:
    def __init__(self, gini, num_samples, num_pos, prediction, random_state=42):
        self.gini = gini
        self.num_samples = num_samples
        self.num_pos = num_pos
        self.prediction = prediction
        self.feature_index = None
        self.threshold = None
        self.left = None
        self.right = None


class DecisionTreeClassifierScratch(BaseEstimator, ClassifierMixin):
    """
    CART 결정트리 (gini 기준)
    개선: 클래스 가중치, min_samples_leaf 추가
    """
    def __init__(self, max_depth=None, min_samples_split=2, min_samples_leaf=1,
                 max_features=None, class_weight=None, random_state=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features  # None → 모든 피처
        self.class_weight = class_weight  # None or 'balanced' or dict
        self.random_state = random_state # Added random_state
        self.n_features_ = None
        self.root = None

    def _gini(self, y, sample_weights=None):
        if len(y) == 0:
            return 0
        if sample_weights is None:
            p = np.mean(y)
        else:
            total_weight = np.sum(sample_weights)
            if total_weight == 0:
                return 0
            p = np.sum(y * sample_weights) / total_weight
        return 2 * p * (1 - p)

    def _best_split(self, X, y, feature_indices, sample_weights=None):
        m = y.size
        if m < self.min_samples_split:
            return None, None

        best_gini = 1.0
        best_idx, best_thr = None, None

        for idx in feature_indices:
            x_sorted = X[:, idx].argsort()
            X_i = X[x_sorted, idx]
            y_i = y[x_sorted]
            w_i = sample_weights[x_sorted] if sample_weights is not None else None

            # 후보 임계값: 인접 값 평균
            unique_vals = np.unique(X_i)
            if unique_vals.size == 1:
                continue
            thresholds = (unique_vals[:-1] + unique_vals[1:]) / 2

            left_count = 0
            left_pos = 0
            left_weight = 0
            right_count = m
            right_pos = np.sum(y_i)
            right_weight = np.sum(w_i) if w_i is not None else m

            j = 0
            for thr in thresholds:
                while j < m and X_i[j] <= thr:
                    left_count += 1
                    left_pos += y_i[j]
                    if w_i is not None:
                        left_weight += w_i[j]
                        right_weight -= w_i[j]
                    right_count -= 1
                    right_pos -= y_i[j]
                    j += 1

                if left_count < self.min_samples_leaf or right_count < self.min_samples_leaf:
                    continue

                gini_left = self._gini(y_i[:left_count], w_i[:left_count] if w_i is not None else None)
                gini_right = self._gini(y_i[left_count:], w_i[left_count:] if w_i is not None else None)

                if w_i is not None:
                    total_weight = left_weight + right_weight
                    if total_weight == 0:
                        continue
                    gini = (left_weight * gini_left + right_weight * gini_right) / total_weight
                else:
                    gini = (left_count * gini_left + right_count * gini_right) / m

                if gini < best_gini:
                    best_gini = gini
                    best_idx = idx
                    best_thr = thr

        return best_idx, best_thr

    def _build(self, X, y, depth, sample_weights=None):
        num_samples = y.size
        if sample_weights is None:
            num_pos = np.sum(y)
            total_weight = num_samples
        else:
            num_pos = np.sum(y * sample_weights)
            total_weight = np.sum(sample_weights)

        node = DecisionTreeNode(
            gini=self._gini(y, sample_weights),
            num_samples=num_samples,
            num_pos=num_pos,
            prediction=1 if num_pos >= total_weight / 2 else 0
        )

        if self.max_depth is not None and depth >= self.max_depth:
            return node
        if num_samples < self.min_samples_split or node.gini == 0.0:
            return node

        if self.max_features is None:
            feat_idx = np.arange(self.n_features_)
        else:
            # Use self.random_state for reproducible feature selection
            rng = np.random.RandomState(self.random_state + depth if self.random_state is not None else None)
            feat_idx = rng.choice(self.n_features_, self.max_features, replace=False)

        idx, thr = self._best_split(X, y, feat_idx, sample_weights)
        if idx is None:
            return node

        indices_left = X[:, idx] <= thr
        X_left, y_left = X[indices_left], y[indices_left]
        X_right, y_right = X[~indices_left], y[~indices_left]
        w_left = sample_weights[indices_left] if sample_weights is not None else None
        w_right = sample_weights[~indices_left] if sample_weights is not None else None

        node.feature_index = idx
        node.threshold = thr
        node.left = self._build(X_left, y_left, depth + 1, w_left)
        node.right = self._build(X_right, y_right, depth + 1, w_right)
        return node

    def fit(self, X, y):
        if self.random_state is not None:
            np.random.seed(self.random_state)

        X = np.asarray(X)
        y = np.asarray(y)
        self.n_features_ = X.shape[1]

        # 클래스 가중치 계산
        sample_weights = None
        if self.class_weight == 'balanced':
            n_pos = np.sum(y == 1)
            n_neg = np.sum(y == 0)
            n_samples = len(y)
            if n_pos > 0 and n_neg > 0:
                sample_weights = np.ones(n_samples)
                sample_weights[y == 0] = n_samples / (2 * n_neg)
                sample_weights[y == 1] = n_samples / (2 * n_pos)
        elif isinstance(self.class_weight, dict):
            sample_weights = np.ones(len(y))
            sample_weights[y == 0] = self.class_weight.get(0, 1.0)
            sample_weights[y == 1] = self.class_weight.get(1, 1.0)

        self.root = self._build(X, y, depth=0, sample_weights=sample_weights)
        return self # Return self for sklearn compatibility

    def _predict_one(self, x, node):
        if node.feature_index is None:
            return node.prediction
        if x[node.feature_index] <= node.threshold:
            return self._predict_one(x, node.left)
        else:
            return self._predict_one(x, node.right)

    def predict(self, X):
        X = np.asarray(X)
        return np.array([self._predict_one(x, self.root) for x in X])

    # Implement get_params and set_params for sklearn compatibility
    def get_params(self, deep=True):
        # Get all parameters from the constructor
        params = inspect.signature(self.__init__).parameters
        return {param: getattr(self, param) for param in params if param != 'self'}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self


class RandomForestClassifierScratch(BaseEstimator, ClassifierMixin):
    """
    간단한 랜덤 포레스트 (bagging + 랜덤 피처)
    개선: 클래스 가중치, min_samples_leaf 추가
    """
    def __init__(self, n_estimators=20, max_depth=None, min_samples_split=2, min_samples_leaf=1,
                 max_features='sqrt', bootstrap=True, class_weight=None, random_state=42):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features  # 'sqrt' or int or None
        self.bootstrap = bootstrap
        self.class_weight = class_weight
        self.random_state = random_state
        self.trees = []

    def _get_max_features(self, n_features):
        if self.max_features == 'sqrt':
            return max(1, int(np.sqrt(n_features)))
        if isinstance(self.max_features, int):
            return min(n_features, self.max_features)
        return n_features

    def fit(self, X, y):
        rng = np.random.RandomState(self.random_state)
        X = np.asarray(X)
        y = np.asarray(y)
        n_samples, n_features = X.shape
        m_features = self._get_max_features(n_features)

        self.trees = []
        for i in range(self.n_estimators):
            if self.bootstrap:
                indices = rng.randint(0, n_samples, n_samples)
            else:
                indices = np.arange(n_samples)
            X_sample = X[indices]
            y_sample = y[indices]

            # Pass random_state to DecisionTreeClassifierScratch for reproducibility
            tree_random_state = self.random_state + i if self.random_state is not None else None

            tree = DecisionTreeClassifierScratch(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                min_samples_leaf=self.min_samples_leaf,
                max_features=m_features,
                class_weight=self.class_weight,
                random_state=tree_random_state
            )
            tree.fit(X_sample, y_sample)
            self.trees.append(tree)
        return self # Return self for sklearn compatibility

    def predict(self, X):
        X = np.asarray(X)
        preds = np.array([tree.predict(X) for tree in self.trees])
        # 다수결
        votes = np.mean(preds, axis=0)
        return (votes >= 0.5).astype(int)

    # Implement get_params and set_params for sklearn compatibility
    def get_params(self, deep=True):
        # Get all parameters from the constructor
        params = inspect.signature(self.__init__).parameters
        return {param: getattr(self, param) for param in params if param != 'self'}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

## 11. Model Evaluation (TODO) (This is just an example. You may modify this if you want.)

After implementing your models, evaluate them here.

**What to do:**
1. Train each model on training data
2. Evaluate on test data using provided functions
3. Plot confusion matrices
4. Compare results across tasks

**Task-specific checks:**
- Task 2: Verify accuracy ≥ 60%
- Task 3: Confirm no neural networks used

In [None]:
from sklearn.model_selection import GridSearchCV

print("="*80)
print("DEFINING HYPERPARAMETER GRIDS")
print("="*80)

# Parameter Grid for Logistic Regression
logreg_param_grid = {
    'lr': [0.01, 0.05, 0.1],
    'epochs': [1000, 1500],
    'reg_lambda': [0.001, 0.01, 0.1]
}

# Parameter Grid for Decision Tree
tree_param_grid = {
    'max_depth': [10, 15, 20, None],
    'min_samples_leaf': [1, 2, 4]
}

# Parameter Grid for Random Forest (reduced for quicker execution, can be expanded)
rf_param_grid = {
    'n_estimators': [20, 30, 40],
    'max_depth': [20, None],
    'min_samples_leaf': [1, 2, 4]
}

print("Parameter grids defined for Logistic Regression, Decision Tree, and Random Forest.")

In [None]:
print("="*80)
print("EXECUTING GRID SEARCH FOR HYPERPARAMETER TUNING")
print("="*80)

# --- Logistic Regression GridSearchCV ---
print("\n1. Running GridSearchCV for Logistic Regression...")
logreg_grid_search = GridSearchCV(
    estimator=LogisticRegressionScratch(class_weight='balanced', random_state=RANDOM_SEED),
    param_grid=logreg_param_grid,
    scoring='f1',
    cv=3, # Using 3-fold cross-validation for speed
    n_jobs=-1, # Use all available cores
    verbose=1
)
logreg_grid_search.fit(X_train_fe, y_train)

print("   Best parameters for Logistic Regression:", logreg_grid_search.best_params_)
print("   Best F1 score for Logistic Regression:", logreg_grid_search.best_score_)

# --- Decision Tree GridSearchCV ---
print("\n2. Running GridSearchCV for Decision Tree...")
tree_grid_search = GridSearchCV(
    estimator=DecisionTreeClassifierScratch(class_weight='balanced', random_state=RANDOM_SEED),
    param_grid=tree_param_grid,
    scoring='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)
tree_grid_search.fit(X_train_fe, y_train)

print("   Best parameters for Decision Tree:", tree_grid_search.best_params_)
print("   Best F1 score for Decision Tree:", tree_grid_search.best_score_)

# --- Random Forest GridSearchCV ---
print("\n3. Running GridSearchCV for Random Forest...")
rf_grid_search = GridSearchCV(
    estimator=RandomForestClassifierScratch(class_weight='balanced', random_state=RANDOM_SEED),
    param_grid=rf_param_grid,
    scoring='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)
rf_grid_search.fit(X_train_fe, y_train)

print("   Best parameters for Random Forest:", rf_grid_search.best_params_)
print("   Best F1 score for Random Forest:", rf_grid_search.best_score_)

print("\nGrid search completed for all models.")

In [None]:
print("="*80)
print("EVALUATING TUNED MODELS ON TEST DATA")
print("="*80)

results={}

print("\n" + "="*80)
print("TASK 1: BEST PREDICTION PERFORMANCE")
print("="*80)

best_rf_model = rf_grid_search.best_estimator_
metrics_rf_tuned = evaluate_model_with_memory(best_rf_model, X_test_fe, y_test, "RF-Tuned")
results["RF-Tuned"] = metrics_rf_tuned
plot_confusion_matrix(metrics_rf_tuned['confusion_matrix'], "Task 1 Model")

print("\n" + "="*80)
print("TASK 2: MEMORY EFFICIENCY (Accuracy >= 60%)")
print("="*80)

best_logreg_model = logreg_grid_search.best_estimator_
metrics_logreg_tuned = evaluate_model_with_memory(best_logreg_model, X_test_fe, y_test, "LogReg-Tuned")
results["LogReg-Tuned"] = metrics_logreg_tuned
plot_confusion_matrix(metrics_logreg_tuned['confusion_matrix'], "Task 2 Model")

print("\n" + "="*80)
print("TASK 3: BEST PERFORMANCE WITHOUT NEURAL NETWORKS")
print("="*80)

best_tree_model = tree_grid_search.best_estimator_
metrics_tree_tuned = evaluate_model_with_memory(best_tree_model, X_test_fe, y_test, "Tree-Tuned")
results["Tree-Tuned"] = metrics_tree_tuned
plot_confusion_matrix(metrics_tree_tuned['confusion_matrix'], "Task 3 Model")



print("\nEvaluation of tuned models completed.")

## 12. Error Analysis (This is just an example. You may modify this if you want.)

Understand where and why your models fail.

**Analysis includes:**
- False Positives: Predicted risk, but no actual risk
- False Negatives: Predicted no risk, but actual risk

In [None]:
print(X_test.shape)
print(X_train_fe.shape)

In [None]:
# Cell 12: Error Analysis
print("="*80)
print("ERROR ANALYSIS")
print("="*80)

"""
Analyze where your model makes mistakes
This helps you understand model limitations and potential improvements
"""

def perform_error_analysis(model, X_test, y_test, model_name="Model"):
    """
    Perform detailed error analysis

    Args:
        model: Trained model
        X_test: Test features (numpy array or DataFrame)
        y_test: True labels
        model_name: Name of the model
    """
    print(f"\nError Analysis for {model_name}")
    print("-"*80)

    # Convert to DataFrame if numpy array
    if isinstance(X_test, np.ndarray):
        X_test_df = pd.DataFrame(X_test, columns=X.columns)
    else:
        X_test_df = X_test

    # Get predictions
    y_pred = model.predict(X_test)

    # Identify errors
    errors = y_pred != y_test
    false_positives = (y_pred == 1) & (y_test == 0)
    false_negatives = (y_pred == 0) & (y_test == 1)

    print(f"\nError Summary:")
    print(f"  Total errors: {sum(errors)} ({sum(errors)/len(y_test)*100:.2f}%)")
    print(f"  False Positives: {sum(false_positives)} (predicted Risk, actually No Risk)")
    print(f"  False Negatives: {sum(false_negatives)} (predicted No Risk, actually Risk)")

    # Analyze false positives
    if sum(false_positives) > 0:
        print(f"\nFalse Positive Analysis:")
        fp_data = X_test_df[false_positives]
        print(f"  Average feature values for False Positives:")
        print(fp_data.mean())

    # Analyze false negatives
    if sum(false_negatives) > 0:
        print(f"\nFalse Negative Analysis:")
        fn_data = X_test_df[false_negatives]
        print(f"  Average feature values for False Negatives:")
        print(fn_data.mean())

    return {
        'false_positives': sum(false_positives),
        'false_negatives': sum(false_negatives),
        'total_errors': sum(errors)
    }


error_analysis_task1 = perform_error_analysis(
    model_task1, X_test_fe, y_test, "Task 1 Model"
)

error_analysis_task2 = perform_error_analysis(
    model_task2, X_test_fe, y_test, "Task 2 Model"
)

error_analysis_task3 = perform_error_analysis(
    model_task3, X_test_fe, y_test, "Task 3 Model"
)

## 13. Clustering Visualization (This is just an example. You may modify this if you want.)

In this section, you will implement K-means clustering from scratch and visualize the data using Principal Component Analysis (PCA).

**Goals:**
- Implement K-means clustering algorithm
- Implement PCA for dimensionality reduction
- Reduce dimensionality to 2D for visualization
- Understand natural groupings in the data
- Analyze the relationship between clusters and heart attack risk

**What you need to implement:**
- K-means clustering algorithm (initialization, assignment, update steps)
- PCA for dimensionality reduction (covariance matrix, eigenvalues/eigenvectors)
- Cluster visualization

In [None]:
# Cell 12: Clustering Visualization (Student Implementation)
print("="*80)
print("CLUSTERING VISUALIZATION")
print("="*80)

"""
TODO: Implement K-means clustering and PCA from scratch

You need to implement the following:
1. PCA (Principal Component Analysis)
   - Compute covariance matrix
   - Calculate eigenvalues and eigenvectors
   - Project data onto principal components
2. K-means clustering algorithm
   - Cluster initialization
   - Cluster assignment
   - Centroid update
   - Convergence check
"""

class PCAImplementation:
    """
    TODO: Implement PCA (Principal Component Analysis)

    Your implementation should include:
    - __init__(self, n_components): Initialize parameters
    - fit(self, X): Fit PCA to data (compute principal components)
    - transform(self, X): Project data onto principal components
    - fit_transform(self, X): Fit and transform in one step
    """

    def __init__(self, n_components=2):
        """
        Initialize PCA

        Args:
            n_components: Number of principal components to keep
        """
        self.n_components = n_components
        self.mean = None
        self.components = None  # Principal components (eigenvectors)
        self.explained_variance_ratio = None

        # TODO: Add any additional attributes you need

    def fit(self, X):
        """
        TODO: Fit PCA to the data

        Algorithm steps:
        1. Center the data (subtract mean)
        2. Compute covariance matrix
        3. Calculate eigenvalues and eigenvectors
        4. Sort eigenvectors by eigenvalues (descending)
        5. Select top n_components eigenvectors

        Hint: Use np.linalg.eig() for eigenvalue decomposition

        Args:
            X: Data matrix (n_samples, n_features)
        """
        # TODO: Implement PCA fitting
        # Step 1: Center the data
        # Step 2: Compute covariance matrix
        # Step 3: Compute eigenvalues and eigenvectors
        # Step 4: Sort and select top components
        # Step 5: Calculate explained variance ratio

        pass

    def transform(self, X):
        """
        TODO: Transform data using fitted principal components

        Args:
            X: Data matrix (n_samples, n_features)

        Returns:
            X_transformed: Projected data (n_samples, n_components)
        """
        # TODO: Project data onto principal components
        # Hint: (X - mean) @ components.T
        pass

    def fit_transform(self, X):
        """
        Fit PCA and transform data in one step

        Args:
            X: Data matrix

        Returns:
            X_transformed: Projected data
        """
        self.fit(X)
        return self.transform(X)


class KMeansClustering:
    """
    TODO: Implement K-means clustering algorithm

    Your implementation should include:
    - __init__(self, n_clusters, max_iters, random_state): Initialize parameters
    - fit(self, X): Fit the model to data
    - predict(self, X): Predict cluster labels for data
    - _initialize_centroids(self, X): Initialize cluster centroids
    - _assign_clusters(self, X): Assign points to nearest centroid
    - _update_centroids(self, X, labels): Update centroid positions
    - _has_converged(self, old_centroids, new_centroids): Check convergence
    """

    def __init__(self, n_clusters=3, max_iters=100, random_state=42):
        """
        Initialize K-means clustering

        Args:
            n_clusters: Number of clusters
            max_iters: Maximum number of iterations
            random_state: Random seed for reproducibility
        """
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        self.centroids = None
        self.labels_ = None

        # TODO: Add any additional attributes you need

    def fit(self, X):
        """
        TODO: Fit K-means to the data

        Algorithm steps:
        1. Initialize centroids randomly
        2. Repeat until convergence or max_iters:
           a. Assign each point to nearest centroid
           b. Update centroids as mean of assigned points
           c. Check for convergence

        Args:
            X: Data matrix (n_samples, n_features)
        """
        np.random.seed(self.random_state)

        # TODO: Implement K-means algorithm
        # Step 1: Initialize centroids
        # Step 2: Iterate until convergence
        #   - Assign points to clusters
        #   - Update centroids
        #   - Check convergence

        pass

    def predict(self, X):
        """
        TODO: Predict cluster labels for data

        Args:
            X: Data matrix (n_samples, n_features)

        Returns:
            labels: Cluster labels for each sample
        """
        # TODO: Assign each point to nearest centroid
        pass

    def _initialize_centroids(self, X):
        """
        TODO: Initialize centroids randomly

        Hint: You can randomly select n_clusters points from X as initial centroids

        Args:
            X: Data matrix

        Returns:
            centroids: Initial centroid positions
        """
        # TODO: Implement centroid initialization
        pass

    def _assign_clusters(self, X):
        """
        TODO: Assign each point to the nearest centroid

        Hint: Calculate Euclidean distance from each point to each centroid
              Assign point to cluster with minimum distance

        Args:
            X: Data matrix

        Returns:
            labels: Cluster assignment for each point
        """
        # TODO: Implement cluster assignment
        pass

    def _update_centroids(self, X, labels):
        """
        TODO: Update centroid positions as mean of assigned points

        Args:
            X: Data matrix
            labels: Current cluster assignments

        Returns:
            centroids: Updated centroid positions
        """
        # TODO: Implement centroid update
        pass

    def _has_converged(self, old_centroids, new_centroids, tolerance=1e-4):
        """
        TODO: Check if centroids have converged

        Hint: Check if the change in centroid positions is below tolerance

        Args:
            old_centroids: Previous centroid positions
            new_centroids: Current centroid positions
            tolerance: Convergence threshold

        Returns:
            Boolean indicating convergence
        """
        # TODO: Implement convergence check
        pass


def visualize_clusters_with_pca(X, y, n_clusters=3):
    """
    Visualize data using your K-means and PCA implementations

    Args:
        X: Feature matrix (numpy array or DataFrame)
        y: Labels (for comparison)
        n_clusters: Number of clusters
    """
    print(f"\nPerforming clustering visualization...")
    print(f"  Number of clusters: {n_clusters}")

    # Convert to numpy array if DataFrame
    if isinstance(X, pd.DataFrame):
        X_array = X.values
    else:
        X_array = X

    # TODO: Use your PCA implementation
    # pca = PCAImplementation(n_components=2)
    # X_pca = pca.fit_transform(X_array)

    # print(f"\n  PCA Explained Variance Ratio:")
    # print(f"    PC1: {pca.explained_variance_ratio[0]*100:.2f}%")
    # print(f"    PC2: {pca.explained_variance_ratio[1]*100:.2f}%")
    # print(f"    Total: {sum(pca.explained_variance_ratio)*100:.2f}%")

    # TODO: Use your K-means implementation
    # kmeans = KMeansClustering(n_clusters=n_clusters, random_state=RANDOM_SEED)
    # kmeans.fit(X_array)
    # cluster_labels = kmeans.predict(X_array)

    # Placeholder for visualization (uncomment after implementing K-means and PCA)
    # Create visualization
    # fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # # Plot 1: Colored by true labels
    # scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y,
    #                            cmap='coolwarm', alpha=0.6, edgecolors='k', linewidth=0.5)
    # axes[0].set_xlabel(f'First Principal Component ({pca.explained_variance_ratio[0]*100:.1f}%)')
    # axes[0].set_ylabel(f'Second Principal Component ({pca.explained_variance_ratio[1]*100:.1f}%)')
    # axes[0].set_title('Data Distribution by True Labels')
    # axes[0].legend(*scatter1.legend_elements(), title="Heart Attack Risk", loc='best')
    # axes[0].grid(True, alpha=0.3)

    # # Plot 2: Colored by cluster labels
    # scatter2 = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels,
    #                            cmap='viridis', alpha=0.6, edgecolors='k', linewidth=0.5)
    # axes[1].set_xlabel(f'First Principal Component ({pca.explained_variance_ratio[0]*100:.1f}%)')
    # axes[1].set_ylabel(f'Second Principal Component ({pca.explained_variance_ratio[1]*100:.1f}%)')
    # axes[1].set_title(f'Data Distribution by Clusters (K={n_clusters})')
    # axes[1].legend(*scatter2.legend_elements(), title="Cluster", loc='best')
    # axes[1].grid(True, alpha=0.3)

    # # Plot cluster centers in PCA space
    # centers_pca = pca.transform(kmeans.centroids)
    # axes[1].scatter(centers_pca[:, 0], centers_pca[:, 1],
    #                c='red', marker='X', s=200, edgecolors='black', linewidth=2,
    #                label='Cluster Centers')
    # axes[1].legend()

    # plt.tight_layout()
    # plt.show()

    # # Analyze cluster-label correspondence
    # print(f"\n  Cluster-Label Correspondence:")
    # for i in range(n_clusters):
    #     cluster_mask = cluster_labels == i
    #     cluster_risk_rate = np.mean(y[cluster_mask])
    #     print(f"    Cluster {i}: {sum(cluster_mask)} samples, "
    #           f"Risk rate: {cluster_risk_rate*100:.1f}%")

    # return pca, X_pca

# TODO: After implementing PCA and K-means, test them on training data
# Example usage:
# visualize_clusters_with_pca(X_train, y_train.values, n_clusters=3)
# Try different numbers of clusters: n_clusters=2, 4, 5

## 14. Results Summary and Comparison (This is just an example. You may modify this if you want.)

Compare all your models side-by-side.

**What to compare:**
- Accuracy, Precision, Recall, F1-Score
- Memory usage

In [None]:
# Summary and Results Comparison
print("="*80)
print("RESULTS SUMMARY")
print("="*80)

"""
Compare all your models and summarize the results
"""

def create_results_summary(results_dict):
    """
    Create a summary table of all model results

    Args:
        results_dict: Dictionary with format {model_name: metrics_dict}
    """
    if not results_dict:
        print("  No results to summarize yet")
        return

    summary_df = pd.DataFrame(results_dict).T
    summary_df = summary_df.sort_values('accuracy', ascending=False)

    print("\nModel Performance Comparison:")
    print("="*80)
    print(summary_df[['accuracy', 'precision', 'recall', 'f1_score', 'memory']])

    # Visualize comparison
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Plot 1: Performance metrics
    metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1_score']
    summary_df[metrics_to_plot].plot(kind='bar', ax=axes[0], width=0.8)
    axes[0].set_title('Model Performance Comparison')
    axes[0].set_ylabel('Score')
    axes[0].set_xlabel('Model')
    axes[0].legend(loc='lower right')
    axes[0].set_xticklabels(summary_df.index, rotation=45, ha='right')
    axes[0].grid(True, alpha=0.3)
    axes[0].axhline(y=0.60, color='r', linestyle='--', label='Task 2 Threshold (60%)')

    # Plot 2: Memory usage
    summary_df['memory'].plot(kind='bar', ax=axes[1], color='orange', width=0.8)
    axes[1].set_title('Model Memory Usage')
    axes[1].set_ylabel('Memory')
    axes[1].set_xlabel('Model')
    axes[1].set_xticklabels(summary_df.index, rotation=45, ha='right')
    axes[1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return summary_df

# TODO: After evaluating all your models, create a summary
# Example:
# results = {
#     'Task 1 Model': metrics_task1,
#     'Task 2 Model': metrics_task2,
#     'Task 3 Model': metrics_task3,
# }
# summary = create_results_summary(results)