# Linear Discriminant Analysis on Bank Marketing Dataset
## Business Prediction: Term Deposit Subscription

**Dataset Overview:**
- Marketing campaigns of a Portuguese banking institution
- 45,211 samples with 17 features
- Binary classification: Will client subscribe to term deposit?
- Features: Age, job, marital status, education, balance, duration, campaign contacts

**Focus:** Business prediction and imbalanced classification

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    accuracy_score,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    f1_score,
    matthews_corrcoef
)
from scipy import stats
from imblearn.over_sampling import SMOTE
from imblearn.metrics import classification_report_imbalanced

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 1. Data Loading

**Note:** This notebook expects the Bank Marketing dataset. You can download it from:
- UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/bank+marketing
- File: `bank-additional-full.csv`

For this demo, we'll create a synthetic version with similar characteristics.

In [None]:
# Load or create synthetic bank marketing data
try:
    # Try to load real dataset
    df = pd.read_csv('bank-additional-full.csv', sep=';')
    print("Real dataset loaded successfully!")
except FileNotFoundError:
    print("Creating synthetic dataset...")
    # Create synthetic data with similar characteristics
    np.random.seed(42)
    n_samples = 5000
    
    # Create features
    age = np.random.normal(40, 10, n_samples).astype(int)
    age = np.clip(age, 18, 95)
    
    balance = np.random.exponential(1500, n_samples).astype(int)
    duration = np.random.exponential(250, n_samples).astype(int)
    campaign = np.random.poisson(2.5, n_samples)
    pdays = np.random.choice([999, *np.random.randint(0, 500, 1000)], n_samples)
    previous = np.random.poisson(0.5, n_samples)
    
    # Categorical features
    jobs = np.random.choice(['admin.', 'technician', 'services', 'management', 
                            'retired', 'blue-collar', 'unemployed', 'entrepreneur',
                            'housemaid', 'self-employed', 'student'], n_samples)
    marital = np.random.choice(['married', 'single', 'divorced'], n_samples, p=[0.6, 0.3, 0.1])
    education = np.random.choice(['university.degree', 'high.school', 'basic.9y', 
                                 'professional.course', 'basic.4y', 'basic.6y'], n_samples)
    default = np.random.choice(['no', 'yes', 'unknown'], n_samples, p=[0.97, 0.01, 0.02])
    housing = np.random.choice(['yes', 'no', 'unknown'], n_samples, p=[0.55, 0.43, 0.02])
    loan = np.random.choice(['yes', 'no', 'unknown'], n_samples, p=[0.15, 0.83, 0.02])
    contact = np.random.choice(['cellular', 'telephone'], n_samples, p=[0.65, 0.35])
    month = np.random.choice(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'mar', 'apr', 'sep'], n_samples)
    poutcome = np.random.choice(['nonexistent', 'failure', 'success'], n_samples, p=[0.86, 0.10, 0.04])
    
    # Target (imbalanced - 11% positive class)
    # Make it dependent on features
    prob_y = 0.05 + (duration > 300) * 0.15 + (poutcome == 'success') * 0.20
    prob_y += (age > 60) * 0.05 + (balance > 2000) * 0.03
    y = (np.random.random(n_samples) < prob_y).astype(int)
    y_labels = np.where(y == 1, 'yes', 'no')
    
    df = pd.DataFrame({
        'age': age,
        'job': jobs,
        'marital': marital,
        'education': education,
        'default': default,
        'balance': balance,
        'housing': housing,
        'loan': loan,
        'contact': contact,
        'day': np.random.randint(1, 32, n_samples),
        'month': month,
        'duration': duration,
        'campaign': campaign,
        'pdays': pdays,
        'previous': previous,
        'poutcome': poutcome,
        'y': y_labels
    })
    print("Synthetic dataset created!")

print(f"\nDataset shape: {df.shape}")
print(f"\nTarget distribution:")
print(df['y'].value_counts())
print(f"\nPositive class percentage: {(df['y']=='yes').mean()*100:.2f}%")

In [None]:
# Display basic information
print("\nDataset Info:")
print(df.info())

print("\nFirst 10 rows:")
display(df.head(10))

print("\nStatistical Summary (Numerical Features):")
display(df.describe())

## 2. Exploratory Data Analysis

In [None]:
# Identify numeric and categorical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('y')  # Remove target

print(f"Numeric features ({len(numeric_cols)}): {numeric_cols}")
print(f"\nCategorical features ({len(categorical_cols)}): {categorical_cols}")

In [None]:
# Class imbalance visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Count plot
df['y'].value_counts().plot(kind='bar', ax=axes[0], color=['steelblue', 'coral'])
axes[0].set_title('Target Variable Distribution (Imbalanced)')
axes[0].set_xlabel('Subscribed to Term Deposit')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No', 'Yes'], rotation=0)

# Add count labels
for i, v in enumerate(df['y'].value_counts().values):
    axes[0].text(i, v + 100, str(v), ha='center', va='bottom', fontweight='bold')

# Pie chart
df['y'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                            colors=['steelblue', 'coral'], startangle=90)
axes[1].set_title('Target Class Proportions')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

imbalance_ratio = df['y'].value_counts()['no'] / df['y'].value_counts()['yes']
print(f"\nImbalance ratio (no:yes): {imbalance_ratio:.2f}:1")
print(f"This is a {'highly' if imbalance_ratio > 5 else 'moderately'} imbalanced dataset")

In [None]:
# Numeric features distribution by target
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols[:6]):
    for label in ['no', 'yes']:
        data = df[df['y'] == label][col]
        axes[idx].hist(data, alpha=0.6, label=label, bins=30)
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
    axes[idx].set_title(f'Distribution of {col} by Target')

plt.tight_layout()
plt.show()

In [None]:
# Box plots for key features
key_features = ['age', 'balance', 'duration', 'campaign']

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    df.boxplot(column=feature, by='y', ax=axes[idx])
    axes[idx].set_title(f'Box Plot: {feature} by Subscription')
    axes[idx].set_xlabel('Subscribed to Term Deposit')
    axes[idx].set_ylabel(feature)

plt.suptitle('')
plt.tight_layout()
plt.show()

In [None]:
# Categorical features vs target
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, col in enumerate(categorical_cols[:6]):
    ct = pd.crosstab(df[col], df['y'], normalize='index')
    ct.plot(kind='bar', ax=axes[idx], color=['steelblue', 'coral'])
    axes[idx].set_title(f'{col} vs Subscription Rate')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Proportion')
    axes[idx].legend(title='Subscribed', labels=['No', 'Yes'])
    axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix for numeric features
plt.figure(figsize=(12, 10))
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, square=True, 
            linewidths=0.5, fmt='.2f', cbar_kws={'shrink': 0.8})
plt.title('Correlation Matrix - Numeric Features', fontsize=14, pad=20)
plt.tight_layout()
plt.show()

## 3. Data Preprocessing

In [None]:
# Encode categorical variables
le_dict = {}
df_encoded = df.copy()

for col in categorical_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df[col])
    le_dict[col] = le

# Encode target
df_encoded['y'] = (df['y'] == 'yes').astype(int)

print("Categorical encoding complete!")
print(f"\nEncoded dataset shape: {df_encoded.shape}")
print(f"\nSample of encoded data:")
display(df_encoded.head())

In [None]:
# Prepare features and target
X = df_encoded.drop('y', axis=1).values
y = df_encoded['y'].values
feature_names = df_encoded.drop('y', axis=1).columns.tolist()

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Number of features: {len(feature_names)}")

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set class distribution:")
unique, counts = np.unique(y_train, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  Class {cls}: {count} ({count/len(y_train)*100:.1f}%)")

print(f"\nTest set class distribution:")
unique, counts = np.unique(y_test, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  Class {cls}: {count} ({count/len(y_test)*100:.1f}%)")

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling complete!")
print(f"\nScaled training set mean: {X_train_scaled.mean(axis=0).mean():.6f}")
print(f"Scaled training set std: {X_train_scaled.std(axis=0).mean():.6f}")

## 4. LDA on Imbalanced Data (Baseline)

In [None]:
# Train baseline LDA
lda_baseline = LinearDiscriminantAnalysis()
lda_baseline.fit(X_train_scaled, y_train)

# Predictions
y_train_pred = lda_baseline.predict(X_train_scaled)
y_test_pred = lda_baseline.predict(X_test_scaled)

# Probabilities
y_train_proba = lda_baseline.predict_proba(X_train_scaled)[:, 1]
y_test_proba = lda_baseline.predict_proba(X_test_scaled)[:, 1]

print("BASELINE LDA (No Resampling)")
print("=" * 70)
print(f"Training Accuracy: {accuracy_score(y_train, y_train_pred):.4f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_test_pred):.4f}")
print(f"Test F1-Score: {f1_score(y_test, y_test_pred):.4f}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_test_proba):.4f}")
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_test, y_test_pred):.4f}")

In [None]:
# Classification report for imbalanced data
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_pred, target_names=['No', 'Yes']))

# Confusion matrix
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.title('Confusion Matrix - Baseline LDA')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

print(f"\nTrue Negatives: {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives: {cm[1,1]}")
print(f"\nSensitivity (Recall): {cm[1,1]/(cm[1,0]+cm[1,1]):.4f}")
print(f"Specificity: {cm[0,0]/(cm[0,0]+cm[0,1]):.4f}")

## 5. LDA with SMOTE (Handling Imbalance)

In [None]:
# Apply SMOTE to balance classes
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print("SMOTE Resampling Applied")
print("=" * 70)
print(f"Original training set: {X_train_scaled.shape[0]} samples")
print(f"Balanced training set: {X_train_balanced.shape[0]} samples")

print(f"\nOriginal class distribution:")
unique, counts = np.unique(y_train, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  Class {cls}: {count} ({count/len(y_train)*100:.1f}%)")

print(f"\nBalanced class distribution:")
unique, counts = np.unique(y_train_balanced, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"  Class {cls}: {count} ({count/len(y_train_balanced)*100:.1f}%)")

In [None]:
# Train LDA on balanced data
lda_balanced = LinearDiscriminantAnalysis()
lda_balanced.fit(X_train_balanced, y_train_balanced)

# Predictions
y_test_pred_balanced = lda_balanced.predict(X_test_scaled)
y_test_proba_balanced = lda_balanced.predict_proba(X_test_scaled)[:, 1]

print("LDA WITH SMOTE RESAMPLING")
print("=" * 70)
print(f"Test Accuracy: {accuracy_score(y_test, y_test_pred_balanced):.4f}")
print(f"Test F1-Score: {f1_score(y_test, y_test_pred_balanced):.4f}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_test_proba_balanced):.4f}")
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_test, y_test_pred_balanced):.4f}")

In [None]:
# Comparison of baseline vs SMOTE
print("\nCOMPARISON: Baseline vs SMOTE")
print("=" * 70)
print(f"{'Metric':<30} {'Baseline':>15} {'SMOTE':>15} {'Improvement':>15}")
print("-" * 70)

metrics = [
    ('Accuracy', accuracy_score(y_test, y_test_pred), accuracy_score(y_test, y_test_pred_balanced)),
    ('F1-Score', f1_score(y_test, y_test_pred), f1_score(y_test, y_test_pred_balanced)),
    ('ROC-AUC', roc_auc_score(y_test, y_test_proba), roc_auc_score(y_test, y_test_proba_balanced)),
    ('MCC', matthews_corrcoef(y_test, y_test_pred), matthews_corrcoef(y_test, y_test_pred_balanced))
]

for metric_name, baseline_val, smote_val in metrics:
    improvement = smote_val - baseline_val
    print(f"{metric_name:<30} {baseline_val:>15.4f} {smote_val:>15.4f} {improvement:>+15.4f}")

In [None]:
# Classification report with SMOTE
print("\nClassification Report (SMOTE):")
print(classification_report(y_test, y_test_pred_balanced, target_names=['No', 'Yes']))

# Confusion matrices comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

cm_baseline = confusion_matrix(y_test, y_test_pred)
cm_smote = confusion_matrix(y_test, y_test_pred_balanced)

sns.heatmap(cm_baseline, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
axes[0].set_title('Baseline LDA')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

sns.heatmap(cm_smote, annot=True, fmt='d', cmap='Greens', ax=axes[1],
            xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
axes[1].set_title('LDA with SMOTE')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

## 6. ROC and Precision-Recall Curves

In [None]:
# ROC curves
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# ROC Curve
fpr_baseline, tpr_baseline, _ = roc_curve(y_test, y_test_proba)
fpr_smote, tpr_smote, _ = roc_curve(y_test, y_test_proba_balanced)

axes[0].plot(fpr_baseline, tpr_baseline, label=f'Baseline (AUC={roc_auc_score(y_test, y_test_proba):.3f})', linewidth=2)
axes[0].plot(fpr_smote, tpr_smote, label=f'SMOTE (AUC={roc_auc_score(y_test, y_test_proba_balanced):.3f})', linewidth=2)
axes[0].plot([0, 1], [0, 1], 'k--', label='Random')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
precision_baseline, recall_baseline, _ = precision_recall_curve(y_test, y_test_proba)
precision_smote, recall_smote, _ = precision_recall_curve(y_test, y_test_proba_balanced)

axes[1].plot(recall_baseline, precision_baseline, label='Baseline', linewidth=2)
axes[1].plot(recall_smote, precision_smote, label='SMOTE', linewidth=2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Feature Importance Analysis

In [None]:
# LDA coefficients (using balanced model)
coefficients = lda_balanced.coef_[0]

feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients,
    'Abs_Coefficient': np.abs(coefficients)
}).sort_values('Abs_Coefficient', ascending=False)

print("Feature Importance (LDA Coefficients):")
print("=" * 70)
display(feature_importance)

# Plot top features
top_n = 15
top_features = feature_importance.head(top_n)

plt.figure(figsize=(12, 8))
colors = ['red' if x < 0 else 'blue' for x in top_features['Coefficient'].values]
plt.barh(range(top_n), top_features['Coefficient'].values[::-1], color=colors[::-1], alpha=0.7)
plt.yticks(range(top_n), top_features['Feature'].values[::-1])
plt.xlabel('LDA Coefficient')
plt.title(f'Top {top_n} Most Important Features for Subscription Prediction')
plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 8. Business Insights and Decision Thresholds

In [None]:
# Analyze prediction probabilities
plt.figure(figsize=(14, 6))

# Distribution of predicted probabilities
plt.subplot(1, 2, 1)
for label, name in zip([0, 1], ['No', 'Yes']):
    mask = y_test == label
    plt.hist(y_test_proba_balanced[mask], bins=30, alpha=0.6, label=name)
plt.xlabel('Predicted Probability')
plt.ylabel('Frequency')
plt.title('Distribution of Predicted Probabilities by True Class')
plt.legend()
plt.grid(True, alpha=0.3)

# Threshold analysis
plt.subplot(1, 2, 2)
thresholds = np.linspace(0, 1, 100)
f1_scores = []
precisions = []
recalls = []

for threshold in thresholds:
    y_pred_thresh = (y_test_proba_balanced >= threshold).astype(int)
    f1_scores.append(f1_score(y_test, y_pred_thresh))
    precision, recall, _, _ = precision_recall_fscore_support(y_test, y_pred_thresh, average='binary')
    precisions.append(precision)
    recalls.append(recall)

plt.plot(thresholds, f1_scores, label='F1-Score', linewidth=2)
plt.plot(thresholds, precisions, label='Precision', linewidth=2)
plt.plot(thresholds, recalls, label='Recall', linewidth=2)
plt.xlabel('Decision Threshold')
plt.ylabel('Score')
plt.title('Performance Metrics vs Decision Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=0.5, color='red', linestyle='--', label='Default (0.5)')

plt.tight_layout()
plt.show()

# Find optimal threshold
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"\nOptimal threshold (max F1): {optimal_threshold:.3f}")
print(f"F1-Score at optimal threshold: {f1_scores[optimal_idx]:.4f}")

In [None]:
# Business-oriented metrics
print("\nBUSINESS METRICS")
print("=" * 70)

# Using optimal threshold
y_pred_optimal = (y_test_proba_balanced >= optimal_threshold).astype(int)
cm_optimal = confusion_matrix(y_test, y_pred_optimal)

tn, fp, fn, tp = cm_optimal.ravel()

print(f"\nUsing threshold: {optimal_threshold:.3f}")
print(f"\nConfusion Matrix:")
print(f"  True Negatives: {tn}")
print(f"  False Positives: {fp}")
print(f"  False Negatives: {fn}")
print(f"  True Positives: {tp}")

print(f"\nKey Metrics:")
print(f"  Conversion Rate (Recall/Sensitivity): {tp/(tp+fn):.2%}")
print(f"  Precision (Success Rate): {tp/(tp+fp):.2%}")
print(f"  Contacts Needed: {tp+fp} (vs {len(y_test)} if contacting all)")
print(f"  Reduction in contacts: {(1-(tp+fp)/len(y_test))*100:.1f}%")
print(f"  True positives found: {tp} out of {tp+fn} total")

## 9. Key Insights and Recommendations

In [None]:
print("\n" + "="*70)
print("KEY INSIGHTS: LDA FOR BANK MARKETING PREDICTION")
print("="*70)

print("\n1. CLASS IMBALANCE HANDLING")
print(f"   - Original imbalance ratio: {(y==0).sum()/(y==1).sum():.1f}:1")
print(f"   - SMOTE improved F1-score by {f1_score(y_test, y_test_pred_balanced) - f1_score(y_test, y_test_pred):.3f}")
print(f"   - Critical for minority class detection (subscribers)")

print("\n2. MODEL PERFORMANCE")
print(f"   - Test ROC-AUC: {roc_auc_score(y_test, y_test_proba_balanced):.3f}")
print(f"   - Precision: {tp/(tp+fp):.2%} (how many predicted subscribers actually subscribe)")
print(f"   - Recall: {tp/(tp+fn):.2%} (how many actual subscribers we identify)")

print("\n3. TOP PREDICTIVE FEATURES")
top_3 = feature_importance.head(3)
for i, (_, row) in enumerate(top_3.iterrows(), 1):
    print(f"   {i}. {row['Feature']}: {row['Coefficient']:.4f}")

print("\n4. BUSINESS IMPACT")
print(f"   - Can reduce contacts by {(1-(tp+fp)/len(y_test))*100:.1f}% while maintaining {tp/(tp+fn):.0%} capture rate")
print(f"   - Optimal threshold: {optimal_threshold:.3f} (vs default 0.5)")
print(f"   - Focus on high-probability customers for better ROI")

print("\n5. RECOMMENDATIONS")
print("   - Use LDA with SMOTE for better minority class detection")
print("   - Adjust threshold based on business costs (contact vs lost customer)")
print(f"   - Focus marketing on customers with probability > {optimal_threshold:.2f}")
print("   - Monitor top features for campaign optimization")

print("\n" + "="*70)

## Summary

### What We Learned:

1. **Handling Imbalanced Data**: SMOTE significantly improves detection of minority class (subscribers) without sacrificing overall performance

2. **Business-Oriented Metrics**: ROC-AUC and precision-recall curves are more informative than accuracy for imbalanced datasets

3. **Feature Importance**: Identified key predictors for term deposit subscription

4. **Threshold Optimization**: Adjusting decision threshold based on business costs can improve ROI

5. **Practical Application**: Model can guide targeted marketing campaigns

### Why Bank Marketing is Good for LDA:
- Real-world business problem with direct ROI implications
- Class imbalance teaches proper evaluation techniques
- Mix of numerical and categorical features
- Interpretable coefficients guide business decisions
- Demonstrates importance of preprocessing and resampling

### Next Steps:
- Experiment with different SMOTE variants (ADASYN, BorderlineSMOTE)
- Try cost-sensitive learning approaches
- Compare with ensemble methods (Random Forest, XGBoost)
- Perform feature engineering for better predictions
- Conduct temporal validation (time-based splits)