# Donor Propensity Modeling
## Predicting High-Income Individuals for Nonprofit Donor Targeting

**Business problem:** A nonprofit organization wants to maximize fundraising efficiency by identifying individuals most likely to donate. Since donation propensity correlates strongly with income level, we build a classifier to predict whether an individual earns above $50K annually using demographic data.

**Approach:** We compare 6 classification models (Logistic Regression through XGBoost), engineer domain-specific features, and conduct a fairness audit to ensure the model doesn't discriminate across protected demographic groups.

**Key result:** Gradient Boosting achieves the best test performance (86.2% accuracy, 0.735 F-beta) with an AUC of 0.92, while maintaining acceptable fairness metrics across race and sex.

In [None]:
import sys, os
sys.path.insert(0, os.path.join(os.getcwd(), '..'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from src.data_loader import load_census, preprocess, split_data
from src.database import create_database, run_query, QUERIES
from src.models import get_models, compare_models, optimize_model, get_classification_report
from src.fairness import fairness_summary
from src.visualizations import (
    plot_income_distribution, plot_feature_distributions,
    plot_model_comparison, plot_roc_pr_curves,
    plot_feature_importance, plot_fairness_results
)

sns.set_palette('husl')
%matplotlib inline

np.random.seed(42)

## 1. Data Overview

In [None]:
data = load_census()
print(f'Dataset: {data.shape[0]:,} records x {data.shape[1]} features')
print(f'Missing values: {data.isnull().sum().sum()}')
print(f'\nTarget distribution:')
print(data['income'].value_counts(normalize=True).round(4))
display(data.describe())

In [None]:
plot_income_distribution(data)

In [None]:
numeric_features = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
plot_feature_distributions(data, numeric_features)

## 2. SQL-Based Exploration

Loading the data into SQLite to demonstrate relational analytics with CTEs, window functions, and aggregations.

In [None]:
db_path = create_database(data)
print(f'Database: {db_path}')

In [None]:
# Income rates by education + occupation (GROUP BY + HAVING)
display(run_query(QUERIES['income_by_education_occupation']))

In [None]:
# Age group income distribution (CASE expressions)
display(run_query(QUERIES['age_income_distribution']))

In [None]:
# Capital gains decile analysis (NTILE window function)
display(run_query(QUERIES['capital_gains_percentiles']))

In [None]:
# Demographic profile with CTE
display(run_query(QUERIES['demographic_income_profile']))

## 3. Feature Engineering & Preprocessing

The preprocessing pipeline:
1. **Feature engineering**: capital_net (gain - loss), age bins, work hour categories
2. **Log-transform** skewed features (capital-gain, capital-loss)
3. **Normalize** numeric features to [0, 1]
4. **One-hot encode** categorical variables

In [None]:
X, y = preprocess(data)
X_train, X_test, y_train, y_test = split_data(X, y)

print(f'Training: {X_train.shape}')
print(f'Test:     {X_test.shape}')
print(f'Features: {X.shape[1]} (from {data.shape[1]-1} original)')
print(f'\nTarget balance - Train: {y_train.mean():.3f}, Test: {y_test.mean():.3f}')

## 4. Baseline Performance

A naive predictor that always predicts the majority class (<=50K) gives us the floor to beat.

In [None]:
from sklearn.metrics import accuracy_score, fbeta_score

majority_class = y.value_counts().idxmax()
baseline_acc = max(y.value_counts()) / len(y)
baseline_fbeta = fbeta_score(y_test, [majority_class]*len(y_test), beta=0.5)

print(f'Baseline accuracy (majority class): {baseline_acc:.4f}')
print(f'Baseline F-beta:                    {baseline_fbeta:.4f}')

## 5. Model Comparison

We compare 6 models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and LightGBM. Each is evaluated at 1%, 10%, and 100% of training data to assess learning efficiency.

In [None]:
results_df = compare_models(X_train, y_train, X_test, y_test)
display(results_df[results_df['sample_frac'] == 1.0][['model', 'test_accuracy', 'test_fbeta', 'train_time']].sort_values('test_fbeta', ascending=False))

In [None]:
plot_model_comparison(results_df)

## 6. Model Optimization

Optimizing Gradient Boosting with GridSearchCV, as it showed the best balance of accuracy, F-beta, and training speed.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

best_model, grid = optimize_model(X_train, y_train)
print(f'Best parameters: {grid.best_params_}')
print(f'Best CV F1: {grid.best_score_:.4f}')

report = get_classification_report(best_model, X_test, y_test)
print(f'\nTest Accuracy: {report["accuracy"]}')
print(f'Test F-beta:   {report["fbeta_0.5"]}')
print(f'ROC AUC:       {report.get("roc_auc", "N/A")}')

In [None]:
plot_roc_pr_curves(best_model, X_test, y_test)

## 7. Feature Importance

In [None]:
plot_feature_importance(best_model, list(X_train.columns), top_n=15)

## 8. Fairness & Bias Analysis

Since this model would influence resource allocation (who gets targeted for outreach), we audit it for fairness across race and sex. We check:
- **Demographic parity**: Are positive prediction rates consistent across groups?
- **Equalized odds**: Are TPR and FPR consistent across groups?
- **Disparate impact**: Does the model satisfy the 4/5ths rule?

In [None]:
y_pred = best_model.predict(X_test)
fairness = fairness_summary(y_test.values, y_pred, data.iloc[y_test.index])

print('=== Sex: Demographic Parity ===')
display(fairness.get('sex_demographic_parity'))
print(f'\nDisparate Impact Ratio (sex): {fairness.get("sex_disparate_impact", "N/A")}')
print('(>= 0.8 satisfies the 4/5ths rule)\n')

print('=== Race: Demographic Parity ===')
display(fairness.get('race_demographic_parity'))
print(f'\nDisparate Impact Ratio (race): {fairness.get("race_disparate_impact", "N/A")}')

In [None]:
print('=== Sex: Equalized Odds ===')
display(fairness.get('sex_equalized_odds'))

print('\n=== Race: Equalized Odds ===')
display(fairness.get('race_equalized_odds'))

In [None]:
plot_fairness_results(fairness)

## 9. Income Profile Comparison

In [None]:
key_features = ['capital-gain', 'age', 'education-num', 'hours-per-week']
high = data[data['income'] == '>50K'][key_features].describe().loc['mean']
low = data[data['income'] == '<=50K'][key_features].describe().loc['mean']

comparison = pd.DataFrame({
    'High Income (mean)': high,
    'Low Income (mean)': low,
    'Difference (%)': ((high - low) / low * 100).round(1)
})
display(comparison)

## 10. Conclusions

### Model Performance
- Gradient Boosting outperforms all tested models with 86.2% accuracy and 0.735 F-beta
- Top predictors: capital gain, marital status, age, and education level
- High-income individuals show 2,500%+ higher capital gains and ~20% more education

### Fairness Findings
- The model shows measurable disparities across sex and race in prediction rates
- Disparate impact ratios should be monitored and addressed if deploying to production
- Consider post-processing calibration or fairness constraints during training

### Business Recommendations
1. **Deploy** the Gradient Boosting model to score existing donor databases
2. **Target** outreach toward individuals with high capital gains, older age, and higher education
3. **Monitor** fairness metrics in production to ensure equitable targeting
4. **A/B test** model-driven vs. existing outreach strategies to measure donation lift
5. **Retrain** periodically with fresh census data to maintain accuracy