# Feature Engineering — Fraud Detection

This notebook demonstrates feature engineering techniques for the fraud detection
dataset using the project's `FeatureEngineer` class and `ExperimentTracker`.

We cover:
1. **Time feature creation** — extract hour, day-of-week, weekend, and night indicators
2. **Amount feature creation** — log, squared, and square-root transformations
3. **Interaction feature creation** — pairwise products of selected features
4. **Feature selection** — univariate (SelectKBest) and RFE methods
5. **Feature importance analysis** — Random Forest importance ranking
6. **Impact analysis** — compare model performance with selected vs all features

All experiments are logged to the `ExperimentTracker` for reproducibility.

**Requirements covered:** 5.1 (derived features), 5.2 (feature selection), 5.3 (experiment logging), 5.4 (feature impact analysis)

## 1. Setup and Imports

In [None]:
import sys
import io

import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Add project src to path
sys.path.insert(0, '../src')
from feature_engineering import FeatureEngineer
from experiment_tracking import ExperimentTracker

sns.set_theme(style='whitegrid')
%matplotlib inline

## 2. Load Data from S3

Load the processed fraud detection dataset from the `fraud-detection-data` bucket.
The dataset contains columns `Time`, `Amount`, `V1`–`V28`, and `Class` (0 = legitimate, 1 = fraud).

In [None]:
BUCKET_NAME = 'fraud-detection-data'
DATA_PREFIX = 'processed'

s3_client = boto3.client('s3')


def load_parquet_from_s3(bucket: str, key: str) -> pd.DataFrame:
    """Load a Parquet file from S3 into a pandas DataFrame."""
    response = s3_client.get_object(Bucket=bucket, Key=key)
    return pd.read_parquet(io.BytesIO(response['Body'].read()))


train_df = load_parquet_from_s3(BUCKET_NAME, f'{DATA_PREFIX}/train.parquet')
test_df = load_parquet_from_s3(BUCKET_NAME, f'{DATA_PREFIX}/test.parquet')

print(f'Training set:  {train_df.shape[0]:,} rows, {train_df.shape[1]} columns')
print(f'Test set:      {test_df.shape[0]:,} rows, {test_df.shape[1]} columns')
train_df.head()

## 3. Time Feature Creation

The `Time` column represents seconds since the first transaction in the dataset.
`create_time_features` converts it to human-readable temporal signals:
- `hour` — hour of the day (0–23)
- `day_of_week` — day of the week (0 = Monday, 6 = Sunday)
- `is_weekend` — 1 if Saturday or Sunday, 0 otherwise
- `is_night` — 1 if hour is between 22:00 and 06:00

**Requirement 5.1**: Provide utilities for creating derived features (time-based)

In [None]:
engineer = FeatureEngineer()

# Create time features on training data
train_df = engineer.create_time_features(train_df)
test_df = engineer.create_time_features(test_df)

time_features = ['hour', 'day_of_week', 'is_weekend', 'is_night']
print('New time features:')
train_df[time_features].describe()

In [None]:
# Visualize time feature distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Hour distribution by class
ax = axes[0, 0]
for label, group in train_df.groupby('Class'):
    ax.hist(group['hour'], bins=24, alpha=0.6,
            label='Legitimate' if label == 0 else 'Fraud', density=True)
ax.set_xlabel('Hour of Day')
ax.set_ylabel('Density')
ax.set_title('Transaction Hour Distribution by Class')
ax.legend()

# Day of week distribution by class
ax = axes[0, 1]
day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for label, group in train_df.groupby('Class'):
    counts = group['day_of_week'].value_counts().sort_index()
    ax.bar(counts.index + (-0.2 if label == 0 else 0.2), counts.values / counts.sum(),
           width=0.4, alpha=0.7, label='Legitimate' if label == 0 else 'Fraud')
ax.set_xticks(range(7))
ax.set_xticklabels(day_labels)
ax.set_xlabel('Day of Week')
ax.set_ylabel('Proportion')
ax.set_title('Day of Week Distribution by Class')
ax.legend()

# Weekend vs weekday fraud rate
ax = axes[1, 0]
weekend_fraud = train_df.groupby('is_weekend')['Class'].mean()
bars = ax.bar(['Weekday', 'Weekend'], weekend_fraud.values, color=['steelblue', 'coral'])
for bar, val in zip(bars, weekend_fraud.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.0002,
            f'{val:.4f}', ha='center')
ax.set_ylabel('Fraud Rate')
ax.set_title('Fraud Rate: Weekday vs Weekend')

# Night vs day fraud rate
ax = axes[1, 1]
night_fraud = train_df.groupby('is_night')['Class'].mean()
bars = ax.bar(['Daytime', 'Nighttime'], night_fraud.values, color=['steelblue', 'coral'])
for bar, val in zip(bars, night_fraud.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.0002,
            f'{val:.4f}', ha='center')
ax.set_ylabel('Fraud Rate')
ax.set_title('Fraud Rate: Daytime vs Nighttime')

plt.tight_layout()
plt.savefig('time_features.png', dpi=150)
plt.show()

## 4. Amount Feature Creation

`create_amount_features` derives three transformations of the `Amount` column:
- `amount_log` — log(1 + Amount), reduces skewness
- `amount_squared` — Amount², amplifies large-value differences
- `amount_sqrt` — √Amount, compresses large values

**Requirement 5.1**: Provide utilities for creating derived features (aggregations)

In [None]:
train_df = engineer.create_amount_features(train_df)
test_df = engineer.create_amount_features(test_df)

amount_features = ['Amount', 'amount_log', 'amount_squared', 'amount_sqrt']
print('Amount feature statistics:')
train_df[amount_features].describe()

In [None]:
# Visualize amount feature distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for idx, feat in enumerate(amount_features):
    ax = axes[idx // 2, idx % 2]
    for label, group in train_df.groupby('Class'):
        ax.hist(group[feat], bins=50, alpha=0.6, density=True,
                label='Legitimate' if label == 0 else 'Fraud')
    ax.set_xlabel(feat)
    ax.set_ylabel('Density')
    ax.set_title(f'{feat} Distribution by Class')
    ax.legend()

plt.tight_layout()
plt.savefig('amount_features.png', dpi=150)
plt.show()

## 5. Interaction Feature Creation

`create_interaction_features` computes element-wise products for specified
feature pairs. This captures non-linear relationships between features.

**Requirement 5.1**: Provide utilities for creating derived features (interactions)

In [None]:
interaction_pairs = [('V1', 'V2'), ('V3', 'V4'), ('V1', 'Amount')]

train_df = engineer.create_interaction_features(train_df, interaction_pairs)
test_df = engineer.create_interaction_features(test_df, interaction_pairs)

interaction_cols = [f'{a}_x_{b}' for a, b in interaction_pairs]
print('Interaction features created:', interaction_cols)
train_df[interaction_cols].describe()

## 6. Feature Selection — Univariate (SelectKBest)

`select_features_univariate` uses the ANOVA F-value (`f_classif`) to rank
features by their individual predictive power and selects the top *k*.

**Requirement 5.2**: Provide feature selection methods (feature importance, correlation analysis)

In [None]:
# Prepare feature matrix (exclude target and raw Time)
TARGET = 'Class'
EXCLUDE = [TARGET, 'Time']
ALL_FEATURES = [c for c in train_df.columns if c not in EXCLUDE]

X_train = train_df[ALL_FEATURES]
y_train = train_df[TARGET]
X_test = test_df[ALL_FEATURES]
y_test = test_df[TARGET]

print(f'Total features available: {len(ALL_FEATURES)}')

In [None]:
K = 20
selected_univariate, scores_df = engineer.select_features_univariate(X_train, y_train, k=K)

print(f'Top {K} features (univariate):')
print(selected_univariate)
scores_df.head(K)

In [None]:
# Visualize univariate feature scores
top_scores = scores_df.head(K)

plt.figure(figsize=(12, 8))
sns.barplot(data=top_scores, x='score', y='feature', palette='viridis')
plt.title(f'Top {K} Features — Univariate F-Score (SelectKBest)')
plt.xlabel('F-Score')
plt.ylabel('Feature')
plt.tight_layout()
plt.savefig('univariate_feature_scores.png', dpi=150)
plt.show()

## 7. Feature Selection — Recursive Feature Elimination (RFE)

`select_features_rfe` uses a Random Forest estimator to iteratively remove the
least important features until the desired count is reached. Features with
ranking = 1 are selected.

**Requirement 5.2**: Provide feature selection methods (recursive elimination)

In [None]:
N_FEATURES_RFE = 20
selected_rfe, ranking_df = engineer.select_features_rfe(X_train, y_train, n_features=N_FEATURES_RFE)

print(f'Top {N_FEATURES_RFE} features (RFE):')
print(selected_rfe)
ranking_df.head(N_FEATURES_RFE)

In [None]:
# Visualize RFE rankings
plt.figure(figsize=(12, 8))
sorted_ranking = ranking_df.sort_values('ranking')
colors = ['forestgreen' if r == 1 else 'lightgray' for r in sorted_ranking['ranking']]
sns.barplot(data=sorted_ranking, x='ranking', y='feature', palette=colors)
plt.title('Feature Rankings — Recursive Feature Elimination')
plt.xlabel('Ranking (1 = selected)')
plt.ylabel('Feature')
plt.tight_layout()
plt.savefig('rfe_feature_rankings.png', dpi=150)
plt.show()

## 8. Feature Importance Analysis

`analyze_feature_importance` trains a Random Forest on the full feature set and
returns Gini importance scores. This gives a holistic view of which features
contribute most to the model's decisions.

**Requirement 5.2**: Provide feature selection methods (feature importance)

In [None]:
importance_df = engineer.analyze_feature_importance(X_train, y_train)

print('Top 20 features by Random Forest importance:')
importance_df.head(20)

In [None]:
# Bar chart of feature importance
top_importance = importance_df.head(20)

plt.figure(figsize=(12, 8))
sns.barplot(data=top_importance, x='importance', y='feature', palette='magma')
plt.title('Top 20 Features — Random Forest Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()

## 9. Feature Impact Analysis — Selected vs All Features

To quantify the impact of feature selection, we train an XGBoost model twice:
1. Using **all** available features
2. Using only the **selected** features from univariate selection

This shows whether feature selection improves or maintains performance while
reducing dimensionality.

**Requirement 5.4**: Provide feature impact analysis showing performance changes from feature additions or removals

In [None]:
def evaluate_model(model, X_tr, y_tr, X_te, y_te):
    """Train a model and return classification metrics."""
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    y_proba = model.predict_proba(X_te)[:, 1]
    return {
        'accuracy': accuracy_score(y_te, y_pred),
        'precision': precision_score(y_te, y_pred),
        'recall': recall_score(y_te, y_pred),
        'f1': f1_score(y_te, y_pred),
        'auc_roc': roc_auc_score(y_te, y_proba),
    }


# Model with ALL features
model_all = XGBClassifier(
    max_depth=5, learning_rate=0.2, n_estimators=100,
    subsample=0.8, colsample_bytree=0.8,
    use_label_encoder=False, eval_metric='logloss',
)
metrics_all = evaluate_model(model_all, X_train, y_train, X_test, y_test)

# Model with SELECTED features (univariate top-K)
model_selected = XGBClassifier(
    max_depth=5, learning_rate=0.2, n_estimators=100,
    subsample=0.8, colsample_bytree=0.8,
    use_label_encoder=False, eval_metric='logloss',
)
metrics_selected = evaluate_model(
    model_selected,
    X_train[selected_univariate], y_train,
    X_test[selected_univariate], y_test,
)

print(f'All features ({len(ALL_FEATURES)}):      {metrics_all}')
print(f'Selected features ({K}): {metrics_selected}')

In [None]:
# Compare metrics side by side
comparison = pd.DataFrame({
    'Metric': list(metrics_all.keys()),
    f'All Features ({len(ALL_FEATURES)})': list(metrics_all.values()),
    f'Selected Features ({K})': list(metrics_selected.values()),
})
comparison['Difference'] = comparison[f'Selected Features ({K})'] - comparison[f'All Features ({len(ALL_FEATURES)})']
comparison

In [None]:
# Grouped bar chart: all features vs selected features
metric_names = list(metrics_all.keys())
x = np.arange(len(metric_names))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 6))
bars1 = ax.bar(x - width / 2, list(metrics_all.values()), width,
               label=f'All Features ({len(ALL_FEATURES)})', color='steelblue')
bars2 = ax.bar(x + width / 2, list(metrics_selected.values()), width,
               label=f'Selected Features ({K})', color='coral')

ax.set_ylabel('Score')
ax.set_title('Feature Impact Analysis — All vs Selected Features')
ax.set_xticks(x)
ax.set_xticklabels([m.replace('_', ' ').title() for m in metric_names])
ax.set_ylim(0, 1.05)
ax.legend()

# Add value labels
for bar in bars1:
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
            f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=9)
for bar in bars2:
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
            f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('feature_impact_analysis.png', dpi=150)
plt.show()

## 10. Log Feature Engineering Experiment

Log the feature engineering experiment to the `ExperimentTracker` so that the
feature set, selection method, and resulting metrics are recorded for
reproducibility.

**Requirement 5.3**: Log the feature set used and resulting model performance

In [None]:
tracker = ExperimentTracker(region_name='us-east-1')

# Log experiment with all features
exp_all = tracker.start_experiment(
    experiment_name='feature-engineering',
    algorithm='XGBoost',
    description='Baseline with all engineered features',
)
tracker.log_parameters(exp_all, {
    'n_features': len(ALL_FEATURES),
    'selection_method': 'none',
    'feature_set': 'all',
    'max_depth': 5,
    'learning_rate': 0.2,
    'n_estimators': 100,
})
tracker.log_metrics(exp_all, metrics_all)
tracker.close_experiment(exp_all)
print(f'Logged all-features experiment: {exp_all}')

# Log experiment with selected features
exp_selected = tracker.start_experiment(
    experiment_name='feature-engineering',
    algorithm='XGBoost',
    description=f'Univariate top-{K} selected features',
)
tracker.log_parameters(exp_selected, {
    'n_features': K,
    'selection_method': 'univariate_f_classif',
    'feature_set': ', '.join(selected_univariate),
    'max_depth': 5,
    'learning_rate': 0.2,
    'n_estimators': 100,
})
tracker.log_metrics(exp_selected, metrics_selected)
tracker.close_experiment(exp_selected)
print(f'Logged selected-features experiment: {exp_selected}')

## 11. Summary

### What we covered

| Step | Technique | New Features / Output |
|------|-----------|----------------------|
| Time features | `create_time_features` | `hour`, `day_of_week`, `is_weekend`, `is_night` |
| Amount features | `create_amount_features` | `amount_log`, `amount_squared`, `amount_sqrt` |
| Interaction features | `create_interaction_features` | `V1_x_V2`, `V3_x_V4`, `V1_x_Amount` |
| Univariate selection | `select_features_univariate` | Top-K features by F-score |
| RFE selection | `select_features_rfe` | Top-N features by recursive elimination |
| Importance analysis | `analyze_feature_importance` | Random Forest Gini importance ranking |
| Impact analysis | XGBoost comparison | All features vs selected features |

### Key takeaways

- Feature selection can maintain (or improve) model performance while reducing
  dimensionality, leading to faster training and simpler models.
- Different selection methods may agree on the most important features but
  diverge on borderline ones — combining methods gives a more robust picture.
- All experiments are logged to the ExperimentTracker for full reproducibility.

### Next steps

1. **Tune hyperparameters** on the selected feature set using
   `02_hyperparameter_tuning.ipynb`.
2. **Compare algorithms** with the engineered features using
   `03_algorithm_comparison.ipynb`.
3. **Promote to production** via `05_production_promotion.ipynb` once the best
   configuration is identified.