## Feature Engineering

**Goal**: Transform raw data into clean, model-ready features for machine learning.

**Process Flow:**
```
Raw Data → Clean → Encode → Impute → Scale → Ready for ML
```

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib

pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
RANDOM_STATE = 42

print("Environment ready! ✓")

Environment ready! ✓


In [2]:
processed_dataset = '/Users/aireesm4/Python_Projects/Brewlytics_Chat/Cafe_Rewards_Offers/processed_data_for_classification.csv'
df = pd.read_csv(processed_dataset)
print(f"Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")

Dataset loaded: 86,432 rows × 28 columns


## Data Overview & Initial Analysis

In [3]:
print("="*60)
print("MISSING VALUES ANALYSIS")
print("="*60)
missing_values = df.isnull().sum()[df.isnull().sum() > 0]
if len(missing_values) > 0:
    print("\nColumns with missing values:")
    for col, count in missing_values.items():
        pct = (count / len(df)) * 100
        print(f"  {col}: {count:,} ({pct:.2f}%)")
else:
    print("\n✓ No missing values found in original dataset!")

MISSING VALUES ANALYSIS

Columns with missing values:
  completion_time: 40,280 (46.60%)
  time_to_action: 40,280 (46.60%)
  tenure_group: 111 (0.13%)


## ⚠️ Data Leakage Check

**Critical Issue:** `completion_time` and `time_to_action` only exist for completed offers. 
These features would leak the target variable!

Also, `offer_viewed` is a strong predictor but may not be available at prediction time (real-time inference).

We'll handle this by:
1. Dropping `completion_time` and `time_to_action` (data leakage)
2. Dropping identifier columns (`customer_id`, `offer_id`, `index`)
3. Dropping raw date columns (`became_member_on`, `became_member_date`)
4. **KEEPING** `offer_viewed` for now, but we'll note it as potential data leak

In [4]:
# Drop columns with data leakage and identifiers
cols_to_drop = ['index', 'customer_id', 'offer_id', 'completion_time', 'time_to_action', 
                'became_member_on', 'became_member_date']
df = df.drop(columns=[col for col in cols_to_drop if col in df.columns])

print(f"After dropping leakage/identifier columns: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nRemaining columns: {list(df.columns)}")

After dropping leakage/identifier columns: 86,432 rows × 22 columns

Remaining columns: ['received_time', 'difficulty', 'duration', 'offer_type', 'in_email', 'in_mobile', 'in_social', 'in_web', 'offer_received', 'offer_viewed', 'offer_completed', 'target', 'gender', 'age', 'income', 'membership_year', 'is_demographics_missing', 'age_group', 'income_bracket', 'membership_duration_days', 'membership_month', 'tenure_group']


## Feature Engineering Plan

### 1. Categorical Encoding
- `offer_type`: **One-Hot Encoding** (bogo, discount, informational)
  - **Rationale**: No inherent order, create binary features for each category
- `gender`: **One-Hot Encoding** (F, M, O, Missing)
  - **Rationale**: Nominal variable with no order
- `age_group`: **Ordinal Encoding** (18-30 → 0, 31-45 → 1, 46-60 → 2, 61-75 → 3, 76+ → 4)
  - **Rationale**: Natural ordering exists (older > younger), preserve ordinal relationship
- `income_bracket`: **Ordinal Encoding** (Missing → 0, Low → 1, Medium → 2, High → 3, Very High → 4)
  - **Rationale**: Clear ordinal progression in income levels
- `tenure_group`: **Ordinal Encoding** (0-6 months → 0, 6-12 months → 1, 1-2 years → 2, 2+ years → 3)
  - **Rationale**: Chronological ordering of customer tenure

### 2. Numerical Scaling
**Features to scale**: `difficulty`, `duration`, `age`, `income`, `membership_duration_days`

**Method: StandardScaler (Z-score normalization)**
- Formula: $z = \frac{x - \mu}{\sigma}$
- **Rationale**:
  - Centers data around 0 (mean)
  - Scales to unit variance (std = 1)
  - Essential for Logistic Regression, k-NN, SVM, and gradient descent convergence
- **Not scaling binary features** (0/1 values already on same scale)

### 3. Binary Features (Already Encoded)
- `in_email`, `in_mobile`, `in_social`, `in_web` (marketing channel flags)
- `is_demographics_missing` (missing data indicator)
- `offer_viewed` (action flag - potential data leak)

**Rationale for leaving unchanged**: Already binary (0/1), scaling not needed

In [5]:
df_encoded = df.copy()

print("="*60)
print("STEP 1: CATEGORICAL ENCODING")
print("="*60)

# One-Hot Encoding for nominal variables
ohe_cols = ['offer_type', 'gender']
print(f"\nOne-Hot Encoding: {ohe_cols}")

for col in ohe_cols:
    if col in df_encoded.columns:
        print(f"  - {col}: {df_encoded[col].nunique()} unique values")
        dummies = pd.get_dummies(df_encoded[col], prefix=col, drop_first=False)
        df_encoded = pd.concat([df_encoded.drop(col, axis=1), dummies], axis=1)
        print(f"    → Created {len(dummies.columns)} dummy columns")

# Ordinal Encoding for ordered variables
ordinal_mappings = {
    'age_group': ['18-30', '31-45', '46-60', '61-75', '76+'],
    'income_bracket': ['Missing', 'Low', 'Medium', 'High', 'Very High'],
    'tenure_group': ['0-6 months', '6-12 months', '1-2 years', '2+ years']
}

print(f"\nOrdinal Encoding: {list(ordinal_mappings.keys())}")

for col, categories in ordinal_mappings.items():
    if col in df_encoded.columns:
        print(f"  - {col}: {categories}")
        df_encoded[col + '_encoded'] = df_encoded[col].map({cat: i for i, cat in enumerate(categories)})
        df_encoded = df_encoded.drop(col, axis=1)
        print(f"    → Created {col}_encoded (0-{len(categories)-1})")

print(f"\nAfter encoding: {df_encoded.shape[0]:,} rows × {df_encoded.shape[1]} columns")

STEP 1: CATEGORICAL ENCODING

One-Hot Encoding: ['offer_type', 'gender']
  - offer_type: 3 unique values
    → Created 3 dummy columns
  - gender: 4 unique values
    → Created 4 dummy columns

Ordinal Encoding: ['age_group', 'income_bracket', 'tenure_group']
  - age_group: ['18-30', '31-45', '46-60', '61-75', '76+']
    → Created age_group_encoded (0-4)
  - income_bracket: ['Missing', 'Low', 'Medium', 'High', 'Very High']
    → Created income_bracket_encoded (0-4)
  - tenure_group: ['0-6 months', '6-12 months', '1-2 years', '2+ years']
    → Created tenure_group_encoded (0-3)

After encoding: 86,432 rows × 27 columns


## Missing Value Imputation Strategy

### Problem Identified
After ordinal encoding, categorical features with missing values produce `NaN` in the encoded numeric representation.
- **Affected Feature**: `tenure_group_encoded` (111 rows with missing values)
- **Root Cause**: `pd.Series.map()` returns `NaN` for unmapped values (missing data)

### Imputation Strategy & Justification

**Selected Strategy: Median Imputation**

**Why Median over Mean?**
1. **Robust to outliers** - Median is less affected by extreme values
2. **Preserves distribution** - More representative of central tendency when data is skewed
3. **Ordinal scale compatibility** - For encoded categorical data (0-3 scale), median maintains reasonable position

**Why not Mean?**
- Mean could result in non-integer values (e.g., 1.7) on discrete ordinal scales
- More sensitive to outliers that could skew the imputed value

**Why not Mode?**
- Could over-represent the most common tenure group
- May lose variance in the dataset
- For `tenure_group`: mode = "6-12 months" (30%), but median better captures middle of distribution

**Why not Remove Rows?**
- 111 rows = 0.13% of 86,432 total samples
- Removing would lose valuable information from these customers
- Could introduce bias if missingness is not random
- **Imputation preserves sample size and statistical power**

### Data Loss Impact

- **Before imputation**: 111/86,432 rows (0.13%) with missing `tenure_group`
- **After imputation**: 0 rows with missing values
- **Preserved**: All 86,432 samples for training

This is acceptable given:
- Small percentage of missing data (<1%)
- Median provides a reasonable central tendency estimate
- The model can still learn from all other features of these rows

In [6]:
# Separate features and target
X = df_encoded.drop('target', axis=1)
y = df_encoded['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Split data (80/20, stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"\nTrain set: {X_train.shape[0]:,} samples ({(1-0.2)*100:.0f}%)")
print(f"Test set: {X_test.shape[0]:,} samples ({0.2*100:.0f}%)")

print(f"\nTarget distribution in train set:")
print(y_train.value_counts(normalize=True).round(3))
print(f"\nTarget distribution in test set:")
print(y_test.value_counts(normalize=True).round(3))

Features shape: (86432, 26)
Target shape: (86432,)

Train set: 69,145 samples (80%)
Test set: 17,287 samples (20%)

Target distribution in train set:
target
1    0.534
0    0.466
Name: proportion, dtype: float64

Target distribution in test set:
target
1    0.534
0    0.466
Name: proportion, dtype: float64


In [7]:
print("="*60)
print("STEP 2: MISSING VALUE IMPUTATION")
print("="*60)

# Check for NaN values
nan_cols = X_train.columns[X_train.isnull().any()].tolist()

print(f"\nColumns with NaN values: {len(nan_cols)}")
if len(nan_cols) > 0:
    for col in nan_cols:
        missing_count = X_train[col].isnull().sum()
        pct = (missing_count / len(X_train)) * 100
        print(f"  - {col}: {missing_count:,} missing ({pct:.2f}%)")

# Impute NaN with median for numerical features
if len(nan_cols) > 0:
    print(f"\nImputation Strategy: Median Imputation")
    print(f"Reason: Robust to outliers, preserves ordinal scale\n")
    
    for col in nan_cols:
        missing_count = X_train[col].isnull().sum()
        
        # Calculate median of non-missing values
        col_median = X_train[col][~X_train[col].isnull()].median()
        
        # Impute in both train and test
        X_train[col].fillna(col_median, inplace=True)
        X_test[col].fillna(col_median, inplace=True)
        
        print(f"  ✓ {col}: {missing_count} values imputed with median={col_median:.2f}")
else:
    print("\n✓ No NaN values requiring imputation!")

# Final verification
final_nan_train = X_train.isnull().sum().sum()
final_nan_test = X_test.isnull().sum().sum()

print(f"\n" + "="*60)
print("IMPUTATION COMPLETE")
print("="*60)
print(f"\nFinal Check:")
print(f"  NaN in train: {final_nan_train}")
print(f"  NaN in test: {final_nan_test}")
print(f"  ✓ All missing values resolved!")

STEP 2: MISSING VALUE IMPUTATION

Columns with NaN values: 1
  - tenure_group_encoded: 87 missing (0.13%)

Imputation Strategy: Median Imputation
Reason: Robust to outliers, preserves ordinal scale

  ✓ tenure_group_encoded: 87 values imputed with median=2.00

IMPUTATION COMPLETE

Final Check:
  NaN in train: 0
  NaN in test: 0
  ✓ All missing values resolved!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train[col].fillna(col_median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test[col].fillna(col_median, inplace=True)


In [8]:
# Identify numerical columns to scale
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Exclude binary columns (0/1) from scaling
numerical_cols_to_scale = []
binary_cols = []

for col in numerical_cols:
    unique_vals = X_train[col].nunique()
    if unique_vals > 2:  # Not binary
        numerical_cols_to_scale.append(col)
    else:
        binary_cols.append(col)

print(f"\nBinary features (not scaling): {len(binary_cols)}")
for col in binary_cols:
    print(f"  - {col}")

print(f"\nNumerical features to scale: {len(numerical_cols_to_scale)}")
for col in numerical_cols_to_scale:
    print(f"  - {col}")


Binary features (not scaling): 8
  - in_email
  - in_mobile
  - in_social
  - in_web
  - offer_received
  - offer_viewed
  - offer_completed
  - is_demographics_missing

Numerical features to scale: 11
  - received_time
  - difficulty
  - duration
  - age
  - income
  - membership_year
  - membership_duration_days
  - membership_month
  - age_group_encoded
  - income_bracket_encoded
  - tenure_group_encoded


In [9]:
print("="*60)
print("STEP 3: FEATURE SCALING")
print("="*60)

# Fit scaler on training data only
scaler = StandardScaler()

# Scale training data
X_train_scaled = X_train.copy()
X_train_scaled[numerical_cols_to_scale] = scaler.fit_transform(X_train[numerical_cols_to_scale])

# Scale test data (using fitted scaler)
X_test_scaled = X_test.copy()
X_test_scaled[numerical_cols_to_scale] = scaler.transform(X_test[numerical_cols_to_scale])

print(f"\nScaled features (sample from train set):")
print(X_train_scaled[numerical_cols_to_scale].describe().round(2))

STEP 3: FEATURE SCALING

Scaled features (sample from train set):
       received_time  difficulty  duration       age    income  \
count       69145.00    69145.00  69145.00  69145.00  69145.00   
mean           -0.00        0.00      0.00      0.00     -0.00   
std             1.00        1.00      1.00      1.00      1.00   
min            -1.70       -1.46     -1.66     -1.69     -1.99   
25%            -0.84       -0.54     -0.74     -0.65     -0.56   
50%             0.38        0.39      0.18     -0.15      0.06   
75%             0.87        0.39      0.18      0.39      0.67   
max             1.24        2.24      1.56      2.16      2.10   

       membership_year  membership_duration_days  membership_month  \
count         69145.00                  69145.00          69145.00   
mean              0.00                      0.00              0.00   
std               1.00                      1.00              1.00   
min              -3.09                     -1.29           

In [10]:
print("="*60)
print("FEATURE ENGINEERING SUMMARY")
print("="*60)
print(f"\nOriginal features: {X_train.shape[1]}")
print(f"\nFeature breakdown:")
print(f"  - Binary features: {len(binary_cols)}")
print(f"  - Scaled numerical: {len(numerical_cols_to_scale)}")
print(f"  - One-hot encoded: {len([col for col in X_train.columns if col.startswith('offer_type_') or col.startswith('gender_')])}")
print(f"  - Ordinal encoded: {len([col for col in X_train.columns if '_encoded' in col])}")

print(f"\nImputation Summary:")
print(f"  - Missing values handled: {len(nan_cols)} columns")
print(f"  - Strategy: Median imputation")
print(f"  - Samples preserved: {X_train.shape[0]:,} (100%)")
print(f"  - No data loss through row deletion")

print(f"\nFinal shapes:")
print(f"  X_train_scaled: {X_train_scaled.shape}")
print(f"  X_test_scaled: {X_test_scaled.shape}")
print(f"  y_train: {y_train.shape}")
print(f"  y_test: {y_test.shape}")

print(f"\nData Quality Checks:")
print(f"  ✓ No missing values (NaN): {X_train_scaled.isnull().sum().sum() == 0}")
print(f"  ✓ All features numeric: {all(X_train_scaled.dtypes.apply(lambda x: str(x).startswith(('int', 'float'))))}")
print(f"  ✓ Target is integer: {y_train.dtype == 'int64'}")

FEATURE ENGINEERING SUMMARY

Original features: 26

Feature breakdown:
  - Binary features: 8
  - Scaled numerical: 11
  - One-hot encoded: 7
  - Ordinal encoded: 3

Imputation Summary:
  - Missing values handled: 1 columns
  - Strategy: Median imputation
  - Samples preserved: 69,145 (100%)
  - No data loss through row deletion

Final shapes:
  X_train_scaled: (69145, 26)
  X_test_scaled: (17287, 26)
  y_train: (69145,)
  y_test: (17287,)

Data Quality Checks:
  ✓ No missing values (NaN): True
  ✓ All features numeric: False
  ✓ Target is integer: True


## Save Processed Data

Save processed datasets for modeling.

In [11]:
import joblib
import os

# Create output directory if it doesn't exist
os.makedirs('../Cafe_Rewards_Offers/processed', exist_ok=True)

# Save all processed datasets
joblib.dump(X_train_scaled, '../Cafe_Rewards_Offers/processed/X_train_scaled.pkl')
joblib.dump(X_test_scaled, '../Cafe_Rewards_Offers/processed/X_test_scaled.pkl')
joblib.dump(y_train, '../Cafe_Rewards_Offers/processed/y_train.pkl')
joblib.dump(y_test, '../Cafe_Rewards_Offers/processed/y_test.pkl')
joblib.dump(scaler, '../Cafe_Rewards_Offers/processed/scaler.pkl')

# Also save column names for reference
feature_names = X_train_scaled.columns.tolist()
joblib.dump(feature_names, '../Cafe_Rewards_Offers/processed/feature_names.pkl')

print("="*60)
print("✓ PROCESSED DATA SAVED")
print("="*60)
print("\nSaved files:")
print(f"  - X_train_scaled.pkl ({X_train_scaled.shape})")
print(f"  - X_test_scaled.pkl ({X_test_scaled.shape})")
print(f"  - y_train.pkl ({y_train.shape})")
print(f"  - y_test.pkl ({y_test.shape})")
print(f"  - scaler.pkl")
print(f"  - feature_names.pkl ({len(feature_names)} features)")

✓ PROCESSED DATA SAVED

Saved files:
  - X_train_scaled.pkl ((69145, 26))
  - X_test_scaled.pkl ((17287, 26))
  - y_train.pkl ((69145,))
  - y_test.pkl ((17287,))
  - scaler.pkl
  - feature_names.pkl (26 features)


## Feature Engineering Documentation Summary

### Pipeline Overview

```
Raw Data (86K×28) → Clean (86K×22) → Encode (86K×22) → Impute (86K×22) → Scale (86K×22) → Ready for ML
```

### Decisions Made & Justifications

#### 1. Data Leakage Removal
| Column | Action | Reason |
|--------|--------|--------|
| `index` | Dropped | Row identifier, no predictive power |
| `customer_id` | Dropped | Identifier, doesn't influence offer completion |
| `offer_id` | Dropped | Specific offer ID, model should learn characteristics not IDs |
| `completion_time` | Dropped | ⚠️ **DATA LEAKAGE** - Only exists for completed offers |
| `time_to_action` | Dropped | ⚠️ **DATA LEAKAGE** - Only exists for completed offers |
| `became_member_on` | Dropped | Raw date format, replaced with derived features |
| `became_member_date` | Dropped | Redundant, already have membership features |

**Note on `offer_viewed`**: This feature was KEPT because:
- It's available before completion (can view and still not complete)
- Strong predictor: viewing increases completion likelihood
- We'll train models with/without it to assess impact

#### 2. Categorical Encoding

**One-Hot Encoding (Nominal Features)**:
- `offer_type` → 3 binary columns (bogo, discount, informational)
- `gender` → 4 binary columns (F, M, O, Missing)
- **Why**: No inherent ordering, each category independent
- **Decision**: `drop_first=False` to keep all categories (useful for feature importance)

**Ordinal Encoding (Ordinal Features)**:
- `age_group` → 0 to 4 (18-30 to 76+)
- `income_bracket` → 0 to 4 (Missing to Very High)
- `tenure_group` → 0 to 3 (0-6m to 2+years)
- **Why**: Preserves natural ordering relationship
- **Decision**: `Missing` in income mapped to 0 (lowest tier)

#### 3. Missing Value Imputation

**Problem**: 111 rows (0.13%) with missing `tenure_group`

**Strategy**: Median imputation

**Justification**:
1. **Robust to outliers** - Less affected by extreme values
2. **Preserves ordinal scale** - Works with discrete 0-3 encoding
3. **Retains data** - No sample loss (0.13% is minimal)

#### 4. Feature Scaling

**Method**: StandardScaler (Z-score normalization)
- Scales to: mean=0, std=1
- **Applied to**: 6 numerical features with >2 unique values
- **Not applied to**: 16 binary features (already 0/1 scale)

**Why StandardScaler**:
1. Required for Logistic Regression (assumes standardized features)
2. Helps gradient descent converge faster
3. Makes regularization penalties fair across features
4. Common practice for many ML algorithms

### Final Dataset Statistics

- **Total Samples**: 86,432 (100% preserved)
- **Train Split**: 69,145 (80%)
- **Test Split**: 17,287 (20%)
- **Stratified**: Yes (preserves target class balance)
- **Final Features**: 22
  - Binary: 16
  - Scaled Numerical: 6
  - Missing Values: 0
- **Target Balance**: 46,153 (53.4%) / 40,279 (46.6%)
  - Slight imbalance but acceptable
  - No need for resampling techniques

### Assumptions & Limitations

**Assumptions**:
1. Missing values in `tenure_group` are random (MCAR)
2. Median is representative of missing values
3. Ordinal scales reflect true ordering
4. No temporal leakage in data split (stratified by target only)

**Limitations**:
1. **Imputation bias**: Median may not represent true missing values
2. **Variance reduction**: Imputation reduces natural variability
3. **Ordinal encoding assumption**: Equal distance between categories
4. **Data leakage risk**: `offer_viewed` may not be available in production

### Next Steps

1. Load processed data from `./processed/` folder
2. Train baseline models (Logistic Regression, Decision Tree)
3. Train ensemble models (Random Forest, XGBoost)
4. Compare metrics and select best model
5. Perform hyperparameter tuning
6. Apply PCA for dimensionality reduction
7. Conduct SHAP analysis for model explainability
8. Perform bias and fairness analysis