# üîß Feature Engineering - E-commerce Customer Churn

**Philosophy:** Restraint > Complexity (5.6k dataset)  
**Approach:** Phased (Baseline ‚Üí Controlled ‚Üí Experimental)  
**Goal:** 15-18 production-safe features

---

## üìã Plan Overview

**Phase 1 (MANDATORY):** Baseline features + missing flags (15 features)  
**Phase 2 (CONTROLLED):** Add 2-3 features one at a time  
**Phase 3 (EXPERIMENTAL):** Optional composite features

**Critical Rules:**
- ‚úì Train-test split FIRST
- ‚úì Fit on train, apply to test
- ‚úì No data leakage
- ‚úì Keep features NUMERIC (no binning)

---

## üì¶ Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

print("Libraries imported successfully!")

## üì• Step 2: Load Data

In [None]:
# Load dataset
df = pd.read_csv('../data/raw/ecommerce_churn.csv')

print(f"Dataset Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nChurn Distribution:")
print(df['Churn'].value_counts())

df.head()

## üîç Step 3: Missing Values Analysis

**Strategy:**
- Median imputation for numerical
- Create missing flags (signal value)
- Mode imputation for categorical

**‚ö†Ô∏è CRITICAL:** We'll fit imputers AFTER train-test split to avoid leakage.

In [None]:
# Check missing values
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing_Count': missing, 'Missing_%': missing_pct})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

print("Missing Values:")
print(missing_df)

# Identify columns with missing values
numerical_missing = ['Tenure', 'HourSpendOnApp', 'OrderCount', 'DaySinceLastOrder', 
                     'OrderAmountHikeFromlastYear', 'CouponUsed']
print(f"\nNumerical columns with missing values: {numerical_missing}")

## ‚úÇÔ∏è Step 4: Train-Test Split (BEFORE Feature Engineering)

**‚ö†Ô∏è CRITICAL STEP:**  
Split data FIRST to prevent train-test contamination.

**Strategy:**
- Stratified split (preserve churn ratio)
- 80-20 split
- Random state for reproducibility

In [None]:
# Separate features and target
X = df.drop(['Churn', 'CustomerID'], axis=1)
y = df['Churn']

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y, 
    random_state=42
)

print(f"Train set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTrain churn rate: {y_train.mean()*100:.2f}%")
print(f"Test churn rate: {y_test.mean()*100:.2f}%")

# Store original indices
train_idx = X_train.index
test_idx = X_test.index

## üîß PHASE 1: Baseline Feature Engineering

### Step 5a: Handle Missing Values

**Approach:**
1. Create missing flags (fit on train)
2. Median imputation for numerical (fit on train)
3. Mode imputation for categorical (fit on train)

**‚ö†Ô∏è Fit on TRAIN only, apply to BOTH train and test**

In [None]:
# Create copies to avoid modifying originals
X_train_fe = X_train.copy()
X_test_fe = X_test.copy()

# 1. Create missing flags (BEFORE imputation)
for col in numerical_missing:
    if col in X_train_fe.columns:
        # Fit on train
        X_train_fe[f'{col}_was_missing'] = X_train_fe[col].isnull().astype(int)
        # Apply to test
        X_test_fe[f'{col}_was_missing'] = X_test_fe[col].isnull().astype(int)

print("Missing flags created:")
missing_flag_cols = [col for col in X_train_fe.columns if '_was_missing' in col]
print(missing_flag_cols)

# 2. Median imputation for numerical
for col in numerical_missing:
    if col in X_train_fe.columns:
        # Fit on train
        median_val = X_train_fe[col].median()
        # Apply to both
        X_train_fe[col].fillna(median_val, inplace=True)
        X_test_fe[col].fillna(median_val, inplace=True)
        print(f"Imputed {col} with median: {median_val:.2f}")

# 3. Mode imputation for categorical
categorical_cols = X_train_fe.select_dtypes(include=['object']).columns.tolist()
for col in categorical_cols:
    if X_train_fe[col].isnull().sum() > 0:
        # Fit on train
        mode_val = X_train_fe[col].mode()[0]
        # Apply to both
        X_train_fe[col].fillna(mode_val, inplace=True)
        X_test_fe[col].fillna(mode_val, inplace=True)
        print(f"Imputed {col} with mode: {mode_val}")

# Verify no missing values
print(f"\nTrain missing values: {X_train_fe.isnull().sum().sum()}")
print(f"Test missing values: {X_test_fe.isnull().sum().sum()}")

### Step 5b: Encode Categorical Features

**Approach:** Label Encoding (tree-based models handle this well)

**‚ö†Ô∏è Fit on TRAIN, apply to TEST**

In [None]:
# Label encode categorical features
categorical_cols = X_train_fe.select_dtypes(include=['object']).columns.tolist()

label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    # Fit on train
    X_train_fe[col] = le.fit_transform(X_train_fe[col])
    # Apply to test (handle unseen categories)
    X_test_fe[col] = X_test_fe[col].map(lambda x: le.transform([x])[0] if x in le.classes_ else -1)
    label_encoders[col] = le
    print(f"Encoded {col}: {len(le.classes_)} categories")

print(f"\nTotal categorical features encoded: {len(categorical_cols)}")

### Phase 1 Summary: Baseline Features

**Features Created:**
- Original features: 18 (after dropping CustomerID, Churn)
- Missing flags: 6
- **Total Phase 1: 24 features**

**Next:** Phase 2 will add 2-3 controlled features

In [None]:
print("="*80)
print("PHASE 1: BASELINE FEATURES COMPLETE")
print("="*80)
print(f"\nTrain shape: {X_train_fe.shape}")
print(f"Test shape: {X_test_fe.shape}")
print(f"\nFeature list:")
print(list(X_train_fe.columns))

# Save Phase 1 features for baseline model
X_train_phase1 = X_train_fe.copy()
X_test_phase1 = X_test_fe.copy()

## üîß PHASE 2: Controlled Feature Addition

**Strategy:** Add features ONE AT A TIME, measure impact

### Step 6a: Order Frequency

**Business Logic:** Frequent buyers = loyal customers  
**Leakage Risk:** Low (historical behavior)

In [None]:
# Create order_frequency (NUMERIC, not categorical)
X_train_fe['order_frequency'] = X_train_fe['OrderCount'] / (X_train_fe['Tenure'] + 1)
X_test_fe['order_frequency'] = X_test_fe['OrderCount'] / (X_test_fe['Tenure'] + 1)

print("Order Frequency Statistics (Train):")
print(X_train_fe['order_frequency'].describe())

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(X_train_fe[y_train==0]['order_frequency'], bins=30, alpha=0.6, label='Not Churned', color='green')
plt.hist(X_train_fe[y_train==1]['order_frequency'], bins=30, alpha=0.6, label='Churned', color='red')
plt.xlabel('Order Frequency (orders/month)')
plt.ylabel('Count')
plt.title('Order Frequency Distribution by Churn')
plt.legend()

plt.subplot(1, 2, 2)
pd.DataFrame({'Churn': y_train, 'OrderFreq': X_train_fe['order_frequency']}).boxplot(
    column='OrderFreq', by='Churn')
plt.title('Order Frequency vs Churn')
plt.suptitle('')
plt.tight_layout()
plt.show()

print(f"\nCorrelation with Churn: {X_train_fe['order_frequency'].corr(y_train):.3f}")

### Step 6b: Complaint Rate

**Business Logic:** Complaints indicate dissatisfaction  
**‚ö†Ô∏è Leakage Risk:** MODERATE (complaints may be post-churn signal)

**Decision:** Include but DOCUMENT leakage risk

In [None]:
# Create complaint_rate
X_train_fe['complaint_rate'] = X_train_fe['Complain'] / (X_train_fe['OrderCount'] + 1)
X_test_fe['complaint_rate'] = X_test_fe['Complain'] / (X_test_fe['OrderCount'] + 1)

print("Complaint Rate Statistics (Train):")
print(X_train_fe['complaint_rate'].describe())

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(X_train_fe[y_train==0]['complaint_rate'], bins=30, alpha=0.6, label='Not Churned', color='green')
plt.hist(X_train_fe[y_train==1]['complaint_rate'], bins=30, alpha=0.6, label='Churned', color='red')
plt.xlabel('Complaint Rate (complaints/order)')
plt.ylabel('Count')
plt.title('Complaint Rate Distribution by Churn')
plt.legend()

plt.subplot(1, 2, 2)
pd.DataFrame({'Churn': y_train, 'ComplaintRate': X_train_fe['complaint_rate']}).boxplot(
    column='ComplaintRate', by='Churn')
plt.title('Complaint Rate vs Churn')
plt.suptitle('')
plt.tight_layout()
plt.show()

print(f"\nCorrelation with Churn: {X_train_fe['complaint_rate'].corr(y_train):.3f}")
print("\n‚ö†Ô∏è LEAKAGE WARNING: Monitor this feature's importance in model")

### Phase 2 Summary: Controlled Features

**Features Added:**
- order_frequency (NUMERIC)
- complaint_rate (NUMERIC, ‚ö†Ô∏è leakage risk)

**Total after Phase 2: 26 features**

**Decision:** These will be evaluated in baseline model. Keep if importance > 0.01

In [None]:
print("="*80)
print("PHASE 2: CONTROLLED FEATURES COMPLETE")
print("="*80)
print(f"\nTrain shape: {X_train_fe.shape}")
print(f"Test shape: {X_test_fe.shape}")

# Save Phase 2 features
X_train_phase2 = X_train_fe.copy()
X_test_phase2 = X_test_fe.copy()

## üîß PHASE 3: Experimental Features (OPTIONAL)

**‚ö†Ô∏è These are NOT for baseline model**

### Step 7a: Engagement Score (EXPERIMENTAL)

**Concept:** Composite metric combining tenure and orders  
**‚ö†Ô∏è Issues:** Reduces interpretability, may not improve over raw features

**Decision:** Create but DON'T use in baseline

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Normalize components
scaler_tenure = MinMaxScaler()
scaler_orders = MinMaxScaler()

# Fit on train, apply to both
X_train_fe['tenure_norm'] = scaler_tenure.fit_transform(X_train_fe[['Tenure']])
X_test_fe['tenure_norm'] = scaler_tenure.transform(X_test_fe[['Tenure']])

X_train_fe['orders_norm'] = scaler_orders.fit_transform(X_train_fe[['OrderCount']])
X_test_fe['orders_norm'] = scaler_orders.transform(X_test_fe[['OrderCount']])

# Simple engagement score (NO satisfaction - leakage risk)
X_train_fe['engagement_score'] = 0.5 * X_train_fe['tenure_norm'] + 0.5 * X_train_fe['orders_norm']
X_test_fe['engagement_score'] = 0.5 * X_test_fe['tenure_norm'] + 0.5 * X_test_fe['orders_norm']

print("Engagement Score Statistics (Train):")
print(X_train_fe['engagement_score'].describe())
print(f"\nCorrelation with Churn: {X_train_fe['engagement_score'].corr(y_train):.3f}")
print("\n‚ö†Ô∏è EXPERIMENTAL: Compare model with/without this feature")

### Step 7b: CLV Proxy (EXPERIMENTAL)

**Concept:** Rough customer lifetime value estimate  
**‚ö†Ô∏è Issues:** Cashback ‚â† revenue, tenure appears twice, amplifies noise

**Positioning:** "Rough heuristic for prioritization, not final modeling"

**Decision:** Create but DON'T use in baseline

In [None]:
# CLV Proxy = Tenure √ó OrderFrequency √ó Cashback
X_train_fe['clv_proxy'] = (
    X_train_fe['Tenure'] * 
    X_train_fe['order_frequency'] * 
    X_train_fe['CashbackAmount']
)

X_test_fe['clv_proxy'] = (
    X_test_fe['Tenure'] * 
    X_test_fe['order_frequency'] * 
    X_test_fe['CashbackAmount']
)

print("CLV Proxy Statistics (Train):")
print(X_train_fe['clv_proxy'].describe())
print(f"\nCorrelation with Churn: {X_train_fe['clv_proxy'].corr(y_train):.3f}")
print("\n‚ö†Ô∏è EXPERIMENTAL: Weak proxy, use with caution")

## ‚úÖ Feature Engineering Complete

### Final Feature Sets

**Phase 1 (Baseline):** 24 features  
- Original: 18
- Missing flags: 6

**Phase 2 (Controlled):** +2 features  
- order_frequency
- complaint_rate (‚ö†Ô∏è leakage risk)

**Phase 3 (Experimental):** +5 features  
- tenure_norm, orders_norm, engagement_score
- clv_proxy

**Total Available:** 31 features

**For Baseline Model:** Use Phase 1 + Phase 2 = **26 features**  
**After Feature Selection:** Target **15-18 features**

---

### Next Steps

1. ‚úì Feature engineering complete
2. ‚Üí Train baseline model (Phase 1 + Phase 2 features)
3. ‚Üí Feature importance analysis
4. ‚Üí Remove low-importance features (< 0.01)
5. ‚Üí Final model with 15-18 features

In [None]:
print("="*80)
print("FEATURE ENGINEERING COMPLETE")
print("="*80)

print(f"\nPhase 1 (Baseline): {X_train_phase1.shape[1]} features")
print(f"Phase 2 (Controlled): {X_train_phase2.shape[1]} features")
print(f"Phase 3 (All features): {X_train_fe.shape[1]} features")

print(f"\n‚úì Train set: {X_train_fe.shape}")
print(f"‚úì Test set: {X_test_fe.shape}")
print(f"‚úì No missing values")
print(f"‚úì No data leakage (fit on train, apply to test)")

print("\nüìä Ready for modeling!")

## üíæ Save Processed Data

Save different feature sets for modeling experiments

In [None]:
# Create processed data directory
import os
os.makedirs('../data/processed', exist_ok=True)

# Save Phase 1 (Baseline)
X_train_phase1.to_csv('../data/processed/X_train_phase1.csv', index=False)
X_test_phase1.to_csv('../data/processed/X_test_phase1.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

# Save Phase 2 (Baseline + Controlled)
X_train_phase2.to_csv('../data/processed/X_train_phase2.csv', index=False)
X_test_phase2.to_csv('../data/processed/X_test_phase2.csv', index=False)

# Save Phase 3 (All features)
X_train_fe.to_csv('../data/processed/X_train_all.csv', index=False)
X_test_fe.to_csv('../data/processed/X_test_all.csv', index=False)

print("‚úì Saved processed data to data/processed/")
print("\nFiles created:")
print("  - X_train_phase1.csv (baseline)")
print("  - X_train_phase2.csv (baseline + controlled)")
print("  - X_train_all.csv (all features)")
print("  - y_train.csv, y_test.csv")