# Wallet Churn Prediction: Feature Engineering

## Notebook Purpose

This notebook applies feature transformations and derives additional metrics based on insights from exploratory analysis to prepare a final modeling dataset.

---

## Key Findings from Notebook 01 (Driving Feature Engineering)

From our EDA, we identified:

1. **`days_since_last_tx`** is the strongest individual churn signal (correlation: 0.54)
2. **`tx_per_day`** provides additional separation beyond raw transaction counts
3. **Transaction count alone** is insufficient—needs context from activity intensity and recency
4. **`total_value` and `avg_tx_value`** are perfectly collinear (1.0)—we'll drop one

**Implication:** Our feature engineering should focus on combining recency, intensity, and lifetime metrics while removing redundancy.

---

## Feature Engineering Strategy

Based on EDA insights, we'll create:

### **1. Interaction Features**
- `recency_intensity_interaction`: `days_since_last_tx * tx_per_day`
  - Hypothesis: Wallets with long inactivity AND low historical activity are highest churn risk

### **2. Binned/Categorical Features**
- `activity_tier`: Categorize wallets by `tx_per_day` (dormant, casual, active, whale)
- `recency_bucket`: Categorize `days_since_last_tx` into time windows

### **3. Ratio Features**
- `value_per_lifetime_day`: `total_value / wallet_lifetime_days`
  - Measures "value intensity" over wallet lifespan

### **4. Redundancy Removal**
- Drop `avg_tx_value` (perfectly correlated with `total_value`)

---

## Expected Outcome

A refined feature set optimized for churn prediction modeling, ready for train/test split and baseline model evaluation in Notebook 03.


## Notebook Objectives

- Apply log transformations to skewed behavioral features
- Engineer composite engagement metrics
- Reduce feature redundancy
- Produce a final feature set for churn modeling


## Load Dataset

We load the wallet-level feature dataset generated from the ETL pipeline.
This dataset has already been validated during exploratory analysis and serves
as the baseline input for feature transformations.


In [1]:
import json
import pandas as pd
import numpy as np

# load wallet features

with open(r"C:\Users\samis\projects\wallet-churn-pipeline\data_samples\wallet_features.json") as f:
    wallet_features = json.load(f)

df = pd.DataFrame(wallet_features)

df.shape, df.head()

((424, 8),
                                wallet_address  tx_count   total_value  \
 0  0xBE0eB53F46cd790Cd13851d5EFF43D12404d33E8      1000  3.751824e+11   
 1  0x0E58e8993100F1CBe45376c410F97f4893d9BfCD       443  1.963961e+06   
 2  0x8315177aB297bA92A06054cE80a67Ed4DBd7ed3a      1000  3.642930e+05   
 3  0x49048044D57e1C92A77f79988d21Fa8fAF74E97e      1000  1.403524e+04   
 4  0x47ac0Fb4F2D84898e4D9E7b4DaB3C24507a6D503      1000  4.120587e+09   
 
    avg_tx_value  wallet_lifetime_days  days_since_last_tx  tx_per_day  churned  
 0  3.751824e+08                  1501                1126      0.6662        1  
 1  4.433320e+03                  1121                  15      0.3952        0  
 2  3.642930e+02                    17                1200     58.8235        1  
 3  1.403520e+01                    27                 869     37.0370        1  
 4  4.120587e+06                   548                1124      1.8248        1  )

## Log Transform Skewed Features

Several behavioral features exhibit heavy right skew, which can negatively impact
model stability and interpretability.

Based on exploratory analysis, we apply log transformations to the following features:
- days_since_last_tx
- tx_per_day
- wallet_lifetime_days
- total_value

A log(1 + x) transformation is used to safely handle zero values.


In [2]:
# Log-transformed features
df["log_days_since_last_tx"] = np.log1p(df["days_since_last_tx"])
df["log_tx_per_day"] = np.log1p(df["tx_per_day"])
df["log_wallet_lifetime_days"] = np.log1p(df["wallet_lifetime_days"])
df["log_total_value"] = np.log1p(df["total_value"])

# Quick sanity check
df[[
    "days_since_last_tx", "log_days_since_last_tx",
    "tx_per_day", "log_tx_per_day",
    "wallet_lifetime_days", "log_wallet_lifetime_days",
    "total_value", "log_total_value"
]].head()


Unnamed: 0,days_since_last_tx,log_days_since_last_tx,tx_per_day,log_tx_per_day,wallet_lifetime_days,log_wallet_lifetime_days,total_value,log_total_value
0,1126,7.027315,0.6662,0.510546,1501,7.314553,375182400000.0,26.650678
1,15,2.772589,0.3952,0.333038,1121,7.022868,1963961.0,14.490474
2,1200,7.09091,58.8235,4.091399,17,2.890372,364293.0,12.805716
3,869,6.768493,37.037,3.638559,27,3.332205,14035.24,9.549398
4,1124,7.025538,1.8248,1.038438,548,6.308098,4120587000.0,22.139261


## Composite Engagement Features

Individual features capture specific aspects of wallet behavior, but relationships
between features often provide stronger churn signals.

Based on exploratory analysis, we engineer composite metrics that combine
activity, recency, and lifetime to better represent sustained engagement patterns.


In [3]:
# Composite engagement features

# Activity relative to recency
df["activity_recency_ratio"] = df["tx_per_day"] / (df["days_since_last_tx"] + 1)

# Activity normalized by lifetime
df["lifetime_activity_ratio"] = df["tx_count"] / (df["wallet_lifetime_days"] + 1)

# Log versions (optional but useful)
df["log_activity_recency_ratio"] = np.log1p(df["activity_recency_ratio"])
df["log_lifetime_activity_ratio"] = np.log1p(df["lifetime_activity_ratio"])

# Sanity check
df[[
    "tx_per_day", "days_since_last_tx", "activity_recency_ratio",
    "tx_count", "wallet_lifetime_days", "lifetime_activity_ratio"
]].head()


Unnamed: 0,tx_per_day,days_since_last_tx,activity_recency_ratio,tx_count,wallet_lifetime_days,lifetime_activity_ratio
0,0.6662,1126,0.000591,1000,1501,0.665779
1,0.3952,15,0.0247,443,1121,0.394831
2,58.8235,1200,0.048979,1000,17,55.555556
3,37.037,869,0.042571,1000,27,35.714286
4,1.8248,1124,0.001622,1000,548,1.821494


## Feature Selection for Modeling

To reduce redundancy and improve model interpretability, we select a final
subset of features for churn modeling.

Feature selection is informed by:
- Exploratory analysis
- Correlation structure
- Feature interpretability


In [4]:
# Final feature set for modeling
feature_cols = [
    "log_days_since_last_tx",
    "log_tx_per_day",
    "log_wallet_lifetime_days",
    "log_total_value",
    "log_activity_recency_ratio",
    "log_lifetime_activity_ratio"
]

X = df[feature_cols]
y = df["churned"]

X.shape, y.shape


((424, 6), (424,))

## Feature Engineering Summary

Key outcomes from this notebook:

- Applied log transformations to address heavy skew in behavioral features
- Engineered composite metrics to capture engagement dynamics
- Reduced feature redundancy based on correlation analysis
- Produced a clean, interpretable feature set for churn modeling

The resulting dataset is used as input for model training and evaluation
in subsequent notebooks.


In [6]:
engineered_df = df.copy()

engineered_df.to_parquet(
    "data_samples/wallet_features_engineered.parquet",
    index=False
)
