## Olist Dataset Table Schema Columns Data Dictionary

The Olist dataset is comprised of several tables, each containing specific information related to e-commerce orders.

### `olist_orders_dataset.csv`

This table contains core information about each order placed on the Olist platform.

| Column Name | Description |
|---|---|
| `order_id` | Unique identifier for each order. |
| `customer_id` | Customer identifier |
| `order_status` | Order status (delivered, shipped, canceled, etc.) |
| `order_purchase_timestamp` | Order purchase timestamp |
| `order_approved_at` | Order approval timestamp |
| `order_delivered_carrier_date` | Order delivery to carrier timestamp |
| `order_delivered_customer_date` | Order delivery to customer timestamp |
| `order_estimated_delivery_date` | Estimated delivery date |

### `olist_order_reviews_dataset.csv`

This table contains customer reviews for orders.

| Column Name | Description |
|---|---|
| `review_id` | Review identifier |
| `order_id` | Order identifier |
| `review_score` | Review score (1-5) |
| `review_comment_title` | Review title |
| `review_comment_message` | Review message text |
| `review_creation_date` | Review creation date |
| `review_answer_timestamp` | Review answer timestamp |

### `olist_order_payments_dataset.csv`

This table contains information about the payment details for each order.

| Column Name | Description |
|---|---|
| `order_id` | Order identifier |
| `payment_sequential` | Payment sequence |
| `payment_type` | Payment method |
| `payment_installments` | Number of installments |
| `payment_value` | Payment amount |

### `olist_order_items_dataset.csv`

This table contains information about the individual items within each order.

| Column Name | Description |
|---|---|
| `order_id` | Order identifier |
| `order_item_id` | Item sequence number |
| `product_id` | Product identifier |
| `seller_id` | Seller identifier |
| `shipping_limit_date` | Shipping limit date |
| `price` | Item price |
| `freight_value` | Freight value |

### `olist_customers_dataset.csv`

This table contains information about the customers.

| Column Name | Description |
|---|---|
| `customer_id` | Customer identifier |
| `customer_unique_id` | Unique customer identifier |
| `customer_zip_code_prefix` | Customer zip code |
| `customer_city` | Customer city |
| `customer_state` | Customer state |

---

In [4]:
import pandas as pd
from pathlib import Path
from customer_ai.config import PROCESSED_DATA_DIR

def load_data():
    """
    Reads the parquet files we produced in data/processed and returns:
      - orders
      - reviews
      - order items
      - payments
      - customers
    """
    orders    = pd.read_parquet(PROCESSED_DATA_DIR / "olist_orders_dataset.parquet")
    reviews   = pd.read_parquet(PROCESSED_DATA_DIR / "olist_order_reviews_dataset.parquet")
    items     = pd.read_parquet(PROCESSED_DATA_DIR / "olist_order_items_dataset.parquet")
    payments  = pd.read_parquet(PROCESSED_DATA_DIR / "olist_order_payments_dataset.parquet")
    customers = pd.read_parquet(PROCESSED_DATA_DIR / "olist_customers_dataset.parquet")
    return orders, reviews, items, payments, customers

orders, reviews, items, payments, customers = load_data()
print("Orders shape:", orders.shape)
orders.head(3)

ModuleNotFoundError: No module named 'customer_ai'

In [19]:
import pandas as pd
from pandas.tseries.offsets import DateOffset

# 1. Convert to datetime
orders["order_purchase_timestamp"] = pd.to_datetime(orders["order_purchase_timestamp"])

# 2. Find the latest purchase date
latest_date = orders["order_purchase_timestamp"].max()
print("Latest order date:", latest_date)

# 3. Define cutoff = 6 months before latest
cutoff_date = latest_date - DateOffset(months=6)
print("Churn cutoff date:", cutoff_date)

# 4. Extract each customer’s last purchase
last = (
    orders
    .groupby("customer_id")["order_purchase_timestamp"]
    .max()
    .reset_index(name="last_purchase")
)

# 5. Flag churn and compute recency
last["is_churned"] = (last["last_purchase"] < cutoff_date).astype(int)
last["days_since_last_order"] = (latest_date - last["last_purchase"]).dt.days

# 6. Quick sanity check
print("Total customers:", len(last))
print("Churn rate:", last["is_churned"].mean().round(3))
last.head()

Latest order date: 2018-10-17 17:30:18
Churn cutoff date: 2018-04-17 17:30:18
Total customers: 99441
Churn rate: 0.709


Unnamed: 0,customer_id,last_purchase,is_churned,days_since_last_order
0,00012a2ce6f8dcda20d059ce98491703,2017-11-14 16:08:26,1,337
1,000161a058600d5901f007fab4c27140,2017-07-16 09:40:32,1,458
2,0001fd6190edaaf884bcaf3d49edf079,2017-02-28 11:06:43,1,596
3,0002414f95344307404f0ace7a26f1d5,2017-08-16 13:09:20,1,427
4,000379cdec625522490c315e70c7a9fb,2018-04-02 13:42:17,1,198


In [20]:
from sklearn.model_selection import train_test_split

# 1. Hold out test (10%)
train_val, test = train_test_split(
    last,
    test_size=0.10,
    stratify=last["is_churned"],
    random_state=42
)

# 2. From the remaining 90%, carve out validation (so val ≈10% of total)
val_relative = 0.10 / 0.90
train, val = train_test_split(
    train_val,
    test_size=val_relative,
    stratify=train_val["is_churned"],
    random_state=42
)

# 3. Check sizes and churn rates
for name, df in [("train", train), ("val", val), ("test", test)]:
    print(f"{name:>5} shape: {df.shape}, churn rate: {df['is_churned'].mean():.3f}")

train shape: (79552, 4), churn rate: 0.709
  val shape: (9944, 4), churn rate: 0.709
 test shape: (9945, 4), churn rate: 0.709


In [22]:
# ensure these are proper timestamps
for col in [
    "order_purchase_timestamp",
    "order_delivered_customer_date",
    "order_estimated_delivery_date",
]:
    orders[col] = pd.to_datetime(orders[col], errors="coerce")

# double-check
print(orders.dtypes[[
    "order_purchase_timestamp",
    "order_delivered_customer_date",
    "order_estimated_delivery_date",
]])

order_purchase_timestamp         datetime64[ns]
order_delivered_customer_date    datetime64[ns]
order_estimated_delivery_date    datetime64[ns]
dtype: object


In [26]:
# ── DELIVERY FEATURES ────────────────────────────────────────────────────────────
# Filter only delivered orders with both actual and estimated dates
delivered = orders[
    (orders.order_status == 'delivered') &
    orders.order_delivered_customer_date.notna() &
    orders.order_estimated_delivery_date.notna()
].copy()

# Compute day-difference columns
delivered['delivery_days_diff'] = (
    delivered['order_delivered_customer_date']
    - delivered['order_estimated_delivery_date']
).dt.days
delivered['delivery_actual_days'] = (
    delivered['order_delivered_customer_date']
    - delivered['order_purchase_timestamp']
).dt.days
delivered['delivery_promised_days'] = (
    delivered['order_estimated_delivery_date']
    - delivered['order_purchase_timestamp']
).dt.days

# Aggregate per customer
delivery_feats = delivered.groupby('customer_id').agg({
    'delivery_days_diff': ['mean', 'max'],
    'delivery_actual_days': 'mean',
    'delivery_promised_days': 'mean',
    'order_id': 'count'
}).round(2)

# Flatten column names
delivery_feats.columns = [
    f"delivery_{col[0]}_{col[1]}"
    for col in delivery_feats.columns
]

# Additional ratio metrics
late_counts      = delivered[delivered.delivery_days_diff > 0].groupby('customer_id').size()
very_late_counts = delivered[delivered.delivery_days_diff > 5].groupby('customer_id').size()
long_prom_counts = delivered[delivered.delivery_promised_days > 20].groupby('customer_id').size()

delivery_feats['delivery_late_ratio']         = (late_counts      / delivery_feats.delivery_order_id_count).fillna(0)
delivery_feats['delivery_very_late_ratio']    = (very_late_counts / delivery_feats.delivery_order_id_count).fillna(0)
delivery_feats['delivery_long_promise_ratio'] = (long_prom_counts / delivery_feats.delivery_order_id_count).fillna(0)

# Preview
delivery_feats.reset_index().head()

Unnamed: 0,customer_id,delivery_delivery_days_diff_mean,delivery_delivery_days_diff_max,delivery_delivery_actual_days_mean,delivery_delivery_promised_days_mean,delivery_order_id_count,delivery_late_ratio,delivery_very_late_ratio,delivery_long_promise_ratio
0,00012a2ce6f8dcda20d059ce98491703,-6.0,-6,13.0,19.0,1,0.0,0.0,0.0
1,000161a058600d5901f007fab4c27140,-10.0,-10,9.0,18.0,1,0.0,0.0,0.0
2,0001fd6190edaaf884bcaf3d49edf079,-16.0,-16,5.0,21.0,1,0.0,0.0,1.0
3,0002414f95344307404f0ace7a26f1d5,-1.0,-1,28.0,28.0,1,0.0,0.0,1.0
4,000379cdec625522490c315e70c7a9fb,-5.0,-5,11.0,15.0,1,0.0,0.0,0.0


In [27]:
delivery_feats.describe()

Unnamed: 0,delivery_delivery_days_diff_mean,delivery_delivery_days_diff_max,delivery_delivery_actual_days_mean,delivery_delivery_promised_days_mean,delivery_order_id_count,delivery_late_ratio,delivery_very_late_ratio,delivery_long_promise_ratio
count,96470.0,96470.0,96470.0,96470.0,96470.0,96470.0,96470.0,96470.0
mean,-11.875889,-11.875889,12.093604,23.372748,1.0,0.067731,0.039017,0.634021
std,10.182105,10.182105,9.55138,8.758421,0.0,0.251285,0.193637,0.481706
min,-147.0,-147.0,0.0,2.0,1.0,0.0,0.0,0.0
25%,-17.0,-17.0,6.0,18.0,1.0,0.0,0.0,0.0
50%,-12.0,-12.0,10.0,23.0,1.0,0.0,0.0,1.0
75%,-7.0,-7.0,15.0,28.0,1.0,0.0,0.0,1.0
max,188.0,188.0,209.0,155.0,1.0,1.0,1.0,1.0


In [28]:
# ── DELIVERY FEATURES  (clean-up & scaling version) ─────────────────────────────
import numpy as np
from sklearn.preprocessing import RobustScaler

# 0.  Filter only delivered orders with both actual & promised dates
delivered = orders[
    (orders.order_status == "delivered")
    & orders.order_delivered_customer_date.notna()
    & orders.order_estimated_delivery_date.notna()
].copy()

# 1.  Day-difference columns
delivered["delivery_days_diff"] = (
    delivered["order_delivered_customer_date"]
    - delivered["order_estimated_delivery_date"]
).dt.days
delivered["delivery_actual_days"] = (
    delivered["order_delivered_customer_date"]
    - delivered["order_purchase_timestamp"]
).dt.days
delivered["delivery_promised_days"] = (
    delivered["order_estimated_delivery_date"]
    - delivered["order_purchase_timestamp"]
).dt.days

# 2.  Aggregate to customer level  ▸ mean + max for lateness, mean for durations
delivery_feats = (
    delivered.groupby("customer_id")
    .agg(
        delivery_days_diff_mean=("delivery_days_diff", "mean"),
        delivery_days_diff_max=("delivery_days_diff", "max"),
        delivery_delivery_actual_days_mean=("delivery_actual_days", "mean"),
        delivery_delivery_promised_days_mean=("delivery_promised_days", "mean"),
        delivery_order_id_count=("order_id", "count"),
    )
    .round(2)
)

# 3.  Ratio metrics
late_counts      = delivered[delivered.delivery_days_diff > 0].groupby("customer_id").size()
very_late_counts = delivered[delivered.delivery_days_diff > 5].groupby("customer_id").size()
long_prom_counts = delivered[delivered.delivery_promised_days > 20].groupby("customer_id").size()

delivery_feats["delivery_late_ratio"]         = (late_counts      / delivery_feats.delivery_order_id_count).fillna(0)
delivery_feats["delivery_very_late_ratio"]    = (very_late_counts / delivery_feats.delivery_order_id_count).fillna(0)
delivery_feats["delivery_long_promise_ratio"] = (long_prom_counts / delivery_feats.delivery_order_id_count).fillna(0)

# ── CLEAN-UP STEPS ─────────────────────────────────────────────────────────────
# A.  Drop constant column ( =1 for everyone )
delivery_feats = delivery_feats.drop(columns=["delivery_order_id_count"])

# B.  Winsorise/cap extreme lateness ±30 days
for col in ["delivery_days_diff_mean", "delivery_days_diff_max"]:
    delivery_feats[col] = delivery_feats[col].clip(-30, 30)

# C.  Log-transform skewed ratios
EPS = 1e-3
for col in ["delivery_late_ratio", "delivery_very_late_ratio", "delivery_long_promise_ratio"]:
    delivery_feats[f"{col}_log"] = np.log(delivery_feats[col] + EPS)

# D.  Make sure **every** customer appears (fill NA with 0)
all_customers = customers[["customer_id"]].drop_duplicates().set_index("customer_id")
delivery_feats = all_customers.join(delivery_feats).fillna(0)

# E.  Robust-scale numeric features  (good for logistic regression / GLMs)
num_cols = delivery_feats.select_dtypes(float).columns
scaler = RobustScaler()
delivery_feats[num_cols] = scaler.fit_transform(delivery_feats[num_cols])

# Preview
delivery_feats.reset_index().head()

Unnamed: 0,customer_id,delivery_days_diff_mean,delivery_days_diff_max,delivery_delivery_actual_days_mean,delivery_delivery_promised_days_mean,delivery_late_ratio,delivery_very_late_ratio,delivery_long_promise_ratio,delivery_late_ratio_log,delivery_very_late_ratio_log,delivery_long_promise_ratio_log
0,06b8999e2fba1a1fbc88172c00ba8bc7,0.1,0.1,-0.222222,-0.363636,0.0,0.0,-1.0,0.0,0.0,-1.0
1,18955e83d337fd6b2def6b18a428ac77,0.4,0.4,0.666667,0.090909,0.0,0.0,0.0,0.0,0.0,0.0
2,4e7b3e00288586ebd08712fdd0374a03,1.3,1.3,1.777778,0.090909,1.0,0.0,0.0,6.908755,0.0,0.0
3,b2b6027bc5c5109e529d4dc6358b12c3,-0.1,-0.1,0.444444,0.363636,0.0,0.0,0.0,0.0,0.0,0.0
4,4f2d8ab171c80ec8364f7c12e35b23ad,0.6,0.6,0.111111,-0.636364,0.0,0.0,-1.0,0.0,0.0,-1.0


In [30]:
# ── REVIEW SENTIMENT FEATURES with “no review” flag ─────────────────────────────

# 1. Base set of all customers
all_customers = orders[["customer_id"]].drop_duplicates().set_index("customer_id")

# 2. Keep only reviews with text and merge customer_id
rv = (
    reviews
    .dropna(subset=["review_comment_message"])
    .merge(
        orders[["order_id", "customer_id"]],
        on="order_id",
        how="left",
    )
    .copy()
)

# 3. Compute polarity and mismatch flag
rv["sentiment_score"] = rv["review_comment_message"].apply(
    lambda txt: TextBlob(txt).sentiment.polarity if isinstance(txt, str) else 0
)
rv["sentiment_class"] = pd.cut(
    rv["sentiment_score"],
    bins=[-1.0, -0.1, 0.1, 1.0],
    labels=["negative", "neutral", "positive"],
)
rv["mismatch_flag"] = (
    ((rv.review_score >= 4) & (rv.sentiment_class == "negative")) |
    ((rv.review_score <= 2) & (rv.sentiment_class == "positive"))
).astype(int)

# 4. Aggregate per customer, now with correct (column, func) syntax
review_feats = rv.groupby("customer_id").agg(
    total_reviews      = ("order_id",      "count"),
    review_score_mean  = ("review_score",  "mean"),
    review_score_min   = ("review_score",  "min"),
    sentiment_mean     = ("sentiment_score","mean"),
    sentiment_std      = ("sentiment_score","std"),
    mismatch_count     = ("mismatch_flag", "sum"),
    negative_count     = ("sentiment_class", lambda s: (s == "negative").sum()),
).round(3)

# 5. Derive ratios and “has_review” flag
review_feats["has_review"]      = (review_feats.total_reviews > 0).astype(int)
review_feats["negative_ratio"]  = (review_feats.negative_count / review_feats.total_reviews).fillna(0)
review_feats["mismatch_ratio"]  = (review_feats.mismatch_count / review_feats.total_reviews).fillna(0)

# 6. Drop intermediate counts
review_feats = review_feats.drop(columns=["total_reviews","negative_count","mismatch_count"])

# 7. Reindex to include everyone
review_feats = all_customers.join(review_feats).fillna({
    "review_score_mean": 0,
    "review_score_min":  0,
    "sentiment_mean":    0,
    "sentiment_std":     0,
    "negative_ratio":    0,
    "mismatch_ratio":    0,
    "has_review":        0,
})

# 8. (Optional) scale numeric columns if you plan to feed into a model
from sklearn.preprocessing import RobustScaler
num_cols = [
    "review_score_mean",
    "review_score_min",
    "sentiment_mean",
    "sentiment_std",
    "negative_ratio",
    "mismatch_ratio",
]
scaler = RobustScaler()
review_feats[num_cols] = scaler.fit_transform(review_feats[num_cols])

# Preview
review_feats.reset_index().head()

Unnamed: 0,customer_id,review_score_mean,review_score_min,sentiment_mean,sentiment_std,has_review,negative_ratio,mismatch_ratio
0,9ef432eb6251297304e76186b10a928d,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,b0830fb4747a6c6d20dea0b8c802d7ef,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2,41ce2a54c0b03bf3443c3d931a367089,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,f88197465ea7920adcdbec7375364d82,1.25,1.25,0.0,0.0,1.0,0.0,0.0
4,8ab97904e6daea8866dbdbc4fb7aad2c,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
# ── RFM FEATURES ──────────────────────────────────────────────────────────────

# 1️⃣ Get the “latest” date overall for recency calculation
latest_date = orders["order_purchase_timestamp"].max()

# 2️⃣ Sum payments per order
order_pay = (
    payments
    .groupby("order_id")["payment_value"]
    .sum()
    .reset_index()
)

# 3️⃣ Combine orders, items, payments
rfm_df = (
    orders[["customer_id", "order_id", "order_purchase_timestamp"]]
    .merge(order_pay, on="order_id", how="left")
    .merge(items[["order_id", "price", "freight_value"]], on="order_id", how="left")
)

# 4️⃣ Aggregate per customer
rfm_feats = (
    rfm_df
    .groupby("customer_id")
    .agg(
        last_purchase   = ("order_purchase_timestamp", "max"),
        frequency       = ("order_id",                "count"),
        total_spend     = ("payment_value",          "sum"),
        avg_order_value = ("payment_value",          "mean"),
        avg_shipping    = ("freight_value",          "mean"),
    )
    .round(2)
)

# 5️⃣ Compute recency and shipping ratio
rfm_feats["recency_days"]    = (latest_date - rfm_feats["last_purchase"]).dt.days
rfm_feats["shipping_ratio"]  = (rfm_feats["avg_shipping"] / rfm_feats["avg_order_value"]).fillna(0)

# 6️⃣ Static shipping flags
for pct in [5, 10, 15, 20]:
    rfm_feats[f"ship_above_{pct}pct"] = (rfm_feats["shipping_ratio"] > pct/100).astype(int)

# 7️⃣ Train-only percentile flags (example: 75th & 90th)
#    You’d compute t75/t90 on your TRAIN split only and then reuse on val/test.
t75 = rfm_feats["shipping_ratio"].quantile(0.75)
t90 = rfm_feats["shipping_ratio"].quantile(0.90)
rfm_feats["ship_above_75pct"] = (rfm_feats["shipping_ratio"] > t75).astype(int)
rfm_feats["ship_above_90pct"] = (rfm_feats["shipping_ratio"] > t90).astype(int)

# 8️⃣ Drop the raw timestamp
rfm_feats = rfm_feats.drop(columns=["last_purchase"])

# 9️⃣ Scale numeric columns for modeling
from sklearn.preprocessing import RobustScaler
num_cols = ["recency_days", "frequency", "total_spend", "avg_order_value", "avg_shipping", "shipping_ratio"]
scaler = RobustScaler()
rfm_feats[num_cols] = scaler.fit_transform(rfm_feats[num_cols])

# 10️⃣ Preview
rfm_feats.reset_index().head()

Unnamed: 0,customer_id,frequency,total_spend,avg_order_value,avg_shipping,recency_days,shipping_ratio,ship_above_5pct,ship_above_10pct,ship_above_15pct,ship_above_20pct,ship_above_75pct,ship_above_90pct
0,00012a2ce6f8dcda20d059ce98491703,0.0,0.038453,0.082203,1.098592,0.282051,0.328934,1,1,1,1,0,0
1,000161a058600d5901f007fab4c27140,0.0,-0.317011,-0.329506,-0.492958,0.799145,0.124352,1,1,1,0,0,0
2,0001fd6190edaaf884bcaf3d49edf079,0.0,0.644386,0.784012,-0.119078,1.388889,-0.562021,1,0,0,0,0,0
3,0002414f95344307404f0ace7a26f1d5,0.0,0.523695,0.644224,1.676056,0.666667,-0.013258,1,1,1,0,0,0
4,000379cdec625522490c315e70c7a9fb,0.0,-0.019602,0.014962,-0.300896,-0.311966,-0.227506,1,1,0,0,0,0


In [40]:
import pandas as pd
import numpy as np

def build_payment_features(payments: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    """Return a tidy DataFrame of customer-level payment-complexity features."""

    # ── 1. Per-order aggregation ──────────────────────────────────────────────
    order_pay = (
        payments
        .groupby("order_id")
        .agg(
            num_methods      = ("payment_sequential",   "max"),     # max # methods
            num_types        = ("payment_type",         "nunique"), # distinct types
            max_installments = ("payment_installments", "max"),     # largest installment
            mean_installments= ("payment_installments", "mean"),    # avg installments
            total_payment    = ("payment_value",        "sum"),     # order revenue
            count_payment    = ("payment_value",        "count"),   # # payment rows
        )
        .reset_index()   # keep order_id as a column for the merge
    )

    # ── 2. Order-level flags ──────────────────────────────────────────────────
    order_pay = order_pay.assign(
        multiple_methods  = lambda df: (df.num_methods      > 1).astype(int),
        multiple_types    = lambda df: (df.num_types        > 1).astype(int),
        high_installments = lambda df: (df.max_installments > 6).astype(int),
    )

    # ── 3. Attach customer_id ────────────────────────────────────────────────
    cust_pay = (
        orders[["order_id", "customer_id"]]
        .merge(order_pay, how="left", on="order_id")
    )

    # ── 4. Aggregate per customer ────────────────────────────────────────────
    pay_feats = (
        cust_pay
        .groupby("customer_id")
        .agg(
            pm_mean_methods    = ("multiple_methods",  "mean"),
            pm_sum_methods     = ("multiple_methods",  "sum"),
            pm_mean_types      = ("multiple_types",    "mean"),
            pm_mean_high_inst  = ("high_installments", "mean"),
            pm_mean_max_inst   = ("max_installments",  "mean"),
            pm_max_max_inst    = ("max_installments",  "max"),
            num_orders         = ("order_id",          "nunique"),   # for ratios
            avg_pay_records    = ("count_payment",     "mean"),      # optional
        )
        .reset_index()
    )

    # ── 5. Derived customer-level metrics ────────────────────────────────────
    pay_feats["complex_payment_ratio"] = (
        pay_feats.pm_sum_methods / pay_feats.num_orders.replace(0, 1)
    )

    pay_feats["financial_stress_score"] = (
          0.4 * pay_feats.pm_mean_high_inst
        + 0.3 * pay_feats.pm_mean_methods
        + 0.3 * (pay_feats.pm_mean_max_inst / 12)
    ).round(3)

    # ── 6. Final cleanup – drop low-information or redundant columns ─────────
    pay_feats = pay_feats.drop(
        columns=[
            "pm_sum_methods",   # nearly all zeros; used only for the ratio
            "pm_max_max_inst",  # redundant extreme already captured in mean/max logic
            "avg_pay_records",  # can be re-added if you later find it useful
        ]
    )

    return pay_feats

customer_payment_features = build_payment_features(payments, orders)
cols_to_remove = ["num_orders", "complex_payment_ratio"]  # redundant / zero-variance
pay_feats = customer_payment_features.drop(columns=cols_to_remove)
pay_feats.head()

Unnamed: 0,customer_id,pm_mean_methods,pm_mean_types,pm_mean_high_inst,pm_mean_max_inst,financial_stress_score
0,00012a2ce6f8dcda20d059ce98491703,0.0,0.0,1.0,8.0,0.6
1,000161a058600d5901f007fab4c27140,0.0,0.0,0.0,5.0,0.125
2,0001fd6190edaaf884bcaf3d49edf079,0.0,0.0,1.0,10.0,0.65
3,0002414f95344307404f0ace7a26f1d5,0.0,0.0,0.0,1.0,0.025
4,000379cdec625522490c315e70c7a9fb,0.0,0.0,0.0,1.0,0.025


In [41]:
pay_feats.describe()

Unnamed: 0,pm_mean_methods,pm_mean_types,pm_mean_high_inst,pm_mean_max_inst,financial_stress_score
count,99440.0,99440.0,99440.0,99440.0,99440.0
mean,0.030561,0.022586,0.122416,2.930521,0.131398
std,0.172126,0.148582,0.327767,2.715685,0.196512
min,0.0,0.0,0.0,0.0,0.025
25%,0.0,0.0,0.0,1.0,0.025
50%,0.0,0.0,0.0,2.0,0.05
75%,0.0,0.0,0.0,4.0,0.1
max,1.0,1.0,1.0,24.0,1.3


In [42]:
def cancellation_features(orders: pd.DataFrame) -> pd.DataFrame:
    cf = orders.groupby("customer_id").agg(
        total_orders    = ("order_id","count"),
        canceled_orders = ("order_status", lambda x: (x=="canceled").sum())
    )
    cf["cancellation_ratio"] = cf.canceled_orders / cf.total_orders
    return cf[["cancellation_ratio"]]

In [43]:
cancellation_features(orders)

Unnamed: 0_level_0,cancellation_ratio
customer_id,Unnamed: 1_level_1
00012a2ce6f8dcda20d059ce98491703,0.0
000161a058600d5901f007fab4c27140,0.0
0001fd6190edaaf884bcaf3d49edf079,0.0
0002414f95344307404f0ace7a26f1d5,0.0
000379cdec625522490c315e70c7a9fb,0.0
...,...
fffecc9f79fd8c764f843e9951b11341,0.0
fffeda5b6d849fbd39689bb92087f431,0.0
ffff42319e9b2d713724ae527742af25,0.0
ffffa3172527f765de70084a7e53aae8,0.0


In [6]:
#!/usr/bin/env python
# features_churn.py

import sys
from pathlib import Path
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
from textblob import TextBlob
from sklearn.model_selection import train_test_split

# allow imports from project root
sys.path.append(str(Path().resolve().parent))
from customer_ai.config import PROCESSED_DATA_DIR

# ───────────────────────────────────────────────────────────────────────────────
# CONFIG
# ───────────────────────────────────────────────────────────────────────────────
SEED = 42
CHURN_MONTHS = 6
TEST_SIZE = 0.10
VAL_SIZE = 0.10  # fraction of total

# ───────────────────────────────────────────────────────────────────────────────
# 1. LOAD DATA
# ───────────────────────────────────────────────────────────────────────────────
def load_data():
    """
    Load all Olist dataset tables from processed parquet files.
    """
    orders = pd.read_parquet(PROCESSED_DATA_DIR / "olist_orders_dataset.parquet")
    reviews = pd.read_parquet(PROCESSED_DATA_DIR / "olist_order_reviews_dataset.parquet")
    items = pd.read_parquet(PROCESSED_DATA_DIR / "olist_order_items_dataset.parquet")
    payments = pd.read_parquet(PROCESSED_DATA_DIR / "olist_order_payments_dataset.parquet")
    customers = pd.read_parquet(PROCESSED_DATA_DIR / "olist_customers_dataset.parquet")

    # parse datetime columns in orders
    for col in [
        "order_purchase_timestamp",
        "order_approved_at",
        "order_delivered_carrier_date",
        "order_delivered_customer_date",
        "order_estimated_delivery_date",
    ]:
        if col in orders.columns:
            orders[col] = pd.to_datetime(orders[col], errors="coerce")

    # parse review dates
    if "review_creation_date" in reviews.columns:
        reviews["review_creation_date"] = pd.to_datetime(reviews["review_creation_date"], errors="coerce")

    # parse items shipping limit
    if "shipping_limit_date" in items.columns:
        items["shipping_limit_date"] = pd.to_datetime(items["shipping_limit_date"], errors="coerce")

    return orders, reviews, items, payments, customers

# ───────────────────────────────────────────────────────────────────────────────
# 2. DEFINE CHURN LABELS
# ───────────────────────────────────────────────────────────────────────────────
def define_churn_labels(orders: pd.DataFrame, churn_months: int = CHURN_MONTHS) -> pd.DataFrame:
    latest = orders["order_purchase_timestamp"].max()
    cutoff = latest - DateOffset(months=churn_months)

    last = (
        orders
        .groupby("customer_id")["order_purchase_timestamp"]
        .max()
        .reset_index(name="last_purchase")
    )
    last["is_churned"] = (last["last_purchase"] < cutoff).astype(int)
    last["days_since_last_order"] = (latest - last["last_purchase"]).dt.days
    return last

# ───────────────────────────────────────────────────────────────────────────────
# 3. TRAIN/VAL/TEST SPLIT
# ───────────────────────────────────────────────────────────────────────────────
def split_customers(labels: pd.DataFrame,
                    test_size: float = TEST_SIZE,
                    val_size: float = VAL_SIZE,
                    seed: int = SEED):
    train_val, test = train_test_split(
        labels,
        test_size=test_size,
        stratify=labels["is_churned"],
        random_state=seed,
    )
    rel_val = val_size / (1 - test_size)
    train, val = train_test_split(
        train_val,
        test_size=rel_val,
        stratify=train_val["is_churned"],
        random_state=seed,
    )
    return train, val, test

# ───────────────────────────────────────────────────────────────────────────────
# 4. ENHANCED FEATURE FUNCTIONS
# ───────────────────────────────────────────────────────────────────────────────
def delivery_features(orders: pd.DataFrame) -> pd.DataFrame:
    df = (
        orders.query("order_status=='delivered'")
              .dropna(subset=["order_delivered_customer_date","order_estimated_delivery_date"])
              .copy()
    )
    
    if df.empty:
        return pd.DataFrame()
    
    df["days_diff"] = (df.order_delivered_customer_date - df.order_estimated_delivery_date).dt.days
    df["actual_days"] = (df.order_delivered_customer_date - df.order_purchase_timestamp).dt.days

    # Focus on both promised delivery performance AND absolute delivery speed
    g = df.groupby("customer_id").agg({
        "days_diff": ["mean", "std"],     # Average delay and consistency
        "actual_days": ["mean", "std"],   # Average delivery time and consistency
        "order_id": "count",              # Number of delivered orders
    }).round(2)
    
    # Flatten column names
    g.columns = [f"delivery_{c[0]}_{c[1]}" for c in g.columns]
    
    # Late delivery metrics (relative to promise)
    late_orders = df[df.days_diff > 0].groupby("customer_id").size()
    g["delivery_late_ratio"] = (late_orders / g.delivery_order_id_count).fillna(0)
    g["delivery_consistently_late"] = (g.delivery_late_ratio > 0.5).astype(int)
    
    # NEW: Absolute delivery speed metrics (regardless of promise)
    # These capture customer frustration with slow delivery even if "on time"
    
    # Flag customers who experienced slow deliveries (>7 days is often considered slow for e-commerce)
    slow_orders = df[df.actual_days > 7].groupby("customer_id").size()
    g["delivery_slow_ratio"] = (slow_orders / g.delivery_order_id_count).fillna(0)
    g["delivery_frequently_slow"] = (g.delivery_slow_ratio > 0.3).astype(int)
    
    # Flag customers who experienced very slow deliveries (>14 days)
    very_slow_orders = df[df.actual_days > 14].groupby("customer_id").size()
    g["delivery_very_slow_ratio"] = (very_slow_orders / g.delivery_order_id_count).fillna(0)
    g["delivery_has_very_slow"] = (g.delivery_very_slow_ratio > 0).astype(int)
    
    # Average delivery speed categories - create binary flags directly
    g["delivery_speed_fast"] = (g.delivery_actual_days_mean <= 3).astype(int)
    g["delivery_speed_normal"] = ((g.delivery_actual_days_mean > 3) & (g.delivery_actual_days_mean <= 7)).astype(int)
    g["delivery_speed_slow"] = ((g.delivery_actual_days_mean > 7) & (g.delivery_actual_days_mean <= 14)).astype(int)
    g["delivery_speed_very_slow"] = (g.delivery_actual_days_mean > 14).astype(int)
    
    # Delivery consistency (high std means unpredictable delivery times)
    g["delivery_inconsistent"] = (g.delivery_actual_days_std > 5).astype(int)
    
    # Combined dissatisfaction score (accounts for both late and slow deliveries)
    g["delivery_dissatisfaction_score"] = (
        g.delivery_late_ratio * 0.3 +           # Being late vs promise
        g.delivery_slow_ratio * 0.4 +           # Being slow in absolute terms  
        g.delivery_very_slow_ratio * 0.3        # Having very slow orders
    )
    
    g["delivery_likely_dissatisfied"] = (g.delivery_dissatisfaction_score > 0.3).astype(int)
    
    # Clean up - keep most important columns
    final_cols = [
        "delivery_days_diff_mean",        # Average delay vs promise
        "delivery_actual_days_mean",      # Average actual delivery time
        "delivery_actual_days_std",       # Delivery time consistency
        "delivery_late_ratio",            # Ratio of late orders
        "delivery_slow_ratio",            # Ratio of slow orders (>7 days)
        "delivery_very_slow_ratio",       # Ratio of very slow orders (>14 days)
        "delivery_consistently_late",     # Binary: often late vs promise
        "delivery_frequently_slow",       # Binary: often slow in absolute terms
        "delivery_has_very_slow",         # Binary: has very slow orders
        "delivery_inconsistent",          # Binary: inconsistent delivery times
        "delivery_dissatisfaction_score", # Combined dissatisfaction metric
        "delivery_likely_dissatisfied",   # Binary: likely dissatisfied overall
        "delivery_speed_fast",            # Binary: average delivery <= 3 days
        "delivery_speed_normal",          # Binary: average delivery 3-7 days
        "delivery_speed_slow",            # Binary: average delivery 7-14 days
        "delivery_speed_very_slow",       # Binary: average delivery > 14 days
    ]
    
    return g[final_cols].fillna(0)


def review_features(reviews: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    if reviews.empty:
        return pd.DataFrame()
    
    # Only process reviews with valid scores (most important signal)
    rv = reviews.dropna(subset=["review_score"]).copy()
    
    if rv.empty:
        return pd.DataFrame()
    
    # Add customer_id
    rv = rv.merge(orders[["order_id", "customer_id"]], on="order_id", how="left")
    
    # Focus on 3 core review metrics
    g = rv.groupby("customer_id").agg({
        "review_score": ["mean", "min"],  # Average satisfaction and worst experience
        "order_id": "count",              # Number of reviews
    }).round(2)
    
    g.columns = [f"review_{c[0]}_{c[1]}" for c in g.columns]
    
    # Simple binary flags for dissatisfaction
    bad_reviews = rv[rv.review_score <= 2].groupby("customer_id").size()
    g["review_bad_ratio"] = (bad_reviews / g.review_order_id_count).fillna(0)
    g["review_dissatisfied"] = (g.review_review_score_mean < 3.5).astype(int)
    
    return g.drop(columns=["review_order_id_count"])


def rfm_features_historical_window(orders: pd.DataFrame,
                                  items: pd.DataFrame,
                                  payments: pd.DataFrame,
                                  is_train: bool,
                                  thresholds: dict = None,
                                  window_months: int = 6):
    """
    Alternative approach: Use a specific historical window before the churn prediction point.
    
    This creates a gap between the features and the churn definition period.
    """
    if orders.empty:
        return pd.DataFrame(), thresholds
    
    # Use data from 6+ months ago (before the churn definition period)
    latest = orders.order_purchase_timestamp.max()
    window_end = latest - pd.DateOffset(months=window_months)
    window_start = window_end - pd.DateOffset(months=window_months)
    
    # Only use data from this historical window
    window_orders = orders[
        (orders.order_purchase_timestamp >= window_start) & 
        (orders.order_purchase_timestamp <= window_end)
    ].copy()
    
    if window_orders.empty:
        return pd.DataFrame(), thresholds
    
    # Build features on this historical window
    od = window_orders.merge(items, on="order_id", how="left")
    pay = payments.groupby("order_id")["payment_value"].sum().reset_index()
    od = od.merge(pay, on="order_id", how="left")

    # Historical behavior patterns
    g = od.groupby("customer_id").agg({
        "order_purchase_timestamp": "count",
        "payment_value": ["sum", "mean"],
        "freight_value": "sum",
        "price": "sum",
    }).round(2)
    g.columns = [f"rfm_hist_{c[0]}_{c[1]}" for c in g.columns]

    # Historical patterns (safe to use)
    g["rfm_hist_monthly_orders"] = g.rfm_hist_order_purchase_timestamp_count / window_months
    g["rfm_hist_shipping_ratio"] = (g.rfm_hist_freight_value_sum / 
                                   g.rfm_hist_price_sum.replace({0: np.nan})).fillna(0)
    
    # Customer segments based on historical behavior
    if is_train:
        value_thresh = g.rfm_hist_payment_value_sum.quantile(0.75)
        freq_thresh = g.rfm_hist_monthly_orders.quantile(0.75)
        thresholds = {"hist_value": value_thresh, "hist_freq": freq_thresh}
    else:
        if thresholds is None:
            thresholds = {"hist_value": 50, "hist_freq": 1}
        value_thresh = thresholds["hist_value"]
        freq_thresh = thresholds["hist_freq"]
    
    g["rfm_hist_was_valuable"] = (g.rfm_hist_payment_value_sum > value_thresh).astype(int)
    g["rfm_hist_was_frequent"] = (g.rfm_hist_monthly_orders > freq_thresh).astype(int)
    
    return g[["rfm_hist_monthly_orders", "rfm_hist_payment_value_mean", 
             "rfm_hist_shipping_ratio", "rfm_hist_was_valuable", 
             "rfm_hist_was_frequent"]], thresholds
    
def payment_features(payments: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    if payments.empty:
        return pd.DataFrame()
    
    # Order-level payment aggregation
    pc = payments.groupby("order_id").agg({
        "payment_installments": "max",
        "payment_type": "nunique",
        "payment_value": "sum",
    })
    
    # Customer-level features
    cpf = (
        orders[["order_id", "customer_id"]]
        .merge(pc.reset_index(), on="order_id", how="left")
        .groupby("customer_id")
        .agg({
            "payment_installments": "mean",
            "payment_type": "mean",
            "payment_value": "mean",
        })
        .round(2)
    )
    
    # Simple binary flags
    cpf["payment_high_installments"] = (cpf.payment_installments > 6).astype(int)
    cpf["payment_multiple_types"] = (cpf.payment_type > 1).astype(int)
    
    return cpf


def cancellation_features(orders: pd.DataFrame) -> pd.DataFrame:
    if orders.empty:
        return pd.DataFrame()
    
    cf = orders.groupby("customer_id").agg(
        total_orders=("order_id","count"),
        canceled_orders=("order_status", lambda x: (x=="canceled").sum())
    )
    cf["cancellation_ratio"] = cf.canceled_orders / cf.total_orders
    cf["has_cancellations"] = (cf.canceled_orders > 0).astype(int)
    
    return cf[["cancellation_ratio", "has_cancellations"]]


def build_for(cust_ids, orders, reviews, items, payments,
              is_train=False, thresholds=None):
    if len(cust_ids) == 0:
        return pd.DataFrame(), thresholds
    
    o = orders[orders.customer_id.isin(cust_ids)]
    r = reviews[reviews.order_id.isin(o.order_id)]
    i = items[items.order_id.isin(o.order_id)]
    p = payments[payments.order_id.isin(o.order_id)]

    D = delivery_features(o)
    V = review_features(r, o)
    R, thresholds = rfm_features_historical_window(o, i, p, is_train, thresholds)
    P = payment_features(p, o)
    C = cancellation_features(o)

    df = pd.DataFrame({"customer_id": cust_ids}).set_index("customer_id")
    for X in (D, V, R, P, C):
        if not X.empty:
            df = df.join(X, how="left")
    return df.fillna(0).reset_index(), thresholds

# ───────────────────────────────────────────────────────────────────────────────
# 6. MAIN: BUILD & SAVE WITH LABELS
# ───────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    orders, reviews, items, payments, customers = load_data()
    labels = define_churn_labels(orders)

    train_ids, val_ids, test_ids = split_customers(labels)

    # Build features
    train_df, thr = build_for(train_ids.customer_id, orders, reviews, items, payments,
                              is_train=True, thresholds=None)
    val_df, _ = build_for(val_ids.customer_id, orders, reviews, items, payments,
                           is_train=False, thresholds=thr)
    test_df, _ = build_for(test_ids.customer_id, orders, reviews, items, payments,
                            is_train=False, thresholds=thr)

    # Merge with labels
    train_df = train_df.merge(train_ids[["customer_id", "is_churned", "days_since_last_order"]], 
                             on="customer_id", how="left")
    val_df = val_df.merge(val_ids[["customer_id", "is_churned", "days_since_last_order"]], 
                         on="customer_id", how="left")
    test_df = test_df.merge(test_ids[["customer_id", "is_churned", "days_since_last_order"]], 
                           on="customer_id", how="left")

    print("▶ Shapes:", train_df.shape, val_df.shape, test_df.shape)
    feat_cols = [c for c in train_df.columns if c not in ("customer_id","is_churned","days_since_last_order")]
    print(f"✅ Built {len(feat_cols)} features")

    # Save datasets
    for df, prefix in [(train_df, "train"), (val_df, "val"), (test_df, "test")]:
        name = f"churn_{prefix}_seed{SEED}.parquet"
        path = PROCESSED_DATA_DIR / name
        df.to_parquet(path, index=False)
        print("Saved →", name)

[32m2025-07-04 19:31:41.607[0m | [1mINFO    [0m | [36mcustomer_ai.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: /home/gwei4/e_commerce_customer_ai_platform[0m


▶ Shapes: (79552, 35) (9944, 35) (9945, 35)
✅ Built 32 features
Saved → churn_train_seed42.parquet
Saved → churn_val_seed42.parquet
Saved → churn_test_seed42.parquet
