# 🏷️ Objective: Customer Segmentation

## 1. RFM Analysis  
- **Recency**  
  – Days since last purchase (reference = “one day after” most recent order)  
- **Frequency**  
  – Total number of orders per customer  
- **Monetary**  
  – Total amount spent  
- **Outcome**  
  – Assign RFM scores (1–5 quintiles) and aggregate into RFM codes  
  – Label segments such as **Champions**, **Loyal Customers**, **At Risk**, etc.

---

## 2. Value-Based Segments  
- **VIP**  
  – Top 10% of spenders  
- **High Value**  
  – 75th–90th percentile  
- **Medium Value**  
  – 50th–75th percentile  
- **Low Value**  
  – 25th–50th percentile  
- **Entry Level**  
  – Bottom 25%

---

## 3. Behavioral Segments  
Cluster customers on standardized behavioral features:  
1. **Purchase Frequency**  
2. **Average Order Value**  
3. **Payment Preferences** (e.g. installments, payment type diversity)  
4. **Review Behavior** (average score, review rate)  
5. **Product Category Diversity**  
6. **Customer Tenure**  

Use K-Means (or similar) to discover groups like “High-Frequency High-Value,” “Big Spenders,” “Occasional Shoppers,” etc.

---

## 🔑 Key Features Created  
- **Transaction Metrics**  
  – Frequency, Recency, Monetary  
- **Behavioral Patterns**  
  – Preferred payment type, average installments, review scores, category preferences  
- **Engagement Metrics**  
  – Review rate (reviews/orders), payment diversity (# payment types)  
- **Geographic Info**  
  – Customer city & state  

In [1]:
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path().resolve().parents[0]
PROC_DIR     = PROJECT_ROOT / "data" / "processed"

# load each table
orders    = pd.read_parquet(PROC_DIR / "olist_orders_dataset.parquet")
items     = pd.read_parquet(PROC_DIR / "olist_order_items_dataset.parquet")
payments  = pd.read_parquet(PROC_DIR / "olist_order_payments_dataset.parquet")
reviews   = pd.read_parquet(PROC_DIR / "olist_order_reviews_dataset.parquet")
customers = pd.read_parquet(PROC_DIR / "olist_customers_dataset.parquet")
products  = pd.read_parquet(PROC_DIR / "olist_products_dataset.parquet")
geoloc    = pd.read_parquet(PROC_DIR / "olist_geolocation_dataset.parquet")
cats      = pd.read_parquet(PROC_DIR / "product_category_name_translation.parquet")

# Computer RFM

In [2]:
# change to datetime
orders["order_purchase_timestamp"] = pd.to_datetime(orders["order_purchase_timestamp"])

# choose an analysis date (e.g. one day after last order)
analysis_date = orders["order_purchase_timestamp"].max() + pd.Timedelta(days=1)

# Frequency, first & last purchase
rfm = (
    orders
      .groupby("customer_id")                        # Group the data by each customer
      ["order_purchase_timestamp"]                   # Focus on the purchase‐date column
      .agg(
          first_order = "min",                       # “first_order”: take the earliest date per customer
          last_order  = "max",                       # “last_order”: take the latest date per customer
          frequency   = "count"                      # “frequency”: count how many purchases each customer made
      )
)

# Recency & tenure
rfm["recency_days"] = (analysis_date - rfm["last_order"]).dt.days
rfm["tenure_days"] = (rfm["last_order"] - rfm["first_order"]).dt.days

rfm = rfm.reset_index()
rfm.head()

Unnamed: 0,customer_id,first_order,last_order,frequency,recency_days,tenure_days
0,00012a2ce6f8dcda20d059ce98491703,2017-11-14 16:08:26,2017-11-14 16:08:26,1,338,0
1,000161a058600d5901f007fab4c27140,2017-07-16 09:40:32,2017-07-16 09:40:32,1,459,0
2,0001fd6190edaaf884bcaf3d49edf079,2017-02-28 11:06:43,2017-02-28 11:06:43,1,597,0
3,0002414f95344307404f0ace7a26f1d5,2017-08-16 13:09:20,2017-08-16 13:09:20,1,428,0
4,000379cdec625522490c315e70c7a9fb,2018-04-02 13:42:17,2018-04-02 13:42:17,1,199,0


In [3]:
# Compute total value per order
order_vals = (
    items
      .groupby("order_id")
      .agg(
        total_price    = ("price", "sum"),
        total_freight  = ("freight_value", "sum")
      )
      .assign(total_value=lambda d: d["total_price"] + d["total_freight"])
      .reset_index()
)

# Attach customer_id then roll up to customer
ord_cust_vals = orders[["order_id","customer_id"]].merge(order_vals, on="order_id")

monetary = (
    ord_cust_vals
      .groupby("customer_id")
      .agg(
        total_spent     = ("total_value", "sum"),
        avg_order_value = ("total_value", "mean"),
        total_freight   = ("total_freight", "sum"),
        avg_freight     = ("total_freight", "mean")
      )
      .reset_index()
)

monetary.head()

Unnamed: 0,customer_id,total_spent,avg_order_value,total_freight,avg_freight
0,00012a2ce6f8dcda20d059ce98491703,114.74,114.74,24.94,24.94
1,000161a058600d5901f007fab4c27140,67.41,67.41,12.51,12.51
2,0001fd6190edaaf884bcaf3d49edf079,195.42,195.42,15.43,15.43
3,0002414f95344307404f0ace7a26f1d5,179.35,179.35,29.45,29.45
4,000379cdec625522490c315e70c7a9fb,107.01,107.01,14.01,14.01


In [4]:
cust_master = (
    rfm
      .merge(monetary, on="customer_id", how="left")
)

# fill any missing (e.g. customers with no monetary data yet)
cust_master.fillna(0, inplace=True)
cust_master.head(3)

Unnamed: 0,customer_id,first_order,last_order,frequency,recency_days,tenure_days,total_spent,avg_order_value,total_freight,avg_freight
0,00012a2ce6f8dcda20d059ce98491703,2017-11-14 16:08:26,2017-11-14 16:08:26,1,338,0,114.74,114.74,24.94,24.94
1,000161a058600d5901f007fab4c27140,2017-07-16 09:40:32,2017-07-16 09:40:32,1,459,0,67.41,67.41,12.51,12.51
2,0001fd6190edaaf884bcaf3d49edf079,2017-02-28 11:06:43,2017-02-28 11:06:43,1,597,0,195.42,195.42,15.43,15.43


# Payment behaviors

Payment Behavior: preferred method, average installments, total spent, diversity


In [5]:
# Merge payments with orders to get customer_id
pay_cust = (
    payments[["order_id", "payment_type", "payment_installments", "payment_value"]]
      .merge(orders[["order_id", "customer_id"]], on="order_id", how="left")
)

# Aggregate per customer
payment_behavior = (
    pay_cust
      .groupby("customer_id")
      .agg(
          preferred_payment_type = ("payment_type", lambda x: x.mode().iat[0] if not x.mode().empty else None),
          avg_installments       = ("payment_installments", "mean"),
          total_payment_value    = ("payment_value", "sum"),
          payment_diversity      = ("payment_type", "nunique")
      )
      .round(2)
      .reset_index()
)

payment_behavior.head()

Unnamed: 0,customer_id,preferred_payment_type,avg_installments,total_payment_value,payment_diversity
0,00012a2ce6f8dcda20d059ce98491703,credit_card,8.0,114.74,1
1,000161a058600d5901f007fab4c27140,credit_card,5.0,67.41,1
2,0001fd6190edaaf884bcaf3d49edf079,credit_card,10.0,195.42,1
3,0002414f95344307404f0ace7a26f1d5,boleto,1.0,179.35,1
4,000379cdec625522490c315e70c7a9fb,boleto,1.0,107.01,1


In [6]:
cust_master = cust_master.merge(payment_behavior, on="customer_id", how="left").fillna({
    "preferred_payment_type": "unknown", 
    "avg_installments": 0, 
    "total_payment_value": 0, 
    "payment_diversity": 0
})

cust_master.head()

Unnamed: 0,customer_id,first_order,last_order,frequency,recency_days,tenure_days,total_spent,avg_order_value,total_freight,avg_freight,preferred_payment_type,avg_installments,total_payment_value,payment_diversity
0,00012a2ce6f8dcda20d059ce98491703,2017-11-14 16:08:26,2017-11-14 16:08:26,1,338,0,114.74,114.74,24.94,24.94,credit_card,8.0,114.74,1.0
1,000161a058600d5901f007fab4c27140,2017-07-16 09:40:32,2017-07-16 09:40:32,1,459,0,67.41,67.41,12.51,12.51,credit_card,5.0,67.41,1.0
2,0001fd6190edaaf884bcaf3d49edf079,2017-02-28 11:06:43,2017-02-28 11:06:43,1,597,0,195.42,195.42,15.43,15.43,credit_card,10.0,195.42,1.0
3,0002414f95344307404f0ace7a26f1d5,2017-08-16 13:09:20,2017-08-16 13:09:20,1,428,0,179.35,179.35,29.45,29.45,boleto,1.0,179.35,1.0
4,000379cdec625522490c315e70c7a9fb,2018-04-02 13:42:17,2018-04-02 13:42:17,1,199,0,107.01,107.01,14.01,14.01,boleto,1.0,107.01,1.0


# Review Behaviors

In [7]:
# Merge reviews with orders to get customer_id on each review
review_cust = (
    orders[["order_id", "customer_id"]]
      .merge(reviews, on="order_id", how="left")
)

# Aggregate per customer
review_behavior = (
    review_cust
      .groupby("customer_id")
      .agg(
          avg_review_score    = ("review_score",      "mean"),  # average star rating
          review_count        = ("review_score",      "count"), # total reviews written
          reviews_with_text   = (
              "review_comment_message",
              lambda x: x.notna().sum()
          )  # how many reviews included comments
      )
      .round(2)
      .reset_index()
)

review_behavior = (
    review_behavior
      .merge(rfm[["customer_id", "frequency"]], on="customer_id", how="left")
      .assign(
          review_rate = lambda df: (df["review_count"] / df["frequency"]).fillna(0).round(2)
      )
)

In [8]:
print("Nulls:\n", review_behavior.isna().mean().round(3))

Nulls:
 customer_id          0.000
avg_review_score     0.008
review_count         0.000
reviews_with_text    0.000
frequency            0.000
review_rate          0.000
dtype: float64


In [9]:
review_behavior.head()

Unnamed: 0,customer_id,avg_review_score,review_count,reviews_with_text,frequency,review_rate
0,00012a2ce6f8dcda20d059ce98491703,1.0,1,1,1,1.0
1,000161a058600d5901f007fab4c27140,4.0,1,0,1,1.0
2,0001fd6190edaaf884bcaf3d49edf079,5.0,1,1,1,1.0
3,0002414f95344307404f0ace7a26f1d5,5.0,1,0,1,1.0
4,000379cdec625522490c315e70c7a9fb,4.0,1,0,1,1.0


# Category Behavior

In [10]:
# Merge order items with products & category lookup
items_with_cats = (
    items         # order_items DataFrame
      .merge(products[['product_id', 'product_category_name']], on='product_id', how='left')
      .merge(cats[['product_category_name', 'product_category_name_english']],
             on='product_category_name', how='left')
      .merge(orders[['order_id', 'customer_id']], on='order_id', how='left')
)

# Aggregate per customer
category_behavior = (
    items_with_cats
      .groupby('customer_id')
      .agg(
          category_diversity = ('product_category_name_english', 'nunique'),
          preferred_category = ('product_category_name_english',
                                lambda s: s.mode().iat[0] if not s.mode().empty else None)
      )
      .reset_index()
)

In [11]:
print("\nNulls per column:\n", category_behavior.isna().mean().round(3))


Nulls per column:
 customer_id           0.000
category_diversity    0.000
preferred_category    0.014
dtype: float64


In [12]:
category_behavior.head()

Unnamed: 0,customer_id,category_diversity,preferred_category
0,00012a2ce6f8dcda20d059ce98491703,1,toys
1,000161a058600d5901f007fab4c27140,1,health_beauty
2,0001fd6190edaaf884bcaf3d49edf079,1,baby
3,0002414f95344307404f0ace7a26f1d5,1,cool_stuff
4,000379cdec625522490c315e70c7a9fb,1,bed_bath_table


# Geography

In [13]:
from functools import reduce

# geographic info
geo_info = customers[['customer_id', 'customer_city', 'customer_state']]

dfs = [
    rfm,
    monetary,
    payment_behavior,
    review_behavior.drop(columns='frequency', errors='ignore'),  # freq already in rfm
    category_behavior,
    geo_info,
]

customer_master = reduce(
    lambda left, right: left.merge(right, on='customer_id', how='left'),
    dfs
).fillna(0)

print("Master shape :", customer_master.shape)
display(customer_master.head(3))

Master shape : (99441, 22)


Unnamed: 0,customer_id,first_order,last_order,frequency,recency_days,tenure_days,total_spent,avg_order_value,total_freight,avg_freight,...,total_payment_value,payment_diversity,avg_review_score,review_count,reviews_with_text,review_rate,category_diversity,preferred_category,customer_city,customer_state
0,00012a2ce6f8dcda20d059ce98491703,2017-11-14 16:08:26,2017-11-14 16:08:26,1,338,0,114.74,114.74,24.94,24.94,...,114.74,1.0,1.0,1,1,1.0,1.0,toys,osasco,SP
1,000161a058600d5901f007fab4c27140,2017-07-16 09:40:32,2017-07-16 09:40:32,1,459,0,67.41,67.41,12.51,12.51,...,67.41,1.0,4.0,1,0,1.0,1.0,health_beauty,itapecerica,MG
2,0001fd6190edaaf884bcaf3d49edf079,2017-02-28 11:06:43,2017-02-28 11:06:43,1,597,0,195.42,195.42,15.43,15.43,...,195.42,1.0,5.0,1,1,1.0,1.0,baby,nova venecia,ES


# RFM quintile scores

In [14]:
# quintile labels: 5 (best) → 1 (worst) for Recency
customer_master['R_score'] = pd.qcut(
    customer_master['recency_days'],
    q=5,
    labels=[5,4,3,2,1]     # smaller recency ⇒ better score
).astype(int)

customer_master['F_score'] = pd.qcut(
    customer_master['frequency'].rank(method='first'),
    q=5,
    labels=[1,2,3,4,5]      # larger freq ⇒ better score
).astype(int)

customer_master['M_score'] = pd.qcut(
    customer_master['total_spent'].rank(method='first'),
    q=5,
    labels=[1,2,3,4,5]      # larger spend ⇒ better score
).astype(int)

customer_master['RFM_score'] = (
    customer_master['R_score'].astype(str) +
    customer_master['F_score'].astype(str) +
    customer_master['M_score'].astype(str)
)

customer_master[['customer_id','R_score','F_score','M_score','RFM_score']].head()


Unnamed: 0,customer_id,R_score,F_score,M_score,RFM_score
0,00012a2ce6f8dcda20d059ce98491703,2,1,3,213
1,000161a058600d5901f007fab4c27140,1,1,2,112
2,0001fd6190edaaf884bcaf3d49edf079,1,1,4,114
3,0002414f95344307404f0ace7a26f1d5,2,1,4,214
4,000379cdec625522490c315e70c7a9fb,4,1,3,413


# value buckets

In [15]:
quantiles = customer_master['total_spent'].quantile([.25,.5,.75,.9])

def value_bucket(x):
    if x >= quantiles[.90]: return 'VIP'
    if x >= quantiles[.75]: return 'High Value'
    if x >= quantiles[.50]: return 'Medium Value'
    if x >= quantiles[.25]: return 'Low Value'
    return 'Entry Level'

customer_master['value_segment'] = customer_master['total_spent'].apply(value_bucket)

# Save Customer_Master Dataste

In [18]:
for col in customer_master.select_dtypes(include="object").columns:
    customer_master[col] = customer_master[col].astype(str)

customer_master.to_parquet(PROC_DIR / "customer_master.parquet")