### Problem statement 
Identify which customers are most likely to abandon their cart during a **new product launch** and determine **data-driven interventions** to reduce abandonment **without eroding profit margins**.

This project frames cart abandonment as a **binary classification problem**, using aggregated behavioral, engagement, and psychographic signals while strictly preventing target leakage.

In [None]:
# Libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler # for value standardization

Let us load the data from `data\raw\e_commerce_shopper_behaviour_and_lifestyle.csv` into a variable and load it into a data frame utilizing pandas.

In [2]:
# Load Data set and check Shape
data = ("../data/raw/e_commerce_shopper_behaviour_and_lifestyle.csv")
ecom_df = pd.read_csv(data)
ecom_df.shape

(1000000, 60)

- We can see that there are 60 colomuns and 1M data points under each colomn. 
- All of these colomns are really not neccessary to our problem staement at hand so we can drop few of them.
- Let us first look at all the colomns to get an idea and understand what colomns need to be dropped.
- Also let us look at the unique values in each object type data to understand what type standardization techniques should be used.

#### Findings:
As there are many countries in the datset we will for this problem focus only on USA such that we can reduce dimensionality and also create aggregations that can help us reduce dimensionality of the data and allows the model to not overfit and create a bias.

In [5]:
# Filter for USA only and drop country column
usa_df = ecom_df[ecom_df['country'] == 'USA'].drop('country', axis=1)

# Verify
print(f"Original shape: {ecom_df.shape}")
print(f"USA only shape: {usa_df.shape}")


Original shape: (1000000, 60)
USA only shape: (99996, 59)


In [6]:
# Create a copy for feature engineering
df_features = usa_df.copy()

# Initialize scaler for normalization
scaler = MinMaxScaler()

#### 1. Engagement Intensity

**avg_daily_engagement_score**: Combines app time, product views, and usage frequency. Measures overall daily activity level.
**weekly_engagement_index**: Scales daily engagement by purchase conversion and projects to weekly. Shows if engagement translates to purchases.

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **avg_daily_engagement_score** | (session_time + views + frequency) ÷ 3 | How active user is daily |
| **weekly_engagement_index** | daily_score × 7 × conversion_rate | Weekly engagement that leads to purchases |

____

| User Type | Engagement | Conversion | Weekly Index | Behavior |
|-----------|-----------|------------|--------------|----------|
| **Power Buyer** | High (0.8) | High (0.8) | **4.48** | Engaged + buys often |
| **Window Shopper** | High (0.8) | Low (0.2) | **1.12** | Browses but doesn't buy |
| **Impulse Buyer** | Low (0.3) | High (0.8) | **1.68** | Quick visits, buys fast |
| **Disengaged** | Low (0.2) | Low (0.2) | **0.28** | Barely uses, rarely buys |

In [None]:
# ============================================================================
# ENGAGEMENT INTENSITY → PURCHASE CONVERSION
# ============================================================================

# Step 1: Normalize raw engagement metrics
engagement_cols = ['daily_session_time_minutes', 'product_views_per_day', 'app_usage_frequency']
df_features[engagement_cols] = scaler.fit_transform(df_features[engagement_cols])

# Step 2: Calculate average daily engagement score
df_features['avg_daily_engagement_score'] = (
    df_features['daily_session_time_minutes'] +
    df_features['product_views_per_day'] +
    df_features['app_usage_frequency']
) / 3

# Step 3: Normalize purchase conversion rate
df_features[['purchase_conversion_rate_normalized']] = scaler.fit_transform(df_features[['purchase_conversion_rate']])

# Step 4: Engagement to Purchase Effectiveness Score
# Formula: How much engagement translates to actual purchases
df_features['engagement_to_purchase_score'] = (
    df_features['avg_daily_engagement_score'] * df_features['purchase_conversion_rate_normalized']
)

# Step 5: Weekly Engagement Index (scaled to weekly)
df_features['weekly_engagement_index'] = df_features['engagement_to_purchase_score'] * 7

#### 2. Advertising Responsiveness

**ad_response_rate**: Measures how often users click on ads they see (clicks ÷ views). High rate = responsive to ads.
**ad_exposure_score**: Combines ad visibility with click behavior. High score = sees many ads AND clicks them often.

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **ad_response_rate** | ad_clicks ÷ ad_views | Click-through rate (CTR) |
| **ad_exposure_score** | normalized_ad_views × ad_response_rate | Total ad effectiveness |

____

| User Type | Ad Views | Click Rate | Exposure Score | Behavior |
|-----------|----------|------------|----------------|----------|
| **Ad Responsive** | High (0.8) | High (0.7) | **0.56** | Sees ads, clicks often |
| **Ad Ignorer** | High (0.8) | Low (0.1) | **0.08** | Sees ads, ignores them |
| **Low Exposure** | Low (0.2) | High (0.7) | **0.14** | Rarely sees ads, clicks when shown |
| **Ad Blind** | Low (0.2) | Low (0.1) | **0.02** | Rarely sees or clicks ads |

In [8]:
# ============================================================================
# 2. ADVERTISING RESPONSIVENESS
# ============================================================================

df_features['ad_response_rate'] = df_features['ad_clicks_per_day'] / df_features['ad_views_per_day'].replace(0, np.nan)
df_features['ad_response_rate'].fillna(0, inplace=True)

df_features[['ad_views_per_day']] = scaler.fit_transform(df_features[['ad_views_per_day']])
df_features['ad_exposure_score'] = df_features['ad_views_per_day'] * df_features['ad_response_rate']

#### 3. Purchase Intent

**purchase_intent_score**: Combines cart size, browse-to-buy efficiency, and purchase frequency. Predicts likelihood to complete purchase.

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **browse_to_buy_inverse** | 1 ÷ browse_to_buy_ratio | Lower browsing per purchase = higher intent |
| **purchase_intent_score** | (cart_size + browse_inverse + weekly_purchases) ÷ 3 | Overall purchase readiness |

---

| User Type | Cart Size | Browse Ratio | Weekly Purchases | Intent Score | Behavior |
|-----------|-----------|--------------|------------------|--------------|----------|
| **Ready Buyer** | High (0.8) | Low/High inverse (0.9) | High (0.8) | **0.83** | Large carts, efficient, buys often |
| **Browser** | Low (0.3) | High/Low inverse (0.2) | Low (0.2) | **0.23** | Small carts, browses a lot |
| **Moderate** | Med (0.5) | Med (0.5) | Med (0.5) | **0.50** | Average behavior |


In [9]:
# ============================================================================
# 3. PURCHASE INTENT (REVISED - NO TARGET LEAKAGE)
# ============================================================================

df_features['browse_to_buy_inverse'] = 1 / df_features['browse_to_buy_ratio'].replace(0, np.nan)
df_features['browse_to_buy_inverse'].fillna(0, inplace=True)

temp_intent = pd.DataFrame({
    'cart': df_features['cart_items_average'],
    'browse': df_features['browse_to_buy_inverse'],
    'weekly': df_features['weekly_purchases']
})
temp_intent_scaled = scaler.fit_transform(temp_intent)

df_features['purchase_intent_score'] = (
    temp_intent_scaled[:, 0] +
    temp_intent_scaled[:, 1] +
    temp_intent_scaled[:, 2]
) / 3

#### 4. Discount Sensitivity

**discount_sensitivity_index**: Measures responsiveness to price incentives. Combines coupon usage and impulse buying behavior.

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **discount_sensitivity_index** | (coupon_usage + impulse_purchases) ÷ 2 | How much discounts drive purchases |

---

| User Type | Coupon Usage | Impulse Buys | Sensitivity | Best Intervention |
|-----------|--------------|--------------|-------------|-------------------|
| **Deal Hunter** | High (0.9) | High (0.8) | **0.85** | Discounts, limited offers |
| **Price Conscious** | High (0.8) | Low (0.2) | **0.50** | Coupons, free shipping |
| **Premium Buyer** | Low (0.2) | Low (0.3) | **0.25** | Exclusivity, early access |
| **Impulse Only** | Low (0.2) | High (0.9) | **0.55** | Flash sales, urgency |


In [10]:
# ============================================================================
# 4. DISCOUNT SENSITIVITY
# ============================================================================

discount_cols = ['coupon_usage_frequency', 'impulse_purchases_per_month']
df_features[discount_cols] = scaler.fit_transform(df_features[discount_cols])

df_features['discount_sensitivity_index'] = (
    df_features['coupon_usage_frequency'] +
    df_features['impulse_purchases_per_month']
) / 2

#### 5. Revenue Strength

**normalized_spend_score**: Standardized monthly spending. Shows purchasing power.
**customer_value_tier**: Categorizes customers by spend level (Low/Mid/High).

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **normalized_spend_score** | normalized(monthly_spend) | Relative spending power (0-1) |
| **customer_value_tier** | Binned spend score | Customer segment |

---

| Tier | Spend Score | Monthly Spend Range | Strategy |
|------|-------------|---------------------|----------|
| **High** | 0.75+ | Top 25% spenders | Premium products, VIP access |
| **Mid** | 0.40-0.75 | Middle 35% | Standard offers, upsells |
| **Low** | <0.40 | Bottom 40% | Entry products, bundles |

In [11]:
# ============================================================================
# 5. REVENUE STRENGTH
# ============================================================================

df_features[['monthly_spend']] = scaler.fit_transform(df_features[['monthly_spend']])
df_features['normalized_spend_score'] = df_features['monthly_spend']

df_features['customer_value_tier'] = pd.cut(
    df_features['normalized_spend_score'],
    bins=[-np.inf, 0.40, 0.75, np.inf],
    labels=['Low', 'Mid', 'High']
)

#### 6. Recency

**days_since_last_purchase**: Time elapsed since last order. Shows engagement freshness.
**recency_bucket**: Categorizes users by purchase recency (Active/Warm/Cold/Dormant).

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **days_since_last_purchase** | today - last_purchase_date | Days since last order |
| **recency_bucket** | Binned by days | Engagement status |

---

| Bucket | Days Since Purchase | Re-engagement Risk | Strategy |
|--------|---------------------|-------------------|----------|
| **Active** | 0-7 days | Very Low | New launches, cross-sells |
| **Warm** | 8-30 days | Low | Reminders, personalized offers |
| **Cold** | 31-90 days | Medium | Win-back campaigns, discounts |
| **Dormant** | 90+ days | High | Deep discounts, surveys |

In [12]:
# ============================================================================
# 6. RECENCY (LEAKAGE-CONTROLLED)
# ============================================================================

df_features['last_purchase_date'] = pd.to_datetime(df_features['last_purchase_date'])
today = pd.Timestamp.now()
df_features['days_since_last_purchase'] = (today - df_features['last_purchase_date']).dt.days

df_features['recency_bucket'] = pd.cut(
    df_features['days_since_last_purchase'],
    bins=[-np.inf, 7, 30, 90, np.inf],
    labels=['Active', 'Warm', 'Cold', 'Dormant']
)

## 7. Loyalty & Advocacy

**advocacy_score**: Measures brand attachment through loyalty, reviews, social sharing, and referrals. High score = brand champion.

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **advocacy_score** | (loyalty + reviews + sharing + referrals) ÷ 4 | Overall brand advocacy |

---

| User Type | Loyalty | Reviews | Social Sharing | Referrals | Advocacy Score | Behavior |
|-----------|---------|---------|----------------|-----------|----------------|----------|
| **Brand Champion** | High (0.9) | High (0.8) | High (0.9) | High (0.8) | **0.85** | Promotes brand actively |
| **Satisfied Customer** | High (0.8) | Med (0.5) | Low (0.2) | Low (0.3) | **0.45** | Loyal but quiet |
| **Casual User** | Low (0.3) | Low (0.2) | Low (0.1) | Low (0.1) | **0.18** | Uses but not attached |


In [13]:
# ============================================================================
# 7. LOYALTY & ADVOCACY
# ============================================================================

advocacy_cols = ['brand_loyalty_score', 'review_writing_frequency', 'social_sharing_frequency', 'referral_count']
df_features[advocacy_cols] = scaler.fit_transform(df_features[advocacy_cols])

df_features['advocacy_score'] = (
    df_features['brand_loyalty_score'] +
    df_features['review_writing_frequency'] +
    df_features['social_sharing_frequency'] +
    df_features['referral_count']
) / 4

## 8. Lifestyle & Stress Impact

**stress_impact_index**: Combines financial stress, overall stress, mental health, and sleep quality. High score = stressed user (may affect purchasing).

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **stress_impact_index** | (financial_stress + overall_stress + (1-mental_health) + (1-sleep)) ÷ 4 | Overall stress level |

---

| User Type | Financial Stress | Overall Stress | Mental Health | Sleep | Stress Index | Shopping Behavior |
|-----------|-----------------|----------------|---------------|-------|--------------|-------------------|
| **High Stress** | High (0.8) | High (0.9) | Low (0.3) | Low (0.2) | **0.80** | Erratic, impulse or avoidance |
| **Moderate Stress** | Med (0.5) | Med (0.5) | Med (0.5) | Med (0.5) | **0.50** | Normal patterns |
| **Low Stress** | Low (0.2) | Low (0.2) | High (0.8) | High (0.9) | **0.18** | Rational, planned purchases |


In [14]:
# ============================================================================
# 8. LIFESTYLE & STRESS IMPACT
# ============================================================================

stress_cols = ['stress_from_financial_decisions', 'overall_stress_level', 'mental_health_score', 'sleep_quality']
df_features[stress_cols] = scaler.fit_transform(df_features[stress_cols])

df_features['stress_impact_index'] = (
    df_features['stress_from_financial_decisions'] +
    df_features['overall_stress_level'] +
    (1 - df_features['mental_health_score']) +
    (1 - df_features['sleep_quality'])
) / 4

## 9. Shopping Regularity

**shopping_consistency_score**: Measures predictability of shopping patterns. Combines time-of-day habits with weekend behavior.

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **shopping_consistency_score** | time_consistency × (1 + weekend_shopper) | How predictable shopping is |

---

| User Type | Time Pattern | Weekend Shopper | Consistency | Behavior |
|-----------|--------------|----------------|-------------|----------|
| **Routine Shopper** | Consistent | Yes (1) | High | Shops same time, predictable |
| **Random Shopper** | Varied | No (0) | Low | Unpredictable timing |
| **Weekend Warrior** | Varied | Yes (1) | Medium | Primarily weekend shopping |

In [15]:
# ============================================================================
# 9. SHOPPING REGULARITY
# ============================================================================

from scipy.stats import entropy

def calculate_entropy(value):
    if pd.isna(value):
        return 0
    return 0  # Simplified for single values

# For categorical column, convert to numeric first
shopping_time_mapping = {time: idx for idx, time in enumerate(df_features['shopping_time_of_day'].unique())}
df_features['shopping_time_numeric'] = df_features['shopping_time_of_day'].map(shopping_time_mapping)

df_features['shopping_consistency_score'] = df_features['shopping_time_numeric'] * (1 + df_features['weekend_shopper'])

**cart_abandonment_flag**: Binary classification target. Users with abandonment rate ≥ 60% are flagged as high-risk (1), others as low-risk (0).

| Metric | Formula | What It Means |
|--------|---------|---------------|
| **cart_abandonment_flag** | 1 if cart_abandonment_rate ≥ 0.6, else 0 | High-risk abandoner classification |

---

| Flag | Abandonment Rate | Risk Level | Action Needed |
|------|------------------|------------|---------------|
| **1 (High Risk)** | ≥ 60% | High | Strong intervention required |
| **0 (Low Risk)** | < 60% | Low | Standard engagement |

In [16]:
# ============================================================================
# CREATE TARGET VARIABLE (BINARY CLASSIFICATION)
# ============================================================================

threshold = 0.6  # Adjust based on business needs
df_features['cart_abandonment_flag'] = (df_features['cart_abandonment_rate'] >= threshold).astype(int)

print(f"\nAfter feature engineering shape: {df_features.shape}")


After feature engineering shape: (99996, 75)


#### Column Dropping Strategy

**Purpose**: Remove raw columns used in aggregations, low-signal features, and target leakage to prevent overfitting and improve model generalization.

**Dropped Categories**:
- Raw aggregation inputs (engagement, advocacy, stress components)
- Identifiers (user_id)
- Low-value demographics (ethnicity, occupation, relationship_status)
- Lifestyle noise (reading_habits, hobbies, exercise)
- Target leakage (cart_abandonment_rate, checkout_abandonments_per_month)

In [17]:
# ============================================================================
# DROP RAW COLUMNS AFTER AGGREGATION
# ============================================================================

columns_to_drop_after_aggregation = [
    # Identifiers
    'user_id',
    
    # Engagement
    'daily_session_time_minutes',
    'product_views_per_day',
    'app_usage_frequency',
    
    # Advertising
    'ad_views_per_day',
    'ad_clicks_per_day',
    'notification_response_rate',
    
    # Purchase Intent
    'cart_items_average',
    'browse_to_buy_ratio',
    'weekly_purchases',
    'browse_to_buy_inverse',
    
    # Discount
    'coupon_usage_frequency',
    'impulse_purchases_per_month',
    
    # Revenue
    'monthly_spend',
    'average_order_value',
    
    # Recency
    'last_purchase_date',
    'days_since_last_purchase',
    'account_age_months',
    
    # Advocacy
    'brand_loyalty_score',
    'review_writing_frequency',
    'social_sharing_frequency',
    'referral_count',
    
    # Stress
    'stress_from_financial_decisions',
    'overall_stress_level',
    'mental_health_score',
    'sleep_quality',
    
    # Shopping
    'shopping_time_of_day',
    'weekend_shopper',
    'shopping_time_numeric',
    
    # Low-value demographics
    'ethnicity',
    'language_preference',
    'occupation',
    'relationship_status',
    'urban_rural',
    'household_size',
    
    # Lifestyle noise
    'reading_habits',
    'hobby_count',
    'travel_frequency',
    'exercise_frequency',
    'physical_activity_level',
    
    # TARGET LEAKAGE - CRITICAL
    'cart_abandonment_rate',
    'checkout_abandonments_per_month',
]

df_final = df_features.drop(columns=columns_to_drop_after_aggregation, errors='ignore')

print(f"\nFinal dataset shape: {df_final.shape}")
print(f"\nFinal features ({len(df_final.columns)} columns):")
print(df_final.columns.tolist())
print(f"\nTarget distribution:")
print(df_final['cart_abandonment_flag'].value_counts())


Final dataset shape: (99996, 33)

Final features (33 columns):
['age', 'gender', 'income_level', 'employment_status', 'education_level', 'has_children', 'device_type', 'preferred_payment_method', 'loyalty_program_member', 'product_category_preference', 'return_frequency', 'budgeting_style', 'impulse_buying_score', 'environmental_consciousness', 'health_conscious_shopping', 'social_media_influence_score', 'wishlist_items_count', 'purchase_conversion_rate', 'premium_subscription', 'return_rate', 'avg_daily_engagement_score', 'weekly_engagement_index', 'ad_response_rate', 'ad_exposure_score', 'purchase_intent_score', 'discount_sensitivity_index', 'normalized_spend_score', 'customer_value_tier', 'recency_bucket', 'advocacy_score', 'stress_impact_index', 'shopping_consistency_score', 'cart_abandonment_flag']

Target distribution:
cart_abandonment_flag
1    94495
0     5501
Name: count, dtype: int64


In [18]:
# Additional columns to drop
additional_drops = [
    'purchase_conversion_rate',
    'return_rate',
    'return_frequency',
    'wishlist_items_count'
]

df_final = df_final.drop(columns=additional_drops, errors='ignore')

print(f"Final dataset shape: {df_final.shape}")
print(f"\nFinal features ({len(df_final.columns)} columns):")
print(df_final.columns.tolist())
print(f"\nTarget distribution:")
print(df_final['cart_abandonment_flag'].value_counts())

Final dataset shape: (99996, 29)

Final features (29 columns):
['age', 'gender', 'income_level', 'employment_status', 'education_level', 'has_children', 'device_type', 'preferred_payment_method', 'loyalty_program_member', 'product_category_preference', 'budgeting_style', 'impulse_buying_score', 'environmental_consciousness', 'health_conscious_shopping', 'social_media_influence_score', 'premium_subscription', 'avg_daily_engagement_score', 'weekly_engagement_index', 'ad_response_rate', 'ad_exposure_score', 'purchase_intent_score', 'discount_sensitivity_index', 'normalized_spend_score', 'customer_value_tier', 'recency_bucket', 'advocacy_score', 'stress_impact_index', 'shopping_consistency_score', 'cart_abandonment_flag']

Target distribution:
cart_abandonment_flag
1    94495
0     5501
Name: count, dtype: int64


In [19]:
usa_df.head(1)

Unnamed: 0,user_id,age,gender,urban_rural,income_level,employment_status,education_level,relationship_status,has_children,household_size,...,cart_items_average,checkout_abandonments_per_month,purchase_conversion_rate,app_usage_frequency,notification_response_rate,account_age_months,last_purchase_date,social_sharing_frequency,premium_subscription,return_rate
7,8,38,Male,Rural,72818,Retired,High School,Divorced,1,8,...,3,8,95,6,83,6,2026-09-27,4,0,90


In [20]:
print(usa_df.columns.tolist())

['user_id', 'age', 'gender', 'urban_rural', 'income_level', 'employment_status', 'education_level', 'relationship_status', 'has_children', 'household_size', 'occupation', 'ethnicity', 'language_preference', 'device_type', 'weekly_purchases', 'monthly_spend', 'cart_abandonment_rate', 'review_writing_frequency', 'average_order_value', 'preferred_payment_method', 'coupon_usage_frequency', 'loyalty_program_member', 'referral_count', 'product_category_preference', 'shopping_time_of_day', 'weekend_shopper', 'impulse_purchases_per_month', 'browse_to_buy_ratio', 'return_frequency', 'budgeting_style', 'brand_loyalty_score', 'impulse_buying_score', 'environmental_consciousness', 'health_conscious_shopping', 'travel_frequency', 'hobby_count', 'social_media_influence_score', 'reading_habits', 'exercise_frequency', 'stress_from_financial_decisions', 'overall_stress_level', 'sleep_quality', 'physical_activity_level', 'mental_health_score', 'daily_session_time_minutes', 'product_views_per_day', 'ad_v