# Comprehensive Lift Analysis Notebook

This notebook covers all three types of lift analysis:
1. **Market Basket Analysis** - Association Rule Lift
2. **Predictive Modeling** - Targeting Efficiency Lift
3. **A/B Testing** - Incremental Impact Lift

## Lift Definition

**Lift is a "Multiplier of Success"** - it measures how much better your specific approach performs compared to a baseline of "business as usual" or "random chance."

---

## Setup and Imports

In [None]:
# Install required packages (uncomment if needed)
# !pip install pandas numpy matplotlib seaborn mlxtend scikit-learn scipy statsmodels

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ All packages imported successfully!")

---
# Part 1: Market Basket Analysis - Association Rule Lift

## Definition
Measures how much more likely a customer is to buy Item B given they bought Item A, compared to buying B randomly.

## Formula
$$\text{Lift} = \frac{P(A \cap B)}{P(A) \times P(B)}$$

## Interpretation
- **Lift = 1.0**: Items are independent (no relationship)
- **Lift > 1.0**: Items are positively correlated (bought together more than random)
- **Lift < 1.0**: Items are negatively correlated (bought together less than random)

---

In [None]:
def load_groceries_data():
    """
    Load and prepare groceries dataset
    
    For real data, load from:
    https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset
    
    Expected format: CSV with 'Member_number' and 'itemDescription' columns
    OR: List of transactions where each transaction is a list of items
    """
    # Sample transactions for demonstration
    # Replace this with your actual data loading code
    transactions = [
        ['milk', 'bread', 'butter'],
        ['beer', 'diapers', 'bread'],
        ['milk', 'bread', 'butter', 'cheese'],
        ['beer', 'diapers'],
        ['milk', 'bread', 'butter', 'eggs'],
        ['beer', 'diapers', 'chips'],
        ['milk', 'cheese'],
        ['bread', 'butter', 'eggs'],
        ['beer', 'diapers', 'bread', 'chips'],
        ['milk', 'bread', 'cheese', 'eggs'],
        ['coffee', 'sugar', 'milk'],
        ['wine', 'cheese', 'crackers'],
        ['beer', 'chips', 'salsa'],
        ['pasta', 'tomato sauce', 'cheese'],
        ['chicken', 'rice', 'vegetables'],
    ]
    
    # If loading from CSV:
    # df = pd.read_csv('groceries.csv')
    # transactions = df.groupby('Member_number')['itemDescription'].apply(list).tolist()
    
    return transactions

In [None]:
def perform_market_basket_analysis(transactions, min_support=0.2, min_threshold=1.0):
    """
    Perform market basket analysis and calculate lift
    
    Parameters:
    -----------
    transactions : list of lists
        Each inner list represents items in a transaction
    min_support : float
        Minimum support threshold for frequent itemsets (0-1)
    min_threshold : float
        Minimum lift threshold for association rules
    """
    # Transform transactions to one-hot encoded DataFrame
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    df = pd.DataFrame(te_ary, columns=te.columns_)
    
    print("=" * 80)
    print("MARKET BASKET ANALYSIS - GROCERIES")
    print("=" * 80)
    print(f"\nTotal Transactions: {len(transactions)}")
    print(f"Unique Items: {len(df.columns)}")
    print(f"\nTop 10 Item Frequencies:")
    print(df.sum().sort_values(ascending=False).head(10))
    
    # Generate frequent itemsets
    frequent_itemsets = apriori(df, min_support=min_support, use_colnames=True)
    print(f"\n\nFrequent Itemsets (support >= {min_support}):")
    print(frequent_itemsets.sort_values('support', ascending=False).to_string())
    
    # Generate association rules
    if len(frequent_itemsets) > 0:
        rules = association_rules(frequent_itemsets, metric="lift", min_threshold=min_threshold)
        rules = rules.sort_values('lift', ascending=False)
        
        print(f"\n\nAssociation Rules (lift >= {min_threshold}):")
        print(f"\nTotal Rules Found: {len(rules)}")
        print("\nTop 15 Rules by Lift:")
        display_cols = ['antecedents', 'consequents', 'support', 'confidence', 'lift']
        print(rules[display_cols].head(15).to_string(index=False))
        
        # Visualizations
        create_market_basket_visualizations(rules)
        
        # Interpretation
        print_market_basket_interpretation(rules)
        
        return rules, frequent_itemsets
    else:
        print("\nNo frequent itemsets found. Try lowering min_support.")
        return None, frequent_itemsets

In [None]:
def create_market_basket_visualizations(rules):
    """Create comprehensive market basket analysis visualizations"""
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Lift distribution
    axes[0, 0].hist(rules['lift'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[0, 0].axvline(x=1, color='red', linestyle='--', linewidth=2, label='Lift = 1 (Independence)')
    axes[0, 0].set_xlabel('Lift', fontsize=12)
    axes[0, 0].set_ylabel('Frequency', fontsize=12)
    axes[0, 0].set_title('Distribution of Lift Values', fontsize=14, fontweight='bold')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Support vs Confidence (colored by Lift)
    scatter = axes[0, 1].scatter(rules['support'], rules['confidence'], 
                                  c=rules['lift'], s=100, alpha=0.6, 
                                  cmap='viridis', edgecolors='black')
    axes[0, 1].set_xlabel('Support', fontsize=12)
    axes[0, 1].set_ylabel('Confidence', fontsize=12)
    axes[0, 1].set_title('Support vs Confidence (colored by Lift)', fontsize=14, fontweight='bold')
    plt.colorbar(scatter, ax=axes[0, 1], label='Lift')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Top rules by lift
    top_rules = rules.nlargest(10, 'lift')
    rule_labels = [f"{list(ant)[0]} → {list(cons)[0]}" 
                   for ant, cons in zip(top_rules['antecedents'], top_rules['consequents'])]
    
    axes[1, 0].barh(range(len(top_rules)), top_rules['lift'], color='coral', edgecolor='black')
    axes[1, 0].set_yticks(range(len(top_rules)))
    axes[1, 0].set_yticklabels(rule_labels, fontsize=10)
    axes[1, 0].set_xlabel('Lift', fontsize=12)
    axes[1, 0].set_title('Top 10 Association Rules by Lift', fontsize=14, fontweight='bold')
    axes[1, 0].axvline(x=1, color='red', linestyle='--', linewidth=2, alpha=0.7)
    axes[1, 0].grid(True, alpha=0.3, axis='x')
    
    # 4. Lift vs Confidence
    axes[1, 1].scatter(rules['confidence'], rules['lift'], s=100, alpha=0.6, 
                       c='darkgreen', edgecolors='black')
    axes[1, 1].axhline(y=1, color='red', linestyle='--', linewidth=2, 
                       label='Lift = 1 (No relationship)', alpha=0.7)
    axes[1, 1].set_xlabel('Confidence', fontsize=12)
    axes[1, 1].set_ylabel('Lift', fontsize=12)
    axes[1, 1].set_title('Confidence vs Lift', fontsize=14, fontweight='bold')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('market_basket_lift_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n✓ Visualizations saved as 'market_basket_lift_analysis.png'")

In [None]:
def print_market_basket_interpretation(rules):
    """Print interpretation guide for market basket analysis"""
    print("\n" + "=" * 80)
    print("INTERPRETATION GUIDE - MARKET BASKET LIFT")
    print("=" * 80)
    print("\nWhat does Lift mean?")
    print("- Lift = 1.0: Items are independent (no relationship)")
    print("- Lift > 1.0: Items are positively correlated (bought together more than random)")
    print("- Lift < 1.0: Items are negatively correlated (bought together less than random)")
    
    if len(rules) > 0:
        print("\n" + "=" * 80)
        print("EXAMPLE INTERPRETATION")
        print("=" * 80)
        top_rule = rules.iloc[0]
        ant = list(top_rule['antecedents'])[0]
        cons = list(top_rule['consequents'])[0]
        lift = top_rule['lift']
        conf = top_rule['confidence']
        supp = top_rule['support']
        
        print(f"\nRule: {ant} → {cons}")
        print(f"Lift: {lift:.2f}")
        print(f"Confidence: {conf:.2%}")
        print(f"Support: {supp:.2%}")
        print(f"\nInterpretation:")
        print(f"Customers who buy '{ant}' are {lift:.2f}x more likely to buy '{cons}'")
        print(f"compared to the general population.")
        print(f"\n{conf:.1%} of customers who buy '{ant}' also buy '{cons}'.")
        print(f"This pattern appears in {supp:.1%} of all transactions.")
        
        print("\n" + "=" * 80)
        print("BUSINESS RECOMMENDATIONS")
        print("=" * 80)
        print(f"\n1. Product Placement: Place '{cons}' near '{ant}' in store")
        print(f"2. Cross-Selling: Recommend '{cons}' to customers buying '{ant}'")
        print(f"3. Bundle Offers: Create bundle deals with '{ant}' and '{cons}'")
        print(f"4. Promotional Strategy: Discount '{ant}' to drive sales of '{cons}'")

### Run Market Basket Analysis

In [None]:
# Load data
transactions = load_groceries_data()

# Perform analysis
# Adjust min_support and min_threshold based on your data
rules, frequent_itemsets = perform_market_basket_analysis(
    transactions, 
    min_support=0.15,  # Lower for larger datasets
    min_threshold=1.0   # Only show rules with lift > 1
)

---
# Part 2: Predictive Modeling Lift - Targeting Efficiency

## Definition
Measures how much better a model is at identifying targets (e.g., churners, buyers) in a specific segment compared to random selection.

## Formula
$$\text{Lift} = \frac{\% \text{ of Targets in Segment}}{\% \text{ of Population in Segment}}$$

## Interpretation
If Lift = 3 in the top 10% decile, you capture **3x more targets** than random selection.

---

In [None]:
def load_churn_data():
    """
    Load Telco Customer Churn dataset
    
    For real data, download from:
    https://www.kaggle.com/datasets/blastchar/telco-customer-churn
    
    Expected columns: tenure, MonthlyCharges, TotalCharges, Contract, 
                     InternetService, TechSupport, PaymentMethod, Churn
    """
    # For demo, create synthetic churn data
    np.random.seed(42)
    n_samples = 5000
    
    data = {
        'tenure': np.random.randint(0, 72, n_samples),
        'MonthlyCharges': np.random.uniform(20, 120, n_samples),
        'TotalCharges': np.random.uniform(20, 8000, n_samples),
        'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples, p=[0.5, 0.3, 0.2]),
        'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples, p=[0.35, 0.45, 0.2]),
        'TechSupport': np.random.choice(['Yes', 'No'], n_samples),
        'PaymentMethod': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples)
    }
    
    df = pd.DataFrame(data)
    
    # Create churn based on logical rules
    churn_prob = 0.1 + (df['Contract'] == 'Month-to-month') * 0.3 + \
                 (df['tenure'] < 12) * 0.2 + \
                 (df['MonthlyCharges'] > 80) * 0.15 + \
                 (df['TechSupport'] == 'No') * 0.1
    
    df['Churn'] = (np.random.random(n_samples) < churn_prob).astype(int)
    
    # If loading from CSV:
    # df = pd.read_csv('telco_churn.csv')
    # df['Churn'] = (df['Churn'] == 'Yes').astype(int)
    
    return df

In [None]:
def calculate_lift_curve(y_true, y_pred_proba, n_deciles=10):
    """
    Calculate lift curve for predictive model
    
    Parameters:
    -----------
    y_true : array-like
        True target values (0 or 1)
    y_pred_proba : array-like
        Predicted probabilities (0 to 1)
    n_deciles : int
        Number of deciles to divide population into
        
    Returns:
    --------
    DataFrame with lift metrics per decile
    """
    # Create DataFrame
    df = pd.DataFrame({
        'y_true': y_true,
        'y_pred_proba': y_pred_proba
    })
    
    # Sort by predicted probability (descending)
    df = df.sort_values('y_pred_proba', ascending=False).reset_index(drop=True)
    
    # Assign deciles
    df['decile'] = pd.qcut(df.index, n_deciles, labels=False, duplicates='drop') + 1
    
    # Calculate metrics per decile
    lift_data = []
    cumulative_targets = 0
    cumulative_population = 0
    total_targets = df['y_true'].sum()
    total_population = len(df)
    
    for decile in range(1, n_deciles + 1):
        decile_df = df[df['decile'] == decile]
        
        # Decile metrics
        decile_population = len(decile_df)
        decile_targets = decile_df['y_true'].sum()
        decile_target_rate = decile_targets / decile_population if decile_population > 0 else 0
        
        # Cumulative metrics
        cumulative_population += decile_population
        cumulative_targets += decile_targets
        cumulative_target_rate = cumulative_targets / cumulative_population
        
        # Overall baseline rate
        baseline_rate = total_targets / total_population
        
        # Lift calculations
        decile_lift = decile_target_rate / baseline_rate if baseline_rate > 0 else 0
        cumulative_lift = cumulative_target_rate / baseline_rate if baseline_rate > 0 else 0
        
        # % of total targets captured
        pct_targets_captured = (cumulative_targets / total_targets) * 100 if total_targets > 0 else 0
        pct_population = (cumulative_population / total_population) * 100
        
        lift_data.append({
            'Decile': decile,
            'Population': decile_population,
            'Targets': int(decile_targets),
            'Target_Rate_%': decile_target_rate * 100,
            'Decile_Lift': decile_lift,
            'Cumulative_Population_%': pct_population,
            'Cumulative_Targets': int(cumulative_targets),
            'Cumulative_Targets_%': pct_targets_captured,
            'Cumulative_Lift': cumulative_lift
        })
    
    return pd.DataFrame(lift_data)

In [None]:
def plot_lift_analysis(lift_df, model_name="Model"):
    """Create comprehensive lift analysis visualizations"""
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Decile-wise Lift
    axes[0, 0].bar(lift_df['Decile'], lift_df['Decile_Lift'], 
                   color='steelblue', edgecolor='black', alpha=0.7)
    axes[0, 0].axhline(y=1, color='red', linestyle='--', linewidth=2, 
                       label='Baseline (Random)', alpha=0.7)
    axes[0, 0].set_xlabel('Decile (1 = Highest Predicted Probability)', fontsize=12)
    axes[0, 0].set_ylabel('Lift', fontsize=12)
    axes[0, 0].set_title(f'Decile-wise Lift - {model_name}', fontsize=14, fontweight='bold')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3, axis='y')
    axes[0, 0].set_xticks(lift_df['Decile'])
    
    # Add value labels
    for idx, row in lift_df.iterrows():
        axes[0, 0].text(row['Decile'], row['Decile_Lift'] + 0.1, 
                        f"{row['Decile_Lift']:.2f}", 
                        ha='center', va='bottom', fontsize=9, fontweight='bold')
    
    # 2. Cumulative Lift Curve
    axes[0, 1].plot(lift_df['Cumulative_Population_%'], lift_df['Cumulative_Lift'], 
                    marker='o', linewidth=2, markersize=8, color='darkgreen', label='Model Lift')
    axes[0, 1].axhline(y=1, color='red', linestyle='--', linewidth=2, 
                       label='Random Selection', alpha=0.7)
    axes[0, 1].set_xlabel('% of Population Contacted', fontsize=12)
    axes[0, 1].set_ylabel('Cumulative Lift', fontsize=12)
    axes[0, 1].set_title('Cumulative Lift Curve', fontsize=14, fontweight='bold')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].set_xlim(0, 100)
    
    # 3. Gains Chart
    axes[1, 0].plot(lift_df['Cumulative_Population_%'], lift_df['Cumulative_Targets_%'], 
                    marker='o', linewidth=2, markersize=8, color='darkorange', label='Model')
    axes[1, 0].plot([0, 100], [0, 100], 'r--', linewidth=2, label='Random', alpha=0.7)
    axes[1, 0].fill_between(lift_df['Cumulative_Population_%'], 
                             lift_df['Cumulative_Targets_%'], 
                             lift_df['Cumulative_Population_%'],
                             alpha=0.2, color='darkorange', label='Lift Area')
    axes[1, 0].set_xlabel('% of Population Contacted', fontsize=12)
    axes[1, 0].set_ylabel('% of Targets Captured', fontsize=12)
    axes[1, 0].set_title('Gains Chart (Cumulative)', fontsize=14, fontweight='bold')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    axes[1, 0].set_xlim(0, 100)
    axes[1, 0].set_ylim(0, 100)
    
    # 4. Target Rate by Decile
    axes[1, 1].bar(lift_df['Decile'], lift_df['Target_Rate_%'], 
                   color='coral', edgecolor='black', alpha=0.7)
    baseline_rate = lift_df['Targets'].sum() / lift_df['Population'].sum() * 100
    axes[1, 1].axhline(y=baseline_rate, color='red', linestyle='--', linewidth=2, 
                       label=f'Overall Rate: {baseline_rate:.2f}%', alpha=0.7)
    axes[1, 1].set_xlabel('Decile', fontsize=12)
    axes[1, 1].set_ylabel('Target Rate (%)', fontsize=12)
    axes[1, 1].set_title('Target Rate by Decile', fontsize=14, fontweight='bold')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3, axis='y')
    axes[1, 1].set_xticks(lift_df['Decile'])
    
    plt.tight_layout()
    filename = f'predictive_lift_{model_name.lower().replace(" ", "_")}.png'
    plt.savefig(filename, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"\n✓ Visualizations saved as '{filename}'")

In [None]:
def perform_churn_prediction_analysis(df):
    """
    Perform churn prediction and lift analysis
    """
    print("\n" + "=" * 80)
    print("PREDICTIVE MODELING LIFT - CHURN PREDICTION")
    print("=" * 80)
    
    # Basic statistics
    print(f"\nDataset Shape: {df.shape}")
    print(f"Churn Rate: {df['Churn'].mean():.2%}")
    print(f"\nChurn Distribution:")
    print(df['Churn'].value_counts())
    
    # Prepare data
    df_encoded = pd.get_dummies(df.drop('Churn', axis=1), drop_first=True)
    X = df_encoded
    y = df['Churn']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    # Train models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
    }
    
    results = {}
    
    for name, model in models.items():
        print(f"\n{'='*80}")
        print(f"Training {name}...")
        print(f"{'='*80}")
        
        # Train
        model.fit(X_train, y_train)
        
        # Predict probabilities
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        
        # Calculate AUC
        auc = roc_auc_score(y_test, y_pred_proba)
        print(f"AUC-ROC: {auc:.4f}")
        
        # Calculate lift
        lift_df = calculate_lift_curve(y_test, y_pred_proba, n_deciles=10)
        
        print(f"\nLift Table for {name}:")
        print(lift_df.to_string(index=False))
        
        # Plot
        plot_lift_analysis(lift_df, model_name=name)
        
        # Store results
        results[name] = {
            'model': model,
            'auc': auc,
            'lift_df': lift_df,
            'y_pred_proba': y_pred_proba
        }
        
        # Key insights
        print(f"\n{'='*80}")
        print(f"KEY INSIGHTS - {name}")
        print(f"{'='*80}")
        top_decile = lift_df.iloc[0]
        print(f"\nTop 10% of Customers (Decile 1):")
        print(f"  - Lift: {top_decile['Decile_Lift']:.2f}x")
        print(f"  - Interpretation: By targeting the top 10% highest-risk customers,")
        print(f"    you capture {top_decile['Decile_Lift']:.2f}x more churners than random selection")
        print(f"  - Target Rate: {top_decile['Target_Rate_%']:.2f}%")
        print(f"  - Cumulative Targets Captured: {top_decile['Cumulative_Targets_%']:.1f}%")
    
    # Compare models
    print("\n" + "=" * 80)
    print("MODEL COMPARISON")
    print("=" * 80)
    comparison_df = pd.DataFrame({
        'Model': list(results.keys()),
        'AUC': [r['auc'] for r in results.values()],
        'Top_Decile_Lift': [r['lift_df'].iloc[0]['Decile_Lift'] for r in results.values()],
        'Top_30pct_Targets_%': [
            r['lift_df'][r['lift_df']['Cumulative_Population_%'] <= 30].iloc[-1]['Cumulative_Targets_%'] 
            for r in results.values()
        ]
    })
    print(comparison_df.to_string(index=False))
    
    return results

### Run Churn Prediction Analysis

In [None]:
# Load data
churn_df = load_churn_data()

# Perform analysis
churn_results = perform_churn_prediction_analysis(churn_df)

---
# Part 3: A/B Testing Lift - Incremental Impact

## Definition
The percentage increase in a metric caused by a treatment compared to a control group.

## Formula
$$\text{Lift} = \frac{\text{Treatment} - \text{Control}}{\text{Control}} \times 100\%$$

## Interpretation
- **Positive lift**: Treatment improved the metric
- **Negative lift**: Treatment hurt the metric  
- **Lift ≈ 0%**: No meaningful difference

---

In [None]:
def load_ab_test_data():
    """
    Load A/B test dataset
    
    For real data, download from:
    https://www.kaggle.com/datasets/zhangluyuan/ab-testing
    
    Expected columns: user_id, group (control/treatment), converted (0/1)
    """
    # Create synthetic A/B test data
    np.random.seed(42)
    n_samples = 10000
    
    # Control group
    control_size = n_samples // 2
    control_conversion_rate = 0.12
    
    # Treatment group
    treatment_size = n_samples // 2
    treatment_conversion_rate = 0.14  # 16.7% lift
    
    data = {
        'user_id': range(n_samples),
        'group': ['control'] * control_size + ['treatment'] * treatment_size,
        'converted': (
            list(np.random.binomial(1, control_conversion_rate, control_size)) +
            list(np.random.binomial(1, treatment_conversion_rate, treatment_size))
        ),
        'time_on_page_sec': np.concatenate([
            np.random.normal(180, 50, control_size),
            np.random.normal(200, 50, treatment_size)
        ]),
        'pages_viewed': np.concatenate([
            np.random.poisson(3, control_size),
            np.random.poisson(3.5, treatment_size)
        ])
    }
    
    df = pd.DataFrame(data)
    
    # If loading from CSV:
    # df = pd.read_csv('ab_test_data.csv')
    
    return df

In [None]:
def calculate_ab_lift(control_metric, treatment_metric):
    """Calculate lift from A/B test"""
    lift = ((treatment_metric - control_metric) / control_metric) * 100
    return lift

def perform_statistical_test(control_data, treatment_data, metric_name="Conversion"):
    """Perform statistical significance test"""
    if metric_name == "Conversion":
        # For binary outcomes, use proportions z-test
        from statsmodels.stats.proportion import proportions_ztest
        
        count = np.array([treatment_data.sum(), control_data.sum()])
        nobs = np.array([len(treatment_data), len(control_data)])
        
        stat, pval = proportions_ztest(count, nobs)
        test_name = "Proportions Z-Test"
    else:
        # For continuous metrics, use t-test
        stat, pval = stats.ttest_ind(treatment_data, control_data)
        test_name = "Independent T-Test"
    
    return stat, pval, test_name

In [None]:
def perform_ab_test_analysis(df):
    """
    Perform comprehensive A/B test analysis with lift calculations
    """
    print("\n" + "=" * 80)
    print("A/B TESTING LIFT - CONVERSION OPTIMIZATION")
    print("=" * 80)
    
    # Split data
    control = df[df['group'] == 'control']
    treatment = df[df['group'] == 'treatment']
    
    print(f"\nSample Sizes:")
    print(f"  Control: {len(control):,}")
    print(f"  Treatment: {len(treatment):,}")
    
    # Define metrics
    metrics = {
        'Conversion Rate': ('converted', 'mean'),
        'Avg Time on Page (sec)': ('time_on_page_sec', 'mean'),
        'Avg Pages Viewed': ('pages_viewed', 'mean')
    }
    
    results = []
    
    for metric_name, (column, agg_func) in metrics.items():
        print(f"\n{'='*80}")
        print(f"METRIC: {metric_name}")
        print(f"{'='*80}")
        
        # Calculate metrics
        control_metric = control[column].mean()
        treatment_metric = treatment[column].mean()
        control_std = control[column].std()
        treatment_std = treatment[column].std()
        
        # Calculate lift
        lift = calculate_ab_lift(control_metric, treatment_metric)
        
        # Statistical test
        stat, pval, test_name = perform_statistical_test(
            control[column], treatment[column], 
            metric_name="Conversion" if column == 'converted' else "Other"
        )
        
        # Bootstrap confidence interval
        n_bootstrap = 1000
        bootstrap_lifts = []
        for _ in range(n_bootstrap):
            c_sample = control[column].sample(len(control), replace=True).mean()
            t_sample = treatment[column].sample(len(treatment), replace=True).mean()
            bootstrap_lifts.append(calculate_ab_lift(c_sample, t_sample))
        
        ci_lower = np.percentile(bootstrap_lifts, 2.5)
        ci_upper = np.percentile(bootstrap_lifts, 97.5)
        
        # Print results
        if 'Rate' in metric_name:
            print(f"\nControl:    {control_metric:.2%} (n={len(control):,})")
            print(f"Treatment:  {treatment_metric:.2%} (n={len(treatment):,})")
        else:
            print(f"\nControl:    {control_metric:.2f} ± {control_std:.2f} (n={len(control):,})")
            print(f"Treatment:  {treatment_metric:.2f} ± {treatment_std:.2f} (n={len(treatment):,})")
        
        print(f"\nLIFT: {lift:+.2f}%")
        print(f"95% CI: [{ci_lower:+.2f}%, {ci_upper:+.2f}%]")
        print(f"\n{test_name}:")
        print(f"  Test Statistic: {stat:.4f}")
        print(f"  P-value: {pval:.4f}")
        
        is_significant = pval < 0.05
        print(f"  Result: {'SIGNIFICANT' if is_significant else 'NOT SIGNIFICANT'} (α=0.05)")
        
        if is_significant:
            direction = "increase" if lift > 0 else "decrease"
            print(f"\n✓ The treatment caused a statistically significant {direction}")
            print(f"  of {abs(lift):.2f}% in {metric_name}")
        else:
            print(f"\n✗ No statistically significant difference detected")
        
        results.append({
            'Metric': metric_name,
            'Control': control_metric,
            'Treatment': treatment_metric,
            'Lift_%': lift,
            'CI_Lower_%': ci_lower,
            'CI_Upper_%': ci_upper,
            'P_value': pval,
            'Significant': is_significant
        })
    
    results_df = pd.DataFrame(results)
    
    # Create visualizations
    create_ab_test_visualizations(results_df, control, treatment)
    
    # Summary
    print("\n" + "=" * 80)
    print("SUMMARY TABLE - ALL METRICS")
    print("=" * 80)
    print(results_df.to_string(index=False))
    
    print_ab_test_interpretation()
    
    return results_df

In [None]:
def create_ab_test_visualizations(results_df, control, treatment):
    """Create A/B test visualizations"""
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Lift by Metric
    colors = ['green' if s else 'red' for s in results_df['Significant']]
    axes[0, 0].barh(results_df['Metric'], results_df['Lift_%'], 
                    color=colors, edgecolor='black', alpha=0.7)
    axes[0, 0].axvline(x=0, color='black', linestyle='-', linewidth=1)
    axes[0, 0].set_xlabel('Lift (%)', fontsize=12)
    axes[0, 0].set_title('A/B Test Lift by Metric', fontsize=14, fontweight='bold')
    axes[0, 0].grid(True, alpha=0.3, axis='x')
    
    for i, row in results_df.iterrows():
        axes[0, 0].text(row['Lift_%'], i, f"  {row['Lift_%']:+.2f}%  ", 
                        ha='left' if row['Lift_%'] > 0 else 'right', 
                        va='center', fontsize=10, fontweight='bold')
    
    # 2. Lift with Confidence Intervals
    axes[0, 1].barh(results_df['Metric'], results_df['Lift_%'], 
                    color=colors, edgecolor='black', alpha=0.7)
    
    for i, row in results_df.iterrows():
        axes[0, 1].plot([row['CI_Lower_%'], row['CI_Upper_%']], [i, i], 
                        'k-', linewidth=2, marker='|', markersize=10)
    
    axes[0, 1].axvline(x=0, color='black', linestyle='-', linewidth=1)
    axes[0, 1].set_xlabel('Lift (%) with 95% CI', fontsize=12)
    axes[0, 1].set_title('Lift with Confidence Intervals', fontsize=14, fontweight='bold')
    axes[0, 1].grid(True, alpha=0.3, axis='x')
    
    # 3. Conversion Rate Comparison
    conv_row = results_df[results_df['Metric'].str.contains('Conversion')].iloc[0]
    x = np.arange(2)
    values = [conv_row['Control'], conv_row['Treatment']]
    
    bars = axes[1, 0].bar(x, [v * 100 for v in values], 
                          color=['steelblue', 'coral'], edgecolor='black', alpha=0.7, width=0.5)
    axes[1, 0].set_ylabel('Conversion Rate (%)', fontsize=12)
    axes[1, 0].set_title('Conversion Rate: Control vs Treatment', fontsize=14, fontweight='bold')
    axes[1, 0].set_xticks(x)
    axes[1, 0].set_xticklabels(['Control', 'Treatment'])
    axes[1, 0].grid(True, alpha=0.3, axis='y')
    
    for bar, val in zip(bars, values):
        height = bar.get_height()
        axes[1, 0].text(bar.get_x() + bar.get_width()/2., height,
                       f'{val*100:.2f}%', ha='center', va='bottom', 
                       fontsize=12, fontweight='bold')
    
    # 4. Statistical Significance Summary
    sig_counts = results_df['Significant'].value_counts()
    colors_pie = ['green', 'red']
    labels = ['Significant', 'Not Significant']
    
    wedges, texts, autotexts = axes[1, 1].pie(
        [sig_counts.get(True, 0), sig_counts.get(False, 0)],
        labels=labels, colors=colors_pie, autopct='%1.0f%%',
        startangle=90, explode=[0.05, 0]
    )
    
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontsize(12)
        autotext.set_fontweight('bold')
    
    axes[1, 1].set_title('Statistical Significance Summary', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('ab_test_lift_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n✓ Visualizations saved as 'ab_test_lift_analysis.png'")

In [None]:
def print_ab_test_interpretation():
    """Print interpretation guide for A/B testing"""
    print("\n" + "=" * 80)
    print("INTERPRETATION GUIDE - A/B TEST LIFT")
    print("=" * 80)
    print("\nWhat does Lift mean in A/B Testing?")
    print("- Lift shows the % change in a metric caused by the treatment")
    print("- Positive lift = Treatment improved the metric")
    print("- Negative lift = Treatment hurt the metric")
    print("- Lift near 0% = No meaningful difference")
    print("\nStatistical Significance:")
    print("- P-value < 0.05: We can be confident the difference is real")
    print("- P-value >= 0.05: Difference might be due to chance")
    print("\nConfidence Intervals:")
    print("- If CI doesn't cross 0%, the effect is statistically significant")
    print("- Wider CI = More uncertainty in the estimate")
    print("\nBusiness Decision:")
    print("- Roll out treatment if: Positive lift AND p-value < 0.05")
    print("- Don't roll out if: Negative lift OR not significant")
    print("- Consider costs: Small lift may not justify implementation costs")

### Run A/B Test Analysis

In [None]:
# Load data
ab_test_df = load_ab_test_data()

# Perform analysis
ab_results = perform_ab_test_analysis(ab_test_df)

---
# Summary: Comparing All Three Lift Types

## Quick Reference Table

In [None]:
def create_lift_comparison_summary():
    """Create comprehensive comparison of all three lift types"""
    
    print("\n" + "=" * 80)
    print("COMPREHENSIVE LIFT ANALYSIS SUMMARY")
    print("=" * 80)
    
    summary_data = {
        'Lift Type': [
            'Market Basket\n(Association)',
            'Predictive Model\n(Targeting)',
            'A/B Testing\n(Incremental)'
        ],
        'Question Answered': [
            'What products are\nbought together?',
            'How well can I\nidentify targets?',
            'Did my change\nimprove the metric?'
        ],
        'Formula': [
            'P(A∩B) /\n[P(A)×P(B)]',
            '% Targets /\n% Population',
            '(Treat - Ctrl) /\nCtrl × 100%'
        ],
        'Baseline': [
            'Independence\n(Lift = 1)',
            'Random\n(Lift = 1)',
            'Control\n(Lift = 0%)'
        ],
        'Use Case': [
            'Cross-selling,\nProduct placement',
            'Campaign targeting,\nChurn prevention',
            'Feature testing,\nUI/UX changes'
        ]
    }
    
    summary_df = pd.DataFrame(summary_data)
    print("\n")
    print(summary_df.to_string(index=False))
    
    print("\n" + "=" * 80)
    print("KEY TAKEAWAYS")
    print("=" * 80)
    print("""
1. MARKET BASKET LIFT (Association Rules)
   - Answers: "What products are bought together?"
   - Action: Place related items near each other, create bundles
   - Success: Lift > 1 means positive association

2. PREDICTIVE MODEL LIFT (Targeting Efficiency)
   - Answers: "How well can I identify high-value customers?"
   - Action: Target top deciles for campaigns to maximize ROI
   - Success: High lift in top deciles = efficient targeting

3. A/B TEST LIFT (Incremental Impact)
   - Answers: "Did my change actually improve the metric?"
   - Action: Roll out treatment if lift is positive and significant
   - Success: Positive lift with p-value < 0.05
    """)
    
    # Create comparison visualization
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # 1. Market Basket
    rules_example = ['Milk→Bread', 'Beer→Diapers', 'Wine→Cheese', 'Coffee→Sugar']
    lifts_example = [2.1, 3.5, 2.8, 1.9]
    axes[0].barh(rules_example, lifts_example, color='steelblue', edgecolor='black', alpha=0.7)
    axes[0].axvline(x=1, color='red', linestyle='--', linewidth=2, label='Independence')
    axes[0].set_xlabel('Lift (Association Strength)', fontsize=11)
    axes[0].set_title('Market Basket Lift\n(Product Associations)', fontsize=13, fontweight='bold')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3, axis='x')
    
    # 2. Predictive Model
    deciles = list(range(1, 11))
    model_lifts = [4.2, 3.1, 2.5, 2.0, 1.6, 1.3, 1.0, 0.8, 0.6, 0.4]
    axes[1].plot(deciles, model_lifts, marker='o', linewidth=2, markersize=8, 
                 color='darkgreen', label='Model')
    axes[1].axhline(y=1, color='red', linestyle='--', linewidth=2, label='Random')
    axes[1].set_xlabel('Decile (1 = Highest Risk)', fontsize=11)
    axes[1].set_ylabel('Lift', fontsize=11)
    axes[1].set_title('Predictive Model Lift\n(Targeting Efficiency)', fontsize=13, fontweight='bold')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    axes[1].set_xticks(deciles)
    
    # 3. A/B Test
    metrics_ab = ['Conversion\nRate', 'Time on\nPage', 'Pages\nViewed']
    lifts_ab = [16.7, 11.1, 8.3]
    colors = ['green', 'green', 'coral']
    axes[2].bar(metrics_ab, lifts_ab, color=colors, edgecolor='black', alpha=0.7)
    axes[2].axhline(y=0, color='black', linestyle='-', linewidth=1)
    axes[2].set_ylabel('Lift (%)', fontsize=11)
    axes[2].set_title('A/B Test Lift\n(Incremental Impact)', fontsize=13, fontweight='bold')
    axes[2].grid(True, alpha=0.3, axis='y')
    
    for i, (metric, lift) in enumerate(zip(metrics_ab, lifts_ab)):
        axes[2].text(i, lift + 1, f'{lift:+.1f}%', 
                    ha='center', va='bottom', fontsize=10, fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('lift_types_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n✓ Comparison visualization saved as 'lift_types_comparison.png'")
    
    return summary_df

# Create summary
summary_df = create_lift_comparison_summary()

---
## Recommendations and Next Steps

In [None]:
print("\n" + "=" * 80)
print("NEXT STEPS AND RECOMMENDATIONS")
print("=" * 80)
print("""
1. MARKET BASKET ANALYSIS
   → Implement: Use high-lift rules for product recommendations
   → Monitor: Track conversion rates on recommended bundles
   → Iterate: Update rules quarterly with fresh transaction data

2. PREDICTIVE MODELING
   → Implement: Target top 2-3 deciles for retention campaigns
   → Monitor: Track actual churn rate in targeted segments
   → Iterate: Retrain models monthly with new outcomes

3. A/B TESTING
   → Implement: Roll out significant positive lifts gradually
   → Monitor: Track metrics over extended period
   → Iterate: Run follow-up tests to optimize further

GENERAL BEST PRACTICES:
- Always compare lift to a baseline (random/control)
- Use statistical tests to validate findings
- Document assumptions and limitations
- Combine multiple lift analyses for holistic insights
- Consider business context and implementation costs
""")

print("\n" + "=" * 80)
print("FILES GENERATED")
print("=" * 80)
print("""
1. market_basket_lift_analysis.png - Association rules visualization
2. predictive_lift_*.png - Model targeting efficiency charts
3. ab_test_lift_analysis.png - A/B test results
4. lift_types_comparison.png - Side-by-side comparison
""")