In [1]:
%store -r

print("Project configuration:")
print(f"SLUG = {SLUG}")
print(f"DATA_DIR = {DATA_DIR}")
print(f"DATASET_KEY = {DATASET_KEY}")
print(f"FIG_DIR = {FIG_DIR}")
print(f"REP_DIR = {REP_DIR}")
print(f"NOTEBOOK_DIR = {NOTEBOOK_DIR}")

missing_vars = [var for var in ['SLUG', 'DATA_DIR', 'FIG_DIR', 'REP_DIR', 'NOTEBOOK_DIR', 'DATASET_KEY'] if var not in globals()]
print(f"Vars not found in globals: {missing_vars}")

# Set default values if variables are not found in store or are empty
if not SLUG:  # Check if empty string
    print(f"{SLUG=} is empty, initializing everything explicitly")
    SLUG = 'customer-segmentation'
    DATASET_KEY = 'vjchoudhary7/customer-segmentation-tutorial-in-python'
    GIT_ROOT = Path.cwd().parent.parent
    DATA_DIR = GIT_ROOT / 'data' / SLUG
    FIG_DIR = GIT_ROOT / 'figures' / SLUG
    REP_DIR = GIT_ROOT / 'reports' / SLUG
    NOTEBOOK_DIR = GIT_ROOT / 'notebooks' / SLUG


Project configuration:
SLUG = customer-segmentation
DATA_DIR = /Users/ravisharma/workdir/eda_practice/data/customer-segmentation
DATASET_KEY = vjchoudhary7/customer-segmentation-tutorial-in-python
FIG_DIR = /Users/ravisharma/workdir/eda_practice/figures/customer-segmentation
REP_DIR = /Users/ravisharma/workdir/eda_practice/reports/customer-segmentation
NOTEBOOK_DIR = /Users/ravisharma/workdir/eda_practice/notebooks/customer-segmentation
Vars not found in globals: []


In [2]:
from pathlib import Path
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display

In [3]:
# Downloading data

base_df = pd.DataFrame()

CSV_PATH = Path(DATA_DIR) / "Mall_Customers.csv"
if not CSV_PATH.exists:
    print(f"CSV {CSV_PATH} does not exist. base_df will remain empty.")
else:
    base_df = pd.read_csv(CSV_PATH)
    print(f"CSV {CSV_PATH} loaded successfully.")

base_df.head()

CSV /Users/ravisharma/workdir/eda_practice/data/customer-segmentation/Mall_Customers.csv loaded successfully.


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


## ✅ **Plan Overview**

The plan covers **10 major categories** of bivariate analysis techniques, each with detailed explanations of:
- **What** each technique does
- **When** to use it
- **Why** it's important
- **Relevance** to your specific dataset (marked with ✅, ⚠️, or ❌)

## 🎯 **Key Features**

1. **Comprehensive Coverage**: Includes techniques even if not directly applicable to your dataset (as requested)
2. **Relevance Indicators**: 
   - ✅ Highly relevant/applicable
   - ⚠️ Limited relevance or optional
   - ❌ Not applicable but included for reference

3. **Prioritized Approach**: Organized by importance for your customer segmentation problem

## 📊 **Most Relevant Techniques for Your Dataset**

**High Priority:**
- Correlation analysis between Age, Income, and Spending Score
- Scatter plots to identify customer clusters
- Gender-based comparisons using t-tests and box plots
- Joint distribution analysis
- Clustering tendency validation

**Medium Priority:**
- Regression modeling
- Outlier detection
- Effect size calculations
- Assumption testing

## 🔍 **Included for Learning/Reference**

Even though not applicable to your current dataset, I included:
- **Time series analysis** (no temporal data)
- **Categorical vs categorical analysis** (only one categorical variable)
- **Advanced techniques** like mutual information
- **Power analysis** methods

This plan will serve as an excellent reference for future EDA projects and gives you a comprehensive roadmap for analyzing the customer segmentation dataset. You can now systematically work through the high-priority techniques first, then explore the medium-priority ones as needed.

# Comprehensive Bivariate Analysis Plan

## Dataset Overview
**Problem**: Customer Segmentation for Mall Customers
- **Variables**: CustomerID, Gender, Age, Annual Income (k$), Spending Score (1-100)
- **Data Types**: 
  - Categorical: Gender
  - Numerical: Age, Annual Income, Spending Score
  - Identifier: CustomerID (excluded from analysis)

## Bivariate Analysis Framework

### 1. Numerical vs Numerical Analysis

#### 1.1 Correlation Analysis ✅ *Highly Relevant*
- **Pearson Correlation Coefficient**: Linear relationships between continuous variables
  - Age vs Annual Income
  - Age vs Spending Score  
  - Annual Income vs Spending Score
- **Spearman Rank Correlation**: Monotonic relationships (non-linear)
- **Kendall's Tau**: Alternative rank correlation for small samples
- **When to use**: Always start with correlation for numerical pairs
- **Interpretation**: r > 0.7 (strong), 0.3-0.7 (moderate), < 0.3 (weak)

#### 1.2 Scatter Plot Analysis ✅ *Highly Relevant*
- **Basic scatter plots** with trend lines
- **Bubble plots** (3rd variable as size/color)
- **Marginal plots** (histograms on axes)
- **When to use**: Visualize relationships, identify outliers, non-linear patterns
- **Look for**: Clusters, outliers, heteroscedasticity, non-linear patterns

#### 1.3 Regression Analysis ✅ *Relevant*
- **Simple Linear Regression**: Model relationships
- **Polynomial Regression**: Capture non-linear relationships
- **Residual Analysis**: Check assumptions
- **When to use**: When one variable predicts another
- **Metrics**: R², RMSE, residual patterns

#### 1.4 Joint Distribution Analysis ✅ *Relevant*
- **2D Histograms/Heatmaps**: Density of point clusters
- **Contour plots**: Probability density contours
- **Hexbin plots**: For large datasets
- **When to use**: Understand joint probability distributions

### 2. Categorical vs Numerical Analysis

#### 2.1 Group Comparison Tests ✅ *Highly Relevant*
- **Independent t-test**: Compare means between two groups (Gender vs Income/Spending/Age)
- **Welch's t-test**: When variances are unequal
- **Mann-Whitney U test**: Non-parametric alternative
- **When to use**: Compare numerical variable across categorical groups
- **Assumptions**: Normality, equal variances (for t-test)

#### 2.2 Multiple Group Comparisons ⚠️ *Limited Relevance*
- **One-way ANOVA**: Compare means across multiple groups
- **Kruskal-Wallis test**: Non-parametric ANOVA
- **Post-hoc tests**: Tukey HSD, Bonferroni
- **When to use**: When categorical variable has >2 levels
- **Note**: Limited relevance as Gender only has 2 levels

#### 2.3 Visual Comparisons ✅ *Highly Relevant*
- **Box plots**: Distribution comparison by groups
- **Violin plots**: Density + box plot information
- **Strip plots**: Individual data points
- **Swarm plots**: Non-overlapping points
- **When to use**: Visualize distribution differences across groups

#### 2.4 Effect Size Measures ✅ *Relevant*
- **Cohen's d**: Standardized mean difference
- **Eta-squared (η²)**: Proportion of variance explained
- **When to use**: Quantify practical significance beyond statistical significance

### 3. Categorical vs Categorical Analysis

#### 3.1 Contingency Table Analysis ❌ *Not Applicable*
- **Cross-tabulation**: Frequency tables
- **Chi-square test of independence**: Test association
- **Fisher's exact test**: Small sample alternative
- **When to use**: Two categorical variables
- **Note**: Only one categorical variable (Gender) in dataset

#### 3.2 Association Measures ❌ *Not Applicable*
- **Cramér's V**: Strength of association
- **Phi coefficient**: For 2x2 tables
- **Lambda**: Proportional reduction in error
- **When to use**: Measure strength of categorical associations
- **Note**: Need multiple categorical variables

### 4. Time Series Analysis ❌ *Not Applicable*
- **Autocorrelation Function (ACF)**: Serial correlation
- **Cross-correlation**: Relationship between time series
- **Lag plots**: Temporal relationships
- **Seasonal decomposition**: Trend, seasonal, residual components
- **When to use**: Time-indexed data
- **Note**: No temporal variables in dataset

### 5. Advanced Bivariate Techniques

#### 5.1 Non-parametric Methods ✅ *Relevant*
- **Kernel Density Estimation**: Smooth density estimates
- **Quantile-Quantile (Q-Q) plots**: Compare distributions
- **Empirical Cumulative Distribution**: Distribution comparison
- **When to use**: Non-normal data, distribution-free analysis

#### 5.2 Robust Statistics ✅ *Relevant*
- **Robust correlation**: Spearman, Kendall
- **Median-based tests**: Mood's median test
- **Trimmed means**: Reduce outlier influence
- **When to use**: Presence of outliers, non-normal data

#### 5.3 Information Theory Measures ⚠️ *Advanced/Optional*
- **Mutual Information**: Non-linear dependencies
- **Normalized Mutual Information**: Scaled version
- **When to use**: Capture complex, non-linear relationships
- **Note**: More advanced, may be overkill for this dataset

### 6. Assumption Testing

#### 6.1 Normality Tests ✅ *Important*
- **Shapiro-Wilk test**: Small samples (n<50)
- **Kolmogorov-Smirnov test**: Larger samples
- **Anderson-Darling test**: More sensitive to tails
- **Visual**: Q-Q plots, histograms
- **When to use**: Before parametric tests

#### 6.2 Homogeneity of Variance ✅ *Important*
- **Levene's test**: Equal variances across groups
- **Bartlett's test**: Assumes normality
- **Brown-Forsythe test**: Robust alternative
- **When to use**: Before ANOVA, t-tests

#### 6.3 Independence Tests ✅ *Important*
- **Durbin-Watson test**: Serial correlation
- **Runs test**: Randomness
- **When to use**: Verify independence assumption

### 7. Outlier Detection in Bivariate Context

#### 7.1 Bivariate Outlier Methods ✅ *Relevant*
- **Mahalanobis Distance**: Multivariate outliers
- **Cook's Distance**: Influential points in regression
- **Leverage plots**: High-leverage points
- **When to use**: Identify points affecting bivariate relationships

### 8. Clustering Tendency Analysis

#### 8.1 Cluster Validation ✅ *Highly Relevant*
- **Hopkins statistic**: Clustering tendency
- **Gap statistic**: Optimal cluster number
- **Silhouette analysis**: Cluster quality
- **When to use**: Before customer segmentation
- **Note**: Directly relevant to customer segmentation problem

### 9. Specialized Visualizations

#### 9.1 Advanced Plots ✅ *Relevant*
- **Pair plots**: All variable combinations
- **Correlation heatmaps**: Matrix visualization
- **Andrews curves**: Multivariate visualization
- **Parallel coordinates**: High-dimensional relationships
- **When to use**: Comprehensive relationship overview

### 10. Statistical Power and Sample Size

#### 10.1 Power Analysis ⚠️ *Optional*
- **Post-hoc power**: Achieved power
- **Effect size estimation**: Practical significance
- **When to use**: Interpret non-significant results
- **Note**: With n=200, likely adequate power for most tests

## Analysis Priority for Customer Segmentation

### High Priority ✅
1. Correlation analysis (all numerical pairs)
2. Scatter plots with clustering overlay
3. Gender-based group comparisons
4. Box plots by gender
5. Joint distribution analysis
6. Clustering tendency tests

### Medium Priority ⚠️
1. Regression analysis
2. Robust correlation methods
3. Outlier detection
4. Effect size calculations
5. Assumption testing

### Low Priority/Reference ❌
1. Time series methods
2. Multiple categorical analysis
3. Advanced information theory
4. Power analysis

## Expected Insights
- **Income vs Spending**: Potential customer segments
- **Age patterns**: Life-stage based behavior
- **Gender differences**: Shopping behavior variations
- **Clustering structure**: Natural customer groups for segmentation


# OVERVIEW OF MISSING/INAPPLICABLE TECHNIQUES 

## ✅ **Comprehensive Coverage**

### **1. Categorical vs Categorical Analysis**
- **Cross-tabulation methods** with formulas and Python implementations
- **Chi-square tests** including Fisher's exact and McNemar's tests
- **Association measures** like Cramér's V, Phi coefficient, Lambda
- **Ordinal analysis** with Gamma, Kendall's Tau variants, Somers' D

### **2. Time Series Bivariate Analysis**
- **Cross-correlation functions** with lag analysis
- **Cointegration testing** (Engle-Granger, Johansen)
- **Granger causality** and VAR models
- **Spectral analysis** and wavelet methods

### **3. Advanced Multivariate Techniques**
- **Information theory measures** (Mutual Information, Transfer Entropy)
- **Distance-based methods** (Distance correlation, MIC)
- **Copula analysis** for dependence modeling

### **4. Robust and Non-parametric Methods**
- **Robust correlations** (Winsorized, Biweight, Percentage Bend)
- **Detailed rank methods** (Spearman, Kendall variants)
- **Distribution-free tests** (K-S, Anderson-Darling, Energy statistics)

### **5. Specialized Domain Applications**
- **Survival analysis** (Log-rank test, Cox regression)
- **Spatial analysis** (Moran's I, Cross-variogram)
- **Network analysis** methods

### **6. Power Analysis and Sample Size**
- **Effect size calculations** with Cohen's guidelines
- **Sample size formulas** for different test types
- **Multiple comparison corrections** (Bonferroni, FDR)

## 🎯 **Key Features**

- **Formulas included** for mathematical understanding
- **Python implementations** specified where available
- **When to use** guidance for each technique
- **Assumptions and limitations** clearly stated
- **Interpretation guidelines** provided

This reference guide will serve you well when working on different types of datasets in the future - whether you encounter time series data, multiple categorical variables, or need specialized domain-specific analyses. You now have both a practical plan for your current customer segmentation project and a comprehensive reference for future EDA work!

# Detailed Reference Guide for Less Applicable Techniques

*This section provides comprehensive details for bivariate analysis techniques that are not directly applicable to the current customer segmentation dataset but are important for other types of data analysis projects.*

---

## 1. Categorical vs Categorical Analysis (Detailed Reference)

### 1.1 Contingency Table Analysis
**When applicable**: Two or more categorical variables

#### Cross-tabulation (Crosstabs)
- **Purpose**: Display frequency distribution of variables
- **Output**: 2x2, 2xN, or NxM frequency tables
- **Calculations**:
  - Observed frequencies
  - Expected frequencies: `(row_total × column_total) / grand_total`
  - Row percentages, column percentages, total percentages
- **Python**: `pd.crosstab()`, `pd.pivot_table()`

#### Chi-Square Test of Independence
- **Null Hypothesis**: Variables are independent
- **Test Statistic**: `χ² = Σ[(Observed - Expected)² / Expected]`
- **Degrees of Freedom**: `(rows-1) × (columns-1)`
- **Assumptions**:
  - Expected frequency ≥ 5 in at least 80% of cells
  - No expected frequency < 1
- **Python**: `scipy.stats.chi2_contingency()`
- **Interpretation**: p < 0.05 suggests association exists

#### Fisher's Exact Test
- **When to use**: Small sample sizes, 2x2 tables
- **Advantage**: Exact p-values, no minimum expected frequency requirement
- **Limitation**: Computationally intensive for large tables
- **Python**: `scipy.stats.fisher_exact()`

#### McNemar's Test
- **Purpose**: Paired categorical data (before/after comparisons)
- **Structure**: 2x2 table with matched pairs
- **Test Statistic**: `χ² = (b-c)² / (b+c)` where b,c are off-diagonal cells
- **Python**: `statsmodels.stats.contingency_tables.mcnemar()`

### 1.2 Association Measures

#### Cramér's V
- **Formula**: `V = √(χ² / (n × min(k-1, r-1)))`
- **Range**: 0 (no association) to 1 (perfect association)
- **Advantage**: Standardized, comparable across different table sizes
- **Interpretation**: 0.1 (small), 0.3 (medium), 0.5 (large)

#### Phi Coefficient (φ)
- **For**: 2x2 tables only
- **Formula**: `φ = √(χ² / n)`
- **Range**: 0 to 1
- **Relationship**: φ = Cramér's V for 2x2 tables

#### Lambda (λ)
- **Concept**: Proportional reduction in error
- **Formula**: `λ = (E₁ - E₂) / E₁`
- **Range**: 0 (no improvement) to 1 (perfect prediction)
- **Types**: Symmetric lambda, asymmetric lambda

#### Goodman and Kruskal's Tau
- **Purpose**: Measure predictive association
- **Advantage**: Accounts for ordinal nature of variables
- **Range**: 0 to 1

#### Uncertainty Coefficient (Theil's U)
- **Based on**: Information theory
- **Formula**: Uses entropy calculations
- **Advantage**: Handles nominal variables well

### 1.3 Ordinal Categorical Analysis

#### Gamma (γ)
- **For**: Two ordinal variables
- **Formula**: `γ = (Concordant - Discordant) / (Concordant + Discordant)`
- **Range**: -1 to +1
- **Interpretation**: Direction and strength of monotonic association

#### Kendall's Tau-b and Tau-c
- **Tau-b**: For square tables
- **Tau-c**: For rectangular tables
- **Advantage**: Handles ties better than Gamma

#### Somers' D
- **Asymmetric**: Distinguishes dependent/independent variable
- **Formula**: Based on concordant/discordant pairs
- **Range**: -1 to +1

---

## 2. Time Series Bivariate Analysis (Detailed Reference)

### 2.1 Cross-Correlation Analysis
**When applicable**: Two time series variables

#### Cross-Correlation Function (CCF)
- **Purpose**: Measure correlation at different lags
- **Formula**: `CCF(k) = Σ[(X_t - μ_X)(Y_{t+k} - μ_Y)] / √(σ_X² × σ_Y²)`
- **Output**: Correlation coefficients for lags k = ..., -2, -1, 0, 1, 2, ...
- **Interpretation**:
  - Positive lag: X leads Y
  - Negative lag: Y leads X
  - Zero lag: Contemporaneous correlation
- **Python**: `statsmodels.tsa.stattools.ccf()`

#### Lead-Lag Analysis
- **Maximum CCF**: Identifies optimal lag relationship
- **Applications**: Economic indicators, stock prices, sensor data
- **Considerations**: Spurious correlations in trending data

### 2.2 Cointegration Analysis

#### Engle-Granger Test
- **Purpose**: Test for long-run equilibrium relationship
- **Steps**:
  1. Test each series for unit root (ADF test)
  2. Estimate cointegrating regression: `Y_t = α + βX_t + ε_t`
  3. Test residuals for stationarity
- **Python**: `statsmodels.tsa.stattools.coint()`

#### Johansen Test
- **Advantage**: Multiple cointegrating relationships
- **Output**: Trace statistic, maximum eigenvalue statistic
- **Applications**: Portfolio analysis, economic modeling

### 2.3 Granger Causality

#### Granger Causality Test
- **Concept**: X Granger-causes Y if past values of X improve prediction of Y
- **Method**: Compare restricted vs unrestricted VAR models
- **Test Statistic**: F-test on lagged coefficients
- **Python**: `statsmodels.tsa.stattools.grangercausalitytests()`
- **Limitation**: Statistical causality ≠ true causality

### 2.4 Vector Autoregression (VAR)

#### VAR Model
- **Structure**: Each variable regressed on lags of all variables
- **Equation**: `Y_t = A₁Y_{t-1} + A₂Y_{t-2} + ... + A_pY_{t-p} + ε_t`
- **Applications**: Forecasting, impulse response analysis
- **Python**: `statsmodels.tsa.vector_ar.var_model.VAR()`

#### Impulse Response Functions
- **Purpose**: Trace effect of shock in one variable on others
- **Output**: Response over time horizons
- **Confidence Intervals**: Bootstrap or analytical methods

### 2.5 Spectral Analysis

#### Cross-Spectral Density
- **Purpose**: Frequency domain relationship analysis
- **Components**: Co-spectrum, quadrature spectrum
- **Coherence**: Frequency-specific correlation
- **Phase Spectrum**: Lead-lag relationships by frequency

#### Wavelet Cross-Correlation
- **Advantage**: Time-frequency analysis
- **Applications**: Non-stationary relationships
- **Output**: Correlation varying over time and frequency

---

## 3. Advanced Multivariate Techniques (Detailed Reference)

### 3.1 Information Theory Measures

#### Mutual Information (MI)
- **Formula**: `MI(X,Y) = ΣΣ p(x,y) log(p(x,y) / (p(x)p(y)))`
- **Range**: 0 (independent) to ∞
- **Advantage**: Captures non-linear relationships
- **Python**: `sklearn.feature_selection.mutual_info_regression()`

#### Normalized Mutual Information
- **Formula**: `NMI = MI(X,Y) / √(H(X)H(Y))`
- **Range**: 0 to 1
- **Advantage**: Standardized for comparison

#### Transfer Entropy
- **Purpose**: Directional information transfer
- **Formula**: Based on conditional mutual information
- **Applications**: Causality detection in complex systems

### 3.2 Distance-Based Methods

#### Distance Correlation
- **Advantage**: Detects all types of dependence
- **Range**: 0 (independent) to 1 (dependent)
- **Test**: Permutation-based significance testing
- **Python**: `dcor` package

#### Maximal Information Coefficient (MIC)
- **Purpose**: Measure strength of relationship
- **Range**: 0 to 1
- **Advantage**: Equitability property
- **Python**: `minepy` package

### 3.3 Copula Analysis

#### Copula Functions
- **Purpose**: Model dependence structure separately from marginals
- **Types**: Gaussian, t-copula, Archimedean copulas
- **Applications**: Risk management, finance
- **Python**: `copulas` package

#### Kendall's Tau from Copula
- **Relationship**: `τ = 4∫∫ C(u,v) dC(u,v) - 1`
- **Advantage**: Distribution-free dependence measure

---

## 4. Robust and Non-parametric Methods (Detailed Reference)

### 4.1 Robust Correlation Methods

#### Winsorized Correlation
- **Method**: Replace extreme values with percentile values
- **Typical**: 5th and 95th percentiles
- **Advantage**: Reduces outlier influence

#### Biweight Midcorrelation
- **Advantage**: Robust to outliers, efficient
- **Formula**: Uses biweight estimates of covariance and variance
- **Python**: `astropy.stats.biweight_midcorrelation()`

#### Percentage Bend Correlation
- **Method**: Based on percentage bend estimators
- **Parameter**: β (bending constant)
- **Advantage**: Good breakdown point

### 4.2 Rank-Based Methods

#### Spearman's Rank Correlation (Detailed)
- **Formula**: `ρ = 1 - (6Σd²) / (n(n²-1))`
- **Tied Ranks**: Correction formula for ties
- **Assumptions**: Monotonic relationship
- **Significance Test**: t-test or exact distribution

#### Kendall's Tau (Detailed)
- **Tau-a**: `τ_a = (C - D) / (n(n-1)/2)`
- **Tau-b**: Adjusts for ties in both variables
- **Tau-c**: For rectangular tables
- **Advantage**: Better for small samples

### 4.3 Distribution-Free Tests

#### Kolmogorov-Smirnov Two-Sample Test
- **Purpose**: Compare two distributions
- **Test Statistic**: `D = max|F₁(x) - F₂(x)|`
- **Advantage**: Sensitive to any difference in distributions
- **Python**: `scipy.stats.ks_2samp()`

#### Anderson-Darling Two-Sample Test
- **Advantage**: More sensitive to tail differences
- **Test Statistic**: Weighted version of K-S test
- **Python**: `scipy.stats.anderson_ksamp()`

#### Energy Statistics
- **E-statistic**: Distance-based test for equal distributions
- **Advantage**: Consistent against all alternatives
- **Applications**: High-dimensional data

---

## 5. Specialized Domain Applications

### 5.1 Survival Analysis Bivariate Methods

#### Log-Rank Test
- **Purpose**: Compare survival curves between groups
- **Assumption**: Proportional hazards
- **Python**: `lifelines.statistics.logrank_test()`

#### Cox Proportional Hazards
- **Bivariate**: Include interaction terms
- **Hazard Ratio**: Measure of relative risk
- **Python**: `lifelines.CoxPHFitter()`

### 5.2 Spatial Analysis

#### Spatial Autocorrelation
- **Moran's I**: Global spatial autocorrelation
- **Local Indicators**: LISA statistics
- **Applications**: Geographic data analysis

#### Cross-Variogram
- **Purpose**: Spatial correlation between two variables
- **Applications**: Geostatistics, environmental modeling

### 5.3 Network Analysis

#### Network Correlation
- **Purpose**: Correlation in network-structured data
- **Methods**: QAP correlation, network autocorrelation
- **Applications**: Social networks, biological networks

---

## 6. Power Analysis and Sample Size (Detailed Reference)

### 6.1 Power Analysis Components

#### Effect Size Measures
- **Correlation**: r itself is the effect size
- **Cohen's Guidelines**: 0.1 (small), 0.3 (medium), 0.5 (large)
- **t-test**: Cohen's d = (μ₁ - μ₂) / σ_pooled
- **ANOVA**: η² = SS_between / SS_total

#### Power Calculation
- **Formula**: Function of α, effect size, sample size
- **Software**: G*Power, Python `statsmodels.stats.power`
- **Types**: 
  - A priori: Determine required sample size
  - Post hoc: Calculate achieved power
  - Sensitivity: Determine detectable effect size

### 6.2 Sample Size Determination

#### Correlation Analysis
- **Formula**: `n = (Z_α/2 + Z_β)² / (0.5 × ln((1+r)/(1-r)))² + 3`
- **Fisher's Z-transformation**: Stabilizes variance
- **Python**: `statsmodels.stats.power.ttest_power()`

#### Two-Sample t-test
- **Formula**: `n = 2σ²(Z_α/2 + Z_β)² / δ²`
- **Equal vs Unequal**: Different formulas for equal/unequal group sizes
- **Welch's t-test**: Adjustment for unequal variances

#### Chi-Square Test
- **Effect Size**: w = √(χ²/n)
- **Sample Size**: Function of degrees of freedom and effect size
- **Minimum Expected Frequency**: Rule of thumb ≥ 5

### 6.3 Multiple Comparisons

#### Bonferroni Correction
- **Adjusted α**: α_adj = α / m (m = number of tests)
- **Conservative**: Controls family-wise error rate
- **Power Loss**: Reduced power with many comparisons

#### False Discovery Rate (FDR)
- **Benjamini-Hochberg**: Less conservative than Bonferroni
- **Q-value**: FDR-adjusted p-value
- **Python**: `statsmodels.stats.multitest.multipletests()`

---

*This reference guide provides detailed information for techniques not directly applicable to the current customer segmentation dataset but essential for comprehensive bivariate analysis across different data types and research contexts.*
