# In this we will study two power transformers, namely Box-Cox and Yeo-Jhonson

# Understanding Box-Cox and Yeo-Johnson Transformations

## Introduction

Data transformation is a crucial preprocessing step in statistical modeling and machine learning. Two of the most popular power transformations used to normalize data and stabilize variance are the **Box-Cox transformation** and the **Yeo-Johnson transformation**. These transformations help make data more suitable for linear regression and other statistical techniques that assume normality.

## Box-Cox Transformation

The Box-Cox transformation, introduced by George Box and David Cox in 1964, is a family of power transformations designed to stabilize variance and make data more closely conform to a normal distribution.

### Formula

The Box-Cox transformation is defined as:

$$
y(\lambda) = \begin{cases}
\frac{y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\
\ln(y) & \text{if } \lambda = 0
\end{cases}
$$

where:
- $y$ is the response variable (must be positive)
- $\lambda$ is the transformation parameter

### Key Characteristics

- **Limitation**: Only works with strictly positive data ($y > 0$)
- **Parameter**: The optimal $\lambda$ is typically found through maximum likelihood estimation
- **Special cases**: 
  - $\lambda = 1$: No transformation
  - $\lambda = 0$: Natural log transformation
  - $\lambda = 0.5$: Square root transformation
  - $\lambda = -1$: Inverse transformation

### When to Use

Use Box-Cox when:
- Your data is strictly positive
- You need to normalize skewed distributions
- You want to stabilize variance across the range of data
- You're preparing data for linear regression or ANOVA

## Yeo-Johnson Transformation

The Yeo-Johnson transformation, proposed by In-Kwon Yeo and Richard Johnson in 2000, extends the Box-Cox transformation to handle both positive and negative values.

### Formula

The Yeo-Johnson transformation is defined as:

$$
y(\lambda) = \begin{cases}
\frac{(y+1)^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, y \geq 0 \\
\ln(y+1) & \text{if } \lambda = 0, y \geq 0 \\
-\frac{(-y+1)^{2-\lambda} - 1}{2-\lambda} & \text{if } \lambda \neq 2, y < 0 \\
-\ln(-y+1) & \text{if } \lambda = 2, y < 0
\end{cases}
$$

### Key Characteristics

- **Advantage**: Works with both positive and negative values, as well as zero
- **Flexibility**: More versatile than Box-Cox for real-world datasets
- **Parameter**: Like Box-Cox, $\lambda$ is found through maximum likelihood estimation

### When to Use

Use Yeo-Johnson when:
- Your data contains negative values or zeros
- You need the flexibility to transform any real-valued data
- Box-Cox is not applicable due to data constraints

## Comparison

| Feature | Box-Cox | Yeo-Johnson |
|---------|---------|-------------|
| Data requirements | Strictly positive ($y > 0$) | Any real values |
| Complexity | Simpler formula | More complex, piecewise formula |
| Use case | Positive-only datasets | General-purpose |
| Year introduced | 1964 | 2000 |

## Practical Implementation

Both transformations are widely available in statistical software:

- **Python**: `scipy.stats.boxcox()` and `sklearn.preprocessing.PowerTransformer()`

### Example Workflow

1. **Assess normality**: Check if your data is normally distributed
2. **Choose transformation**: Select Box-Cox (positive data) or Yeo-Johnson (any data)
3. **Find optimal λ**: Use maximum likelihood to determine the best parameter
4. **Apply transformation**: Transform your data using the optimal λ
5. **Verify**: Check if the transformed data better meets normality assumptions

## Conclusion

Both Box-Cox and Yeo-Johnson transformations are powerful tools for data preprocessing. While Box-Cox is limited to positive values, it's simpler and well-established. Yeo-Johnson offers greater flexibility by handling any real-valued data. Choose based on your data characteristics and modeling requirements. Remember that transformation is not always necessary—only apply it when your analysis assumptions require more normally distributed data.

---

*These transformations are essential tools in the modern data scientist's toolkit, helping bridge the gap between raw data and the assumptions underlying many statistical models.*

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import scipy.stats as stats
rom

In [None]:
f