# Chapter 10 - Re-expressing Data: Get It Straight!

## Goals of Re-Expression

1. Make a **more symmetric distribution** of a variable (e.g., as seen in its histogram)
2. Make **more similar spreads of several groups** (e.g., as seen in side-by-side boxplots), even if centers differ
3. Make a **more linear form of a scatterplot**
4. Make a **more evenly spread out scatter** in a scatterplot, rather than thickening at one end

## The Ladder of Powers

Collection of re-expressions to consider:

|Power|Name|Comment|
|-----|----|-------|
|2    |The **square** of the data values, $y^2$|good for unimodal distributions that are left skewed|
|1    |The **raw data** (no change)            |less likely to benefit values that are + and - with no bounds|
|1/2  |The **square root** of the data values, $\sqrt{y}$|counts often benefit from a square root re-expression|
|"0"  |The **logarithm** of the data values, $\log{(y)}$|measurements that cannot be negative; values that grow by percentage increases; if data contains '0's, consider adding small $\epsilon{}$|
|-1/2 |The **(negative) reciprocal square root**, $\frac{-1}{\sqrt{y}}$|uncommon; negative sign preserves direction|
|-1   |The **(negative) reciprocal**, $\frac{-1}{y}$|ratios of two quantities (e.g. miles per hour) often benefit; negative preserves direction; consider adding small $\epsilon{}$ if data has '0's|

## Plan B: Logarithms

If the Ladder of Powers doesn't provide enough to resolve a challenging curvature, consider:

|Model Name|$x$-axis|$y$-axis|Comment|
|-|-|-|-|
|**Exponential**|$x$|$\log{(y)}$|This is the same as "0"-power in ladder approach.|
|**Logarithmic**|$\log{(x)}$|$y$|May benefit i) a wide range of $x$-values, or ii) a scatterplot that descends rapidly at the left but levels off towards the right.|
|**Power**|$\log{(x)}$|$\log{(y)}$|May help when one of the ladder's powers is too big and the next is too small.|

## What Can Go Wrong?

* Don't expect your model to be perfect.
    * strive for a _useful_ model
* Don't stary too far from the ladder approach.
    * stick to powers between [-2, 2]
    * prefer simpler powers
* Don't choose a model based on $R^2$ alone.
* Beware of multiple modes.
    * re-expression won't resolve multiple modes, but might help see how to isolate each for separate analyses
* Watch out for scatterplots that turn around.
    * these cannot be analyzed using linear regression
* Watch out for negative data values.
* Watch for data far from 1.
    * consider subtracting a constant to bring all values back near 1
    * in the process, avoid creating a zero value