# Confidence Intervals, Effect Sizes & Power Analysis in R

## Overview

This notebook covers three closely related concepts that together provide a complete picture of statistical results — going beyond the binary signal of a p-value.

| Concept | What It Tells You |
|---|---|
| **Confidence interval (CI)** | The plausible range for the true parameter value given the data |
| **Effect size** | The magnitude of a difference or association, independent of sample size |
| **Power analysis** | Whether a study has adequate sample size to detect an effect of a given size |

> A p-value answers "is there an effect?". An effect size answers "how big is it?". A confidence interval answers "what is our best estimate of its size?". Power analysis answers "could we have detected it if it existed?". All four are needed to fully evaluate a result.

## Applications by Sector

| Sector | Example |
|---|---|
| **Ecology** | What is the estimated difference in invertebrate density between acidified and reference sediments, and how precisely is it estimated? Was the field survey adequately powered to detect a biologically meaningful difference? |
| **Healthcare** | What is the effect size for a new treatment vs. standard care? How many patients are needed in a clinical trial to detect a clinically meaningful improvement? |
| **Finance** | What is the 95% CI for mean portfolio return? How large a sample is needed to detect a 2% lift in conversion rate from a product change? |
| **Insurance** | What is the estimated reduction in mean claim amount under a new policy, and is it practically meaningful? How many policyholders are needed to detect a 5% change in claim rate? |

---

## Assumptions Checklist

**Confidence intervals (parametric):**
- [ ] Data are approximately normally distributed (or n is large enough for CLT)
- [ ] Observations are independent
- [ ] The appropriate standard error formula is used for the parameter of interest

**Effect sizes:**
- [ ] The chosen effect size measure is appropriate for the test used (see table below)
- [ ] Effect sizes are interpreted in context — a "small" effect may be practically important or unimportant depending on the domain

**Power analysis:**
- [ ] Effect size is specified a priori — from prior literature, pilot data, or a minimum meaningful difference
- [ ] Alpha level (type I error rate) is specified (typically 0.05)
- [ ] Desired power is specified (typically 0.80 or 0.90)
- [ ] The test type matches the planned analysis

---

## Setup

In [None]:
# ── Libraries ────────────────────────────────────────────────────────────────
library(tidyverse)    # data manipulation and visualization
library(effectsize)   # Cohen's d, eta-squared, omega-squared, Cramér's V
library(pwr)          # power analysis for common tests
library(ggplot2)      # visualization
library(boot)         # bootstrap confidence intervals

# ── Reproducibility ──────────────────────────────────────────────────────────
set.seed(42)

---

## Confidence Intervals

A 95% confidence interval means: if we repeated the study many times and computed a CI each time, 95% of those intervals would contain the true parameter value. It does **not** mean there is a 95% probability the true value lies in *this particular* interval.

### CI for a Mean

In [None]:
# ── CI for a single mean (from t-test output) ─────────────────────────────────
result <- t.test(iris$Sepal.Length)
cat(sprintf("Mean: %.2f, 95%% CI [%.2f, %.2f]\n",
            result$estimate,
            result$conf.int[1],
            result$conf.int[2]))

# ── Manual calculation ────────────────────────────────────────────────────────
# CI = x̄ ± t(α/2, df) * (s / √n)
n    <- length(iris$Sepal.Length)
xbar <- mean(iris$Sepal.Length)
se   <- sd(iris$Sepal.Length) / sqrt(n)
t_crit <- qt(0.975, df = n - 1)  # two-tailed, alpha = 0.05

cat(sprintf("Manual CI: [%.2f, %.2f]\n",
            xbar - t_crit * se,
            xbar + t_crit * se))

# ── CIs for group means ───────────────────────────────────────────────────────
iris %>%
  group_by(Species) %>%
  summarise(
    n    = n(),
    mean = mean(Sepal.Length),
    se   = sd(Sepal.Length) / sqrt(n),
    lower_95 = mean - qt(0.975, df = n-1) * se,
    upper_95 = mean + qt(0.975, df = n-1) * se
  )

### Visualizing Confidence Intervals

In [None]:
# ── Mean ± 95% CI plot ────────────────────────────────────────────────────────
ci_df <- iris %>%
  group_by(Species) %>%
  summarise(
    mean  = mean(Sepal.Length),
    se    = sd(Sepal.Length) / sqrt(n()),
    lower = mean - qt(0.975, df = n()-1) * se,
    upper = mean + qt(0.975, df = n()-1) * se
  )

ggplot(ci_df, aes(x = Species, y = mean, color = Species)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.15, linewidth = 1) +
  scale_color_manual(values = c("#4a8fff", "#4fffb0", "#ffd166")) +
  labs(title = "Mean Sepal Length by Species",
       subtitle = "Points = mean, bars = 95% confidence interval",
       y = "Sepal Length (cm)", x = "Species") +
  theme_minimal() +
  theme(legend.position = "none")
# Non-overlapping CIs suggest significant differences;
# overlapping CIs do NOT guarantee non-significance — run the test

### Bootstrap Confidence Intervals

Use bootstrap CIs when the normality assumption is not met or when you need CIs for statistics without closed-form standard errors (medians, correlations, custom metrics).

In [None]:
# ── Bootstrap CI for the median ───────────────────────────────────────────────
median_boot <- function(data, indices) median(data[indices])

boot_result <- boot::boot(
  data      = iris$Sepal.Length,
  statistic = median_boot,
  R         = 2000  # number of bootstrap resamples
)

boot::boot.ci(boot_result, type = "perc")  # percentile CI
# 'bca' (bias-corrected accelerated) is generally preferred when n > 30
boot::boot.ci(boot_result, type = "bca")

---

## Effect Sizes

Effect sizes quantify the practical magnitude of a result. They are independent of sample size, making them comparable across studies.

| Test | Effect Size | Measure | Small | Medium | Large |
|---|---|---|---|---|---|
| t-test | Cohen's *d* | Standardized mean difference | 0.2 | 0.5 | 0.8 |
| ANOVA | Eta-squared η² | Proportion of variance explained | 0.01 | 0.06 | 0.14 |
| ANOVA | Omega-squared ω² | Less biased version of η² | 0.01 | 0.06 | 0.14 |
| Chi-square | Cramér's *V* | Association strength (0–1) | 0.1 | 0.3 | 0.5 |
| Correlation | *r* | Pearson/Spearman correlation | 0.1 | 0.3 | 0.5 |

In [None]:
# ── Cohen's d: two-sample comparison ─────────────────────────────────────────
iris_two <- iris %>% filter(Species %in% c("setosa", "versicolor"))
effectsize::cohens_d(Sepal.Length ~ Species, data = iris_two)
# Positive d: first group has larger mean; negative: second group larger
# Report with sign and 95% CI

# ── Cohen's d: one-sample ─────────────────────────────────────────────────────
effectsize::cohens_d(iris$Sepal.Length, mu = 5.0)  # vs. hypothesized mean of 5

# ── Eta-squared: ANOVA ────────────────────────────────────────────────────────
model <- aov(Sepal.Length ~ Species, data = iris)
effectsize::eta_squared(model)         # total eta-squared
effectsize::eta_squared(model, partial = TRUE)   # partial (same for one-way)
effectsize::omega_squared(model)       # omega-squared (preferred for small n)

# ── Cramér's V: chi-square ────────────────────────────────────────────────────
hair_eye <- margin.table(HairEyeColor, margin = c(1,2))
effectsize::cramers_v(hair_eye)

# ── Correlation effect size ───────────────────────────────────────────────────
cor.test(iris$Sepal.Length, iris$Petal.Length)
# r is already the effect size; extract with $estimate

---

## Power Analysis

Statistical power is the probability of correctly rejecting H₀ when it is false (i.e., detecting an effect that truly exists). Four quantities are related — specify any three to solve for the fourth:

- **n** — sample size per group
- **d** (or f, w) — effect size
- **α** — significance level (type I error rate, typically 0.05)
- **power** — target power (typically 0.80 or 0.90)

> Always conduct power analysis *before* data collection. Post-hoc power analysis on a non-significant result is not meaningful.

In [None]:
# ── Required n for a two-sample t-test ────────────────────────────────────────
# Detect a medium effect (d = 0.5) at alpha = 0.05 with 80% power
pwr::pwr.t.test(
  d           = 0.5,    # Cohen's d
  sig.level   = 0.05,
  power       = 0.80,
  type        = "two.sample",
  alternative = "two.sided"
)
# n = required sample size PER GROUP

# ── Power given a fixed n ─────────────────────────────────────────────────────
pwr::pwr.t.test(
  n           = 30,     # available n per group
  d           = 0.5,
  sig.level   = 0.05,
  type        = "two.sample",
  alternative = "two.sided"
)

# ── Minimum detectable effect given n and power ───────────────────────────────
pwr::pwr.t.test(
  n           = 30,
  sig.level   = 0.05,
  power       = 0.80,
  type        = "two.sample",
  alternative = "two.sided"
)
# d returned = smallest effect detectable with this design

In [None]:
# ── Power for one-way ANOVA ───────────────────────────────────────────────────
# f = sqrt(eta² / (1 - eta²)); small = 0.10, medium = 0.25, large = 0.40
pwr::pwr.anova.test(
  k           = 3,      # number of groups
  f           = 0.25,   # medium effect
  sig.level   = 0.05,
  power       = 0.80
)

# ── Power for chi-square test ─────────────────────────────────────────────────
# w = Cramér's V (for 2x2); small = 0.1, medium = 0.3, large = 0.5
pwr::pwr.chisq.test(
  w           = 0.3,    # medium effect
  df          = 1,      # (rows-1) * (cols-1)
  sig.level   = 0.05,
  power       = 0.80
)

# ── Power for correlation ─────────────────────────────────────────────────────
pwr::pwr.r.test(
  r           = 0.3,    # medium correlation
  sig.level   = 0.05,
  power       = 0.80,
  alternative = "two.sided"
)

In [None]:
# ── Power curve: power as a function of n ─────────────────────────────────────
n_seq    <- seq(10, 150, by = 5)
power_df <- map_dfr(c(0.2, 0.5, 0.8), function(d) {
  tibble(
    n     = n_seq,
    power = map_dbl(n_seq, ~ pwr::pwr.t.test(
                                n = .x, d = d,
                                sig.level = 0.05,
                                type = "two.sample")$power),
    effect = paste0("d = ", d)
  )
})

ggplot(power_df, aes(x = n, y = power, color = effect)) +
  geom_line(linewidth = 1) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray50") +
  scale_color_manual(values = c("#4a8fff", "#4fffb0", "#ffd166")) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(title = "Power Curves for Two-Sample t-Test",
       subtitle = "Dashed line = 80% power threshold",
       x = "Sample Size per Group (n)",
       y = "Statistical Power",
       color = "Effect Size") +
  theme_minimal()
# Use this to communicate sample size requirements to collaborators or stakeholders

---

## Common Pitfalls

**1. Reporting p-values without effect sizes**  
Statistical significance depends on sample size — with large n, trivially small effects become significant. Always pair p-values with an effect size and CI.

**2. Interpreting overlapping CIs as non-significant**  
Two 95% CIs can overlap and the difference can still be statistically significant. Overlapping CIs roughly correspond to p < 0.05 only for 84% CIs, not 95% CIs. Run the test.

**3. Using Cohen's benchmarks without domain context**  
A "small" effect in psychology (d = 0.2) might be clinically meaningful in medicine or economically meaningful in finance. Define what constitutes a meaningful effect in your domain before collecting data.

**4. Post-hoc power analysis on non-significant results**  
Calculating power after observing a non-significant result and then concluding "we were underpowered" is circular. Power analysis should always be prospective.

**5. Forgetting that power analysis gives n per group**  
`pwr.t.test()` returns the required sample size per group for a two-sample test. Total n = n × number of groups.

**6. Treating confidence intervals as Bayesian credible intervals**  
A 95% CI does not mean "there is a 95% probability the true value is in this range." It is a frequentist statement about the long-run coverage of the procedure.

---
*r_methods_library · Samantha McGarrigle · [github.com/samantha-mcgarrigle](https://github.com/samantha-mcgarrigle)*