## References

Brunner, E. and Munzel, U. “The nonparametric Benhrens-Fisher problem: Asymptotic theory and a small-sample approximation”. Biometrical Journal. Vol. 42(2000): 17-25.

Neubert, K. and Brunner, E. “A studentized permutation test for the non-parametric Behrens-Fisher problem”. Computational Statistics and Data Analysis. Vol. 51(2007): 5192-5204.

## Behrens - Fisher Problem

The Behrens–Fisher problem, named after Walter-Ulrich Behrens and Ronald Fisher, involves the challenge of estimating confidence intervals and testing hypotheses for the difference in means between two normally distributed populations when their variances are not assumed to be equal, using independent samples from each population.

A more generalized approach extends the Behrens-Fisher problem to a nonparametric setting, where the underlying distribution functions are not assumed to be normal or continuous. This allows the model to handle data with arbitrary ties. In this framework, a rank-based test is used, where the test statistic's asymptotic variance is estimated consistently. The estimation relies on two components: the ranks of data points from the experiment groups across the combined set of observations and the ranks within each individual experiment group. 

## Nonparametric Model & Hypothesis

Let $N = n_1 + n_2$ be the total number of independent random variables:

$$
\begin{align*}
X_{ik} &\sim F_i, \quad k = 1, \ldots, n_i \quad \& \quad i = 1, 2, 
\end{align*}
$$

where $F_i$ is the cumulative distribution function of the $i$-th treatment group since the $i \in \{1, 2\}$ index denotes the treatment groups. Typically, there are two groups--- a **control group** and a **treatment group**.

### Null Hypothesis

The null hypothesis of no treatment effect can be formulated to include the parametric Behren-Fisher problem with normality assumptions as a special case:

$$
\begin{align*}
p = P(X_{11} < X_{21}) + \frac{1}{2}P(X_{11} = X_{21})
\end{align*}
$$

Here, $X_{11}$ and $X_{21}$ are random variables from the two treatment groups.

#### Case $p < \frac{1}{2}$ (Samples From Group I (Control) Tend to be Larger)

- **Formula Breakdown:**

  $$
  p = \textcolor{red}{P(X_{11} < X_{21})} + \frac{1}{2}P(X_{11} = X_{21})
  $$

- For $p < \frac{1}{2}$, this implies:

  - $\textcolor{red}{P(X_{11} < X_{21})}$ must be small.

  - Even when adding $\frac{1}{2}P(X_{11} = X_{21})$, the total value remains less than $\frac{1}{2}$.

- **Interpretation:**

  - A small $P(X_{11} < X_{21})$ suggests that it is less likely for $X_{11}$ to be less than $X_{21}$.

  - This means $X_{11}$ is more likely to be **greater** than $X_{21}$, indicating that values from group 1 (typically the control group) tend to be larger.

#### Case $p > \frac{1}{2}$ (Samples From Group I (Control) Tend to be Smaller)

- **Formula Breakdown:**

  $$
  p = \textcolor{red}{P(X_{11} < X_{21})} + \frac{1}{2}P(X_{11} = X_{21})
  $$

- For $p > \frac{1}{2}$, this implies:

  - $\textcolor{red}{P(X_{11} < X_{21})}$ must be relatively large.

  - If we add $\frac{1}{2}P(X_{11} = X_{21})$, the total exceeds $\frac{1}{2}$.

- **Interpretation:**

  - A large $P(X_{11} < X_{21})$ suggests that it is more likely for $X_{11}$ to be less than $X_{21}$.

  - This means $X_{11}$ is more likely to be **smaller** than $X_{21}$, indicating that values from group 1 tend to be smaller.


#### Case $p = \frac{1}{2}$ (No Treatment Effect)

- **Formula Breakdown:**

  $$
  p = \textcolor{red}{P(X_{11} < X_{21})} + \frac{1}{2}\textcolor{orange}{P(X_{11} = X_{21})}
  $$

- **Interpretation:**

  - This indicates that samples from $X_{11}$ and $X_{21}$ are equally likely to be less than, greater than, or equal to each other.

In the context of normal distributions $F_1$ and $F_2$ with means $\mu_1, \mu_2$ and variances $\sigma_1^2, \sigma_2^2$, $p = \frac{1}{2}$ implies $\mu_1 = \mu_2$, regardless of whether $\sigma_1^2$ and $\sigma_2^2$ are different. When $\mu_1 = \mu_2$, the symmetry of the difference $X_{11} - X_{21} \sim N(0, \sigma_1^2 + \sigma_2^2)$ implies that $P(X_{11} < X_{21}) = P(X_{11} > X_{21}) = \frac{1}{2}$. Additionally, the probability that $\textcolor{orange}{P(X_{11} = X_{21})}$ is zero for continuous distributions, so the first term $\textcolor{red}{P(X_{11} < X_{21})}$ completely determines the overall probability $p = \frac{1}{2}$.

This matches the parametric Behrens-Fisher problem, where the focus is on comparing means even if variances differ. Thus, the nonparametric hypothesis $H_0^p: p = \frac{1}{2}$ generalizes the Behrens-Fisher problem, encompassing it as a special case when the distributions are normal.

## Estimation of the Relative Treatment Effect

The relative treatment effect $p$ is a measure used to compare two distributions, typically representing the effect of a treatment versus a control group. It quantifies the probability that a randomly selected observation from one group (say, treatment) is greater or less than a randomly selected observation from the other group (control).

**Normalized Version of the Distribution Function**

- To express $p$ in terms of the distribution functions, the normalized version of the distribution function is used:
  $$
  F_i(x) = \frac{1}{2}\left[F_i^{-}(x) + F_i^{+}(x)\right] \quad \text{for } i = 1, 2
  $$
  Here:
  - $F_i(x)$ is the distribution function for group $i$.
  - $F_i^{-}(x) = P(X_{i1} < x)$ is the left-continuous version of the distribution function. It calculates the probability of observing a value less than $x$.
  - $F_i^{+}(x) = P(X_{i1} \leq x)$ is the right-continuous version, calculating the probability of observing a value less than or equal to $x$.

  The average of these two versions, $\frac{1}{2}(F_i^{-}(x) + F_i^{+}(x))$, provides a normalized version of the distribution function, making it convenient for theoretical derivations and avoiding ambiguity at discontinuity points. **This averaged CDF can be interpreted as giving half the probability mass to each side of a tie at $x$, if one exists.**

**Expression for the Relative Treatment Effect ($p$)**

- The relative treatment effect $p$ can then be written as:
  $$
  p = \int F_1 \, dF_2
  $$
  This integral represents the average value of $F_1$ when weighted by the distribution $F_2$. Essentially, it measures the probability that a value from the distribution represented by $F_1$ is greater than or equal to a value from $F_2$. This expression allows for an easy computation of $p$ under different distribution assumptions.

**Hypothesis of No Treatment Effect ($H_0^p$)**

- The null hypothesis of no treatment effect can now be expressed as:
  $$
  H_0^p: p = \int F_1 \, dF_2 = \frac{1}{2}
  $$
  This states that if there is no effect of the treatment, the probability of a value from the treatment distribution being greater than or equal to a value from the control distribution is 0.5. This would indicate that the treatment and control groups are equally likely to yield higher or lower values, implying no difference between them.


## Test Statistic

### Empirical Distribution Function

   - The empirical distribution function is defined as:
   $$
   \hat{F}_i(x) = \frac{1}{2}\left[\hat{F}_i^{-}(x) + \hat{F}_i^{+}(x)\right]
   $$
   - **Explanation**:

     - $ \hat{F}_i(x) $: Empirical distribution function for group $ i $.

     - $ \hat{F}_i^{-}(x) $: Left-continuous version of the empirical distribution function, which counts the probability of observing values strictly less than $ x $.

     - $ \hat{F}_i^{+}(x) $: Right-continuous version of the empirical distribution function, which counts the probability of observing values less than or equal to $ x $.
     
     - The average of these two versions gives a normalized and continuous empirical distribution function.

### Combined Distribution Function

   - The combined distribution function is given by:
   $$
   H(x) = \sum_{i=1}^2 \frac{n_i}{N} F_i(x)
   $$
   - The empirical version:
   $$
   \hat{H}(x) = \sum_{i=1}^2 \frac{n_i}{N} \hat{F}_i(x)
   $$
   - **Explanation**:

     - $ H(x) $ and $ \hat{H}(x) $: Combined (true and empirical) distribution functions, weighted by the sample sizes of each group.

     - $ n_i $: The size of group $ i $.

     - $ N = n_1 + n_2 $: Total number of observations from both groups.

     - $ F_i(x) $ and $ \hat{F}_i(x) $: Distribution function and empirical distribution function for group $ i $.

### Rank of an Observation

   - The rank of an observation $ X_{ij} $ among all $ N $ observations is:
   $$
   R_{ik} = N \cdot \hat{H}\left(X_{ik}\right) + \frac{1}{2}
   $$
   - **Explanation**:

     - $ R_{ik} $: Rank of the $ k $-th observation in group $ i $ among all $ N $ observations.

     - $ \hat{H}\left(X_{ik}\right) $: Value of the empirical combined distribution function evaluated at $ X_{ik} $.

     - The term $ \frac{1}{2} $ accounts for mid-ranking in the case of ties.

### Mean Rank of a Group

   - The mean rank in group $ i $ is:
   $$
   \bar{R}_{i\cdot} = \frac{1}{n_i} \sum_{k=1}^{n_i} R_{ik}
   $$
   - **Explanation**:

     - $\bar{R}_{i\cdot}$: Mean of the ranks for observations in group $ i $.

     - $ R_{ik} $: Rank of each observation in group $ i $.

     - The mean rank is calculated by averaging the ranks of all observations within group $ i $.

### Estimator for $ p $

   - The estimator for the relative treatment effect $ p $ is therefore:
   $$
   \hat{p}=\int \hat{F}_1 \mathrm{~d} \hat{F}_2=\frac{1}{n_1}\left(\bar{R}_{2\cdot}-\frac{n_2+1}{2}\right)
   $$
   - **Explanation**:

     - $ \hat{p} $: Estimator for the relative treatment effect.

     - $ \bar{R}_{2\cdot} $: Mean rank of group 2 (typically the treatment group but can be reversed depending on the context).

     - $ n_2 $: Size of group 2.
     
     - $ \frac{n_2 + 1}{2} $: Corresponds to the **expected average rank** under the assumption that there's no difference between the groups (i.e., under the null hypothesis). In a sample of size $n_2$, the ranks assigned to the observations will range from $1$ to $n_2$, and the **expected average rank** for group 2 under the null hypothesis (where there is no treatment effect) would be the midpoint of these ranks.

### Unbiasedness of $ \hat{p} $

   - The unbiasedness follows from:
   $$
   E\left[c\left(X_{21} - X_{11}\right)\right] = \int F_1 \, dF_2
   $$
   - **Explanation**:
     - $ E\left[c\left(X_{21} - X_{11}\right)\right] $: Expectation of a count function comparing observations from the two groups.
     - $ c(u) = 0, \frac{1}{2}, 1 $ depending on whether $ u <, =, > 0 $. This is the normalized count function defined as:

       $$
        c(u) = 
        \begin{cases} 
        0 & \text{if } u < 0 \\
        \frac{1}{2} & \text{if } u = 0 \\
        1 & \text{if } u > 0
        \end{cases}
      $$

        Where:
        $$
        u = X_{21} - X_{11}
        $$
        - $X_{21}$: Observation from the treatment group.
        - $X_{11}$: Observation from the control group.

        This function assigns:

        - $0$ if the treatment group observation is less than the control group’s.
        - $\frac{1}{2}$ if the observations are equal (a tie).
        - $1$ if the treatment group observation is greater.

        This function, when averaged over all pairs of observations from the two groups, gives an estimate of the probability that a treatment group observation is greater than, equal to, or less than a control group observation.

## Asymptotic Normality

The asymptotic normality of the statistic is:

$$
\frac{\sqrt{N}\left(\hat{p} - \frac{1}{2}\right)}{\sigma_N} = \frac{(\bar{R}_{2\cdot} - \bar{R}_{1\cdot})}{\sqrt{N\sigma^{2}_{N}}} \quad \xrightarrow{d} \quad \mathcal{N}(0, 1)
$$

where:
$$
\sigma_N^2 = N\left[\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}\right]
$$

and:

- $\sigma_1^2 = \operatorname{Var}\left(F_2(X_{11})\right)$: Variance of the cumulative distribution function $F_2$ evaluated at observations from the control group.
   
- $\sigma_2^2 = \operatorname{Var}\left(F_1(X_{21})\right)$: Variance of the cumulative distribution function $F_1$ evaluated at observations from the treatment group.

The statistic $\frac{\sqrt{N}\left(\hat{p} - \frac{1}{2}\right)}{\sigma_N}$ follows an asymptotic standard normal distribution $\mathcal{N}(0, 1)$ as $N \to \infty$, under the null hypothesis $H_0^p: p = \frac{1}{2}$.

### Variance Estimation

Note that, even under $H_0^p$, the variances $\sigma_1^2$ and $\sigma_2^2$ are unknown and must be estimated from the data. To do this, let 

$$
\begin{align*}
Y_{1k} = F_2\left(X_{1k}\right), k=1, \ldots, n_1
\end{align*}
$$ 

$$
\begin{align*}
Y_{2k} = F_1\left(X_{2k}\right), k=1, \ldots, n_2
\end{align*}
$$ 

The random variables $Y_{1k}$ and $Y_{2k}$ are independent by assumption. Given these, the following expression is an unbiased and consistent for $\sigma_i^2, i=1,2$: 

$$
\begin{align*}
\tilde{\sigma}_i^2 = \frac{1}{n_i-1} \sum_{k=1}^{n_i}\left(\textcolor{red}{Y_{ik}} - \textcolor{orange}{\bar{Y}_{i\cdot}}\right)^2
\end{align*}
$$

Since the random variables $Y_{ik}$ are unobservable, they must be replaced with observable variables "close enough" to the $Y_{ik}$. We replace the distribution functions $F_i(x)$ with their empirical counterparts $\hat{F}_i(x), i=1,2$. By definition:

$$
\begin{align*}
n_1 \hat{F}_1\left(X_{2k}\right) &= N \hat{H}\left(X_{2k}\right) - n_2 \hat{F}_2\left(X_{2k}\right) = R_{2k} - R_{2k}^{(2)} \\
n_2 \hat{F}_2\left(X_{1k}\right) &= N \hat{H}\left(X_{1k}\right) - n_1 \hat{F}_1\left(X_{1k}\right) = R_{1k} - R_{1k}^{(1)}
\end{align*}
$$

where 

* $R_{ik} = N \hat{H}\left(X_{ik}\right) + \frac{1}{2}$ denotes the (overall) rank of $X_{ik}$ among all $N$ observations
* $R_{ik}^{(i)} = n_i \hat{F}_i\left(X_{ik}\right) + \frac{1}{2}$ denotes the (within) rank of $X_{ik}$ among the $n_i$ observations within the $i$-th group $X_{i1}, \ldots, X_{in_i}, i=1,2$. 

In the case of ties, the mid-ranks come out automatically as already noted for the (overall) ranks $R_{ik} = N \hat{H}\left(X_{ik}\right) + \frac{1}{2}$. Then, the variances $\sigma_i^2$ are estimated by:

$$
\begin{align*}
\hat{\sigma}_i^2 = \frac{S_i^2}{\left(N - n_i\right)^2}
\end{align*}
$$

where:

* $S_i^2 = \frac{1}{n_i-1} \sum_{k=1}^{n_i}\left([\textcolor{red}{R_{ik} - R_{ik}^{(i)}}] - [\textcolor{orange}{\bar{R}_{i\cdot} - \frac{n_i + 1}{2}}]\right)^2$ is the empirical variance of $R_{ik} - R_{ik}^{(i)}, k=1, \ldots, n_i, i=1,2$.

Under $H_0^p$:

$$
\begin{align*}
\hat{\sigma}_N^2 = N \cdot\left[\frac{\hat{\sigma}_1^2}{n_1} + \frac{\hat{\sigma}_2^2}{n_2}\right]
\end{align*}
$$

is consistent for $\sigma_N^2$, and it follows that the statistic:

$$
\begin{align*}
W_N^{BF} = \frac{\sqrt{N}\left(\hat{p} - \frac{1}{2}\right)}{\hat{\sigma}_N} = \frac{1}{\sqrt{N}} \cdot \frac{\bar{R}_{2\cdot} - \bar{R}_{1\cdot}}{\hat{\sigma}_N}
\end{align*}
$$

has, asymptotically, a standard normal distribution under the hypothesis $H_0^p: p = \frac{1}{2}$.

## Small Sample Approximation (Central $t$-Distribution)

Based on simulation studies, the distribution of $\hat{\sigma}_N^2 = N \cdot\left[\frac{\hat{\sigma}_1^2}{n_1} + \frac{\hat{\sigma}_2^2}{n_2}\right]$ becomes degenerate quickly, at the rate of $1 / N$, since $\hat{\sigma}_N^2$ is consistent for $\sigma_N^2$. 

As a result, the small sample distribution of $W_N^{BF}$ may be approximated by a distribution that converges to the standard normal distribution as the sample size increases. A simulation study showed that the quality of this approximation depends mainly on the **ratio of the variances $\sigma^{2}_{i}$**, the **individual sample sizes $n_i$**, and the **total sample size** $N$.

To address small sample sizes, an approximation using a $t$-distribution is employed, where the degrees of freedom are based on the parametric Satterthwaite-Smith-Welch (SSW) approximation. Specifically, for small samples, the null distribution of $W_N^{BF}$ is approximated by a central $t$-distribution with:

$$
\begin{align*}
\hat{f} &= \frac{\left(\sum_{i=1}^2 \frac{\hat{\sigma}_i^2}{n_i}\right)^2}{\sum_{i=1}^2 \frac{\left(\hat{\sigma}_i^2 / n_i\right)^2}{\left(n_i-1\right)}} \\
&= \frac{\left(\sum_{i=1}^2 \frac{S_i^2}{\left(N-n_i\right)}\right)^2}{\sum_{i=1}^2 \frac{\left[S_i^2 /\left(N-n_i\right)\right]^2}{\left(n_i-1\right)}}
\end{align*}
$$

degrees of freedom, where $S_i^2$ is defined as, again:

$$
\begin{align*}
S_i^2 = \frac{1}{n_i-1} \sum_{k=1}^{n_i}\left([\textcolor{red}{R_{ik} - R_{ik}^{(i)}}] - [\textcolor{orange}{\bar{R}_{i\cdot} - \frac{n_i + 1}{2}}]\right)^2
\end{align*}
$$

As $\hat{f} \rightarrow \infty$, the $t_{\hat{f}}$-distribution converges to the standard normal distribution, making the approximation asymptotically accurate.

### Rough Practical Rule of Thumb 

> The test based on the statistic $W_{N}^{BF}$ however, was more or less liberal for medium or small sample sizes (smaller than about 50) and was quite accurate for larger sample sizes.

Therefore, for $n_i \leq 50$, the $t$-distribution approximation is recommended, while for larger sample sizes, the standard normal distribution approximation is sufficient.

For $n_i < 10$, the [permutation test](https://www.sciencedirect.com/science/article/abs/pii/S0167947306001885?via%3Dihub) can be considered, as the $t$-distribution approximation cannot be expected to be accurate.

## Clinical Trial Example

In this example, the statistic $W_N^{BF}$ and its $t$-distribution approximation with $\hat{f}$ degrees of freedom are applied to a clinical trial from [Lumley (1996)](https://pubmed.ncbi.nlm.nih.gov/9032714/). The trial observed ordinal pain scores (1–5) for 25 female patients post-laparoscopic surgery. Two treatments (active $Y$ and control $N$) were randomly assigned: 14 patients received $Y$ and 11 received $N$. The table below shows the pain scores on the third day after surgery.

<center>

| Treatment | Pain Scores          |
|-----------|----------------------|
| $Y$       | 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 1, 1 |
| $N$       | 3, 3, 4, 3, 1, 2, 3, 1, 1, 5, 4          |

</center>

The physician aimed to determine if the pain scores for $Y$ were generally lower than for $N$. The relative treatment effect is defined as $p = \int F_1 \, dF_2$, where $F_1$ and $F_2$ are distributions for $Y$ and $N$, respectively. 

Testing $H_0^p: p = \frac{1}{2}$ against $H_1^p: p > \frac{1}{2} = \textcolor{red}{P(X_{11} < X_{21})} + \frac{1}{2}P(X_{11} = X_{21})$ is appropriate. The benefit of $Y$ can be estimated by $\hat{p} = \frac{\bar{R}_2 - \frac{(n_2+1)}{2}}{n_1}$, where $\bar{R}_{2\cdot}$ is the mean of mid-ranks of the observed pain scores under the control treatment. In this example, the control group is hypothesized to have higher pain scores than the treatment group. Therefore, the test is one-sided and $R_{2\cdot}$ is the mean of the mid-ranks of the control group rather than the treatment group.

In [8]:
from typing import Tuple

import numpy as np
from scipy.stats import brunnermunzel, rankdata

In [2]:
treatment = np.array([1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 1, 1])
control = np.array([3, 3, 4, 3, 1, 2, 3, 1, 1, 5, 4])

treatment.shape, control.shape

((14,), (11,))

In [3]:
# The alternative hypothesis is that the control group is greater than the treatment group
brunnermunzel(x=control, y=treatment, alternative="greater", distribution="t")

BrunnerMunzelResult(statistic=np.float64(-3.1374674823029505), pvalue=np.float64(0.002893104333075734))

To compute $\hat{p} = \frac{\bar{R}_2 - \frac{(n_2+1)}{2}}{n_1}$ using a function from [statsmodels](https://www.statsmodels.org/stable/generated/statsmodels.stats.nonparametric.rankdata_2samp.html#statsmodels.stats.nonparametric.rankdata_2samp):

In [4]:
def rankdata_2samp(
    x1: np.ndarray, x2: np.ndarray
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Compute midranks for two samples.

    Parameters
    ----------
    x1, x2 : array_like
        Original data for two samples that will be converted to midranks.

    Returns
    -------
    rank1 : ndarray
        Midranks of the first sample in the pooled sample.
    rank2 : ndarray
        Midranks of the second sample in the pooled sample.
    ranki1 : ndarray
        Internal midranks of the first sample.
    ranki2 : ndarray
        Internal midranks of the second sample.
    """
    x1 = np.asarray(x1)
    x2 = np.asarray(x2)

    nobs1 = len(x1)
    nobs2 = len(x2)
    if nobs1 == 0 or nobs2 == 0:
        raise ValueError("One sample has zero length")

    x_combined = np.concatenate((x1, x2))
    rank = rankdata(
        x_combined, method="average"
    )  # Compute midranks for the pooled data
    rank1 = rank[:nobs1]
    rank2 = rank[nobs1:]

    ranki1 = rankdata(
        x1
    )  # Internal ranks for x1 (i.e., the ranks of each element of x1 within x1)
    ranki2 = rankdata(
        x2
    )  # Internal ranks for x2 (i.e., the ranks of each element of x2 within x2)

    return rank1, rank2, ranki1, ranki2

In [5]:
rank_control, rank_treatment, _, _ = rankdata_2samp(control, treatment)

n_control, n_treatment = len(control), len(treatment)

mean_rank_control = np.mean(rank_control)
mean_rank_treatment = np.mean(rank_treatment)

mean_rank_control, mean_rank_treatment

(np.float64(17.045454545454547), np.float64(9.821428571428571))

In [7]:
relative_treatment_effect_estimate = (
    mean_rank_control - (n_control + 1) / 2
) / n_treatment

print(f"Relative treatment effect estimate: {relative_treatment_effect_estimate:.4f}")

Relative treatment effect estimate: 0.7890
