## Fukunaga (1990), *Introduction to Statistical Pattern Recognition (2nd ed.)*
***

The Fukunaga text considers simultaneous diagonalization in multiple multivariate statistics problems. First, we begin with their simultaneous diagonalization method.

### Introduction

In this document we consider the problem of diagonalizing $k$ matrices of dimensions $n \times n$. That is, let $\textbf{A}_1, \ldots, \textbf{A}_k \in \mathbb{R}_{n \times n}$. We say that $\textbf{Q}$ *simultaneously diagonalizes* $\textbf{A}_1, \ldots, \textbf{A}_k$ if

$$\textbf{Q}'\textbf{A}_k \textbf{Q} = \textbf{D}_k,$$

where $\textbf{D}_k$ is diagonal for all $k$. 


### Theorem (p. 31)

Let $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ be $p \times p$ symmetric matrices (covariance matrices?).

(1) First, we whiten $\boldsymbol{\Sigma}_1$ by

$$
\textbf{Y} = \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \textbf{X},
$$

where $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$ are the eigenvalues and eigenvector matrices of $\boldsymbol{\Sigma}_1$, respectively, as

$$
\boldsymbol{\Sigma}_1 \boldsymbol{\Phi} = \boldsymbol{\Phi} \boldsymbol{\Theta} \quad \text{and} \quad \boldsymbol{\Phi}'\boldsymbol{\Phi} = \textbf{I}_p.
$$

Then, $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ are transformed to

\begin{align}
    \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} &= \textbf{I}_p \\
    \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} &= \textbf{K}.
\end{align}

In general, $\textbf{K}$ is not a diagonal matrix.

(2) Second, we apply the orthonormal transformation to diagonalize $\textbf{K}$. That is,

$$
\textbf{Z} = \boldsymbol{\Psi}' \textbf{Y},
$$

where $\boldsymbol{\Psi}$ and $\boldsymbol{\Lambda}$ are the eigenvector and eigenvalue matrices of $\textbf{K}$ as

$$
\textbf{K} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda} \quad \text{and} \quad \boldsymbol{\Psi}'\boldsymbol{\Psi} = \textbf{I}_p.
$$

Equation 2.92 states that a covariance matrix is invariant under any orthonormal transformation after a whitening transformation. Hence, the whitened $\boldsymbol{\Sigma}_1$, i.e., $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}$, is invariant under the transformation $\boldsymbol{\Psi}$. Thus,

\begin{align}
\boldsymbol{\Psi}'\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}\boldsymbol{\Psi} &= \boldsymbol{\Psi}'\boldsymbol{\Psi} = \textbf{I}_p, \\
\boldsymbol{\Psi}' \textbf{K} \boldsymbol{\Psi} &= \boldsymbol{\Lambda}.
\end{align}

Thus, both matrices, $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$, are diagonalized. The combination of steps (1) and (2) gives the overall transformation matrix $\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi}$. The following figure shows a 2-dimensional example of this process.

![Figure](simultaneous-diagonalization-example.png)


The matrices $\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi}$ and $\boldsymbol{\Lambda}$ can be calculated directly from $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ without going through the two steps above as shown in the following theorem.
***

### Theorem (p. 32)

Let $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ be $p \times p$ symmetric matrices.
Then,

$$
\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p \quad \text{and} \quad \textbf{A}' \boldsymbol{\Sigma}_2 \textbf{A} = \boldsymbol{\Lambda}
$$

are simultaneously diagonalized by $\textbf{A}$, where $\textbf{A}$ and $\boldsymbol{\Lambda}$ are the eigenvector and eigenvalue matrices of $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$, respectively, such that

$$
\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 \textbf{A} = \textbf{A} \boldsymbol{\Lambda}.
$$

#### Proof

Because $\textbf{K} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda}$, we know that the eigenvalues of $\textbf{K}$ satisfy

$$
|\textbf{K} - \lambda \textbf{I}_p| = 0.
$$

Hence, recalling that $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{K}$ and $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{I}_p$, we have

\begin{align}
0
    &= \left|\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} - \lambda \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \right| \\
    &= \left|\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}'(\boldsymbol{\Sigma}_2 - \lambda \boldsymbol{\Sigma}_1 )\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}\right| \\
    &= \left|\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}'\right| \left| \boldsymbol{\Sigma}_2 - \lambda \boldsymbol{\Sigma}_1 \right| \left|\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}\right|.
\end{align}

Because $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}'$ is nonsingular, $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \ne 0$. Hence, $\left| \boldsymbol{\Sigma}_2 - \lambda \boldsymbol{\Sigma}_1 \right| = 0$, which implies $\left| \boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 - \lambda \textbf{I}_p \right| = 0$. Therefore, $\boldsymbol{\Lambda}$ is the eigenvalue matrix of $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$.

Next, we show that $\textbf{A} = \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi}$ is the eigenvector matrix of $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$. Substituting $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{K}$ into $\textbf{K} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda}$, we see that

$$
\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda},
$$

which implies

$$
\boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} = \left( \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \right)^{-1} \boldsymbol{\Psi} \boldsymbol{\Lambda}.
$$

Because $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{I}_p$, it follows that

$$
\left( \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \right)^{-1} = \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}.
$$

Thus,

\begin{align}
\boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} &= \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} \boldsymbol{\Lambda} \\
\Rightarrow
\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} &= \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} \boldsymbol{\Lambda} \\
\Rightarrow
\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 \textbf{A} &= \textbf{A} \boldsymbol{\Lambda}.
\end{align}

$$\tag*{$\blacksquare$}$$

It is important to note that because $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$ is not symmetric in general, and subsequently its eigenvectors $\boldsymbol{\psi}_j$ are not mutually orthogonal, i.e., $\boldsymbol{\psi}_i'\boldsymbol{\psi}_j = 0$ for $i \ne j$. Instead, the $\boldsymbol{\psi}_j$'s are orthogonal with respect to $\boldsymbol{\Sigma}_1$ such that $\boldsymbol{\psi}_i' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j = 0$ for $i \ne j$. Furthermore, in order to make the $\boldsymbol{\psi}_j$'s are orthonormal with respect to $\boldsymbol{\Sigma}_1$ to satisfy $\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p$, the scale of $\boldsymbol{\psi}_j$ must be adjusted by $\boldsymbol{\psi}_j' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j$ such that

$$
\dfrac{\boldsymbol{\psi}_j'}{\sqrt{\boldsymbol{\psi}_j' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j}} \boldsymbol{\Sigma}_1 \dfrac{\boldsymbol{\psi}_j}{\sqrt{\boldsymbol{\psi}_j' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j}} = 1.
$$

> Simultaneous diagonalization of two matrices is a very powerful tool in pattern recognition because many problems of pattern recognition consider two distributions for classification purposes. Also, there are many possible modifications of the above discussion. These depend on what kind of properties we are interested in, what kind of matrices are used, etc. In this section we will show one of the modifications that will be used in later chapters.

***

### Theorem (p. 33)

Let a matrix $\textbf{Q}$ be given by a linear combination of two symmetric matrices $\textbf{Q}_1$ and $\textbf{Q}_2$ as

$$
\textbf{Q} = a_1 \textbf{Q}_1 + a_2 \textbf{Q}_2,
$$

where $a_1, a_2 > 0$. If we normalize the eigenvectors with respect to $\textbf{Q}$ to satisfy $\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p$ above, $\textbf{Q}_1$ and $\textbf{Q}_2$ will share the same eigenvectors, and their eigenvalues will be reversely ordered as

\begin{align}
\lambda_1^{(1)} > \lambda_2^{(1)} > \ldots > \lambda_n^{(1)} &\text{ for } \textbf{Q}_1, \\
\lambda_1^{(2)} < \lambda_2^{(2)} < \ldots < \lambda_n^{(2)} &\text{ for } \textbf{Q}_2.
\end{align}

#### Proof

Let $\textbf{Q}$ and $\textbf{Q}_1$ be diagonalized simultaneously such that

$$
\textbf{A}' \textbf{Q} \textbf{A} = \textbf{I}_p \quad \text{and} \quad \textbf{A}' \textbf{Q}_1 \textbf{A} = \boldsymbol{\Lambda}^{(1)},
$$

where

$$
\textbf{Q}^{-1} \textbf{Q}_1 \textbf{A} = \textbf{A} \boldsymbol{\Lambda}^{(1)}.
$$

Then $\textbf{Q}_2$ is also diagonalized because

\begin{align}
    \textbf{I}_p
    &= \textbf{A}' \textbf{Q} \textbf{A} \\
    &= \textbf{A}' (a_1 \textbf{Q}_1 + a_2 \textbf{Q}_2) \textbf{A} \\
    &= a_1 \textbf{A}' \textbf{Q}_1 \textbf{A} + a_2 \textbf{A}' \textbf{Q}_2 \textbf{A} \\
    &= a_1 \boldsymbol{\Lambda}^{(1)} + a_2 \textbf{A}' \textbf{Q}_2 \textbf{A},
\end{align}

which implies that

$$
\textbf{A}' \textbf{Q}_2 \textbf{A} = \dfrac{1}{a_2} \left(\textbf{I}_p - a_1 \boldsymbol{\Lambda}^{(1)} \right).
$$

That is,

$$
\lambda_j^{(2)} = \dfrac{1 - a_1 \lambda_j^{(1)}}{a_2},
$$

which implies that, if $\lambda_i^{(1)} > \lambda_j^{(1)}$, then $\lambda_i^{(2)} < \lambda_j^{(2)}$. Furthermore, $\textbf{Q}_1$ and $\textbf{Q}_2$ share the same eigenvectors that are normalized with respect to $\textbf{Q}$.

$$\tag*{$\blacksquare$}$$

***

Fukunaga defines the autocorrelation matrix $\textbf{S}$ and the covariance matrix $\boldsymbol{\Sigma}$ as

\begin{align}
\textbf{S} &= E[\textbf{XX}'], \\
\boldsymbol{\Sigma} &= \textbf{S} - \textbf{mm}',
\end{align}

where $\textbf{m} = E[\textbf{X}]$.

The following example uses the Theorem above.

### Example (p. 34)

Let $\textbf{S}$ be the mixture autocorrelation matrix of two distributions who autocorrelation matrices are $\textbf{S}_1$ and $\textbf{S}_2$. Then

\begin{align}
    \textbf{S}
    &= E[\textbf{XX}'] \\
    &= P_1 E[\textbf{XX}' | \omega_1] + P_2 E[\textbf{XX}' | \omega_2] \\
    &= P_1 \textbf{S}_1 + P_2 \textbf{S}_2.
\end{align}

> Thus, by the above theorem, we can diagonalize $\textbf{S}_1$ and $\textbf{S}_2$ with the same set of eigenvectors. Since the eigenvalues are ordered in reverse, the eigenvector with the largest eigenvalue for the first distribution has the least eigenvalue for the second, and vice versa. **This property can be used to extract features important to distinguish two distributions.**

This important finding is the so-called [Fukunaga–Koontz Transform](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1671511). Here are a couple of examples that cite this transform.

* [Paper #1](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7044575)
* [Paper #2](https://users.ece.cmu.edu/~juefeix/felix_pr16_fkda.pdf)

***

### Relationship between $|\textbf{S}|$ and $|\boldsymbol{\Sigma}|$ (p. 38)

Simultaneous diagonalization enables us to establish the relationship between the determinants of the autocorrelation matrix $\textbf{S}$ and the covariance matrix $\boldsymbol{\Sigma}$.

Note that $\textbf{S} = \boldsymbol{\Sigma} + \textbf{mm}'$. Applying the simultaneous diagonalization of $\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p$ and $\textbf{A}' \boldsymbol{\Sigma}_2 \textbf{A} = \boldsymbol{\Lambda}$ with $\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}$ and $\boldsymbol{\Sigma}_2 = \textbf{mm}'$, we have $\textbf{A}'(\boldsymbol{\Sigma} + \textbf{mm}')\textbf{A} = \textbf{I}_p + \boldsymbol{\Lambda}$. Notice that $|\textbf{A}'| |\boldsymbol{\Sigma}| |\textbf{A}| = |\textbf{I}_p|$, which implies that $|\boldsymbol{\Sigma}| = 1 / |\textbf{A}|^2$. Therefore, $|\textbf{A}'| |\boldsymbol{\Sigma} + \textbf{mm}'| |\textbf{A}| = |\textbf{I}_p + \boldsymbol{\Lambda}|$, which implies

\begin{align}
    |\boldsymbol{\Sigma} + \textbf{mm}'| &= \dfrac{|\textbf{I}_p + \boldsymbol{\Lambda}|}{|\textbf{A}|^2} \\
    &= |\textbf{I}_p + \boldsymbol{\Lambda}| |\boldsymbol{\Sigma}| \\
    &= |\boldsymbol{\Sigma}| \prod_{j=1}^p (1 + \lambda_j).
\end{align}

Notice that rank$(\textbf{mm}') = 1$ if $\textbf{m} \ne \textbf{0}$ so that

$$
\lambda_1 \ne 0, \quad \lambda_2 = \ldots = \lambda_p = 0,
$$

implying that $|\boldsymbol{\Sigma} + \textbf{mm}'| = |\boldsymbol{\Sigma}| (1 + \lambda_1)$. Also, notice that, if $\textbf{A}$ is nonsingular, $\boldsymbol{\Sigma}^{-1} = \textbf{AA'}$. Hence,

\begin{align}
    \lambda_1
    &= \sum_{j=1}^p \lambda_j \\
    &= \text{tr}\{\boldsymbol{\Lambda}\} \\
    &= \text{tr}\{ \textbf{A}'\textbf{mm}'\textbf{A} \} \\
    &= \text{tr}\{ \textbf{m}'\textbf{AA}'\textbf{m} \} \\
    &= \text{tr}\{ \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m} \} \\
    &= \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}.
\end{align}

Therefore,

$$
    |\boldsymbol{\Sigma} + \textbf{mm}'| = |\boldsymbol{\Sigma}| (1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}).
$$

### Relationship between $\textbf{S}^{-1}$ and $\boldsymbol{\Sigma}^{-1}$ (p. 42)

Similar to above, simultaneous diagonalization enables us to establish the relationship between $\textbf{S}^{-1}$ and $\boldsymbol{\Sigma}^{-1}$.

Recall that $\textbf{A}'(\boldsymbol{\Sigma} + \textbf{mm}')\textbf{A} = \textbf{I}_p + \boldsymbol{\Lambda}$. Because $\textbf{A}$ is nonsingular, $\boldsymbol{\Sigma} + \textbf{mm}' = (\textbf{A}')^{-1}(\textbf{I}_p + \boldsymbol{\Lambda})\textbf{A}^{-1}$, which implies

$$
(\boldsymbol{\Sigma} + \textbf{mm}')^{-1} = \textbf{A}(\textbf{I}_p + \boldsymbol{\Lambda})^{-1}\textbf{A}'.
$$

Using the above example, we have that

\begin{align}
    (\textbf{I}_p + \boldsymbol{\Lambda})^{-1}
    &= \text{diag}\left(\frac{1}{1 + \lambda_1}, 1, \ldots, 1 \right) \\
    &= \text{diag}\left(1 - \frac{\lambda_1}{1 + \lambda_1}, 1, \ldots, 1 \right) \\
    &= \textbf{I}_p - \dfrac{1}{1 + \lambda_1} \boldsymbol{\Lambda}.
\end{align}


Because $\textbf{A}'\textbf{mm}'\textbf{A} = \boldsymbol{\Lambda}$ from the simultaneous diagonalization, it follows that

$$
    \textbf{A}\boldsymbol{\Lambda}\textbf{A}'
    = \textbf{AA}'\textbf{mm}'\textbf{AA}'
    = \boldsymbol{\Sigma}^{-1}\textbf{mm}'\boldsymbol{\Sigma}^{-1},
$$

where the second equality follows recalling that $\boldsymbol{\Sigma}^{-1} = \textbf{AA}'$. Again, using the above example, we have that

\begin{align}
    \textbf{S}^{-1}
    &= \textbf{A}\left( \textbf{I}_p - \dfrac{1}{1 + \lambda_1} \boldsymbol{\Lambda} \right)\textbf{A}' \\
    &= \textbf{AA}' - \dfrac{1}{1 + \lambda_1} \textbf{A}\boldsymbol{\Lambda}\textbf{A'} \\
    &= \boldsymbol{\Sigma}^{-1} - \dfrac{1}{1 + \lambda_1} \boldsymbol{\Sigma}^{-1} \textbf{mm}'\boldsymbol{\Sigma}^{-1} \\
    &= \boldsymbol{\Sigma}^{-1} - \dfrac{1}{1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}} \boldsymbol{\Sigma}^{-1} \textbf{mm}'\boldsymbol{\Sigma}^{-1} \quad (\text{recall: } \lambda_1 = \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}).
\end{align}

If we would further like to calculate the quadratic form $\textbf{m}'\textbf{S}^{-1}\textbf{m}$ in terms of $\textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}$, then

\begin{align}
    \textbf{m}'\textbf{S}^{-1}\textbf{m}
    &= \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m} - \dfrac{1}{1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}}  (\textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m})^2 \\
    &= \dfrac{\textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}}{1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}}.
\end{align}

Similarly,

$$
    \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m} = \dfrac{\textbf{m}'\textbf{S}^{-1}\textbf{m}}{1 - \textbf{m}'\textbf{S}^{-1}\textbf{m}}.
$$

***

### Matrix Inversion (p. 41-42)

Simultaneous diagonalization can yield a significant reduction in computation by preprocessing the data when computing a distance function involves a matrix inverse. For two distributions, the distance functions are, by simultaneous diagonalization,

\begin{align}
    d_1(\textbf{x})
    &= (\textbf{x} - \textbf{m}_1)'\boldsymbol{\Sigma}_1^{-1}(\textbf{x} - \textbf{m}_1) \\
    &= (\textbf{y} - \textbf{d}_1)'\textbf{I}_p{-1}(\textbf{y} - \textbf{d}_1) \\
    &= \sum_{j=1}^p (y_j - d_{1j})^2, \\
    d_2(\textbf{x})
    &= (\textbf{x} - \textbf{m}_2)'\boldsymbol{\Sigma}_2^{-1}(\textbf{x} - \textbf{m}_2) \\
    &= (\textbf{y} - \textbf{d}_2)'\boldsymbol{\Lambda}^{-1}(\textbf{y} - \textbf{d}_2) \\
    &= \sum_{j=1}^p \dfrac{(y_j - d_{2j})^2}{\lambda_j},
\end{align}

where $\textbf{y} = \textbf{A}'\textbf{x}$ and $\textbf{d}_{k} = \textbf{A}'\textbf{m}_k$, $k = 1, 2$.

***

### Optimum Linear Transformation (p. 448)

Feature extraction for classification consists of choosing those features which are most effective for preserving class separability. We can consider feature extraction for classification as a search, among all posssible singular transformations, for the best subspace which preserves class separability as much as possible in the lowest possible dimensional space.

A linear transformation from a $p$-dimensional $\textbf{x}$ to an $q$-dimensional $\textbf{y}$ $(q < p)$ is expressed by

$$
\textbf{y} = \textbf{A}'\textbf{x},
$$

where $\textbf{A}$ is a $p \times q$ rectangular matrix and the column vectors are linearly independent. These column vectors do not need to be orthonormal. Moreover, because $q < p$, $\textbf{A}$ is singular and can yield a low-dimensional projection.

The within-class $\textbf{S}_w$, between-class $\textbf{S}_b$, and mixture (total) $\textbf{S}_m$ scatter matrices are used to formulate criteria of class separability. The within-class scatter matrix $\textbf{S}_w$ shows the scatter of samples around their respective class expected vectors. On the other hand, the between-class scatter matrix $\textbf{S}_b$ is the scatter of the expected vectors around the mixture (grand) mean. The mixture (total) scatter matrix $\textbf{S}_m$ is the covariance matrix of all samples regardless of their class assignments such that

$$
\textbf{S}_m = \textbf{S}_b + \textbf{S}_w.
$$

Fukunaga uses notation that can be confusing at times to interchange any of these three matrices when defining class separability. That is, let $\textbf{S}_1$ and $\textbf{S}_2$ be one of $\textbf{S}_m$, $\textbf{S}_b$, and $\textbf{S}_w$.

> In order to formulate criteria for class separability, we need to convert these matrices to a number. This number should be larger when the between-class scatter is larger or the within-class scatter is smaller. There are several ways to do this:

1. $J_1 = \text{tr }(\textbf{S}_2^{-1} \textbf{S}_1)$.
2. $J_2 = \ln|\textbf{S}_2^{-1} \textbf{S}_1| = \ln|\textbf{S}_1| - \ln|\textbf{S}_2|$.
3. $J_3 = \text{tr }\textbf{S}_1 - \mu( \text{tr }\textbf{S}_2 - c)$.
4. $J_4 = \dfrac{\text{tr }\textbf{S}_1}{\text{tr }\textbf{S}_2}$.

Many combinations of $\textbf{S}_m$, $\textbf{S}_b$, and $\textbf{S}_w$ for $\textbf{S}_1$ and $\textbf{S}_2$ are possible. Typical examples for $\{\textbf{S}_1, \textbf{S}_2\}$ are

* $\{\textbf{S}_b, \textbf{S}_w\}$
* $\{\textbf{S}_b, \textbf{S}_m\}$
* $\{\textbf{S}_w, \textbf{S}_m\}$.

Fukunaga further distinguishes $\textbf{S}_1$ and $\textbf{S}_2$ as covariance matrices in the $\textbf{X}$-space with covariance matrices in the $\textbf{Y}$-space by

$$
\textbf{S}_{iY} = \textbf{A}'\textbf{S}_{iX}\textbf{A}.
$$

Thus, the problem of feature extraction for classification is to find the $\textbf{A}$ which optimizes one of the $J$s in the $\textbf{Y}$-space.

#### Optimization of $J_1$

Let $J_1(m)$ be the value of $J_1$ in an $m$-dimensional $\textbf{Y}$-space. Then

$$
J_1(m) = \text{tr }(\textbf{S}_{2Y}^{-1} \textbf{S}_{1Y}) = \text{tr }\{(\textbf{A}'\textbf{S}_{2Y}\textbf{A})^{-1} \textbf{A}'\textbf{S}_{1Y}\textbf{A}\}.
$$

TODO