# Simultaneous Diagonalization: Derivations, Properties, and Examples

John A. Ramey

## Introduction
***

In this document we consider the problem of diagonalizing $k$ matrices of dimensions $n \times n$. That is, let $\textbf{A}_1, \ldots, \textbf{A}_k \in \mathbb{R}_{n \times n}$. We say that $\textbf{Q}$ *simultaneously diagonalizes* $\textbf{A}_1, \ldots, \textbf{A}_k$ if

$$\textbf{Q}'\textbf{A}_k \textbf{Q} = \textbf{D}_k,$$

where $\textbf{D}_k$ is diagonal for all $k$. 

## Harville (2008), *Matrix Algebra from a Statistician's Perspective*
***

In Section 21.13, Harville first considers the case where $\textbf{Q}$ is nonsingular. Suppose that there exists such a matrix $\textbf{Q}$ such that $\textbf{Q}^{-1}\textbf{A}_k \textbf{Q} = \textbf{D}_k$ for some diagonal matrices $\textbf{D}_k$. Then, for $s \ne i = 1, \ldots, k$,

\begin{align}
    \textbf{Q}^{-1} \textbf{A}_s \textbf{A}_i \textbf{Q} &= \textbf{Q}^{-1} \textbf{A}_s \textbf{Q} \textbf{Q}^{-1} \textbf{A}_i \textbf{Q} \\
    &= \textbf{D}_s \textbf{D}_i \\
    &= \textbf{D}_i \textbf{D}_s \\
    &= \textbf{Q}^{-1} \textbf{A}_i \textbf{Q} \textbf{Q}^{-1} \textbf{A}_s \textbf{Q} \\
    &= \textbf{Q}^{-1} \textbf{A}_i \textbf{A}_s \textbf{Q},
\end{align}
which implies
\begin{align}
    \textbf{A}_s \textbf{A}_i
    &= \textbf{Q} (\textbf{Q}^{-1} \textbf{A}_s \textbf{A}_i \textbf{Q}) \textbf{Q}^{-1} \\
    &= \textbf{Q} (\textbf{Q}^{-1} \textbf{A}_i \textbf{A}_s \textbf{Q}) \textbf{Q}^{-1} \\
    &= \textbf{A}_i \textbf{A}_s.
\end{align}

Thus, a necessary condition for $\textbf{A}_1, \ldots, \textbf{A}_k$ to be simultaneously diagonalizable is that $\textbf{A}_1, \ldots, \textbf{A}_k$ commute in pairs, i.e., 

$$
\textbf{A}_i \textbf{A}_s = \textbf{A}_s \textbf{A}_i \quad (s > i = 1, \ldots, k).
$$

Harville then states that commuting in pairs is a *necessary and sufficient* condition for symmetric matrices $\textbf{A}_1, \ldots, \textbf{A}_k$ to be simultaneously diagonalizable. Rather than providing a theorem and then proof, the author derives the result by induction before stating the following theorem. **NOTE**: we do not include the lengthy constructive derivation.

### Theorem 21.13.1 (p. 568)

If $n \times n$ matrices $\textbf{A}_1, \ldots, \textbf{A}_k$ are simultaneously diagonalizable, they they commute in pairs, that is, for $s > i = 1, \ldots, k$, $\textbf{A}_i \textbf{A}_s = \textbf{A}_s \textbf{A}_i$. If $n \times n$ symmetric matrices $\textbf{A}_1, \ldots, \textbf{A}_k$ commute in pairs, they they can be simultaneously diagonalized by an orthogonal matrix; that is, there exists an $n \times n$ orthogonal matrix $\textbf{P}$ and diagonal matrices $\textbf{D}_1, \ldots, \textbf{D}_k$ such that, for $i = 1, \ldots, k$,

$$
\textbf{P}'\textbf{A}_i \textbf{P} = \textbf{D}_i.
$$

#### Note

Note that symmetric matrices $\textbf{A}_1, \ldots, \textbf{A}_k$ commute in pairs if and only if each of the $k(k - 1) / 2$ matrix products $\textbf{A}_1 \textbf{A}_2, \textbf{A}_1 \textbf{A}_3, \ldots, \textbf{A}_{k-1} \textbf{A}_k$ is symmetric.

In the special case where $k = 2$, Theorem 21.13.1 can be restated as the following corrollary.

### Corollary 21.13.2 (p. 568)

If two $n \times n$ matrices $\textbf{A}$ and $\textbf{B}$ are simultaneously diagonalizable, then they commute (i.e., $\textbf{BA} = \textbf{AB}$). If two $n \times n$ *symmetric* matrices $\textbf{A}$ and $\textbf{B}$ commute (or, equivalently, if their product $\textbf{AB}$ is symmetric), then they can be simultaneously diagonalized by an orthogonal matrix; that is, there exists an $n \times n$ orthogonal matrix $\textbf{P}$ such that $\textbf{P}'\textbf{A} \textbf{P} = \textbf{D}_1$ and $\textbf{P}'\textbf{B} \textbf{P} = \textbf{D}_2$ for some diagonal matrices $\textbf{D}_1$ and $\textbf{D}_2$.

### Exercise 29 (p. 588)

Let $\textbf{A}_1, \ldots, \textbf{A}_k$ represent $n \times n$ not-necessarily-symmetric matrices, each of which is  diagonalizable. Show that if $\textbf{A}_1, \ldots, \textbf{A}_k$ commute in pairs, then $\textbf{A}_1, \ldots, \textbf{A}_k$ are simultaneously diagonalizable.

## Horn and Johnson (1985), *Matrix Analysis*
***

In Section 4.5 (p. 227-228) consider simultaneous diagonalization in the context of mechanics problems (e.g., kinetic and potential energy). Horn and Johnson enumerate several cases of two simultaneously diagonalizable matrices. Necessary and sufficient conditions are given (see table below) for these cases:

* Hermitian matrices $\textbf{A}$ and $\textbf{B}$ with some unitary matrix $\textbf{U}$
* Hermitian matrices $\textbf{A}$ and $\textbf{B}$ with some nonsingular matrix $\textbf{S}$ (weaker result)
* Symmetric matrices $\textbf{A}$ and $\textbf{B}$ with some unitary matrix $\textbf{U}$
* Symmetric matrices $\textbf{A}$ and $\textbf{B}$ with some nonsingular matrix $\textbf{S}$
* Two symmetric matrices $\textbf{A}$ and $\textbf{B}$ with some unitary matrix $\textbf{U}$
* Hermitian matrix $\textbf{A}$ and symmetric matrix $\textbf{B}$ with either unitary $\textbf{U}$ or singular matrix $\textbf{S}$ (mixed problem)

After listing the various cases, the authors state:

> In each case, the natural congruence to consider is one that preserves the special algebraic character of the respective matrix. Al of these situations arise in the applications. Fortunately, they can all be treated with the same techniques. The simplest case to consider is when one of the two matrices is nonsingular.

### Table 4.5.15T (p. 229)

| Assumptions on A and B | Diagonalized Matrices | Equivalent necessary and sufficient conditions for simultaneous diagonalization |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| \begin{align} \textbf{A} &= \textbf{A}^* \\ \textbf{B} &= \textbf{B}^* \\ \textbf{A} &\text{ nonsingular} \\ \textbf{C} &= \textbf{A}^{-1} \textbf{B} \end{align} | (a) $\textbf{UAU}^*$ and $\textbf{UBU}^*$ | \begin{align} &\text{(1) There is a unitary } \textbf{V} \in M_n \text{ such that } \textbf{V}^*\textbf{CV} \text{ is a real diagonal matrix} \\ &\text{(2) } \textbf{C} \text{ has real eigenvalues and is unitarily diagonalizable} \\ &\text{(3) } \textbf{C} \text{ is Hermitian} \\ &\text{(4) } \textbf{AB} = \textbf{BA} \end{align} |
| \begin{align} \textbf{A} &= \textbf{A}^* \\ \textbf{B} &= \textbf{B}^* \\ \textbf{A} &\text{ nonsingular} \\ \textbf{C} &= \textbf{A}^{-1} \textbf{B} \end{align} | (b) $\textbf{SAS}^*$ and $\textbf{SBS}^*$ | \begin{align}&\text{(1) There is a nonsingular } \textbf{R} \in M_n \text{ such that } \textbf{R}^{-1}\textbf{CR} \text{ is real diagonal} \\&\text{(2) } \textbf{C} \text{ has real eigenvalues and is diagonalizable}\end{align} |
| \begin{align} \textbf{A} &= \textbf{A}' \\ \textbf{B} &= \textbf{B}' \\ \textbf{A} &\text{ nonsingular} \\ \textbf{C} &= \textbf{A}^{-1} \textbf{B} \end{align} | (a) $\textbf{UAU}'$ and $\textbf{UBU}'$ | \begin{align}&\text{(1) There is a unitary } \textbf{V} \in M_n \text{ such that } \textbf{V}^*\textbf{CV} \text{ is a real diagonal matrix} \\&\text{(2) } \textbf{C} \text{ is unitarily diagonalizable} \\&\text{(3) } \textbf{C} \text{ is normal}\end{align} |
| \begin{align} \textbf{A} &= \textbf{A}' \\ \textbf{B} &= \textbf{B}' \\ \textbf{A} &\text{ nonsingular} \\ \textbf{C} &= \textbf{A}^{-1} \textbf{B} \end{align} | (b) $\textbf{SAS}'$ and $\textbf{SBS}'$ | \begin{align}&\text{(1) There is a nonsingular } \textbf{R} \in M_n \text{ such that } \textbf{R}^{-1}\textbf{CR} \text{ is diagonal} \\&\text{(2) } \textbf{C} \text{ is diagonalizable}\end{align} |
| \begin{align} \textbf{A} &= \textbf{A}^* \\ \textbf{B} &= \textbf{B}' \\ \text{If } \textbf{A} &\text{ is nonsingular} \\ &\text{set } \textbf{C} = \textbf{A}^{-1} \textbf{B} \\ \text{If } \textbf{B} &\text{ is nonsingular} \\ &\text{set } \textbf{C} = \textbf{B}^{-1} \textbf{A}\end{align} | (a) $\textbf{UAU}'$ and $\textbf{UBU}'$ | \begin{align}&\text{(1) There is a nonsingular } \textbf{W} \in M_n \text{ such that } \textbf{W}^{-1}\textbf{C}\bar{\textbf{W}} \text{ is diagonal} \\&\text{(3) } \textbf{C} \text{ is symmetric}\\&\text{(4) } \textbf{AB} = \textbf{B} \bar{\textbf{A}}\end{align} |
| \begin{align} \textbf{A} &= \textbf{A}^* \\ \textbf{B} &= \textbf{B}' \\ \text{If } \textbf{A} &\text{ is nonsingular} \\ &\text{set } \textbf{C} = \textbf{A}^{-1} \textbf{B} \\ \text{If } \textbf{B} &\text{ is nonsingular} \\ &\text{set } \textbf{C} = \textbf{B}^{-1} \textbf{A}\end{align} | (b) $\textbf{SAS}'$ and $\textbf{SBS}'$ | \begin{align}&\text{(1) There is a nonsingular } \textbf{R} \in M_n \text{ such that } \textbf{R}^{-1}\textbf{C}\bar{\textbf{R}} \text{ is diagonal} \\&\text{(5) There is a nonsingular } \textbf{R} \in M_n \text{ such that } \textbf{R}^{-1}\textbf{C}\bar{\textbf{R}} \text{ is symmetric} \\\end{align} |

Section 7.6 provides some results regarding simultaneous diagonalization after the following note.
> Simultaneous diagonalizability of two matrices by similarity is a rare event, requiring the strong joint assumption of commutativity. Simultaneous diagonalization of two Hermitian matrices by joint star-congruence, however, requires much less. Simultaneous diagonalization by star-congruence corresponds to transforming two Hermitian quadratic forms into a linear combination of squares by a single linear change of variables.

### Definition 7.6.2 (p. 464)

Two matrices $\textbf{A}, \textbf{B} \in M_n$ are star-congruent if there exists a nonsingular matrix $\textbf{C} \in M_n$ such that $\textbf{B} = \textbf{C}^*\textbf{A}\textbf{C}$.

The following theorem (without proof) is used in the proof of **Theorem 7.6.4** below.

### Theorem 7.2.7 (p. 406)

A matrix $\textbf B \in M_n$ is positive definite if and only if there is a nonsingular matrix $\textbf{C} \in M_n$ such that $\textbf{B} = \textbf{C}^* \textbf{C}$.

> The following result is classical; for a generalization see 4.5.15 (table and results above).

### Theorem 7.6.4 (p. 465)

Let $\textbf{A}, \textbf{B} \in M_n$ be two Hermitian matrices and suppose that there is a real linear combination of $\textbf{A}$ and $\textbf{B}$ that is positive definite. Then there exists a nonsingular matrix $\textbf{C} \in M_n$ such that $\textbf{C}^*\textbf{A}\textbf{C}$ and $\textbf{C}^*\textbf{B}\textbf{C}$ are diagonal.

#### Proof

Suppose that $\textbf{P} = \alpha \textbf{A} + \beta \textbf{B}$ is positive definite for some $\alpha, \beta \in \mathbb{R}$. At least one of $\alpha$ and $\beta$ must be nonzero, so we may assume $\beta \ne 0$. But since $\textbf{B} = \beta^{-1}(\textbf{P} - \alpha \textbf{A})$, if we can show that $\textbf{A}$ and $\textbf{P}$ are simultaneously diagonalizable by star-congruence, then it will follow that $\textbf{A}$ and $\textbf{B}$ are also. By Theorem 7.2.7 (see above) we know that $\textbf{P}$ is star-congruent to the identity, so there is some nonsingular $\textbf{C}_1 \in M_n$ such that $\textbf{C}_1^* \textbf{P} \textbf{C}_1 = \textbf{I}$. Since $\textbf{C}_1^* \textbf{P} \textbf{C}_1$ is Hermititan, there exists a unitary matrix $\textbf{U}$ such that $\textbf{U}^* \textbf{C}_1^* \textbf{P} \textbf{C}_1 \textbf{U} = \textbf{D}$ is diagonal. Letting $\textbf{C} \equiv \textbf{C}_1 \textbf{U}$, we have $\textbf{C}^* \textbf{P} \textbf{C} = \textbf{I}$ and $\textbf{C}^* \textbf{A} \textbf{C} = \textbf{D}$ so that $\textbf{C}^* \textbf{B} \textbf{C} = \beta^{-1}(\textbf{I} - \alpha \textbf{D})$ is diagonal.

$$\tag*{$\blacksquare$}$$

***

Corollary 7.6.5 provides the result for the most common application of Theorem 7.6.4 to the classical situation in mechanics, in which two real symmetric quadratic forms are given, one of which is positive definite.

### Corollary 7.6.5 (p. 466)

If $\textbf{A} \in M_n$ is positive definite and $\textbf{B} \in M_n$ is Hermitian, then there exists a nonsingular matrix $\textbf{C} \in M_n$ such that $\textbf{C}^*\textbf{A}\textbf{C} = \textbf{I}$ and $\textbf{C}^*\textbf{B}\textbf{C}$ are diagonal.

Theorem 7.6.6 provides an analogous result for a pair of matrices, one of which is positive definite and the other (complex) symmetric. This result is also generalized in Table 4.5.15 (table above).

### Corollary 4.4.4 -- Takagi's Factorization (p. 204)

If $\textbf{A} \in M_n$ is symmetric, then there exists a unitary matrix $\textbf{U} \in M_n$ and a real nonnegative diagonal matrix $\boldsymbol{\Sigma} = \text{diag}(\sigma_1, \ldots, \sigma_n)$ such that $\textbf{A} = \textbf{U} \boldsymbol{\Sigma} \textbf{U}'$. The columns of $\textbf{U}$ are an orthonormal set of eigenvectors for $\textbf{A}\bar{\textbf{A}}$, and the corresponding diagonal entries of $\boldsymbol{\Sigma}$ are the nonnegative square roots of the corresponding eigenvalues of $\textbf{A}\bar{\textbf{A}}$.


### Theorem 7.6.6 (p. 466)

If $\textbf{A} \in M_n$ is positive definite and $\textbf{B} \in M_n$ is a symmetric complex matrix, then there is a nonsingular matrix $\textbf{C} \in M_n$ such that $\textbf{C}^*\textbf{A}\textbf{C}$ and $\textbf{C}'\textbf{B}\textbf{C}$ are both diagonal.

#### Proof

Choose a nonsingular matrix $\textbf{C}_1 \in M_n$ such that $\textbf{C}_1^*\textbf{A}\textbf{C}_1 = \textbf{I}$. Then $\textbf{C}_1'\textbf{B}\textbf{C}_1$ is symmetric, so by Takagi's factorization (see above) there is a unitary matrix $\textbf{U}$ such that $\textbf{U}'(\textbf{C}_1^*\textbf{B}\textbf{C}_1)\textbf{U} = \textbf{D}$, where $\textbf{D}$ is diagonal. Then $\textbf{U}^*\textbf{C}_1^*\textbf{A}\textbf{C}_1\textbf{U} = \textbf{I}$, too, so we may take $\textbf{C} \equiv \textbf{C}_1 \textbf{U}$.

$$\tag*{$\blacksquare$}$$

***


### Theorem 7.6.7 --  Application (p. 466)

The function $f(\textbf{A}) = \log \det \textbf{A}$ is a strictly concave function on the convex set of positive definite Hermitian matrices in $M_n$.

#### Proof

For any two given positive definite matrices $\textbf{A}, \textbf{B} \in M_n$, we must show that

$$f(\alpha \textbf{A} + (1 - \alpha) \textbf{B}) \ge \alpha f(\textbf{A}) + (1 - \alpha) f(\textbf{B})$$

for all $\alpha \in (0, 1)$, with equality if and only if $\textbf{A} = \textbf{B}$. By Theorem 7.6.5 above, we write $\textbf{A} = \textbf{CIC}^*$ and $\textbf{B} = \textbf{C} \boldsymbol{\Lambda} \textbf{C}^*$ for some nonsingular $\textbf{C} \in M_n$ and $\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)$ with all $\lambda_i > 0$. Then

\begin{align}
    f(\alpha \textbf{A} + (1 - \alpha) \textbf{B})
    &= f(C[\alpha \textbf{I} + (1 - \alpha) \boldsymbol{\Lambda}]\textbf{C}^*) \\
    &= f(\textbf{C}\textbf{C}^*) + f(\alpha \textbf{I} + (1 - \alpha) \boldsymbol{\Lambda}) \\
    &= f(\textbf{A}) + f(\alpha \textbf{I} + (1 - \alpha) \boldsymbol{\Lambda})
\end{align}
and
\begin{align}
    \alpha f(\textbf{A}) + (1 - \alpha) f(\textbf{B})
    &= \alpha f(\textbf{A}) + (1 - \alpha) f(\textbf{C} \boldsymbol{\Lambda} \textbf{C}^*) \\
    &= \alpha f(\textbf{A}) + (1 - \alpha) [f(\textbf{CC}^*) + f(\boldsymbol{\Lambda})] \qquad \text{(because } f(\textbf{A}) = \log \det \textbf{A}) \\
    &= \alpha f(\textbf{A}) + (1 - \alpha) f(\textbf{A}) + (1 - \alpha) f(\boldsymbol{\Lambda}) \\
    &= f(\textbf{A}) + (1 - \alpha) f(\boldsymbol{\Lambda}).
\end{align}

Thus, it sufficies to show that $f(\alpha \textbf{I} + (1 - \alpha) \boldsymbol{\Lambda}) \ge (1 - \alpha) f(\boldsymbol{\Lambda})$ for all $\alpha \in (0, 1)$ for any diagonal matrix $\boldsymbol{\Lambda}$ with positive diagonal entries. This follows easily from the strict concavity of the logarithm function since

\begin{align}
    f(\alpha \textbf{I} + (1 - \alpha) \boldsymbol{\Lambda})
    &= \log \prod_{i=1}^n [\alpha + (1 - \alpha) \lambda_i] \\
    &= \sum_{i=1}^n \log [\alpha + (1 - \alpha) \lambda_i] \\
    &\ge \sum_{i=1}^n [\alpha \log 1 + (1 - \alpha) \log \lambda_i] \\
    &= (1 - \alpha) \sum_{i=1}^n \log \lambda_i \\
    &= (1 - \alpha) \log \prod_{i=1}^n \lambda_i \\
    &= (1 - \alpha) \log \det \boldsymbol{\Lambda} \\
    &= (1 - \alpha) f(\boldsymbol{\Lambda}).\\
\end{align}

Equality holds in this inequality if and only if every $\lambda_i = 1$, which can happen if and only if $\boldsymbol{\Lambda} = \textbf{I}$ and $\textbf{B} = \textbf{CIC}^* = \textbf{A}$.

$$\tag*{$\blacksquare$}$$

***

By exponentiating the inequality in Theorem 7.6.7, we get a more commonly used quantitative expressision for the fact that a convex combination of positive definite matrices is positive definite, and hence must be nonsingular.

### Corollary 7.6.8 --  Application (p. 467)

Let $\textbf{A}, \textbf{B} \in M_n$ be positive definite, and let $0 < \alpha < 1$. Then

$$
\det [\alpha \textbf{A} + (1 - \alpha) \textbf{B}] \ge [\det \textbf{A}]^{\alpha} [\det \textbf{B}]^{1 - \alpha}
$$

with equality if and only if $\textbf{A} = \textbf{B}$.

## Golub and Van Loan (1996), *Matrix Computations*
***

In Section 8.7, Golub and Van Loan consider some generalized eigenvalue problems, which are closely related to simultaneous diagonalization. The goal is to find a nonzero vector $\textbf{x}$ and a scalar $\lambda$ such that, $$\textbf{Ax} = \lambda \textbf{Bx},$$ where $\textbf{A} \in \mathbb{R}^{n \times n}$ is symmetric and $\textbf{B} \in \mathbb{R}^{n \times n}$ is symmetric positive definite. The scalar $\lambda$ can be thought of as a *generalized eigenvalue*. The problem formulation resembles that of standard eigenvalue problems when $\textbf{B} = \textbf{I}_n$.

The matrix $\textbf{A} - \lambda \textbf{B}$ defines a *pencil*, $$\lambda(\textbf{A}, \textbf{B}) = \{ \lambda\ | \ \text{det}(\textbf{A} - \lambda \textbf{B}) = 0 \}.$$

A symmetric-definite generalized eigenproblem can be transformed to an equivalent problem with a congruence transformation: $$ \textbf{A} - \lambda \textbf{B} \text{ is singular} \Leftrightarrow (\textbf{X}'\textbf{AX}) - \lambda (\textbf{X}'\textbf{BX}) \text{ is singular}.$$

We seek a stable, efficient algorithm that computes $\textbf{X}$ such that $\textbf{X}'\textbf{AX}$ and $\textbf{X}'\textbf{BX}$ are both in *canonical form* (i.e., diagonal form).

### Theorem 8.7.1 (p. 461)

Suppose $\textbf{A}, \textbf{B} \in \mathbb{R}^{n \times n}$ are symmetric, and define

$$\textbf{C}(\mu) = \mu \textbf{A} + (1 - \mu) \textbf{B}, \quad \mu \in \mathbb{R}.$$

If there exists a $\mu \in [0, 1]$ such that $\textbf{C}(\mu)$ is non-negative definite and

$$\text{null}(\textbf{C}(\mu)) = \text{null}(\textbf{A}) \cap \text{null}(\textbf{B}),$$

then there exists a nonsingular $\textbf{X}$ such that both $\textbf{X}'\textbf{AX}$ and $\textbf{X}'\textbf{BX}$ are diagonal.

#### Proof

Let $\mu \in [0, 1]$ be chosen so that $\textbf{C}(\mu)$ is non-negative definite with the property that $\text{null}(\textbf{C}(\mu)) = \text{null}(\textbf{A}) \cap \text{null}(\textbf{B})$. Let

$$
\textbf{Q}_1' \textbf{C}(\mu) \textbf{Q}_1 = \begin{bmatrix} \textbf{D} & \textbf{0} \\ \textbf{0} & \textbf{0}_{n-k} \end{bmatrix}, \quad \textbf{D} =  \text{diag}(d_1, \ldots, d_k), d_i > 0
$$

be the [Schur decomposition](https://en.wikipedia.org/wiki/Schur_decomposition) of $\textbf{C}(\mu)$ and define $\textbf{X}_1 = \textbf{Q}_1$ diag$(\textbf{D}^{-1/2}, \textbf{I}_{n - k})$. If $\textbf{A}_1 = \textbf{X}_1' \textbf{A X}_1$, $\textbf{B}_1 = \textbf{X}_1' \textbf{B X}_1$, and $\textbf{C}_1 = \textbf{X}_1' \textbf{C}(\mu) \textbf{X}_1$, then

$$
\textbf{C}_1 = \begin{bmatrix} \textbf{I}_k & \textbf{0} \\ \textbf{0} & \textbf{0}_{n-k} \end{bmatrix} = \mu \textbf{A}_1 + (1 - \mu) \textbf{B}_1.
$$

Since span{$e_{k+1}, \ldots, e_n$} $ = \text{null}(\textbf{C}_1) = \text{null}(\textbf{A}_1) \cap \text{null}(\textbf{B}_1)$, it follows that $\textbf{A}_1$ and $\textbf{B}_1$ have the following block structure:

$$
\textbf{A}_1 = \begin{bmatrix} \textbf{A}_{11} & \textbf{0} \\ \textbf{0} & \textbf{0}_{n-k} \end{bmatrix}, \quad
\textbf{B}_1 = \begin{bmatrix} \textbf{B}_{11} & \textbf{0} \\ \textbf{0} & \textbf{0}_{n-k} \end{bmatrix}.
$$

Moreover, $\textbf{I}_k = \mu \textbf{A}_{11} + (1 - \mu) \textbf{B}_{11}$.

Suppose $\mu \ne 0$. It then follows that if $\textbf{Z}' \textbf{B}_{11} \textbf{Z} = \text{diag}(b_1, \ldots, b_k)$ is the [Schur decomposition](https://en.wikipedia.org/wiki/Schur_decomposition) of $\textbf{B}_{11}$, and we set $\textbf{X} = \textbf{X}_1 \text{diag}(\textbf{Z}, \textbf{I}_{n - k})$, then

$$
\textbf{X}' \textbf{B} \textbf{X} = \text{diag}(b_1, \ldots, b_k, 0, \ldots, 0) \equiv \textbf{D}_{\textbf{B}}
$$

and

\begin{align}
    \textbf{X}' \textbf{A} \textbf{X}
    &= \frac{1}{\mu} \textbf{X}' \{ \textbf{C}(\mu) - (1 - \mu) \textbf{B} \} \textbf{X} \\
    &= \frac{1}{\mu} \left\{ \begin{bmatrix} \textbf{I}_k & \textbf{0} \\ \textbf{0} & \textbf{0}_{n-k} \end{bmatrix} - (1 - \mu) \textbf{B} \right\} \equiv \textbf{D}_{\textbf{A}}.
\end{align}

On the other hand, if $\mu = 0$, then let $\textbf{Z}' \textbf{A}_{11} \textbf{Z} = \text{diag}(a_1, \ldots, a_k)$ be the [Schur decomposition](https://en.wikipedia.org/wiki/Schur_decomposition) of $\textbf{A}_{11}$ and set $\textbf{X} = \textbf{X}_1 \text{diag}(\textbf{Z}, \textbf{I}_{n - k})$. It is easy to verify that in this case as well, both $\textbf{X}'\textbf{AX}$ and $\textbf{X}'\textbf{BX}$ are diagonal.

$$\tag*{$\blacksquare$}$$

### Corollary 8.7.2 (p. 462)

If $\textbf{A} - \lambda \textbf{B} \in \mathbb{R}^{n \times n}$ is symmetric-definite, then there exists a nonsingular $\textbf{X} = [\textbf{x}_1, \ldots, \textbf{x}_n]$ such that

$$
\textbf{X}'\textbf{A}\textbf{X} = \text{diag}(a_1, \ldots, a_n) \quad \text{and} \quad
\textbf{X}'\textbf{B}\textbf{X} = \text{diag}(b_1, \ldots, b_n).
$$

Moreover, $\textbf{A} \textbf{x}_i = \lambda_i \textbf{B} \textbf{x}_i$ for $i = 1, \ldots, n$, where $\lambda_i = a_i / b_i$.

#### Proof

By setting $\mu = 0$ in Theorem 8.7.1, we see that symmetric-definite pencils can be simultaneously diagonalized. The rest of the corollary is easily verified.

### Example 8.7.1 (p. 462)

If

$$
\textbf{A} = \begin{bmatrix} 229 & 163 \\ 163 & 116 \end{bmatrix}, \quad
\textbf{B} = \begin{bmatrix} 81 & 59 \\ 59 & 43 \end{bmatrix},
$$

then $\textbf{A} - \lambda \textbf{B}$ is symmetric-definite and $\lambda(\textbf{A}, \textbf{B}) = \{ 5, -1/2 \}$. If

$$
\textbf{X} = \begin{bmatrix} 3 & -5 \\ -4 & 7 \end{bmatrix},
$$

then

\begin{align}
\textbf{X}'\textbf{AX} &= \begin{bmatrix} 5 & 0 \\ 0 & -1 \end{bmatrix}, \\
\textbf{X}'\textbf{BX} &= \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}.
\end{align}

## Anderson (2003), *An Introduction to Multivariate Statistical Analysis*
***

TODO

## Fukunaga (1990), *Introduction to Statistical Pattern Recognition (2nd ed.)*
***

The Fukunaga text considers simultaneous diagonalization in multiple statistics problems. First, we begin with their simultaneous diagonalization method.

#### Theorem (p. 31)

Let $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ be $p \times p$ symmetric matrices (covariance matrices?).

(1) First, we whiten $\boldsymbol{\Sigma}_1$ by

$$
\textbf{Y} = \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \textbf{X},
$$

where $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$ are the eigenvalues and eigenvector matrices of $\boldsymbol{\Sigma}_1$, respectively, as

$$
\boldsymbol{\Sigma}_1 \boldsymbol{\Phi} = \boldsymbol{\Phi} \boldsymbol{\Theta} \quad \text{and} \quad \boldsymbol{\Phi}'\boldsymbol{\Phi} = \textbf{I}_p.
$$

Then, $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ are transformed to

\begin{align}
    \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} &= \textbf{I}_p \\
    \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} &= \textbf{K}.
\end{align}

In general, $\textbf{K}$ is not a diagonal matrix.

(2) Second, we apply the orthonormal transformation to diagonalize $\textbf{K}$. That is,

$$
\textbf{Z} = \boldsymbol{\Psi}' \textbf{Y},
$$

where $\boldsymbol{\Psi}$ and $\boldsymbol{\Lambda}$ are the eigenvector and eigenvalue matrices of $\textbf{K}$ as

$$
\textbf{K} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda} \quad \text{and} \quad \boldsymbol{\Psi}'\boldsymbol{\Psi} = \textbf{I}_p.
$$

Equation 2.92 states that a covariance matrix is invariant under any orthonormal transformation after a whitening transformation. Hence, the whitened $\boldsymbol{\Sigma}_1$, i.e., $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}$, is invariant under the transformation $\boldsymbol{\Psi}$. Thus,

\begin{align}
\boldsymbol{\Psi}'\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}\boldsymbol{\Psi} &= \boldsymbol{\Psi}'\boldsymbol{\Psi} = \textbf{I}_p, \\
\boldsymbol{\Psi}' \textbf{K} \boldsymbol{\Psi} &= \boldsymbol{\Lambda}.
\end{align}

Thus, both matrices, $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$, are diagonalized. The combination of steps (1) and (2) gives the overall transformation matrix $\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi}$. The following figure shows a 2-dimensional example of this process.

![Figure](simultaneous-diagonalization-example.png)


The matrices $\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi}$ and $\boldsymbol{\Lambda}$ can be calculated directly from $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ without going through the two steps above as shown in the following theorem.
***

### Theorem (p. 32)

Let $\boldsymbol{\Sigma}_1$ and $\boldsymbol{\Sigma}_2$ be $p \times p$ symmetric matrices.
Then,

$$
\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p \quad \text{and} \quad \textbf{A}' \boldsymbol{\Sigma}_2 \textbf{A} = \boldsymbol{\Lambda}
$$

are simultaneously diagonalized by $\textbf{A}$, where $\textbf{A}$ and $\boldsymbol{\Lambda}$ are the eigenvector and eigenvalue matrices of $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$, respectively, such that

$$
\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 \textbf{A} = \textbf{A} \boldsymbol{\Lambda}.
$$

#### Proof

Because $\textbf{K} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda}$, we know that the eigenvalues of $\textbf{K}$ satisfy

$$
|\textbf{K} - \lambda \textbf{I}_p| = 0.
$$

Hence, recalling that $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{K}$ and $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{I}_p$, we have

\begin{align}
0
    &= \left|\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} - \lambda \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \right| \\
    &= \left|\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}'(\boldsymbol{\Sigma}_2 - \lambda \boldsymbol{\Sigma}_1 )\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}\right| \\
    &= \left|\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}'\right| \left| \boldsymbol{\Sigma}_2 - \lambda \boldsymbol{\Sigma}_1 \right| \left|\boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}\right|.
\end{align}

Because $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}'$ is nonsingular, $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \ne 0$. Hence, $\left| \boldsymbol{\Sigma}_2 - \lambda \boldsymbol{\Sigma}_1 \right| = 0$, which implies $\left| \boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 - \lambda \textbf{I}_p \right| = 0$. Therefore, $\boldsymbol{\Lambda}$ is the eigenvalue matrix of $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$.

Next, we show that $\textbf{A} = \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi}$ is the eigenvector matrix of $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$. Substituting $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{K}$ into $\textbf{K} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda}$, we see that

$$
\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} = \boldsymbol{\Psi} \boldsymbol{\Lambda},
$$

which implies

$$
\boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} = \left( \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \right)^{-1} \boldsymbol{\Psi} \boldsymbol{\Lambda}.
$$

Because $\boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} = \textbf{I}_p$, it follows that

$$
\left( \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Phi}' \right)^{-1} = \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2}.
$$

Thus,

\begin{align}
\boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} &= \boldsymbol{\Sigma}_1 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} \boldsymbol{\Lambda} \\
\Rightarrow
\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} &= \boldsymbol{\Phi} \boldsymbol{\Theta}^{-1/2} \boldsymbol{\Psi} \boldsymbol{\Lambda} \\
\Rightarrow
\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2 \textbf{A} &= \textbf{A} \boldsymbol{\Lambda}.
\end{align}

$$\tag*{$\blacksquare$}$$

It is important to note that because $\boldsymbol{\Sigma}_1^{-1} \boldsymbol{\Sigma}_2$ is not symmetric in general, and subsequently its eigenvectors $\boldsymbol{\psi}_j$ are not mutually orthogonal, i.e., $\boldsymbol{\psi}_i'\boldsymbol{\psi}_j = 0$ for $i \ne j$. Instead, the $\boldsymbol{\psi}_j$'s are orthogonal with respect to $\boldsymbol{\Sigma}_1$ such that $\boldsymbol{\psi}_i' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j = 0$ for $i \ne j$. Furthermore, in order to make the $\boldsymbol{\psi}_j$'s are orthonormal with respect to $\boldsymbol{\Sigma}_1$ to satisfy $\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p$, the scale of $\boldsymbol{\psi}_j$ must be adjusted by $\boldsymbol{\psi}_j' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j$ such that

$$
\dfrac{\boldsymbol{\psi}_j'}{\sqrt{\boldsymbol{\psi}_j' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j}} \boldsymbol{\Sigma}_1 \dfrac{\boldsymbol{\psi}_j}{\sqrt{\boldsymbol{\psi}_j' \boldsymbol{\Sigma}_1 \boldsymbol{\psi}_j}} = 1.
$$

> Simultaneous diagonalization of two matrices is a very powerful tool in pattern recognition because many problems of pattern recognition consider two distributions for classification purposes. Also, there are many possible modifications of the above discussion. These depend on what kind of properties we are interested in, what kind of matrices are used, etc. In this section we will show one of the modifications that will be used in later chapters.

***

### Theorem (p. 33)

Let a matrix $\textbf{Q}$ be given by a linear combination of two symmetric matrices $\textbf{Q}_1$ and $\textbf{Q}_2$ as

$$
\textbf{Q} = a_1 \textbf{Q}_1 + a_2 \textbf{Q}_2,
$$

where $a_1, a_2 > 0$. If we normalize the eigenvectors with respect to $\textbf{Q}$ to satisfy $\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p$ above, $\textbf{Q}_1$ and $\textbf{Q}_2$ will share the same eigenvectors, and their eigenvalues will be reversely ordered as

\begin{align}
\lambda_1^{(1)} > \lambda_2^{(1)} > \ldots > \lambda_n^{(1)} &\text{ for } \textbf{Q}_1, \\
\lambda_1^{(2)} < \lambda_2^{(2)} < \ldots < \lambda_n^{(2)} &\text{ for } \textbf{Q}_2.
\end{align}

#### Proof

Let $\textbf{Q}$ and $\textbf{Q}_1$ be diagonalized simultaneously such that

$$
\textbf{A}' \textbf{Q} \textbf{A} = \textbf{I}_p \quad \text{and} \quad \textbf{A}' \textbf{Q}_1 \textbf{A} = \boldsymbol{\Lambda}^{(1)},
$$

where

$$
\textbf{Q}^{-1} \textbf{Q}_1 \textbf{A} = \textbf{A} \boldsymbol{\Lambda}^{(1)}.
$$

Then $\textbf{Q}_2$ is also diagonalized because

\begin{align}
    \textbf{I}_p
    &= \textbf{A}' \textbf{Q} \textbf{A} \\
    &= \textbf{A}' (a_1 \textbf{Q}_1 + a_2 \textbf{Q}_2) \textbf{A} \\
    &= a_1 \textbf{A}' \textbf{Q}_1 \textbf{A} + a_2 \textbf{A}' \textbf{Q}_2 \textbf{A} \\
    &= a_1 \boldsymbol{\Lambda}^{(1)} + a_2 \textbf{A}' \textbf{Q}_2 \textbf{A},
\end{align}

which implies that

$$
\textbf{A}' \textbf{Q}_2 \textbf{A} = \dfrac{1}{a_2} \left(\textbf{I}_p - a_1 \boldsymbol{\Lambda}^{(1)} \right).
$$

That is,

$$
\lambda_j^{(2)} = \dfrac{1 - a_1 \lambda_j^{(1)}}{a_2},
$$

which implies that, if $\lambda_i^{(1)} > \lambda_j^{(1)}$, then $\lambda_i^{(2)} < \lambda_j^{(2)}$. Furthermore, $\textbf{Q}_1$ and $\textbf{Q}_2$ share the same eigenvectors that are normalized with respect to $\textbf{Q}$.

$$\tag*{$\blacksquare$}$$

***

Fukunaga defines the autocorrelation matrix $\textbf{S}$ and the covariance matrix $\boldsymbol{\Sigma}$ as

\begin{align}
\textbf{S} &= E[\textbf{XX}'], \\
\boldsymbol{\Sigma} &= \textbf{S} - \textbf{mm}',
\end{align}

where $\textbf{m} = E[\textbf{X}]$.

The following example uses the Theorem above.

### Example (p. 34)

Let $\textbf{S}$ be the mixture autocorrelation matrix of two distributions who autocorrelation matrices are $\textbf{S}_1$ and $\textbf{S}_2$. Then

\begin{align}
    \textbf{S}
    &= E[\textbf{XX}'] \\
    &= P_1 E[\textbf{XX}' | \omega_1] + P_2 E[\textbf{XX}' | \omega_2] \\
    &= P_1 \textbf{S}_1 + P_2 \textbf{S}_2.
\end{align}

> Thus, by the above theorem, we can diagonalize $\textbf{S}_1$ and $\textbf{S}_2$ with the same set of eigenvectors. Since the eigenvalues are ordered in reverse, the eigenvector with the largest eigenvalue for the first distribution has the least eigenvalue for the second, and vice versa. **This property can be used to extract features important to distinguish two distributions.**

This important finding is the so-called [Fukunaga–Koontz Transform](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1671511). Here are a couple of examples that cite this transform.

* [Paper #1](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7044575)
* [Paper #2](https://users.ece.cmu.edu/~juefeix/felix_pr16_fkda.pdf)

***

### Relationship between $|\textbf{S}|$ and $|\boldsymbol{\Sigma}|$ (p. 38)

Simultaneous diagonalization enables us to establish the relationship between the determinants of the autocorrelation matrix $\textbf{S}$ and the covariance matrix $\boldsymbol{\Sigma}$.

Note that $\textbf{S} = \boldsymbol{\Sigma} + \textbf{mm}'$. Applying the simultaneous diagonalization of $\textbf{A}' \boldsymbol{\Sigma}_1 \textbf{A} = \textbf{I}_p$ and $\textbf{A}' \boldsymbol{\Sigma}_2 \textbf{A} = \boldsymbol{\Lambda}$ with $\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}$ and $\boldsymbol{\Sigma}_2 = \textbf{mm}'$, we have $\textbf{A}'(\boldsymbol{\Sigma} + \textbf{mm}')\textbf{A} = \textbf{I}_p + \boldsymbol{\Lambda}$. Notice that $|\textbf{A}'| |\boldsymbol{\Sigma}| |\textbf{A}| = |\textbf{I}_p|$, which implies that $|\boldsymbol{\Sigma}| = 1 / |\textbf{A}|^2$. Therefore, $|\textbf{A}'| |\boldsymbol{\Sigma} + \textbf{mm}'| |\textbf{A}| = |\textbf{I}_p + \boldsymbol{\Lambda}|$, which implies

\begin{align}
    |\boldsymbol{\Sigma} + \textbf{mm}'| &= \dfrac{|\textbf{I}_p + \boldsymbol{\Lambda}|}{|\textbf{A}|^2} \\
    &= |\textbf{I}_p + \boldsymbol{\Lambda}| |\boldsymbol{\Sigma}| \\
    &= |\boldsymbol{\Sigma}| \prod_{j=1}^p (1 + \lambda_j).
\end{align}

Notice that rank$(\textbf{mm}') = 1$ if $\textbf{m} \ne \textbf{0}$ so that

$$
\lambda_1 \ne 0, \quad \lambda_2 = \ldots = \lambda_p = 0,
$$

implying that $|\boldsymbol{\Sigma} + \textbf{mm}'| = |\boldsymbol{\Sigma}| (1 + \lambda_1)$. Also, notice that, if $\textbf{A}$ is nonsingular, $\boldsymbol{\Sigma}^{-1} = \textbf{AA'}$. Hence,

\begin{align}
    \lambda_1
    &= \sum_{j=1}^p \lambda_j \\
    &= \text{tr}\{\boldsymbol{\Lambda}\} \\
    &= \text{tr}\{ \textbf{A}'\textbf{mm}'\textbf{A} \} \\
    &= \text{tr}\{ \textbf{m}'\textbf{AA}'\textbf{m} \} \\
    &= \text{tr}\{ \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m} \} \\
    &= \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}.
\end{align}

Therefore,

$$
    |\boldsymbol{\Sigma} + \textbf{mm}'| = |\boldsymbol{\Sigma}| (1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}).
$$

### Relationship between $\textbf{S}^{-1}$ and $\boldsymbol{\Sigma}^{-1}$ (p. 42)

Similar to above, simultaneous diagonalization enables us to establish the relationship between $\textbf{S}^{-1}$ and $\boldsymbol{\Sigma}^{-1}$.

Recall that $\textbf{A}'(\boldsymbol{\Sigma} + \textbf{mm}')\textbf{A} = \textbf{I}_p + \boldsymbol{\Lambda}$. Because $\textbf{A}$ is nonsingular, $\boldsymbol{\Sigma} + \textbf{mm}' = (\textbf{A}')^{-1}(\textbf{I}_p + \boldsymbol{\Lambda})\textbf{A}^{-1}$, which implies

$$
(\boldsymbol{\Sigma} + \textbf{mm}')^{-1} = \textbf{A}(\textbf{I}_p + \boldsymbol{\Lambda})^{-1}\textbf{A}'.
$$

Using the above example, we have that

\begin{align}
    (\textbf{I}_p + \boldsymbol{\Lambda})^{-1}
    &= \text{diag}\left(\frac{1}{1 + \lambda_1}, 1, \ldots, 1 \right) \\
    &= \text{diag}\left(1 - \frac{\lambda_1}{1 + \lambda_1}, 1, \ldots, 1 \right) \\
    &= \textbf{I}_p - \dfrac{1}{1 + \lambda_1} \boldsymbol{\Lambda}.
\end{align}


Because $\textbf{A}'\textbf{mm}'\textbf{A} = \boldsymbol{\Lambda}$ from the simultaneous diagonalization, it follows that

$$
    \textbf{A}\boldsymbol{\Lambda}\textbf{A}'
    = \textbf{AA}'\textbf{mm}'\textbf{AA}'
    = \boldsymbol{\Sigma}^{-1}\textbf{mm}'\boldsymbol{\Sigma}^{-1},
$$

where the second equality follows recalling that $\boldsymbol{\Sigma}^{-1} = \textbf{AA}'$. Again, using the above example, we have that

\begin{align}
    \textbf{S}^{-1}
    &= \textbf{A}\left( \textbf{I}_p - \dfrac{1}{1 + \lambda_1} \boldsymbol{\Lambda} \right)\textbf{A}' \\
    &= \textbf{AA}' - \dfrac{1}{1 + \lambda_1} \textbf{A}\boldsymbol{\Lambda}\textbf{A'} \\
    &= \boldsymbol{\Sigma}^{-1} - \dfrac{1}{1 + \lambda_1} \boldsymbol{\Sigma}^{-1} \textbf{mm}'\boldsymbol{\Sigma}^{-1} \\
    &= \boldsymbol{\Sigma}^{-1} - \dfrac{1}{1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}} \boldsymbol{\Sigma}^{-1} \textbf{mm}'\boldsymbol{\Sigma}^{-1} \quad (\text{recall: } \lambda_1 = \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}).
\end{align}

If we would further like to calculate the quadratic form $\textbf{m}'\textbf{S}^{-1}\textbf{m}$ in terms of $\textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}$, then

\begin{align}
    \textbf{m}'\textbf{S}^{-1}\textbf{m}
    &= \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m} - \dfrac{1}{1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}}  (\textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m})^2 \\
    &= \dfrac{\textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}}{1 + \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m}}.
\end{align}

Similarly,

$$
    \textbf{m}'\boldsymbol{\Sigma}^{-1}\textbf{m} = \dfrac{\textbf{m}'\textbf{S}^{-1}\textbf{m}}{1 - \textbf{m}'\textbf{S}^{-1}\textbf{m}}.
$$

***

### Matrix Inversion (p. 41-42)

Simultaneous diagonalization can yield a significant reduction in computation by preprocessing the data when computing a distance function involves a matrix inverse. For two distributions, the distance functions are, by simultaneous diagonalization,

\begin{align}
    d_1(\textbf{x})
    &= (\textbf{x} - \textbf{m}_1)'\boldsymbol{\Sigma}_1^{-1}(\textbf{x} - \textbf{m}_1) \\
    &= (\textbf{y} - \textbf{d}_1)'\textbf{I}_p{-1}(\textbf{y} - \textbf{d}_1) \\
    &= \sum_{j=1}^p (y_j - d_{1j})^2, \\
    d_2(\textbf{x})
    &= (\textbf{x} - \textbf{m}_2)'\boldsymbol{\Sigma}_2^{-1}(\textbf{x} - \textbf{m}_2) \\
    &= (\textbf{y} - \textbf{d}_2)'\boldsymbol{\Lambda}^{-1}(\textbf{y} - \textbf{d}_2) \\
    &= \sum_{j=1}^p \dfrac{(y_j - d_{2j})^2}{\lambda_j},
\end{align}

where $\textbf{y} = \textbf{A}'\textbf{x}$ and $\textbf{d}_{k} = \textbf{A}'\textbf{m}_k$, $k = 1, 2$.

***

### Optimum Linear Transformation (p. 448)

A linear transformation from a $p$-dimensional $\textbf{x}$ to an $q$-dimensional $\textbf{y}$ $(q < p)$ is expressed by

$$
\textbf{y} = \textbf{A}'\textbf{x},
$$

where $\textbf{A}$ is a $p \times q$ rectangular matrix and the column vectors are linearly independent. These column vectors do not need to be orthonormal.

The within-class $\textbf{S}_w$, between-class $\textbf{S}_b$, and mixture (total) $\textbf{S}_m$ scatter matrices are used to formulate criteria of class separability. The within-class scatter matrix $\textbf{S}_w$ shows the scatter of samples around their respective class expected vectors. On the other hand, the between-class scatter matrix $\textbf{S}_b$ is the scatter of the expected vectors around the mixture (grand) mean. The mixture (total) scatter matrix $\textbf{S}_m$ is the covariance matrix of all samples regardless of their class assignments such that

$$
\textbf{S}_m = \textbf{S}_w + \textbf{S}_w.
$$