In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

# 2.7 Vector and Matrix Differentiation

We define derivatives of vectors, matrices, products of vectors and matrices, as well as scalar functions of vectors and matrices, results which commonly appear in linear model theory; see Magnus and Neudecker (1988) for more details. We assume throughout that all the derivatives exist and are continuous.

## Definition 2.7.1

Let $\boldsymbol{\beta} = (\beta_1, \ldots, \beta_k)^T$ be a $k$-dimensional vector and let $f(\boldsymbol{\beta})$ denote a scalar function of $\boldsymbol{\beta}$. The first partial derivative of $f$ with respect to $\boldsymbol{\beta}$ is defined to be the $k$-dimensional vector of partial derivatives $\partial f/\partial \beta_i$:

$$
\frac{\partial f(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = \frac{\partial f}{\partial \boldsymbol{\beta}} = \begin{pmatrix}
\partial f/\partial \beta_1 \\
\partial f/\partial \beta_2 \\
\vdots \\
\partial f/\partial \beta_k
\end{pmatrix} \tag{2.7.1}
$$

Also,
$$
\frac{\partial f}{\partial \boldsymbol{\beta}^T} = \left(\frac{\partial f}{\partial \beta_1}, \frac{\partial f}{\partial \beta_2}, \ldots, \frac{\partial f}{\partial \beta_k}\right) \tag{2.7.2}
$$

## Definition 2.7.2

The second partial derivative of $f$ with respect to $\boldsymbol{\beta}$ is defined to be the $k \times k$ matrix of partial derivatives $\partial^2 f/\partial \beta_i \partial \beta_j = \partial^2 f/\partial \beta_j \partial \beta_i$:

$$
\frac{\partial^2 f}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = \begin{pmatrix}
\partial^2 f/\partial \beta_1^2 & \partial^2 f/\partial \beta_1 \partial \beta_2 & \cdots & \partial^2 f/\partial \beta_1 \partial \beta_k \\
\partial^2 f/\partial \beta_1 \partial \beta_2 & \partial^2 f/\partial \beta_2^2 & \cdots & \partial^2 f/\partial \beta_2 \partial \beta_k \\
\vdots & \vdots & \ddots & \vdots \\
\partial^2 f/\partial \beta_1 \partial \beta_k & \partial^2 f/\partial \beta_2 \partial \beta_k & \cdots & \partial^2 f/\partial \beta_k^2
\end{pmatrix} \tag{2.7.3}
$$

## Example 2.7.1

Let $\boldsymbol{\beta} = (\beta_1, \beta_2)^T$ and let $f(\boldsymbol{\beta}) = \beta_1^2 - 2\beta_1\beta_2$. Then,

$$
\frac{\partial f}{\partial \boldsymbol{\beta}^T} = (2\beta_1 - 2\beta_2, -2\beta_1)
$$

and

$$
\frac{\partial^2 f}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = \begin{pmatrix}
2 & -2 \\
-2 & 0
\end{pmatrix}
$$

## Result 2.7.1

Let $f$ and $g$ represent scalar functions of a $k$-dimensional vector $\boldsymbol{\beta}$, and let $a$ and $b$ be real constants. Then

$$
\begin{align}
\frac{\partial(af + bg)}{\partial \beta_j} &= a\frac{\partial f}{\partial \beta_j} + b\frac{\partial g}{\partial \beta_j} \\
\frac{\partial(fg)}{\partial \beta_j} &= f\frac{\partial g}{\partial \beta_j} + g\frac{\partial f}{\partial \beta_j} \\
\frac{\partial(f/g)}{\partial \beta_j} &= \frac{1}{g^2}\left(g\frac{\partial f}{\partial \beta_j} - f\frac{\partial g}{\partial \beta_j}\right)
\end{align} \tag{2.7.4}
$$

## Definition 2.7.3

Let $\mathbf{A} = \{a_{ij}\}$ be an $m \times n$ matrix and let $f(\mathbf{A})$ be a real function of $\mathbf{A}$. The first partial derivative of $f$ with respect to $\mathbf{A}$ is defined as the $m \times n$ matrix of partial derivatives $\partial f/\partial a_{ij}$:

$$
\frac{\partial f(\mathbf{A})}{\partial \mathbf{A}} = \{\partial f/\partial a_{ij}\}, \quad i = 1, \ldots, m, \quad j = 1, \ldots, n
$$

$$
= \begin{pmatrix}
\partial f/\partial a_{11} & \partial f/\partial a_{12} & \cdots & \partial f/\partial a_{1n} \\
\vdots & \vdots & \ddots & \vdots \\
\partial f/\partial a_{m1} & \partial f/\partial a_{m2} & \cdots & \partial f/\partial a_{mn}
\end{pmatrix} \tag{2.7.5}
$$

The results that follow give rules for finding partial derivatives of vector or matrix functions of matrices and vectors.

## Result 2.7.2

Let $\boldsymbol{\beta} \in \mathbb{R}^n$ and let $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then,

$$
\frac{\partial \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}^T} = \mathbf{A}, \quad \text{and} \quad \frac{\partial \boldsymbol{\beta}^T \mathbf{A}^T}{\partial \boldsymbol{\beta}} = \mathbf{A}^T \tag{2.7.6}
$$

---

*This section presents the fundamental definitions and results for vector and matrix differentiation, which are essential tools in linear model theory and optimization.*

# 2.7 Vector and Matrix Differentiation

We define derivatives of vectors, matrices, products of vectors and matrices, as well as scalar functions of vectors and matrices, results which commonly appear in linear model theory; see Magnus and Neudecker (1988) for more details. We assume throughout that all the derivatives exist and are continuous.

## Definition 2.7.1

Let $\boldsymbol{\beta} = (\beta_1, \ldots, \beta_k)^T$ be a $k$-dimensional vector and let $f(\boldsymbol{\beta})$ denote a scalar function of $\boldsymbol{\beta}$. The first partial derivative of $f$ with respect to $\boldsymbol{\beta}$ is defined to be the $k$-dimensional vector of partial derivatives $\partial f/\partial \beta_i$:

$$
\frac{\partial f(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = \frac{\partial f}{\partial \boldsymbol{\beta}} = \begin{pmatrix}
\partial f/\partial \beta_1 \\
\partial f/\partial \beta_2 \\
\vdots \\
\partial f/\partial \beta_k
\end{pmatrix} \tag{2.7.1}
$$

Also,
$$
\frac{\partial f}{\partial \boldsymbol{\beta}^T} = \left(\frac{\partial f}{\partial \beta_1}, \frac{\partial f}{\partial \beta_2}, \ldots, \frac{\partial f}{\partial \beta_k}\right) \tag{2.7.2}
$$

## Definition 2.7.2

The second partial derivative of $f$ with respect to $\boldsymbol{\beta}$ is defined to be the $k \times k$ matrix of partial derivatives $\partial^2 f/\partial \beta_i \partial \beta_j = \partial^2 f/\partial \beta_j \partial \beta_i$:

$$
\frac{\partial^2 f}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = \begin{pmatrix}
\partial^2 f/\partial \beta_1^2 & \partial^2 f/\partial \beta_1 \partial \beta_2 & \cdots & \partial^2 f/\partial \beta_1 \partial \beta_k \\
\partial^2 f/\partial \beta_1 \partial \beta_2 & \partial^2 f/\partial \beta_2^2 & \cdots & \partial^2 f/\partial \beta_2 \partial \beta_k \\
\vdots & \vdots & \ddots & \vdots \\
\partial^2 f/\partial \beta_1 \partial \beta_k & \partial^2 f/\partial \beta_2 \partial \beta_k & \cdots & \partial^2 f/\partial \beta_k^2
\end{pmatrix} \tag{2.7.3}
$$

## Example 2.7.1

Let $\boldsymbol{\beta} = (\beta_1, \beta_2)^T$ and let $f(\boldsymbol{\beta}) = \beta_1^2 - 2\beta_1\beta_2$. Then,

$$
\frac{\partial f}{\partial \boldsymbol{\beta}^T} = (2\beta_1 - 2\beta_2, -2\beta_1)
$$

and

$$
\frac{\partial^2 f}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = \begin{pmatrix}
2 & -2 \\
-2 & 0
\end{pmatrix}
$$

## Result 2.7.1

Let $f$ and $g$ represent scalar functions of a $k$-dimensional vector $\boldsymbol{\beta}$, and let $a$ and $b$ be real constants. Then

$$
\begin{align}
\frac{\partial(af + bg)}{\partial \beta_j} &= a\frac{\partial f}{\partial \beta_j} + b\frac{\partial g}{\partial \beta_j} \\
\frac{\partial(fg)}{\partial \beta_j} &= f\frac{\partial g}{\partial \beta_j} + g\frac{\partial f}{\partial \beta_j} \\
\frac{\partial(f/g)}{\partial \beta_j} &= \frac{1}{g^2}\left(g\frac{\partial f}{\partial \beta_j} - f\frac{\partial g}{\partial \beta_j}\right)
\end{align} \tag{2.7.4}
$$

## Definition 2.7.3

Let $\mathbf{A} = \{a_{ij}\}$ be an $m \times n$ matrix and let $f(\mathbf{A})$ be a real function of $\mathbf{A}$. The first partial derivative of $f$ with respect to $\mathbf{A}$ is defined as the $m \times n$ matrix of partial derivatives $\partial f/\partial a_{ij}$:

$$
\frac{\partial f(\mathbf{A})}{\partial \mathbf{A}} = \{\partial f/\partial a_{ij}\}, \quad i = 1, \ldots, m, \quad j = 1, \ldots, n
$$

$$
= \begin{pmatrix}
\partial f/\partial a_{11} & \partial f/\partial a_{12} & \cdots & \partial f/\partial a_{1n} \\
\vdots & \vdots & \ddots & \vdots \\
\partial f/\partial a_{m1} & \partial f/\partial a_{m2} & \cdots & \partial f/\partial a_{mn}
\end{pmatrix} \tag{2.7.5}
$$

The results that follow give rules for finding partial derivatives of vector or matrix functions of matrices and vectors.

## Result 2.7.2

Let $\boldsymbol{\beta} \in \mathbb{R}^n$ and let $\mathbf{A} \in \mathbb{R}^{m \times n}$. Then,

$$
\frac{\partial \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}^T} = \mathbf{A}, \quad \text{and} \quad \frac{\partial \boldsymbol{\beta}^T \mathbf{A}^T}{\partial \boldsymbol{\beta}} = \mathbf{A}^T \tag{2.7.6}
$$

## Proof of Result 2.7.2

We may write
$
\mathbf{A}\boldsymbol{\beta} = \begin{pmatrix}
a_{11}\beta_1 + \cdots + a_{1n}\beta_n \\
a_{21}\beta_1 + \cdots + a_{2n}\beta_n \\
\vdots \\
a_{m1}\beta_1 + \cdots + a_{mn}\beta_n
\end{pmatrix}
$

so that by Definition 2.7.2, $\partial \mathbf{A}\boldsymbol{\beta}/\partial \boldsymbol{\beta}^T$ is given by

$
\begin{pmatrix}
\partial(a_{11}\beta_1 + \cdots + a_{1n}\beta_n)/\partial\beta_1 & \cdots & \partial(a_{11}\beta_1 + \cdots + a_{1n}\beta_n)/\partial\beta_n \\
\partial(a_{21}\beta_1 + \cdots + a_{2n}\beta_n)/\partial\beta_1 & \cdots & \partial(a_{21}\beta_1 + \cdots + a_{2n}\beta_n)/\partial\beta_n \\
\vdots & \ddots & \vdots \\
\partial(a_{m1}\beta_1 + \cdots + a_{mn}\beta_n)/\partial\beta_1 & \cdots & \partial(a_{m1}\beta_1 + \cdots + a_{mn}\beta_n)/\partial\beta_n
\end{pmatrix}
$

$
= \begin{pmatrix}
a_{11} & \cdots & a_{1n} \\
\vdots & \ddots & \vdots \\
a_{m1} & \cdots & a_{mn}
\end{pmatrix} = \mathbf{A} \tag{2.7.7}
$

That $\partial \boldsymbol{\beta}^T \mathbf{A}^T/\partial \boldsymbol{\beta} = \mathbf{A}^T$ follows by transposing both sides of (2.7.7).

## Result 2.7.3

Let $\boldsymbol{\beta} \in \mathbb{R}^n$ and $\mathbf{A} \in \mathbb{R}^{n \times n}$. Then,

$
\begin{align}
\frac{\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}} &= (\mathbf{A} + \mathbf{A}^T)\boldsymbol{\beta} \\
\frac{\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}^T} &= \boldsymbol{\beta}^T(\mathbf{A} + \mathbf{A}^T) \\
\frac{\partial^2 \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} &= \mathbf{A} + \mathbf{A}^T
\end{align} \tag{2.7.8}
$

Further, if $\mathbf{A}$ is a symmetric matrix,

$
\begin{align}
\frac{\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}} &= 2\mathbf{A}\boldsymbol{\beta} \\
\frac{\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}^T} &= 2\boldsymbol{\beta}^T\mathbf{A} \\
\frac{\partial^2 \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} &= 2\mathbf{A}
\end{align} \tag{2.7.9}
$

### Proof

We prove the result for a symmetric matrix $\mathbf{A}$. Clearly,

$
\boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta} = \sum_{i,j=1}^n a_{ij}\beta_i\beta_j
$

so that

$
\frac{\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \beta_r} = \sum_{\substack{j=1\\j \neq r}}^n a_{rj}\beta_j + \sum_{\substack{i=1\\i \neq r}}^n a_{ir}\beta_i + 2a_{rr}\beta_r
$

$
= 2\sum_{j=1}^n a_{rj}\beta_j \quad \text{(by symmetry of } \mathbf{A}\text{)}
$

$
= 2\mathbf{a}_r^T\boldsymbol{\beta}
$

where $\mathbf{a}_r^T$ denotes the $r$th row vector of $\mathbf{A}$. By Definition 2.7.3, we get

$
\frac{\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}} = \begin{pmatrix}
\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}/\partial \beta_1 \\
\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}/\partial \beta_2 \\
\vdots \\
\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}/\partial \beta_n
\end{pmatrix} = 2\begin{pmatrix}
\mathbf{a}_1^T \\
\mathbf{a}_2^T \\
\vdots \\
\mathbf{a}_n^T
\end{pmatrix}\boldsymbol{\beta} = 2\mathbf{A}\boldsymbol{\beta} \tag{2.7.10}
$

The second result follows by transposing both sides of (2.7.10). To show the last result, we again take the first partial derivative of $\partial \boldsymbol{\beta}^T \mathbf{A}\boldsymbol{\beta}/\partial \boldsymbol{\beta}^T = 2\boldsymbol{\beta}^T\mathbf{A}$, and use Result 2.7.2.

## Result 2.7.4

Let $\mathbf{C} \in \mathbb{R}^{m \times n}$, $\boldsymbol{\alpha} \in \mathbb{R}^m$, and $\boldsymbol{\beta} \in \mathbb{R}^n$. Then,

$
\frac{\partial \boldsymbol{\alpha}^T \mathbf{C}\boldsymbol{\beta}}{\partial \mathbf{C}} = \boldsymbol{\alpha}\boldsymbol{\beta}^T \tag{2.7.11}
$

### Proof

Since
$
\boldsymbol{\alpha}^T \mathbf{C}\boldsymbol{\beta} = \sum_{i=1}^m \sum_{j=1}^n c_{ij}\alpha_i\beta_j
$

we have
$
\frac{\partial(\boldsymbol{\alpha}^T \mathbf{C}\boldsymbol{\beta})}{\partial c_{kl}} = \alpha_k\beta_l
$

which is the $(k,l)$th element of $\boldsymbol{\alpha}\boldsymbol{\beta}^T$, from which the result follows.

## Additional Results

We next give a few useful results without proof.

### Result 2.7.5

Let $\mathbf{A}$ be an $n \times n$ matrix. Then,

$
\frac{\partial \text{tr}(\mathbf{A})}{\partial \mathbf{A}} = \mathbf{I}_n \quad \text{and} \quad \frac{\partial |\mathbf{A}|}{\partial \mathbf{A}} = \text{Adj}(\mathbf{A}) \tag{2.7.12}
$

where $\text{Adj}(\mathbf{A})$ denotes the adjoint of $\mathbf{A}$.

### Result 2.7.6

Suppose $\mathbf{A}$ is an $n \times n$ matrix with $|\mathbf{A}| > 0$. Then

$
\frac{\partial \ln |\mathbf{A}|}{\partial \mathbf{A}} = (\mathbf{A}^T)^{-1} \tag{2.7.13}
$

### Result 2.7.7

Let $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times m}$. Then,

$
\frac{\partial \text{tr}(\mathbf{A}\mathbf{B})}{\partial \mathbf{A}} = \mathbf{B}^T \tag{2.7.14}
$

### Result 2.7.8

Let $\boldsymbol{\Omega}$ be a symmetric matrix, $\mathbf{y} \in \mathbb{R}^n$, $\boldsymbol{\beta} \in \mathbb{R}^k$, and $\mathbf{X} \in \mathbb{R}^{n \times k}$. Then,

$
\begin{align}
\frac{\partial (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T \boldsymbol{\Omega}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} &= -2\mathbf{X}^T\boldsymbol{\Omega}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\
\frac{\partial^2 (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T \boldsymbol{\Omega}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} &= 2\mathbf{X}^T\boldsymbol{\Omega}\mathbf{X}
\end{align} \tag{2.7.15}
$

## Definition 2.7.4

The next definition deals with partial derivatives of a matrix (or a vector) with respect to some scalar $\theta$. We see that in this case, the partial derivative is itself a matrix or vector of the same dimension whose elements are the partial derivatives with respect to $\theta$ of each element of that matrix or vector.

Let $\mathbf{A} \in \mathbb{R}^{m \times n}$ be a function of a scalar $\theta$, then

$
\frac{\partial \mathbf{A}}{\partial \theta} = \{\partial a_{ij}/\partial \theta\}, \quad i = 1, \ldots, m, \quad j = 1, \ldots, n
$

$
= \begin{pmatrix}
\partial a_{11}/\partial \theta & \partial a_{12}/\partial \theta & \cdots & \partial a_{1n}/\partial \theta \\
\vdots & \vdots & \ddots & \vdots \\
\partial a_{m1}/\partial \theta & \partial a_{m2}/\partial \theta & \cdots & \partial a_{mn}/\partial \theta
\end{pmatrix} \tag{2.7.16}
$

---

*This section presents the fundamental definitions and results for vector and matrix differentiation, which are essential tools in linear model theory and optimization.*

In [1]:
import numpy as np
from typing import Callable, Tuple, Union
import sympy as sp
from sympy import symbols, diff, Matrix, derive_by_array
import warnings

class VectorMatrixDifferentiation:
    """
    Implementation of vector and matrix differentiation operations
    as defined in Section 2.7
    """
    
    def __init__(self):
        pass
    
    @staticmethod
    def gradient_scalar_wrt_vector(f: Callable, beta: np.ndarray, h: float = 1e-8) -> np.ndarray:
        """
        Compute the gradient of a scalar function f with respect to vector beta
        using finite differences (Definition 2.7.1)
        
        Args:
            f: Scalar function of vector
            beta: k-dimensional vector
            h: Step size for finite differences
            
        Returns:
            k-dimensional gradient vector
        """
        k = len(beta)
        grad = np.zeros(k)
        
        for i in range(k):
            beta_plus = beta.copy()
            beta_minus = beta.copy()
            beta_plus[i] += h
            beta_minus[i] -= h
            
            grad[i] = (f(beta_plus) - f(beta_minus)) / (2 * h)
        
        return grad
    
    @staticmethod
    def hessian_scalar_wrt_vector(f: Callable, beta: np.ndarray, h: float = 1e-6) -> np.ndarray:
        """
        Compute the Hessian matrix of a scalar function f with respect to vector beta
        using finite differences (Definition 2.7.2)
        
        Args:
            f: Scalar function of vector
            beta: k-dimensional vector
            h: Step size for finite differences
            
        Returns:
            k x k Hessian matrix
        """
        k = len(beta)
        hessian = np.zeros((k, k))
        
        for i in range(k):
            for j in range(k):
                # Compute second partial derivative using finite differences
                beta_pp = beta.copy()
                beta_pm = beta.copy()
                beta_mp = beta.copy()
                beta_mm = beta.copy()
                
                beta_pp[i] += h
                beta_pp[j] += h
                
                beta_pm[i] += h
                beta_pm[j] -= h
                
                beta_mp[i] -= h
                beta_mp[j] += h
                
                beta_mm[i] -= h
                beta_mm[j] -= h
                
                hessian[i, j] = (f(beta_pp) - f(beta_pm) - f(beta_mp) + f(beta_mm)) / (4 * h**2)
        
        return hessian
    
    @staticmethod
    def derivative_A_beta_wrt_beta(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂(Aβ)/∂β^T = A (Result 2.7.2)
        
        Args:
            A: m x n matrix
            
        Returns:
            Matrix A
        """
        return A
    
    @staticmethod
    def derivative_beta_AT_wrt_beta(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂(β^T A^T)/∂β = A^T (Result 2.7.2)
        
        Args:
            A: m x n matrix
            
        Returns:
            Matrix A^T
        """
        return A.T
    
    @staticmethod
    def quadratic_form_derivatives(A: np.ndarray, beta: np.ndarray, 
                                 is_symmetric: bool = None) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """
        Compute derivatives of quadratic form β^T A β (Result 2.7.3)
        
        Args:
            A: n x n matrix
            beta: n-dimensional vector
            is_symmetric: Whether A is symmetric (auto-detect if None)
            
        Returns:
            Tuple of (first_derivative, first_derivative_transpose, hessian)
        """
        if is_symmetric is None:
            is_symmetric = np.allclose(A, A.T)
        
        if is_symmetric:
            # For symmetric matrix
            first_deriv = 2 * A @ beta
            first_deriv_T = 2 * beta.T @ A
            hessian = 2 * A
        else:
            # For general matrix
            first_deriv = (A + A.T) @ beta
            first_deriv_T = beta.T @ (A + A.T)
            hessian = A + A.T
        
        return first_deriv, first_deriv_T, hessian
    
    @staticmethod
    def derivative_alpha_C_beta_wrt_C(alpha: np.ndarray, beta: np.ndarray) -> np.ndarray:
        """
        Compute ∂(α^T C β)/∂C = α β^T (Result 2.7.4)
        
        Args:
            alpha: m-dimensional vector
            beta: n-dimensional vector
            
        Returns:
            m x n matrix α β^T
        """
        return np.outer(alpha, beta)
    
    @staticmethod
    def trace_derivative(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂tr(A)/∂A = I (Result 2.7.5)
        
        Args:
            A: n x n matrix
            
        Returns:
            n x n identity matrix
        """
        n = A.shape[0]
        return np.eye(n)
    
    @staticmethod
    def log_determinant_derivative(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂ln|A|/∂A = (A^T)^{-1} (Result 2.7.6)
        
        Args:
            A: n x n invertible matrix
            
        Returns:
            (A^T)^{-1}
        """
        if np.linalg.det(A) <= 0:
            raise ValueError("Matrix must have positive determinant")
        
        return np.linalg.inv(A.T)
    
    @staticmethod
    def trace_product_derivative(A: np.ndarray, B: np.ndarray) -> np.ndarray:
        """
        Compute ∂tr(AB)/∂A = B^T (Result 2.7.7)
        
        Args:
            A: m x n matrix
            B: n x m matrix
            
        Returns:
            B^T
        """
        return B.T
    
    @staticmethod
    def regression_derivatives(y: np.ndarray, X: np.ndarray, beta: np.ndarray, 
                             Omega: np.ndarray = None) -> Tuple[np.ndarray, np.ndarray]:
        """
        Compute derivatives for regression objective (y - Xβ)^T Ω (y - Xβ) (Result 2.7.8)
        
        Args:
            y: n-dimensional response vector
            X: n x k design matrix
            beta: k-dimensional parameter vector
            Omega: n x n symmetric weight matrix (default: identity)
            
        Returns:
            Tuple of (first_derivative, hessian)
        """
        if Omega is None:
            Omega = np.eye(len(y))
        
        residual = y - X @ beta
        first_deriv = -2 * X.T @ Omega @ residual
        hessian = 2 * X.T @ Omega @ X
        
        return first_deriv, hessian


class SymbolicDifferentiation:
    """
    Symbolic computation of vector and matrix derivatives using SymPy
    """
    
    @staticmethod
    def symbolic_gradient_example():
        """
        Example 2.7.1: f(β) = β₁² - 2β₁β₂
        """
        # Define symbols
        beta1, beta2 = symbols('beta1 beta2')
        beta = Matrix([beta1, beta2])
        
        # Define function
        f = beta1**2 - 2*beta1*beta2
        
        # Compute gradient
        gradient = Matrix([diff(f, beta1), diff(f, beta2)])
        
        # Compute Hessian
        hessian = Matrix([
            [diff(f, beta1, beta1), diff(f, beta1, beta2)],
            [diff(f, beta2, beta1), diff(f, beta2, beta2)]
        ])
        
        print("Example 2.7.1:")
        print(f"f(β) = {f}")
        print(f"∇f = {gradient}")
        print(f"∇²f = {hessian}")
        
        return f, gradient, hessian
    
    @staticmethod
    def symbolic_quadratic_form():
        """
        Symbolic computation of quadratic form derivatives
        """
        # Define symbols
        n = 3  # Example dimension
        beta_symbols = [symbols(f'beta{i+1}') for i in range(n)]
        beta = Matrix(beta_symbols)
        
        # Define a symbolic symmetric matrix
        A = Matrix([
            [symbols('a11'), symbols('a12'), symbols('a13')],
            [symbols('a12'), symbols('a22'), symbols('a23')],
            [symbols('a13'), symbols('a23'), symbols('a33')]
        ])
        
        # Quadratic form
        quad_form = beta.T @ A @ beta
        quad_form = quad_form[0]  # Extract scalar
        
        # Compute gradient
        gradient = Matrix([diff(quad_form, b) for b in beta_symbols])
        
        print("\nSymbolic Quadratic Form:")
        print(f"β^T A β = {quad_form}")
        print(f"∇(β^T A β) = {gradient}")
        
        return quad_form, gradient


def demonstrate_examples():
    """
    Demonstrate the implementations with concrete examples
    """
    print("=== Vector and Matrix Differentiation Examples ===\n")
    
    # Initialize the class
    vmd = VectorMatrixDifferentiation()
    
    # Example 2.7.1
    print("1. Example 2.7.1: f(β) = β₁² - 2β₁β₂")
    
    def f_example(beta):
        return beta[0]**2 - 2*beta[0]*beta[1]
    
    beta_test = np.array([1.0, 2.0])
    grad_numerical = vmd.gradient_scalar_wrt_vector(f_example, beta_test)
    hess_numerical = vmd.hessian_scalar_wrt_vector(f_example, beta_test)
    
    # Analytical results
    grad_analytical = np.array([2*beta_test[0] - 2*beta_test[1], -2*beta_test[0]])
    hess_analytical = np.array([[2, -2], [-2, 0]])
    
    print(f"β = {beta_test}")
    print(f"Gradient (numerical): {grad_numerical}")
    print(f"Gradient (analytical): {grad_analytical}")
    print(f"Hessian (numerical):\n{hess_numerical}")
    print(f"Hessian (analytical):\n{hess_analytical}")
    print(f"Gradient error: {np.linalg.norm(grad_numerical - grad_analytical):.2e}")
    print(f"Hessian error: {np.linalg.norm(hess_numerical - hess_analytical):.2e}")
    
    # Result 2.7.2 demonstration
    print("\n2. Result 2.7.2: ∂(Aβ)/∂β^T = A")
    A = np.array([[1, 2, 3], [4, 5, 6]])
    result = vmd.derivative_A_beta_wrt_beta(A)
    print(f"A =\n{A}")
    print(f"∂(Aβ)/∂β^T =\n{result}")
    print(f"Verification: A == result? {np.allclose(A, result)}")
    
    # Result 2.7.3 demonstration
    print("\n3. Result 2.7.3: Quadratic form derivatives")
    A_sym = np.array([[2, 1], [1, 3]])
    beta = np.array([1, -1])
    
    first_deriv, first_deriv_T, hessian = vmd.quadratic_form_derivatives(A_sym, beta, is_symmetric=True)
    
    print(f"A (symmetric) =\n{A_sym}")
    print(f"β = {beta}")
    print(f"∂(β^T A β)/∂β = {first_deriv}")
    print(f"∂(β^T A β)/∂β^T = {first_deriv_T}")
    print(f"∂²(β^T A β)/∂β∂β^T =\n{hessian}")
    
    # Verify with analytical: 2Aβ and 2A
    expected_grad = 2 * A_sym @ beta
    expected_hess = 2 * A_sym
    print(f"Expected gradient: {expected_grad}")
    print(f"Expected Hessian:\n{expected_hess}")
    print(f"Gradient match: {np.allclose(first_deriv, expected_grad)}")
    print(f"Hessian match: {np.allclose(hessian, expected_hess)}")
    
    # Result 2.7.4 demonstration
    print("\n4. Result 2.7.4: ∂(α^T C β)/∂C = α β^T")
    alpha = np.array([1, 2, 3])
    beta = np.array([4, 5])
    result = vmd.derivative_alpha_C_beta_wrt_C(alpha, beta)
    expected = np.outer(alpha, beta)
    
    print(f"α = {alpha}")
    print(f"β = {beta}")
    print(f"∂(α^T C β)/∂C =\n{result}")
    print(f"α β^T =\n{expected}")
    print(f"Match: {np.allclose(result, expected)}")
    
    # Regression example (Result 2.7.8)
    print("\n5. Result 2.7.8: Regression derivatives")
    np.random.seed(42)
    n, k = 10, 3
    X = np.random.randn(n, k)
    y = np.random.randn(n)
    beta = np.random.randn(k)
    
    first_deriv, hessian = vmd.regression_derivatives(y, X, beta)
    
    print(f"Design matrix X shape: {X.shape}")
    print(f"Response vector y shape: {y.shape}")
    print(f"Parameter vector β shape: {beta.shape}")
    print(f"First derivative shape: {first_deriv.shape}")
    print(f"Hessian shape: {hessian.shape}")
    
    # Check if Hessian is positive definite (should be for X^T X)
    eigenvals = np.linalg.eigvals(hessian)
    print(f"Hessian eigenvalues: {eigenvals}")
    print(f"Positive definite: {np.all(eigenvals > 0)}")


def symbolic_examples():
    """
    Run symbolic differentiation examples
    """
    print("\n=== Symbolic Differentiation Examples ===\n")
    
    sym_diff = SymbolicDifferentiation()
    sym_diff.symbolic_gradient_example()
    sym_diff.symbolic_quadratic_form()


if __name__ == "__main__":
    # Run numerical examples
    demonstrate_examples()
    
    # Run symbolic examples
    try:
        symbolic_examples()
    except ImportError:
        print("\nSymPy not available for symbolic examples")
    
    print("\n=== Implementation Complete ===")

ModuleNotFoundError: No module named 'sympy'

In [2]:
import numpy as np
from typing import Callable, Tuple, Union
import warnings

# Try to import sympy, but make it optional
try:
    import sympy as sp
    from sympy import symbols, diff, Matrix, derive_by_array
    SYMPY_AVAILABLE = True
except ImportError:
    SYMPY_AVAILABLE = False
    warnings.warn("SymPy not available. Symbolic functionality will be disabled.")

class VectorMatrixDifferentiation:
    """
    Implementation of vector and matrix differentiation operations
    as defined in Section 2.7
    """
    
    def __init__(self):
        pass
    
    @staticmethod
    def gradient_scalar_wrt_vector(f: Callable, beta: np.ndarray, h: float = 1e-8) -> np.ndarray:
        """
        Compute the gradient of a scalar function f with respect to vector beta
        using finite differences (Definition 2.7.1)
        
        Args:
            f: Scalar function of vector
            beta: k-dimensional vector
            h: Step size for finite differences
            
        Returns:
            k-dimensional gradient vector
        """
        k = len(beta)
        grad = np.zeros(k)
        
        for i in range(k):
            beta_plus = beta.copy()
            beta_minus = beta.copy()
            beta_plus[i] += h
            beta_minus[i] -= h
            
            grad[i] = (f(beta_plus) - f(beta_minus)) / (2 * h)
        
        return grad
    
    @staticmethod
    def hessian_scalar_wrt_vector(f: Callable, beta: np.ndarray, h: float = 1e-6) -> np.ndarray:
        """
        Compute the Hessian matrix of a scalar function f with respect to vector beta
        using finite differences (Definition 2.7.2)
        
        Args:
            f: Scalar function of vector
            beta: k-dimensional vector
            h: Step size for finite differences
            
        Returns:
            k x k Hessian matrix
        """
        k = len(beta)
        hessian = np.zeros((k, k))
        
        for i in range(k):
            for j in range(k):
                # Compute second partial derivative using finite differences
                beta_pp = beta.copy()
                beta_pm = beta.copy()
                beta_mp = beta.copy()
                beta_mm = beta.copy()
                
                beta_pp[i] += h
                beta_pp[j] += h
                
                beta_pm[i] += h
                beta_pm[j] -= h
                
                beta_mp[i] -= h
                beta_mp[j] += h
                
                beta_mm[i] -= h
                beta_mm[j] -= h
                
                hessian[i, j] = (f(beta_pp) - f(beta_pm) - f(beta_mp) + f(beta_mm)) / (4 * h**2)
        
        return hessian
    
    @staticmethod
    def derivative_A_beta_wrt_beta(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂(Aβ)/∂β^T = A (Result 2.7.2)
        
        Args:
            A: m x n matrix
            
        Returns:
            Matrix A
        """
        return A
    
    @staticmethod
    def derivative_beta_AT_wrt_beta(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂(β^T A^T)/∂β = A^T (Result 2.7.2)
        
        Args:
            A: m x n matrix
            
        Returns:
            Matrix A^T
        """
        return A.T
    
    @staticmethod
    def quadratic_form_derivatives(A: np.ndarray, beta: np.ndarray, 
                                 is_symmetric: bool = None) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """
        Compute derivatives of quadratic form β^T A β (Result 2.7.3)
        
        Args:
            A: n x n matrix
            beta: n-dimensional vector
            is_symmetric: Whether A is symmetric (auto-detect if None)
            
        Returns:
            Tuple of (first_derivative, first_derivative_transpose, hessian)
        """
        if is_symmetric is None:
            is_symmetric = np.allclose(A, A.T)
        
        if is_symmetric:
            # For symmetric matrix
            first_deriv = 2 * A @ beta
            first_deriv_T = 2 * beta.T @ A
            hessian = 2 * A
        else:
            # For general matrix
            first_deriv = (A + A.T) @ beta
            first_deriv_T = beta.T @ (A + A.T)
            hessian = A + A.T
        
        return first_deriv, first_deriv_T, hessian
    
    @staticmethod
    def derivative_alpha_C_beta_wrt_C(alpha: np.ndarray, beta: np.ndarray) -> np.ndarray:
        """
        Compute ∂(α^T C β)/∂C = α β^T (Result 2.7.4)
        
        Args:
            alpha: m-dimensional vector
            beta: n-dimensional vector
            
        Returns:
            m x n matrix α β^T
        """
        return np.outer(alpha, beta)
    
    @staticmethod
    def trace_derivative(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂tr(A)/∂A = I (Result 2.7.5)
        
        Args:
            A: n x n matrix
            
        Returns:
            n x n identity matrix
        """
        n = A.shape[0]
        return np.eye(n)
    
    @staticmethod
    def log_determinant_derivative(A: np.ndarray) -> np.ndarray:
        """
        Compute ∂ln|A|/∂A = (A^T)^{-1} (Result 2.7.6)
        
        Args:
            A: n x n invertible matrix
            
        Returns:
            (A^T)^{-1}
        """
        if np.linalg.det(A) <= 0:
            raise ValueError("Matrix must have positive determinant")
        
        return np.linalg.inv(A.T)
    
    @staticmethod
    def trace_product_derivative(A: np.ndarray, B: np.ndarray) -> np.ndarray:
        """
        Compute ∂tr(AB)/∂A = B^T (Result 2.7.7)
        
        Args:
            A: m x n matrix
            B: n x m matrix
            
        Returns:
            B^T
        """
        return B.T
    
    @staticmethod
    def regression_derivatives(y: np.ndarray, X: np.ndarray, beta: np.ndarray, 
                             Omega: np.ndarray = None) -> Tuple[np.ndarray, np.ndarray]:
        """
        Compute derivatives for regression objective (y - Xβ)^T Ω (y - Xβ) (Result 2.7.8)
        
        Args:
            y: n-dimensional response vector
            X: n x k design matrix
            beta: k-dimensional parameter vector
            Omega: n x n symmetric weight matrix (default: identity)
            
        Returns:
            Tuple of (first_derivative, hessian)
        """
        if Omega is None:
            Omega = np.eye(len(y))
        
        residual = y - X @ beta
        first_deriv = -2 * X.T @ Omega @ residual
        hessian = 2 * X.T @ Omega @ X
        
        return first_deriv, hessian


class SymbolicDifferentiation:
    """
    Symbolic computation of vector and matrix derivatives using SymPy
    """
    
    @staticmethod
    def symbolic_gradient_example():
        """
        Example 2.7.1: f(β) = β₁² - 2β₁β₂
        """
        if not SYMPY_AVAILABLE:
            print("SymPy not available. Cannot run symbolic examples.")
            return None, None, None
            
        # Define symbols
        beta1, beta2 = symbols('beta1 beta2')
        beta = Matrix([beta1, beta2])
        
        # Define function
        f = beta1**2 - 2*beta1*beta2
        
        # Compute gradient
        gradient = Matrix([diff(f, beta1), diff(f, beta2)])
        
        # Compute Hessian
        hessian = Matrix([
            [diff(f, beta1, beta1), diff(f, beta1, beta2)],
            [diff(f, beta2, beta1), diff(f, beta2, beta2)]
        ])
        
        print("Example 2.7.1:")
        print(f"f(β) = {f}")
        print(f"∇f = {gradient}")
        print(f"∇²f = {hessian}")
        
        return f, gradient, hessian
    
    @staticmethod
    def symbolic_quadratic_form():
        """
        Symbolic computation of quadratic form derivatives
        """
        if not SYMPY_AVAILABLE:
            print("SymPy not available. Cannot run symbolic quadratic form example.")
            return None, None
            
        # Define symbols
        n = 3  # Example dimension
        beta_symbols = [symbols(f'beta{i+1}') for i in range(n)]
        beta = Matrix(beta_symbols)
        
        # Define a symbolic symmetric matrix
        A = Matrix([
            [symbols('a11'), symbols('a12'), symbols('a13')],
            [symbols('a12'), symbols('a22'), symbols('a23')],
            [symbols('a13'), symbols('a23'), symbols('a33')]
        ])
        
        # Quadratic form
        quad_form = beta.T @ A @ beta
        quad_form = quad_form[0]  # Extract scalar
        
        # Compute gradient
        gradient = Matrix([diff(quad_form, b) for b in beta_symbols])
        
        print("\nSymbolic Quadratic Form:")
        print(f"β^T A β = {quad_form}")
        print(f"∇(β^T A β) = {gradient}")
        
        return quad_form, gradient


def demonstrate_examples():
    """
    Demonstrate the implementations with concrete examples
    """
    print("=== Vector and Matrix Differentiation Examples ===\n")
    
    # Initialize the class
    vmd = VectorMatrixDifferentiation()
    
    # Example 2.7.1
    print("1. Example 2.7.1: f(β) = β₁² - 2β₁β₂")
    
    def f_example(beta):
        return beta[0]**2 - 2*beta[0]*beta[1]
    
    beta_test = np.array([1.0, 2.0])
    grad_numerical = vmd.gradient_scalar_wrt_vector(f_example, beta_test)
    hess_numerical = vmd.hessian_scalar_wrt_vector(f_example, beta_test)
    
    # Analytical results
    grad_analytical = np.array([2*beta_test[0] - 2*beta_test[1], -2*beta_test[0]])
    hess_analytical = np.array([[2, -2], [-2, 0]])
    
    print(f"β = {beta_test}")
    print(f"Gradient (numerical): {grad_numerical}")
    print(f"Gradient (analytical): {grad_analytical}")
    print(f"Hessian (numerical):\n{hess_numerical}")
    print(f"Hessian (analytical):\n{hess_analytical}")
    print(f"Gradient error: {np.linalg.norm(grad_numerical - grad_analytical):.2e}")
    print(f"Hessian error: {np.linalg.norm(hess_numerical - hess_analytical):.2e}")
    
    # Result 2.7.2 demonstration
    print("\n2. Result 2.7.2: ∂(Aβ)/∂β^T = A")
    A = np.array([[1, 2, 3], [4, 5, 6]])
    result = vmd.derivative_A_beta_wrt_beta(A)
    print(f"A =\n{A}")
    print(f"∂(Aβ)/∂β^T =\n{result}")
    print(f"Verification: A == result? {np.allclose(A, result)}")
    
    # Result 2.7.3 demonstration
    print("\n3. Result 2.7.3: Quadratic form derivatives")
    A_sym = np.array([[2, 1], [1, 3]])
    beta = np.array([1, -1])
    
    first_deriv, first_deriv_T, hessian = vmd.quadratic_form_derivatives(A_sym, beta, is_symmetric=True)
    
    print(f"A (symmetric) =\n{A_sym}")
    print(f"β = {beta}")
    print(f"∂(β^T A β)/∂β = {first_deriv}")
    print(f"∂(β^T A β)/∂β^T = {first_deriv_T}")
    print(f"∂²(β^T A β)/∂β∂β^T =\n{hessian}")
    
    # Verify with analytical: 2Aβ and 2A
    expected_grad = 2 * A_sym @ beta
    expected_hess = 2 * A_sym
    print(f"Expected gradient: {expected_grad}")
    print(f"Expected Hessian:\n{expected_hess}")
    print(f"Gradient match: {np.allclose(first_deriv, expected_grad)}")
    print(f"Hessian match: {np.allclose(hessian, expected_hess)}")
    
    # Result 2.7.4 demonstration
    print("\n4. Result 2.7.4: ∂(α^T C β)/∂C = α β^T")
    alpha = np.array([1, 2, 3])
    beta = np.array([4, 5])
    result = vmd.derivative_alpha_C_beta_wrt_C(alpha, beta)
    expected = np.outer(alpha, beta)
    
    print(f"α = {alpha}")
    print(f"β = {beta}")
    print(f"∂(α^T C β)/∂C =\n{result}")
    print(f"α β^T =\n{expected}")
    print(f"Match: {np.allclose(result, expected)}")
    
    # Regression example (Result 2.7.8)
    print("\n5. Result 2.7.8: Regression derivatives")
    np.random.seed(42)
    n, k = 10, 3
    X = np.random.randn(n, k)
    y = np.random.randn(n)
    beta = np.random.randn(k)
    
    first_deriv, hessian = vmd.regression_derivatives(y, X, beta)
    
    print(f"Design matrix X shape: {X.shape}")
    print(f"Response vector y shape: {y.shape}")
    print(f"Parameter vector β shape: {beta.shape}")
    print(f"First derivative shape: {first_deriv.shape}")
    print(f"Hessian shape: {hessian.shape}")
    
    # Check if Hessian is positive definite (should be for X^T X)
    eigenvals = np.linalg.eigvals(hessian)
    print(f"Hessian eigenvalues: {eigenvals}")
    print(f"Positive definite: {np.all(eigenvals > 0)}")


def symbolic_examples():
    """
    Run symbolic differentiation examples
    """
    print("\n=== Symbolic Differentiation Examples ===\n")
    
    sym_diff = SymbolicDifferentiation()
    sym_diff.symbolic_gradient_example()
    sym_diff.symbolic_quadratic_form()


if __name__ == "__main__":
    # Run numerical examples
    demonstrate_examples()
    
    # Run symbolic examples
    try:
        symbolic_examples()
    except ImportError:
        print("\nSymPy not available for symbolic examples")
    
    print("\n=== Implementation Complete ===")

  if sys.path[0] == "":


=== Vector and Matrix Differentiation Examples ===

1. Example 2.7.1: f(β) = β₁² - 2β₁β₂
β = [1. 2.]
Gradient (numerical): [-2.00000001 -1.99999999]
Gradient (analytical): [-2. -2.]
Hessian (numerical):
[[ 2.00006678e+00 -2.00006678e+00]
 [-2.00006678e+00 -2.22044605e-04]]
Hessian (analytical):
[[ 2 -2]
 [-2  0]]
Gradient error: 1.58e-08
Hessian error: 2.50e-04

2. Result 2.7.2: ∂(Aβ)/∂β^T = A
A =
[[1 2 3]
 [4 5 6]]
∂(Aβ)/∂β^T =
[[1 2 3]
 [4 5 6]]
Verification: A == result? True

3. Result 2.7.3: Quadratic form derivatives
A (symmetric) =
[[2 1]
 [1 3]]
β = [ 1 -1]
∂(β^T A β)/∂β = [ 2 -4]
∂(β^T A β)/∂β^T = [ 2 -4]
∂²(β^T A β)/∂β∂β^T =
[[4 2]
 [2 6]]
Expected gradient: [ 2 -4]
Expected Hessian:
[[4 2]
 [2 6]]
Gradient match: True
Hessian match: True

4. Result 2.7.4: ∂(α^T C β)/∂C = α β^T
α = [1 2 3]
β = [4 5]
∂(α^T C β)/∂C =
[[ 4  5]
 [ 8 10]
 [12 15]]
α β^T =
[[ 4  5]
 [ 8 10]
 [12 15]]
Match: True

5. Result 2.7.8: Regression derivatives
Design matrix X shape: (10, 3)
Response vector

# 2.8 Special Operations on Matrices

This section covers important matrix operations including the Kronecker product, vectorization, and direct sum operations that are fundamental in linear algebra and statistical computing.

## Definition 2.8.1: Kronecker Product of Matrices

Let $\mathbf{A} = \{a_{ij}\}$ be an $m \times n$ matrix and $\mathbf{B} = \{b_{ij}\}$ be a $p \times q$ matrix. The **Kronecker product** of $\mathbf{A}$ and $\mathbf{B}$ is denoted by $\mathbf{A} \otimes \mathbf{B}$ and is defined to be the $mp \times nq$ matrix

$$
\mathbf{A} \otimes \mathbf{B} = \begin{pmatrix}
a_{11}\mathbf{B} & a_{12}\mathbf{B} & \cdots & a_{1n}\mathbf{B} \\
a_{21}\mathbf{B} & a_{22}\mathbf{B} & \cdots & a_{2n}\mathbf{B} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1}\mathbf{B} & a_{m2}\mathbf{B} & \cdots & a_{mn}\mathbf{B}
\end{pmatrix} \tag{2.8.1}
$$

The matrix in (2.8.1) is a partitioned matrix whose $(i,j)$th entry is a $p \times q$ submatrix $a_{ij}\mathbf{B}$. The Kronecker product $\mathbf{A} \otimes \mathbf{B}$ can be defined regardless of the dimensions of $\mathbf{A}$ and $\mathbf{B}$. The Kronecker product is also referred to in the literature as the **direct product** or the **tensor product**.

## Example 2.8.1

Consider two matrices $\mathbf{A}$ and $\mathbf{B}$ where

$$
\mathbf{A} = \begin{pmatrix}
3 & 4 \\
2 & 0
\end{pmatrix}, \quad \text{and} \quad \mathbf{B} = \begin{pmatrix}
-1 & 5 & -1 \\
0 & 3 & 3
\end{pmatrix}
$$

Then,

$$
\mathbf{A} \otimes \mathbf{B} = \begin{pmatrix}
-3 & 15 & -3 & -4 & 20 & -4 \\
0 & 9 & 9 & 0 & 12 & 12 \\
-2 & 10 & -2 & 0 & 0 & 0 \\
0 & 6 & 6 & 0 & 0 & 0
\end{pmatrix}
$$

and

$$
\mathbf{B} \otimes \mathbf{A} = \begin{pmatrix}
-3 & -4 & 15 & 20 & -3 & -4 \\
-2 & 0 & 10 & 0 & -2 & 0 \\
0 & 0 & 9 & 12 & 9 & 12 \\
0 & 0 & 6 & 0 & 6 & 0
\end{pmatrix}
$$

In general, $\mathbf{A} \otimes \mathbf{B}$ is not equal to $\mathbf{B} \otimes \mathbf{A}$. The elements in these two products are the same, except that they are in different positions.

The definition $\mathbf{A} \otimes \mathbf{B}$ extends naturally to more than two matrices:

$$
\mathbf{A} \otimes \mathbf{B} \otimes \mathbf{C} = \mathbf{A} \otimes (\mathbf{B} \otimes \mathbf{C}) \quad \text{and} \quad \bigotimes_{i=1}^k \mathbf{A}_i = \mathbf{A}_1 \otimes \mathbf{A}_2 \otimes \cdots \otimes \mathbf{A}_k \tag{2.8.2}
$$

## Result 2.8.1: Properties of Kronecker Product

Let $\mathbf{A}$ be an $m \times n$ matrix. Then:

1. For a positive scalar $c$, we have $c \otimes \mathbf{A} = \mathbf{A} \otimes c = c\mathbf{A}$.

2. For any diagonal matrix $\mathbf{D} = \text{diag}(d_1, \ldots, d_k)$, $\mathbf{D} \otimes \mathbf{A} = \text{diag}(d_1\mathbf{A}, \ldots, d_k\mathbf{A})$.

3. $\mathbf{I} \otimes \mathbf{A} = \text{diag}(\mathbf{A}, \mathbf{A}, \ldots, \mathbf{A})$.

4. $\mathbf{I}_m \otimes \mathbf{I}_p = \mathbf{I}_{mp}$.

5. For a $p \times q$ matrix $\mathbf{B}$, we have $(\mathbf{A} \otimes \mathbf{B})^T = \mathbf{A}^T \otimes \mathbf{B}^T$.

6. $(\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = (\mathbf{A}\mathbf{C}) \otimes (\mathbf{B}\mathbf{D})$, where we assume that relevant matrices are conformal for multiplication.

7. $\text{rank}(\mathbf{A} \otimes \mathbf{B}) = \text{rank}(\mathbf{A}) \cdot \text{rank}(\mathbf{B})$.

8. $(\mathbf{A} + \mathbf{B}) \otimes (\mathbf{C} + \mathbf{D}) = (\mathbf{A} \otimes \mathbf{C}) + (\mathbf{A} \otimes \mathbf{D}) + (\mathbf{B} \otimes \mathbf{C}) + (\mathbf{B} \otimes \mathbf{D})$.

9. Suppose $\mathbf{A}$ is an $n \times n$ matrix, and $\mathbf{B}$ is an $m \times m$ matrix. The $nm$ eigenvalues of $\mathbf{A} \otimes \mathbf{B}$ are products of the $n$ eigenvalues $\lambda_i, i = 1, \ldots, n$ of $\mathbf{A}$ and the $m$ eigenvalues $\gamma_j, j = 1, \ldots, m$ of $\mathbf{B}$.

10. $|\mathbf{A} \otimes \mathbf{B}| = |\mathbf{A}|^m |\mathbf{B}|^n = \left(\prod_{i=1}^n \lambda_i\right)^m \left(\prod_{j=1}^m \gamma_j\right)^n$.

11. Provided all the inverses exist, $(\mathbf{A} \otimes \mathbf{B})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{B}^{-1}$.

## Definition 2.8.2: Vectorization of Matrices

Given an $m \times n$ matrix $\mathbf{A}$ with columns $\mathbf{a}_1, \ldots, \mathbf{a}_n$, we define $\text{vec}(\mathbf{A}) = (\mathbf{a}_1^T, \ldots, \mathbf{a}_n^T)^T$ to be an $mn$-dimensional column vector.

## Result 2.8.2: Properties of the vec Operator

1. Given $m \times n$ matrices $\mathbf{A}$ and $\mathbf{B}$, $\text{vec}(\mathbf{A} + \mathbf{B}) = \text{vec}(\mathbf{A}) + \text{vec}(\mathbf{B})$.

2. If $\mathbf{A}$, $\mathbf{B}$ and $\mathbf{C}$ are respectively $m \times n$, $n \times p$ and $p \times q$ matrices, then:

   (i) $\text{vec}(\mathbf{A}\mathbf{B}) = (\mathbf{I}_p \otimes \mathbf{A})\text{vec}(\mathbf{B}) = (\mathbf{B}^T \otimes \mathbf{I}_m)\text{vec}(\mathbf{A})$

   (ii) $\text{vec}(\mathbf{A}\mathbf{B}\mathbf{C}) = (\mathbf{C}^T \otimes \mathbf{A})\text{vec}(\mathbf{B})$

   (iii) $\text{vec}(\mathbf{A}\mathbf{B}\mathbf{C}) = (\mathbf{I}_q \otimes \mathbf{A}\mathbf{B})\text{vec}(\mathbf{C}) = (\mathbf{C}^T\mathbf{B}^T \otimes \mathbf{I}_n)\text{vec}(\mathbf{A})$

3. If $\mathbf{A}$ is $m \times n$ and $\mathbf{B}$ is $n \times m$, 
   $$\text{vec}(\mathbf{B}^T)^T\text{vec}(\mathbf{A}) = \text{vec}(\mathbf{A}^T)^T\text{vec}(\mathbf{B}) = \text{tr}(\mathbf{A}\mathbf{B})$$

4. If $\mathbf{A}$, $\mathbf{B}$ and $\mathbf{C}$ are respectively $m \times n$, $n \times p$ and $p \times m$ matrices,
   $$\begin{align}
   \text{tr}(\mathbf{A}\mathbf{B}\mathbf{C}) &= \text{vec}(\mathbf{A}^T)^T(\mathbf{C}^T \otimes \mathbf{I}_n)\text{vec}(\mathbf{B}) \\
   &= \text{vec}(\mathbf{A}^T)^T(\mathbf{I}_m \otimes \mathbf{B})\text{vec}(\mathbf{C}) \\
   &= \text{vec}(\mathbf{B}^T)^T(\mathbf{A} \otimes \mathbf{I}_p)\text{vec}(\mathbf{C}) \\
   &= \text{vec}(\mathbf{B}^T)^T(\mathbf{I}_n \otimes \mathbf{C})\text{vec}(\mathbf{A}) \\
   &= \text{vec}(\mathbf{C}^T)^T(\mathbf{B}^T \otimes \mathbf{I}_m)\text{vec}(\mathbf{A}) \\
   &= \text{vec}(\mathbf{C}^T)^T(\mathbf{I}_p \otimes \mathbf{A})\text{vec}(\mathbf{B})
   \end{align}$$

## Definition 2.8.3: Direct Sum of Matrices

The **direct sum** of two matrices $\mathbf{A}$ and $\mathbf{B}$ (which can be of any dimension) is defined as

$$
\mathbf{A} \oplus \mathbf{B} = \begin{pmatrix}
\mathbf{A} & \mathbf{O} \\
\mathbf{O} & \mathbf{B}
\end{pmatrix} \tag{2.8.3}
$$

This operation extends naturally to more than two matrices:

$$
\bigoplus_{i=1}^k \mathbf{A}_i = \mathbf{A}_1 \oplus \mathbf{A}_2 \oplus \cdots \oplus \mathbf{A}_k = \begin{pmatrix}
\mathbf{A}_1 & \mathbf{O} & \cdots & \mathbf{O} \\
\mathbf{O} & \mathbf{A}_2 & \cdots & \mathbf{O} \\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{O} & \mathbf{O} & \cdots & \mathbf{A}_k
\end{pmatrix} \tag{2.8.4}
$$

This definition applies to vectors as well.

---

*These special matrix operations are fundamental tools in multivariate statistics, quantum mechanics, signal processing, and many other areas of applied mathematics. The Kronecker product is particularly important in the analysis of structured matrices and tensor operations, while vectorization provides a bridge between matrix operations and vector spaces.*