# DATA 1010

#### 21 September 2020

#### Matrix differentiation

Just as elementary differentiation rules are helpful for optimizing single-variable functions, *matrix* differentiation rules are helpful for optimizing expressions written in matrix form. This technique is used often in statistics. 

Suppose $\mathbf{f}$ is a function from $\mathbb{R}^n$ to $\mathbb{R}^m$. Writing $\mathbf{f}(\mathbf{x}) = \mathbf{f}(x_1, \ldots, x_n)$, we define the Jacobian matrix (or *derivative* matrix) to be 

$$
\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \left[
  \begin{array}{cccc}
    \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n}  \\ 
    \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n}  \\ 
    \vdots & & \ddots & \vdots \\ 
    \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n}
  \end{array}\right]
$$

### Problem 1
(a) The derivative with respect to $\mathbf{x}$ goes by a more common name when $m = 1$. What is it?

(b) In single-variable calculus, we learn that $f(x+h) - f(x) \approx f'(x)h$ if $f$ is differentiable. Show that the vector derivative behaves similarly: if $f:\mathbb{R}^n \to \mathbb{R}$, then $f(\mathbf{x}+\mathbf{h}) - f(\mathbf{x}) \approx \frac{\partial f}{\partial \mathbf{x}} \mathbf{h}$. Check this result numerically below.

In [26]:
f(x) = x[1]^2 + x[2]^2

f([3.101,0.51]) - f([3.100,0.5])

0.016300999999998567

In [28]:
# replace verbal descriptions:
#dfdx(x) = [first component of derivative, second component]'
#dfdx([base point]) * [vector representing how far we moved from the base point]

### Problem 2
Show that if $A$ is a constant matrix, then 

$$
\begin{align*}
\frac{\partial}{\partial \mathbf{x}} (A \mathbf{x}) &= A, \text{ and}\\
\frac{\partial}{\partial \mathbf{x}} (\mathbf{x}' A) &= A'
\end{align*}
$$

Hint: use the definition and work out the various partial derivatives with respect to $\mathbf{x}$. If you feel stuck, make up a small example and try it.

### Problem 3

Show that the Hessian of a function $f:\mathbb{R}^n \to \mathbb{R}$ can be written as 

$$
\mathbf{H}(\mathbf{x}) = \frac{\partial}{\partial \mathbf{x}}
\left(\frac{\partial f}{\partial \mathbf{x}}\right)'.
$$

### Problem 4

Let $Q: \mathbb{R}^n \to \mathbb{R}$ be defined by $Q(\mathbf{x}) = \mathbf{x}' A \mathbf{x}$ where $Q$ is a symmetric matrix. Find $\frac{\partial Q}{\partial \mathbf{x}}.$ 

Note: the product rule for vector differentiation says that $\frac{\partial}{\partial \mathbf{x}} (\mathbf{u}' \mathbf{v}) = \mathbf{u}'\frac{\partial \mathbf{v}}{\partial \mathbf{x}} + \mathbf{v}'\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$ if $\mathbf{u}$ and $\mathbf{v}$ are vector-valued functions of $\mathbf{x}$. 

### Problem 5

One way to think about Taylor polynomials is to observe that they **match derivatives** of a function at a given base point. For example, given a function $f$ from $\mathbb{R}$ to $\mathbb{R}$, 

$$
f(0) + f'(0)x + \frac{1}{2} f''(0) x^2
$$

is the unique quadratic polynomial with the property that its zeroth, first, and second derivatives match those of $f$. 

Use matrix differentiation to show that the same is true for functions of multiple variables with the quadratic function

$$
f(\mathbf{0}) + \frac{\partial f}{\partial \mathbf{x}}(\mathbf{0})\mathbf{x} + \frac{1}{2}\mathbf{x}'H\mathbf{x}, 
$$

where $H$ is the Hessian of $f$ evaluated at the origin.

### Problem 6

Suppose that $A$ is an $m\times n$ matrix, that $\mathbf{b} \in \mathbb{R}^m$, and that $\lambda > 0$. Find a formula for the value of $\mathbf{x}$ which minimizes: $|A\mathbf{x} - \mathbf{b}|^2 + \lambda |\mathbf{x}|^2$

(For any square matrices that you would like to invert, you may assume they are invertible).

Describe what happens (either to the minimizer, or to the original optimization problem) as $\lambda$ increases from 0.

### Problem 7
The result from the previous problem has a valuable data science interpretation. If we're trying to model the relationship between $A$ and $\mathbf{b}$ as linear, then we're looking to find $\mathbf{x}$ so as to minimize $|A\mathbf{x} - \mathbf{b}|^2$. 

However, if the columns of $A$ are close to being linearly dependent, then minimizing $|A\mathbf{x} - \mathbf{b}|^2$ often means leveraging directions that wouldn't be present if the columns of $A$ were adjusted slightly to make them actually linearly dependent. This gives a better fit than is meaningfully possible. 

(a) Using the matrix $A$ and vector $\mathbf{b}$ given below, solve $A\mathbf{x} = \mathbf{b}$. Try adjusting the entry 1.1 to be even closer to 1. What happens to the coefficients? 


(b) Adding the term $\lambda|\mathbf{x}|^2$ tempers the coefficients and enables us to get a more meaningful result. In the plot below, show the minimizer of $|A\mathbf{x} - \mathbf{b}|^2 + \lambda |\mathbf{x}|^2$ for various values of $\lambda$. Find one that you think is pretty good.

In [30]:
using Plots
gr(size=(400,400))
A = [1.1 1
      1  1]
b = [0, 1]
plot([(0,0),(A[1,1],A[2,1])], 
     label="first column of A", 
     legend = :bottomright, 
     aspect_ratio = :equal)
plot!([(0,0),(A[1,2],A[2,2])], label="second column of A")
plot!([(0,0),(b[1],b[2])], label="target (b)")

A \ b

2-element Array{Float64,1}:
 -9.999999999999996
 10.999999999999996

### Problem 7 (challenge)

We can also differentiate functions with respect to higher-order arrays like matrices. Just as the derivative of a real-valued function with respect to a vector is a vector, the derivative of a real-valued function with respect to a matrix is a matrix: given $f:\mathbb{R}^{m\times n} \to \mathbb{R}$, we define
  $$
    \frac{\partial}{\partial W}f(W) =
    \begin{bmatrix}
      \frac{\partial f}{\partial a_{1,1}} & \cdots & \frac{\partial
        f}{\partial a_{1,n}} \\
      \vdots & \ddots & \vdots \\
      \frac{\partial f}{\partial a_{m,1}} & \cdots & \frac{\partial
        f}{\partial a_{m,n}}
    \end{bmatrix}, 
  $$
  where $a_{i,j}$ is the entry in the $i$th row and $j$th column of
  $W$. Suppose that $\mathbf{u}$ is a $1 \times m$ row vector and
  $\mathbf{v}$ is an $n \times 1$ column vector. Show that
  $$
  \frac{\partial}{\partial W}(\mathbf{u}W \mathbf{v}) = \mathbf{u}'
    \mathbf{v}'. 
  $$