# Optimization #

This is the process of finding a maximum or minimum of a function. One can add in constraints, which makes it a constrained optimization problem.

#### Least Squares Problems ####

Many statistical optimization problems involve changing parameters of an objective function such that the function is able to fit a given set of data as closely as possible. The least squares problem is a part of this class of problems, where the objective function measures the mean square error of the actual data and the fitted data.

The linear regression done in part 1.4 is a special case of the least squares problem.

Let $\vec{y}$ denote the data that we are aiming to fit, and $\vec{\theta}$ the parameters that we can change. $\vec{f}(\vec{\theta}|\vec{x})$ is the function that we are trying to fit with $\vec{y}$. We can now define the error function as:

$$
\vec{\epsilon}(\vec{\theta}|\vec{x}) = \vec{y} - \vec{y}(\vec{\theta}|\vec{x}).
$$

Now the general least squares objective function is:

$$
\min_{\vec{\theta}} (\vec{\epsilon}^\dag(\vec{\theta}| \vec{x}) \vec{\epsilon}(\vec{\theta}| \vec{x}) ) \text{ s.t. } \vec{h} (\vec{\theta}) \leq 0,
$$

where $\vec{h} (\vec{\theta})$ encodes the constraints on the values of the parameters. Unlike the examples discussed in part 1.4, when the function is not linear, it is typically not able to obtain analytical solutions. In the general case we minimize this by applying an iterative gradient method. One popular algorihtm for this is the *Levenberg-Marquardt algorithm*. This is of the form:

$$
\vec{\epsilon}(\vec{\theta}_{n+1}) = \vec{\epsilon}(\vec{\theta}_{n}) + J(\vec{\theta}_n)(\vec{\theta}_{n+1} - \vec{\theta}_n).
$$

In this, $J$ is the Jacobian matrix of partial derivatives, $j_{ij} = \frac{\partial f_i}{\partial \theta_j}$.

#### Likelihood methods ####

Maximizing a likelihood function is a very flexible method for estimating the parameters of any distribution, and *maximum likelihood estimation* (MLE) is a standard procedure for doing so. Typically, algorithms for optimizing the likelihood function are more complex than least squares algorithms, and gradient methods will not always converge. The likelihood can be a very flat surface, making it extremely hard to converge on a global optimum. Thus, one needs to take more care when using a maximum likelihood proceedure.

In general, the covariance matrix of the estimators is given as:

$$
V(\vec{\theta}) = - I(\vec{\theta})^{-1}, I(\vec{\theta})^{-1} = E\left[ \frac{\partial^2 \log L}{\partial\vec{\theta} \partial\vec{\theta}^\prime } \right].
$$

$I(\vec{\theta})$ is called the information matrix. For a normal density, it is diagonal, but it can be a complicated non-diagonal matrix.