## Numeric Optimization : A Connection to Deep Learning

...Does this even work? seems like gradient methods fail or produce degenerate solutions for these type of problems...



Up until now, we've avoided using the usual forward and backward pass notation to define our variables, as i wanted to draw a link between other timeseries methods. It turns out the *smoothened* hidden state distributions are the $\gamma_t(\cdot)$ variable used in HMM literature. But let's unroll the helper variables one by one and start with the *forward* variable $\alpha$, which is a **unnormalized likelihood** $\ell(X_t|Y_{0:t})$.

\begin{align*}
    \alpha_t(j) &\overset{\text{Def.}}{=} (\sum_{i=1}^N \alpha_{t-1}(i)\mathcal A[i, j]) \mathcal B[j, Y_t]
\end{align*}

We obtain the full data likelihood $P(Y|\theta)$, by summing over $\alpha_T(i)$. As all the computation is differentiable (only matrix multiplications), we can **back-propagate** through the computation graph and optimize via **gradient descent** using auto-diff and **Optimization constraints**.
<p align="center"><img src="numeric_optim.drawio.png" alt="drawing" width="500"/></p>

In [1]:
import numpy as np
from numpy.typing import NDArray
from scipy.optimize import minimize
from functools import partial

def likelihood(Bs : np.ndarray, y_i : np.ndarray, lengths : np.ndarray) -> np.ndarray:
    """
    Bs : np.ndarray     of shape num_states * prod(num_obs_i).
    offsets : List[int] contains num_obs - 1 indeces for the split.
    returns: product of likelihoods for observation y_i
    
    """
    sections = np.insert(np.cumsum(lengths)[:-1], 0, 0)
    indeces = y_i + sections
    
    arrs = Bs[:, indeces]

    return np.prod(arrs, axis=1).squeeze()

def neg_likelihood(params : np.ndarray, Y : NDArray[np.int64], num_states : int, lengths : NDArray[np.int64]) -> np.float32:
    """
        Y : np.ndarray of shape (T, number of emission signals)
    """
    pi = params[:num_states]
    A = params[num_states: num_states + num_states**2].reshape(num_states, num_states)
    Bs = params[num_states + num_states**2:].reshape(num_states, -1)

    alpha_tm1 : np.ndarray = pi * likelihood(Bs, Y[0, :], lengths)

    for y_i in Y[1:]:
        alpha_tm1 = A.T @ alpha_tm1 * likelihood(Bs, np.array(y_i), lengths)

    return -np.sum(alpha_tm1)

In [2]:
num_states = 2
lengths = np.array([2, 3])
observations = np.array([[0, 1], [1, 0], [1, 1], [1, 0]]) # (D, not B), (C, B), (C, not B), (W, B)

init_A = np.random.random(size=(num_states, num_states))
init_A /= init_A.sum(axis=1)[:, None]

Bs = [
    np.random.random(size=(num_states, M)) for M in lengths
]

init_Bs = [
    B / B.sum(axis=1)[:, None] for B in Bs
]

init_pi = np.ones(shape=(num_states,)) / num_states

assert np.all([np.all([np.isclose(B.sum(axis=1), 1)]) for B in init_Bs]), 'Bs not stochastic'
assert np.all(np.isclose(init_A.sum(axis=1), 1)), 'A not stochastic'
assert np.isclose(init_pi.sum(), 1), 'pi not stochastic'

In [4]:
# turn initial guesses into np arrays
A = init_A.reshape(-1)
Bs = np.concatenate(init_Bs, axis=1).reshape(-1)
pi = np.array(init_pi).reshape(-1)

params = np.concatenate([pi, A, Bs])

In [None]:
from scipy.optimize import NonlinearConstraint, Bounds, LinearConstraint

cons = [
    # TODO
]

func = partial(neg_likelihood, Y=observations, num_states=num_states, lengths=lengths)
results = minimize(func, x0=params, bounds=Bounds(lb=0, ub=1), constraints=cons)