# L13c: Introduction to Linear Structured State Space Models of Sequences
This lecture introduces linear structured state space models of _long_ sequences. These models use a time-invariant linear state space representation of _hidden_ state dynamics, then some output mapping between the hidden state and the observed data. The key topics we will cover include:
* __Linear time-invariant state space models__: A time-invariant linear state-space model is a mathematical representation of a linear time-invariant (LTI) system using state variables, where the system's dynamics are described by four constant matrices $\mathbf{A}$, $\mathbf{B}$, $\mathbf{C}$, and $\mathbf{D}$. These models characterize systems with fixed parameters and linear relationships between inputs, outputs, and internal states over time.
* __S4 Leg-S models__: Stanford's S4 Leg-S models are structured state-space sequence models that leverage the HiPPO-LegS matrix initialization. They enable efficient modeling of long-range dependencies by decomposing inputs onto orthogonal polynomial bases. These models excel in tasks requiring handling extremely long sequences, such as those in the [Long Range Arena benchmark](https://arxiv.org/abs/2011.04006), by combining linear state space dynamics with deep learning architectures.

The material for this lecture was compiled from the following sources: [click me!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/tree/main/lectures/week-13/L13c/docs)

Let's go!
___

## Background: Linear Time Invariant State Space Models
Linear time invariant (LTI) state space models are a class of _continuous-time_ models that can represent a system's dynamics over time. The following equations characterize them:
$$
\begin{align*}
\dot{\mathbf{x}} &= \mathbf{A} \mathbf{x} + \mathbf{B} \mathbf{u} \\
\mathbf{y} &= \mathbf{C} \mathbf{x} + \mathbf{D} \mathbf{u}
\end{align*}
$$
where $\mathbf{x}\in\mathbb{R}^{h}$ is an $h$-dimensional state vector, $\mathbf{u}\in\mathbb{R}^{d_{in}}$ is the $d_{in}$ dimensional input vector, $\mathbf{y}\in\mathbb{R}^{d_{out}}$ is the $d_{out}$ dimensional output vector. The LTI system is defined by the system matrices (and the initial state and input):
* The $\mathbf{A}\in\mathbb{R}^{h\times{h}}$ matrix is the state transition matrix, which describes how the state depends upon itself over time.
* The $\mathbf{B}\in\mathbb{R}^{h\times{d_{in}}}$ matrix is the input matrix, which describes how the input vector affects the state.
* The $\mathbf{C}\in\mathbb{R}^{d_{out}\times{h}}$ matrix is the output matrix, which describes how the state affects the output vector.
* The $\mathbf{D}\in\mathbb{R}^{d_{out}\times{d_{in}}}$ matrix is the feedforward matrix, which describes how the input vector affects the output vector.

Linear time-invariant state space models have been widely used in control theory, signal processing, and other fields. They can model a wide range of systems, including mechanical, electrical, and biological systems.

You may be familiar with these models from your automatic control class, where they are used to model system dynamics. In this lecture, we will focus on the discrete-time version of these models, which are often used in machine learning and signal processing applications.
* __Single Input Single Output (SISO)__: The simplest case of a linear time invariant state space model is the single input single output (SISO) case, where there is one input $d_{in} = 1$ and one output $d_{out} = 1$ _per time step_. In this case, the system can be represented by a single transfer function, which describes the relationship between the input and output.
* __Multiple Input Multiple Output (MIMO)__: In the multiple input multiple output (MIMO) case, there are multiple inputs and multiple outputs. In this case, the system can be represented by a matrix of transfer functions, which describes the relationship between the inputs and outputs.

The different versions of this approach for modeling long sequences differ in the structure of the system matrices.

## S4 Methods
The S4 (Structured State Space Sequence) models, [developed by the Re lab at Stanford](https://cs.stanford.edu/~chrismre/), represent a significant advancement in sequence modeling by leveraging the mathematical framework of state space models (SSMs) and the HiPPO (Highly Predictive Polynomial Operators) theory. 

* _Advantage_: Unlike traditional architectures such as RNNs, CNNs, or Transformers, S4 models are designed to efficiently capture long-range dependencies in sequential data using a continuous-time state space formulation and a specialized state matrix known as the HiPPO matrix. This approach enables S4 to process long sequences with linear computational and memory complexity. It is highly scalable and effective for tasks involving extensive context, such as time series forecasting, audio, and language modeling.
* _Does it work?_ Yes! The S4 (and a newer variant called [the S5 approach](https://arxiv.org/pdf/2208.04933)) has set new benchmarks for long-range sequence modeling, demonstrating both state-of-the-art performance and efficiency across a variety of domains.

### SISO Leg-S $\mathbf{A}$ and $\mathbf{B}$ HiPPO matrices
The Leg-S HiPPO matrices are a specific type of structured state space model designed to capture long-range dependencies in sequential data efficiently. 
* _What_? The Leg-S approach is based on [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials), which are a set of _orthogonal polynomials_ that can be used to represent functions over the finite interval $[-1,1]$. 

The Leg-S HiPPO $a_{ik}\in\mathbf{A}$ state transition matrix for a `SISO` problem is constructed as:
$$
\begin{align*}
a_{ik} &= \begin{cases}
    \left(2i+1\right)^{1/2}\left(2k+1\right)^{1/2} & \text{if } i>k \\
    \left(i+1\right) & \text{if } i=k \\
    0 & \text{if } i<k \\
\end{cases}
\end{align*}
$$
where the $b_{n}\in\mathbf{B}$ input matrix is constructed as:
$$
\begin{align*}
b_{i} &= \left(2i+1\right)^{1/2} \\
\end{align*}
$$

This form of $\mathbf{A}$ and $\mathbf{B}$ have some nice theoretical properties, including:
* _Time invariance_: The Leg-S HiPPO matrices are invariant to the input timescale, which means that they can be used to model systems with different time scales without changing the underlying structure of the model.
* _Fast computation and bounded_: The Leg-S HiPPO matrices can be computed efficiently using fast algorithms, making them suitable for real-time applications. They also give rise to bounded gradients and approximation errors.
* _Alternatives_? Is Leg-S the only approach to building $\mathbf{A}$ and $\mathbf{B}$? No! Other approaches use different polynomials. [For more details, click me!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-13/L13c/docs/Gu-arXix-HiPPO-2020.pdf)

Let's build some example $\mathbf{A}$ and $\mathbf{B}$ matrices using the Leg-S HiPPO approach. 

In [4]:
(A,B) = let

    # initialize -
    h = 10; # internal hidden state memory size
    din = 1; # we are SISO, so single input 
    A = Array{Float64,2}(undef, h, h); # internal hidden state memory
    B = Array{Float64,2}(undef, h, din); # internal hidden state memory

    # build the A-matrix
    for i ∈ 1:h
        for k = 1:h
            
            if (i > k)
                A[i,k] = sqrt((2*i+1))*sqrt((2*k+1));

            elseif (i == k)
                A[i,k] = (i+1);
            else
                A[i,k] = 0.0;
            end
        end
    end

    # build the B-matrix
    for i ∈ 1:h
        B[i,1] = sqrt((2*i+1));
    end

    # return -
    (A,B)
end

([2.0 0.0 … 0.0 0.0; 3.872983346207417 3.0 … 0.0 0.0; … ; 7.54983443527075 9.746794344808965 … 10.0 0.0; 7.937253933193771 10.246950765959598 … 19.97498435543818 11.0], [1.7320508075688772; 2.23606797749979; … ; 4.358898943540674; 4.58257569495584;;])

In [16]:
B

10×1 Matrix{Float64}:
 1.7320508075688772
 2.23606797749979
 2.6457513110645907
 3.0
 3.3166247903554
 3.605551275463989
 3.872983346207417
 4.123105625617661
 4.358898943540674
 4.58257569495584

So we have $\mathbf{A}$ and $\mathbf{B}$ matrices that are structured in a way that allows us to efficiently compute the state space model. Where do we get $\mathbf{C}$ and $\mathbf{D}$ from? We estimate these matricies from the training data.

### Training an S4 model
We train the S4 model using a standard supervised learning approach, minimizing the difference between the predicted output and the true output (loss function). In particular, we estimate the elements of the $\mathbf{C}$ and (sometimes) the $\mathbf{D}$ matrices.
* _Loss function_: The loss function is typically a mean squared error (MSE) or cross-entropy loss, depending on the output data type. The loss function measures the difference between the predicted and true output.
* _Optimization_: The optimization process is typically done using stochastic gradient descent (SGD) or one of its variants, such as Adam or RMSprop. The optimization process updates the parameters of the model (the $\mathbf{C}$ and $\mathbf{D}$ matrices) to minimize the loss function.

However, there is an interesting wrinkle. To speed up the calculation, we discretize the continuous-time state space model and use the discrete variables of the hidden state in all calculations.

#### Discretization
The discrete-time state space model is given by:
$$
\begin{align*}
\mathbf{x}_{t+1} &= \mathbf{\bar{A}} \mathbf{x}_{t} + \mathbf{\bar{B}} \mathbf{u}_{t} \\
\mathbf{y}_{t} &= \mathbf{\bar{C}} \mathbf{x}_{t} + \mathbf{\bar{D}} \mathbf{u}_{t}
\end{align*}
$$
where $\mathbf{x}_{t}$ is the hidden state at time $t$, $\mathbf{u}_{t}$ is the input at time $t$, and $\mathbf{y}_{t}$ is the output at time $t$. The discretized matrices $\mathbf{\bar{A}}$, $\mathbf{\bar{B}}$, and $\mathbf{\bar{C}}$ can be obtained from a variety of methods, such as the bilinear method:
$$
\begin{align*}
\mathbf{\bar{A}} &= \left(\mathbf{I}-\left(\Delta/2\right)\cdot\mathbf{A}\right)^{-1}\left(\mathbf{I}+\left(\Delta/2\right)\cdot\mathbf{A}\right) \\
\mathbf{\bar{B}} &= \left(\mathbf{I}-\left(\Delta/2\right)\cdot\mathbf{A}\right)^{-1}\left(\Delta\cdot\mathbf{B}\right) \\
\mathbf{\bar{C}} &= \mathbf{C}
\end{align*}
$$
where $\Delta$ is the time step size (sampling frequency), and $\mathbf{I}$ is the identity matrix. The bilinear method is a standard method for discretizing continuous-time state-space models, and it is used in many applications.
* _Simplification_: In most applications, we set $\mathbf{D} = 0$, which means that the output is only dependent on the hidden state and not on the input, thus $\mathbf{\bar{D}} = 0$. If this were not the case, we set $\mathbf{\bar{D}} = \mathbf{D}$, which means that the output is dependent on both the hidden state and the input.

What problem are we solving in training, e.g., for a regression task? For a `SISO` problem, we want to find the $\mathbf{C}$ matrix (which is a row-vector) that minimizes the following loss function:
$$
\begin{align*}
\mathcal{L}(\mathbf{\bar{C}}) &= \sum_{t=1}^{T}\left(y_{t}-\mathbf{\bar{C}}\mathbf{x}_{t}\right)^{2} \\
\end{align*}
$$
where $T$ is the number of time steps in the sequence, the loss function measures the (squared) difference between the predicted output and the true output, and we want to minimize this difference by adjusting the $\mathbf{C}$ matrix.
* _Hmmm_: Is that loss just linear regression? Yes! The loss function is a standard linear regression loss function, where we are trying to find the best linear mapping between the hidden state and the output.  However, in this case, the hidden states $\mathbf{x}_{t}$ are independent variables; they are the output of the S4 Leg-S model.

## Example
We are working on an example of this approach for modeling return distributions [for the upcoming INFORMS conference](https://meetings.informs.org/wordpress/annual/?_gl=1%2Apww31x%2A_gcl_au%2AMTYxNTU3NjcxOS4xNzQ0MDMwNzA1). Let's check out that (incomplete) example where do a sequence to sequence modeling task using the S4-LegS model.

## Lab
In Lab `L13d`, we will implement (and _hopefully_ train) an S4 model using the Leg-S HiPPO matrices for a natural language processing task. We will use our own implementation and training methods.

# Today?
That's a wrap! What are some of the interesting things we discussed today?