# L13c: Introduction to Linear Structured State Space Models of Sequences
In this lecture, we introduce linear structured state space models of _long_ sequences. These models use a time-invariant linear state space representation of _hidden_ state dynamics, and then some type of output mapping between the hidden state and the observed data. The key topics we will cover include:
* __Linear time-invariant state space models__: Fill me in.
* __S4 models__: Fill me in.
* __Training__: Fill me in.

The material for this lecture was compiled from the following sources:
Fill me in.

Let's go!
___

## Background: Linear Time Invariant State Space Models
Linear time invariant (LTI) state space models are a class of _continous time_ models that can be used to represent the dynamics of a system over time. They are characterized by the following equations:
$$
\begin{align*}
\dot{\mathbf{x}} &= \mathbf{A} \mathbf{x} + \mathbf{B} \mathbf{u} \\
\mathbf{y} &= \mathbf{C} \mathbf{x} + \mathbf{D} \mathbf{u}
\end{align*}
$$
where $\mathbf{x}\in\mathbb{R}^{h}$ is an $h$-dimensional state vector, $\mathbf{u}\in\mathbb{R}^{d_{in}}$ is the $d_{in}$ dimensional input vector, $\mathbf{y}\in\mathbb{R}^{d_{out}}$ is the $d_{out}$ dimensional output vector. The LTI system is defined are by the system matrices (and the initial state and input):
* The $\mathbf{A}\in\mathbb{R}^{h\times{h}}$ matrix is the state transition matrix, which describes how the state depends upon itself over time.
* The $\mathbf{B}\in\mathbb{R}^{h\times{d_{in}}}$ matrix is the input matrix, which describes how the input vector affects the state.
* The $\mathbf{C}\in\mathbb{R}^{d_{out}\times{h}}$ matrix is the output matrix, which describes how the state affects the output vector.
* The $\mathbf{D}\in\mathbb{R}^{d_{out}\times{d_{in}}}$ matrix is the feedforward matrix, which describes how the input vector affects the output vector.

Linear time invariant state space models are a powerful tool for modeling dynamic systems, and they have been widely used in control theory, signal processing, and other fields. They can be used to model a wide range of systems, including mechanical systems, electrical systems, and biological systems.

You may be familar with these models from your automatic control class, where they are used to model the dynamics of systems. In this lecture, we will focus on the discrete time version of these models, which are often used in machine learning and signal processing applications.
* __Single Input Single Output (SISO)__: The simplest case of a linear time invariant state space model is the single input single output (SISO) case, where there is one input $d_{in} = 1$ and one output $d_{out} = 1$ _per time step_. In this case, the system can be represented by a single transfer function, which describes the relationship between the input and output.
* __Multiple Input Multiple Output (MIMO)__: In the multiple input multiple output (MIMO) case, there are multiple inputs and multiple outputs. In this case, the system can be represented by a matrix of transfer functions, which describes the relationship between the inputs and outputs.

The different versions of this approach for modeling long sequences differ in teh structure of the system matricies.

## S4 Methods
The S4 (Structured State Spaces) models, [developed by the Re lab at Stanford](), represent a significant advancement in sequence modeling by leveraging the mathematical framework of state space models (SSMs) and the HiPPO (Highly Predictive Polynomial Operators) theory. 

* _Advantage_: Unlike traditional architectures such as RNNs, CNNs, or Transformers, S4 models are designed to efficiently capture long-range dependencies in sequential data by using a continuous-time state space formulation and a specialized state matrix known as the HiPPO matrix. This approach enables S4 to process very long sequences with linear computational and memory complexity, making it highly scalable and effective for tasks involving extensive context, such as time series forecasting, audio, and language modeling.
* _Key innovation_: The S4 model decomposes the HiPPO matrix into the sum of a low-rank and normal components - which allows for fast and stable computation, while its mathematical foundation provides a principled mechanism for memorizing input history over long time horizons. 
* _Does it work?_ Yes! The S4 approach has set new benchmarks for long-range sequence modeling, demonstrating both state-of-the-art performance and efficiency across a variety of domains.

### SISO Leg-S $\mathbf{A}$ and $\mathbf{B}$ HiPPO matrices
The Leg-S HiPPO matrices are a specific type of structured state space model that is designed to efficiently capture long-range dependencies in sequential data. 
* _What_? The Leg-S approach is based on [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials), which are a set of orthogonal polynomials that can be used to represent functions over a finite interval. 

The Leg-S HiPPO $a_{ik}\in\mathbf{A}$ state transition matrix for a `SISO` problem is constructed as:
$$
\begin{align*}
a_{ik} &= \begin{cases}
    \left(2i+1\right)^{1/2}\left(2k+1\right)^{1/2} & \text{if } i>k \\
    \left(i+1\right) & \text{if } i=k \\
    0 & \text{if } i<k \\
\end{cases}
\end{align*}
$$
where the $b_{n}\in\mathbf{B}$ input matrix is constructed as:
$$
\begin{align*}
b_{i} &= \left(2i+1\right)^{1/2} \\
\end{align*}
$$

This form of $\mathbf{A}$ and $\mathbf{B}$ have some nice theoretical properties, including:
* _Time invariance_: The Leg-S HiPPO matrices are invariant to the input timescale, which means that they can be used to model systems with different time scales without changing the underlying structure of the model.
* _Fast computation and bounded_: The Leg-S HiPPO matrices can be computed efficiently using fast algorithms, which makes them suitable for real-time applications. The Leg-S HiPPO matrices also give rise to bounded gradients and appromation errors.

Let's build some example $\mathbf{A}$ and $\mathbf{B}$ matrices using the Leg-S HiPPO approach. 

In [1]:
(A,B) = let

    # initialize -
    h = 10; # internal hidden state memory size
    din = 1; # we are SISO, so single input 
    A = Array{Float64,2}(undef, h, h); # internal hidden state memory
    B = Array{Float64,2}(undef, h, din); # internal hidden state memory

    # build the A-matrix
    for i ∈ 1:h
        for k = 1:h
            
            if (i > k)
                A[i,k] = sqrt((2*i+1))*sqrt((2*k+1));

            elseif (i == k)
                A[i,k] = (i+1);
            else
                A[i,k] = 0.0;
            end
        end
    end

    # build the B-matrix
    for i ∈ 1:h
        B[i,1] = sqrt((2*i+1));
    end

    # return -
    (A,B)
end

([2.0 0.0 … 0.0 0.0; 3.872983346207417 3.0 … 0.0 0.0; … ; 7.54983443527075 9.746794344808965 … 10.0 0.0; 7.937253933193771 10.246950765959598 … 19.97498435543818 11.0], [1.7320508075688772; 2.23606797749979; … ; 4.358898943540674; 4.58257569495584;;])

So we have $\mathbf{A}$ and $\mathbf{B}$ matrices that are structured in a way that allows us to efficiently compute the state space model. Where do we get $\mathbf{C}$ and $\mathbf{D}$ from? We estimate these matricies from the training data.

### Training an S4 model
Fill me in