# L15a: What comes after Transformer and LLMs?
In this lecture we'll speculate (wildly) about what might come after Transformers and Large Language Models (LLMs), see the conversation [between Yann LeCun and Bill Dally | NVIDIA GTC 2025](https://www.youtube.com/watch?v=eyrDM3A_YFc).

Transformers have been a huge success, but they are not the end of the line. There are many other architectures and techniques that could be used to build even more powerful models. Let's explore a few of these possibilities, [see the review paper Schneider, J. (2024). What comes after transformers? - A selective survey connecting ideas in deep learning. ArXiv, abs/2408.00386.)](https://arxiv.org/abs/2305.13936) for a more in-depth discussion.

The Schneider paper reviews a number of different approaches that have been proposed as alternatives to Transformers, and far-out ideas for future architectures. In this lecture we'll explore a few of these ideas:

* __State Space Models (SSMs)__ are an emerging alternative to transformers for sequence modeling, using a fixed-size latent state that enables efficient processing of extremely long inputs, such as entire books or audio streams, without the quadratic computational cost of attention mechanisms. While SSMs like Mamba can match or even outperform transformers at small to medium scale, they are generally less effective than transformers at tasks requiring selective attention or copying from specific parts of the input, due to their reliance on compressing information into a fixed-size state rather than dynamically attending to all previous tokens.
* __Spiking Neural Networks (SNNs)__ are brain-inspired models that process information using discrete spikes over time, offering energy-efficient and biologically plausible alternatives to transformers, especially for tasks with strong temporal dynamics148. By leveraging event-driven computation and sparse communication, SNNs can achieve high computational efficiency and are particularly well-suited for deployment on [neuromorphic hardware](https://en.wikipedia.org/wiki/Neuromorphic_computing), addressing some of the limitations of transformer architectures in terms of power consumption and real-time processing6810.

The sources used to prepare this lecture are:
* [Schneider, J. (2024). What comes after transformers? - A selective survey connecting ideas in deep learning. ArXiv, abs/2408.00386](https://arxiv.org/abs/2305.13936)
* [Smith, J., Warrington, A., & Linderman, S.W. (2022). Simplified State Space Layers for Sequence Modeling. ArXiv, abs/2208.04933.](https://arxiv.org/abs/2208.04933)
* [Limbacher, T., Özdenizci, O., & Legenstein, R.A. (2022). Memory-enriched computation and learning in spiking neural networks through Hebbian plasticity. ArXiv, abs/2205.11276.](https://arxiv.org/abs/2205.11276)

___

## Review: Linear Time Invariant State Space Models
Linear time invariant (LTI) state space models are a class of _continuous-time_ models that can represent a system's dynamics over time. The following equations characterize them:
$$
\begin{align*}
\dot{\mathbf{x}} &= \mathbf{A} \mathbf{x} + \mathbf{B} \mathbf{u} \\
\mathbf{y} &= \mathbf{C} \mathbf{x} + \mathbf{D} \mathbf{u}
\end{align*}
$$
where $\mathbf{x}\in\mathbb{R}^{h}$ is an $h$-dimensional state vector, $\mathbf{u}\in\mathbb{R}^{d_{in}}$ is the $d_{in}$ dimensional input vector, $\mathbf{y}\in\mathbb{R}^{d_{out}}$ is the $d_{out}$ dimensional output vector. The LTI system is defined by the system matrices (and the initial state and input):
* The $\mathbf{A}\in\mathbb{R}^{h\times{h}}$ matrix is the state transition matrix, which describes how the state depends upon itself over time.
* The $\mathbf{B}\in\mathbb{R}^{h\times{d_{in}}}$ matrix is the input matrix, which describes how the input vector affects the state.
* The $\mathbf{C}\in\mathbb{R}^{d_{out}\times{h}}$ matrix is the output matrix, which describes how the state affects the output vector.
* The $\mathbf{D}\in\mathbb{R}^{d_{out}\times{d_{in}}}$ matrix is the feedforward matrix, which describes how the input vector affects the output vector.

Linear time-invariant state space models have been widely used in control theory, signal processing, and other fields. You may be familiar with these models from your automatic control class, where they are used to model system dynamics. In this lecture, we will focus on the discrete-time version of these models, which are often used in machine learning and signal processing applications.
* __Single Input Single Output (SISO)__: The simplest case of a linear time invariant state space model is the single input single output (SISO) case, where there is one input $d_{in} = 1$ and one output $d_{out} = 1$ _per time step_. In this case, the system can be represented by a single transfer function, which describes the relationship between the input and output.
* __Multiple Input Multiple Output (MIMO)__: In the multiple input multiple output (MIMO) case, there are multiple inputs and multiple outputs. In this case, the system can be represented by a matrix of transfer functions, which describes the relationship between the inputs and outputs.

### Discretization
Whether it is SISO or MIMO, to speed up the calculation, we discretize the continuous-time state space model and use the discrete variables of the hidden state in all calculations.
The discrete-time state space model is given by:
$$
\begin{align*}
\mathbf{x}_{t} &= \mathbf{\bar{A}} \mathbf{x}_{t-1} + \mathbf{\bar{B}} \mathbf{u}_{t} \\
\mathbf{y}_{t} &= \mathbf{\bar{C}} \mathbf{x}_{t} + \mathbf{\bar{D}} \mathbf{u}_{t}
\end{align*}
$$
where $\mathbf{x}_{t}$ is the hidden state at time $t$, $\mathbf{u}_{t}$ is the input at time $t$, and $\mathbf{y}_{t}$ is the output at time $t$. The discretized matrices $\mathbf{\bar{A}}$, $\mathbf{\bar{B}}$, and $\mathbf{\bar{C}}$ can be obtained from a variety of methods, such as the bilinear method:
$$
\begin{align*}
\mathbf{\bar{A}} &= \left(\mathbf{I}-\left(\Delta/2\right)\cdot\mathbf{A}\right)^{-1}\left(\mathbf{I}+\left(\Delta/2\right)\cdot\mathbf{A}\right) \\
\mathbf{\bar{B}} &= \left(\mathbf{I}-\left(\Delta/2\right)\cdot\mathbf{A}\right)^{-1}\left(\Delta\cdot\mathbf{B}\right) \\
\mathbf{\bar{C}} &= \mathbf{C}
\end{align*}
$$
where $\Delta$ is the time step size (sampling frequency), and $\mathbf{I}$ is the identity matrix. The bilinear method is a standard method for discretizing continuous-time state-space models, and it is used in many applications.
* _Simplification_: In most applications, we set $\mathbf{D} = 0$, which means that the output is only dependent on the hidden state and not on the input, thus $\mathbf{\bar{D}} = 0$. If this were not the case, we set $\mathbf{\bar{D}} = \mathbf{D}$, which means that the output is dependent on both the hidden state and the input.

### SISO Leg-S HiPPO matrices
The Leg-S HiPPO matrices are a specific type of structured state space model designed to capture long-range dependencies in sequential data efficiently. The Leg-S approach is based on [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials), which are a set of _orthogonal polynomials_ that can be used to represent functions over the finite interval $[-1,1]$. The Leg-S HiPPO $a_{ik}\in\mathbf{A}$ state transition matrix for a `SISO` problem is constructed as:
$$
\begin{align*}
a_{ik} &= -\begin{cases}
    \left(2i+1\right)^{1/2}\left(2k+1\right)^{1/2} & \text{if } i>k \\
    \left(i+1\right) & \text{if } i=k \\
    0 & \text{if } i<k \\
\end{cases}
\end{align*}
$$
where the $b_{n}\in\mathbf{B}$ input matrix is constructed as:
$$
\begin{align*}
b_{i} &= \left(2i+1\right)^{1/2} \\
\end{align*}
$$

## Two views of the same state space model
Linear state space models, either SISO or MIMO can be operated in two ways: (i) they can process one input token at a time, or (ii) they can process all input tokens at once. The first approach is called _sequential operation_, and the second approach is called _convolutional operation_. 

### Sequential operation
Imagive that we have [a queue of input tokens](https://en.wikipedia.org/wiki/Queue_(abstract_data_type)) that we want to process one at a time. We can use a linear state space model to process each input token in the queue sequentially, and then place the output token in a corresponding [output queue](https://en.wikipedia.org/wiki/Queue_(abstract_data_type)). The time steps of the input and output queues are aligned, so that the output token at time $t$ corresponds to the input token at time $t$. Let's look at a simple algorithm for sequential processing of input tokens:

__Initialization__: The user gives $\mathbf{A}$, $\mathbf{B}$, $\mathbf{C}$, and $\mathbf{D}$ matrices to the model. The model initializes the hidden state $\mathbf{x}_{0} = \mathbf{0}$.

### Convolutional operation

## S5 Models: From SISO to MIMO
The SISO model can be extended to the MIMO case by using a block-diagonal structure for the system matrices, i.e., we constuct many independent SISO models, one for each input-output pair, where the final output is the concatenation of all outputs. Alternatively, [Linderman and colleagues, 2023](https://arxiv.org/abs/2208.04933) proposed a more general approach that allows for the modeling of multiple inputs and outputs in a single model, that they called the `S5` model. 
* _What is S5?_ The S5 model is a state space model that uses a single set of system matrices to represent the dynamics of the system, and it can be used to model both SISO and MIMO systems. The S5 model is based on the idea of using a single set of system matrices to represent the dynamics of the system, rather than using a block-diagonal structure for the system matrices.