# Intuition

# Introduction

> We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down

Adds a forget gate

> "Back-Propagation Through Time" (Williams and Zipser, 1992; Werbos, 1988) or "Real-Time Recurrent Learning" (Robinson and Fallside, 1987; Williams and Zipser, 1992) share an important limitation. The temporal evolution of the path integral over all error signals "flowing back in time" exponentially depends on the magnitude of the weights (Hochreiter, 1991). This implies that the backpropagated error quickly either vanishes or blows up

With a longer pathway comes higher exposure to more weights, and thus more operations on weights when flowing back in time (backprop)

> A recent model, "Long Short-Term Memory" (LSTM), (Hochreiter and Schmidhuber, 1997) is not affected by this problem. LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow through "constant error carrousels" (CECs)

LSTMs can form reliable pathways called CECs that perform identity mappings through memory cells. These cells can control what information to include into the cell (new information or previous information).

> The problem is that a continual input stream eventually may cause the internal values of the cells to grow without bound, even if the repetitive nature of the problem suggests they should be reset occasionally. This paper will present a remedy.

Weights that are repeatedly opened by the valve will retain their values, there is no release of information in a cell.

# 2 LSTMs

Cell states grow linearly. Will cause saturation of the output squashing function h, thus gradient will vanish and cell will not learn anymore

Could use a teacher to reset the internal states when a new sequence starts. But it requires the teacher to know when a new sequence starts, which is particularly difficult.
We want to build a self supervised solution baked into the architecture that will dynamically know to forget

# 3 Forget Gates

## Forward Pass of extended LSTM with Forget Gates

$\begin{gathered}
y^{\varphi_j}(t)=f_{\varphi_j}\left(\operatorname{net}_{\varphi_j}(t)\right) \\
\operatorname{net}_{\varphi_j}(t)=\sum_m w_{\varphi_j m} y^m(t-1)
\end{gathered}$
Similar activation as other gates

The forget gate is initialized to always remember, and learn how to forget.

We truncate the error flow back to forget gates because CEC is used there as well
