# The HyperCampus 1 Algorithm

In the following discussion, I will denote the *forgetting index*, the percentage of acceptable forgetting as specified by the user, with $\varphi$. Thus, the algorithm tries to keep memory retention at an average of $1-\varphi$ by scheduling cards such that they are reviewed again as soon as they are in danger of being forgotten with a probability of $\varphi$.

## The nature of forgetting

HyperCampus relies on the idea that forgetting is primarily caused by increased difficulty of access to a memory rather than damage to the memory itself. A memory is constantly threatened by the danger of some of its access pathways being blocked by other memories. If we think in terms of discrete events that reduce accessability of a memory, we might expect that such an event happens at a constant rate. Then, the time until such an event happens is distributed according to an exponential distribution
$$p(t) = \lambda e^{-\lambda t}$$
with a rate parameter $\lambda$. Then, the share of connections that are blocked by a time $T$ is given by
$$\int_0^T p(t) dt = 1-e^{-\lambda t}.$$
Thus, the "intactness" of a memory by the time $t$ can be described as
$$\rho(t) = e^{-\lambda t}.$$
We will call $\rho$ the *retrievability* of the memory. Now there are two fundamentally different interpretations of $\rho$:
1. It indicates the *probability of recall*, meaning that if we test a person on a certain memorized item after a time $t$, the probability that they will answer correctly is $\rho(t)$. In this case, recall is binary: An item is either recalled or not. Access to it is either entirely possible with probability $\rho$ or impossible with probability $1-\rho$.
1. It is a measure of *retrieval effort*, describing how difficult it is to retrieve a memory, by the time that has to be spent on retrieving it. Access to the memory becomes progressively more difficult, as described by $\rho(t)$.

Almost all spaced repetition systems use interpretation 1. Indeed, it is empirically well documented that recall probability decays along a negative exponential in time as in the above formula, at least for simple memories. However, it is usually difficult to measure the probability of recall directly, since the result of an observation is either success or failure. When conducting a study, recall probability can be determined because a big amount of data is available, but on a trial-to-trial basis as in individual spaced repetition, this is in my eyes unreliable. Sometimes an item isn't recalled even though retrievability was high (*tip of the tongue* phenomenon), and sometimes an item is successfully recalled even though retrievability was low (by spending a lot of time trying to retrieve it).

For this reason, HyperCampus relies on the second interpretation. The continuous grade that the user provides about how difficult the retrieval was is mapped onto a retrievability estimate, irrespective of whether the item was recalled or not. We will discuss empirical evidence for this later. For now, note that both interpretations can be reconciled if access to memory is interpreted as being a statistical search: Low $\rho$ indicates difficult access to the memory (interpretation 1), meaning that the search is more likely to be aborted early (interpretation 2).

This way, HyperCampus still treats calculates the probability of recall as $\rho$, but $\rho$ is measured as retrieval effort. Note that the formula for $\rho(t)$ depends on an unknown rate parameter $\lambda$. Let us assume for a second that we know how to estimate $\lambda$, then the algorithm for a given item works like this:
1. Estimate $\lambda$ for that item and derive the retrievability formula $\rho(t) = e^{-\lambda t}$ for it.
1. To find the point in time when probability of forgetting rises above the $\varphi$ requested by the user, solve $\rho(T) = 1-\varphi$ for $T$:
$$T = \frac{1}{\lambda}\ln{\frac{1}{1-\varphi}}$$
1. Test the user on the item after time $T$. If the grade doesn't match the expected retrievability of $1-\varphi$, update the estimate of $\lambda$ and start again at 1.

Now let us have a closer look at $\lambda$. It indicates how fast $\rho$ decays. To be precise, after a time $\Delta t = \frac{1}{\lambda}$, $\rho$ has fallen to $\frac{1}{e}$ of its original value. Obviously, $\lambda$ is different for different items stored in memory: Some remain accessible longer than others. A bit less obviously, $\lambda$ is not constant. Spaced repetition is based on the assumption that reviews can be more and more spaced out over time, indicating that $\lambda$ (the decay rate) should decrease if we have reviewed an item more often.

Thus, the difficulty lies in predicting how $\lambda$ changes with reviews. Literature indicates that effects on $\lambda$ are generally multiplicative, e.g. $\lambda$ might be halved by a review. To convert multiplication to addition, we will not deal with $\lambda$ directly but with a quantity
$$\sigma = -\ln\lambda,$$
which, following [Piotr Woźniak from SuperMemo](https://supermemo.guru/wiki/Two_component_model_of_memory), I will refer to as *stability* (however, note that SuperMemo's stability is different, namely $\frac{1}{\lambda}$), as it measures how stable the storage of a memory is.

## The stability increase formula

If we assume that directly after a review, $\rho$ goes back to $1$, and at a later time $t$ we find that $\rho$ has decreased to $\rho'$, we can calculate $\sigma$ using the forgetting curve from above:
$$\sigma = -\ln{\left(\frac{1}{t}\ln\frac{1}{\rho'}\right)}.$$

Now, we need a model on how $\sigma$ is changed by a review. A simple model is used in the [Leitner system](https://en.wikipedia.org/wiki/Leitner_system), where a successful review doubles the interval until the next review and an unsuccessful review halves the interval. Thus, the update rule is
$$\Delta\sigma = \delta \ln 2,$$
with $\delta = 1$ for a successful review and $\delta = -1$ for an unsuccessful review.

However, there are many more factors that influence the change in stability than just whether or not the review was successful. Let us consider that memories are encoded in [*engrams*](https://en.wikipedia.org/wiki/Engram_(neuropsychology)), as certain cellular structures, in the brain, and that $\rho$ indicates how easily the engram can be accessed. Which factors contribute to $\sigma$?
* *synaptic strength*: how strong is the connection of the engram to the rest of the brain via its synapses?
* *complexity*: how many cells or synapses are necessary to encode the knowledge?
* *integratedness*: how well is the engram associated with other brain content in structural terms?

A common theory on the neurobiology of learning is that it is mediated by [long-term potentiation](https://en.wikipedia.org/wiki/Long-term_potentiation), which describes the phenomenon that synaptic strength increases between two cells after certain patterns of activation of the two. Thus, exposition to a fact, for example in a review, should increase the stability of the memory. Supposedly, there is a maximum possible synaptic strength, and in analogy to the [Rescorla-Wagner model](https://en.wikipedia.org/wiki/Rescorla%E2%80%93Wagner_model), I hypothesize that
$$\Delta\sigma \propto \sigma_{max} - \sigma,$$
i.e., the increase of stability slows down as $\sigma$ approaches a certain maximum possible stability $\sigma_{max}$.

*Complexity* and *integratedness* are certainly hard to quantify, but their effect on stability might be characterized. Complexity depends on the item and probably does not change significantly over time. As complex items are supposed to have a lot of connections that would have to be strengthened for an increase in stability, complexity probably cuts down stability increase by the number of required connections. Integratedness, on the other hand, can change in time facilitate stability enhancement: For example, making up a mnemonic might lead to better integration of a memory into the existing concepts, countering the effect that the complexity might have. Now because this is all very speculative, I did not try to model these considerations explicitly. Rather, I included a factor $\alpha$ in the model that scales the stability increase and is supposed to depend on the individual relationship between learner and item that can change over time.

Additional empirical evidence about the nature of stability increase comes from the *retrieval effort hypothesis*, according to which facts are learnt more efficiently when retrieval was difficult. This suggests that stability increase might depend inversely on retrievability,
$$\Delta\sigma \propto 1 - \rho.$$
Thus, we can hypothesize that the stability increase formula has the following form:
$$\Delta\sigma = \alpha(1-\rho)(\sigma_{max}-\sigma).$$

By expanding this formula, we obtain a model that is linear in $\rho$ and $\sigma$:
$$\Delta\sigma = \alpha\sigma_{max} - \alpha\sigma_{max}\rho - \alpha\sigma + \alpha\rho\sigma.$$

Now, to allow for greater flexibility of the model and allowing it to include more unknown factors, we simply replace the coefficients in the above model by four independent parameters:
$$\Delta\sigma = a + b\rho + c\sigma + d\rho\sigma.$$

[Evidence from SuperMemo data](https://www.supermemo.com/en/archives1990-2015/articles/stability) does indeed suggest that stability increase depends on $\rho$ and $\sigma$ in this functional way (although the formula published there does not include an interaction term, i.e. $d = 0$).

To summarize, assuming that we know the parameters $\vartheta = (a,b,c,d)$, our algorithm now looks like this:
1. Using the current stability estimate $\sigma$, find the point in time when probability of forgetting rises above the $\varphi$ requested by the user as above.
1. Test the user on the item after time $T$. If the grade doesn't match the expected retrievability of $1-\varphi$, update the estimate of $\sigma$ and calculate the actual stability increase from the last review.
1. Using that stability increase $\Delta\sigma$, update the model parameters $\vartheta$.
1. Using the new parameters $\vartheta'$, calculate the new stability $\sigma' = \sigma + f(\rho,\sigma,\vartheta)$ with $f$ being the model from above and start again at 1.

Naturally, the next question is: How to find the appropriate values of $\vartheta$?

## Determining the model parameters

The problem in determining appropriate values for the model parameters $\vartheta = (a,b,c,d)$ is that it has to be done even when few data is available. The solution implemented by HyperCampus uses [Bayesian linear regression](https://en.wikipedia.org/wiki/Bayesian_linear_regression). The general scheme works like this:
1. Guess initial values for $\vartheta$ by specifying a prior probability distribution $p(\vartheta)$.
1. Specify a likelihood $p(\Delta\sigma\mid\vartheta)$ for how probable any observed value of the stability increase is given specific values $\vartheta$.
1. Measure $\Delta\sigma$ and use [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) to obtain the a probability distribution of $\vartheta$ given that observation:
$$p(\vartheta\mid\Delta\sigma) = \frac{p(\Delta\sigma)p(\Delta\sigma\mid\vartheta)}{\int p(\Delta\sigma)p(\Delta\sigma\mid\vartheta) d\Delta\sigma}$$
1. Use $p(\vartheta\mid\Delta\sigma)$ as the new prior and repeat from 3.

Since our model is linear, the problem becomes analytically tractable by using Gaussian probability densities for prior and likelihood. Let's say our prior has as mean the current estimate of the model parameters $\hat{\vartheta}$ and some covariance $\Psi$. The likelihood mean is given by $f(\rho,\sigma,\vartheta) = a+b\rho+c\sigma+d\rho\sigma$ and the variance depends on how accurately we can measure $\Delta\sigma$, denoted by $\gamma$. Thus:
$$p(\vartheta) = \cal{N}(\hat{\vartheta},\Psi)$$
$$p(\Delta\sigma\mid\vartheta) = \cal{N}(f(\rho,\sigma,\vartheta),\gamma)$$

As a well-known result, the posterior $p(\vartheta\mid\Delta\sigma) = \cal{N}(\hat{\vartheta}',\Psi')$ is a Gaussian distribution with
$$\Psi' = \left(\Psi^{-1}+\gamma^{-1}h(\rho,\sigma)h(\rho,\sigma)^T\right)^{-1}$$
$$\hat{\vartheta}' = \Psi'\left(\Psi^{-1}\hat{\vartheta} + \gamma^{-1}\Delta\sigma h(\rho,\sigma)\right)$$
$$h(\rho,\sigma) = (1,\rho,\sigma,\rho\sigma)^T.$$

The following is a quick implementation of the model in python. It is fed with random data generated by a process that behaves as the model predicts, with fixed parameter values $\vartheta$ saved in the variable *w* in the code.

In [10]:
import numpy as np

n = 20 # number of trials to sample

h = lambda x: np.array([[1, x[0], x[1], x[0]*x[1]]]).T
w = np.array([-2,7,0.4,0.01]).T # true values of theta
noise = 0.1 # true standard deviation of measurement

eta = 100 # = 1/gamma
# prior initialization:
theta = np.array([[0,0,0,0]]).T
Psi   = 1*np.eye(4)

# returns a random sample
def sample():
    r = np.random.random()
    s = np.random.random()*1000
    x = np.array([r,s])
    y = w.T @ h(x) + np.random.normal(0,noise)
    return x, y

# simulate trials
for i in range(0,n):
    x, y = sample()
    invPsi = np.linalg.inv(Psi)
    Psi = np.linalg.inv(invPsi + eta * h(x) @ h(x).T)
    theta = Psi @ ((invPsi @ theta) + eta * y * h(x))
    print("#"+str(i+1)+" "+str(theta.T[0]))
    
print("true:"+str(w))

#1 [0.00034003 0.00027948 0.24698609 0.2029098 ]
#2 [3.99886290e-04 1.77676950e-05 3.97778718e-01 1.93621072e-02]
#3 [0.6450641  0.24190646 0.39715269 0.018713  ]
#4 [0.34243981 1.14589255 0.39732033 0.01767813]
#5 [0.22954578 1.56594703 0.39738423 0.01713927]
#6 [0.22392551 1.75933819 0.3974689  0.0164476 ]
#7 [-0.19003248  3.20892489  0.39800344  0.01433425]
#8 [-0.46678489  4.29925699  0.39830235  0.01305939]
#9 [-0.47337404  4.38829909  0.39831073  0.01295625]
#10 [-0.62020224  4.63020862  0.39842367  0.01274032]
#11 [-1.70007682  6.39284375  0.3996081   0.010787  ]
#12 [-1.75894907  6.590248    0.39967516  0.01055353]
#13 [-1.79376965  6.68459364  0.39970887  0.01046837]
#14 [-1.78576375  6.68023551  0.3996785   0.01048591]
#15 [-1.80762139  6.75871908  0.39970406  0.01039235]
#16 [-1.80411908  6.75191732  0.39970404  0.01040925]
#17 [-1.80969343  6.76081425  0.39970769  0.01038273]
#18 [-1.8105603   6.76426549  0.39971093  0.01037086]
#19 [-1.80148659  6.74479913  0.39971207  0.0

As you can see, the model converges relatively fast to the true values, even though the measurement noise is big compared to the smallest parameter and $\rho$ is sampled as a number between $0$ and $1$ while $\sigma$ is sampled between $0$ and $1000$.

The algorithm now looks like this:
1. Choose a prior for $\vartheta$.
1. Using the current stability estimate $\sigma$, find the point in time when probability of forgetting rises above the $\varphi$ requested by the user as above. A random dispersal is added to accelerate convergence, as different values of $\rho$ can be tested.
1. Test the user on the item after time $T$. If the grade doesn't match the expected retrievability of $\rho(T)$, update the estimate of $\sigma$ and calculate the actual stability increase from the last review.
1. Perform Bayesian linear regression to find the posterior $p(\vartheta\mid\Delta\sigma)$.
1. Using the new parameters $\vartheta'$, calculate the new stability.
1. Using the posterior as the new prior for $\vartheta$, repeat from 2.

## What happens in case of a lapse?

The above version of the algorithm does not differentiate between successful and unsuccessful reviews, since the retrievability is determined from the retrieval effort grading alone. Intuitively, it seems that items that were forgotten should be reviewed again after a short interval. Empirical evidence on this is inconclusive:
* Retrieval attempts enhance learning regardless of whether they were successful or not. Attempting to retrieve an item from memory and subsequent activation of it seems to strenghten the relevant connections, whether the activation was triggered from "inside" (by finding it on one's own) or "outside" (by being told the solution).
* Retrieval effort correlates with gain of memory strength.

Now the interesting question would be the combination of the two: Does the gain of learning in unsuccessful retrieval attempts correlate with retrieval effort? I couldn't find any study that addressed this question. Also, none of the studies I have seen on this so far have investigated whether these effects also hold true for long intervals (such as months or years). Therefore, it might not be wise to treat lapses entirely on the basis of these findings.

On a speculative side, here is a description of what might happen in an unsuccessful retrieval attempt: Memory contents are browsed for the item to recall. Related concepts are activated and strengthened during the retrieval attempt to a greater or lesser extent, depending on retrievability. When the search is aborted and the solution is shown, there are two possibilities: The solution is recognized as being the correct one or it seems completely unfamiliar. In case that it is recognized, it was successfully relocated in memory, and arguably some strengthening of the memory occurs as both related concepts and the concept itself were activated. In case it was completely forgotten, however, it has to be relearned anew and it cannot benefit from existing stability.

How could we discern between the two possibilities? Retrievability might show a perspective on the issue: If an item was completely forgotten, it means that $\rho$ must have been close to $0$. Where a higher $\rho$ was expected, it means that $\sigma$ must have been catastrophically low as well. Accordingly, the post-lapse interval will automatically be short. Now we could set a certain retrievability grade cut-off below which we deem an item as completely forgotten in case it wasn't recalled. However, there are some practical considerations that might override these reflections:
* Typically, a low forgetting index is desirable. Lapses should comprise only a small portion of reviews, and if the goal is just to keep the retention rate high, the way of dealing with them might not matter much.
* In the long run, longer gaps will always outperform shorter gaps. Thus, if we review a lapsed card after a long interval, it will always benefit the long-term storage of the item more.
* As a consequence, what should really matter is until when the user deems it acceptable not to know a card with high confidence.

In practice, this means: As of now, the algorithm does not discern between successfully and unsuccessfully retrieved items, but a future version of the HyperCampus app will support cramming (e.g. before exams), such that the user can choose to review cards (including the lapsed ones) prematurely. Furthermore, changes to this will be made if the algorithm turns out to perform badly with lapsed cards.

In my eyes, spaced repetition works well for consolidating items in memory, but not so well for learning them initially. This may also apply to relearning lapsed cards. Thus, the default mode of the algorithm is to relay the responsibility for learning and relearning items to the user. The algorithm treats new cards as though they have already been learnt. However, in the future, there will be features to aid in acquisition of the items, such as displaying relevant context information, allowing the use of mnemonics.