# Simulating Language 12, Iterated learning (lecture)

*This is a first draft of lecture notes for the Simulating Language course. It probably contains lots of typos!*

## How good is our model of learning?

In the previous classes, we set out to create the simplest possible model of associative learning, inspired by the Hebbian model of learning in the nervous system. This simply increased the connection weight between meanings and signals that occured together and then used the winner take all algorithm on rows or columns of the weight matrix for production and reception respectively.

Is this a good model of learning? There are a number of reasonable ways to go about answering this question. First, we could look at whether two agents who were given the same input training data would be able to **communicate** successfully (or whether an agent could communicate with its own teaching). Alternatively, we could ask whether an agent given some training data can **recall** that training data correctly (e.g. produce the correct signals for the meanings seen). Or, perhaps, we might consider **generalisation** to be critical for learning: given some training data, can the learner generalise correctly to unseen data?

Importantly, as we shall see, the answer to these kinds of questions is not absolute for a given learning algorithm. Instead they crucially depend on exactly what is being learned. This leads to a new kind of research question for our models. Instead of asking how good two innate signalling systems are for communication, we now want to know what kinds of errors a particular type of learner might make given a particular language. In other words, if a learner is given a set of signal/meaning pairs as input, how will it transform those signal/meaning pairs when they produce them as output?

### Our simple Hebbian-like learner

Given an optimal language, the learner seems to be able to recall its training data well. For example, with one example for each meaning of a language that maps $m_1\rightarrow s_1, m_2\rightarrow s_2, m_3\rightarrow s_3$, the resulting matrix is:


|.    |$s_1$|$s_2$|$s_3$|
|-----|-----|-----|-----|
|$m_1$|1    |0    |0    |
|$m_2$|0    |1    |0    |
|$m_3$|0    |0    |1    |

This gives the output: $m_1\rightarrow s_1, m_2\rightarrow s_2, m_3\rightarrow s_3$

But, what about a language with synonymy? Given training: $m_1\rightarrow s_1, m_1\rightarrow s_2, m_1\rightarrow s_2$, the matrix is:

|.    |$s_1$|$s_2$|$s_3$|
|-----|-----|-----|-----|
|$m_1$|1    |2    |0    |
|$m_2$|0    |0    |0    |
|$m_3$|0    |0    |0    |

The output behaviour is now $m_1\rightarrow s_2$ only. In other words, if there is a frequency asymmetry, it fails to reproduce synonymy.

How about generalisation? The learner fails to correctly generalise the optimal language. From $m_1\rightarrow s_1, m_2\rightarrow s_2$ we get:

|.    |$s_1$|$s_2$|$s_3$|
|-----|-----|-----|-----|
|$m_1$|1    |0    |0    |
|$m_2$|0    |1    |0    |
|$m_3$|0    |0    |0    |

Which results in: $m_1\rightarrow s_1, m_2\rightarrow s_2, m_3\rightarrow s_1, s_2, s_3$.

Similar generalisation behaviour is produced for a maximally ambiguous language. $m_1\rightarrow s_1, m_2\rightarrow s_1$ leads to:

|.    |$s_1$|$s_2$|$s_3$|
|-----|-----|-----|-----|
|$m_1$|1    |0    |0    |
|$m_2$|1    |0    |0    |
|$m_3$|0    |0    |0    |

The output is therefore: $m_1\rightarrow s_1, m_2\rightarrow s_1, m_3\rightarrow s_1, s_2, s_3$.

## Bias

Our learner, though very simple, is absolutely not a "blank slate". It responds differently to different training sets. The Hebbian learner struggles with synonyms, but is otherwise faithful to its training data to the extent that if misses 'obvious' generalisations.

Where does this behaviour come from? Somehow features of the architecture of our model create an inherent *learning bias* which may favour some languages over others. We've created an agent for whom some languages are easier to learn (in this case, recall and generalise) than others. Since some languages are better for communication, the knock on effect of learning bias is that different learners may ultimately favour or disfavour communication too.

It is important, therefore, to consider which features of the architecture of our model could be modified to manipulate this learning bias. One possibility is the way we update weights.

### Lateral inhibition

The Hebbian update rule is: **if signal node and meaning node are both active, increase connection weight by one**. What this means is that the learner does not directly change the connection weights to any *other* meanings or signals when a particular meaning/signal pair is observed. It's reasonable to imagine that a learner might, on seeing evidence of a particular pairing, also infer something about other possible pairings for that meaning or signal. We can add something like this by adding: **also *reduce* activation between competing meanings and signals (i.e., the meanings that don't match the signal and the signals that don't match the meaning)**.

We call this approach *lateral inhibition*. This type of inhibition between active neurons and inactive ones has been known about for a long time in the visual system, for example, where activity in one area inhibits avtivity in neighbouring areas. For example, the physicist [Ernst Mach](https://wikipedia.org/wiki/Mach_bands) suggested in 1865 that it was this kind of inhibition that explained why the contrast between areas of different shades of grey is increased in pictures like this:

![img](img/Bandes_de_mach.PNG)


How well does the Hebbian learner with lateral inhibition generalise? It is able to correctly generalise an optimal language! Given $m_1\rightarrow s_1, m_2\rightarrow s_2$, the matrix is:

|.    |$s_1$|$s_2$|$s_3$|
|-----|-----|-----|-----|
|$m_1$|1    |-2   |-1   |
|$m_2$|-2   |1    |-1   |
|$m_3$|-1   |-1   |0    |

This gives the output behaviour $m_1\rightarrow s_1, m_2\rightarrow s_2, m_3\rightarrow s_3$.

Importantly, given the less communicative maximally ambiguous language, it can recall the training data but seems positively unwilling to correctly generalise to ambiguous unseen items. $m_1\rightarrow s_1, m_2\rightarrow s_1$ leads to:

|.    |$s_1$|$s_2$|$s_3$|
|-----|-----|-----|-----|
|$m_1$|0    |-1   |-1   |
|$m_2$|0    |-1   |-1   |
|$m_3$|-2   |0    |0    |

This produces the resulting language: $m_1\rightarrow s_1, m_2\rightarrow s_1, m_3\rightarrow s_2, s_3$.

### A route to optimal communication?

Looking at this generalisation behaviour, we get the first hint that changing learning bias might result in differences in the communicative optimality of the *output* of learners. The Hebbian learner’s output given a subset of an optimal language is likely to include some ambiguity, whereas the addition of lateral inhibition creates generalisation to the optimal unseen items. Equally, given a communicatively suboptimal language, the Hebbian learner is still likely to produce at least some ambiguous output, but the *failure* of lateral inhibition to generalise the ambiguous language means that its output is guaranteed to be more like an optimal language than its input. (Note however, that this does not mean it will be able to communicate optimally with its teacher!)

## Rational speakers

Modifying learning bias may not be the only route to optimal signalling, however. An alternative is to explore the way that a learner *uses* its learned knowledge while communicating. Imagine a speaker has somehow acquired the following matrix:

|.    |$s_1$|$s_2$|
|-----|-----|-----|
|$m_1$|1    |1    |
|$m_2$|0    |2    |

This speaker would naturally produce $m_1\rightarrow s_1, s_2, m_2\rightarrow s_2$. If there were another identical agent listening to the speaker’s language, half the time for $m_1$ there would be miscommunication, since such an agent would always understand $s_2$ to mean $m_2$.

But what if such a speaker could somehow realise this risk of miscommunication and choose not to use $s_2$ when communicating $m_1$ and produce the other option, $s_1$, instead? We call this kind of behaviour *communicatively rational*. A rational speaker produces behaviour that optimises communicative success by figuring out how the hearer might respond to its signals.

How could this work in practice in our model? Instead of just producing the signal that is indicated by its matrix, a rational speaker needs to consider what meaning would be understood by a hearer and then adjust its behaviour accordingly. To do that requires that speakers somehow have a model of hearers that they can internally “test” signals on before they go ahead and produce them in the real world. On the face of it, this seems like a big challenge. How on earth could speakers acquire such a model of a hearer? But any such model must be an approximation, and a reasonable approximation is to imagine that a hearer has been exposed to similar training data as yourself, and therefore has a similar signalling matrix as yourself. A rational speaker can therefore use its *own* signalling matrix as a model of the hearer and can, in essence, ask before uttering a signal: would *I* understand this signal correctly if I heard it?

Returning to our example matrix above. Say the speaker wants to communicate about $m_1$, and runs the winner take all algorithm, which generates $s_2$. Rather than just produce this signal, the speaker first checks what meaning they’d understand if they heard $s_2$. In this case, they’d understand $m_2$, which wasn’t the intended meaning. They therefore take action to produce a different signal.

We’re going to build this rational speaker model in the lab, and as usual we’re going to make the simplest one we can thing of! Our (minimally) rational speaker takes the following steps:

1. Use matrix to choose signal
2. Use matrix to see what meaning I’d understand if I heard that signal
3. If the understood meaning and the intended meaning do not match, choose another signal at random

This is obviously not a very clever algorithm (apart from anything else, the randomly chosen signal might be same one again!). However, it captures the key insight of the rational model of speakers: it adjusts behaviour by taking into account the goal of communication and using a model of the hearer. In fact, this insight is the core of much of pragmatics, which considers the communicative effect of utterances in context. A fairly recent theoretical advance has been the *Rational Speech Act* (RSA) model of communication, which uses an elaboration of the approach we’re developing here in which both speakers *and* hearers use models of each other to reason (recursively) about which signals to produce in context, and which meanings to infer when a signal is heard. There is a lot of exciting ongoing work in this area, but I recommend you read at least [Goodman & Frank (2016)](https://www.sciencedirect.com/science/article/pii/S136466131630122X?via%3Dihub) (don’t worry too much about the mathematics in this paper at this stage - just have a look at the kinds of linguistic phenomena that can be captured by RSA.)

## The problem of linkage

We’ve seen now how differences in **learning bias** or **interaction strategies** *might* change the type of languages that are preferred by agents in some way. It seems likely that either lateral inhibition or speaker rationality may lead somehow to languages being more communicatively optimal.

But this begs a couple of serious questions:

- Where do these languages that we are testing our agents with come from in the first place?
- And, relatedly, how do we bridge the gap between these differences in learning bias or interaction strategies and the universal properties of language structure that we’re trying to explain in the first place?

Ultimately, we’re interested in explaining why language is the way it is, and we’re doing this by saying that it arises from properties of the individual’s who learn and use those languages. But this leads to what I have called the “Problem of Linkage” ([Kirby, 1999](https://scholar.google.co.uk/scholar?cluster=17515968785376094851&hl=en&as_sdt=0,5)).

![img](img/Linkage.png)

### Iterated learning

The solution to the problem of linkage is the (kind of obvious!) realisation that the languages learners acquire is the product of the output of other learners. In other words, the signal/meaning pairs that one agent produces are the signal/meaning pairs that the next learner will learn from. In other words, language persists over time by repeatedly being used by multiple individuals in a population. It is therefore constantly being transformed between two different domains: observable behaviours (signals in context), and internal representations (grammars, or in our simple model, signalling matrices). The process that maps the former to the latter is production, and the process that maps the latter to the former is perception and learning:

![img](img/Iterated.png)

It is out of this continual process of **iterated learning** that the structure of language emerges. Note that this is evolution, but it is a process of *cultural* evolution, quite distinct from the biological process that we modelled earlier in this course. Here, inheritance is enabled by observing behaviours and learning from them rather than passing on genetic information directly.

## Modelling iterated learning

Iterated learning has developed into a very active area of research over the last 20 years or so, much of it pioneered in Edinburgh. We will look into some of this work in more detail in the coming weeks. But for now it is sufficient to say that the primary research goal of this work is to uncover precisely the relationship between properties of individual learners and the emergent universal properties of language structure. In other words, research in iterated learning seeks to solve the problem of linkage in order to build a genuinely explanatory theory of language, which links the population level (languages) to the individual level (the cognitive makeup of language learners).

We can start to develop an insight of what iterated learning does just as we did when we investigated what biological evolution did: by building a computational model. Such a model will place agents in a population in which they learn from each other’s utterances. We can then start the simulation with a random language and observe what languages emerge over time through cultural evolution given different possible learning rules, and communicative strategies.

To do this, we need a model of the population. One such model works as follows:

1. Start with a collection of agents with random languages
2. Remove an agent (death)
3. Add a “child” agent (birth)
4. All the “adults” in the population speak
5. The child learns from the adults’ utterances
6. The child enters the population to become a new adult
7. Repeat steps 2 - 6 for multiple generations

In this model, the agents will always be born identical to each other. Unlike in a model of biological evolution, there is no genetic inheritance (or indeed variation) here! The only thing that evolves is the behaviour that it inherited culturally.

This is only one possible population model. [Mesoudi & Whiten (2008)](https://royalsocietypublishing.org/doi/full/10.1098/rstb.2008.0129) in a review of cultural evolution experiments outline a range of different possible population models that are worth considering:

- The replacement method (gradually replace individuals - this is what I described above)
- Transmission chains (the whole population learns from the previous generation, which is then replaced)
- Closed groups (no turnover of the population at all, individuals just constantly learn from each other)

In the lab we’ll implement cultural evolution by iterated learning, and use it to test the effect of **lateral inhibition**, **speaker rationality**, and **population dynamics** on the emerging language of the population. We’ll ask the question: can cultural evolution provide an alternative route to adaption that does not rely on biological evolution by natural selection.
