# Differentiable Neural Computer

![DNC architecture](https://storage.googleapis.com/deepmind-live-cms/images/dnc_figure1.width-1500_Zfxk87k.png "DNC architecture")

![DNC model](images/dnc_model.png "DNC model")


Original paper:
[Graves, Alex, et al. "Hybrid computing using a neural network with dynamic external memory." Nature 538.7626 (2016): 471-476.](https://www.nature.com/nature/journal/v538/n7626/abs/nature20101.html)

Author blog post:
[Differentiable neural computers](https://deepmind.com/blog/differentiable-neural-computers/)

Author implementation (Tensorflow + Sonnet):
[deepmind/dnc](https://github.com/deepmind/dnc)

## Definitions

$t \in \mathbb{N}$: time step

$\mathcal{N}$: controller network

$X$: input domain

$x_t \in \mathbb{R}^X$: input vector at time $t$

$Y$: output domain

$z_t \in \mathbb{R}^Y$: target vector at time $t$

$N \in \mathbb{N}$: number of memory rows or locations

$W \in \mathbb{N}$: number of memory columns or word length

$M_t \in \mathbb{R}^{N \times W}$: memory matrix at time $t$

$R \in \mathbb{N}$: number of read heads

$w^r \in \mathbb{R}^{N \times R}$: read weighting

$w^{r, i} \in \mathbb{R}^N$: read weighting for read head $i$

$r_t^i \in \mathbb{R}^W$: read vector at time $t$ from read head $i$

$\mathcal{X}_t$: controller input vector at time $t$; $\mathcal{X}_t = [x_t; r_{t-1}^1; \ldots, r_{t-1}^R]$

$L \in \mathbb{N}$: number of controller network layers

$h_t^L$: activation of the last controller network layer at time $t$

$W_{\nu}$: output weight matrix

$\nu_t \in \mathbb{R}^Y$: (intermediate) output vector at time $t$; $\nu_t = W_{\nu}h_t^L$

$W_{\xi}$: interface weight matrix

$\xi_t \in \mathbb{R}^{(W \times R) + 3W + 5R + 3}$:
interface vector (interactions with the memory) at time $t$;
$\xi = W_{\xi}h_t^L$

$\theta$: controller network parameters

$[r_t^1; \ldots, r_t^R] \in \mathbb{R}^{RW}$: concatenation of read vectors at time $t$

$W_r \in \mathbb{R}^{RW \times Y}$: read vectors to output weight matrix

$y_t \in \mathbb{R}^Y$: output vector at time $t$; $y_t = \nu_t + W_r[r_t^1; \ldots, r_t^R]$

$\mathcal{S}_N$: $(N-1)$-dimensional unit simplex;
$\mathcal{S}_N = \{\alpha \in \mathbb{R}^N: \alpha_i \in [0, 1], \sum_{i=1}^N \alpha_i = 1\}$

$\circ$: elementwise multiplication

### Activations

* $\sigma(x) = (1 + e^{-x})^{-1}$
* $oneplus(x) = 1 + \log(1 + e^x)$
* $softmax(x)[i] = e^{x[i]} (\sum_{j=1}^n e^{x[j]})^{-1}$

## Controller networks

* Recurrent Neural Network: $(\nu_t, \xi_t) = \mathcal{N}([\mathcal{X}_1; \ldots; \mathcal{X}_t]; \theta)$

* Feed-forward Neural Network: $(\nu_t, \xi_t) = \mathcal{N}(\mathcal{X}_t; \theta)$

## Interface parameters

$\xi_t= [
k_t^{r,1}; \ldots; k_t^{r,R};
\hat{\beta}_t^{r,1}; \ldots; \hat{\beta}_t^{r,R};
k_t^w; \hat{\beta}_t^w; \hat{e}_t; v_t;
\hat{f}_t^1; \ldots; \hat{f}_t^R;
\hat{g}_t^a; \hat{g}_t^w;
\hat{\pi}_t^1; \ldots; \hat{\pi}_t^R
]$

$\{k_t^{r,i} \in \mathbb{R}^W; 1 \leq i \leq R\}$: read keys

$\{\beta_t^{r,i} = oneplus(\hat{\beta}_t^{r,i}) \in [1, \infty); 1 \leq i \leq R\}$: read strengths

$k_t^w \in \mathbb{R}^W$: write key

$\beta_t^w = oneplus(\hat{\beta}_t^w) \in [1, \infty)$: write strength

$e_t = \sigma(\hat{e}_t) \in [0, 1]^W$: erase vector

$v_t \in \mathbb{R}^W$: write vector

$\{f_t^i = \sigma(\hat{f}_t^i) \in [0, 1]; 1 \leq i \leq R\}$: free gates

$g_t^a = \sigma(\hat{g}_t^a) \in [0, 1]$: allocation gate

$g_t^w = \sigma(\hat{g}_t^w) \in [0, 1]$: write gate

$\{\pi_t^i = softmax(\hat{\pi}_t^i) \in \mathcal{S}_3; 1 \leq i \leq R\}$: read modes

## Reading and writting

$\Delta_N$: complete set of allowed weightings over $N$ locations;
$\Delta_N = \{\alpha \in \mathbb{R}^N: \alpha_i \in [0, 1], \sum_{i=1}^N \alpha_i \leq 1\}$

### Calculate read vectors

$r_t^i = M_t^{\top} w_t^{r, i}, w_t^{r, i} \in \Delta_N, 1 \leq i \leq R$

They are appended to the input in the next step.

### Calculate next state of memory (erase and write)

$M_t = M_{t - 1} \circ (E - w_t^w e_t^{\top}) + w_t^w v_t^{\top}$

Where $E$ is an $N \times W$ matrix of ones.

## Read modes (content attention mechanisms)

### Similarity meassure (cossine similarity)

Used for reading and writing and it is related to assosiative structures.

$$\mathcal{C}(M, k, \beta)[i] = \frac{\exp\{\mathcal{D(k, M[i, \cdot])\beta}\}}
{(\sum_{j=i}^N \exp\{\mathcal{D(k, M[j, \cdot])\beta}\})}$$

Where:

$\mathcal{C}(M, k, \beta) \in \mathcal{S}_N$,

$k \in \mathbb{R}^W$ is the lookup key,

$\beta \in [1, \infty)$ is the key strength,

and $\mathcal{D}$ is the cosine similarity defined as:

$$\mathcal{D}(u, v) = \frac{u \cdot v}{|u||v|}$$
  
### Usage vector

Used for memory allocation, increased after write and decreased after read.

$$u_t = (u_{t-1} + w_{t-1}^w - u_{t-1} \circ w_{t-1}^w) \circ \psi_t  \in [0, 1]^N$$

Where $\psi_t \in [0, 1]^N$ represents by how much each location will not be freed by the free gates:

$$\psi_t = \Pi_{i=1}^R (1- f_t^i w_{t-1}^{r, i})$$

A location $i$ is used if it has been retained by the free gates ($\psi_t[i] \approx 1$), and were either already in use or have just been written to.

### Temporal link matrix

Used for secuential retrieval.

$$L \in [0, 1]^{N \times N}$$

If $L[i, j] \approx 1$ then $i$ was written after $j$, otherwise $L[i, j] \approx 0$.

$Lw$ smoothly shifts the focus forwards to the locations written after those emphasized in $w$,
whereas $L^{\top}w$ shifts the focus backwards.