# Talk like the President, part 1

In this notebook I build up towards a simple char RNN architecture using the [fastai library](https://github.com/fastai/).

The code is taken nearly line for line from [Lesson 6](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson6-rnn.ipynb) of the FastAI Part 1 v2 course. The diagrams come from the course videos.

Jeremy provides a great overview of each of the architectures in the [lecture 6 video](https://youtu.be/sHcLkfRrgoQ?t=1h12m40s). 

In [1]:
import glob
from utils import *

In [2]:
from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

In [None]:
# create data dir if doesn't exist
!mkdir -p data

# Download and unzip corpus of speeches
!wget 'http://www.thegrammarlab.com/?wpdmpro=corpus-of-presidential-speeches&wpdmdl=595' -O "data/corpus.zip"
!unzip data/corpus.zip -d data/

# Data from:
#   Brown, D. W. (2016) Corpus of Presidential Speeches. Retrieved from http://www.thegrammarlab.com

In [4]:
# Data taken from:
#   Brown, D. W. (2016) Corpus of Presidential Speeches. Retrieved from http://www.thegrammarlab.com
get_data('http://www.thegrammarlab.com/?wpdmpro=corpus-of-presidential-speeches&wpdmdl=595', 'data')

In [5]:
speeches = ' '.join(preprocess_speech(file) for file in glob('data/Corpus of Presential Speeches/obama/*'))

In [6]:
len(speeches)

1027441

Just over 1 mln chars! This should be perfect for our purposes!

Let's start with creating a vocabulary of all the chars that are used.

In [7]:
chars = sorted(list(set(speeches)))
vocab_size = len(chars)
print('total chars:', vocab_size)

total chars: 74


In [8]:
''.join(chars)

' !"$\',-./0123456789:;>?ABCDEFGHIJKLMNOPQRSTUVWYZabcdefghijklmnopqrstuvwxyz'

We want to be able to easily go from char to its index in the vocabulary and back. Let's use dictionaries for that.

In [9]:
char2idx = {c: idx for idx, c in enumerate(chars)}
idx2char = {idx: c for idx, c in enumerate(chars)}

Now we need to convert the speeches to representation as list of corresponding indexes.

In [10]:
idxs = [char2idx[char] for char in speeches]

In [11]:
idxs[:11]

[35, 65, 7, 0, 41, 63, 52, 48, 58, 52, 65]

In [12]:
''.join(idx2char[idx] for idx in idxs[:11])

'Mr. Speaker'

## First model - simple FC NN

Our first model will be very simple - we will use three chars to predict the fourth char. In order to do so, we embed each char using a rank 1 tensor of length 42. We concatenate the embeddings together and we slap a fully connected layer on top.

It is a simple architecture that I wanted to try before we move onto RNNs!

<img src='images/basic_nn.png'>

In [13]:
n_fac = 42 # number of latent factors, embedding length
n_hidden = 256

In [14]:
class FCModel(nn.Module):
    def __init__(self, vocab_size, n_fac, n_hidden):
        super().__init__()
        self.n_fac = n_fac
        self.e = nn.Embedding(vocab_size, n_fac)

        # fully connected layer
        self.l_in = nn.Linear(n_fac * 3, n_hidden)

        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, c1, c2, c3):
        inp = torch.cat([self.e(c1), self.e(c2), self.e(c3)], dim=1)
        h = F.relu(self.l_in(inp))
        
        return F.log_softmax(self.l_out(h))

### Create inputs

In [15]:
cs=3
c1_dat = idxs[0:-cs]
c2_dat = idxs[1:-cs+1]
c3_dat = idxs[2:-cs+2]
c4_dat = idxs[3:]

Visually inspecting whether the above worked!

In [16]:
print(list(idx2char[idx] for idx in c1_dat[-10:]))
print(list(idx2char[idx] for idx in c2_dat[-10:]))
print(list(idx2char[idx] for idx in c3_dat[-10:]))
print(list(idx2char[idx] for idx in c4_dat[-10:]))

['a', '.', ' ', 'T', 'h', 'a', 'n', 'k', ' ', 'y']
['.', ' ', 'T', 'h', 'a', 'n', 'k', ' ', 'y', 'o']
[' ', 'T', 'h', 'a', 'n', 'k', ' ', 'y', 'o', 'u']
['T', 'h', 'a', 'n', 'k', ' ', 'y', 'o', 'u', '.']


In [17]:
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)
y = np.stack(c4_dat)

In [18]:
x1.shape, x2.shape, x3.shape, y.shape

((1027438,), (1027438,), (1027438,), (1027438,))

Let's create our dataset from tese four arrays.

In [19]:
# This one line allows me to run my models on a subset of data while I work on this notebook
# md = ColumnarModelData.from_arrays('.', [-20_000], np.stack([x1,x2,x3], axis=1)[:200_000, :], y[:200_000], bs=2**9)

md = ColumnarModelData.from_arrays('.', [-100_000], np.stack([x1,x2,x3], axis=1), y, bs=2**9)

In [20]:
m = FCModel(vocab_size, n_fac, n_hidden).cuda()

In [21]:
opt = optim.Adam(m.parameters(), 1e-2)

fit(m, md, 1, opt, F.nll_loss)
set_lrs(opt, 1e-3)
fit(m, md, 1, opt, F.nll_loss)

[ 0.       1.61072  1.07665]                                   



[ 0.       1.51321  1.00247]                                   



Let's see if our model learned anything useful!

In [22]:
def get_next(inp):
    idxs = T(np.array([char2idx[c] for c in inp]))
    p = m(*VV(idxs)).exp()
    r = torch.multinomial(p, 1)
    return idx2char[r.data[0, 0]]

In [23]:
get_next('pre')

's'

In [24]:
get_next(' an')

'd'

Let's try to produce something a bit longer.

In [25]:
text = 'Goo'
for i in range(200): text += get_next(text[-3:])

print(text)

Goods, forman like are. And his now. I'm conomy belim 169000 sign-pers on here, decisiance is in studer of his ting ours America's who we can rultions in you lart gat was cal wanted forth be are strusine


Well, this does sound like a president, but not the one whose manner of speech we are trying to emulate here. Let's try a slightly more complex architecture.

## Char3Model

Technically, I am not sure if this is an RNN. It looks to me like an unrolled RNN but maybe it is missing something that it needs to be called an unrolled RNN. Either way, this is just semantics and we are very close to having a 'vanilla' RNN.

<img src='images/3char.png'>

In [26]:
class Char3Model(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)

        # The 'green arrow' from our diagram - the layer operation from input to hidden
        # We are now feeding in chars one at a time! The input dimension of the first linear layer
        # is (1, 42) vs the (1, 3 * 42) we used earlier.
        self.l_in = nn.Linear(n_fac, n_hidden)

        # The 'orange arrow' from our diagram - the layer operation from hidden to hidden
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        
        # The 'blue arrow' from our diagram - the layer operation from hidden to output
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, c1, c2, c3):
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
        h = V(torch.zeros(in1.size()).cuda()) # initiating our hidden state to all zeros
        h = F.tanh(self.l_hidden(h+in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

In [27]:
m = Char3Model(vocab_size, n_fac).cuda()

In [28]:
opt = optim.Adam(m.parameters(), 1e-2)

fit(m, md, 1, opt, F.nll_loss)
set_lrs(opt, 1e-3)
fit(m, md, 1, opt, F.nll_loss)

[ 0.       1.84013  3.70968]                                   



[ 0.       1.60283  2.86376]                                   



Let's run our two tests to see if it learned anything!

In [29]:
text = 'Goo'
for i in range(200): text += get_next(text[-3:])

print(text)

Gooteched to there hous crease bazensudenting an hado, will should if to cut distcover progle. But sumen I stant this thances more hown is in of to beliremocrate if hoperman if American't it's seears, re


Time to change our Char3Model into a 'real' RNN!

## CharLoopModel

<img src='images/simple_RNN.png'>

In [30]:
cs=8 # our model will generate predictions based on it seeing this many chars 

c_in_dat = [[idxs[i+j] for i in range(cs)] for j in range(len(idxs)-cs-1)]
c_out_dat = [idxs[j+cs] for j in range(len(idxs)-cs-1)]

xs = np.stack(c_in_dat, axis=0)
y = np.stack(c_out_dat)

In [31]:
xs.shape

(1027432, 8)

In [32]:
[[idx for idx in row] for row in xs[:8]]

[[35, 65, 7, 0, 41, 63, 52, 48],
 [65, 7, 0, 41, 63, 52, 48, 58],
 [7, 0, 41, 63, 52, 48, 58, 52],
 [0, 41, 63, 52, 48, 58, 52, 65],
 [41, 63, 52, 48, 58, 52, 65, 5],
 [63, 52, 48, 58, 52, 65, 5, 0],
 [52, 48, 58, 52, 65, 5, 0, 35],
 [48, 58, 52, 65, 5, 0, 35, 65]]

In [33]:
[[idx2char[idx] for idx in row] for row in xs[:cs]]

[['M', 'r', '.', ' ', 'S', 'p', 'e', 'a'],
 ['r', '.', ' ', 'S', 'p', 'e', 'a', 'k'],
 ['.', ' ', 'S', 'p', 'e', 'a', 'k', 'e'],
 [' ', 'S', 'p', 'e', 'a', 'k', 'e', 'r'],
 ['S', 'p', 'e', 'a', 'k', 'e', 'r', ','],
 ['p', 'e', 'a', 'k', 'e', 'r', ',', ' '],
 ['e', 'a', 'k', 'e', 'r', ',', ' ', 'M'],
 ['a', 'k', 'e', 'r', ',', ' ', 'M', 'r']]

In [34]:
class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        
        return F.log_softmax(self.l_out(h))

In [35]:
# This one line allows me to run my models on a subset of data while I work on this notebook
# md = ColumnarModelData.from_arrays('.', [-20_000], xs[:200_000, :], y[:200_000], bs=2**9)

md = ColumnarModelData.from_arrays('.', [-100_000], xs, y, bs=2**9)

In [36]:
m = CharLoopModel(vocab_size, n_fac).cuda()

In [37]:
opt = optim.Adam(m.parameters(), 1e-2)

fit(m, md, 1, opt, F.nll_loss)
set_lrs(opt, 1e-3)
fit(m, md, 1, opt, F.nll_loss)

[ 0.       1.85732  4.13745]                                  



[ 0.       1.54626  4.06182]                                  



In [38]:
get_next('Good mo')

'v'

In [39]:
text = 'Good mo'
for i in range(200): text += get_next(text[-8:])

print(text)

Good mould Afghanistant so not our inear ive or his know than is indurtment that of his that a businen doday be the for thirled withinefeumple, - Now, America. And eachorde that and the part their part once 


Looking a bit better but still not that great. Next step - replacing the for loop with a PyTorch component.

## CharRnn

This model will do exactly the same thing as the model above, the difference being that it no longer relies on the manual for loop. We will replace it with a nn.RNN layer.

This is what that layer does (taken from [PyTorch documentation](http://pytorch.org/docs/master/nn.html#recurrent-layers)):

$$h_t = \tanh(w_{ih} * x_t + b_{ih}  +  w_{hh} * h_{(t-1)} + b_{hh})$$

In [40]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
#         self.l_in = nn.Linear(n_fac, n_hidden) <- nn.RNN will do this for us already so this can go
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h0 = V(torch.zeros(1, bs, n_hidden).cuda())
        inp = self.e(torch.stack(cs))
        
        # output.shape = [8, bs, n_hidden]
        # hn.shape = [1, bs, n_hidden]
        #   turns out hn is the 'tensor containing the hidden state for k=seq_len' but for just a plain nn.RNN
        #   that ends up being output[-1, bs, n_hidden] which can be quite confusing!
        output, hn = self.rnn(inp, h0)
        return F.log_softmax(self.l_out(output[-1]))

In [41]:
m = CharRNN(vocab_size, n_fac).cuda()

In [42]:
opt = optim.Adam(m.parameters(), 1e-2)

fit(m, md, 1, opt, F.nll_loss)
set_lrs(opt, 1e-3)
fit(m, md, 1, opt, F.nll_loss)

[ 0.       1.83428  6.31489]                                  



[ 0.       1.5256   5.58316]                                  



In [43]:
get_next('Good mo')

'r'

In [44]:
text = 'Good mo'
for i in range(200): text += get_next(text[-8:])

print(text)

Good more States. Agen that be -adentind taxned our should is a coaligy. But surderfulloustand sume. You child. And so doebrets is wen's captory. And shory been mar ca supelesses fromightment intent terredia


## What is going on here? 

I did some copying of code from Jeremy's notebook. Things seem to work. I am starting to use nn.RNN which hides some of the complexity (and I think I understand what it does).

But something feels a bit off. Would I be able to create a model from scratch?

At this point, all this seems quite fuzzy. The only way to really learn this is to build a model from scratch myself

## Adding a string of numbers

I will build a very simple RNN for evaluating a simple calculation contained in a string, for example '2+3'. For this example, the network should output the number 5.

I understand that this will be doing very little, but that is the plan. There are so many ways that this could be extended. Including other types of calculations, summing up digits in a string up until some special character indicating going back to zero (for testing forget gates), etc.

Let's see if I can get the most basic of models to work.

### Creating the dataset 

In [45]:
summands = [[x1, x2] for x1 in range(10) for x2 in range(10)]

y = map(lambda lst: lst[0] + lst[1], summands)
y = np.stack(y).reshape((-1,1))

xs = map(lambda lst: [str(lst[0]), '+', str(lst[1])], summands)
xs = np.stack(xs)

In [46]:
y[::16]

array([[ 0],
       [ 7],
       [ 5],
       [12],
       [10],
       [ 8],
       [15]])

In [47]:
xs[::16]

array([['0', '+', '0'],
       ['1', '+', '6'],
       ['3', '+', '2'],
       ['4', '+', '8'],
       ['6', '+', '4'],
       ['8', '+', '0'],
       ['9', '+', '6']],
      dtype='<U1')

In [48]:
chars = np.unique(xs)
chars

array(['+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
      dtype='<U1')

In [49]:
vocab_size = len(chars)
print('total chars:', vocab_size)

total chars: 11


We want to be able to easily go from char to its index in the vocabulary and back. Let's use dictionaries for that.

In [50]:
char2idx = {c: idx for idx, c in enumerate(chars)}
idx2char = {idx: c for idx, c in enumerate(chars)}

Now we need to convert our xs to an array with indices.

In [51]:
char2idx_ufunc = np.vectorize(lambda c: char2idx[c])

In [52]:
x_idxs = char2idx_ufunc(xs)

In [53]:
x_idxs[::16]

array([[ 1,  0,  1],
       [ 2,  0,  7],
       [ 4,  0,  3],
       [ 5,  0,  9],
       [ 7,  0,  5],
       [ 9,  0,  1],
       [10,  0,  7]])

We are ready to rock :)

In [54]:
# Both xs and y are ordered - this sort of selection of the val set will lead to the model never seeing '9' as x1
md = ColumnarModelData.from_arrays('.', [-10], x_idxs, y.astype(np.float32), bs=5)

Incremental model construction.

Will not go into my next post as it would make the code too messy. Wanted to share as I find it very useful.

In [55]:
class Calculator(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        
        # many simplifications here, which is good -
        #   I'll make the model only as complex as it needs to be to learn
        self.n_hidden = n_fac
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_hidden = nn.Linear(self.n_hidden, self.n_hidden)
        self.l_out = nn.Linear(self.n_hidden, 1)
    
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, self.n_hidden).cuda())
        
        for c in cs:
            inp = self.e(c)
            h = inp + self.l_hidden(h)
        
        return self.l_out(h)

In [56]:
m = Calculator(vocab_size, 1).cuda()

In [57]:
opt = optim.Adam(m.parameters(), 1)

fit(m, md, 4, opt, F.mse_loss)
set_lrs(opt, 1e-1)
fit(m, md, 6, opt, F.mse_loss)

[  0.       81.94344   3.13505]                  
[  1.       44.22871   1.96429]                  
[  2.       28.90028   3.84676]                  
[  3.       18.40855   9.18894]                  



[ 0.       1.64152  6.71108]                     
[ 1.       1.46116  4.95674]                     
[ 2.       1.27369  3.99521]                     
[ 3.       1.08349  2.33228]                     
[ 4.       0.91851  1.82099]                      
[ 5.       0.76727  1.46149]                      



The NN has very few parameters so it is quite susceptible to local minima. You might have to restart the training a couple of times to get nice results as above.

Now the fun thing! What has our NN learned?

In [58]:
chars

array(['+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
      dtype='<U1')

In [59]:
m.e.weight

Parameter containing:
-3.9842
 2.4200
 2.3855
 0.9966
-1.4621
-2.4677
-3.6912
-4.8159
-4.8263
-6.6776
-9.0398
[torch.cuda.FloatTensor of size 11x1 (GPU 0)]

In [60]:
m.l_hidden.weight

Parameter containing:
-1.0264
[torch.cuda.FloatTensor of size 1x1 (GPU 0)]

In [61]:
m.l_out.weight

Parameter containing:
-0.7582
[torch.cuda.FloatTensor of size 1x1 (GPU 0)]

The network can converge to various solutions - with the hidden weight being a very small number close to zero, or one close to +1 or -1, etc. You might want to rerun the training and see what happens!

I tried adding a tanh activation between the hidden states but it didn't work. Makes sense if you think about what the network is doing. 

Can we get this to work with a hidden state and a tanh nonlinearity?

In [62]:
class CalculatorWithNonlinearities(nn.Module):
    def __init__(self, vocab_size, n_fac, n_hidden):
        super().__init__()

        self.n_hidden = n_hidden
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden) # we need this because now n_fac != n_hidden
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, 1)
    
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, self.n_hidden).cuda())
        for c in cs:
            inp = self.l_in(self.e(c))
            h = F.tanh(self.l_hidden(h + inp))
        return self.l_out(h)

I know that 1 should work for the n_fac - the NN just needs to learn the number corresponding to each char. But what about n_hidden?

How small can I make the network for it to learn?

In [63]:
m = CalculatorWithNonlinearities(vocab_size, 1, 20).cuda()

In [64]:
opt = optim.Adam(m.parameters(), 1e-2)

fit(m, md, 30, opt, F.mse_loss)

[  0.       66.14835  21.62745]                  
[  1.       42.2609    1.72222]                  
[  2.       30.39179   0.08944]                  
[  3.       24.26544   0.03263]                  
[  4.       21.06459   0.14275]                  
[  5.       18.88719   0.03486]                  
[  6.       16.91121   0.06657]                  
[  7.       14.98446   1.69308]                  
[  8.       13.17066   0.76843]                  
[  9.       11.54898   7.11219]                  
[ 10.       10.06652   3.6616 ]                  
[ 11.        8.28587   4.35489]                  
[ 12.        6.70898   3.56428]                  
[ 13.        5.34851   2.9674 ]                  
[ 14.        4.13392   2.51921]                  
[ 15.        3.08958   0.42672]                  
[ 16.        2.29914   0.30198]                  
[ 17.        1.65553   0.3035 ]                  
[ 18.        1.1885    0.52241]                  
[ 19.        0.86083   0.49692]                   

For what it's worth, 10 dimensionsional hidden state seems to be too small for this method of training / architecture. 20 seems to do the trick.

Let's see what embeddings it learned.

In [65]:
chars

array(['+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
      dtype='<U1')

In [66]:
m.e.weight

Parameter containing:
 0.7059
 1.4007
 1.1018
 0.8484
 0.6089
 0.3554
 0.1097
-0.1371
-0.3852
-0.6220
-0.9086
[torch.cuda.FloatTensor of size 11x1 (GPU 0)]

Makes sense! Now time for getting it to work with the nn.RNN black box.

Pasting the equation summarizing what it does here so that I know what to feed it and what I can expect to come out of it:
$$h_t = \tanh(w_{ih} * x_t + b_{ih}  +  w_{hh} * h_{(t-1)} + b_{hh})$$

In [67]:
class CalculatorWithRNNCell(nn.Module):
    def __init__(self, vocab_size, n_fac, n_hidden):
        super().__init__()

        self.n_hidden = n_hidden
        self.e = nn.Embedding(vocab_size, n_fac)   
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, 1)
    
    def forward(self, *cs):
        bs = cs[0].size(0)
        inp = self.e(torch.stack(cs))
        h0 = V(torch.zeros(1, bs, self.n_hidden).cuda())
        out, _ = self.rnn(inp, h0)
        
        return self.l_out(out[-1])

In [68]:
m = CalculatorWithRNNCell(vocab_size, 1, 20).cuda()

In [69]:
opt = optim.Adam(m.parameters(), 1e-2)

fit(m, md, 30, opt, F.mse_loss)

[  0.       49.11596   9.42365]                            
[  1.       31.13147   0.05916]                            
[  2.       24.25076   0.14015]                            
[  3.       21.08268   0.01221]                            
[  4.       19.50281   0.04864]                            
[  5.       18.45021   0.00196]                            
[  6.       17.89862   0.00804]                            
[  7.       17.73083   0.02894]                            
[  8.       17.52084   0.05901]                            
[  9.       17.16917   0.00533]                            
[ 10.       16.74032   0.0291 ]                            
[ 11.       16.58917   0.74846]                  
[ 12.       16.12264   1.24357]                            
[ 13.       15.27243  11.13585]                            
[ 14.       14.131     1.42411]                            
[ 15.       11.71144   1.18931]                            
[ 16.        8.90745   2.72275]                   

In [70]:
chars

array(['+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
      dtype='<U1')

In [71]:
m.e.weight

Parameter containing:
-1.1853
 2.2100
 1.2919
 0.8742
 0.5553
 0.2421
-0.1066
-0.4738
-0.8615
-1.4669
-2.3471
[torch.cuda.FloatTensor of size 11x1 (GPU 0)]

## End of part 1

And that's it for part 1! In part 2 we will look at longer sequences and specifically the LTSM and GRU cells! 

See you in part 2!

### Fun things to try

* The RNNs for the learning from speeches part have not been tuned. Can you lower the validation loss by picking different embedding / hidden state dimensions? What about learning rate annealing (decreasing the learning rate as you train) or adding batch norm / dropout?
* Can you teach the network to perform other calculations, such as subtraction or multiplication? How big of a hidden state do you need to make it work?
* What about a network that would take in long sequences of ones and zeros and count the ones? How long can we make such a sequence before the network is unable to learn?