# Human numbers

We're now going to jump into human numbers which is lesson7-human-numbers.ipynb. This is a dataset that I created which literally just contains all the numbers from 1 to 9,999 written out in English.

We're going to try to create a language model that can predict the next word in this document. It's just a toy example for this purpose. In this case, we only have one document. That one document is the list of numbers. So we can use a TextList to create an item list with text in for the training of the validation.

## Lesson 7 Notes

[from hiromis](https://github.com/hiromis/notes/blob/master/Lesson7.md)


[discussion thread](https://forums.fast.ai/t/lesson-7-in-class-chat/32554/118)

[advanced discussion thread](https://forums.fast.ai/t/lesson-7-further-discussion/32555)

## Recurrent Neural Network - RNN

[video timing](https://www.youtube.com/watch?v=nWpdkZE2_cc&feature=youtu.be&t=5911)

see Hiromis notes.  [from hiromis](https://github.com/hiromis/notes/blob/master/Lesson7.md)


In [0]:
from fastai.text import *

In [0]:
bs=64

## Data

In [0]:
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

[PosixPath('/home/ubuntu/.fastai/data/human_numbers/train.txt'),
 PosixPath('/home/ubuntu/.fastai/data/human_numbers/valid.txt')]

In [0]:
def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]

In [0]:
train_txt = readnums('train.txt'); train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

NOTE: we check the first 80 characters from the training data set.

In [0]:
valid_txt = readnums('valid.txt'); valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

NOTE: we check the first 80 characters from the validation data set.

In [0]:
train = TextList(train_txt, path=path)
valid = TextList(valid_txt, path=path)

src = ItemLists(path=path, train=train, valid=valid).label_for_lm()
data = src.databunch(bs=bs)

NOTE:  In this case, the validation set is the numbers from 8,000 onwards, and the training set is 1 to 8,000. We can combine them together, turn that into a data bunch. 

`label_for_lm` is a special labelling method for language models. 




In [0]:
# display the first 80 characters
train[0].text[:80]

'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'

NOTE:  We only have one document, so `train[0]` is the document grab its `.text` that's how you grab the contents of a text list, and here are the first 80 characters. It starts with a special token `xxbos`. Anything starting with `xx` is a special fast.ai token, `bos` is the beginning of stream token. It basically says this is the start of a document, and it's very helpful in NLP to know when documents start so that your models can learn to recognize them.

In [0]:
len(data.valid_ds[0][0].data)

13017

NOTE:  The validation set contains 13,017 tokens. So 13,017 words or punctuation marks because everything between spaces is a separate token.

In [0]:
# bptt = sequence length (back prop through time)
data.bptt, len(data.valid_dl)

(70, 3)

In [0]:
# total batches (total number of tokens / sequence length / batch size)
13017/70/bs

2.905580357142857

NOTE:  The batch size that we asked for was 64, and then by default it uses something called `bptt` of 70. `bptt`, as we briefly mentioned, stands for **"back prop through time"**. **That's the sequence length**. 

For each of our 64 document segments, we split it up into lists of 70 words that we look at at one time. So what we do for the validation set is we grab this entire string of 13,000 tokens, and then we split it into 64 roughly equal sized sections. **They're 64 roughly equally sized segments**. So we take the first 1/64 of the document - piece 1. The second 1/64 - piece 2.

Then for each of those 1/64 of the document, we then split those into pieces of length 70. So let's now say for those 13,000 tokens, how many batches are there? Well, divide by batch size and divide by 70, so there's going to be 3 batches.

In [0]:
# iterator for the DataLoader
it = iter(data.valid_dl)
x1,y1 = next(it)
x2,y2 = next(it)
x3,y3 = next(it)
it.close()

In [0]:
# then add the number of elements
x1.numel()+x2.numel()+x3.numel()

13440

NOTE:  Let's grab an iterator for a data loader, grab 1 2 3 batches (the X and the Y), and let's add up the number of elements.

In [0]:
x1.shape,y1.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [0]:
x2.shape,y2.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

NOTE:  As you can see, it's 64 by 70. 
`bppt` add a bit of shuffling.

In [0]:
# first batch of x - numericalized
x1[:,0]

tensor([ 2,  8, 10, 11, 12, 10,  9,  8,  9, 13, 18, 24, 18, 14, 15, 10, 18,  8,
         9,  8, 18, 24, 18, 10, 18, 10,  9,  8, 18, 19, 10, 25, 19, 22, 19, 19,
        23, 19, 10, 13, 10, 10,  8, 13,  8, 19,  9, 19, 34, 16, 10,  9,  8, 16,
         8, 19,  9, 19, 10, 19, 10, 19, 19, 19], device='cuda:0')

In [0]:
# first bach of y - numericalized
y1[:,0]

tensor([18, 18, 26,  9,  8, 11, 31, 18, 25,  9, 10, 14, 10,  9,  8, 14, 10, 18,
        25, 18, 10, 17, 10, 17,  8, 17, 20, 18,  9,  9, 19,  8, 10, 15, 10, 10,
        12, 10, 12,  8, 12, 13, 19,  9, 19, 10, 23, 10,  8,  8, 15, 16, 19,  9,
        19, 10, 23, 10, 18,  8, 18, 10, 10,  9], device='cuda:0')

NOTE:  So here, you can see the first batch of X (remember, we've numeric aliased all these) and here's the first batch of Y. And you'll see here x1 is ``[2, 18, 10, 11, 8, ...]`, y1 is `[18, 10, 11, 8, ...].` So **y1 is offset by 1 from x1**. Because that's what you want to do with a language model. We want to predict the next word. So after 2 (x1:0), should come 18 (y1:0), and after 8(x1:1), should come 18 (y1:1).



In [0]:
v = data.valid_ds.vocab

### batch 1

In [0]:
v.textify(x1[0])

'xxbos eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight'

In [0]:
v.textify(y1[0])

'eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight thousand'

NOTE:  You can grab the vocab for this dataset, and a vocab has a textify so if we look at exactly the same thing but with `textify`, that will just look it up in the vocab. So here you can see `**xxbos** eight thousand one` where else in the y, there's no `xxbos`, it's just `eight thousand one`. So after `xxbos` is `eight`, after `eight` is `thousand`, after `thousand` is `one`.

In [0]:
v.textify(x2[0])

'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'

In [0]:
v.textify(x3[0])

'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'

### batch 2

In [0]:
v.textify(x1[1])

', eight thousand forty six , eight thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine ,'

In [0]:
v.textify(x2[1])

'eight thousand sixty , eight thousand sixty one , eight thousand sixty two , eight thousand sixty three , eight thousand sixty four , eight thousand sixty five , eight thousand sixty six , eight thousand sixty seven , eight thousand sixty eight , eight thousand sixty nine , eight thousand seventy , eight thousand seventy one , eight thousand seventy two , eight thousand seventy three , eight thousand'

In [0]:
v.textify(x3[1])

'seventy four , eight thousand seventy five , eight thousand seventy six , eight thousand seventy seven , eight thousand seventy eight , eight thousand seventy nine , eight thousand eighty , eight thousand eighty one , eight thousand eighty two , eight thousand eighty three , eight thousand eighty four , eight thousand eighty five , eight thousand eighty six , eight thousand eighty seven , eight thousand eighty'

In [0]:
v.textify(x3[-1])

'ninety , nine thousand nine hundred ninety one , nine thousand nine hundred ninety two , nine thousand nine hundred ninety three , nine thousand nine hundred ninety four , nine thousand nine hundred ninety five , nine thousand nine hundred ninety six , nine thousand nine hundred ninety seven , nine thousand nine hundred ninety eight , nine thousand nine hundred ninety nine xxbos eight thousand one , eight'

NOTE:  Then we can go right back to the start, but look at batch index 1 which is batch number 2. Now we can continue. A slight skip from 8,040 to 8,046, that's because the last mini batch wasn't quite complete. What this means is that every mini batch joins up with a previous mini batch. So you can go straight from `x1[0]` to `x2[0]` - it continues 8,023, 8,024. If you took the same thing for `:,1`, you'll also see they join up. **So all the mini batches join up.**

In [0]:
data.show_batch(ds_type=DatasetType.Valid)

idx,text
0,"thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine , eight thousand sixty , eight thousand sixty"
1,"eight , eight thousand eighty nine , eight thousand ninety , eight thousand ninety one , eight thousand ninety two , eight thousand ninety three , eight thousand ninety four , eight thousand ninety five , eight thousand ninety six , eight thousand ninety seven , eight thousand ninety eight , eight thousand ninety nine , eight thousand one hundred , eight thousand one hundred one , eight thousand one"
2,"thousand one hundred twenty four , eight thousand one hundred twenty five , eight thousand one hundred twenty six , eight thousand one hundred twenty seven , eight thousand one hundred twenty eight , eight thousand one hundred twenty nine , eight thousand one hundred thirty , eight thousand one hundred thirty one , eight thousand one hundred thirty two , eight thousand one hundred thirty three , eight thousand"
3,"three , eight thousand one hundred fifty four , eight thousand one hundred fifty five , eight thousand one hundred fifty six , eight thousand one hundred fifty seven , eight thousand one hundred fifty eight , eight thousand one hundred fifty nine , eight thousand one hundred sixty , eight thousand one hundred sixty one , eight thousand one hundred sixty two , eight thousand one hundred sixty three"
4,"thousand one hundred eighty three , eight thousand one hundred eighty four , eight thousand one hundred eighty five , eight thousand one hundred eighty six , eight thousand one hundred eighty seven , eight thousand one hundred eighty eight , eight thousand one hundred eighty nine , eight thousand one hundred ninety , eight thousand one hundred ninety one , eight thousand one hundred ninety two , eight thousand"


That's the data. We can do show_batch to see it.

## Single fully connected model

In [0]:
data = src.databunch(bs=bs, bptt=3)

NOTE: here we have a length sequence of 3 not 70 as before.

In [0]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 3]), torch.Size([64, 3]))

In [0]:
nv = len(v.itos); nv

38

In [0]:
nh=64

In [0]:
def loss4(input,target): return F.cross_entropy(input, target[:,-1])
def acc4 (input,target): return accuracy(input, target[:,-1])

NOTE:  
`F.cross_entroy(input, target(:-1)] it will compare the result of our model (`input` to the last word in the sequence `target(:-1)`.
`


Here is our model which is doing what we saw in the diagram:

In [0]:
# model definition as per diagram
class Model0(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = self.bn(F.relu(self.i_h(x[:,0])))
        if x.shape[1]>1:
            h = h + self.i_h(x[:,1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1]>2:
            h = h + self.i_h(x[:,2])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

NOTE:  It content contains 1 embedding (i.e. the green arrow), one hidden to hidden - brown arrow layer, and one hidden to output. So each colored arrow has a single matrix. Then in the forward pass, we take our first input `x[0]` and put it through input to hidden (the green arrow), create our first set of activations which we call `h`. Assuming that there is a second word, because sometimes we might be at the end of a batch where there isn't a second word. Assume there is a second word then we would add to h the result of `x[1]` put through the green arrow (that's `i_h`). Then we would say, okay our new `h` is the result of those two added together, put through our hidden to hidden (orange arrow), and then ReLU then batch norm. Then for the second word, do exactly the same thing. Then finally blue arrow - put it through `h_o`.

So that's how we convert our diagram to code. Nothing new here at all. We can chuck that in a learner, and we can train it - 46%.

In [0]:
learn = Learner(data, Model0(), loss_func=loss4, metrics=acc4)

In [0]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4
1,3.596286,3.588869,0.046645
2,3.086100,3.205763,0.274816
3,2.494411,2.749365,0.392004
4,2.144753,2.463537,0.415671
5,2.010915,2.352887,0.409237
6,1.983992,2.336967,0.408778


## Same thing with a loop

[video timing](https://www.youtube.com/watch?v=nWpdkZE2_cc&feature=youtu.be&t=6648)

Let's take this code and recognize it's pretty awful. There's a lot of duplicate code, and as coders, when we see duplicate code, what do we do? We refactor. So we should refactor this into a loop.



In [0]:
class Model1(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

NOTE:  Here we are. We've refactored it into a loop. So now we're going for each `xi` in `x`, and doing it in the loop. Guess what? **That's an RNN**. An RNN is just a refactoring. It's not anything new. This is now an RNN. And let's refactor our diagram:

This is the same diagram, but I've just replaced it with my loop. It does the same thing, so here it is. It's got exactly the same `__init__`, literally exactly the same, just popped a loop here. Before I start, I just have to make sure that I've got a bunch of zeros to add to. And of course, I get exactly the same result when I train it.

Now, this code will work for any arbitrary lengh sequence. So for the next section of this notebook, we will start with a bptt = 20.

In [0]:
learn = Learner(data, Model1(), loss_func=loss4, metrics=acc4)

In [0]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4
1,3.493525,3.420231,0.156250
2,2.987600,2.937893,0.376149
3,2.440199,2.477995,0.388787
4,2.132837,2.256569,0.391774
5,2.011305,2.181337,0.392923
6,1.985913,2.170874,0.393153


NOTE:  One nice thing about the loop though, is now this will work even if I'm not predicting the fourth word from the previous three, but the ninth word from the previous eight. It'll work for any arbitrarily length long sequence which is nice.

So let's up the `bptt` to 20 since we can now. And let's now say, okay, instead of just predicting the `n`th word from the previous `n-1`, let's try to predict the second word from the first, the third from the second, and the fourth from the third, and so forth. Look at our loss function.

## Multi fully connected model

In other words, after every loop, predict, loop, predict, loop, predict.



In [0]:
data = src.databunch(bs=bs, bptt=20)

`bptt = 20` sequence length is now 20.

In [0]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 20]), torch.Size([64, 20]))

In [0]:
class Model2(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        res = []
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.h_o(self.bn(h)))
        return torch.stack(res, dim=1)

NOTE:  Previously we were comparing the result of our model to just the last word of the sequence. It is very wasteful, because there's a lot of words in the sequence. So let's compare every word in x to every word and y. To do that, we need to change the diagram so it's not just one triangle at the end of the loop, but the triangle is inside the loop: 

Here's this code. It's the same as the previous code, but now I've created **an array**, and every time I go through the loop, I append `h_o(h)` to the array.  So, now, for `n` inputs, I create `n` outputs. **So I'm predicting after every word**.

In [0]:
learn = Learner(data, Model2(), metrics=accuracy)

In [0]:
learn.fit_one_cycle(10, 1e-4, pct_start=0.1)

epoch,train_loss,valid_loss,accuracy
1,3.639285,3.709278,0.058949
2,3.551151,3.565677,0.151776
3,3.439908,3.431850,0.207741
4,3.323083,3.314237,0.283949
5,3.213422,3.219906,0.321662
6,3.119673,3.151162,0.336790
7,3.046645,3.106630,0.341690
8,2.995379,3.082552,0.346662
9,2.963800,3.073327,0.349645
10,2.947312,3.071951,0.349787


NOTE:  Previously I had 39.%, now I have 34.9%. **Why is it worse? It's worse because now when I'm trying to predict the second word, I only have one word of state to use.** When I'm looking at the third word, I only have two words of state to use. So it's a much harder problem for it to solve. The key problem is here:

`h = torch.zeros(x.shape[0], nh).to(device=x.device)`

```
def forward(self, x):
       ** h = torch.zeros(x.shape[0], nh).to(device=x.device)**
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)
```
I am restarting the state after every bptt sequence.  So let's take away this line from the forward method and put it in the __init__ section. 




## Maintain state

In [0]:
class Model3(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        self.h = torch.zeros(bs, nh).cuda()
        
    def forward(self, x):
        res = []
        h = self.h
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.bn(h))
        self.h = h.detach()
        res = torch.stack(res, dim=1)
        res = self.h_o(res)
        return res

NOTE:  There it is. So it's now `self.h`. So this is now exactly the same code, but at the end, let's put the new h back into `self.h`. It's now doing the same thing, but it's not throwing away that state.

NOTE:  I go `h = torch.zeros`. I reset my state to zero every time I start another BPTT sequence. Let's not do that. Let's keep `h`. And we can, because remember, each batch connects to the previous batch. It's not shuffled like happens in image classification. So let's take this exact model and replicate it again, but let's move the creation of h into the constructor.

In [0]:
learn = Learner(data, Model3(), metrics=accuracy)

In [0]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy
1,3.598183,3.556362,0.050710
2,3.274616,2.975699,0.401634
3,2.624206,2.036894,0.467330
4,2.022702,1.956439,0.316193
5,1.681813,1.934952,0.336861
6,1.453007,1.948201,0.351349
7,1.276971,2.005776,0.368679
8,1.138499,2.081261,0.360156
9,1.029217,2.145853,0.360795
10,0.939949,2.215388,0.372230
11,0.865441,2.240438,0.401491
12,0.805310,2.195846,0.409375
13,0.755035,2.324373,0.422727
14,0.713073,2.305542,0.449716
15,0.677393,2.350155,0.446449
16,0.645841,2.418738,0.446591
17,0.621809,2.456903,0.446165
18,0.605300,2.541699,0.443040
19,0.594099,2.539824,0.443040
20,0.587563,2.551423,0.442827


NOTE:  Therefore, now we actually get above the original. We get all the way up to 44% accuracy. So this is what a real RNN looks like. You always want to keep that state. But just keep remembering, there's nothing different about an RNN, and it's a totally normal fully connected neural net. It's just that you've got a loop you refactored.

## nn.RNN

NOTE:  What you could do though is at the end of your every loop, you could not just spit out an output, but you could spit it out into another RNN. So you have an RNN going into an RNN. That's nice because we now got more layers of computation, you would expect that to work better.

To get there, let's do some more refactoring. Let's take this code (`Model3`) and replace it with the equivalent built in PyTorch code which is you just say that:

In [0]:
class Model4(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.RNN(nh,nh, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(1, bs, nh).cuda()
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

NOTE:  So `nn.RNN` basically says do the loop for me. We've still got the same embedding, we've still got the same output, still got the same batch norm, we still got the same initialization of `h`, but we just got rid of the loop. One of the nice things about RNN is that you can now say how many layers you want. This is the same accuracy of course:



In [0]:
learn = Learner(data, Model4(), metrics=accuracy)

In [0]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy
1,3.451432,3.268344,0.224148
2,2.974938,2.456569,0.466051
3,2.316732,1.946969,0.465625
4,1.866151,1.991952,0.314702
5,1.618516,1.802403,0.437216
6,1.411517,1.731107,0.436293
7,1.171916,1.655979,0.504048
8,0.965887,1.579963,0.522088
9,0.797046,1.479819,0.565057
10,0.659378,1.487831,0.579048
11,0.553282,1.441922,0.597798
12,0.475167,1.498148,0.600781
13,0.416131,1.546984,0.606463
14,0.372395,1.594261,0.607386
15,0.337093,1.578321,0.613352
16,0.311385,1.580973,0.623366
17,0.292869,1.625745,0.618253
18,0.279486,1.623960,0.626065
19,0.270054,1.682090,0.611719
20,0.263857,1.675676,0.614702


So here, I'm going to do it with two layers:

## 2-layer GRU

But here's the thing. When you think about this: (see picture)

It keeps on going, and we've got a BPTT of 20, so there's 20 layers of this. And we know from that visualizing the loss landscapes paper, that deep networks have awful bumpy loss surfaces. So when you start creating long timescales and multiple layers, these things get impossible to train. There's a few tricks you can do. One thing is you can add skip connections, of course. But what people normally do is, instead of just adding these together(green and orange arrows), they actually use a little mini neural net to decide how much of the green arrow to keep and how much of the orange arrow to keep. When you do that, you get something that's either called GRU or LSTM depending on the details of that little neural net. And we'll learn about the details of those neural nets in part 2. They really don't matter though, frankly.

So we can now say let's create a GRU instead. It's just like what we had before, but it'll handle longer sequences in deeper networks. Let's use two layers.

In [0]:
class Model5(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.GRU(nh, nh, 2, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(2, bs, nh).cuda()
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [0]:
learn = Learner(data, Model5(), metrics=accuracy)

In [0]:
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy
1,2.864854,2.314943,0.454545
2,1.798988,1.357116,0.629688
3,0.932729,1.307463,0.796733
4,0.451969,1.329699,0.788636
5,0.225787,1.293570,0.800142
6,0.118085,1.265926,0.803338
7,0.065306,1.207096,0.806960
8,0.038098,1.205361,0.813920
9,0.024069,1.239411,0.807813
10,0.017078,1.253409,0.807102


NOTE:  And we're up to 75%. That's RNNs and the main reason I wanted to show it to you was to remove the last remaining piece of magic, and this is one of the least magical things we have in deep learning. It's just a refactored fully connected network. So don't let RNNs ever put you off. With this approach where you basically have a sequence of n inputs and a sequence of n outputs we've been using for language modeling, you can use that for other tasks.

For example, the sequence of outputs could be for every word there could be something saying is there something that is sensitive and I want to anonymize or not. So it says private data or not. Or it could be a part of speech tag for that word, or it could be something saying how should that word be formatted, or whatever. These are called sequence labeling tasks and so you can use this same approach for pretty much any sequence labeling task. Or you can do what I did in the earlier lesson which is once you finish building your language model, you can throw away the h_o bit, and instead pop there a standard classification head, and then you can now do NLP classification which as you saw earlier will give you a state of the art results even on long documents. So this is a super valuable technique, and not remotely magical.

# Fast.AI Wrap up

## What we did cover in Part 1
- Affine Functions & Non-Linearities
- Batch Norm
- Image classification & Regression
- Segmentation, U-Net, GANs
- Parameters and Activations
- Dropout
- Embeddings
- SGD, Momemtum, Adam
- Convolutions
- Res/Dense Block
- Language Modesl, NLP classification
- Weight Decay
- Collaborative Filtering
- Random init & Transfer Learning
- Data augmentation
- Continuous and Categorical variables


## Advices
That's it. That's deep learning, or at least the practical pieces from my point of view. Having watched this one time, you won't get it all. And I don't recommend that you do watch this so slowly that you get it all the first time, but you go back and look at it again, take your time, and there'll be bits that you go like "oh, now I see what he's saying" and then you'll be able to implement things you couldn't before or you'll be able to dig in more than before. So definitely go back and do it again. And as you do, write code, not just for yourself but put it on github. It doesn't matter if you think it's great code or not. The fact that you're writing code and sharing it is impressive and the feedback you'll get if you tell people on the forum "hey, I wrote this code. It's not great but it's my first effort. Anything you see jump out at you?" People will say like "oh, that bit was done well. But you know, for this bit, you could have used this library and saved you sometime." You'll learn a lot by interacting with your peers.

As you've noticed, I've started introducing more and more papers. Part 2 will be a lot of papers, so it's a good time to start reading some of the papers that have been introduced in this section. **All the bits that say like derivation and theorems and lemmas, you can skip them. I do. They add almost nothing to your understanding of your practical deep learning**. **But the bits that say why are we solving this problem, and what are the results, and so forth, are really interesting. Then try and write English prose.** Not English prose that you want to be read by Geoffrey Hinton and Yann LeCun, but English prose you want to be read by you as of six months ago. Because there's a lot more people in the audience of you as of six months ago than there is of Geoffrey Hinton and Yann LeCun. That's the person you best understand. You know what they need.

Go and get help, and help others. Tell us about your success stories. But perhaps the most important one is get together with others. People's learning works much better if you've got that social experience. So start a book club, get involved in meetups, create study groups, and build things. Again, it doesn't have to be amazing. Just build something that you think the world would be a little bit better if that existed. Or you think it would be kind of slightly delightful to your two-year-old to see that thing. Or you just want to show it to your brother the next time they come around to see what you're doing, whatever. Just finish something. Finish something and then try make it a bit better.



## Fast.AI Part 2

What is coming up?

- Deep dive into Fast.ai codebase
- Development and Research Process
- Reading Academic Papers
- Translating Math to Code
- Attentional Models
- Speech Recognition
- Translation
- Multi-Modal Models
- CycleGAN
- Object Detection
- Large and Distributed Training
- and more ...
