# Recurrent neural network

### Resources
1. [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
2. [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
3. [Neural machine trranslation by jointly leranring to align and translate](https://arxiv.org/pdf/1409.3215.pdf)
4. [Translation with a sequence to sequence network and attention (pytorch)](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

### 1. RNN
Recurrent neural networks were based on David Rumelhart's work in 1986. A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

$$y_t, h_t = f(x_t, h_{t-1})$$

An Elman network is a three-layer network with the addition of a set of context units. The middle (hidden) layer is connected to these context units fixed with a weight of one. At each time step, the input is fed forward and a learning rule is applied. The fixed back-connections save a copy of the previous values of the hidden units in the context units (since they propagate over the connections before the learning rule is applied). Thus the network can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that are beyond the power of a standard multilayer perceptron.

$$
h_t = \sigma_{h}(W_h x_t + U_h h_{t-1} + b_h) $$
$$y_t = \sigma_{y}(W_y x_t + U_y h_{t-1} + b_y) $$

#### Training
Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. BPTT has difficulty with local optima. With recurrent neural networks, local optima are a much more significant problem than with feed-forward neural networks. The recurrent feedback in such networks tends to create chaotic responses in the error surface which cause local optima to occur frequently, and in poor locations on the error surface.

$$
y_t, h_t = f(x_{t-1}, h_{t-1}, w) $$
$$ y_{t-1}, h_{t-1} = f(x_{t-2}, h_{t-2}, w) $$
$$ y_{t-2}, h_{t-2} = f(x_{t-3}, h_{t-3}, w) $$

$$
\frac{\partial y_t}{\partial w} = \frac{\partial y_t}{\partial w} + \frac{\partial y_t}{\partial h_{t-1}} \frac{\partial h_{t-1}}{\partial w} $$
$$
\frac{\partial h_{t-1}}{\partial w} = \frac{\partial h_{t-1}}{\partial w} + \frac{\partial h_{t-1}}{\partial h_{t-2}} \frac{\partial h_{t-2}}{\partial w} $$

#### Problems
1. **[Vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem)**. In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (−1, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the early layers train very slowly.
2. **Exploding gradient problem.** When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

In [1]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import time
from tqdm import tqdm
import random
import math

if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print('Using PyTorch version:', torch.__version__, ' Device:', device)

Using PyTorch version: 1.7.0  Device: cuda


## Sounds

In [2]:
f = open('train.txt')
text = f.read()
f.close()

In [3]:
voc = text.split('\n')
voc = [v.split(' ') for v in voc]
voc.remove([''])

In [4]:
train_pairs = list()
for w in voc:
    train_pairs.append({'word': w[0], 'sounds': w[1]})
    if len(w) > 2:
        train_pairs.append({'word': w[0], 'sounds': w[2]})

In [5]:
import json

with open('pairs.json', 'w') as fp:
    fp.write('\n'.join(json.dumps(i) for i in train_pairs))

In [2]:
def tokenize_inp(w):
    return [l for l in w]

def tokenize_out(w):
    return w.split('_')

In [3]:
from torchtext.data import Field, TabularDataset, BucketIterator

SRC = Field(tokenize=tokenize_inp, init_token = '<sos>', eos_token = '<eos>')
TRG = Field(tokenize=tokenize_out, init_token = '<sos>', eos_token = '<eos>', is_target=True)

In [4]:
train_data = TabularDataset(path='pairs.json', format='json', fields={'word': ('src', SRC), 'sounds': ('trg', TRG)})

In [5]:
SRC.build_vocab(train_data)
TRG.build_vocab(train_data)

In [6]:
torch.random.manual_seed(42)
train_data, val_data = train_data.split(split_ratio=0.9)

In [7]:
train_iterator = BucketIterator(train_data, batch_size=8, device=device)
val_iterator = BucketIterator(val_data, batch_size=1, device=device)

In [12]:
# import torch
# import random
# from torch import nn
# from torch.autograd import Variable
# import torch.nn.functional as F


# class Encoder(nn.Module):
#     def __init__(self, input_size, embed_size, hidden_size,
#                  n_layers=1, dropout=0.5):
#         super(Encoder, self).__init__()
#         self.input_size = input_size
#         self.hidden_size = hidden_size
#         self.embed_size = embed_size
#         self.embed = nn.Embedding(input_size, embed_size)
#         self.gru = nn.GRU(embed_size, hidden_size, n_layers,
#                           dropout=dropout, bidirectional=True)

#     def forward(self, src, hidden=None):
#         embedded = self.embed(src)
#         outputs, hidden = self.gru(embedded, hidden)
#         # sum bidirectional outputs
#         outputs = (outputs[:, :, :self.hidden_size] +
#                    outputs[:, :, self.hidden_size:])
#         return outputs, hidden


# class Attention(nn.Module):
#     def __init__(self, hidden_size):
#         super(Attention, self).__init__()
#         self.hidden_size = hidden_size
#         self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
#         self.v = nn.Parameter(torch.rand(hidden_size))
#         stdv = 1. / math.sqrt(self.v.size(0))
#         self.v.data.uniform_(-stdv, stdv)

#     def forward(self, hidden, encoder_outputs):
#         timestep = encoder_outputs.size(0)
#         h = hidden.repeat(timestep, 1, 1).transpose(0, 1)
#         encoder_outputs = encoder_outputs.transpose(0, 1)  # [B*T*H]
#         attn_energies = self.score(h, encoder_outputs)
#         return F.softmax(attn_energies, dim=1).unsqueeze(1)

#     def score(self, hidden, encoder_outputs):
#         # [B*T*2H]->[B*T*H]
#         energy = F.relu(self.attn(torch.cat([hidden, encoder_outputs], 2)))
#         energy = energy.transpose(1, 2)  # [B*H*T]
#         v = self.v.repeat(encoder_outputs.size(0), 1).unsqueeze(1)  # [B*1*H]
#         energy = torch.bmm(v, energy)  # [B*1*T]
#         return energy.squeeze(1)  # [B*T]


# class Decoder(nn.Module):
#     def __init__(self, output_size, embed_size, hidden_size,
#                  n_layers=1, dropout=0.2):
#         super(Decoder, self).__init__()
#         self.embed_size = embed_size
#         self.hidden_size = hidden_size
#         self.output_size = output_size
#         self.n_layers = n_layers

#         self.embed = nn.Embedding(output_size, embed_size)
#         self.dropout = nn.Dropout(dropout, inplace=True)
#         self.attention = Attention(hidden_size)
#         self.gru = nn.GRU(hidden_size + embed_size, hidden_size,
#                           n_layers, dropout=dropout)
#         self.out = nn.Linear(hidden_size * 2, output_size)

#     def forward(self, input, last_hidden, encoder_outputs):
#         # Get the embedding of the current input word (last output word)
#         embedded = self.embed(input).unsqueeze(0)  # (1,B,N)
#         embedded = self.dropout(embedded)
#         # Calculate attention weights and apply to encoder outputs
#         attn_weights = self.attention(last_hidden[-1], encoder_outputs)
#         context = attn_weights.bmm(encoder_outputs.transpose(0, 1))  # (B,1,N)
#         context = context.transpose(0, 1)  # (1,B,N)
#         # Combine embedded input word and attended context, run through RNN
#         rnn_input = torch.cat([embedded, context], 2)
#         output, hidden = self.gru(rnn_input, last_hidden)
#         output = output.squeeze(0)  # (1,B,N) -> (B,N)
#         context = context.squeeze(0)
#         output = self.out(torch.cat([output, context], 1))
#         output = F.log_softmax(output, dim=1)
#         return output, hidden, attn_weights


# class Seq2Seq(nn.Module):
#     def __init__(self, encoder, decoder, device):
#         super(Seq2Seq, self).__init__()
#         self.encoder = encoder
#         self.decoder = decoder
#         self.device = device

#     def forward(self, src, trg, teacher_forcing_ratio=0.5):
#         batch_size = src.size(1)
#         max_len = trg.size(0)
#         vocab_size = self.decoder.output_size
#         outputs = Variable(torch.zeros(max_len, batch_size, vocab_size)).to(self.device)

#         encoder_output, hidden = self.encoder(src)
#         hidden = hidden[:self.decoder.n_layers]
#         output = Variable(trg.data[0, :])  # sos
#         for t in range(1, max_len):
#             output, hidden, attn_weights = self.decoder(
#                     output, hidden, encoder_output)
#             outputs[t] = output
#             is_teacher = random.random() < teacher_forcing_ratio
#             top1 = output.data.max(1)[1]
#             output = Variable(trg.data[t] if is_teacher else top1).cuda()
#         return outputs

In [88]:
# # class Encoder(nn.Module):
# #     def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
# #         super().__init__()
        
# #         self.hid_dim = hid_dim
# #         self.n_layers = n_layers
        
# #         self.embedding = nn.Embedding(input_dim, emb_dim)
        
# #         self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
# #         self.dropout = nn.Dropout(dropout)
        
# #     def forward(self, src):
        
# #         #src = [src len, batch size]
        
# #         embedded = self.dropout(self.embedding(src))
        
# #         #embedded = [src len, batch size, emb dim]
        
# #         outputs, (hidden, cell) = self.rnn(embedded)
        
# #         #outputs = [src len, batch size, hid dim * n directions]
# #         #hidden = [n layers * n directions, batch size, hid dim]
# #         #cell = [n layers * n directions, batch size, hid dim]
        
# #         #outputs are always from the top hidden layer
        
# #         return hidden, cell

# class Decoder(nn.Module):
#     def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
#         super().__init__()
        
#         self.output_dim = output_dim
#         self.hid_dim = hid_dim
#         self.n_layers = n_layers
        
#         self.embedding = nn.Embedding(output_dim, emb_dim)
        
#         self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
#         self.fc_out = nn.Linear(hid_dim, output_dim)
        
#         self.dropout = nn.Dropout(dropout)
        
#     def forward(self, input, hidden, cell):
        
#         #input = [batch size]
#         #hidden = [n layers * n directions, batch size, hid dim]
#         #cell = [n layers * n directions, batch size, hid dim]
        
#         #n directions in the decoder will both always be 1, therefore:
#         #hidden = [n layers, batch size, hid dim]
#         #context = [n layers, batch size, hid dim]
        
#         input = input.unsqueeze(0)
        
#         #input = [1, batch size]
        
#         embedded = self.dropout(self.embedding(input))
        
#         #embedded = [1, batch size, emb dim]
                
#         output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
#         #output = [seq len, batch size, hid dim * n directions]
#         #hidden = [n layers * n directions, batch size, hid dim]
#         #cell = [n layers * n directions, batch size, hid dim]
        
#         #seq len and n directions will always be 1 in the decoder, therefore:
#         #output = [1, batch size, hid dim]
#         #hidden = [n layers, batch size, hid dim]
#         #cell = [n layers, batch size, hid dim]
        
#         prediction = self.fc_out(output.squeeze(0))
        
#         #prediction = [batch size, output dim]
        
#         return prediction, hidden, cell

# # class Seq2Seq(nn.Module):
# #     def __init__(self, encoder, decoder, device):
# #         super().__init__()
        
# #         self.encoder = encoder
# #         self.decoder = decoder
# #         self.device = device
        
# #         assert encoder.hid_dim == decoder.hid_dim, \
# #             "Hidden dimensions of encoder and decoder must be equal!"
# #         assert encoder.n_layers == decoder.n_layers, \
# #             "Encoder and decoder must have equal number of layers!"
        
# #     def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
# #         #src = [src len, batch size]
# #         #trg = [trg len, batch size]
# #         #teacher_forcing_ratio is probability to use teacher forcing
# #         #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
# #         batch_size = trg.shape[1]
# #         trg_len = trg.shape[0]
# #         trg_vocab_size = self.decoder.output_dim
        
# #         #tensor to store decoder outputs
# #         outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
# #         #last hidden state of the encoder is used as the initial hidden state of the decoder
# #         hidden, cell = self.encoder(src)
        
# #         #first input to the decoder is the <sos> tokens
# #         input = trg[0,:]
        
# #         for t in range(1, trg_len):
            
# #             #insert input token embedding, previous hidden and previous cell states
# #             #receive output tensor (predictions) and new hidden and cell states
# #             output, hidden, cell = self.decoder(input, hidden, cell)
            
# #             #place predictions in a tensor holding predictions for each token
# #             outputs[t] = output
            
# #             #decide if we are going to use teacher forcing or not
# #             teacher_force = random.random() < teacher_forcing_ratio
            
# #             #get the highest predicted token from our predictions
# #             top1 = output.argmax(1) 
            
# #             #if teacher forcing, use actual next token as next input
# #             #if not, use predicted token
# #             input = trg[t] if teacher_force else top1
        
# #         return outputs

In [123]:
class MyRnn(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        
    def forward(self, input, hidden):

        input = input.unsqueeze(0)
        
        embedded = self.embedding(input)
                
        output, hidden = self.rnn(embedded, hidden).to(device)
        
        prediction = self.fc_out(output.squeeze(0))
        
        return prediction, hidden

In [124]:
import time 
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [125]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 128
DEC_EMB_DIM = 128
HID_DIM = 256
N_LAYERS = 2

model = MyRnn(OUTPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS).to(device)

In [126]:
# INPUT_DIM = len(SRC.vocab)
# OUTPUT_DIM = len(TRG.vocab)
# ENC_EMB_DIM = 128
# DEC_EMB_DIM = 128
# HID_DIM = 256
# N_LAYERS = 3
# ENC_DROPOUT = 0.3
# DEC_DROPOUT = 0.3

# enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
# dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

# model = Seq2Seq(enc, dec, device).to(device)

In [127]:
def init_weights(m: nn.Module):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
        
model.apply(init_weights)

MyRnn(
  (embedding): Embedding(43, 128)
  (rnn): LSTM(128, 256, num_layers=2)
  (fc_out): Linear(in_features=256, out_features=43, bias=True)
)

In [128]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 938,155 trainable parameters


In [129]:
optimizer = optim.RMSprop(model.parameters(), lr=0.001)
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

In [130]:
def train(model: nn.Module,
          iterator: torch.utils.data.DataLoader,
          optimizer: optim.Optimizer,
          criterion: nn.Module,
          clip: float):

    model.train()

    epoch_loss = 0

    for b, batch in enumerate(iterator):
        src, trg = batch.src.to(device), batch.trg.to(device)

        optimizer.zero_grad()

        output = model(src, trg)

        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()
        if b % 500 == 0 and b != 0:
            print("[%d][loss:%5.2f][pp:%5.2f]" %
                  (b, epoch_loss / (b + 1), math.exp(epoch_loss / (b + 1))))

    return epoch_loss / len(iterator)

In [131]:
def evaluate_val(model: nn.Module,
             iterator: torch.utils.data.DataLoader,
             criterion: nn.Module):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for _, batch in enumerate(iterator):
            src, trg = batch.src.to(device), batch.trg.to(device)
            output = model(src, trg, 0) #turn off teacher forcing
            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()

    return epoch_loss / len(iterator)


In [132]:
from copy import deepcopy

N_EPOCHS = 15
CLIP = 1

best_valid_loss = float('inf')
best_model  = None
for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate_val(model, val_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        best_model = deepcopy(model)
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

RuntimeError: input must have 3 dimensions, got 4

In [97]:
test_data = TabularDataset(path='test_pairs.json', format='json', fields={'word': ('src', SRC), 'sounds': ('trg', TRG)})

In [98]:
test_iterator = BucketIterator(test_data, batch_size=1, device=device, shuffle=False)

In [99]:
def src_to_word(vec):
    for el in vec:
        print(SRC.vocab.itos[el], end='')
    print('\n')

def trg_to_word(vec):
    res = ''
    for el in vec:
        if TRG.vocab.itos[el] == '<eos>':
            break
        if el > 4:
            res += TRG.vocab.itos[el] + '_'
    return res[:-1]

In [100]:
def evaluate_test(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = torch.transpose(torch.tensor([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]]), 0, 1).to(device)
#             src_to_word(src[:,0])
#             print(trg_to_word(trg[:,0]))
#             print('src:\n', src.shape)
#             print('trg:\n', trg.shape)
            

            output = model(src, trg, 0) #turn off teacher forcing
            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]
#             print('predict:\n', output.argmax(2)[:,1])

            output_dim = output.shape[-1]
            df.at[i, 'Transcription'] = trg_to_word(output.argmax(2))
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [101]:
import pandas as pd
df = pd.read_csv('submission.csv')

In [102]:
evaluate_test(best_model.to(device), test_iterator, criterion)

23.87385980251314

In [103]:
df

Unnamed: 0,Id,Transcription
0,1,P_IH_CH_T
1,2,D_IH_S_AA_L_V_ER_Z
2,3,S_K_R_AO_N_IY
3,4,B_AA_N_N_F_N_T
4,5,IH_K_S_IY_D_Z
...,...,...
41592,41593,IH_N_AA_K_Y_L_EY_SH_N
41593,41594,N_T_OW
41594,41595,S_K_OW_G_IH_N
41595,41596,HH_EH_SH_N


In [104]:
df2 = pd.read_csv('submission.csv')

In [105]:
len(df[df['Transcription'] != df2['Transcription']])

15482

In [107]:
df[df['Transcription'] != df2['Transcription']][:30]

Unnamed: 0,Id,Transcription
1,2,D_IH_S_AA_L_V_ER_Z
3,4,B_AA_N_N_F_N_T
7,8,K_P_IH_CH_L_EY_T
15,16,M_EY_P_L
19,20,P_ER_S_P_AY_R
22,23,T_IY_N
28,29,M_AA_M_B_AA_S
30,31,P_EY_Y_UW
31,32,K_R_IH_S_M_N_Z
36,37,S_B_AA_L_OW_Z


In [108]:
df2[df['Transcription'] != df2['Transcription']][:30]

Unnamed: 0,Id,Transcription
1,2,D_IH_Z_AA_L_V_ER_Z
3,4,B_OW_N_IH_N_F_N_T
7,8,K_AE_P_IH_T_L_EY_T
15,16,M_AE_P_L
19,20,P_ER_S_P_AY_ER
22,23,T_AY_N
28,29,M_OW_M_B_AE_S
30,31,P_EY_Y
31,32,K_R_IH_S_T_M_N_Z
36,37,S_B_AE_L_OW_Z


In [109]:
df.to_csv('submission.csv', columns=['Id', 'Transcription'], index=False)    

### 2. LSTM (long short-term memory)
LSTM was proposed by Sepp Hochreiter and Jürgen Schmidhuber. By introducing Constant Error Carousel (CEC) units, LSTM deals with the vanishing gradient problem. The initial version of LSTM block included cells, input and output gates.

In theory, classic RNNs can keep track of arbitrary long-term dependencies in the input sequences. The problem with vanilla RNNs is computational (or practical) in nature: when training a vanilla RNN using back-propagation, the gradients which are back-propagated can "vanish" (that is, they can tend to zero) or "explode" (that is, they can tend to infinity), because of the computations involved in the process, which use finite-precision numbers. RNNs using LSTM units partially solve the vanishing gradient problem, because LSTM units allow gradients to also flow unchanged. However, LSTM networks can still suffer from the exploding gradient problem.

The compact forms of the equations for the forward pass of an LSTM unit with a forget gate are below.
$$
f_t = \sigma \big( W_{f} x_t + U_{f} h_{t-1} + b_f \big) $$
$$
i_t = \sigma \big (W_{i} x_t + U_{i} h_{t-1} + b_i \big) $$
$$
o_t = \sigma(W_{o} x_t + U_{o} h_{t-1} + b_o \big) $$
$$
\tilde{c}_t = \tanh(W_{c} x_t + U_{c} h_{t-1} + b_c \big)$$
$$
c_t = f_t \ast c_{t-1} + i_t \ast \tilde{c}_t $$
$$
h_t = o_t \ast \tanh(c_t)
$$

### 3. GRU (gated recuent unit)
Gated recurrent units are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho. The GRU is like a LSTM with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate. GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs have been shown to exhibit better performance on certain smaller and less frequent datasets.

$$
z_t = \sigma \big( W_z x_t + U h_{t-1} + b_z \big) $$
$$r_t = \sigma \big( W_r x_t + U_r h_{t-1} + b_r \big) $$
$$\hat{h}_t = \tanh\big( W_h x_t + U_h(r_t \ast h_{t-1}) + b_h \big) $$
$$h_t = (1-z_t) \ast h_{t-1} + z_t \ast \hat{h}_t $$


**Exercises**
1. Try GRU and LSTM model for our sum problem. Which is best?
2. Try to train model for 3-digits numbers.
3. Add difference (minus) operation.
4. Learn this pytorch [tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).
5. Take part in the transcription contest.