# Recurrent Neural Networks 
We will see text generation and text translation.
What is a RNN? We want to include time in our model. We want to model the correlation between samples through time. This is done by means of "loops": weights are updated not only on the base of current input, but also using information coming from the past (i.e. state). The way in which the loop is implemented varies between implementation: 
- RNN
- LSTM
- GRU
Many variants are possible, but there 3 are the basic cells used. <br>

We have seen RNN (slide 2): at each time is updating with state coming from the past + input. As we move on, we process our sequence, and we update the weights based on the past information. We can thus with back prop. thorugh time. We can image this loop unrolled: if we do, we can go backward to compute the gradient. The problem of this kind of network is vanishing gradient: weights are multiplied as we go back, and if we want to model a long term dependency it will fail. <br> <br>
LSTM are an evalution of this basic cell: it has two kind of components. The first is the cell memory/state, and it is the memory of our network, and then we a set of learnble gates that allow to update content: what to forget from the memory, what to update, and which output we want to produce. GRU merge forget and input gate in a single gate having, thus, less parameters. <br>

##### RNN  in Keras
In keras there is a layer to implement the cells. Units is dimension of the state: how big is encoding of past information. Then, we have parameter typical of the cell: suppose that we process an input sequence of a certain lenght, return_state allow to give as output the output and the state. <br>
###### LSTM in Keras
In LSTM it is different, since the state is given by the couple [s_h, s_c], where s_c is the state of the memory.
Stateful is another important parameter: each time i give a new batch to my cell, if it True, the cell won't discard the previous state, otherwise it will re-init the state every time we give a new input. 
If we want to model, the temporal correlation between 3 samples, we pick the 3 samples and we give them to the LSTM, and, suppose that we shift to next 3: if stateful is false, the input of the batch is s_h and s_c to 0. If I put stateful to True, we init to state at time 2, instead.

## Example
Suppose that we have a sentence in input (e.g. "hello") and given the previous chars, we want to predict the next one.
The samples of our dataset will be the chars, and, we want to do next-char prediction.
Our dataset is a book, and we split the dataset in this way: we start at beginning, we keep 3 chars, and then we have target 1, then we go one step ahead, so: <br>
hel-l <br>
ell-o <br>
and so on.<br>



In [1]:
import numpy as np
import tensorflow as tf

# Data set up
full_text = "Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu.Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsuSono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu "
vocabulary = sorted(list(set(full_text)))

ctoi = {c:i for i,c in enumerate(vocabulary)}
itoc = {i:c for i,c in enumerate(vocabulary)}

seq_lenght = 100 # how many samples we want to consider for the next char prediction

# Create dataset
full_text_lenght = len(full_text)
step = 1
X = []
Y = []
for i in range(0, full_text_lenght - (seq_lenght), step):
    sequence = full_text[i: i+seq_lenght]
    target = full_text[i+seq_lenght]
    X.append([ctoi[c] for c in sequence])
    Y.append(ctoi[target])

X = np.array(X)
Y = np.array(Y)

indices = np.arange(len(X)) # we want to shuffle them, to use it for both input and target
np.random.shuffle(indices)

X = X[indices]
Y = Y[indices]

num_train = int(0.9 * len(X))
x_train = np.array(X[:num_train])
y_train = np.array(Y[:num_train])
x_valid = np.array(X[num_train:])
y_valid = np.array(Y[num_train:])

def char_encode(x_, y_):
    return tf.one_hot(x_, len(vocabulary)), tf.one_hot(y_, len(vocabulary))

bs = 256
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=x_train.shape[0])
train_dataset = train_dataset.map(char_encode)
train_dataset = train_dataset.batch(bs)
train_dataset = train_dataset.repeat()

valid_dataset = tf.data.Dataset.from_tensor_slices((x_valid, y_valid))
valid_dataset = valid_dataset.shuffle(buffer_size=x_valid.shape[0])
valid_dataset = valid_dataset.map(char_encode)
valid_dataset = valid_dataset.batch(bs)
valid_dataset = valid_dataset.repeat()

We want to avoid coming from interegers: we want to consider each char as equal: there is no ordinal distance. Each char will be a vector equal to len(vocabulary) and i will have in correspondence to the integer of my char.


In [2]:
h_size = 128 # hidden size 
# Each time i go from time t to time t+1, i have a tensor of dimension h_size. 
# And this will be also the dimension of the output
# If I'm using an LSTM, it is also the dimension of the cell state

model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(units=h_size, 
                               return_sequences=True,
                               stateful=False,
                               batch_input_shape=[None, seq_lenght, len(vocabulary)]))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.LSTM(units=h_size, return_sequences=False, stateful=False))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(activation="softmax", units=len(vocabulary)))

There are two ways of doing this stuffs:
- Create an LSTM and give one element at time: give x(t), produce h,c and o, store it into a variable, and when it comes t+1, take them and put them in the cell again. I can do this with a for loop
- The other way is: give all the inputs and the output is directly the final one, if I set return_seq to False. Otherwise, i will have a vector with an output for each of the time step

Other comments:
- Stateful=False: we have shuffled the dataset. In the batch size, i have my cicle (for batch in Dataset:) and i want to optimize my network. If i preserve the state in some way I'm connecting the state after the previous word, with the state of the next word, which are not related. Each time, I want to reset. 
- You can stack LSTM: like in FFNN, we can have multiple layers, and this is such also in this scenario. The output of the first LSTM, can be the input of another layer of LSTM. Don't mess this with unrollmenet. This allows to implement an encoder. At each level of LSTM I can exploit some different feature/abstraction level. We could set a different size for the new LSTM: you may want to compress to have a bottleneck, like in U-net. However, since we want the output of first LSTM, we need to set return_seq to True in the first layer. IN this way we have input of h e l l and so on, that goes in the input of the second LSTM. 
- At the end, my output is a tensor of dimension batch_size*units. Units is dimension of my state, which is dimension of my output. I'm not directly doing a prediction on the value of the char. My output so far is an encoding. It is the information of what is in my input-> we need a last step of course

In [3]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 100, 128)          74752     
_________________________________________________________________
dropout (Dropout)            (None, 100, 128)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 17)                2193      
Total params: 208,529
Trainable params: 208,529
Non-trainable params: 0
_________________________________________________________________


In [5]:
loss = tf.keras.losses.CategoricalCrossentropy()
opt = tf.keras.optimizers.Adam()
metrics = ["accuracy"]
model.compile(loss=loss, optimizer=opt, metrics=metrics)

# Text generation

In [10]:
generation_len = 100
start_idx = np.random.randint(0, full_text_lenght - seq_lenght)
seed_sentence = full_text[start_idx:start_idx+seq_lenght]

In [11]:
seed_sentence

'rdero, cioe? asbfoabb asufb uosbfuabuofb aubfuoasbufb auofbusabfu baofsaubf asobubsu Sono un cordero'

I give this sentence of 100 chars, and I want to generate another 100 chars. How to do it? By shifting. I have my seed sequence, and i predict the next, then, i shift, filling with the made prediction.


In [12]:
in_onehot = np.zeros([1, seq_lenght, len(vocabulary)])
for t_idx, c in enumerate(seed_sequence):
    in_onehot[:, t_idx, ctoi[c]] = 1
    
generated_sentence = seed_sentence

for i in range(generation_len):
    preds = model.predict(in_onehot)[0]
    next_char = np.argmax(preds, -1)
    next_char_onehot = np.zeros([1,1,len(vocabulary)])
    next_char_onehot[:, :, next_char] = 1
    in_oneout = np.concatenate([in_onehot, next_char_onehot], axis=1)
    in_onehot = in_oneout[:, 1:, :]
    generated_sequence += itoc[next_char]
    


To avoid to get stuck in the same generation: <br>
- In the naive idea, i have the same probability always, choosing the best one.
- Sometimes you are allowed to keep not the best, but the ones with a probaiblity over a threshold, and then i random samples in the remaining ones. You can do this with "temperature".  The higher the temperature, the higher the probability of having the network to generate no-senses: i am allowing to sample from almost all the vector. The lower, the most confident is the generation. It is an hyper-parameter.

# Translation
Given a sentence, i want to encode it into an abstract information, and I want to use it to output the translation.
This can in done with S2S model, with an encoder and decoder.                        
I have a path that compress the information, and, I use the encoded information to decide and predict the translations.                                                                                     
It is very similar to what we have done with chars, but we have to use some kind of embeddings, since we want to capture the semantics in our sentence: one hot encoding does not allow this.                                    