The language models deal with input of vectors embedded which have arbitrary length, the objective will be dependant on the history information, neural networks accept fixed size of inputs, so we need to deal with context and history size. one way of modeling is to input only a fixed window of history. The introduction of encoders and decoders:

Encoders : Send input through a smaller-than-necessary layer to force the neural network to find a small set of parameters that produced intermediate activations that approximates the output

Decoder : A set of parameters that recovers information to produce the output

In [1]:
## read the text file
with open('Europarl-french-v7/europarl-en.txt', 'r') as file:
    lines_en = file.readlines()[:5000]
with open('Europarl-french-v7/europarl-fr.txt', 'r') as file:
    lines_fr = file.readlines()[:5000]

## RNN and LSTM
RNN structures : deal with different length of history, recurrently process each time slice
![RNN](RNN.png)

LSTM : instead of replacing the hidden state each time-slice, adding a memory cell to decide which part to forget or memmorize
![LTSM](LSTM.png)

a. forget gate : $f = \sigma(W_{x,f}x+b_{x,f}+W_{h,f}h+b_{h,f}) $ output (0,1)

b. input gate : $i = \sigma(W_{x,i}x+b_{i,f}+W_{h,i}h+b_{h,i}) $ 

c. cell memory : $g = tanh(W_{x,g}x+b_{x,g}+W_{h,g}h+b_{h,g}) $  output range [-1,1]

d. update cell state : $c = (f*c_{i-1})+(i*g)$

e. output gate :  $o = \sigma(W_{x,o}x+b_{x,o}+W_{h,o}h+b_{h,o}) $ 

f. update hidden state: $h_i = o*tanh(c_i)$

In [43]:
import numpy as np
import tensorflow as tf
from transformers import BertTokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, LSTM, Dense, Input
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Create input sequences and labels for training
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
input_sequences = []
for line in lines_en[:500]:
    token_list = tokenizer.encode(line.replace("\n", "")[:512])
    input_sequences.append(token_list[:-2])

max_sequence_length = max(len(seq) for seq in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=tokenizer.vocab_size)
X_ = X.reshape((X.shape[0], 1, X.shape[1]))

In [15]:
X_.shape

(500, 1, 104)

In [17]:
y.shape

(500, 28996)

In [47]:
# Build and train the SimpleRNN model
model_rnn = Sequential()
model_rnn.add(SimpleRNN(100, input_shape=(1, X_.shape[2])))
model_rnn.add(Dense(tokenizer.vocab_size, activation='softmax'))
model_rnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model_rnn.fit(X_, y, epochs=100, verbose=0)

# Build and train the LSTM model
model_lstm = Sequential()
model_lstm.add(LSTM(100, input_shape=(1, X_.shape[2])))
model_lstm.add(Dense(tokenizer.vocab_size, activation='softmax'))
model_lstm.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model_lstm.fit(X_, y, epochs=100, verbose=0)


  super().__init__(**kwargs)


<keras.src.callbacks.history.History at 0x309619fd0>

In [49]:
# Generate text using the trained models
def generate_text(seed_text, model, max_sequence_len, num_words):
    for _ in range(num_words):
        token_list = tokenizer.encode(seed_text)
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        token_list = token_list.reshape((token_list.shape[0], 1, token_list.shape[1]))
        predicted = np.argmax(model.predict(token_list), axis=-1)
        output_word = tokenizer.decode(predicted)
        seed_text += " " + output_word
    return seed_text

# Example of generating text with each model
generated_text_rnn = generate_text("There should be no confusion in this debate. As environmentalists, we do not want an", model_rnn, max_sequence_length, num_words=5)
generated_text_lstm = generate_text("There should be no confusion in this debate. As environmentalists, we do not want an", model_lstm, max_sequence_length, num_words=5)

print("Generated Text (SimpleRNN):", generated_text_rnn)
print("Generated Text (LSTM):", generated_text_lstm)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
Generated Text (SimpleRNN): There should be no confusion in this debate. As environmentalists, we do not want an programmes employment them ##ity report
Generated Text (LSTM): There should be no confusion in this debate. As environmentalists, we do not want an safety Europe them citi

## Sequence to Sequence models
When dealing with translation tasks, the input and output will look like

$input_i = SOSx_{i,1}x_{i,2}\dots x_{i,2}EOS$

$ouput_i = SOSy_{i,1}y_{i,2}\dots y_{i,2}EOS$

there's no one-to-one mapping, the output could be of arbitrary length, the entire context is needed for translation.

Then a sequence to sequence model structure is utilized, it encodes all the words until EOS reached, then after decoding all words, there's a encoding layer to put all results together
![s2s](s2s.png)

In [19]:
import numpy as np
import keras
import os
from pathlib import Path

In [35]:
batch_size = 64  # Batch size for training.
epochs = 16  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 1000  # Number of samples to train on.

In [37]:
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

for i in range(num_samples):
    input_text = lines_en[i]
    target_text = lines_fr[i]
    target_text = "\t" + target_text + "\n"
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

Number of samples: 1000
Number of unique input tokens: 86
Number of unique output tokens: 99
Max sequence length for inputs: 683
Max sequence length for outputs: 877


In [39]:
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype="float32",
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype="float32",
)
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype="float32",
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0

In [41]:
# Define an input sequence and process it.
encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))
encoder = keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [43]:
model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)
model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)
# Save model
# model.save("s2s_model.keras")

Epoch 1/16
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 2s/step - accuracy: 0.6275 - loss: 2.6168 - val_accuracy: 0.8410 - val_loss: 0.9426
Epoch 2/16
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 2s/step - accuracy: 0.8151 - loss: 1.0525 - val_accuracy: 0.8409 - val_loss: 0.9267
Epoch 3/16
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 2s/step - accuracy: 0.8183 - loss: 1.0300 - val_accuracy: 0.8409 - val_loss: 0.8873
Epoch 4/16
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 2s/step - accuracy: 0.8169 - loss: 1.0070 - val_accuracy: 0.8409 - val_loss: 0.8595
Epoch 5/16
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 2s/step - accuracy: 0.8182 - loss: 0.9703 - val_accuracy: 0.8405 - val_loss: 0.9735
Epoch 6/16
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 2s/step - accuracy: 0.8213 - loss: 0.9795 - val_accuracy: 0.8409 - val_loss: 0.7349
Epoch 7/16
[1m13/13[0m [32m━━━━━━━━━━

<keras.src.callbacks.history.History at 0x32db1e9c0>

In [61]:
encoder_inputs = model.input[0]  # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
encoder_states = [state_h_enc, state_c_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states)

decoder_inputs = model.input[1]  # input_2
decoder_state_input_h = keras.Input(shape=(latent_dim,))
decoder_state_input_c = keras.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = keras.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

## translation with this model
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq, verbose=0)
    
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value, verbose=0
        )

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return decoded_sentence


# Take one sequence 
# for trying out decoding.
input_seq = encoder_input_data[num_samples - 1:num_samples]
decoded_sentence = decode_sequence(input_seq)
print("-")
print("Input sentence:", input_texts[num_samples - 1])
print("Decoded sentence:", decoded_sentence)

-
Input sentence: Are state aid to business or inter-company agreements legitimate in a market economy, and who must supervise these exceptions to the absolute rules of the market economy?

Decoded sentence: eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee