<a href="https://colab.research.google.com/github/s34836/WUM/blob/main/Lab_10_Recurrent_Networks_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 10 - Recurrent Networks 2
## Text vectorization

Text data can be vectorized using `TextVectorization`.

### "Bag-of-words" models
- Binary encoding: the text is encoded as a binary vector, in which the value 1 on the i-th position means that the text contains the i-th word in the generated dictionary (`output_mode="multi_hot"`)
- Frequency encoding: the i-th position contains the frequency of the i-th word. (`output_mode="count"`)
- It is possible to encode n-grams (sequences of n words). (e.g. `ngrams=2`)
- Frequencies can be normalized using TF-IDF. (`output_mode="tf_idf"`)

## Sequential model

- Texts can be encoded as sequences of numbers `output="int"`.
- The numbers in each sequence can be vectorized using one-hot encoding or with an `Embedding` layer.

## Tasks

1. The `aclImdb` dataset contains examples of positive (`pos`) and negative (`neg`) movie reviews. Load the dataset using the `text_dataset_from_directory()` method.
2. Create a `TextVectorization` layer with `output_mode='multi_hot'`/`'count'`/`'tf_idf'` and generate the vocabulary using the `adapt()` method.
3. Create a dense network to classify the reviews. Insert the vectorization layer directly after the input layer.
4. Create a `TextVectorization` layer with `output_mode='int'` and generate the vocabulary using the `adapt()` method.
5. Create an LSTM network to classify the reviews. Insert the vectorization layer directly after the input layer, followed by an `Embedding` layer.
6. Compare several recurrent networks with different architectures. You can use GRU layers instead of LSTM layers. Rcurrent layers can be regularized by setting the `recurrent_dropout` parameter. Plot the learning curves on training/validation data.

In [None]:
import tensorflow as tf

batch_size = 32

train_data = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)

valid_data = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

### Dense network (example)

In [None]:
max_tokens = 5000

text_vectorization = tf.keras.layers.TextVectorization(output_mode="multi_hot", max_tokens=max_tokens)
text_vectorization.adapt(train_data.map(lambda x, y: x))

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(1,), dtype="string"),
    text_vectorization,
    # Add your layers here
])

### Recurrent network (example)

In [None]:
max_tokens = 5000
max_len = 600

text_vectorization = tf.keras.layers.TextVectorization(output_mode="int", max_tokens=max_tokens, output_sequence_length=max_len)
text_vectorization.adapt(train_data.map(lambda x, y: x))

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,), dtype="string"),
    text_vectorization,
    tf.keras.layers.Embedding(input_dim=max_tokens, output_dim=128),
    # Add your layers here
])