<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/huggingface-transformers/huggingface-course-event/huggingface_talk_nlp_workflows_with_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##NLP workflows with Keras

This colab will show a number of ways to build simple NLP models with Keras, as a companion to the Hugging Face Keras talk on NLP workflows in Keras. We will build four very simple NLP models and see how they compare.

<img src='https://i.imgur.com/1vD2az8.png?raw=1' width='800'/>

**Reference**:

https://www.youtube.com/watch?v=gZIP-_2XYMM

https://huggingface.co/course/event/1?fw=pt



## Setup


To start, let's download the dataset.

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup  # Delete unlabeled data.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  25.7M      0  0:00:03  0:00:03 --:--:-- 25.7M


We will convert our text file inputs into batched datasets of size 32, and make a small utility to efficiently apply and cache our preprocessing with `tf.data`. This allows us to avoid recomputing our preprocessed dataset each epoch.

You can learn more about `tf.data` [here](https://www.tensorflow.org/guide/data) and keras preprocessing [here](https://www.tensorflow.org/guide/keras/preprocessing_layers).

In [None]:
import tensorflow as tf

train_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=32)
test_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=32)

def apply_preprocessing(ds, preprocessing_model):
  ds = ds.map(
      lambda x, y: (preprocessing_model(x), y),
      num_parallel_calls=tf.data.AUTOTUNE)
  # Cache and prefetch the data.
  return ds.cache().prefetch(tf.data.AUTOTUNE)

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


## Take 1: Unigram model

Our first attempt at a model is a simple unigram model, also called a bag of words model. We will build up a vocabulary of the 10K most popular words in our dataset, and we will only track which words from our vocabulary are present in each review.

We will multi-hot encode a simple unigram representation of our inputs using [TextVecotorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer, and build a simple logistic regression over the outputs with the [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer.

In [None]:
inputs = tf.keras.Input(shape=(1,), dtype='string')

# Preprocess inputs.
features = train_ds.map(lambda x, y: x)
text_vectorizer = tf.keras.layers.TextVectorization(
    output_mode='multi_hot', max_tokens=10000)
text_vectorizer.adapt(features)
preprocessed_inputs = text_vectorizer(inputs)

# Apply model layers.
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(preprocessed_inputs)

# Split the preprocessing into a separate model to apply it with tf.data.
unigram_preprocessing = tf.keras.Model(inputs, preprocessed_inputs)
unigram_model = tf.keras.Model(preprocessed_inputs, outputs)

# Apply preprocessing asynchonously with tf.data.
preprocessed_train_ds = apply_preprocessing(train_ds, unigram_preprocessing)
preprocessed_test_ds = apply_preprocessing(test_ds, unigram_preprocessing)

# Train the model.
unigram_model.compile(loss='binary_crossentropy', metrics='accuracy')
unigram_model.summary()
unigram_model.fit(
    preprocessed_train_ds, validation_data=preprocessed_test_ds, epochs=5)

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 10000)]           0         
                                                                 
 dense (Dense)               (None, 1)                 10001     
                                                                 
Total params: 10,001
Trainable params: 10,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f2170136710>

## Take 2: Bigram model.

As a next step, we can try using a little more of the sequence data available to us by also considering bigrams, that is pairs of words. Passing `ngrams=2` to the `TextVectorization` layer will have it build a vocabulary of the most common words as well as pairs of words in the dataset. We can then train a logistic regression exactly as before.

Because our new vocabulary space (all words and pairs of words) is much bigger, we will double our learned vocabulary size.

In [None]:
inputs = tf.keras.Input(shape=(1,), dtype='string')

# Preprocess inputs.
features = train_ds.map(lambda x, y: x)
text_vectorizer = tf.keras.layers.TextVectorization(
    output_mode='multi_hot', max_tokens=20000, ngrams=2)
text_vectorizer.adapt(features)
preprocessed_inputs = text_vectorizer(inputs)

# Apply model layers.
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(preprocessed_inputs)

# Split the preprocessing into a separate model to apply it with tf.data.
bigram_preprocessing = tf.keras.Model(inputs, preprocessed_inputs)
bigram_model = tf.keras.Model(preprocessed_inputs, outputs)

# Apply preprocessing asynchonously with tf.data.
preprocessed_train_ds = apply_preprocessing(train_ds, bigram_preprocessing)
preprocessed_test_ds = apply_preprocessing(test_ds, bigram_preprocessing)

# Train the model.
bigram_model.compile(loss='binary_crossentropy', metrics='accuracy')
bigram_model.summary()
bigram_model.fit(
    preprocessed_train_ds, validation_data=preprocessed_test_ds, epochs=5)

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_1 (Dense)             (None, 1)                 20001     
                                                                 
Total params: 20,001
Trainable params: 20,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f20eeee5110>

## Take 3: Embedding and pooling.

Let's now switch up our preprocessed text representation to preserve the sequence of words in each review. By setting `ouput_mode='int'` on the `TextVectorization` layer (the default), the layer will output an integer index for each word of the input, corresponding to the index of the word in our learned vocabulary.

We can use this representation to train an [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) for each word. A very simple approach we can try with our embedded words is to simply average them all together, which we can do with the [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer. To avoid overfitting, we will use two [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) layers.

In [None]:
inputs = tf.keras.Input(shape=(1,), dtype='string')

# Preprocess inputs.
features = train_ds.map(lambda x, y: x)
text_vectorizer = tf.keras.layers.TextVectorization(
    output_mode='int', max_tokens=10000, output_sequence_length=250)
text_vectorizer.adapt(features)
preprocessed_inputs = text_vectorizer(inputs)

# Apply model layers.
embedding = tf.keras.layers.Embedding(
    text_vectorizer.vocabulary_size(), 32, mask_zero=True)
x = embedding(preprocessed_inputs)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.GlobalAveragePooling1D()(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

# Split the preprocessing into a separate model to apply it with tf.data.
pooling_preprocessing = tf.keras.Model(inputs, preprocessed_inputs)
pooling_model = tf.keras.Model(preprocessed_inputs, outputs)

# Apply preprocessing asynchonously with tf.data.
preprocessed_train_ds = apply_preprocessing(train_ds, pooling_preprocessing)
preprocessed_test_ds = apply_preprocessing(test_ds, pooling_preprocessing)

# Train the model.
pooling_model.compile(loss='binary_crossentropy', metrics='accuracy')
pooling_model.summary()
pooling_model.fit(
    preprocessed_train_ds, validation_data=preprocessed_test_ds, epochs=5)

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 250)]             0         
                                                                 
 embedding (Embedding)       (None, 250, 32)           320000    
                                                                 
 dropout (Dropout)           (None, 250, 32)           0         
                                                                 
 global_average_pooling1d (G  (None, 32)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                           

<keras.callbacks.History at 0x7f2170629890>

## Take 4: Bidirectional LSTM.

The last thing we will try is learning an [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) layer over our embedded words. This is almost a line for line equivalent to our pooling model above, but instead of the pooling layer, we swap in an LSTM. This was we can try to allow our model to learn from the sequence order in which our review words appeared.

In [None]:
inputs = tf.keras.Input(shape=(1,), dtype='string')

# Preprocess inputs.
features = train_ds.map(lambda x, y: x)
text_vectorizer = tf.keras.layers.TextVectorization(
    output_mode='int', max_tokens=10000, output_sequence_length=250)
text_vectorizer.adapt(features)
preprocessed_inputs = text_vectorizer(inputs)

# Split the preprocessing into a separate model to apply it with tf.data.
embedding = tf.keras.layers.Embedding(
    text_vectorizer.vocabulary_size(), 32, mask_zero=True)
x = embedding(preprocessed_inputs)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.LSTM(32)(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

# Split the preprocessing into a separate model to apply it with tf.data.
lstm_preprocessing = tf.keras.Model(inputs, preprocessed_inputs)
lstm_model = tf.keras.Model(preprocessed_inputs, outputs)

# Apply preprocessing asynchonously with tf.data.
preprocessed_train_ds = apply_preprocessing(train_ds, lstm_preprocessing)
preprocessed_test_ds = apply_preprocessing(test_ds, lstm_preprocessing)

# Train the model.
lstm_model.compile(loss='binary_crossentropy', metrics='accuracy')
lstm_model.summary()
lstm_model.fit(
    preprocessed_train_ds, validation_data=preprocessed_test_ds, epochs=5)

Model: "model_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 250)]             0         
                                                                 
 embedding_1 (Embedding)     (None, 250, 32)           320000    
                                                                 
 dropout_2 (Dropout)         (None, 250, 32)           0         
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dropout_3 (Dropout)         (None, 32)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
Total params: 328,353
Trainable params: 328,353
Non-trainab

<keras.callbacks.History at 0x7f20f748e9d0>

## Results and inference

That's it for four quick experiments with different model architectures. Interestingly, the largest model we tried (LSTMs) ended up performing the worst. The bigram model performed the best at over 90% validation accuracy.

This is a good insight to gain--without pretraining, on a dataset this size, simple ngram models are going to be very hard to beat. There's a bit more accuracy we could gain using larger vocabularies and larger ngrams--feel free to try it out in this colab!

The last thing we will do is demo saving a model for a later inference. We will save the bigram model with preprocessing and training grouped together, so our single saved can directly run directly on raw text.

In [None]:
# Group our preprocessing and model into a single model that handles raw data.
inputs = bigram_preprocessing.input
outputs = bigram_model(bigram_preprocessing(inputs))
inference_model = tf.keras.Model(inputs, outputs)
inference_model.save("final_model")

# Load the model back from disk and make a prediction.
loaded_model = tf.keras.models.load_model("final_model")
loaded_model.predict(
    tf.constant(['Terrible, no good, trash.', 'I loved this movie!']))

INFO:tensorflow:Assets written to: final_model/assets


array([[0.27302995],
       [0.78468144]], dtype=float32)