- Used the combined EPIE corpus to build a custom RNN classifier to recognize sentences with idioms and without with ~67% accuracy.
- Models consist of 1 embedding layer with 2 dense layers with 50% dropout
- Compared the accuracy of models with 1 and 2 Bidirectional LSTM Layers in between the embedding and dense layers

In [1]:
import tensorflow as tf
import tensorflow_hub as hub

In [2]:
import numpy as np

In [3]:
from official.nlp import optimization  # to create AdamW optimizer


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

 The versions of TensorFlow you are currently using is 2.10.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons


In [4]:
import tensorflow_text as text

In [5]:
# Load the text data from the file
with open('combined.txt', 'r', encoding='utf-8') as file:
    text_data = file.readlines()

# Create labels (half 0s and half 1s)
labels = [1] * (len(text_data) // 2) + [0] * (len(text_data) // 2)

# Combine text and labels
data = list(zip(text_data, labels))

In [6]:
labels[4000]

0

In [7]:
import random

In [18]:
# Load the text data from the file
with open('combined.txt', 'r', encoding='utf-8') as file:
    text_data = file.read()

text_data = text_data.replace("\xe2\x80\x98", "‘")
text_data = text_data.replace("\xe2\x80\x99", "’")
    
text_data = text_data.split('\n')

# Create labels (half 0s and half 1s)
labels = [1] * (len(text_data) // 2) + [0] * (len(text_data) // 2)

# Combine text and labels
data = list(zip(text_data, labels))

# Shuffle the combined data
random.seed(42)  # For reproducibility
random.shuffle(data)

# Split the combined data into training, validation, and test sets
num_samples = len(data)
num_train = int(0.6 * num_samples)
num_val = int(0.2 * num_samples)

train_data = data[:num_train]
val_data = data[num_train:num_train+num_val]
test_data = data[num_train+num_val:]

# Create TensorFlow datasets
batch_size = 32

def text_label_generator(data):
    for text, label in data:
        yield text, label

train_ds = tf.data.Dataset.from_generator(
    lambda: text_label_generator(train_data),
    output_signature=(
        tf.TensorSpec(shape=(), dtype=tf.string),
        tf.TensorSpec(shape=(), dtype=tf.int32)
    )
)
train_ds = train_ds.shuffle(buffer_size=num_train)
train_ds = train_ds.batch(batch_size)

val_ds = tf.data.Dataset.from_generator(
    lambda: text_label_generator(val_data),
    output_signature=(
        tf.TensorSpec(shape=(), dtype=tf.string),
        tf.TensorSpec(shape=(), dtype=tf.int32)
    )
)
val_ds = val_ds.batch(batch_size)

test_ds = tf.data.Dataset.from_generator(
    lambda: text_label_generator(test_data),
    output_signature=(
        tf.TensorSpec(shape=(), dtype=tf.string),
        tf.TensorSpec(shape=(), dtype=tf.int32)
    )
)
test_ds = test_ds.batch(batch_size)

class_names = ['Not', 'Idiom']  # Labels for 0 and 1

for text_batch, label_batch in train_ds.take(1):
    for i in range(3):
        sentence = text_batch[i].numpy().decode('utf-8')  # Decoding the bytes to string
        label = label_batch[i].numpy()
        print(f'Sentence: {sentence}')
        print(f'Label: {label} ({class_names[label]})')

Sentence: A person who prioritizes actions and deeds over discussion or contemplation all his life , David went into the Army in the RAMC as a young man and later trained as a State Registered Nurse in a civilian hospital .
Label: 0 (Not)
Sentence: Route-finding is easy to begin with , though enthusiastic signposting unfortunately ceases to function just when paths become undefined , across fields .
Label: 0 (Not)
Sentence: But he 's my husband , and even if we do stray now and again , we will always be an item . ’
Label: 1 (Idiom)


In [19]:
for example, label in train_ds.take(2):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

texts:  [b'Your quick response in an emergency could be a life-saver for your child .'
 b"Wolves may not howl here in the moonlight , as they did in the journal of Jonathan Harker , but I have no difficulty in seeing Slains as he saw Count Dracula 's castle in Bukovina , the tall black windows from which not a glimmer of light came , and the jagged battlements glimpsed when the moon CAME OUT FROM BEHIND the fitful clouds ."
 b'The Minister had planned a speech of thanks himself during a visit to Stoke Mandeville Hospital \xe2\x80\xa6 but Adis Avdic lessened his praise, force or authority .']

labels:  [1 0 0]
texts:  [b"And then we 're going to use a very nice cream erm called moisture balance and that 's a dermatological product and , and that 's for keeping the skin nice and soft , and keeping the wrinkles at a safe distance ."
 b'If it is too drastic to begin without guidance, assistance or preparation with such a sweeping change , why not try it out in experimental matches , festiv

In [20]:
VOCAB_SIZE = 1000
encoder = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_ds.map(lambda text, label: text))

In [21]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'to', 'and', 'a', 'of', 'in', 'that', 'it',
       'i', 'was', 'he', 'for', '’', 'with', 'not', '‘', 'is', 's'],
      dtype='<U14')

In [22]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[  4,  71,  32, 119, 105,   3, 253,   5,  77, 772,   1, 226, 228,
          1, 600,   4,   8,  19,   5,   1,   1,   4,   4,   8,  19,  13,
        121,   2,   1, 772,   4,   1,   4, 121,   2,   1,  24,   5, 112,
        122,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [ 52,   9,  18, 118,   1,   3, 680, 156,   1,   1,  44,   1,  15,
        131,   5,   1, 501, 272,  16, 323,   9,  42,   7,   1,   1,   1,
         44, 174,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0],
       [ 23,   1, 235,   1,   1,  15,   1,   1, 180,  62,   3, 880,   2,
          1,   4,   1,   2,   1,   1,  49,   1, 569,   4,   1,   9, 208,
         12,  29, 410,   1,   9,  98,  33, 204,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0]], dtype=

In [23]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

Original:  b"And then we 're going to use a very nice cream erm called moisture balance and that 's a dermatological product and , and that 's for keeping the skin nice and soft , and keeping the wrinkles at a safe distance ."
Round-trip:  and then we re going to use a very nice [UNK] erm called [UNK] balance and that s a [UNK] [UNK] and and that s for keeping the [UNK] nice and [UNK] and keeping the [UNK] at a safe distance                  

Original:  b'If it is too drastic to begin without guidance, assistance or preparation with such a sweeping change , why not try it out in experimental matches , festival or night matches ?'
Round-trip:  if it is too [UNK] to begin without [UNK] [UNK] or [UNK] with such a [UNK] change why not try it out in [UNK] [UNK] [UNK] or night [UNK]                             

Original:  b'As presentation day grew closer with Mark racing against time to complete the plan and finalise the slide illustrations , so Klepner read and re-read it until he had al

In [24]:
model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [25]:
print([layer.supports_masking for layer in model.layers])

[False, True, True, True, True]


In [26]:
# predict on a sample text without padding.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

[0.00442513]


In [27]:
# predict on a sample text with padding

padding = "the " * 2000
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])

[0.00442513]


In [28]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [30]:
history = model.fit(train_ds, epochs=10,
                    validation_data=val_ds,
                    validation_steps=30)

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [31]:
test_loss, test_acc = model.evaluate(test_ds)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: 0.6361322999000549
Test Accuracy: 0.6733067631721497


In [33]:
sample_text = ('its raining cats and dogs')
predictions = model.predict(np.array([sample_text]))
predictions



array([[-0.35265735]], dtype=float32)

In [42]:
model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation = "sigmoid")
])

In [45]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [46]:
history = model.fit(train_ds, epochs=10,
                    validation_data=val_ds,
                    validation_steps=30)

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [47]:
test_loss, test_acc = model.evaluate(test_ds)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: 0.6798563003540039
Test Accuracy: 0.6613546013832092
