-----------------------
#### Word embeddings 
--------------------------

- Objective

    - build a binary classification model
    - perform sentiment analysis on IMDB dataset
    

**Data Download and Extraction:**

Downloads a sentiment analysis dataset (IMDb reviews) from a specified URL.
Extracts the dataset from the downloaded tar.gz file.

**Data Preparation:**

Creates directories for training and validation data.
Loads the training data using text_dataset_from_directory from TensorFlow, splitting it into training and validation subsets.

**Data Preprocessing and Optimization:**

Defines the custom_standardization function to perform text preprocessing, converting text to lowercase and stripping HTML break tags.
Uses the TextVectorization layer to normalize, split, and map strings to integers, adapting it to the training data.
Sets up the AUTOTUNE constant and applies caching and prefetching to the training and validation datasets for optimized performance.

**Model Definition:**

Creates a text classification model using TensorFlow's Keras API.
Comprises layers for text vectorization, embedding, global average pooling, and two dense layers for classification.
Specifies the vocabulary size, sequence length, and embedding dimension.

**Model Training:**

Compiles and trains the defined model on the preprocessed training dataset.
Utilizes the fit method with specified parameters such as training data, validation data, number of epochs, and callbacks.

**TensorBoard Callback:**

There is a reference to a tensorboard_callback, which suggests the usage of TensorBoard for model training visualization. However, the instantiation and definition of this callback are not provided in the provided code snippet.

In [1]:
import io
import os
import re
import shutil
import string

import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

#### Download the IMDb Dataset

In [18]:
%%time
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

# 82 MB file
dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", 
                                  url,
                                  untar       = True, 
                                  cache_dir   = r'D:\AI-DATASETS\02-MISC-large\keras\datasets',
                                  cache_subdir= '')

CPU times: total: 2min 54s
Wall time: 5min 4s


In [19]:
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [20]:
dataset_dir

'D:\\AI-DATASETS\\02-MISC-large\\keras\\datasets\\aclImdb'

**train/ directory**

- `pos` and `neg` folders with movie reviews labelled as positive and negative respectively. 

In [21]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [22]:
train_dir

'D:\\AI-DATASETS\\02-MISC-large\\keras\\datasets\\aclImdb\\train'

The train directory also has additional folders which should be removed before creating training dataset.

In [23]:
%%time
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

CPU times: total: 4.14 s
Wall time: 6.11 s


Next, create a `tf.data.Dataset` using `tf.keras.utils.text_dataset_from_directory`.

Use the train directory to create both train and validation datasets with a split of 20% for validation.

In [24]:
%%time
batch_size = 1024
seed       = 123

train_ds = tf.keras.utils.text_dataset_from_directory(
                    train_dir, 
                    batch_size      = batch_size, 
                    validation_split= 0.2,
                    subset          = 'training', 
                    seed            = seed)

val_ds = tf.keras.utils.text_dataset_from_directory(
                    train_dir, 
                    batch_size      = batch_size, 
                    validation_split= 0.2,
                    subset          = 'validation', 
                    seed            = seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
CPU times: total: 3.34 s
Wall time: 3.48 s


Take a look at a few movie reviews and their labels (1: positive, 0: negative) from the train dataset.

In [25]:
%%time
for text_batch, label_batch in train_ds.take(1):
    for i in range(2):
        print(label_batch[i].numpy(), text_batch.numpy()[i])
        print()

0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe"

1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (withou

In [17]:
# sets the variable AUTOTUNE to the special value tf.data.AUTOTUNE, 
# which is a constant used in TensorFlow to dynamically tune the performance of 
# operations based on the available resources.
AUTOTUNE = tf.data.AUTOTUNE

# caches the elements of the dataset in memory
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds   = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

#### how are we going to create embeddings

Given review text : "Some movies just leave me speechless.. any other cast-member of SNL"

1. first tokenize the text into words
2. assign unique integer number (think like a code) to every word

#### Using the Embedding layer

- The Embedding layer serves as a lookup table, associating integer indices with dense vectors that represent the embeddings of specific words.
- It can be compared to a parameterized table where each word is assigned a unique dense vector.
- The dimensionality or width of the embedding is a tunable parameter, allowing experimentation to find an optimal setting for a given problem.
- Similar to adjusting the number of neurons in a Dense layer, modifying the embedding dimensionality enables fine-tuning for improved model performance.
- Experimenting with different embedding dimensions helps determine the most effective representation of words in the context of a particular task.

In [26]:
# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(input_dim=1000, output_dim=5)

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [27]:
result = embedding_layer(tf.constant([1, 2, 3]))
result.numpy()

array([[ 0.00515721, -0.01837242, -0.02512432,  0.0366399 , -0.04934866],
       [-0.02547615, -0.01515649, -0.03694643,  0.03688324, -0.03789264],
       [-0.04340212,  0.03707488,  0.01986596, -0.00517799,  0.00846215]], dtype=float32)

**for text data**


- For text or sequence-related problems, the Embedding layer in neural networks accepts a 2D tensor of integers with a shape of (`samples`, `sequence_length`).

- Each entry in this tensor represents a sequence of integers, allowing the layer to handle variable-length sequences effectively.

- Batches with different shapes, such as (32, 10) or (64, 15), can be fed into the Embedding layer, where 32 or 64 represents the number of sequences in the batch, and 10 or 15 is the length of each sequence.

- The resulting tensor from the Embedding layer has one additional axis compared to the input. The embedding vectors are aligned along this new last axis.
- If a batch with a shape of (2, 3) is passed to the Embedding layer, the output tensor will be of shape (2, 3, N), where N represents the dimensionality of the embedding space. 
- The embeddings for each integer in the input sequences are aligned along the new axis, preserving the sequence structure.






In [28]:
import numpy as np
np.set_printoptions(linewidth=140)

In [29]:
result = embedding_layer(tf.constant([[0, 1, 2], 
                                      [3, 4, 5]]))
result.shape

TensorShape([2, 3, 5])

In [30]:
result

<tf.Tensor: shape=(2, 3, 5), dtype=float32, numpy=
array([[[-0.03659021,  0.01373824, -0.01638372,  0.014025  ,  0.00111081],
        [ 0.00515721, -0.01837242, -0.02512432,  0.0366399 , -0.04934866],
        [-0.02547615, -0.01515649, -0.03694643,  0.03688324, -0.03789264]],

       [[-0.04340212,  0.03707488,  0.01986596, -0.00517799,  0.00846215],
        [ 0.02256364,  0.00312041, -0.03689418,  0.02967275,  0.00882412],
        [-0.01595887, -0.02671605, -0.01254153, -0.03929228,  0.01542492]]], dtype=float32)>

- When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (`samples`, `sequence_length`, `embedding_dimensionality`). 
- To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. 
- You could use an RNN, Attention, or pooling layer before passing it to a Dense layer.

#### Text preprocessing
Next, define the dataset preprocessing steps required for your sentiment classification model. 

Initialize a `TextVectorization layer` with the desired parameters to vectorize movie reviews. 

In [31]:
from tensorflow.keras.layers import TextVectorization

In [32]:
# Sample training data
train_texts = ["This is a sample sentence.", 
               "Another example sentence.", 
               "TensorFlow is great!"]

In [33]:
# Create a TextVectorization layer
text_vectorizer = TextVectorization(max_tokens            = 100, 
                                    output_mode           = 'int', 
                                    output_sequence_length= 5)

In [34]:
# Adapt the layer to your training text data
text_vectorizer.adapt(train_texts)

In [35]:
# Transform input text into numerical vectors
input_texts = ["Sample sentence for testing.", "TensorFlow example."]
numerical_vectors = text_vectorizer(input_texts)

In [36]:
# Print the results
print("Original texts:")
print(train_texts)

Original texts:
['This is a sample sentence.', 'Another example sentence.', 'TensorFlow is great!']


In [37]:
print("\nNumerical vectors:")
print(numerical_vectors.numpy())


Numerical vectors:
[[6 2 1 1 0]
 [5 8 0 0 0]]


...back to code 

In [38]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
    lowercase     = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

In [39]:
# Vocabulary size and number of words in a sequence.
vocab_size      = 10000
sequence_length = 100

In [40]:
# Use the text vectorization layer to normalize, split, and map strings to
# integers. 
# Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
                        standardize           = custom_standardization,
                        max_tokens            = vocab_size,
                        output_mode           = 'int',
                        output_sequence_length= sequence_length
)

In [41]:
%%time
# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

Cause: could not parse the source code of <function <lambda> at 0x000002384CE86160>: no matching AST found among candidates:

Cause: could not parse the source code of <function <lambda> at 0x000002384CE86160>: no matching AST found among candidates:

CPU times: total: 34.6 s
Wall time: 49.1 s


#### Constructing a Sentiment Classification Model

- Utilize the Keras Sequential API to establish a sentiment classification model, specifically adopting a `Continuous Bag of Words` style.

- The TextVectorization layer plays a crucial role in transforming strings into vocabulary indices. 

- After initializing vectorize_layer as a TextVectorization layer and building its vocabulary through the adaptation process on text_ds, it becomes a fundamental component as the initial layer in the end-to-end classification model. 

- This layer efficiently feeds transformed strings into the subsequent Embedding layer.

- The `Embedding layer` takes the `integer-encoded vocabulary` and retrieves the corresponding `embedding vector` for each word-index. 

- These vectors evolve and improve as the model undergoes training, adding an extra dimension to the output array. The resultant dimensions following this operation are (batch, sequence, embedding).

- To obtain a fixed-length output vector for each example, the model incorporates the GlobalAveragePooling1D layer. 

- This layer achieves this by averaging over the sequence dimension, ensuring the model can handle inputs of varying lengths in a straightforward manner.

- The fixed-length output vector then progresses through a fully-connected (Dense) layer featuring 16 hidden units.

- Concluding the architecture, the last layer establishes a dense connection with a single output node."

In [42]:
embedding_dim = 16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'), # optional
  Dense(1)                      # binary
])

#### Compile and train the model

In [43]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [44]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [45]:
model.summary()

In [None]:
%%time
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

Epoch 1/15
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1s/step - accuracy: 0.4985 - loss: 0.6915

In [41]:
# #docs_infra: no_execute
# %load_ext tensorboard
# %tensorboard --logdir logs

#### Retrieve the trained word embeddings and save them to disk
Next, retrieve the word embeddings learned during training. 

The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape (vocab_size, embedding_dimension).

Obtain the weights from the model using get_layer() and get_weights(). 

The get_vocabulary() function provides the vocabulary to build a metadata file with one token per line.

In [40]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [41]:
weights.shape

(10000, 16)