## Goals
* Applying sentiment analysis using sequence based models on IMDB dataset

### Preparing the IMDB movie reviews data
Download the IMDB dataset from the Stanford page of Andrew Maas and uncompress it.

In [4]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  14.1M      0  0:00:05  0:00:05 --:--:-- 16.2M


The directory structure:
* aclImdb/
* ...train/
* ......pos/ [contains a set of 12,500 text files with the text body of a positive-sentiment movie review to be used as training data]
* ......neg/ [contains a set of 12,500 text files with the text body of a negative-sentiment movie review to be used as training data]
* ...test/
* ......pos/ [contains a set of 12,500 text files with the text body of a positive-sentiment movie review to be used as testing data]
* ......neg/ [contains a set of 12,500 text files with the text body of a negative-sentiment movie review to be used as testing data]

In [5]:
# check the files in the directory
! ls -ltr

total 82164
drwxr-xr-x 4 7297 1000     4096 Jun 26  2011 aclImdb
drwxr-xr-x 1 root root     4096 Oct 28 13:37 sample_data
-rw-r--r-- 1 root root 84125825 Nov  7 19:31 aclImdb_v1.tar.gz


In [8]:
# check the files and sub-folders in directories
import os
 
rootdir = '/content'
for rootdir, dirs, files in os.walk(rootdir):
    for subdir in dirs:
        print(os.path.join(rootdir, subdir)) # excluded files as it outputs many files

/content/.config
/content/aclImdb
/content/sample_data
/content/.config/configurations
/content/.config/logs
/content/.config/logs/2022.10.28
/content/aclImdb/test
/content/aclImdb/train
/content/aclImdb/test/pos
/content/aclImdb/test/neg
/content/aclImdb/train/pos
/content/aclImdb/train/unsup
/content/aclImdb/train/neg


In [9]:
# remove the foder /content/aclImdb/train/unsup
!rm -r aclImdb/train/unsup

In [10]:
rootdir = '/content'
for rootdir, dirs, files in os.walk(rootdir):
    for subdir in dirs:
        print(os.path.join(rootdir, subdir))

/content/.config
/content/aclImdb
/content/sample_data
/content/.config/configurations
/content/.config/logs
/content/.config/logs/2022.10.28
/content/aclImdb/test
/content/aclImdb/train
/content/aclImdb/test/pos
/content/aclImdb/test/neg
/content/aclImdb/train/pos
/content/aclImdb/train/neg


In [11]:
# glimpse through a movie review
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

Prepare a validation set by setting apart 20% of the training text files in a new directory: *aclImdb/val*

In [22]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"



for category in ("neg", "pos"):
  os.makedirs(val_dir / category)
  files = os.listdir(train_dir / category)
  # Shuffle the list of training files using a seed, to get the same validation set every time the code is executed
  random.Random(1337).shuffle(files)
  # Take 20% of the training files to use for validation.
  num_val_samples = int(0.2 * len(files))
  val_files = files[-num_val_samples:]
  
  # Move the files to aclImdb/val/neg and aclImdb/val/pos.
  for fname in val_files:
    shutil.move(train_dir / category / fname,
                val_dir / category / fname)

Create three Dataset objects for training, validation, and testing of text files using the text_dataset_from_directory utility.


In [23]:

from tensorflow import keras

batch_size = 32

train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", 
                                                   batch_size=batch_size)

val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", 
                                                 batch_size=batch_size)

test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", 
                                                  batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


The datasets yield 
* inputs in TensorFlow tf.string tensors format and 
* targets in int32 tensors encoding the value “0” or “1.”

In [24]:
# Display the shapes and dtypes of the first batch
for inputs, targets in train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'I never attended the midnight showing of a movie before "Dick Tracy" came out.<br /><br />I still have the "t-shirt ticket" I had to wear to get admitted to the showing around here somewhere and, like that shirt, "Dick Tracy" has stuck with me ever since.<br /><br />If you\'ve seen the movie, the sharp visuals, bright primary colors and strong characters have no doubt been etched into your brain. It\'s a wonder to behold.<br /><br />As director/star/co-writer/producer, Beatty knows what works in a film and shows it here, taking a familiar American icon and re-creating him for a whole new era. Still set in the \'30s, "Tracy" has a kind of timeless quality like all good films do. I\'ve lost track of how many times I\'ve watched "Tracy" and I still catch something new every time I do.<br /><br />The others are all top notch, starting with Pacino\'s Big Boy Capric

* Process raw text datasets with a TextVectorization layer 
* Results in multi-hot encoded binary word vectors. 
* Illustrated for single words i.e., unigrams

* Limit the vocabulary to the 20,000 most frequent words. In general, 20,000 is the right vocabulary size for text classification.


In [30]:
from keras import layers

text_vectorization = layers.TextVectorization(max_tokens=20000,
                                       # Encode the output tokens as multi-hot binary vectors
                                       output_mode="multi_hot",)

# Prepare a dataset that only yields raw text inputs (no labels).
text_only_train_ds = train_ds.map(lambda x, y: x)

#Use that dataset to index the dataset vocabulary via the adapt() method.
text_vectorization.adapt(text_only_train_ds)

#Prepare processed versions of training, validation, and test dataset.

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y),
                                    #Specify num_parallel_calls to leverage multiple CPU cores.
                                     num_parallel_calls=4)

binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y),
                                 num_parallel_calls=4)

binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y),
                                   num_parallel_calls=4)

In [31]:
# Inspect the output of our binary unigram dataset
for inputs, targets in binary_1gram_train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


In [32]:
#  model-building utility
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid")(x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop",
                loss="binary_crossentropy",
                metrics=["accuracy"])
  
  return model

In [33]:
# Train and test the binary unigram model

model = get_model()
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [34]:
callbacks = [keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                             save_best_only=True)]

# Call cache() on the datasets to cache them in memory
# This preproceses once, during the first epoch, and reuse the preprocessed texts for the following epochs. 
# This can only be done if the data is small enough to fit in memory.

model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.888


The model accuracy is 88.8%. Try to enhance it by using bi-grams in text vectorization.

In [35]:
#  bi-grams  text vectorization.
text_vectorization = layers.TextVectorization(ngrams=2,
                                              max_tokens=20000,
                                              output_mode="multi_hot",)



In [36]:
text_vectorization.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y),
                                     num_parallel_calls=4)

binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y),
                                 num_parallel_calls=4)

binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y),
                                   num_parallel_calls=4)

# invoke the model and summary
model = get_model()
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [37]:
# create checkpoint using callback
callbacks = [keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                             save_best_only=True)]
#Train the model and perform validation
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)


model = keras.models.load_model("binary_2gram.keras")

# verify accuracy of the bigram model
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.898


# Bi-grams with TF-IDF encoding
* add a bit more information to bi-gram representation by counting how many times each word or N-gram occurs
* otherwise, take the histogram of the words over the text
* in text classification knowing how many times a word occurs in a sample is critical
* On the topic, refer to the class notes [NLP] covered by Mr. Ashutosh Vyas

In [38]:
# Configure the TextVectorization layer to return TF-IDF
text_vectorization = layers.TextVectorization(ngrams=2,
                                       max_tokens=20000,
                                       output_mode="tf_idf")



In [39]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y),
                                    num_parallel_calls=4)

tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y),
                                num_parallel_calls=4)

tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y),
                                  num_parallel_calls=4)

model = get_model()
model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [40]:
callbacks = [keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                             save_best_only=True)]

model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.892


TF-IDF vectorization accuracy is slightly lower than bi-gram vectorization without TF IDF

In [41]:
## Incorporating text preprocessing in model development

# One input sample would be one string
inputs = keras.Input(shape=(1,), dtype="string")

# Apply text preprocessing.
processed_inputs = text_vectorization(inputs)

# Apply the previously
outputs = model(processed_inputs)

# Instantiate the end-to-end model. trained model.
inference_model = keras.Model(inputs, outputs)

In [42]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
["That was an excellent movie, I loved it."],
])

predictions = inference_model(raw_text_data)

print(f"{float(predictions[0] * 100):.2f} percent positive")

96.40 percent positive
