# Simple Sentiment Classification: Bag of Words

In this notebook, we consider *sentiment classification*, a standard task in natural language processing. Based on a review of a movie (or a restaurant, hotel, etc.), we want to predict whether the person liked the movie or not. As an example, we use a data set provided by the International Movie Database website www.imdb.com. The provided reviews are labeled with a binary rating whether they are positive (label 1) or negative (label 0).

## Set-up
First of all, we need to load the libraries that we will need for this task. We will use keras and tensorflow for this code example, so we load the relevant parts of this framework:

In [None]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Input, TextVectorization, Embedding, Conv1D, Flatten, LSTM, Bidirectional
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import Constant

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
# some more general libraries for evaluation purposes:
import matplotlib.pyplot as plt
import datetime

In [None]:
# set some model parameters
VOCAB_SIZE = 1000
NUM_EPOCHS = 50 # set lower for fast results - set higher for good results
BUFFER_SIZE = 10000
BATCH_SIZE = 512
EMBED_DIM = 100

In [None]:
# Configurations
tf.config.run_functions_eagerly(True)

import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

## Loading the Data
The IMDb data set is available via the library `tensorflow_datasets`, which allows for easy access. The data is already split into a TRAIN and TEST set, each containing 25'000 labeled reviews. In order to obtain also a validation set, we further divide the TRAIN data set into 80% training and 20% validation data. The labels come directly with the respective texts:

In [None]:
train_ds, val_ds, test_ds = tfds.load(
    name = "imdb_reviews",
    split = [ 'train[:80%]', 'train[80%:]', 'test' ],
    as_supervised = True)

The data sets are of a special data type which is optimized for handling large amounts of data and processing them on mulitple machines in parallel. We can get the number of samples as follows:

In [None]:
train_ds.cardinality()

First we look at some examples from the training data set:

In [None]:
for example, label in train_ds.take(5):
  print("Input: ", example)
  print(10*".")
  print('Target labels: ', label)
  print(50*"-")

## Text Representation: Bag of Words
The text is raw text, without a lot of formatting, but it also includes some html tags. For a first try, we will start with a simple `TextVectorization` layer, that is similar to a bag of word. We will use a vocabulary size of 1000 words:

In [None]:
encoderBoW = TextVectorization(output_mode = "count", max_tokens=VOCAB_SIZE)
encoderBoW.adapt(train_ds.map(lambda text, label: text))

The `.adapt()` method chooses the vocabulary of the layer based on the training data - this corresponds to a kind of training, but is done in an explicit step (essentially counting and sorting). The function `map(lambda text, label: text)` ensures that we only use the texts from the `train_ds` data set (and leave out the labels for the moment).

As we work with a `VOCAB_SIZE` of 1000, chances are high that some of the words will not be represented in the vocabulary. The `.adapt()` will return the most common 999 words, and all  other words are represented as `[UNK]` (for *unknown*). The tokens are sorted in descending order of frequency. Here are the first 20 tokens (i.e., the most frequent ones):

In [None]:
vocab = np.array(encoderBoW.get_vocabulary())
vocab[:20]

The first word in the vocabulary is `[UNK]`, the token for the unknown words. Afterwards, we have a number of token for very common words, the so-called **stop words**. The first one being 'the'. So, in the numerical vector that we get after coding, the first column corresponds to all unknown words (i.e. all words that do not appear in the vocabulary), and the second column corresponds to the word 'the'. Also some *domain-specific* words occur frequenty: `movie` and `film` indicate that the vocabulary was built on movie reviews.

We can now get an example encoding:

In [None]:
encoderBoW("the")

In [None]:
print(example)

In [None]:
encoderBoW(example).numpy()

# A first Model: Linear Regression
We start with a first, very simple model. It corresponds to a multiple linear regression model, with the 1000-dimensional vector representation of the sentence as input (independent variables, predictors), and the rating (0 or 1) as output (dependent variable, target variable). While this could also be done with a classical regression, we start right away with a neural network, and will extend it gradually as we add more advanced concepts.

## Model Definition

In [None]:
model_BoW_1L_lin = Sequential()
model_BoW_1L_lin.add(Input(shape=(1,), dtype='string'))
model_BoW_1L_lin.add(encoderBoW)
model_BoW_1L_lin.add(Dense(1))

In [None]:
model_BoW_1L_lin.compile(loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
model_BoW_1L_lin.summary()

## Training
Now, let's train the model. We do only a rather small number of epochs and include early stopping in order not to spend too much time on training.

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
history_BoW_1L_lin = model_BoW_1L_lin.fit(
    train_ds.shuffle(buffer_size=BUFFER_SIZE).batch(BATCH_SIZE),
    validation_data = val_ds.batch(BATCH_SIZE),
    epochs = NUM_EPOCHS, verbose = 1,
    callbacks = [ EarlyStopping(monitor='val_accuracy', patience=5,
                                verbose=False, restore_best_weights=True)])

In [None]:
model_BoW_1L_lin.save_weights('model_BoW_1L_lin')

## Evaluation
Our simple model already has an accuracy of some 84% on the validation data. Let's check the performance on the test data:

In [None]:
results1 = model_BoW_1L_lin.evaluate(test_ds.batch(BATCH_SIZE), verbose=2)

for name, value in zip(model_BoW_1L_lin.metrics_names, results1):
    print("%s: %.3f" % (name, value))

We get similar performance on the novel test data.

## Interpretation
We want to try to interpret what the model has learned. To do so, we look at the weights that have been inferred.

In [None]:
model_BoW_1L_lin_weights = model_BoW_1L_lin.layers[1].get_weights()[0]
model_BoW_1L_lin_weights

`model1_BoW_weights[0]` contains the 1000 weights for the words in the dictionary, while `model1_BoW_weights[1]` is the bias or intercept of the linear regression. We look at the weights and search the indices with the largest values -- these will be the words that are the most positive:


In [None]:
model_BoW_1L_lin_sortOrder = np.argsort(model_BoW_1L_lin_weights, axis=0)
vocab[model_BoW_1L_lin_sortOrder[-5:]]

That seems plausible! Similarly, we can look for the words that best indicate a bad review:

In [None]:
vocab[model_BoW_1L_lin_sortOrder[:5]]

## Evaluate Performance Development
Using the history (which we got from the fit() function), we can also check the evolution of the performance over the training epochs:

In [None]:
history_BoW_1L_lin_dict = history_BoW_1L_lin.history

train_acc_BoW_1L_lin = history_BoW_1L_lin_dict['accuracy']
val_acc_BoW_1L_lin   = history_BoW_1L_lin_dict['val_accuracy']

train_loss_BoW_1L_lin = history_BoW_1L_lin_dict['loss']
val_loss_BoW_1L_lin   = history_BoW_1L_lin_dict['val_loss']

epochs_BoW_1L_lin    = range(1, len(train_acc_BoW_1L_lin) + 1)

In [None]:
plt.subplot(2, 1, 1)
plt.plot(epochs_BoW_1L_lin, train_loss_BoW_1L_lin, 'b:', label='model_BoW_1L_lin, Training loss')
plt.plot(epochs_BoW_1L_lin, val_loss_BoW_1L_lin,   'b',  label='model_BoW_1L_lin, Validation loss')
plt.title('Training and validation loss')

plt.subplot(2, 1, 2)
plt.plot(epochs_BoW_1L_lin, train_acc_BoW_1L_lin, 'r:', label='model_BoW_1L_lin, Training acc')
plt.plot(epochs_BoW_1L_lin, val_acc_BoW_1L_lin,   'r',  label='model_BoW_1L_lin, Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.grid(True)
plt.show()

# A slightly more advanced Model: Logistic Regression
As the output is binary, we might also see the task as binary classification, and logistic regression seems a natural choice. In the model above, we only need to change the activation function to `sigmoid`, and we already have a neural network doing logistic regression:

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
model_BoW_1L_sig = Sequential()
model_BoW_1L_sig.add(Input(shape=(1,), dtype='string'))
model_BoW_1L_sig.add(encoderBoW)
model_BoW_1L_sig.add(Dense(1, activation="sigmoid"))

In [None]:
model_BoW_1L_sig.compile(loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
history_BoW_1L_sig = model_BoW_1L_sig.fit(
    train_ds.shuffle(buffer_size=BUFFER_SIZE).batch(BATCH_SIZE),
    validation_data = val_ds.batch(BATCH_SIZE),
    epochs = NUM_EPOCHS, verbose = 1,
    callbacks = [ EarlyStopping(monitor='val_accuracy', patience=5,
                                verbose=False, restore_best_weights=True)])

# Evaluate Training Progress
history_BoW_1L_sig_dict = history_BoW_1L_sig.history

train_acc_BoW_1L_sig = history_BoW_1L_sig_dict['accuracy']
val_acc_BoW_1L_sig   = history_BoW_1L_sig_dict['val_accuracy']
epochs_BoW_1L_sig    = range(1, len(train_acc_BoW_1L_sig) + 1)

In [None]:
model_BoW_1L_sig.save_weights('model_BoW_1L_sig')

In [None]:
plt.plot(epochs_BoW_1L_lin, train_acc_BoW_1L_lin, 'r:', label='model_BoW_1L_lin, Training acc')
plt.plot(epochs_BoW_1L_lin, val_acc_BoW_1L_lin,   'r',  label='model_BoW_1L_lin, Validation acc')
plt.plot(epochs_BoW_1L_sig, train_acc_BoW_1L_sig, 'g:', label='model_BoW_1L_sig, Training acc')
plt.plot(epochs_BoW_1L_sig, val_acc_BoW_1L_sig,   'g',  label='model_BoW_1L_sig, Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

# Word Embeddings
Until now, we have represented all words as an index (represented as an integer number or as one-hot encoding). This representation does not take similarities in the meaning of words into acocunt - two words are either the same or different.

Using a word embedding, we get a more detailed notion of word similarity. In a high-dimensional representation, the distance between two words will be defined by how similar they are (as infered from the training data). While word embeddings are usually used in connection with more complex models, we start here with a

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
model_embed_BoW_1L_sig = Sequential()
model_embed_BoW_1L_sig.add(Input(shape=(1,), dtype='string'))
model_embed_BoW_1L_sig.add(encoderBoW)
model_embed_BoW_1L_sig.add(Embedding(VOCAB_SIZE, EMBED_DIM))
model_embed_BoW_1L_sig.add(Flatten())
model_embed_BoW_1L_sig.add(Dense(1, activation="sigmoid"))

In [None]:
model_embed_BoW_1L_sig.compile(loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
model_embed_BoW_1L_sig.summary()

In [None]:
history_embed_BoW_1L_sig = model_embed_BoW_1L_sig.fit(
    train_ds.shuffle(buffer_size=BUFFER_SIZE).batch(BATCH_SIZE),
    validation_data = val_ds.batch(BATCH_SIZE),
    epochs = NUM_EPOCHS, verbose = 1,
    callbacks = [ EarlyStopping(monitor='val_accuracy', patience=5,
                                verbose=False, restore_best_weights=True)])

In [None]:
model_embed_BoW_1L_sig.save_weights('model_embed_BoW_1L_sig')

In [None]:
# Evaluate Training Progress
history_embed_BoW_1L_sig_dict = history_embed_BoW_1L_sig.history

train_acc_embed_BoW_1L_sig_dict = history_embed_BoW_1L_sig_dict['accuracy']
val_acc_embed_BoW_1L_sig_dict   = history_embed_BoW_1L_sig_dict['val_accuracy']
epochs_embed_BoW_1L_sig_dict    = range(1, len(train_acc_embed_BoW_1L_sig_dict) + 1)

In [None]:
plt.plot(epochs_BoW_1L_lin, train_acc_BoW_1L_lin, 'r:', label='model_BoW_1L_lin, Training acc')
plt.plot(epochs_BoW_1L_lin, val_acc_BoW_1L_lin,   'r',  label='model_BoW_1L_lin, Validation acc')

plt.plot(epochs_BoW_1L_sig, train_acc_BoW_1L_sig, 'g:', label='model_BoW_1L_sig, Training acc')
plt.plot(epochs_BoW_1L_sig, val_acc_BoW_1L_sig,   'g',  label='model_BoW_1L_sig, Validation acc')

plt.plot(epochs_embed_BoW_1L_sig_dict, train_acc_embed_BoW_1L_sig_dict, 'b:', label='model_embed_BoW_1L_sig, Training acc')
plt.plot(epochs_embed_BoW_1L_sig_dict, val_acc_embed_BoW_1L_sig_dict,   'b',  label='model_embed_BoW_1L_sig, Validation acc')

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

The plot shows us several things:
* Embeddings do yield somewhat better results, but the improvement is not exciting - we might have expected more.
* Overfitting is becoming an issue: We reach the optimal performance on the validation data already after a few steps - afterwards, training performance continues to go up, but validation performance decreases.

Hence, we need some more advanced concepts - especially, as we still ignore the order of the words.

## Saving (for later use)

In [None]:
BoW_Vec_Results = {
    'train_acc_BoW_1L_lin': train_acc_BoW_1L_lin,
    'val_acc_BoW_1L_lin'  : val_acc_BoW_1L_lin,
    'epochs_BoW_1L_lin'   : epochs_BoW_1L_lin,

    'train_acc_BoW_1L_sig': train_acc_BoW_1L_sig,
    'val_acc_BoW_1L_sig'  : val_acc_BoW_1L_sig,
    'epochs_BoW_1L_sig'   : epochs_BoW_1L_sig,

    'train_acc_embed_BoW_1L_sig_dict': train_acc_embed_BoW_1L_sig_dict,
    'val_acc_embed_BoW_1L_sig_dict': val_acc_embed_BoW_1L_sig_dict,
    'epochs_embed_BoW_1L_sig_dict': epochs_embed_BoW_1L_sig_dict
}

In [None]:
import pickle

with open('BoW_Vec_Results.pkl', 'wb') as f:
    pickle.dump(BoW_Vec_Results, f)