# Spam Classifier Word Embeddings - Solution

Let's have a look at word embeddings. Therefore, we want to 

1. Apply the [TensorFlow tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings) to the Spam/Ham dataset.

2. Load the embedings in the [Embedding Projector](http://projector.tensorflow.org/) to visualize the word embedding.

In this notebook you will find one solution how to load and prepare the Spam/Ham data to apply those NLP models.

In [None]:
# importing all needed libraries and functions
import io
import re
import string
import tensorflow as tf

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from tensorflow.keras import  Sequential
from tensorflow.keras.layers import  Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

from tensorflow import keras
import matplotlib.pyplot as plt
from IPython.display import clear_output

## Loading data with pandas

We are loading our Spam and Ham data with pandas and afterwards split our data into a train, a validation and a test set as usual with sklearns train_test_split function.


In [None]:
# Load spam/ham data
dataframe_full = pd.read_csv(
    "./data/SMSSpamCollection.txt",
    encoding="utf-8",
    header=None,
    delimiter="\t",
    names=["target", "text"],
)

# Encoding target variable
dataframe_full["target"] = np.where(dataframe_full["target"] == "spam", 1, 0)

In [None]:
# First look at the data
dataframe_full.sample(5)

In [None]:
# Splitting data in train, validation and test set
dataframe_train, dataframe_test = train_test_split(dataframe_full, test_size=0.2, random_state=42)
dataframe_train, dataframe_val = train_test_split(dataframe_train, test_size=0.25, random_state=42)

In [None]:
def print_shape(dataframe):
    """Print number of observations and number of columns of the given dataframe.

    Args:
        dataframe (pandas DataFrame): Any pandas DataFrame
    """
    name =[x for x in globals() if globals()[x] is dataframe][0]
    print(f'There are {dataframe.shape[0]} observations and {dataframe.shape[1]} columns in the {name}.')


In [None]:
for i in [dataframe_train, dataframe_val, dataframe_test]:
    print_shape(i)


## Change pandas Dataframes to tf.dataset

After spliting the data, we need to transform the data to tensorflow "tensors".

In [None]:
dataset_train = tf.data.Dataset.from_tensor_slices((dataframe_train.text, dataframe_train.target))
dataset_val = tf.data.Dataset.from_tensor_slices((dataframe_val.text, dataframe_val.target))
dataset_test = tf.data.Dataset.from_tensor_slices((dataframe_test.text, dataframe_test.target))

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

dataset_train = dataset_train.cache().prefetch(buffer_size=AUTOTUNE)
dataset_val = dataset_val.cache().prefetch(buffer_size=AUTOTUNE)

## Text preprocessing

In [None]:
def custom_standardization(input_data):
    """Text preprocessing: lowercases, no punctuation

    Args:
        input_data (tf.dataframe): [text, formated as tf.string]

    Returns:
        [tf.dataframe]: [preprocessed text]
    """
    text_lower = tf.strings.lower(input_data)
    return tf.strings.regex_replace(text_lower,
                                  '[%s]' % re.escape(string.punctuation), '')

In [None]:
# Vocabulary size and number of words in a sequence.
vocab_size = 7546  # taken from notebook 1
sequence_length = int(dataframe_train.text.apply(lambda x: len(x.split())).max())

# Use the text vectorization layer to normalize, split, and map strings to 
# integers. Note that the layer uses the custom_standardization function defined above. 
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)


In [None]:
# Make a text-only dataset (without labels), then call adapt
train_text = dataset_train.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

In [None]:
print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

## Create a classification model

In the tutorial the batch-size was defined when loading the data as tf.Dataset. That's why we have to specify this now too. This is especially important for training the model.
You can create the batches as shown here: [tf.data.Dataset.batch() method, combined with repeat() method](https://www.gcptutorials.com/article/how-to-use-batch-method-in-tensorflow).

In [None]:
dataset_train_batch = dataset_train.repeat().batch(batch_size=32)
dataset_val_batch = dataset_val.repeat().batch(batch_size=32)
dataset_test_batch = dataset_test.repeat().batch(batch_size=32)

In [None]:
# checking shape and type of batched dataset 
dataset_train_batch

**Defining model structure:**
- first vectorize data
- then using embedding layer
- globalaveragepooling1D layer will return fixed-length output vector even though the input may varry in length
- fully connected layer
- last layer with single output node

In [None]:
#model structure
embedding_dim=16
model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1)
])

In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [None]:
# model compiling using Adam optimizer and BinaryCrossentropy loss
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
# ploting model loss during training, created by Daniel: https://medium.com/geekculture/how-to-plot-model-loss-while-training-in-tensorflow-9fa1a1875a5
class PlotLearning(keras.callbacks.Callback):
    """
    Callback to plot the learning curves of the model during training.
    """
    def on_train_begin(self, logs={}):
        self.metrics = {}
        for metric in logs:
            self.metrics[metric] = []
            

    def on_epoch_end(self, epoch, logs={}):
        # Storing metrics
        for metric in logs:
            if metric in self.metrics:
                self.metrics[metric].append(logs.get(metric))
            else:
                self.metrics[metric] = [logs.get(metric)]
        
        # Plotting
        metrics = [x for x in logs if 'val' not in x]
        
        f, axs = plt.subplots(1, len(metrics), figsize=(15,5))
        clear_output(wait=True)

        for i, metric in enumerate(metrics):
            axs[i].plot(range(1, epoch + 2), 
                        self.metrics[metric], 
                        label=metric)
            if logs['val_' + metric]:
                axs[i].plot(range(1, epoch + 2), 
                            self.metrics['val_' + metric], 
                            label='val_' + metric)
                
            axs[i].legend()
            axs[i].grid()

        plt.tight_layout()
        plt.show()

In [None]:
# training the model
callbacks_list = [PlotLearning()]
model.fit(
    dataset_train_batch,
    validation_data=dataset_val_batch,
    epochs=20,
    steps_per_epoch=240,
    validation_steps=25,
    callbacks=callbacks_list
    )

In [None]:
# calculating the loss and accuracy on the test set.
loss, accuracy = model.evaluate(dataset_test_batch, verbose=2, steps=25)
print(f'Model accuracy: {accuracy}')

In [None]:
model.summary()

## Retrieve the trained word embeddings and save them to disk

In [None]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()

## Visualize the embeddings
To visualize the embeddings, upload them to the embedding projector.

Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).

Click on "Load data".

Upload the two files you created above: vecs.tsv and meta.tsv.

