### References

**Book:**
- Deep Learning with Python, Second Edition
  - Book by François Chollet
  - François Chollet is a French software engineer and artificial intelligence researcher currently working at Google. Chollet is the creator of the Keras deep-learning library, released in 2015, and a main contributor to the TensorFlow machine learning framework.


### Pre-processing

What modern NLP is about: using machine learning and large datasets to give computers the ability not to understand language, which is a more lofty goal, but to ingest a piece of language as input and return something useful, like predicting the following:
- “What’s the topic of this text?” (text classification)
- “Does this text contain abuse?” (content filtering)
- “Does this text sound positive or negative?” (sentiment analysis)
- “What should be the next word in this incomplete sentence?” (language modeling)
- “How would you say this in German?” (translation)
- “How would you summarize this article in one paragraph?” (summarization)

etc.

**Pre-processing template**
- First, you *standardize* the text to make it easier to process, such as by converting 
it to lowercase or removing punctuation.
- You split the text into units (called tokens), such as characters, words, or groups
of words. This is called *tokenization*.
- You convert each such token into a numerical vector. This will usually involve
first *indexing* all tokens present in the data.

**Model Type**
There are two kinds of text-processing models: 
- Those that care about word order, called **sequence models**.
- Those that treat input words as a set, discarding their original order, called **bag-of-words models**.

If you’re building a sequence model, you’ll use word-level tokenization, and if you’re building a bag-of-words model, you’ll use N-gram tokenization.

### Vocabulary Indexing

```
vocabulary = {}

for text in dataset:
    text = standardize(text)
    tokens = tokenize(text)
    for token in tokens:
        if token not in vocabulary:
        vocabulary[token] = len(vocabulary)
```

You can then convert that integer into a vector encoding that can be processed by a neural network, like a one-hot vector.

```
def one_hot_encode_token(token):
    vector = np.zeros((len(vocabulary),))
    token_index = vocabulary[token]
    vector[token_index] = 1
    return vector
```

Note that at this step it’s common to restrict the vocabulary to only the top 20,000 or 30,000 most common words found in the training data. Any text dataset tends to feature an extremely large number of unique terms, most of which only show up once or
twice—indexing those rare terms would result in an excessively large feature space, where most features would have almost no information content.

**OOV - Out of vocabulary**

Now, there’s an important detail here that we shouldn’t overlook: when we look up a new token in our vocabulary index, it may not necessarily exist. 

Your training data may not have contained any instance of the word “cherimoya” (or maybe you excluded it from your index because it was too rare), so doing `token_index = vocabulary["cherimoya"]` may result in a KeyError. 

To handle this, you should use an “out of vocabulary” index (abbreviated as OOV index)—a catch-all for any token that wasn’t in the index. 

It’s usually index 1: you’re actually doing `token_index = vocabulary.get(token, 1)`. 

When decoding a sequence of integers back into words, you’ll replace 1 with something like “[UNK]” (which you’d call an “OOV token”).

**Mask Token or Padding**

There are two special tokens that you will commonly use: the OOV token (index 1), and the mask token (index 0). 

While the OOV token means “here was a word we did not recognize,” the mask token tells us “ignore me, I’m not a word.” You’d use it in particular to pad sequence data: because data batches need to be contiguous, all sequences in a batch of sequence data must have the same length, so shorter sequences should be padded to the length of the longest sequence. 

### TextVectorization layer

In [2]:
import string

# Vectorizer class
class Vectorizer:

    # Standarization method
    def standardize(self, text):
        # lowercasing the text
        text = text.lower()
        # returning the text characters one by one if it does not belong to any punctuation
        return "".join(char for char in text
                       if char not in string.punctuation)
    
    # Tokenization method
    def tokenize(self, text):
        # Calling the standardization method
        text = self.standardize(text)
        # Tokenizing and returning the text
        return text.split()
    
    # Create index-word vocabulary
    def make_vocabulary(self, dataset):
        # Masking and OOV indices
        self.vocabulary = {"": 0, "[UNK]": 1}

        # Iterating over each separate text (sentence, paragraph) in the dataset
        for text in dataset:
            # Standardize and Tokenize
            text = self.standardize(text)
            tokens = self.tokenize(text)

            # Iterating over each token
            for token in tokens:
                # If token not in the vocabulary already
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)

        # Inversing (word-index) to (index-word)                   
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items()
            )

    # Token to index
    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    # Index to token
    def decode(self, int_sequence):
        return " ".join(
        self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

In [3]:
# Vectorizer object
vectorizer = Vectorizer()

# Dataset
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
    ]

# Vocabulary created
vectorizer.make_vocabulary(dataset)

In [4]:
test_sentence = "I write, rewrite, and still rewrite again"

# Encode test sentence
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

# Decode encoded test sentence
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

[2, 3, 5, 7, 1, 5, 6]
i write rewrite and [UNK] rewrite again


**Keras in-built class**

In [5]:
import re
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization




In [6]:
# Custom standardization function
def custom_standardization_fn(string_tensor):
    # Lowercasing
    lowercase_string = tf.strings.lower(string_tensor)
    
    # Replace punctuations with blank character
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]", "")

# Custom split function
def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor)

# Custom TextVectorizer
text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
    )




In [7]:
dataset = [
"I write, erase, rewrite",
"Erase again, and then",
"A poppy blooms.",
]

# Creating vocabulary
text_vectorization.adapt(dataset)




In [8]:
# Display vocab
print(text_vectorization.get_vocabulary())

['', '[UNK]', 'erase', 'write', 'then', 'rewrite', 'poppy', 'i', 'blooms', 'and', 'again', 'a']


In [9]:
# Encode

test_sentence = "I write, rewrite, and still rewrite again"

encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


In [10]:
# Decode

# Getting list of vocab words
vocabulary = text_vectorization.get_vocabulary()

# Creating index-word dictionary
inverse_vocab = dict(enumerate(vocabulary))

decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


Importantly, because TextVectorization is mostly a dictionary lookup operation, it can’t be executed on a GPU (or TPU)—only on a CPU.

There are two ways we could use our TextVectorization layer. 
- The first option is to put it in the tf.data pipeline.
- The second option is to make it part of the model.

```
# Pipeline

# string_dataset would be a dataset that yields string tensors.
int_sequence_dataset = string_dataset.map(
    text_vectorization,
    # The num_parallel_calls argument is used to parallelize the map() call across multiple CPU cores.
    num_parallel_calls=4)
```

```
# Part of Model

# Create a symbolic input that expects strings
text_input = keras.Input(shape=(), dtype="string")

# Apply the text vectorization layer to it
vectorized_text = text_vectorization(text_input)

# You can keep chaining new layers on top just your regular Functional API model.
embedded_input = keras.layers.Embedding(...)(vectorized_text)
output = ...

model = keras.Model(text_input, output)
```

**Note**
- So if you’re training the model on GPU or TPU, you’ll probably want to go with the first option to get the best performance. 

- When training on a CPU, though, synchronous processing is fine: you will get 100% utilization of your cores regardless of which option you go with.

### Prepare IMDB Dataset

Let’s start by downloading the dataset from the Stanford page of Andrew Maas and uncompressing it.

`!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`!tar -xf aclImdb_v1.tar.gz`

There’s also a train/unsup subdirectory in there, which we don’t need. Let’s delete it:

`!rm -r aclImdb/train/unsup`

#### Prepare Validation Data

In [11]:
import os, pathlib, shutil, random

In [None]:
# Defining base directory
base_dir = pathlib.Path("aclImdb")

# Val dir and Train dir
val_dir = base_dir / "val"
train_dir = base_dir / "train"

# Category wise iterating
for category in ("neg", "pos"):
    # Creating folder
    os.makedirs(val_dir / category)

    # Getting all the file names from train directory
    files = os.listdir(train_dir / category)

    # Random shuffling of file names
    random.Random(1337).shuffle(files)

    # Number of samples
    num_val_samples = int(0.2 * len(files))

    # Getting the num_val_samples number of files from the end
    val_files = files[-num_val_samples:]

    # Moving the actual files from training folder to validation folder
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

#### Batched Dataset

In [13]:
from tensorflow import keras

In [35]:
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
# Displaying the shapes and dtypes of the first training batch

for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)

    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"I haven't yet read the Kurt Vonnegut book this was adapted from, but I am familiar with some of his other work and was interested to see how it would be translated to the screen. Overall, I think this is a very successful adaptation of one of Vonnegut's novels. It concerns the story of an American living in Germany who is recruited as a spy for the US. His job is to ingratiate himself with high ranked Nazi's and send secret messages to the American's via his weekly radio show. But when the war ends he is denounced as a war criminal but escapes to New York, where various odd plot twists await.<br /><br />If Mother Night has a problem it's that it tends to get a little too sentimental at times. But for most of the film the schmaltz is kept to a minimum and the very strange plot is carried through with skill and aplomb. And there are some fabulous moments of blac

#### Processing words as a set: The bag-of-words approach

**SINGLE WORDS (UNIGRAMS) WITH BINARY ENCODING**

The main advantage of this encoding is that you can represent an entire text as a single vector, where each entry is a presence indicator for a given word. 

For instance, using binary encoding (multi-hot), you’d encode a text as a vector with as many dimensions as there are words in your vocabulary—with 0s almost everywhere and some 1s for dimensions that encode words present in the text.

In [36]:
# TextVectorization object

text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
    )

In [37]:
# Prepare a dataset  that only yields raw text inputs (no labels)
text_only_train_ds = train_ds.map(lambda x, y: x)

# Use that dataset to index the dataset vocabulary via the adapt() method
text_vectorization.adapt(text_only_train_ds)

In [None]:
# Looking at text_only_train_ds first batch

for i in text_only_train_ds:
    print(len(i))
    print(i)
    break

32
tf.Tensor(
[b'So, what\'s the reason? Is there some sort of vendetta against this AWESOME show or somebody involved therein? Why would the best show I\'ve seen in years be canceled? I\'m addicted. I saw this show on randomly last fall, and immediately loved it, and watched it every week. Then it went away, and I tried to Tivo it, but it wasn\'t being aired. So I forgot about it for awhile, until I found the episodes on ABC\'s website. Now I want MORE. I agree with everybody else - with the rest of the junk on TV today, it was refreshing to see something as well-rounded and developed as this. I watch Boston Legal for my eccentric-comedic fix, and House for my intellectual-mystery-jackass fix. My wife loves Grey\'s Anatomy for its "realism", and I do love/hate the show, but it could not be farther from real for me. WAY too much drama. Everything that can go wrong, does. But for once, there\'s a drama that\'s REALLY real. Real people, real problems. Sure, there are some extremes like a

In [None]:
count = 0

for i in text_only_train_ds:
    count += 1

# Total number of batches in the training set
print(count)

# Crosschecking the total count of samples in the training dataset
print(count*32)

625
20000


In [None]:
# First 20 words in the vocab list
print(text_vectorization.get_vocabulary()[:20])

['[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but', 'film']


In [None]:
# Sample text vectorization

sample_text = "The is asdasd in."

text_vectorization(sample_text)[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([1., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)>

In [38]:
# Prepare processed versions of our training, validation, and test dataset
# Make sure to specify num_parallel_calls to leverage multiple CPU cores

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
# binary_1gram_train_ds - This is a pack of 2 tensor set
# 1st set - input texts
# 2nd set - corresponding labels
for i in binary_1gram_train_ds:
    print(len(i))
    print(i)
    break

2
(<tf.Tensor: shape=(32, 20000), dtype=float32, numpy=
array([[1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       ...,
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.],
       [1., 1., 1., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(32,), dtype=int32, numpy=
array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 0])>)


In [None]:
# Input set - 32 - Each having length 20000 (Vocab size)
# Label set - 32 - Labels for 32 corresponding input 
for i in binary_1gram_train_ds:
    for j in i:
        print(len(j))
        print(j)
    break

32
tf.Tensor(
[[1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 ...
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]], shape=(32, 20000), dtype=float32)
32
tf.Tensor([0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0], shape=(32,), dtype=int32)


In [None]:
# Inspecting a single batch

for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)

    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)

    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


**Note**

This vectorization only encodes the information about the words that are present in the text. But it does not encode the sequence (position of each word in the sentence).

### Model-building utility

In [19]:
from tensorflow import keras
from tensorflow.keras import layers

In [39]:
def get_model(max_tokens=20000, hidden_dim=16):
    # Flattened vector representation of a text (vector of shape max_tokens)
    inputs = keras.Input(shape=(max_tokens,))

    # First hidden layer
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    # Second hidden layer
    x = layers.Dropout(0.5)(x)
    
    # Output layer
    outputs = layers.Dense(1, activation="sigmoid")(x)
    
    # Building and compiling the model
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    
    return model

In [40]:
model = get_model()
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_6 (Dense)             (None, 16)                320016    
                                                                 
 dropout_3 (Dropout)         (None, 16)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 17        


                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [41]:
# Training and testing the binary unigram model

callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
                                    ]

model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x1f8bc3be4a0>

In [42]:
model = keras.models.load_model("binary_1gram.keras")

print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Test acc: 0.886


Note that in this case, since the data set is a balanced two-class classification dataset (there are as many positive samples as
negative samples), the “naive baseline” we could reach without training an actual model would only be 50%. 

Meanwhile, the best score that can be achieved on this dataset without leveraging external data is around 95% test accuracy.

### BIGRAMS WITH BINARY ENCODING

In [43]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

In [44]:
text_vectorization.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [45]:
model = get_model()
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_8 (Dense)             (None, 16)                320016    
                                                                 
 dropout_4 (Dropout)         (None, 16)                0         
                                                                 
 dense_9 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [46]:
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
    save_best_only=True)
    ]

model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x1f8c86ec100>

In [47]:
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Test acc: 0.895


Performance increased from 88.6% to 89.5%. It's a good increase.

### BIGRAMS WITH TF-IDF ENCODING

You can also add a bit more information to this representation by counting how many times each word or N-gram occurs, that is to say, by taking the histogram of the words over the text.

If you’re doing text classification, knowing how many times a word occurs in a sample is critical: any sufficiently long movie review may contain the word “terrible” regardless of sentiment, but a review that contains many instances of the word “terrible” is
likely a negative one.

In [29]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
    )

Now, of course, some words are bound to occur more often than others no matter what the text is about. The words “the,” “a,” “is,” and “are” will always dominate your word count histograms, drowning out other words—despite being pretty much useless features in a classification context. How could we address this?

We could just normalize word counts by subtracting the mean and dividing by the variance (computed across the entire training  dataset). That would make sense. Except most vectorized sentences consist almost entirely of zeros (our previous example features 12 non-zero entries and 19,988 zero entries), a property called “sparsity.” That’s a great property to have, as it dramatically
reduces compute load and reduces the risk of overfitting. If we subtracted the mean from each feature, we’d wreck sparsity. Thus, whatever normalization scheme we use should be divide-only.

The best practice is to go with something called TF-IDF normalization.

```
def tfidf(term, document, dataset):
    term_freq = document.count(term)
    doc_freq = math.log(sum(doc.count(term) for doc in dataset) + 1)
    return term_freq / doc_freq
```

In [30]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
    )

In [31]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [32]:
model = get_model()
model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)


_________________________________________________________________


In [33]:
# Training and testing the TF-IDF bigram model

callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
                                    ]

model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x1f86745d3c0>

In [34]:
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Test acc: 0.882


Performance decreased from 89% to 88.2%. Not really helpful in this case. However, for many text-classification datasets, it
would be typical to see a one-percentage-point increase when using TF-IDF compared to plain binary encoding.

In the preceding examples, we did our text standardization, splitting, and indexing as part of the tf.data pipeline. 

But if we want to export a standalone model independent of this pipeline, we should make sure that it incorporates its own text preprocessing (otherwise, you’d have to reimplement in the production environment, which can be challenging or can lead to subtle discrepancies between the training data and the production data).

Just create a new model that reuses your TextVectorization layer and adds to it the model you just trained.

In [51]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

text_vectorization.adapt(text_only_train_ds)

In [53]:
model = keras.models.load_model("binary_2gram.keras")

In [54]:
# One input sample would be one string
inputs = keras.Input(shape=(1,), dtype="string")

# Apply text preprocessing
processed_inputs = text_vectorization(inputs)

# Apply the previously trained model
outputs = model(processed_inputs)

# Instantiate the end-to-end mode
inference_model = keras.Model(inputs, outputs)

In [58]:
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
    ])

predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

82.35 percent positive
