<a href="https://colab.research.google.com/github/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/01_preparing_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Preparing text data

Deep learning models, being differentiable functions, can only process numeric tensors: they can’t take raw text as input. 

**Vectorizing text is the process of transforming
text into numeric tensors.**

 Text vectorization processes come in many shapes and
forms, but they all follow the same template.

- First, you standardize the text to make it easier to process, such as by converting
it to lowercase or removing punctuation.
- You split the text into units (called tokens), such as characters, words, or groups
of words. This is called tokenization.
- You convert each such token into a numerical vector. This will usually involve
first indexing all tokens present in the data.

<img src='https://github.com/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/images/1.png?raw=1' width='800'/>

##Text standardization

Text standardization is a basic form of feature engineering that aims to erase
encoding differences that you don’t want your model to have to deal with. It’s not
exclusive to machine learning, either—you’d have to do the same thing if you were
building a search engine.

One of the simplest and most widespread standardization schemes is “convert to
lowercase and remove punctuation characters.”

Of course, standardization may
also erase some amount of information, so always keep the context in mind: for
instance, if you’re writing a model that extracts questions from interview articles, it
should definitely treat “?” as a separate token instead of dropping it, because it’s a useful
signal for this specific task.

##Text splitting (tokenization)

Once your text is standardized, you need to break it up into units to be vectorized
(tokens), a step called tokenization. You could do this in three different ways:

- Word-level tokenization
- N-gram tokenization
- Character-level tokenization

In general, you’ll always use either word-level or N-gram tokenization. There are two kinds of text-processing models: 

- those that care about word order, called **sequence models**,
- those that treat input words as a set, discarding their original order, 
called **bag-of-words models**

If you’re building a sequence model, you’ll use word-level tokeni
zation, and if you’re building a bag-of-words model, you’ll use N-gram tokenization.

N-grams are a way to artificially inject a small amount of local word order information into the model.




##Vocabulary indexing

Once your text is split into tokens, you need to encode each token into a numerical
representation. 

You could potentially do this in a stateless way, such as by hashing each
token into a fixed binary vector, but in practice, the way you’d go about it is to build
an index of all terms found in the training data (the “vocabulary”), and assign a
unique integer to each entry in the vocabulary.

In [None]:
vocabulary = {}
for text in dataset:
  text = standardize(text)
  tokens = tokenize(text)
  for token in tokens:
    if token not in vocabulary:
      vocabulary[token] = len(vocabulary)

You can then convert that integer into a vector encoding that can be processed by a
neural network, like a one-hot vector:

In [None]:
def one_hot_encode_token(token):
  vector = np.zeros((len(vocabulary), ))
  token_index = vocabulary[token]
  vector[token_index] = 1
  return vector

##Using the TextVectorization layer

Let's create a Text Vectorization class.

In [1]:
import string

In [7]:
class Vectorizer:

  def standardize(self, text):
    text = text.lower()
    return "".join(char for char in text if char not in string.punctuation)

  def tokenize(self, text):
    text = self.standardize(text)
    return text.split()

  def make_vocabulary(self, dataset):
    self.vocabulary = {"": 0, "[UNK]": 1}
    for text in dataset:
      text = self.standardize(text)
      tokens = self.tokenize(text)
      for token in tokens:
        if token not in self.vocabulary:
          self.vocabulary[token] = len(self.vocabulary)
    self.inverse_vocabulary = dict((v, k) for k, v in self.vocabulary.items())

  def encode(self, text):
    text = self.standardize(text)
    tokens = self.tokenize(text)
    return [self.vocabulary.get(token, 1) for token in tokens]

  def decode(self, int_seq):
    return " ".join(self.inverse_vocabulary.get(i, "[UNK]") for i in int_seq)

In [8]:
vectorizer = Vectorizer()

In [9]:
dataset = [
   "I write, erase, rewrite",
   "Erase again, and then",
   "A poppy blooms."        
]

vectorizer.make_vocabulary(dataset)

It does the job:

In [10]:
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [11]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


In practice, you’ll
work with the Keras `TextVectorization` layer, which is fast and efficient and can be
dropped directly into a `tf.data` pipeline or a Keras model.

This is what the TextVectorization layer looks like:

In [12]:
from tensorflow.keras.layers import TextVectorization

In [13]:
# Configures the layer to return sequences of words encoded as integer indices.
text_vectorization = TextVectorization(output_mode="int")

By default, the `TextVectorization` layer will use the setting “convert to lowercase and
remove punctuation” for text standardization, and “split on whitespace” for tokenization.

But importantly, you can provide custom functions for standardization and tokenization,
which means the layer is flexible enough to handle any use case. 

Note that
such custom functions should operate on tf.string tensors, not regular Python
strings!

In [14]:
import re
import tensorflow as tf

In [19]:
def custom_standardization_fn(string_tensor): 
  # Convert strings to lowercase
  lowercase_string = tf.strings.lower(string_tensor)
  # Replace punctuation characters with the empty string
  return tf.strings.regex_replace(lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
  # Split strings on whitespace.
  return tf.strings.split(string_tensor)

In [20]:
text_vectorization = TextVectorization(output_mode="int",
                                       standardize=custom_standardization_fn,
                                       split=custom_split_fn)

To index the vocabulary of a text corpus, just call the
 `adapt()` method of the layer with a Dataset object that yields strings, or just with a list of Python strings:

In [21]:
dataset = [
   "I write, erase, rewrite",
   "Erase again, and then",
   "A poppy blooms."        
]

text_vectorization.adapt(dataset)

Note that you can retrieve the computed vocabulary via `get_vocabulary()`—this can be useful if you need to convert text encoded as integer sequences back into words.

The first two entries in the vocabulary are the mask token (index 0) and the OOV
token (index 1). 

Entries in the vocabulary list are sorted by frequency, so with a realworld
dataset, very common words like “the” or “a” would come first.

In [22]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

For a demonstration, let’s try to encode and then decode an example sentence:

In [23]:
vocabulary = text_vectorization.get_vocabulary()

test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


In [24]:
inverse_vocab = dict(enumerate(vocabulary))

decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


Importantly, because `TextVectorization` is mostly a dictionary lookup operation, it
can’t be executed on a GPU (or TPU)—only on a CPU. 

So if you’re training your model
on a GPU, your `TextVectorization` layer will run on the CPU before sending its output
to the GPU. This has important performance implications.