<a href="https://colab.research.google.com/github/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/01_preparing_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Preparing text data

Deep learning models, being differentiable functions, can only process numeric tensors: they can’t take raw text as input. 

**Vectorizing text is the process of transforming
text into numeric tensors.**

 Text vectorization processes come in many shapes and
forms, but they all follow the same template.

- First, you standardize the text to make it easier to process, such as by converting
it to lowercase or removing punctuation.
- You split the text into units (called tokens), such as characters, words, or groups
of words. This is called tokenization.
- You convert each such token into a numerical vector. This will usually involve
first indexing all tokens present in the data.

<img src='images/1.png?raw=1' width='800'/>

##Text standardization

Text standardization is a basic form of feature engineering that aims to erase
encoding differences that you don’t want your model to have to deal with. It’s not
exclusive to machine learning, either—you’d have to do the same thing if you were
building a search engine.

One of the simplest and most widespread standardization schemes is “convert to
lowercase and remove punctuation characters.”

Of course, standardization may
also erase some amount of information, so always keep the context in mind: for
instance, if you’re writing a model that extracts questions from interview articles, it
should definitely treat “?” as a separate token instead of dropping it, because it’s a useful
signal for this specific task.

##Text splitting (tokenization)

Once your text is standardized, you need to break it up into units to be vectorized
(tokens), a step called tokenization. You could do this in three different ways:

- Word-level tokenization
- N-gram tokenization
- Character-level tokenization

In general, you’ll always use either word-level or N-gram tokenization. There are two kinds of text-processing models: 

- those that care about word order, called **sequence models**,
- those that treat input words as a set, discarding their original order, 
called **bag-of-words models**

If you’re building a sequence model, you’ll use word-level tokeni
zation, and if you’re building a bag-of-words model, you’ll use N-gram tokenization.

N-grams are a way to artificially inject a small amount of local word order information into the model.




##Vocabulary indexing

Once your text is split into tokens, you need to encode each token into a numerical
representation. 

You could potentially do this in a stateless way, such as by hashing each
token into a fixed binary vector, but in practice, the way you’d go about it is to build
an index of all terms found in the training data (the “vocabulary”), and assign a
unique integer to each entry in the vocabulary.

In [None]:
vocabulary = {}
for text in dataset:
  text = standardize(text)
  tokens = tokenize(text)
  for token in tokens:
    if token not in vocabulary:
      vocabulary[token] = len(vocabulary)

You can then convert that integer into a vector encoding that can be processed by a
neural network, like a one-hot vector:

In [None]:
def one_hot_encode_token(token):
  vector = np.zeros((len(vocabulary), ))
  token_index = vocabulary[token]
  vector[token_index] = 1
  return vector

##Using the TextVectorization layer