# Tokenization

## What is Tokenization?
Representing words in a way that a computer can process them is called **Tokenization**. Different words can have same words just in different order. This makes it hard for us to understand the sentiment of a word just by the letetrs in it. So it might be easier instead of encoding letetrs to encode words.

## Tokenization in Tensorflow
Tensorflow has a **Tokenizer** API for tokenization. 

* We can create an instance of tokenizer object. The `num_words` paramter is the max number of words to keep. Imagine if we have hundreds of books to tokenize, but we just want the hundred most frequent words in all of that. This paramter i.e. `num_words` will automatically do that for us when we do the next step.
* Now we tell the `tokenizer` to fo through all the text and then fit itself to them
* The full list of words is available as the tokenizer's `word_index` property. We can print it out

**Note:** 
The tokenizer is also smart enough to catch some exceptions. Look at how **"dog!"** has same token as **"dog"**. The nice thing is that tokenizer is smart enough to recognize this and not create a new token.


# Turning sequences into data
We will add another sentence to out set of text. We are doing this because existing sentences all have four words, and it is important to see how to manage sentences or sequences of different lengths.

The tokenizer supports a method called `text_to_sequences`, which perform most of the work for you. It creates sequences sequences of tokens representing each sentence.

## Handling unseen words
Now, we have the basic tokenization done, but there is a catch. This is all very well for getting data ready for training a neural network, but what happens when that neural network needs to classify text and there are wprds in the text that it has never seen before?

We have used a set of sentences for training a neural network. The tokenizer gets `word_index` from these and creates a sequence for us. So now if we want to sequence sentences containing new unseen words that were not present in our initial set of data and hence not present in the `word_index`, what's going to happen?

-> A five word sentence will end up as four word sequence, because new word was not present in the `word_index`.

In order to not lose the length of the sequences, we can use a little trick. By using the `OOV` token property and setting it to soemthing that we would not expect to see in the corpus, the tokenizer will create a token for that. Then it will replace words that it does not recognize with the **Out of Vocabulary (OOV)** token instead. It is simple but effective. Meaning may not be correct, but atleast the length is preserved.

## How to handle sentences of different lengths
`Images` are usually the same size, so we can train a neural network easily. How do we solves this for text data?

The advanced answer is to use something called `Ragged Tensor`, but that is out of scope for now. We will use a simpler and different solution called `padding`.

* Import `pad_sequences` from preprocessing. Now just pass the sequences to this function and rest if done automatically for you.

**Note:** 
* Our longest sentence has seven words, so pad_sequence will measure that and pad remaining sentence with zeros in the front to make the length same.
* **OOV** isn't '0', it is '1'. '0' means padding
* If you want zeros after the sentence, set `padding` parameter in `pad_sequences` to `post`
* You can also specify `maxlen` parameter to specify the desired length of the padded sequences

What if sentences are longer than the specified max lenght? Then you can use `truncating` parameter to specify how to truncate by chopping off words either from the end or from the beginning.


# Training a model to recognize sentiment
## Dataset

We use Rishabh Mirza's [dataset](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection) from kaggle. The dataset is nice and simple.

* The `is_sarcastic` filed is 1 if the record is sarcastic, otherwise 0
* There is a `headline` of the news article that we will train on
* There is also an `article_link` to the original news article

### Format
The data is stored in `json` format. We will have to convert it to `python` format for training. Every json element will become a python list element.

Python has a `json-toolkit` that can achieve this!

## Preprocessing
* Create `token` for every word in the corpus
* Convert sentences into `sequences` of tokens and `pad` then to the same length
* Slice the data into training and test set

**Note:** We need to make sure that our `tokenizer` just fits the training data and not test data. Use `fits_on_texts` only on training data.

## Embeddings
`Word embedding`, `word vector` or `embeddings` is a term used for the representation of words for text analysis, typically in the form of a `real-valued vector` that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings allow words with similar meaning to have a similar representation.

In the code, the top layer is an `Embedding layer`.

## Train the model
Use `model.fit` to train the network.

## Test
Convert the new sentence into sequences using `tokenizer`, `pad` them with the same padding type as training data and the use `model.predict`.


