<a href="https://colab.research.google.com/github/rafi007akhtar/coursera-tensorflow/blob/master/NLP_practise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizer practise

Colab Notebook containing code for basic NLP practise using Keras.

* Topics covered:
  - word index
  - sequences
  - out-of-vocabs
  - padding

* Class used: `Tokenizer`

* Functions / methods used
  - `fit_on_texts`
  - `texts_to_sequences`
  - `pad_sequences`

## Import `tokenizer` from `keras`

In [0]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

## Generate word index

In [0]:
# consider the following list of sentences
sentences = [
    "winter is coming",
    "winter is here"
]

# create an instance of the Tokenizer class
tkz = Tokenizer(num_words = 100, oov_token="<missing>")

Parameters used in the above object creation:
- `num_words`: The maximum number of words the tokenizer would look for in the sentences
- `oov_token`: replace unrecognized  ("out-of-vocabulary") words with this token

In [11]:
tkz.fit_on_texts(sentences)  # creates a list with words and their indices
ind = tkz.word_index  # returns a dictionary with words and their indices
print(ind)

{'<missing>': 1, 'winter': 2, 'is': 3, 'coming': 4, 'here': 5}


## Generate sequences

In [12]:
seq = tkz.texts_to_sequences(sentences) # returns a list with each word replaced by its index from the word_index
seq

[[2, 3, 4], [2, 3, 5]]

## Missing text

Here is where the `oov_token`parameter will come into play. The sequence will replace any out-of-vocab word with the value of the above parameter.

If the `oov_token` parameter was not provided above, the unrecognized words would have been ignored.

In [13]:
test_text = [
    "when winter comes and",
    "here lies my grave",
    "the lone wolf dies but the pack survives"
]
seq = tkz.texts_to_sequences(test_text)
seq

[[1, 2, 1, 1], [5, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]

In [14]:
tkz.fit_on_texts(test_text)
ind = tkz.word_index
print(ind)

{'<missing>': 1, 'winter': 2, 'is': 3, 'here': 4, 'the': 5, 'coming': 6, 'when': 7, 'comes': 8, 'and': 9, 'lies': 10, 'my': 11, 'grave': 12, 'lone': 13, 'wolf': 14, 'dies': 15, 'but': 16, 'pack': 17, 'survives': 18}


## Padding

Here, extra info is added to the sequence sublists so that their lengths are uniform.

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
pads = pad_sequences(seq) 


In [0]:
pads

array([[ 0,  0,  0,  0,  6,  2,  7,  8],
       [ 0,  0,  0,  0,  4,  9, 10, 11],
       [ 1,  1,  1,  1,  1,  1,  1,  1]], dtype=int32)

Additional parameters that can be used:
- use `padding="post"` for padding at the end of the matrix; 
- use `maxlen=5` for a maximum of 5 words; you will lose from the beginning
- use `truncating="post"` along with maxlen to lose extra info from the end