### https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/

### *** https://towardsdatascience.com/natural-language-processing-with-tensorflow-e0a701ef5cef

### text_to_word_sequence

In [2]:
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog. Dog is big in size'
# tokenize the document
vocab_words = text_to_word_sequence(text)
print(len(vocab_words))
print(vocab_words)

14
['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'dog', 'is', 'big', 'in', 'size']


### one_hot Encoding

Keras provides the one_hot() function that you can use to tokenize and integer encode a text document in one step. The one_hot() function will make the text lower case, filter out punctuation, and split words based on white space.

As with the text_to_word_sequence() function in the previous section, the one_hot() function will make the text **lower case, filter out punctuation, and split words based on white space.**

In addition to the text, the vocabulary size (total words) must be specified. This could be the total number of words in the document or more if you intend to encode additional documents that contains additional words. The size of the vocabulary defines the hashing space from which words are hashed. Ideally, this should be larger than the vocabulary by some percentage (perhaps 25%) to minimize the number of collisions. By default, the ‘hash’ function is used, although as we will see in the next section, alternate hash functions can be specified when calling the hashing_trick() function directly.

In [6]:
from keras.preprocessing.text import one_hot
vocab = set(vocab_words) #To remove duplicate words
vocab_size = len(vocab)
one_hot_encoding = one_hot(text, round(vocab_size*1.3)) # The vocabulary size is increased by one-third to minimize collisions when hashing words.
print(vocab)
print(one_hot_encoding)

{'is', 'over', 'size', 'dog', 'fox', 'in', 'jumped', 'the', 'brown', 'lazy', 'big', 'quick'}
[10, 13, 14, 12, 4, 4, 10, 3, 15, 15, 2, 8, 4, 14]


### Hash Encoding with hashing_trick

In [9]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(words)
print(result)

8
{'over', 'dog', 'fox', 'jumped', 'the', 'brown', 'lazy', 'quick'}
[6, 4, 1, 2, 7, 5, 6, 2, 6]


### Tokenizer API


### https://towardsdatascience.com/natural-language-processing-with-tensorflow-e0a701ef5cef

Tokenizer will handle the heavy lifting in the code. Using tokenizer, we can label each word and provide a dictionary of the words being used in the sentences. We create an instance of tokenizer and assign a hyperparameter num_words to 100.

The word_index returns the above key value pairs. Notice that ‘I’ has been replaced by ‘i’. This is what tokenizer does; it omits the punctuation.

In [11]:
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
defaultdict(<class 'int'>, {'well': 1, 'done': 1, 'work': 2, 'good': 1, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})


#### https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%203%20-%20NLP/Course%203%20-%20Week%201%20-%20Lesson%201.ipynb

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print('Training set')
print(word_index)

# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)

Training set
{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Test Sequence =  [[3, 1, 2, 4], [2, 4, 2]]

Padded Test Sequence: 
[[0 0 0 0 0 0 3 1 2 4]
 [0 0 0 0 0 0 0 2 4 2]]


In test_data, we have 5 words in "i really love my dog" but texts_to_sequences returns only 4 index
which is from trained data(really word isn't available in fit_on_texts training data).
                           
"my dog loves my manatee" --> "manatee" & "loves" is missing fit_on_texts training data.
                           
<b>It can be fixed by setting oov_token=OOV in tokenizer.</b> It maps new words to the index

In [8]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences, maxlen=5)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)


# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded Sequences:
[[ 0  5  3  2  4]
 [ 0  5  3  2  7]
 [ 0  6  3  2  4]
 [ 9  2  4 10 11]]

Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 1 2 1]]
