# Text Preparation

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [11]:
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/devonrasch/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/devonrasch/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
lines = [
    "the quick brown fox",
    "Jumps over $$$ the lazy brown dog",
    "who jumps high into the blue sky after counting 123",
    "And quickly returns to earth"
]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [17]:
sequences

[[1, 4, 2, 5],
 [3, 6, 1, 7, 2, 8],
 [9, 3, 10, 11, 1, 12, 13, 14, 15, 16],
 [17, 18, 19, 20, 21]]

 the word brown appears in lines one and two and is represented by the index two. Therefore two appears in both sequences. Similarly a three representing the word jumps appears in the sequences two and three. The index zero isn't used to denote words; it's reversed to serve as a padding.

 TLDR : 
 * Index 0 for the 3 is reserved to serve as a padding. 
 * `The` is found at $$$ representing 1. 

In [18]:
def remove_stopwords(text):
    text = word_tokenize(text.lower())
    stop_words = set(stopwords.words("english"))
    text = [word for word in text if word.isalpha() and not word in stop_words]
    return " ".join(text)

lines = list(map(remove_stopwords, lines))

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
tokenizer.texts_to_sequences(lines)

[[3, 1, 4], [2, 5, 1, 6], [2, 7, 8, 9, 10], [11, 12, 13]]

Neural networks expects all sequences be the same length Keira's pad sequences performs the final step truncating sequences longer than the specified length and padding sequence is shorter than the specified length with zeros.

In [21]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences, maxlen=10)
padded_sequences

array([[ 0,  0,  0,  0,  0,  0,  1,  4,  2,  5],
       [ 0,  0,  0,  0,  3,  6,  1,  7,  2,  8],
       [ 9,  3, 10, 11,  1, 12, 13, 14, 15, 16],
       [ 0,  0,  0,  0,  0, 17, 18, 19, 20, 21]], dtype=int32)

In [None]:
# CHEAT CODE
# tokenizer = Tokenizer(filters='!"@#$%^&*()-_=+:;[{]}\|<>,./?\t\n[\\] `~ '\
#                       '0123456789')