# Word-Based Text Generator in Keras

This notebook will describe a word-based text generator outlined in <a href='https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/'>this tutorial</a> written by Jason Brownlee. Similarly to the Char-Based text generater, this model is intended to be exported to tensorflow.js and hosted on a web server for MIT App Inventor in order to teach middle and high school students the fundamental concepts for AI and Machine Learning. 

In this document, I have combine three almost independent parts of the process of training this word-based text generator. These parts are:
* Process the corpus
* Train the Model
* Use the Model

Basically, you can process the corpus with one python script, train a new (or existing) model with another, and generate text from an exported model with a third python script.

This should provide some context for a little bit of redundancy in this code. An example is saving and loading text to a file when you could simply continue using the variable. 

## Imports and Parameters

This model is also trained using the Keras library. 

The parameters are similar to the ones used in the Char-Based model. Here, the variable `in_filename` represents the path to the body of text (corpus) on which we will be training the model. The way we will be processing the corpus, described in a later section, we also need an `out_filename` which will be the new text file. The constant `TEXT_NAME` will be the name of the file that we will be using to save the model, its tokenizer, and its processed out_file.

The `input_size` is the same as the `look_back` in the Char-Based generator. Previously, the Char-Based generator had assumed that the output_size of the model would just be 1. In this project, there is now an explicit variable stating that the expected output_size would be 1. I'm not sure what would happen if we increased the output_size. It may or may not create a model that produces the next two words, or it may break...

For the sake of this example, we will be training using the Dr. Seuss text file saved in the datasets folder. 



In [None]:
import string
from numpy import array
from random import randint
import keras
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.callbacks import LambdaCallback

from pickle import dump

in_filename = '../datasets/drseuss.txt'
TEXT_NAME = 'drseuss'
out_filename = '{}-lines.txt'.format(TEXT_NAME)

input_size = 50
output_size = 1

### Filtered Punctuation

One parameter that I added is `filtered_punctuation` which is a string containing all of the punctuation which we will want to remove from the text. This cleans up the text significantly and reduces the vocabulary that the model may need to learn. 

After some training, I noticed that the models were not generating complete thoughts and seemed to get caught in run-on sentences. This makes sense, since the model never knew where to complete a thought and move onto the next one. As a result, I decided to remove commas and periods from the list of filtered_punctuation. My reasoning was that if I had the models consider periods and commas as words of their own, the resulting models would be able to generate more coherent thoughts. Some post processing will need to be applied to the resulting model's outputs in order to remove the space before the commas and periods, but aside from that, the resulting models had a significant improvement in coherency.

In [None]:
filtered_punctuation = string.punctuation.replace(',','').replace('.','')

## Load the Document

Here, Brownlee defines a helper funciton `load_doc` that will load the source text from the provided file path. It will then return the entire document as one large string. We can then take a look at the first handful of characters in our document. 

In [None]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load document
doc = load_doc(in_filename)
print(doc[:200])

## Clean the Document

Now that we've loaded the document into memory, we want to clean and process the document. 

For example, before splitting the document into words, we may want to replace all "-" with spaces to so that the words split more nicely. 

We also remove the punctuation marks from the text, excluding those we mentioned earlier. 

We can then see the "tokens" in the document, which are simply the words in the text. In our case, this also includes any 

In [None]:
# turn a doc into clean tokens
def clean_doc(doc):
    # make lower case
    doc = doc.lower()
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    doc = doc.replace('-', ' ')
    # I put a space before the punctuation so that words like "however," 
    # are not treated differently from ones like "however " due to the comma
    doc = doc.replace('.', ' .')
    doc = doc.replace(',', ' ,')
    doc = doc.replace('  ', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', filtered_punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha() or word == ',' or word == '.']
    return tokens

# clean document
tokens = clean_doc(doc)
print(tokens[:20])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

## Dataset Preparation

Now that we've tokenized our data (that is, separate the document into the list of words), we can now organize the dataset into input and output words. In this case, we've set the input to be 50 words followed by the next word. In otherwords, sequences of length 51, or `input_size + output_size`.

The resulting list, `sequences`, is a list of strings with only 51 words each.

Once we have finished creating this list of sequences, we want to save this to a different file with the name `out_filename`. This will become the actual training data for our new model once we create it. The helper function `save_doc` will help us do this.

In [92]:
# organize into sequences of tokens
length = input_size + output_size
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 13783


In [93]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
# save sequences to file
save_doc(sequences, out_filename)

# Training The Model

From this point on, we can almost separate this code from what came above. Now that we've done the data preparation, we can load the dataset and really prepare it for the model by one hot encoding.

## Load the Sequences

Here, Brownlee defined a helper function `load_doc`. Given the file path to the document with the lines of length `input_size + output_size`, it will return the full document to you, after which you would need to split the massive string into an array of lines. 

Now we should have an arry called `lines` where every element is one string with 51 words.

In [95]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
doc = load_doc(out_filename)
lines = doc.split('\n')
lines[0]

'the cat in the hat by dr seuss the sun did not shine it was too wet to play so we sat in the house all that cold cold wet day i sat there with sally we sat there we two and i said how i wish we had something to'

## Encode the Sequences

Before we can train the model on the data, we need to tokenize the data again. We load the Tokenizer, and use it to prepare the lines. We save these tokenized sentences into an array of sequences. 

Finally, we export this tokenizer as a pkl, which we can load back into a different python script should we so choose. 

In [96]:
from keras.preprocessing.text import Tokenizer
# integer encode sequences of words
tokenizer = Tokenizer(filters=filtered_punctuation)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# save the tokenizer
dump(tokenizer, open(TEXT_NAME + '-tokenizer.pkl', 'wb'))

Let's take a quick look at the vocabulary size of the model. This basically shows how many unique words the model has available to choose from when predicting the next word.

We'll also store this vocabulary size into the variable `vocab_size` for defining our model later.

In [97]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

1870


Now that we've encoded the data, we need to separate the dataset into input `X` and output `y` elements.

In [98]:
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

## Fit the Model

Now, we get to the meat of things. We are going to define the model that we are going to use. The model that I plan on using will have the following architecture:



In [99]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout

# define model
model = Sequential()
model.add(Embedding(vocab_size, input_size, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(100))
model.add(Dropout(0.1))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 50, 50)            93500     
_________________________________________________________________
lstm_9 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
dropout_13 (Dropout)         (None, 50, 100)           0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 100)               80400     
_________________________________________________________________
dropout_14 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout_15 (Dropout)         (None, 100)               0         
__________

# Helper Function and Callbacks

Finally, the model is ready to be fit on the data for some amount of epochs. This takes quite a long time. You can speed up the training by increasing the `batch_size` or decreasing the number of `epochs`.

We then also save this model. This is the point where the tutorial that I've been following ends, but personally I like to save my models periodically, so before we train, I'm going to define a callback function which will checkpoint the model every five epochs in case I get impatient. This utilizes the keras `LambdaCallback` class. 

In [100]:
# ===================================================================
# Define function to define the generated text

SAVE_FILE_NAME = TEXT_NAME + "-{}.h5"

def generate_text(words_to_generate=80):
    
    result = list()
    # select a seed text
    seed_text = lines[randint(0,len(lines))]
    
    for i in range(words_to_generate):
        # encode the seed text
        encoded = tokenizer.texts_to_sequences([seed_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)

        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break

        # append to input
        seed_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

def on_epoch_end (epoch, _):
    if (epoch + 1) % 25 == 0:
        print("Checkpointing the model...")
        model.save(SAVE_FILE_NAME.format(epoch+1))
        # save the tokenizer
        dump(tokenizer, open('tokenizer.pkl', 'wb'))
        print("Generating Text...")
        print(generate_text())
        
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)


In [101]:
model.save(SAVE_FILE_NAME.format(0))

In [102]:
# fit model
model.fit(X, y, batch_size=256, epochs=500, callbacks=[print_callback])

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Checkpointing the model...
Generating Text...
zoo and and and trees and and and and and and and and and and a grinch and and and and and the grinch and and and and the grinch of and and and and the grinch and and and and and and the grinch and and and and the grinch and and and and and and and the grinch and and and and the grinch and and and and and the grinch and and and and and the grinch
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epo

Epoch 140/500
Epoch 141/500
Epoch 142/500
Epoch 143/500
Epoch 144/500
Epoch 145/500
Epoch 146/500
Epoch 147/500
Epoch 148/500
Epoch 149/500
Epoch 150/500
Checkpointing the model...
Generating Text...
and go down back and the lazy air and mayzie it opener and down it with snoots in the plain belly sneetches why all something but my egg why sit again it wire but the wire now the plain belly sneetches they roast beast its my grinch were aiming making thneeds making flip flapping north chugs and fast but he sing ready small youre the run of the lorax ler lifted down back the chimney on the plain belly sneetches
Epoch 151/500
Epoch 152/500
Epoch 153/500
Epoch 154/500
Epoch 155/500
Epoch 156/500
Epoch 157/500
Epoch 158/500
Epoch 159/500
Epoch 160/500
Epoch 161/500
Epoch 162/500
Epoch 163/500
Epoch 164/500
Epoch 165/500
Epoch 166/500
Epoch 167/500
Epoch 168/500
Epoch 169/500
Epoch 170/500
Epoch 171/500
Epoch 172/500
Epoch 173/500
Epoch 174/500
Epoch 175/500
Checkpointing the model...
Generati

blocks bricks and blocks on knox on box now now now chicks and clocks knox bricks and blocks sir lets do tricks tricks in the minute to ticks and clocks sir first do the most trick its a mind sir you call it some yes hes youll lot and no idea whether the tree that you will find some beasts and i let them in near you meant it all one hundred per cent one uncles and i have me
Epoch 276/500
Epoch 277/500
Epoch 278/500
Epoch 279/500
Epoch 280/500
Epoch 281/500
Epoch 282/500
Epoch 283/500
Epoch 284/500
Epoch 285/500
Epoch 286/500
Epoch 287/500
Epoch 288/500
Epoch 289/500
Epoch 290/500
Epoch 291/500
Epoch 292/500
Epoch 293/500
Epoch 294/500
Epoch 295/500
Epoch 296/500
Epoch 297/500
Epoch 298/500
Epoch 299/500
Epoch 300/500
Checkpointing the model...
Generating Text...
for the trees have no tongues and im asking you sir at a top if my lungs he was upset as went with he grim hung by the egg where the men had around at he had but my small all without brand nothing to christmas other hole in the

Epoch 410/500
Epoch 411/500
Epoch 412/500
Epoch 413/500
Epoch 414/500
Epoch 415/500
Epoch 416/500
Epoch 417/500
Epoch 418/500
Epoch 419/500
Epoch 420/500
Epoch 421/500
Epoch 422/500
Epoch 423/500
Epoch 424/500
Epoch 425/500
Checkpointing the model...
Generating Text...
balls and little pink pills oh the things that they did them they did and like it make that one cat but now again at away you have gone out of his head now all pop voom i saw have me is me he should have this fear cat me about said the fish in the hat but all right hands in a box i will show to fall you will be just two thing then the big bed with
Epoch 426/500
Epoch 427/500
Epoch 428/500
Epoch 429/500
Epoch 430/500
Epoch 431/500
Epoch 432/500
Epoch 433/500
Epoch 434/500
Epoch 435/500
Epoch 436/500
Epoch 437/500
Epoch 438/500
Epoch 439/500
Epoch 440/500
Epoch 441/500
Epoch 442/500
Epoch 443/500
Epoch 444/500
Epoch 445/500
Epoch 446/500
Epoch 447/500
Epoch 448/500
Epoch 449/500
Epoch 450/500
Checkpointing the model...
Gen

Epoch 477/500
Epoch 478/500
Epoch 479/500
Epoch 480/500
Epoch 481/500
Epoch 482/500
Epoch 483/500
Epoch 484/500
Epoch 485/500
Epoch 486/500
Epoch 487/500
Epoch 488/500
Epoch 489/500
Epoch 490/500
Epoch 491/500
Epoch 492/500
Epoch 493/500
Epoch 494/500
Epoch 495/500
Epoch 496/500
Epoch 497/500
Epoch 498/500
Epoch 499/500
Epoch 500/500
Checkpointing the model...
Generating Text...
places youll go youll be on y our way up youll be seeing great sights youll join the high fliers who soar to high heights you wont lag behind because youll have the speed youll pass the whole gang and youll soon take the lead wherever you fly youll be best of the best wherever you go you will top all the rest except when you dont because sometimes you wont im sorry to say so but sadly its true


<keras.callbacks.History at 0x7f9747bcf1d0>

In [79]:

js = "["
for word, index in tokenizer.word_index.items():
    js += '"' + word + '",'
js += "]"
f = open(TEXT_NAME + "-js-tokenizer.txt", "w")
f.write(js)

30571

In [61]:
# fit model
model.fit(X, y, batch_size=256, epochs=500, callbacks=[print_callback])

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Checkpointing the model...
Generating Text...
is you to be a great , and the lion was a great , and the scarecrow was a great , and the scarecrow was a great , and the scarecrow was a great , and the scarecrow was a great , and the scarecrow was the emerald city , and the scarecrow was a great , and the scarecrow was a great , and the scarecrow was a great , and the scarecrow was a great , and the
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/5

Epoch 140/500
Epoch 141/500
Epoch 142/500
Epoch 143/500
Epoch 144/500
Epoch 145/500
Epoch 146/500
Epoch 147/500
Epoch 148/500
Epoch 149/500
Epoch 150/500
Checkpointing the model...
Generating Text...
am sure how are we have been an other way to us , said dorothy . i am not right to get back to kansas . so that im so much happy , answered the scarecrow . the scarecrow is so much so if i should tell you asked dorothy . no one so that i have been afraid , replied dorothy . oh , yes asked dorothy . aunt em is very good beautiful way to give me courage
Epoch 151/500
Epoch 152/500
Epoch 153/500
Epoch 154/500
Epoch 155/500
Epoch 156/500
Epoch 157/500
Epoch 158/500
Epoch 159/500
Epoch 160/500
Epoch 161/500
Epoch 162/500
Epoch 163/500
Epoch 164/500
Epoch 165/500
Epoch 166/500
Epoch 167/500
Epoch 168/500
Epoch 169/500
Epoch 170/500
Epoch 171/500
Epoch 172/500
Epoch 173/500
Epoch 174/500
Epoch 175/500
Checkpointing the model...
Generating Text...
friends were quite much greatly by the people or a

Checkpointing the model...
Generating Text...
next morning they came the golden cap where the wicked witch had carried her would do next . this was only one much made him many years but this would have come into the world , for if i am more right that i am only a friend , although i am not so a great man . it was been so than he said . after out here the first time dorothy answered . if i am a humbug
Epoch 276/500
Epoch 277/500
Epoch 278/500
Epoch 279/500
Epoch 280/500
Epoch 281/500
Epoch 282/500
Epoch 283/500
Epoch 284/500
Epoch 285/500
Epoch 286/500
Epoch 287/500
Epoch 288/500
Epoch 289/500
Epoch 290/500
Epoch 291/500
Epoch 292/500
Epoch 293/500
Epoch 294/500
Epoch 295/500
Epoch 296/500
Epoch 297/500
Epoch 298/500
Epoch 299/500
Epoch 300/500
Checkpointing the model...
Generating Text...
take me some aunt em being so much too too made the way to get back to kansas . when he did nothing out and a new place near the wonderful gray time the wicked witch being too too good a good beast , 

Epoch 408/500
Epoch 409/500
Epoch 410/500
Epoch 411/500
Epoch 412/500
Epoch 413/500
Epoch 414/500
Epoch 415/500
Epoch 416/500
Epoch 417/500
Epoch 418/500
Epoch 419/500
Epoch 420/500
Epoch 421/500
Epoch 422/500
Epoch 423/500
Epoch 424/500
Epoch 425/500
Checkpointing the model...
Generating Text...
looking at once without the other travelers she did come over , where dorothy was ready . but the next morning the soldier had toto and started back to make his way to sight , so that told them hard at the winged monkeys came being down , and at last came back to see the house full of its silk like his body , and he told her in the world she seemed up to do next , replied
Epoch 426/500
Epoch 427/500
Epoch 428/500
Epoch 429/500
Epoch 430/500
Epoch 431/500
Epoch 432/500
Epoch 433/500
Epoch 434/500
Epoch 435/500
Epoch 436/500
Epoch 437/500
Epoch 438/500
Epoch 439/500
Epoch 440/500
Epoch 441/500
Epoch 442/500
Epoch 443/500
Epoch 444/500
Epoch 445/500
Epoch 446/500
Epoch 447/500
Epoch 448/500
Epoch

Epoch 477/500
Epoch 478/500
Epoch 479/500
Epoch 480/500
Epoch 481/500
Epoch 482/500
Epoch 483/500
Epoch 484/500
Epoch 485/500
Epoch 486/500
Epoch 487/500
Epoch 488/500
Epoch 489/500
Epoch 490/500
Epoch 491/500
Epoch 492/500
Epoch 493/500
Epoch 494/500
Epoch 495/500
Epoch 496/500
Epoch 497/500
Epoch 498/500
Epoch 499/500
Epoch 500/500
Checkpointing the model...
Generating Text...
a strange roar and which she called her could be at once , so that dorothy walked near , dorothy could see how any had made in an pretty rooms than home , and after toto and ever would run away from his beautiful country . dorothy told his years being well , while he said where the winged monkeys was stuffed with straw , even three one so one that the first old place before him , for he


<keras.callbacks.History at 0x7f96676bbcc0>

Because I am still trying to export this model into Javascript, I also need to be able to replicate the tokenizer. To do that, I take the tokenizer and construct an array. Every word is the placed in the array at the index which is the number the Tokenizer associates that word with. (Given 1-indexing). 

In [62]:
js = "["
for word, index in tokenizer.word_index.items():
    js += '"' + word + '",'
js += "]"
f = open(TEXT_NAME + "-js-tokenizer.txt", "w")
f.write(js)

25804

# Using the Model!

## Load the Models and the Tokenizer

In [55]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load cleaned text sequences
in_filename = 'harry-potter.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

seq_length = len(lines[0].split()) - 1

In [56]:
from keras.models import load_model
from pickle import load

# load the model
model = load_model('model.h5')
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

## Generate Text

Before you generate the text. we first need to have a model loaded, which obviously we already do, since we just finished training it. Then, the generation algorithm is as follows.

1. Randomly choose a line of text to be the seed text
2. Encode the words into numbers using the Tokenizer
3. Ensure that the Encoded seed text is the right length
4. Have the model predict the next class to code
5. Convert the number back into word format
6. Append the new word to the seed text
7. Repeat with the newly generated word in the seed text


In [57]:
from random import randint

def generate_text():
    
    result = list()
    # select a seed text
    seed_text = lines[randint(0,len(lines))]
    
    for i in range(words_to_generate):
        # encode the seed text
        encoded = tokenizer.texts_to_sequences([seed_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)

        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break

        # append to input
        seed_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

arms around the trolls neck from behind the troll couldnt feel harry hanging there but even a troll will notice if you stick a long bit of wood up its nose and harrys wand had still been in his hand when hed jumped it had gone straight up one of the



In [60]:
tokenizer.texts_to_sequences(['penis'])

[[]]