### Defining a simple function to read a file

In [1]:
def read_file(filepath):
    with open(filepath,encoding="utf8") as f:
        str_text = f.read()
    return str_text  

### Load Spacy to do some text cleaning
We will load Spacy and disable the parsing, tagging and named entity recognision
The reason for doing this is because the process will go a lot faster.

In [2]:
import spacy

In [3]:
nlp = spacy.load('en', disable=['parser','tagger','ner'])

setting the max length to a larger number than 1 million, so the number that should work for the entirety of the text file.

In [4]:
nlp.max_length = 1200000

Create a function which will omit the unwanted characters

In [5]:
def seperate_punc(doc_text):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in '\ufeffjoao 1111 111111 \n\n\n\n \n\n\n\n\n\n\n \n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

### Read the file

In [6]:
d = read_file('moby_dick_four_chapters.txt')

In [7]:
tokens = seperate_punc(d)

In [8]:
len(tokens)

10944

#25 words --> network predict #26

Here we will create a text sequence of 25 words, 26th will be the prediction word

We will append the seq of 26 words to the dictionary of text_sequences

##### Note that 25 words are long enough to capture the context of the sentence

it depends upon what document you are looking at, and the sequence will be shorter number of words

In [49]:
train_len = 50 + 1
text_sequences = []

for i in range(train_len,len(tokens)):    
    seq = tokens[i-train_len:i]
    text_sequences.append(seq)

Grabbing the 300th item in the list.
The next sequence will start from the word 'stuffed' and will end on one word over.

In [50]:
' '.join(text_sequences[300])

'thousands of mortal men fixed in ocean reveries some leaning against the spiles some seated upon the pier heads some looking over the bulwarks of ships from china some high aloft in the rigging as if striving to get still better seaward peep but these are all landsmen of week days'

###### So given the above 25 words the prediction will be "way"

In [51]:
' '.join(text_sequences[301])

'of mortal men fixed in ocean reveries some leaning against the spiles some seated upon the pier heads some looking over the bulwarks of ships from china some high aloft in the rigging as if striving to get still better seaward peep but these are all landsmen of week days pent'

#### using keras tokenization to convert these sequences to a numerical system that keras can understand

In [14]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [52]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)

In [53]:
sequences = tokenizer.texts_to_sequences(text_sequences)

#### printing the first sequence of list of sequences (25 numbers each) to see how the text is being converted into a numerical format
the number shows the index against each word

In [54]:
sequences[0]

[956,
 13,
 262,
 49,
 261,
 408,
 86,
 218,
 134,
 119,
 954,
 260,
 51,
 42,
 37,
 315,
 6,
 22,
 546,
 2,
 149,
 259,
 5,
 2711,
 13,
 24,
 2710,
 4,
 59,
 4,
 58,
 406,
 36,
 51,
 2,
 100,
 1,
 2709,
 175,
 3,
 1,
 174,
 7,
 19,
 108,
 4,
 47,
 3,
 2706,
 99,
 1]

We can use the index_word property of the tokenizer to know which word is at which index

In [18]:
tokenizer.index_word

{1: 'the',
 2: 'and',
 3: 'of',
 4: 'i',
 5: 'to',
 6: 'in',
 7: 'it',
 8: 'that',
 9: 'he',
 10: 'his',
 11: 'was',
 12: 'but',
 13: 'me',
 14: 'with',
 15: 'as',
 16: 'at',
 17: 'this',
 18: 'you',
 19: 'is',
 20: 'all',
 21: 'for',
 22: 'my',
 23: 'on',
 24: 'be',
 25: "'s",
 26: 'not',
 27: 'from',
 28: 'there',
 29: 'one',
 30: 'up',
 31: 'what',
 32: 'him',
 33: 'so',
 34: 'bed',
 35: 'now',
 36: 'about',
 37: 'no',
 38: 'into',
 39: 'by',
 40: 'were',
 41: 'out',
 42: 'or',
 43: 'harpooneer',
 44: 'had',
 45: 'then',
 46: 'have',
 47: 'an',
 48: 'upon',
 49: 'little',
 50: 'some',
 51: 'old',
 52: 'like',
 53: 'if',
 54: 'they',
 55: 'would',
 56: 'do',
 57: 'over',
 58: 'landlord',
 59: 'thought',
 60: 'room',
 61: 'when',
 62: 'could',
 63: "n't",
 64: 'night',
 65: 'here',
 66: 'head',
 67: 'such',
 68: 'which',
 69: 'man',
 70: 'did',
 71: 'sea',
 72: 'time',
 73: 'other',
 74: 'very',
 75: 'go',
 76: 'these',
 77: 'more',
 78: 'though',
 79: 'first',
 80: 'sort',
 81: 'said

To know how many words appeared how many times, we can use the word_counts property of the tokenizer. 

In [19]:
tokenizer.word_counts

OrderedDict([('call', 27),
             ('me', 2471),
             ('ishmael', 133),
             ('some', 758),
             ('years', 135),
             ('ago', 84),
             ('never', 449),
             ('mind', 164),
             ('how', 321),
             ('long', 374),
             ('precisely', 37),
             ('having', 142),
             ('little', 767),
             ('or', 950),
             ('no', 1003),
             ('money', 120),
             ('in', 5646),
             ('my', 1786),
             ('purse', 71),
             ('and', 9644),
             ('nothing', 281),
             ('particular', 152),
             ('to', 6497),
             ('interest', 24),
             ('on', 1716),
             ('shore', 26),
             ('i', 7150),
             ('thought', 676),
             ('would', 702),
             ('sail', 104),
             ('about', 1014),
             ('see', 416),
             ('the', 15539),
             ('watery', 26),
             ('part', 234),
 

Getting the size of our vocabulary, to get the unique words across the whole document

In [55]:
vocabulary_size = len(tokenizer.word_counts)

In [56]:
vocabulary_size

2717

Since the Sequence is a list, we need to convert it into a format which is understood by the model.

For that we are going to use the numpy library, to convert this into array
where each row in the array represents a single line in the text, with lenght of 26

In [111]:
import numpy as np
sequences = np.array(sequences)
sequences

array([[ 956,   13,  262, ..., 2706,   99,    1],
       [  13,  262,   49, ...,   99,    1,  957],
       [ 262,   49,  261, ...,    1,  957,    2],
       ...,
       [ 386,    3,   31, ...,   10,  313,   52],
       [   3,   31,  403, ...,  313,   52, 2717],
       [  31,  403, 2707, ...,   52, 2717,   25]])

### Create the LSTM Based Model
##### Split the Data into Featureas and Labels
##### X will be the first n words of sequence
##### y will be the next word after the sequence
##### Fit the model

In [112]:
from keras.utils import to_categorical

In [113]:
X = sequences[:,:-1]

In [114]:
y = sequences[:,-1]

Convert the y to to_categorical, that is 0 1 format,  where the number of classes, will be the size of vocabulary, and we will add 1 to it a the way keras padding works it actually needs and extra one to hold a zero.

In [115]:
y = to_categorical(y,num_classes=vocabulary_size+1)

In [117]:
y.shape

(10893, 2718)

In [118]:
seq_len = X.shape[1] #setting the seq_len to 25

In [119]:
X.shape #94047 sequences, each containing 25 words

(10893, 50)

### Training the model 
LSTM layer to deal with the sequences, and Embedding layer to deal with the vocabulary
##### Note: that when defining the LSTM, the provided number of neurons = seq_len*11 that will be 275, but you can define your own number if you want to, there's no possible description of how many neurons should go into it, but it should be some sort of multiple of the seq_length 

In [83]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

In [120]:
def create_model(vocabulary_size, seq_len):
    model = Sequential()
    model.add(Embedding(vocabulary_size,seq_len,input_length=seq_len))
    model.add(LSTM(seq_len*2, return_sequences = True))
    model.add(LSTM(seq_len*2))
    model.add(Dense(50,activation='relu'))
    model.add(Dense(vocabulary_size,activation='softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    model.summary()
    return model

In [121]:
model = create_model(vocabulary_size+1,seq_len)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 50, 50)            135900    
_________________________________________________________________
lstm_17 (LSTM)               (None, 50, 100)           60400     
_________________________________________________________________
lstm_18 (LSTM)               (None, 100)               80400     
_________________________________________________________________
dense_17 (Dense)             (None, 50)                5050      
_________________________________________________________________
dense_18 (Dense)             (None, 2718)              138618    
Total params: 420,368
Trainable params: 420,368
Non-trainable params: 0
_________________________________________________________________


In [109]:
from pickle import dump, load 

In [122]:
#batch size is how many sequences you want to pass in, obviously you can't pass all the sequences at a time
#epochs is how many times you want to train, to get something reasonable, it must be more than 200, and it will take time too for training
#verbose is the output port
model.fit(X,y,batch_size=128, epochs=2, verbose=1)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x2029f012c18>

In [123]:
model.save('deep_learning_model_11_aug19.h5')

In [37]:
dump(tokenizer,open('demo_tokenizer','wb'))

### Generating new Text based on Seed Input

In [124]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

In [125]:
def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    '''
    INPUTS:
    model : model that was trained on text data
    tokenizer : tokenizer that was fit on text data
    seq_len : length of training sequence
    seed_text : raw string text to serve as the seed
    num_gen_words : number of words to be generated by model
    '''
    
    # Final Output
    output_text = []
    
    # Intial Seed Sequence
    input_text = seed_text
    
    # Create num_gen_words
    for i in range(num_gen_words):
        
        # Take the input text string and encode it to a sequence
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        # Pad sequences to our trained rate (50 words in the video)
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
        # Predict Class Probabilities for each word
        pred_word_ind = model.predict_classes(pad_encoded, verbose=0)[0]
        
        # Grab word
        pred_word = tokenizer.index_word[pred_word_ind] 
        
        # Update the sequence of input text (shifting one over with the new word)
        input_text += ' ' + pred_word
        
        output_text.append(pred_word)
        
    # Make it look like a sentence.
    return ' '.join(output_text)

In [126]:
import random
random.seed(101)
ramdom_pick = random.randint(0,len(text_sequences))
random_seed_text = text_sequences[15]
random_seed_text

['money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail',
 'about',
 'little',
 'and',
 'see',
 'the',
 'watery',
 'part',
 'of',
 'the',
 'world',
 'it',
 'is',
 'way',
 'i',
 'have',
 'of',
 'driving',
 'off',
 'the',
 'spleen',
 'and',
 'regulating',
 'the',
 'circulation',
 'whenever',
 'i',
 'find',
 'myself',
 'growing',
 'grim',
 'about',
 'the',
 'mouth',
 'whenever']

In [127]:
seed_text = ' '.join(random_seed_text)
seed_text

'money in my purse and nothing particular to interest me on shore i thought i would sail about little and see the watery part of the world it is way i have of driving off the spleen and regulating the circulation whenever i find myself growing grim about the mouth whenever'

In [128]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=25)

'the the the the the the the the the the the the the the the the the the the the the the the the the'

### Use the following to Load the model if you do not want to Train or Training takes time

In [129]:
model = load_model('epoch250.h5')
tokener = load(open('epoch250','rb'))

In [130]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=50)

'while help must new narrow of cover besides shaking the ship i going is cheerless staving make glasses will or wake up with and getting might specimens desired i ready to working you stood and long magnetic in possibly looking have for is just leave mystifying i incessant him it'

In [106]:
full_text = read_file('moby_dick_four_chapters.txt')
for i,word in enumerate(full_text.split()):
    if word == 'interest':
        print(' '.join(full_text.split()[i-5:i+5]))
        print('\n')

purse, and nothing particular to interest me on shore, I


