Link to one of the used datasets: https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt

In [3]:
%tensorflow_version 2.x
import tensorflow as tf
import string
import requests

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


In [4]:
response = requests.get('https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt')
# Here, use the link of the database from which you want to take the responses.

In [5]:
response.text[:1500]

'This is the 100th Etext file presented by Project Gutenberg, and\nis presented in cooperation with World Library, Inc., from their\nLibrary of the Future and Shakespeare CDROMS.  Project Gutenberg\noften releases Etexts that are NOT placed in the Public Domain!!\n\nShakespeare\n\n*This Etext has certain copyright implications you should read!*\n\n<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM\nSHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS\nPROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE\nWITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE\nDISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS\nPERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED\nCOMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY\nSERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>\n\n*Project Gutenberg is proud to cooperate with The World Library*\nin the presentation of The Complete Works of William Shakespeare\nfor your

In [6]:
data = response.text.split('\n')
data[0]
# Splitting according to each line. Take into consideration from where the data in the actual file starts from.

'This is the 100th Etext file presented by Project Gutenberg, and'

In [7]:
data = " ".join(data)
data[:1000]

'This is the 100th Etext file presented by Project Gutenberg, and is presented in cooperation with World Library, Inc., from their Library of the Future and Shakespeare CDROMS.  Project Gutenberg often releases Etexts that are NOT placed in the Public Domain!!  Shakespeare  *This Etext has certain copyright implications you should read!*  <<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>  *Project Gutenberg is proud to cooperate with The World Library* in the presentation of The Complete Works of William Shakespeare for your reading for educati

In [8]:
def clean_text(doc):
  tokens = doc.split()
  table = str.maketrans('', '', string.punctuation)
  tokens = [w.translate(table) for w in tokens]
  tokens = [word for word in tokens if word.isalpha()]
  tokens = [word.lower() for word in tokens]
  return tokens

tokens = clean_text(data)
print(tokens[:50])
print(len(tokens))
print(len(set(tokens)))

['this', 'is', 'the', 'etext', 'file', 'presented', 'by', 'project', 'gutenberg', 'and', 'is', 'presented', 'in', 'cooperation', 'with', 'world', 'library', 'inc', 'from', 'their', 'library', 'of', 'the', 'future', 'and', 'shakespeare', 'cdroms', 'project', 'gutenberg', 'often', 'releases', 'etexts', 'that', 'are', 'not', 'placed', 'in', 'the', 'public', 'domain', 'shakespeare', 'this', 'etext', 'has', 'certain', 'copyright', 'implications', 'you', 'should', 'read']
899788
28113


 use a set of previous words to predict the next word in the sentence. To be precise we are going to use a set of 50 words to predict the 51st word. Hence we are going to divide our data in chunks of 51 words and at the last we will separate the last word from every line. We are going to limit our dataset to 200000 words.

In [9]:
length = 50 + 1
lines = []

for i in range(length, len(tokens)):
  seq = tokens[i-length:i]
  line = ' '.join(seq)
  lines.append(line)
  if i > 200000:
    break

print(len(lines))
print(lines[0])

199951
this is the etext file presented by project gutenberg and is presented in cooperation with world library inc from their library of the future and shakespeare cdroms project gutenberg often releases etexts that are not placed in the public domain shakespeare this etext has certain copyright implications you should read this


In [10]:
print(tokens[50])
print(lines[1])

this
is the etext file presented by project gutenberg and is presented in cooperation with world library inc from their library of the future and shakespeare cdroms project gutenberg often releases etexts that are not placed in the public domain shakespeare this etext has certain copyright implications you should read this electronic


In [11]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences

We are going to create a unique numerical token for each unique word in the dataset. fit_on_texts() updates internal vocabulary based on a list of texts. texts_to_sequences() transforms each text in texts to a sequence of integers.

In [12]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

sequences containes a list of integer values created by tokenizer. Each line in sequences has 51 words. Now we will split each line such that the first 50 words are in X and the last word is in y.

In [13]:
sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:,-1]
X[0]

array([   22,    11,     1,   397,  2006,  6743,    30,   398,   455,
           2,    11,  6743,    10,  6744,    15,   122,   486,   520,
          47,    65,   486,     5,     1,  2775,     2,   418, 13163,
         398,   455,   678, 13162,  2187,     9,    33,    12,  4811,
          10,     1,   848,  3823,   418,    22,   397,   236,   611,
         485, 13161,     6,    86,   610])

vocab_size contains all the uniques words in the dataset. tokenizer.word_index gives the mapping of each unique word to its numerical equivalent. Hence len() of tokenizer.word_index gives the vocab_size.

In [14]:
vocab_size = len(tokenizer.word_index) + 1
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]
seq_length

50

Using the LSTM Model:

In [15]:
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            658250    
                                                                 
 lstm (LSTM)                 (None, 50, 100)           60400     
                                                                 
 lstm_1 (LSTM)               (None, 100)               80400     
                                                                 
 dense (Dense)               (None, 100)               10100     
                                                                 
 dense_1 (Dense)             (None, 13165)             1329665   
                                                                 
Total params: 2138815 (8.16 MB)
Trainable params: 2138815 (8.16 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [16]:
# Trianing and running the model
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [None]:
# Fitting: making the prediction
model.fit(X, y, batch_size = 256, epochs = 100)

In [None]:
seed_text=lines[12343]
seed_text

generate_text_seq() generates n_words number of words after the given seed_text. We are going to pre-process the seed_text before predicting. We are going to encode the seed_text using the same encoding used for encoding the training data. Then we are going to convert the seed_textto 50 words by using pad_sequences(). Now we will predict using model.predict_classes(). After that we will search the word in tokenizer using the index in y_predict. Finally we will append the predicted word to seed_text and text and repeat the process.

In [None]:
def generate_text_seq(model, tokenizer, text_seq_length, seed_text, n_words):
  text = []

  for _ in range(n_words):
    encoded = tokenizer.texts_to_sequences([seed_text])[0]
    encoded = pad_sequences([encoded], maxlen = text_seq_length, truncating='pre')

    y_predict = model.predict_classes(encoded)

    predicted_word = ''
    for word, index in tokenizer.word_index.items():
      if index == y_predict:
        predicted_word = word
        break
    seed_text = seed_text + ' ' + predicted_word
    text.append(predicted_word)
  return ' '.join(text)

In [None]:
# Next 100 words will be predicted
generate_text_seq(model, tokenizer, seq_length, seed_text, 100)