<a href="https://colab.research.google.com/github/siddheshsathe/google-colab-repos/blob/master/Generating_text_from_Shakespeare's_Art.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Words with Shakespeare's Art

Listing the points that we'll be following to achieve the above task

1. Download data set and remove some data from the txt file which has mostly licensing and other info (This is done since I'm not distributing this to anybody, it's just for my own study purpose)
3. Read lines from the file and store them all in a list
4. Convert those all lines to lower case and without any punctuations. This will be our corpus
5. Initialize the `Tokernizer()` instance and fit it over our corpus
6. Above step will give us our total number of words 
7. Using the `tokernizer` instance, we'll convert all our corpus lines to `input_sequences` which will be a set of numbers and not the words; this is what a neural network expects
8. Since the lengths of every line are different, we'll have to pad these input sequences with some numbers which ideally is 0. This gives us a rectangular `input_sequence`
9. Now our input data for training is ready in the format what a machine expects
10. We'll create a model using `LSTM`, `Dense`, `Embedding` layers. There's no predefined model structure, we'll use some random layer structures.
11. Fit the model for certain number of epochs and will plot the graph of how training went in terms of losses and accuracy
12. Now, it's time to generate the text
13. As it's LSTM model, we'll need to give some seed text so that it'll generate the text based on that input
14. Convert the input text (which obviously a system can't understand) to sequences as we did for training data
15. Pad it for getting a shape of expected input shape
16. Do a prediction on this

## 1. Download dataset
Download the data set from [Link](https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt)
<br>
You may edit the file as I mentioned above in highlights. I've edited and renamed it to `shakespeare.txt`

In [1]:
!wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt

--2020-02-17 04:14:35--  https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
Resolving ocw.mit.edu (ocw.mit.edu)... 23.62.77.179, 2600:1409:a:39c::18a8, 2600:1409:a:39a::18a8
Connecting to ocw.mit.edu (ocw.mit.edu)|23.62.77.179|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5458199 (5.2M) [text/plain]
Saving to: ‘t8.shakespeare.txt’


2020-02-17 04:14:37 (4.96 MB/s) - ‘t8.shakespeare.txt’ saved [5458199/5458199]



In [0]:
file_path = 't8.shakespeare.txt'

## 2. Read the file
Read the lines from file and store them in a list

In [3]:
complete_file = open(file_path, 'r').readlines()
type(complete_file)

list

## 3. Create corpus
Convert the lines in to all lower case and remove any punctuations. <br>
This is our corpus

In [0]:
import string

def remove_punc_and_lower(line):
    line = [character.encode("utf8").decode("ascii",'ignore')
            for character in line if character not in string.punctuation]
    return "".join(line)

corpus = [remove_punc_and_lower(line) for line in complete_file]

## 4. Initialize tokenizer
This will help us convert our text based input data to a numbers based

In [5]:
%tensorflow_version 2.0.1
from tensorflow.keras.preprocessing.text import Tokenizer

`%tensorflow_version` only switches the major version: `1.x` or `2.x`.
You set: `2.0.1`. This will be interpreted as: `2.x`.


TensorFlow 2.x selected.


In [0]:
max_corpus_length = 7000 
# This is the length for fitting tokenizer and creating labels. 
# This is chosen as an experiment. If all corpus used, it gives memory error on my setup

In [0]:
tokenizer = Tokenizer()

In [0]:
tokenizer.fit_on_texts(corpus[:max_corpus_length])

In [9]:
total_words = len(tokenizer.word_index) + 1
total_words

6298

## 5. Converting to `input_sequences`

In [10]:
input_sequences = []
for line in corpus[:max_corpus_length]:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        seq = token_list[:i+1]
        input_sequences.append(seq)
input_sequences[:10]

[[23, 11],
 [23, 11, 1],
 [23, 11, 1, 2790],
 [23, 11, 1, 2790, 209],
 [23, 11, 1, 2790, 209, 868],
 [23, 11, 1, 2790, 209, 868, 1861],
 [23, 11, 1, 2790, 209, 868, 1861, 28],
 [23, 11, 1, 2790, 209, 868, 1861, 28, 222],
 [23, 11, 1, 2790, 209, 868, 1861, 28, 222, 264],
 [23, 11, 1, 2790, 209, 868, 1861, 28, 222, 264, 2]]

## 6. Pad sequence
We'll need to pad the sequences as the `input_sequences` are not in rectangular format

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [12]:
longest_sequence = max([len(seq) for seq in input_sequences])
print('Longest Seq: ', longest_sequence)

Longest Seq:  16


In [0]:
input_sequences = pad_sequences(input_sequences, maxlen=longest_sequence)

In [14]:
input_sequences.shape

(43401, 16)

In [0]:
X_train = input_sequences[:, :-1]
y_train = input_sequences[:, -1]

## 7. Converting labels to one-hot encoding

In [0]:
from tensorflow.keras.utils import to_categorical

In [0]:
y_train = to_categorical(y_train, num_classes=total_words)

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2)

In [20]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(34720, 15) (8681, 15) (34720, 6298) (8681, 6298)


## 8. Model creation
Now, we've all required data in all proper formats for feeding in to the network.
<br>
`X_train`: Our input training data <br>
`y_train`: Our input training labels <br>
`longest_sequence`: longest sequence present in our input data. This is equal to the number of columns in `y_train` one-hot encoded

In [0]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout

In [22]:
model = Sequential()

model.add(Embedding(total_words, 10, input_length=longest_sequence-1))

model.add(LSTM(512))
model.add(Dropout(0.4))

model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 15, 10)            62980     
_________________________________________________________________
lstm (LSTM)                  (None, 512)               1071104   
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense (Dense)                (None, 6298)              3230874   
Total params: 4,364,958
Trainable params: 4,364,958
Non-trainable params: 0
_________________________________________________________________


In [23]:
model.fit(X_train, y_train, epochs=20, validation_data=(X_test, y_test))

Train on 34720 samples, validate on 8681 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fd275beb860>

## 9. Generate Text
Now it's a time to generate the text from our trained model.
<br>
Here, we'll require to give a seed text for model to start predicting the next words.

In [0]:
def generate_text(seed_word):
    tokens = tokenizer.texts_to_sequences([seed_word])[0]
    # Padding
    tokens = pad_sequences([tokens], maxlen=longest_sequence-1)
    # Prediction
    pred = model.predict_classes(tokens)
    return pred
    

Here, we've created a dict to map the word directly with the output prediction.

In [0]:
wordDict = {}
for k, v in tokenizer.word_index.items():
    wordDict[v] = k

In [26]:
seed_word = "Siddhesh"
for i in range(50):
    word = wordDict[generate_text(seed_word)[0]]
    seed_word += " " + word

print(seed_word)
    

Siddhesh i have no precious pilgrim at all a fruitful wise and a great wise shakes in a thousand letters and a velvet as may be as as as a more and in the king a end of a mans as of a mans and of a mans as of a
