## Char Prediction using LSTM

1. Download data of Alice in Wonderland or Dracula from https://www.gutenberg.org/browse/scores/top in plain text format
2. Create an char_to_int map which maps each character used in the novel to an integer. example {a: 3}
3. Read data from the text file and do the following:
    3.1 Create a sliding window in which it takes in first 100 characters as the input sequence and 101th character as the output sequence. (It slides over every character).
    For example: 
        "Avul Pakir Jainulabdeen Abdul Kalam better known as A.P.J. Abdul Kalam"
        You should slide from "A" to the 100th char and 101th char will be your output.
        Then you should start sliding from "v" to the 100th char and 101th char will be your output.
    The input and the output sequence should be converted to their integer representation using the char_to_int map.
    With this you basically have two arrays seqIn and seqOut with each element containing integer representation of 100 characters and 1 character respectively.
    seqIn = [[10........15], [5.....25]...] seqOut = [5, 2, 5]
4. Now reshape your seqIn as (NumberOfSamples, 100, 1) - So you basically get this [[[10]........[15]], [[5]..... [25]]...]
5. One hot encode your seqOut using np_utils.to_categorical

6. Now create a simple model with LSTM followed by a Dense layer.

7. Then, given a seed sentence predict the next character using the model created.


In [54]:
from keras.layers import LSTM, Dense
from keras.layers import Embedding
from keras.models import Sequential
from keras.layers import Dropout
from keras.utils import np_utils
import numpy as np
import sklearn.model_selection as m_sel

In [2]:
def storedata(fname):
    seq_in = []
    seq_out = []
    inp = []
    c = 1
    all_ch =set()
    with open(fname, "r") as f:
        while True:
            ch = f.read(1)
            if ch:
                asci_ch = ord(ch)
                if (asci_ch >= 65 and asci_ch <= 90) or (asci_ch >= 97 and asci_ch <= 122): 
                    if asci_ch >= 65 and asci_ch <=90 :
                            asci_ch = asci_ch + 32 
                    asci_ch = asci_ch % 97        
                    if c <= 100:
                        c = c + 1
                        inp.append(asci_ch)
                    elif c == 101:
                        seq_in.append(inp)
                        inp = []
                        seq_out.append(asci_ch)
                        c = 1
                    all_ch.add(asci_ch)    
            else:
                print("End Of File")
                print("Extraction of Data Complete")
                f.close()
                break
    return seq_in,seq_out,all_ch                

In [3]:
fname = "345-8.txt"
seq_in, seq_out, all_ch = storedata(fname)

End Of File
Extraction of Data Complete


In [4]:
num_alpha = len(all_ch)

In [5]:
num_alpha

26

In [6]:
len(seq_in)

6522

In [7]:
len(seq_in[0])

100

In [8]:
len(seq_in[6521])

100

In [9]:
len(seq_out)

6522

In [10]:
x = np.array(seq_in)

In [11]:
np.shape(x)

(6522, 100)

In [12]:
y = np.array(seq_out)

In [13]:
np.shape(y)

(6522,)

In [14]:
x_train, x_test, y_train, y_test = m_sel.train_test_split(x, y, test_size=0.30, random_state=20)

In [15]:
print(type(x_train))
print(np.shape(x_train))
print(np.shape(x_test))
print(type(y_train))
print(np.shape(y_train))
print(np.shape(y_test))

<class 'numpy.ndarray'>
(4565, 100)
(1957, 100)
<class 'numpy.ndarray'>
(4565,)
(1957,)


In [16]:
y.max()

25

In [17]:
y.min()

0

In [18]:
y_train = np_utils.to_categorical(y_train, num_classes=num_alpha)
y_test = np_utils.to_categorical(y_test, num_classes=num_alpha)

In [19]:
ycat = np_utils.to_categorical(y, num_classes=num_alpha)

In [20]:
print(type(x))
print(np.shape(x))
print(type(y))
print(np.shape(y))
print(type(ycat))
print(np.shape(ycat))

<class 'numpy.ndarray'>
(6522, 100)
<class 'numpy.ndarray'>
(6522,)
<class 'numpy.ndarray'>
(6522, 26)


In [21]:
x1 = x
x1 = x1.reshape((6522,100,1))

In [28]:
model = Sequential()
model.add(Embedding(num_alpha, 1,input_length = 100))
model.add(LSTM(64))
model.add(Dense(num_alpha, activation = "sigmoid"))
model.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model.fit(x, ycat, batch_size = 25, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff662fa09e8>

In [23]:
model1 = Sequential()
model1.add(LSTM(64,return_sequences=True,input_shape=(100, 1)))
model1.add(LSTM(64))
model1.add(Dense(num_alpha, activation = "sigmoid"))
model1.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model1.fit(x1, ycat, batch_size = 25, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff689c89a20>

In [None]:
data = """I soon lost sight and recollection of ghostly fears in the beauty of the
scene as we drove along, although had I known the language, or rather
languages, which my fellow-passengers were speaking, I might not have
been able to throw them off so easily. Before us lay a green sloping
land full of forests and woods, with here and there steep hills, crowned
with clumps of trees or with farmhouses, the blank gable end to the
road. There was everywhere a bewildering mass of fruit blossom--apple,
plum, pear, cherry; and as we drove by I could see the green grass under
the trees spangled with the fallen petals. In and out amongst these
green hills of what they call here the "Mittel Land" ran the road,
losing itself as it swept round the grassy curve, or was shut out by the
straggling ends of pine woods, which here and there ran down the
hillsides like tongues of flame. The road was rugged, but still we
seemed to fly over it with a feverish haste. I could not understand then
what the haste meant, but the driver was evidently bent on losing no
time in reaching Borgo Prund. I was told that this road is in summertime
excellent, but that it had not yet been put in order after the winter
snows. In this respect it is different from the general run of roads in
the Carpathians, for it is an old tradition that they are not to be kept
in too good order. Of old the Hospadars would not repair them, lest the
Turk should think that they were preparing to bring in foreign troops,
and so hasten the war which was always really at loading point."""
fname_test = "test.txt"
with open(fname_test, 'a') as out:
    out.write(data + '\n')
    out.close()
seq_in_test,seq_out_test,all_ch_test = storedata(fname_test)
print("seq_in_test shape ",np.shape(seq_in_test))
print("seq_out_test shape ",np.shape(seq_out_test))
seq_in_test_arr = np.array(seq_in_test[0])
seq_out_test_arr = np.array(seq_out_test[0])
print("seq_out_test[0] ",seq_out_test[0])
print("seq_in_test_arr shape ",np.shape(seq_in_test_arr))
print("seq_out_test_arr shape ",np.shape(seq_out_test_arr))
seq_in_test_arr = seq_in_test_arr.reshape((1,100,1))
open(fname_test,'w').close()
print(np.argmax(model1.predict(seq_in_test_arr,verbose = 0)))
print(model1.predict(seq_in_test_arr,verbose = 0).max())
print(seq_out_test_arr)


In [34]:
import h5py
import numpy as np
f = h5py.File("mytestfile.hdf5", "r")

In [53]:
for i in f.keys():
    print(i)
for i in f.values():
    print(i)    

mydataset
<HDF5 dataset "mydataset": shape (100,), type "<i4">


In [50]:
data = f["mydataset"]
att = data.keys()

AttributeError: 'Dataset' object has no attribute 'keys'

In [61]:
model3 = Sequential()
model3.add(LSTM(256,return_sequences=True,input_shape=(100, 1)))
model3.add(Dropout(0.2))
model3.add(LSTM(64))
model3.add(Dropout(0.2))
model3.add(Dense(num_alpha, activation = "sigmoid"))
model3.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])
#model3.fit(x1, ycat, batch_size = 25, epochs = 10)
model3.load_weights(f)
model3.add_weight()

ImportError: `load_weights` requires h5py.

In [57]:
import h5py
with h5py.File('mytestfile.hdf5', 'r') as f:
    my_array = f['mydataset'][()]

In [60]:
type(f)

h5py._hl.files.File