### Text Generation using Character Level Language Modelling.
**Here we are creating a Generative Text Models**


**Approach**
- loading the text
- converting in to lower case
- creating vocalbulary
- preparing the data set for the many to one model
- Building the Lstm many to One Model
- Generating the Text

In [1]:
### importing the libraries
import numpy as np
import os
import string
import re
from tqdm import tqdm

In [2]:
### loading the text from the given text file
file_path="./input/Alice_wonder_land.txt"
f=open(file_path,"r",encoding="utf-8")
text=f.read()

In [3]:
## Converting the text to lower case
text=text.lower()

### Creating teh vocabulary from the text
chars=list(set(text))
vocab_chars=dict((c,i) for i,c in enumerate(chars))

## summary 
print("Number of Characters in the Whole text :",len(text))
print("Number of unique charactes :",len(vocab_chars))

Number of Characters in the Whole text : 162939
Number of unique charactes : 63


### Experiment 1: with out One-Hot Encoding

- Preparing the training dataset
- Just we do here is assigning the values of the encodings.

In [4]:
def Prepare_dataset(corpus,max_length):
  x=[]
  y=[]
  length=len(corpus)
  len_vocab=len(vocab_chars)
  for i in tqdm(range(0,length-max_length)):
    inp=corpus[i:i+max_length]
    out=corpus[i+max_length]
    x.append([vocab_chars[c] for c in inp])
    ## one hot encoding for the y
    a=np.zeros(len_vocab)
    a[vocab_chars[out]]=1
    y.append(a)
  
  return np.array(x),np.array(y)




In [5]:
### preparing teh datase 
max_length=100
x_train,y_train=Prepare_dataset(text,max_length)
print("Shape of Inputs :",x_train.shape)
print("Shape of Ouputs :",y_train.shape)


100%|██████████| 162839/162839 [00:03<00:00, 52568.59it/s]


Shape of Inputs : (162839, 100)
Shape of Ouputs : (162839, 63)


In [6]:
## reshaping the dataset as suitable for the LSTMs
x_train=x_train.reshape(x_train.shape[0],x_train.shape[1],1)

In [7]:
x_train=x_train/len(vocab_chars)

In [8]:
x_train.shape

(162839, 100, 1)

### Building the Model for version 1

In [15]:
## importing the libraries
from keras.layers import Dense,LSTM,Embedding,Dropout
from keras.models import Sequential,load_model

In [16]:
model_1=Sequential()
model_1.add(LSTM(512,input_shape=(100,1)))
model_1.add(Dropout(0.2))
model_1.add(Dense(len(vocab_chars),activation="softmax"))

model_1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 512)               1052672   
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense (Dense)                (None, 63)                32319     
Total params: 1,084,991
Trainable params: 1,084,991
Non-trainable params: 0
_________________________________________________________________


In [None]:
## compile the model
model_1.compile(loss="categorical_crossentropy",optimizer="adam",metrics=["accuracy"])
model_1.fit(x_train,y_train,epochs=20,batch_size=128)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f714664a6d0>

In [None]:
### model saving
model_1.save("model_text_gen.h5")

In [21]:
from tensorflow.keras.models import load_model

In [22]:
model_1 = load_model("./models/model_text_gen.h5")
model_1.compile(loss="categorical_crossentropy",optimizer="adam",metrics=["accuracy"])
#model_1.fit(x_train,y_train,epochs=1,batch_size=128)

In [11]:
import keras
keras.__version__

'2.4.3'

In [30]:
### Generating the Text
index=1
sample_text=text[index:index+max_length]
sample_text=[vocab_chars[c] for c in sample_text]
sample_text=np.array(sample_text).reshape(1,max_length,1)
int_char=dict((i,c) for c,i in vocab_chars.items())

num_chars=500
generate_string = ""
for _ in tqdm(range(num_chars)):
    y_pre=model_1.predict(sample_text/len(int_char))
    pre = np.argmax(y_pre)
    generate_string += int_char[pre]
    pre = pre.reshape((1,1,1))
   
    sample_text=np.concatenate([sample_text,pre],axis=1)
    sample_text=sample_text[:,1:,:]
  




  0%|          | 0/500 [00:00<?, ?it/s][A[A[A


  0%|          | 1/500 [00:00<01:05,  7.63it/s][A[A[A


  0%|          | 2/500 [00:00<01:03,  7.80it/s][A[A[A


  1%|          | 3/500 [00:00<01:03,  7.85it/s][A[A[A


  1%|          | 4/500 [00:00<01:02,  7.94it/s][A[A[A


  1%|          | 5/500 [00:00<01:02,  7.89it/s][A[A[A


  1%|          | 6/500 [00:00<01:02,  7.95it/s][A[A[A


  1%|▏         | 7/500 [00:00<01:01,  8.03it/s][A[A[A


  2%|▏         | 8/500 [00:00<00:59,  8.20it/s][A[A[A


  2%|▏         | 9/500 [00:01<01:01,  8.01it/s][A[A[A


  2%|▏         | 10/500 [00:01<01:04,  7.65it/s][A[A[A


  2%|▏         | 11/500 [00:01<01:03,  7.69it/s][A[A[A


  2%|▏         | 12/500 [00:01<01:02,  7.86it/s][A[A[A


  3%|▎         | 13/500 [00:01<01:01,  7.86it/s][A[A[A


  3%|▎         | 14/500 [00:01<01:03,  7.69it/s][A[A[A


  3%|▎         | 15/500 [00:01<01:01,  7.85it/s][A[A[A


  3%|▎         | 16/500 [00:02<01:01,  7.91it/s][A[A

 27%|██▋       | 136/500 [00:17<00:45,  7.92it/s][A[A[A


 27%|██▋       | 137/500 [00:17<00:45,  8.02it/s][A[A[A


 28%|██▊       | 138/500 [00:17<00:43,  8.25it/s][A[A[A


 28%|██▊       | 139/500 [00:17<00:43,  8.38it/s][A[A[A


 28%|██▊       | 140/500 [00:17<00:43,  8.37it/s][A[A[A


 28%|██▊       | 141/500 [00:17<00:42,  8.36it/s][A[A[A


 28%|██▊       | 142/500 [00:17<00:42,  8.48it/s][A[A[A


 29%|██▊       | 143/500 [00:17<00:41,  8.59it/s][A[A[A


 29%|██▉       | 144/500 [00:18<00:41,  8.52it/s][A[A[A


 29%|██▉       | 145/500 [00:18<00:41,  8.46it/s][A[A[A


 29%|██▉       | 146/500 [00:18<00:48,  7.32it/s][A[A[A


 29%|██▉       | 147/500 [00:18<01:04,  5.51it/s][A[A[A


 30%|██▉       | 148/500 [00:18<01:06,  5.30it/s][A[A[A


 30%|██▉       | 149/500 [00:19<01:00,  5.84it/s][A[A[A


 30%|███       | 150/500 [00:19<00:53,  6.52it/s][A[A[A


 30%|███       | 151/500 [00:19<00:50,  6.90it/s][A[A[A


 30%|███       | 152/500

 54%|█████▍    | 270/500 [00:35<00:31,  7.23it/s][A[A[A


 54%|█████▍    | 271/500 [00:35<00:32,  6.99it/s][A[A[A


 54%|█████▍    | 272/500 [00:35<00:32,  7.00it/s][A[A[A


 55%|█████▍    | 273/500 [00:35<00:31,  7.23it/s][A[A[A


 55%|█████▍    | 274/500 [00:36<00:29,  7.56it/s][A[A[A


 55%|█████▌    | 275/500 [00:36<00:29,  7.65it/s][A[A[A


 55%|█████▌    | 276/500 [00:36<00:30,  7.23it/s][A[A[A


 55%|█████▌    | 277/500 [00:36<00:31,  7.12it/s][A[A[A


 56%|█████▌    | 278/500 [00:36<00:29,  7.47it/s][A[A[A


 56%|█████▌    | 279/500 [00:36<00:28,  7.67it/s][A[A[A


 56%|█████▌    | 280/500 [00:36<00:29,  7.57it/s][A[A[A


 56%|█████▌    | 281/500 [00:36<00:30,  7.24it/s][A[A[A


 56%|█████▋    | 282/500 [00:37<00:30,  7.19it/s][A[A[A


 57%|█████▋    | 283/500 [00:37<00:30,  7.23it/s][A[A[A


 57%|█████▋    | 284/500 [00:37<00:28,  7.45it/s][A[A[A


 57%|█████▋    | 285/500 [00:37<00:27,  7.75it/s][A[A[A


 57%|█████▋    | 286/500

 81%|████████  | 404/500 [00:52<00:11,  8.34it/s][A[A[A


 81%|████████  | 405/500 [00:53<00:11,  8.12it/s][A[A[A


 81%|████████  | 406/500 [00:53<00:15,  6.26it/s][A[A[A


 81%|████████▏ | 407/500 [00:53<00:13,  6.79it/s][A[A[A


 82%|████████▏ | 408/500 [00:53<00:12,  7.16it/s][A[A[A


 82%|████████▏ | 409/500 [00:53<00:12,  7.55it/s][A[A[A


 82%|████████▏ | 410/500 [00:53<00:11,  7.84it/s][A[A[A


 82%|████████▏ | 411/500 [00:53<00:10,  8.12it/s][A[A[A


 82%|████████▏ | 412/500 [00:54<00:10,  8.34it/s][A[A[A


 83%|████████▎ | 413/500 [00:54<00:10,  8.50it/s][A[A[A


 83%|████████▎ | 414/500 [00:54<00:10,  8.46it/s][A[A[A


 83%|████████▎ | 415/500 [00:54<00:09,  8.61it/s][A[A[A


 83%|████████▎ | 416/500 [00:54<00:09,  8.50it/s][A[A[A


 83%|████████▎ | 417/500 [00:54<00:11,  6.93it/s][A[A[A


 84%|████████▎ | 418/500 [00:54<00:12,  6.81it/s][A[A[A


 84%|████████▍ | 419/500 [00:54<00:11,  7.23it/s][A[A[A


 84%|████████▍ | 420/500

In [31]:
print(generate_string)

35 3 y3-yw r”r5kd55 q8 q8y5 8i q@5 y88q 3ks” r8 9-8 k@5 q5k ‘$q8- @5o t5 q@5 t58$ y”$tr5 ”q 3rr q5rr”55 3k/ q8-$5 d5 ”q5- q8 d5 q8‘k/ ”q”— n@3q 3 /”5$ r”qqr5 d”5q @3-l 3s3”k— 3k/ $88l i8o @5- q8 y”k5 q@5 o88l 93/ 38l  88 5k55 q8 /”y5 q@5k— 3‘/ ”85 $k q@5 t5-5q5k$k  3r”q5 :8yr/ s5- i8 y3-k5ks q8 3r”:5— q@5 o‘$@5$$ 8i 5f-:‘q”8k$ 9'”@3q$ h”q5 q@3q—n $3”/ q@5 y3q:@ @3-5w99'”” q8‘// q8-3@ /” s5q ”kq8 q@5 y8-l q‘-qr5 $8‘ lk8o—n $3”/ 3r”:5— 33/ q@5 y88l 3‘- q85$5 8ki q@”k— 3k/ q@5k 3r”:5 :8‘r/ @5- i88 


## Experiment 1:

### Preparing the Training dataset 
Here we are planning to build the many to one model . So 

Data set will be : input(c1c2c3c4c5...) --> ouput(Co)

i.e we are passing the multiple characters as input but we expecting single character.

In [None]:
### Preparing the Training dataset 
## step1 : we will extract the 100 character at each time
## step 2 : we will encode the input
## step 3 : Do one hot encoding
max_length=100  ## this was an input length
total_len=len(text)
Input=[]
Output=[]
for i in tqdm(range(0,total_len-max_length-1)):
  x=text[i:i+max_length]
  y=text[i+max_length]
  Input.append([vocab_chars[c] for c in x])
  Output.append(vocab_chars[y])

print("Number of pattern is :",len(Input))

100%|██████████| 162838/162838 [00:01<00:00, 101400.38it/s]

Number of pattern is : 162838





In [None]:
len_inp=len(Input)

In [None]:
### defining the Generator to get effieentlt for by the obe-hot encoding
def Data_generator():
  batch_size=128
  i=0
  len_vocab=len(vocab_chars)
  while(True):
    
    x=Input[i:i+batch_size]
    y=Output[i:i+batch_size]

    gen_x=[]
    gen_y=[]
    for j in x:
      ohe=[]
      for m in j:
        k=np.zeros(len_vocab)
        k[m]=1
        ohe.append(k)
      gen_x.append(ohe)
    
    for j in y:
      k=np.zeros(len_vocab)
      k[j]=1
      gen_y.append(k)
    i=i+batch_size
    yield (np.array(gen_x),np.array(gen_y))


### Building the Model

In [None]:
from keras.layers import LSTM,Dense,Dropout
from keras.models import Sequential


## len of the feaures
len_vocab=len(vocab_chars)

In [None]:
model=Sequential()
model.add(LSTM(256,input_shape=(max_length,len_vocab)))
model.add(Dropout(0.2))
model.add(Dense(len_vocab,activation="softmax"))


model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_8 (LSTM)                (None, 256)               327680    
_________________________________________________________________
dropout_7 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 63)                16191     
Total params: 343,871
Trainable params: 343,871
Non-trainable params: 0
_________________________________________________________________


In [None]:
## compiling the model
model.compile(loss="categorical_crossentropy",optimizer="adam",metrics=["accuracy"])
for i in range(5):
  data_gen=Data_generator()
  model.fit(data_gen,epochs=1,steps_per_epoch=(len_inp/128)-1)
model.save("generation_model.h5")



### Generating the Text Using Trained Model

In [None]:
test_text=text[1000:1000+max_length]
## converting the text to one hot code 
test=[vocab_chars[c] for c in test_text]
test_x=[]
for i in test:
  k=np.zeros(len(vocab_chars))
  k[i]=1
  test_x.append(k)

test_x=np.array(test_x)
test_x=test_x.reshape(1,test_x.shape[0],test_x.shape[1])
print("Shape of test :",test_x.shape)

Shape of test : (1, 100, 63)


In [None]:
### integer to character 
int_char=dict((i,c) for c,i in vocab_chars.items())

In [None]:
gen_len_text=100
tt=[]
for i in range(gen_len_text):
  y=model.predict(test_x)
  test_x=test_x.reshape((test_x.shape[1],test_x.shape[2]))
  k=np.zeros(len(vocab_chars))
  k[np.argmax(y)]=1
  k=k.reshape((1,63))
  test_x=np.vstack((test_x,k))
  test_x=test_x[1:,:]
  test_x=test_x.reshape((1,test_x.shape[0],test_x.shape[1]))
  tt.append(int_char[np.argmax(y)])

print("Generated Text is ")
print("="*50)
generated_text="".join(tt)
print(generated_text)

Generated Text is 
 the project gutenberg-tm electronic works and project gutenberg-tm electronic works and project gut


In [None]:
y=model.predict(test_x)

In [None]:
np.argmax(y)

4

In [None]:
print(test_text)

cket_, and looked at it, and then hurried
on, alice started to her feet, for it flashed across her m
