Use below import if using Google Colab, to mount the drive

In [63]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


**Required imports**

In [0]:
import json
import os
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.utils import np_utils
import pandas as pd

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import Conv1D, MaxPool1D
from keras.callbacks import ModelCheckpoint
from keras.models import load_model

Set the root path accordinly. Root path cannot be realtive, must be absolute i.e. beginning with C:, /usr/ etc.

In [0]:
root_path = "gdrive/My Drive/colab_notebooks/LM-SampleGeneration"

Read the files, make sure that they are in path below.

In [0]:
with open(root_path + "/data/all_story_cleaned.txt", encoding = "utf-8") as f:
  raw_text = f.read()
  
with open(root_path + "/data/vocab_clenaed-50.json", encoding="utf-8") as f:
  json_dump = json.load(f)
  char_to_int = json_dump["char_to_int"]
  int_to_char = json_dump["int_to_char"]
  int_to_char = {int(k):int_to_char[k] for k in int_to_char}

Global variables to define properties of the experiment

In [0]:
n_vocab = len(int_to_char)
max_len = 50

Returns predicted character for the given probabilty distribution. This method remain same irrespective of model used for prediction.  <br>
Argument ***pred:*** the probabiliteis score the output predicted. <br>
Argument ***temperature:*** defines the extent tne neural network can get adventurous in predicting the next character. Higher the temperature value, more chances of infrequent character getting selected. <br>
**Returns:** integer id of the character selected for prediction

Nice explanation for temepraure based sampling <br>
https://stats.stackexchange.com/questions/255223/the-effect-of-temperature-in-temperature-sampling

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    # select a character based on probability
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Method to generate samples for a given model and text. <br>
Argument ***model:*** the model over which prediction will be called, cannot be **None**. <br>
Argument ***text:*** the text to be used  as seed for generating sample. **Optional**, in case not provided a random seed is selected from raw text. <br>
Argument ***temperatures:*** the temperature/diversity for generated text. It also governs the number of generated text per method call. **Optional**, if not provided 8 samples generated for mix of values.<br>
**Returns** tuple object containing seed and corresponding generated text.

In [0]:
def generate_sample(model, model_desc, text=None, 
                    temperatures=[0.7, 0.7, 0.6, 0.6,0.5, 0.5, 0.4, 0.4]):
  
  samples = []
  
  if text is not None:
    custom = True
    original_text = text
  else: custom = False
  
  # run for each temperature value
  for diversity in temperatures:
    
    # if custome text is not provided
    if not custom:
      # select a random start point
      start = np.random.randint(len(raw_text)-(max_len + 1))
      # select portion of text till max length
      text = raw_text [start : start + max_len]
    else:
      # if custom text is provided, use lower case chars
      # make that given text has character present in vocab (50 chars)
      # pad it with space, if custome text is shorter than max lenght
      text = original_text
      text = " "*max_len + text.lower()
      # slice last max len chars
      text = text[-max_len:]
    
    # set the seed
    seed = text
    # tokenize the text
    tokenized_text = [char_to_int[c] for c in text]
    
    # conver to write shape: (1, max_len, n_vocab)
    x_seq = np_utils.to_categorical(tokenized_text, num_classes=n_vocab)
    # add batch axis as it was missing
    x_seq = np.expand_dims(x_seq, axis = 0)
    
    # generate 200 chars
    for i in range(200):
      
      # get the predictions orignal shape of prediction is (1, max_vocab), take first test sample from batch axis. final shape (50, )
      preds = model.predict(x_seq, verbose=0)[0]
      
      # use temperature based sampling
      next_index = sample(preds, diversity)
      # untokenize the index to get the character
      next_char = int_to_char[next_index]
      # append the char
      text = text + next_char
      # prepare a new array for prediction. shape (1, 1, 1).
      new_arr = np_utils.to_categorical(next_index, num_classes=n_vocab).reshape(1, 1, -1)
      # append it to current X, and take all elements other than first. (sliding window approach)
      # use this for prediction next iteration
      x_seq = np.concatenate([x_seq, new_arr], axis = 1)[:, 1:, :]
      
    # append to array to return
    samples.append((model_desc, diversity, seed, text))
    print("generated for temp:", diversity)
  
  # return the all the samples
  return samples

Set the model path for which text needs to be generated accordingly

In [0]:
two_layer_lstm = [root_path + "/models/best_model-2layer-512-256.h5", "2 Layer LSTM"]
three_layer_lstm = [root_path + "/models/best_model-3layer-512-256-256.h5", "3 Layer LSTM"]
two_layer_cnn_lstm = [root_path + "/models/best_model-2cnn-2layer-1stride-50len.h5", "2 Layer LSTM with Char-CNN"]
pair_embedding = [root_path + "/models/best_model-pair-embedding.h5", "3 Layer LSTM with Pair Embeddings"]

**Generating a few samples for fun**

In [79]:
good_model = load_model(three_layer_lstm[0])

# do not pass text if want try random generation.
# char generation takes a few minutes.
generated_text = generate_sample(good_model, three_layer_lstm[1], text = "you shall know a word by ", temperatures=[0.6, 0.6])
for text_tuple in generated_text:
  print("seed:")
  print(text_tuple[2])
  print("generated text:")
  print(text_tuple[3])

generated for temp: 0.6
generated for temp: 0.6
seed:
                         you shall know a word by 
generated text:
                         you shall know a word by the latter world, or i do not come in to see him in the first tage to the promise of the unportrait days of the content in the same distance, and has properted by her shoes, and in the new day of by t
seed:
                         you shall know a word by 
generated text:
                         you shall know a word by the servant, my love, we looked at me and find for the practice of means in the barnacle, with a deserted great lian, and to tell me what i should make the stranger when you are satisfied in the mains


In [80]:
generate_sample(good_model, three_layer_lstm[1], temperatures=[0.6, 0.6])

generated for temp: 0.6
generated for temp: 0.6


[('3 Layer LSTM',
  0.6,
  'board, to make my supper on when i came back at ni',
  'board, to make my supper on when i came back at night. what i find them being one of the houses, and there had felt an appearance of the boy, and before me who had come upon him to the proposal of his common, had come to him to be revariously and mos'),
 ('3 Layer LSTM',
  0.6,
  ' kit. ‘but i am to give it to himself, if you plea',
  ' kit. ‘but i am to give it to himself, if you please you know the son was to be a destruction of the table, he had the bell and a subject of which the sense of little children did i couldn’t have no servant and her dear gentleman and mr. nickleby in ')]

**Batch Generation for Human Evaluation**

In [0]:
all_models = [two_layer_lstm, three_layer_lstm , two_layer_cnn_lstm, pair_embedding]

For each model generate 20 pair of text and save it as text file for human evaluation

In [0]:
gen_samples = []
for model_item in all_models:
  model_path = model_item[0]
  model_desc = model_item[1]
  
  model = load_model(model_path)
  
  batches = 5
  print("generating " + str(batches) + " batches for model:", model_desc)
  
  for num in range(batches):
    print("started for batch:", num)
    gen_samples = gen_samples + generate_sample(model, model_desc, 
                                temperatures=[0.7, 0.6, 0.6, 0.5, 0.5])
    

df = pd.DataFrame(gen_samples, columns = ["model", "temperature", "seed", "generated text"])
df.to_csv(root_path+"/generated_sample-2.csv", sep="#", index=False)
df.head()

generating 5 batches for model: 2 Layer LSTM
started for batch: 0
generated for temp: 0.7


  after removing the cwd from sys.path.


generated for temp: 0.6
generated for temp: 0.6
generated for temp: 0.5
generated for temp: 0.5
started for batch: 1
generated for temp: 0.7
generated for temp: 0.6
generated for temp: 0.6
generated for temp: 0.5
generated for temp: 0.5
started for batch: 2
generated for temp: 0.7
generated for temp: 0.6
generated for temp: 0.6
generated for temp: 0.5
generated for temp: 0.5
started for batch: 3
generated for temp: 0.7
generated for temp: 0.6
generated for temp: 0.6
generated for temp: 0.5
generated for temp: 0.5
started for batch: 4
generated for temp: 0.7
generated for temp: 0.6
generated for temp: 0.6
generated for temp: 0.5
generated for temp: 0.5
generating 5 batches for model: 3 Layer LSTM
started for batch: 0
generated for temp: 0.7
generated for temp: 0.6
generated for temp: 0.6
generated for temp: 0.5
generated for temp: 0.5
started for batch: 1
generated for temp: 0.7
generated for temp: 0.6
generated for temp: 0.6
generated for temp: 0.5
generated for temp: 0.5
started for b

Unnamed: 0,model,temperature,seed,generated text
0,2 Layer LSTM,0.7,"ht and to suffer her to talk on, as it was evi...","ht and to suffer her to talk on, as it was evi..."
1,2 Layer LSTM,0.6,"d, and i can you take it for in this man, of c...","d, and i can you take it for in this man, of c..."
2,2 Layer LSTM,0.6,"d in the street, mr. jarndyce as she said to b...","d in the street, mr. jarndyce as she said to b..."
3,2 Layer LSTM,0.5,"ld not have the little moment to every walk, a...","ld not have the little moment to every walk, a..."
4,2 Layer LSTM,0.5,n of end in the next side of the most self-str...,n of end in the next side of the most self-str...
