### Registration Number : 2007253
### Project : seq2seq : Natural Language to SQL Query Conversion


##### References: For the code and architecture used in this notebook references are taken from the book: Deep Learning with Python (Second Edition) by Francois Chollet.

##### In the chapter-11, Deep Learning for Text, the authour has proposed a GRU based sequence-to-sequence Neural Network on English to Spanish Translation. Using this work as a baseline, I have tried to replicate the Neural Network on the Yale Spider dataset.

#####  Also, I have extented the original experiment to use Bi-Directional LSTMs (single and multi-layer) and computed BLEU score for different simulations done on the Yale Spider Dataset.

### Below are the results inferred from different sequence-to-sequence models simulated

#### Model-1 : Single Layer BiLSTM-GRU
1. Test Set BLEU Score : 0.145068
2. Validation Set BLEU Score : 0.141787

#### Model-2 : Single Layer BiLSTM
1. Test Set BLEU Score : 0.154295
2. Validation Set BLEU Score : 0.150324

#### Model-3 : Two Layer BiLSTM
1. Test Set BLEU Score : 0.165978
2. Validation Set BLEU Score : 0.161283

#### Model-4 : Three Layer BiLSTM
1. Test Set BLEU Score : 0.153336
2. Validation Set BLEU Score : 1.000

### Import Packages

In [2]:
import re
import pathlib
import random
import string
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from nltk.translate.bleu_score import corpus_bleu
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

### Retrieve Datasets

In [3]:
## These paths are from my Google drive

train_path = '/content/drive/MyDrive/CE888/Spider/train_others.json'
train_other_path = '/content/drive/MyDrive/CE888/Spider/train_spider.json'

train_data = pd.read_json(train_path)
train_other_data = pd.read_json(train_other_path)

### Concatenate Datasets

In [4]:
train_data = pd.concat([train_data, train_other_data], axis=0, ignore_index=True)
train_data.head()

Unnamed: 0,db_id,query,query_toks,query_toks_no_value,question,question_toks,sql
0,geo,SELECT city_name FROM city WHERE population =...,"[SELECT, city_name, FROM, city, WHERE, populat...","[select, city_name, from, city, where, populat...",what is the biggest city in wyoming,"[what, is, the, biggest, city, in, wyoming]","{'from': {'table_units': [['table_unit', 1]], ..."
1,geo,SELECT city_name FROM city WHERE population =...,"[SELECT, city_name, FROM, city, WHERE, populat...","[select, city_name, from, city, where, populat...",what wyoming city has the largest population,"[what, wyoming, city, has, the, largest, popul...","{'from': {'table_units': [['table_unit', 1]], ..."
2,geo,SELECT city_name FROM city WHERE population =...,"[SELECT, city_name, FROM, city, WHERE, populat...","[select, city_name, from, city, where, populat...",what is the largest city in wyoming,"[what, is, the, largest, city, in, wyoming]","{'from': {'table_units': [['table_unit', 1]], ..."
3,geo,SELECT city_name FROM city WHERE population =...,"[SELECT, city_name, FROM, city, WHERE, populat...","[select, city_name, from, city, where, populat...",where is the most populated area of wyoming,"[where, is, the, most, populated, area, of, wy...","{'from': {'table_units': [['table_unit', 1]], ..."
4,geo,SELECT city_name FROM city WHERE population =...,"[SELECT, city_name, FROM, city, WHERE, populat...","[select, city_name, from, city, where, populat...",which city in wyoming has the largest population,"[which, city, in, wyoming, has, the, largest, ...","{'from': {'table_units': [['table_unit', 1]], ..."


### Create a List of 'Question' and 'Query' values from the dataset

In [5]:
text_pairs = []
for i in range (len(train_data)):
    question=train_data.loc[i, "question"]
    query=train_data.loc[i, "query"]
    query = "[start] " + query + " [end]"
    text_pairs.append((question, query))

### Create Training, Testing & Validation Sets

In [6]:
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

In [7]:
print ('Total Training Pairs:', len(train_pairs))
print ('Total Testing Pairs:', len(test_pairs))
print ('Total Validation Pairs:', len(val_pairs))

Total Training Pairs: 6063
Total Testing Pairs: 1298
Total Validation Pairs: 1298


### Perform Data Preprocessing which involves casefolding and Text Vectorizsation

In [8]:
vocab_size = 15000
sequence_length = 64
batch_size = 128

def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return lowercase

question_vectorization = TextVectorization(max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length,)
query_vectorization = TextVectorization(max_tokens=vocab_size,  output_mode="int", output_sequence_length=sequence_length + 1, standardize=custom_standardization)


In [9]:
train_question_texts = [pair[0] for pair in train_pairs]
train_query_texts = [pair[1] for pair in train_pairs]
question_vectorization.adapt(train_question_texts)
query_vectorization.adapt(train_query_texts)

In [10]:
def format_dataset(question, query):
    question = question_vectorization(question)
    query = query_vectorization(query)
    return ({"encoder_inputs": question, "decoder_inputs": query[:, :-1],}, query[:, 1:])

def make_dataset(pairs):
    question_texts, query_texts = zip(*pairs)
    question_texts = list(question_texts)
    query_texts = list(query_texts)
    dataset = tf.data.Dataset.from_tensor_slices((question_texts, query_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.shuffle(2048).prefetch(16).cache()

In [11]:
train_ds = make_dataset(train_pairs)
test_ds = make_dataset(test_pairs)
val_ds = make_dataset(val_pairs)

In [12]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (128, 64)
inputs["decoder_inputs"].shape: (128, 64)
targets.shape: (128, 64)


### Setting Parameters for the Embeddings

In [13]:
embed_dim = 256
latent_dim = 1024

### Model 1: Single Layer Bidirectional GRU

In [14]:
source = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(layers.GRU(latent_dim), merge_mode="sum")(x)

In [15]:
past_target = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)

In [16]:
seq2seq_01 = keras.Model([source, past_target], target_next_step)

### Setup Callbacks

In [17]:
callbacks = [EarlyStopping(monitor='val_accuracy', patience=1),
             ModelCheckpoint("seq2seq_01.keras", save_best_only=True, monitor="val_accuracy", mode='max')]


### Compile and Run the model

In [18]:
seq2seq_01.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

seq2seq_01.fit(train_ds, epochs=30, validation_data=val_ds, callbacks=callbacks)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30


<keras.callbacks.History at 0x7f113ef1d550>

### Inference/Decode the Input for making predictions

In [19]:
query_vocab = query_vectorization.get_vocabulary()
query_index_lookup = dict(zip(range(len(query_vocab)), query_vocab))
max_decoded_sentence_length = 20

In [20]:
def decode_sequence_01(input_sentence):
  tokenized_input_sentence = question_vectorization([input_sentence])
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = query_vectorization([decoded_sentence])
    next_token_predictions = seq2seq_01.predict([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(next_token_predictions[0, i, :])

    sampled_token = query_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token
    if sampled_token == "[end]":
      break
  return decoded_sentence

### Compute BLEU Score on Test Dataset

In [22]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in test_pairs]
query_texts = [pair[1] for pair in test_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_01(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Test Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))

BLEU-1 Score on Test Dataset: 0.145068


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Some Predictions simulated on Test set

In [23]:
test_questions_texts = [pair[0] for pair in test_pairs]
for _ in range(5):
  input_sentence = random.choice(test_questions_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence_01(input_sentence))

-
Find the name, type, and flag of the ship that is built in the most recent year.
[start] select t1.name from category as t2 join business as t1 on t2.business_id = t1.business_id join student as t3 on t3.business_id
-
Find the phone number of all the customers and staff.
[start] select t1.name from category as t2 join business as t1 on t2.business_id = t1.business_id join student as t3 on t3.business_id
-
Which three cities have the largest regional population?
[start] select state_name from state where area = ( select max ( area ) from state ); [end]
-
Who made the latest order?
[start] select t1.name from category as t2 join business as t1 on t2.business_id = t1.business_id join student as t3 on t3.business_id
-
List the cities which have more than 2 airports sorted by the number of airports.
[start] select t1.name from category as t2 join business as t1 on t2.business_id = t1.business_id join student as t3 on t3.business_id


### Compute BLEU Score on Validation Dataset

In [24]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in val_pairs]
query_texts = [pair[1] for pair in val_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_01(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Validation Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))

BLEU-1 Score on Validation Dataset: 0.141787


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Model 2: Single Layer Bidirectional LSTM

In [33]:
source = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
out_encoder, state_h_forward, state_c_forward, state_h_backward, state_c_backward = layers.Bidirectional(layers.LSTM(latent_dim, return_sequences=True, return_state=True), merge_mode="sum")(x)


In [35]:
past_target = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_LSTM = layers.LSTM(latent_dim, return_sequences=True)

encoder_state=[state_h_forward, state_c_forward]

x = decoder_LSTM(x, initial_state=encoder_state)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)

In [37]:
seq2seq_02 = keras.Model([source, past_target], target_next_step)

In [36]:
callbacks = [EarlyStopping(monitor='val_accuracy', patience=1),
             ModelCheckpoint("seq2seq_02.keras", save_best_only=True, monitor="val_accuracy", mode='max')]


In [38]:
seq2seq_02.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

seq2seq_02.fit(train_ds, epochs=30, validation_data=val_ds, callbacks=callbacks)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30


<keras.callbacks.History at 0x7f10955b0990>

In [39]:
query_vocab = query_vectorization.get_vocabulary()
query_index_lookup = dict(zip(range(len(query_vocab)), query_vocab))
max_decoded_sentence_length = 20

In [40]:
def decode_sequence_02(input_sentence):
  tokenized_input_sentence = question_vectorization([input_sentence])
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = query_vectorization([decoded_sentence])
    next_token_predictions = seq2seq_02.predict([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(next_token_predictions[0, i, :])

    sampled_token = query_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token
    if sampled_token == "[end]":
      break
  return decoded_sentence

In [41]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in test_pairs]
query_texts = [pair[1] for pair in test_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_02(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Test Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))

BLEU-1 Score on Test Dataset: 0.154295


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [42]:
test_questions_texts = [pair[0] for pair in test_pairs]
for _ in range(5):
  input_sentence = random.choice(test_questions_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence_02(input_sentence))

-
How many airports haven't the pilot 'Thompson' driven an aircraft?
[start] select distinct ( ( ) from paperkeyphrase as t2 join keyphrase as t1 on t2.authorid = t1.authorid where t1.authorname =
-
Show the average price of hotels for different pet policy.
[start] select t1.name from publication as t1 join author as t2 on t1.id = t2.customer_id where t1.name = "san and t3.name
-
Return complaint status codes have more than 3 corresponding complaints?
[start] select distinct ( ( ) from writes as t2 join author as t1 on t2.authorid = t1.authorid where t1.authorname =
-
Return the description of the product called "Chocolate".
[start] select distinct ( ( ) from paperkeyphrase as t2 join keyphrase as t1 on t2.authorid = t1.authorid where t1.authorname =
-
What are the country names, area and population which has both roller coasters with speed higher
[start] select t1.name from publication as t1 join author as t2 on t1.id = t2.customer_id where t1.name = "san and t3.name


In [43]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in val_pairs]
query_texts = [pair[1] for pair in val_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_02(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Validation Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))

BLEU-1 Score on Validation Dataset: 0.150324


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Model 3: Two Layer Bidirectional LSTM

In [44]:
source = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(layers.LSTM(latent_dim), merge_mode="sum")(x)
out_encoder1, state_h_forward1, state_c_forward1, state_h_backward1, state_c_backward1 = layers.Bidirectional(layers.LSTM(latent_dim, return_sequences=True, return_state=True), merge_mode="sum")(x)
out_encoder2, state_h_forward2, state_c_forward2, state_h_backward2, state_c_backward2 = layers.Bidirectional(layers.LSTM(latent_dim, return_sequences=True, return_state=True), merge_mode="sum")(out_encoder1)


In [45]:
past_target = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_LSTM = layers.LSTM(latent_dim, return_sequences=True)

encoder_state=[state_h_forward2, state_c_forward2]

x = decoder_LSTM(x, initial_state=encoder_state)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)

In [46]:
seq2seq_03 = keras.Model([source, past_target], target_next_step)

In [47]:
callbacks = [EarlyStopping(monitor='val_accuracy', patience=1),
             ModelCheckpoint("seq2seq_03.keras", save_best_only=True, monitor="val_accuracy", mode='max')]


In [48]:
seq2seq_03.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

seq2seq_03.fit(train_ds, epochs=30, validation_data=val_ds, callbacks=callbacks)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30


<keras.callbacks.History at 0x7f1091406fd0>

In [49]:
query_vocab = query_vectorization.get_vocabulary()
query_index_lookup = dict(zip(range(len(query_vocab)), query_vocab))
max_decoded_sentence_length = 20

In [51]:
def decode_sequence_03(input_sentence):
  tokenized_input_sentence = question_vectorization([input_sentence])
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = query_vectorization([decoded_sentence])
    next_token_predictions = seq2seq_03.predict([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(next_token_predictions[0, i, :])

    sampled_token = query_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token
    if sampled_token == "[end]":
      break
  return decoded_sentence

In [52]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in test_pairs]
query_texts = [pair[1] for pair in test_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_03(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Test Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))

BLEU-1 Score on Test Dataset: 0.165978


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [53]:
test_questions_texts = [pair[0] for pair in test_pairs]
for _ in range(5):
  input_sentence = random.choice(test_questions_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence_03(input_sentence))

-
When did Carole Bernhard first become a customer?
[start] select distinct ( ( ) ) from paperkeyphrase as t2 join author as t1 on t2.authorid = t1.authorid join writes
-
Find the number of investors in total.
[start] select distinct ( ( ) ) from paperkeyphrase as t2 join author as t1 on t2.authorid = t1.authorid join writes
-
What are the names and descriptions of the photos taken at the tourist attraction "film festival"?
[start] select count(*) from state where name = (select select = ( and state = ( select max ( ( )
-
List ids and details for all projects.
[start] select distinct ( ( ) ) from paperkeyphrase as t2 join author as t1 on t2.authorid = t1.authorid join writes
-
How many activities does Mark Giuliano participate in?
[start] select distinct ( ( ) ) from paperkeyphrase as t2 join author as t1 on t2.authorid = t1.authorid join writes


In [54]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in val_pairs]
query_texts = [pair[1] for pair in val_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_03(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Validation Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))

BLEU-1 Score on Validation Dataset: 0.161283


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Model 4: Three Layer Bidirectional LSTM

In [55]:
source = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(layers.LSTM(latent_dim), merge_mode="sum")(x)
out_encoder1, state_h_forward1, state_c_forward1, state_h_backward1, state_c_backward1 = layers.Bidirectional(layers.LSTM(latent_dim, return_sequences=True, return_state=True), merge_mode="sum")(x)
out_encoder2, state_h_forward2, state_c_forward2, state_h_backward2, state_c_backward2 = layers.Bidirectional(layers.LSTM(latent_dim, return_sequences=True, return_state=True), merge_mode="sum")(out_encoder1)
out_encoder3, state_h_forward3, state_c_forward3, state_h_backward3, state_c_backward3 = layers.Bidirectional(layers.LSTM(latent_dim, return_sequences=True, return_state=True), merge_mode="sum")(out_encoder2)


In [56]:
past_target = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_LSTM = layers.LSTM(latent_dim, return_sequences=True)

encoder_state=[state_h_forward3, state_c_forward3]

x = decoder_LSTM(x, initial_state=encoder_state)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)

In [57]:
seq2seq_04 = keras.Model([source, past_target], target_next_step)

In [58]:
callbacks = [EarlyStopping(monitor='val_accuracy', patience=1),
             ModelCheckpoint("seq2seq_04.keras", save_best_only=True, monitor="val_accuracy", mode='max')]


In [59]:
seq2seq_04.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

seq2seq_04.fit(train_ds, epochs=30, validation_data=val_ds, callbacks=callbacks)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30


<keras.callbacks.History at 0x7f108a9b7650>

In [60]:
query_vocab = query_vectorization.get_vocabulary()
query_index_lookup = dict(zip(range(len(query_vocab)), query_vocab))
max_decoded_sentence_length = 20

In [61]:
def decode_sequence_04(input_sentence):
  tokenized_input_sentence = question_vectorization([input_sentence])
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = query_vectorization([decoded_sentence])
    next_token_predictions = seq2seq_04.predict([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(next_token_predictions[0, i, :])

    sampled_token = query_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token
    if sampled_token == "[end]":
      break
  return decoded_sentence

In [62]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in test_pairs]
query_texts = [pair[1] for pair in test_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_04(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Test Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))

BLEU-1 Score on Test Dataset: 0.153336


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [63]:
test_questions_texts = [pair[0] for pair in test_pairs]
for _ in range(5):
  input_sentence = random.choice(test_questions_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence_04(input_sentence))

-
What are the distinct names of wines with prices higher than any wine from John Anthony winery.
[start] select name from customer where name = (select select ( ( ) from state where state_name = ( select max
-
Show the movie titles and book titles for all companies in China.
[start] select name from customer where name = (select select ( ( ) from state where state_name = ( select max
-
Find the average elevation of all airports for each country.
[start] select name from customer as t1 join author as t1 on t1.id = t2.customer_id where t1.name = "san and t2.year
-
What are the names of the services that have never been used?
[start] select name from state where state_name = ( select max ( area ) from state ); [end]
-
Show the budget type code and description and the corresponding document id.
[start] select name from customer where name = (select select ( ( ) from state where state_name = ( select max


### I am sorry my GPU hours got exhausted before I could execute the below cell

In [None]:
actual, predicted = list(), list()
question_texts = [pair[0] for pair in val_pairs]
query_texts = [pair[1] for pair in val_pairs]
for i in range(len(question_texts)):
  input_sentence = question_texts[i]
  output_sentence = query_texts[i]
  y_pred= decode_sequence_04(input_sentence)
  actual.append(output_sentence)
  predicted.append(y_pred)
    
# calcuate BLEU score
print("BLEU-1 Score on Validation Dataset: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))