# Spelling Bee

This notebook starts our deep dive (no pun intended) into NLP by introducing sequence-to-sequence learning on Spelling Bee.

## Data Stuff

We take our data set from [The CMU pronouncing dictionary](https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary) 

[Dataset Download Page](http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/)

In [7]:
%matplotlib inline
from importlib import reload
import utils2; reload(utils2)
from utils2 import *
np.set_printoptions(precision=4)
PATH = 'data/spellbee/'

In [8]:
limit_mem()

In [9]:
from sklearn.model_selection import train_test_split

The CMU pronouncing dictionary consists of sounds/words and their corresponding phonetic description (American pronunciation).

The phonetic descriptions are a sequence of phonemes. Note that the vowels end with integers; these indicate where the stress is.

Our goal is to learn how to spell these words given the sequence of phonemes.

The preparation of this data set follows the same pattern we've seen before for NLP tasks.

Here we iterate through each line of the file and grab each word/phoneme pair that starts with an uppercase letter. 

In [10]:
fname = get_file('cmudict-0.7b', 
                 origin='http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b',
                cache_subdir=os.path.join(os.getcwd(),PATH))

In [17]:
lines = [l.strip().split("  ") for l in open(fname, encoding='latin1') if re.match("^[A-Z]", l)]
lines = [(w, ps.split()) for w, ps in lines]
lines[0], lines[-1]

(('A', ['AH0']), ('ZYWICKI', ['Z', 'IH0', 'W', 'IH1', 'K', 'IY0']))

In [18]:
len(lines)

133779

Next we're going to get a list of the unique phonemes in our vocabulary, as well as add a null "_" for zero-padding.

In [23]:
phonemes = ["_"] + sorted(set(p for w,ps in lines for p in ps))
phonemes[:5]

['_', 'AA0', 'AA1', 'AA2', 'AE0']

In [24]:
len(phonemes)

70

Then we create mappings of phonemes and letters to respective indices.

Our letters include the padding element "_", but also "*" which we'll explain later.

In [26]:
p2i = {w:idx for idx, w in enumerate(phonemes)}
letters = "_abcdefghijklmnopqrstuvwxyz*"
l2i = {l:idx for idx, l in enumerate(letters)}

Let's create a dictionary mapping words to the sequence of indices corresponding to it's phonemes, and let's do it only for words between 5 and 15 characters long.

In [35]:
maxlen = 15
pronounce_dict = {w.lower(): [p2i[p] for p in ps]    for w, ps in lines 
 if 5<=len(w)<=maxlen and re.match("^[A-Z]+$", w)}
print(len(pronounce_dict))

108006


Aside on various approaches to python's list comprehension:
* the first list is a typical example of a list comprehension subject to a conditional
* the second is a list comprehension inside a list comprehension, which returns a list of list
* the third is similar to the second, but is read and behaves like a nested loop
    * Since there is no inner bracket, there are no lists wrapping the inner loop

In [33]:
a=['xyz','abc']
[o.upper() for o in a if o[0]=='x'], [[p for p in o] for o in a], [p for o in a for p in o]

(['XYZ'], [['x', 'y', 'z'], ['a', 'b', 'c']], ['x', 'y', 'z', 'a', 'b', 'c'])

Split lines into words, phonemes, convert to indexes (with padding), split into training, validation, test sets. Note we also find the max phoneme sequence length for padding.

In [40]:
maxlen_p = max([len(v) for k,v in pronounce_dict.items()]); print(maxlen_p)

16


In [51]:
pairs = np.random.permutation(list(pronounce_dict.keys()))
n = len(pairs) # len(pronounce_dict)
input_ = np.zeros((n, maxlen_p), np.int32)
labels_ = np.zeros((n, maxlen), np.int32)

for i, k in enumerate(pairs):
    # input_[i] <= pronounce_dict[k]
    for j, pi in enumerate(pronounce_dict[k]): input_[i][j] = pi
    #labels_[i] <= l2i[k]
    for j, letter in enumerate(k): labels_[i][j] = l2i[letter]

In [52]:
go_token = l2i["*"]
dec_input_ = np.concatenate([np.ones((n,1)) * go_token, labels_[:,:-1]], axis=1)

Sklearn's <tt>train_test_split</tt> is an easy way to split data into training and testing sets.

In [57]:
(input_train, input_test, labels_train, labels_test, dec_input_train, dec_input_test
    ) = train_test_split(input_, labels_, dec_input_, test_size=0.1)

In [61]:
input_train.shape, labels_train.shape

((97205, 16), (97205, 15))

In [64]:
input_vocab_size, output_vocab_size = len(phonemes), len(letters)
input_vocab_size, output_vocab_size

(70, 28)

Next we proceed to build our model.

## Keras code

### Without attention

The model has three parts:
* We first pass list of phonemes through an embedding function to get a list of phoneme embeddings. Our goal is to turn this sequence of embeddings into a single distributed representation that captures what our phonemes say.
* Turning a sequence into a representation can be done using an RNN. This approach is useful because RNN's are able to keep track of state and memory, which is obviously important in forming a complete understanding of a pronunciation.
    * <tt>BiDirectional</tt> passes the original sequence through an RNN, and the reversed sequence through a different RNN and concatenates the results. This allows us to look forward and backwards.
    * We do this because in language things that happen later often influence what came before (i.e. in Spanish, "el chico, la chica" means the boy, the girl; the word for "the" is determined by the gender of the subject, which comes after).
* Finally, we arrive at a vector representation of the sequence which captures everything we need to spell it. We feed this vector into more RNN's, which are trying to generate the labels. After this, we make a classification for what each letter is in the output sequence.
    * We use <tt>RepeatVector</tt> to help our RNN remember at each point what the original word is that it's trying to translate.
    


### With attention

In [66]:
Tx = maxlen_p
Ty = maxlen
Tx, Ty

(16, 15)

In [68]:
def softmax(x, axis=1):
    """Softmax activation function.
    # Arguments
        x : Tensor.
        axis: Integer, axis along which the softmax normalization is applied.
    # Returns
        Tensor, output of softmax transformation.
    # Raises
        ValueError: In case `dim(x) == 1`.
    """
    ndim = K.ndim(x)
    if ndim == 2:
        return K.softmax(x)
    elif ndim > 2:
        e = K.exp(x - K.max(x, axis=axis, keepdims=True))
        s = K.sum(e, axis=axis, keepdims=True)
        return e / s
    else:
        raise ValueError('Cannot apply softmax to a tensor that is 1D')

In [67]:
def one_step_attention(a, s_prev, Tx):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attetion) LSTM cell
    """
    
    # Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
    s_prev = RepeatVector(Tx)(s_prev) #repeator(s_prev)
    # Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
    concat = Concatenate(axis=-1)([a, s_prev]) #concatenator([a, s_prev])
    # Use densor to propagate concat through a small fully-connected neural network to compute the "energies" variable e. (≈1 lines)
    e = Dense(1, activation = "relu")(concat) #densor(concat)
    # Use activator and e to compute the attention weights "alphas" (≈ 1 line) 
    # We are using a custom softmax(axis = 1) loaded in this notebook
    alphas = Activation(softmax)(e) #activator(e) # Activation(softmax, name='attention_weights')(e)
    # Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
    context = Dot(axes = 1)([alphas, a]) #dotor([alphas, a])

    return context

In [71]:
n_a = 64
n_s = 128

def model(Tx, Ty, n_a, n_s, input_vocab_size, output_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    input_vocab_size -- size of the python dictionary "input_vocab"
    output_vocab_size -- size of the python dictionary "output_vocab"

    Returns:
    model -- Keras model instance
    """
    
    # Define the inputs of your model with a shape (Tx,)
    # Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)
    X = Input(shape=(Tx, input_vocab_size))
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    
    # Initialize empty list of outputs
    outputs = []
    
    # Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)
    a = Bidirectional(LSTM(n_a, return_sequences=True))(X)
    
    # Step 2: Iterate for Ty steps
    for t in range(Ty):
    
        # Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)
        context = one_step_attention(a, s, Tx)
        
        # Step 2.B: Apply the post-attention LSTM cell to the "context" vector.
        # Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)
        s, _, c = LSTM(n_s, return_state = True)(context, initial_state = [s, c]) # post_activation_LSTM_cell=LSTM(n_s, return_state = True)
        
        # Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)
        out = Dense(output_vocab_size, activation=softmax)(s) # output_layer = Dense(len(machine_vocab), activation=softmax)
        
        # Step 2.D: Append "out" to the "outputs" list (≈ 1 line)
        outputs.append(out)
    
    # Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
    model = Model([X, s0, c0], outputs)
    
    return model

In [72]:
model = model(Tx, Ty, n_a, n_s, input_vocab_size, output_vocab_size)

In [73]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 16, 70)        0                                            
____________________________________________________________________________________________________
s0 (InputLayer)                  (None, 128)           0                                            
____________________________________________________________________________________________________
bidirectional_1 (Bidirectional)  (None, 16, 128)       69120       input_1[0][0]                    
____________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)   (None, 16, 128)       0           s0[0][0]                         
___________________________________________________________________________________________

____________________________________________________________________________________________________
activation_5 (Activation)        (None, 16, 1)         0           dense_9[0][0]                    
____________________________________________________________________________________________________
dot_5 (Dot)                      (None, 1, 128)        0           activation_5[0][0]               
                                                                   bidirectional_1[0][0]            
____________________________________________________________________________________________________
lstm_6 (LSTM)                    [(None, 128), (None,  131584      dot_5[0][0]                      
                                                                   lstm_5[0][0]                     
                                                                   lstm_5[0][2]                     
___________________________________________________________________________________________

activation_10 (Activation)       (None, 16, 1)         0           dense_19[0][0]                   
____________________________________________________________________________________________________
dot_10 (Dot)                     (None, 1, 128)        0           activation_10[0][0]              
                                                                   bidirectional_1[0][0]            
____________________________________________________________________________________________________
lstm_11 (LSTM)                   [(None, 128), (None,  131584      dot_10[0][0]                     
                                                                   lstm_10[0][0]                    
                                                                   lstm_10[0][2]                    
____________________________________________________________________________________________________
repeat_vector_11 (RepeatVector)  (None, 16, 128)       0           lstm_11[0][0]           

____________________________________________________________________________________________________
dot_15 (Dot)                     (None, 1, 128)        0           activation_15[0][0]              
                                                                   bidirectional_1[0][0]            
____________________________________________________________________________________________________
lstm_16 (LSTM)                   [(None, 128), (None,  131584      dot_15[0][0]                     
                                                                   lstm_15[0][0]                    
                                                                   lstm_15[0][2]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 28)            3612        lstm_2[0][0]                     
___________________________________________________________________________________________

In [117]:
opt = Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [122]:
m = len(input_train)
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))

In [124]:
from keras.utils import to_categorical

In [113]:
print(input_train.shape)
input_train_oh = np.array(list(map(lambda x: to_categorical(x, num_classes=input_vocab_size), input_train)))
print(input_train_oh.shape) # input shape: (m, Tx, input_vocab_size)

(97205, 16)
(97205, 16, 70)


In [114]:
labels_train_oh = np.array(list(map(lambda x: to_categorical(x, num_classes=output_vocab_size), labels_train)))
print(labels_train_oh.shape)
outputs = list(labels_train_oh.swapaxes(0,1))
print(len(outputs), outputs[0].shape) # outputs list len: Ty, each shape: (m, output_vocab_size)

(97205, 15, 28)
15 (97205, 28)


In [156]:
model.fit([input_train_oh, s0, c0], outputs, epochs=1, batch_size=128, verbose=0)

<keras.callbacks.History at 0x24e89c71e10>

In [159]:
model.save_weights(PATH+'attention_ted1.h5')

In [129]:
input_test_oh = np.array(list(map(lambda x: to_categorical(x, num_classes=input_vocab_size), input_test)))
#labels_test_oh = np.array(list(map(lambda x: to_categorical(x, num_classes=output_vocab_size), labels_test)))
#outputs_test = list(labels_test_oh.swapaxes(0,1))

In [166]:
def eval_attention():
    preds = model.predict([input_test_oh, s0, c0], batch_size=128)
    predict = np.transpose(np.argmax(preds, axis = 2))
    return (np.mean([all(real==p) for real, p in zip(labels_test, predict)]), predict)

In [167]:
eval_attention()[0]

0.015831867419683364