## Embedding Model Artefacts

The notebook shows how to produce a model and artefacts for a model that generates predictions of embedding components and the result is computed as the minimum distance to elements of an embedding dataset with a configurable distance function. The artefacts produced are:
* tensorflow saved model
* embedding dataset
* verification samples

In [None]:
import json
import tensorflow as tf
import tensorflow.keras as K
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np

### Example Word2Vec Model Construction

In [None]:
(train_data, test_data), info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST), 
    with_info=True, as_supervised=True)

In [None]:
train_batches = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.shuffle(1000).padded_batch(10)

In [None]:
encoder = info.features['text'].encoder
embedding_dim=16

model = K.Sequential([
  K.layers.Embedding(encoder.vocab_size, embedding_dim),
  K.layers.GlobalAveragePooling1D(),
  K.layers.Dense(16, activation='relu'),
  K.layers.Dense(1)
])

model.summary()

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(
    train_batches,
    epochs=10,
    validation_data=test_batches, validation_steps=20)

#### Make a model that predicts the embedding components

In [None]:
prediction_model = K.models.clone_model(model)
prediction_model.pop()
prediction_model.pop()
prediction_model.summary()

#### Extract embedding weights from the model

In [None]:
embedding_weights = model.layers[0].get_weights()[0][1:,:]
embedding_weights.shape

#### Generate verification data from model and embeddings

In this example we compute the cosine distance for verification.

In [None]:
verification_input = test_batches.unbatch().batch(1).take(10)
num_results = 5
requests = [{
    'input': [[int(x) for x in e[0][0]]],
    'num': num_results,
    'distance': 'cosine'
} for e in list(verification_input.as_numpy_iterator())]

prediction_output = prediction_model.predict(verification_input)

def norm(m):
    return m / np.sqrt(np.sum(m * m, axis=-1, keepdims=1))

scores = norm(prediction_output) @ norm(embedding_weights).T

examples = prediction_output.shape[0]
scored_ix = np.arange(examples).reshape(-1, 1)
top_k = scores.argpartition(-num_results)[:,-num_results:]
sorted_k = top_k[scored_ix, (scores[scored_ix, top_k]).argsort()]
scores_k = scores[scored_ix, sorted_k]

responses = [
    {'result': [{'term': encoder.decode([i + 1]).rstrip(), 'score': float(s)}
                for i, s in zip(terms, scores)]}
    for terms, scores in zip(top_k, scores_k)]

#### Update the layers of the created model

In order for our service to figure out which parameter passed into our predict api should go to the model, we need to make sure the model's input layer is named. Below, we use an `InputLayer` with the name `tokens`, so we can expect the api to look something like `/api/predict?tokens=[[111, 222, 333]]`. Later in the workbook we explain how to verify this.

In [None]:
model_to_save = K.Sequential([
    K.layers.InputLayer(input_shape=(None,), name='tokens'),
    prediction_model,
    K.layers.Lambda(lambda x: tf.reduce_mean(x, axis=0), name='result')
])

#### Write out all Artefacts

In [None]:
!mkdir -p /tmp/word2vec/model
saved_model_dir = '/tmp/word2vec/model'
model_to_save.save(saved_model_dir)

!tar -cvzf /tmp/word2vec/model.tgz -C /tmp/word2vec/model .

pd.DataFrame(
    embedding_weights,
    index=pd.Index(
        [encoder.decode([i]).rstrip() for i in range(1, encoder.vocab_size)],
        name='term')
).to_csv('/tmp/word2vec/embedding.csv')

with open('/tmp/word2vec/verification.jsonl', 'wt') as f:
    for req, resp in zip(requests, responses):
        json.dump({'request': req, 'response': resp}, f)
        f.write('\n')

!ls -l /tmp/word2vec

#### Verify saved model

Some models may require the definition of keras layers performing custom transformations. A good check is to re-load the model from disk as follows:

In [None]:
model_from_disk = tf.keras.models.load_model(saved_model_dir)
model_from_disk.summary()

Next we'll double-check the input tensor and it's name. There's a little bit of data cleaning because of how Tensorflow stores its input signatures, but from the printed output, we can see that the input tensor has the name of `tokens`, which is exactly what we wanted.

In [None]:
print('Input Tensors: ', [tensor for tensor in model_from_disk.signatures['serving_default'].structured_input_signature if tensor]) # Cleanup empty inputs
print('Output Tensors: ', model_from_disk.signatures['serving_default'].structured_outputs)