## Letter Embeddings
* Notebook:https://github.com/RasaHQ/algorithm-whiteboard-resources/blob/master/letter-embeddings/algo_whiteboard_letter_embeddings_v2.ipynb
* Video: https://www.youtube.com/watch?v=mWvnlVw_LiY&amp=&amp;index=5

For some reason doesn't use GPU. Even if GPU accelerator is enabled and tensorflow can see GPU ¯\_(ツ)_/¯

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
try:
  # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

In [None]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

In [None]:
import numpy as np
import tensorflow as tf

## Fetching the Data 

This is a bit annoying. But to download from kaggle we need to upload the API key here. Then we need to move the file to the correct folder after which we need to change the permissions. The error messages will not provide helpful information. Then again, this code works;

In [None]:
import pandas as pd
headlines = pd.read_csv('/kaggle/input/million-headlines/abcnews-date-text.csv')['headline_text']

## Sequence of Letters 

Let's now take these headlines and grab sequences of letters out of them.

In [None]:
headlines[0]

In [None]:
import itertools as it 

def sliding_window(txt):
    for i in range(len(txt) - 2):
        yield txt[i], txt[i + 1], txt[i + 2]

window = list(it.chain(*[sliding_window(_) for _ in headlines[:10000]]))

In [None]:
for win in window[:10]:
    print(win)

In [None]:
mapping = {c: i for i, c in enumerate(pd.DataFrame(window)[0].unique())}
integers_in = np.array([[mapping[w[0]], mapping[w[1]]] for w in window])
integers_out = np.array([mapping[w[2]] for w in window]).reshape(-1, 1)

In [None]:
mapping

In [None]:
integers_in.shape

In [None]:
from tensorflow.keras.layers import Embedding, Dense, Flatten
from tensorflow.keras.models import Sequential

num_letters = len(mapping) # typically 36 -> 26 letters + 10 numbers

# this one is so we might grab the embeddings
model_emb = Sequential()
embedding = Embedding(num_letters, 2, input_length=2)
model_emb.add(embedding)
output_array = model_emb.predict(integers_in)
output_array.shape

## Initalized Letters Visualization
Shows the "random" distribtuion of letters after initalization

In [None]:
import matplotlib.pylab as plt

idx_to_calc = list(mapping.values())
idx_to_calc = np.array([idx_to_calc, idx_to_calc]).T

translator = {v:k for k,v in mapping.items()}
preds = model_emb.predict(idx_to_calc)

plt.scatter(preds[:, 0, 0], preds[:, 0, 1], alpha=0)
for i, idx in enumerate(idx_to_calc):
    plt.text(preds[i, 0, 0], preds[i, 0, 1], translator[idx[0]])

In [None]:
from tensorflow.keras.optimizers import Adam

# this one is so we might learn the mapping
model_pred = Sequential()
model_pred.add(embedding)
model_pred.add(Flatten())
model_pred.add(Dense(num_letters, activation="softmax"))

adam = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False)

model_pred.compile(adam, 'categorical_crossentropy', metrics=['accuracy'])

output_array = model_pred.predict(integers_in)
output_array.shape

In [None]:
from sklearn.preprocessing import OneHotEncoder

to_predict = OneHotEncoder(sparse=False).fit_transform(integers_out)
model_pred.fit(integers_in, to_predict, epochs=10, verbose=1)

## Learned Letters Relationship Visualization
Shows the learned distribtuion of letters after using previous 2 letters to predict 3rd

In [None]:
preds = model_emb.predict(idx_to_calc)
plt.scatter(preds[:, 0, 0], preds[:, 0, 1], alpha=0)
for i, idx in enumerate(idx_to_calc):
    plt.text(preds[i, 0, 0], preds[i, 0, 1], translator[idx[0]])