#### Structure of notebook
- This notebook will serve as an introduction to the word embedding process in Keras.
- Here we introduce:
    1. Tokenization
    2. Embedding
- The following notebook will show you how to:
    1. Load a trained vector model
    2. Use embedding for building a simple model

#### One hot encoding and text tokenization
Here we are going to tokenize our text and one-hot encode the words using keras's in-built tokenizer

In [1]:
from keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
samples = ['The quick brown fox.', 'Jumped over the lazy fox.']
# Creates a tokenizer, configured to only take into account the 1000 most common words
# Note that we only have 7
tokenizer = Tokenizer(num_words = 1000)
# Building the word index
tokenizer.fit_on_texts(samples)

Using TensorFlow backend.


In [2]:
# Turns strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(samples)
sequences

[[1, 3, 4, 2], [5, 6, 1, 7, 2]]

In [3]:
# Turns string into binary vector of of dim 1000 (based on word limit above)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
pd.DataFrame(one_hot_results)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# Dictionary mapping of words to one-hot-encoded index value
word_index = tokenizer.word_index
print('Found {} unique tokens'.format(len(word_index)))
print('The dictionary mapping of tokens is\n {}'.format(word_index))

Found 7 unique tokens
The dictionary mapping of tokens is
 {'over': 6, 'brown': 4, 'the': 1, 'quick': 3, 'fox': 2, 'jumped': 5, 'lazy': 7}


#### Hashing trick
- For very large vacabularies one-hot-encoding will not work. 
- Rather we use one-hot hashing which uses a light-weight hashing function to hash words into vectors of fixed size (rather than maintaing an index).

- Advantages:
    - Do not need to maintain word index
    - Saves memory
    - Allows online encoding of data (can generate token vectors on the fly before all data has been seen)

- Disadvantages:
    - Hash-collisions (occurs when 2 words occupt the same hash)

- Practicalities:
    - If the dimensionality of the hash-space is large then hash-collisions are unlikely

In [5]:
# Example of one-hot hashing
samples = ['The quick brown fox.', 'Jumped over the lazy fox.']
dimensionality = 10
max_length = 10

# Pre-allocation
results = np.zeros((len(samples), max_length, dimensionality))

# Hashing function 
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # Hashes word into a random integer index between 0 and 1000
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1

In [6]:
# 2 arrays with hashes
results

array([[[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

       [[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]

#### Setting up an embedding layer.
- The embedding layer takes in a 2d tensor of integers of the dimension `(samples, sequence_length)` 
    - It accept batches of size `samples` and 
    - The string will need to be either 0 padded or truncated to reach the `sequence_length`

In [2]:
from keras.layers import Embedding
# Embedding takes two values
# Embedding:(n,d) = (max number of tokens, embedding dimension)
embedding_layer = Embedding(1000, 64)

- Here we'll use the imdb dataset
    - The `x` values are tokenized values of words
    - The `y` values are the sentiment score
- We'll restrict our voacbulary to the 10000 most popular words with and cut-off reviews after 20 words


In [19]:
# --- Loading librarires ---
from keras.datasets import imdb
from keras import preprocessing

# --- Setting up constants ---
# Number of words as features, we keep only the top most-common words
max_features = 1000
# Max number of words in a review (truncate the rest)
maxlen = 20
# --- Reading in in data ---
# Loads the data as lists of integers
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = max_features)

In [21]:
# What do our x_train, y_train look like?
print("The tokenized vector for the first review:")
print(pd.DataFrame(x_train[1]).head(10))
print("The sentiment for the first review" + str(pd.DataFrame(y_train[[1]])))


The tokenized vector for the first review:
     0
0    1
1  194
2    2
3  194
4    2
5   78
6  228
7    5
8    6
9    2
The sentiment for the first review   0
0  0


In [22]:
# --- Preprocessing data to pad/truncate sequences ---
# Turns the lists of integers into a 2d integer tensor of shape (samples, maxlen)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen = maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen = maxlen)

In [23]:
x_train.shape

(25000, 20)

In [24]:
x_train[1,:]

array([ 23,   4,   2,  15,  16,   4,   2,   5,  28,   6,  52, 154, 462,
        33,  89,  78, 285,  16, 145,  95], dtype=int32)

#### Training a model and embedding layer
- Let's now train the classifier and the weights from the embedding layer
- Note:
    - The Embedding layer weights, like all other weights in the network will be trained (e.g. with stochastic gradient descent)
    - Word embeddings can be pretrained with w2v and use them as initial weights for the Embedding layer
    - You can then make the weights static or trainable, depending on your preference

- The model we will train will be a single dense layer on top for classification 
    - This is equivalent to a simple logisitic regression
    - We do not consider inter-word relationships
    - Recurrent nets take into account word relations

In [25]:
# Load libraries
from keras.models import Sequential
from keras.layers import Flatten, Dense

# Setting up keras model
model = Sequential()

# Create embedding layer as input
# 1000 - number of words
# 8 - embedding dimension
# input_length = length of phrase
model.add(Embedding(10000,8, input_length = maxlen))
model.add(Flatten())

# Add sigmoid layer
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer = 'rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_5 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [28]:

model.fit(x_train, y_train,
                   epochs = 5,
                   batch_size = 32,
                   validation_split = 0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1208b8d50>

- Our validation accuracy is not so bad around ~75%
- Considering that there is a 50% chance of being correct that's a 50% boost in accuracy

Note that in this example, we trained our own word embeddings for this specific classification task. The unsupervised way of doing word embeddings is to use skip-gram or CBOW approach. Then, the learned embeddings can be useful for all kinds of different classification tasks. 