## Word embeddings:
It is an approach to define a dense vector representation of words that captures something about their meaning instead of bag-of-word. The position of the word within the vector space is learned from text and is based on the words tath surround the word when it is used. Simply put, it illustrates how a set of fixed length dense and continuos-valued vectors is trained from a large corpus of text. **gensim** python library provides built-in function for such task. Two common methods are Word2Vec and Glove.

Keras provides an **Embedding** layer for neural network on text data where input data is encoded as integer s.t. each word represents by a unique integer. It is the first layer of a neural network that consists of:   
* **input_dim**: This is the size of the vocabulary in the textdata   
* **output_dim**: This is the size of the vector space where words willbe embedded (size of the output vectors from this layer for each word).   
* **input_length**: This is a length of input sequences. E.g, if all of your input docs consists of 1000, input_length=1000.   

For example: we define an **Embedding** layer with a vocabulary of 200 (e.g, integer encoded wod from 0 to 199), a vector spae of 32 dimensions in which words will be embedded, and input documents with 50 words each.

In [2]:
from keras.layers.embeddings import Embedding
e = Embedding(200,32,input_length=50)

Using TensorFlow backend.


The output of Embedding layer is a 2D vector with one embedding for each word in the input sequences of words (input document). As a result, to connect a Dense layer directly to an Embedding layer, we need flatten the 2D output matrix to 1D vector using the **Flattern** layer. Let bring the codes together.

In [2]:
import numpy as np
from keras.models import Sequential
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.embeddings import Embedding

Using TensorFlow backend.


Let begin with a tiny collection of 10 text documents which defines a comment aboustudents' work. Each text document is classified as positive 1 or negative 0.

In [3]:
docs =[ 'Well done!', 'Good work', 'Great effort', 'nice work','Excellent!','Weak',
       'Poor effort','not good','poor work','Could have done better.']
labels = np.array([1,1,1,1,1,0,0,0,0,0])

Next, we need to encode each document so that, the **Embedding** layer will have sequences of input integers. We use vocabulary size of 50 to reduce a prbability of collisions from the hash function.

In [4]:
vocab_size = 50
encoded_docs = [one_hot(d,vocab_size) for d in docs]
print(encoded_docs)

[[16, 1], [28, 38], [32, 33], [38, 38], [40], [7], [34, 33], [31, 28], [34, 38], [17, 9, 1, 22]]


The sequences have different lengths which Keras requires to vectorize in fixed length. This is known as padding. Assume that we want all input sequences of lenght 4 with **pad_sequence()**.

In [5]:
max_length =4
padded_docs = pad_sequences(encoded_docs, maxlen= max_length, padding='post')
print(padded_docs)                            

[[16  1  0  0]
 [28 38  0  0]
 [32 33  0  0]
 [38 38  0  0]
 [40  0  0  0]
 [ 7  0  0  0]
 [34 33  0  0]
 [31 28  0  0]
 [34 38  0  0]
 [17  9  1 22]]


The embedding layer has a vocabulary of 50 and an input length of 4. The output from Embedding layer will be 4 vectors of  8 dimensions each, one for each word; then we flatten this to a one 32 element vector to pass to the **Dense** output layer

In [7]:
def create_model():
    model =Sequential()
    model.add(Embedding(vocab_size,8,input_length = max_length))
    model.add(Flatten())
    model.add(Dense(1,activation='sigmoid'))
    # compile model
    model.compile(loss='binary_crossentropy',optimizer='adam',metrics =['acc'])
    model.summary()
    model.fit(padded_docs,labels,epochs=50,verbose = 0)
    return model

model = create_model()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_2 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________


In [8]:
loss, accuracy = model.evaluate(padded_docs,labels,verbose=0)
print('Accuracy: %f'% (accuracy*100))

Accuracy: 80.000001


## Using Pre-Trained Glove Embedding   
Keras Embedding layer can us a word embedding. For example, we can download pre-trained GloVe with smallest size, **glove.6B.zip** which includes 50, 100, 200 and 300 dimensions. Interested can refer to example: **pretrained_word_embeddings.py**, example downloaded file is **glove.6B.100d.txt** 

In [None]:
# Load pretrainde Glove
embeddings_index =dict()
f =open('glove.6B.100d.txt')
for line in f:
    values =line.split()
    word = values[0]
    coefs = asarray(values[1:],dtype='float32')
    embeddings_index[word] = coefs
f.close()

This is pretty slow. It is better to filter the embedding for the unique words in the training data. We need to create a matrix of one embedding for each word in training set by enumerating all unique words in the **Tokenizer.word_index** and locating the embedding weight vector from the loaded Glove embedding to generate a matrix of weigths only for words in training.

In [None]:
# create a weight matrix for words 
embedding_matrix = zeros((vocab_size,100))
for word, i in t.word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

The following **Embedding** layer  uses 100-dimensional version where output_dim =100. To avoid updating the learned word weights, we set **trainable** =False in creating model. The full example is follows:

In [None]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding

docs = ['Well done!','Good work','Greate effort','nice work','Excellent!','Weak',
        'Poor effort!','not good','poor work','Could have done better.']
labels = np.array[1,1,1,1,1,0,0,0,0,0]

t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) +1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

# padding docs to a max length of 4 words
max_length =4
padded_docs = pad_sequences(encoded_docs, maxlen = max_length, padding='post')
print(padded_docs)


# load the whole embedding 
embeddings_index = dict()
f = open('glove.6B.100d.txt', mode='rt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:],dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors'% len(embeddings_index))


# create a weight matrix for words in training
embedding_matrix = zeros((vocab_size,100))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    is embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        
# create model
def create_model():
    model =Sequential()
    e = Embedding(vocab_size,100, weights=[embedding_matrix],input_length=4, trainable = False)
    model.add(e)
    model.add(Flatten())
    model.add(Dense(1,activation='sigmoid'))
    # compile
    model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['acc'])
    model.summary()
    # fit model
    model.fit(padded_docs,labels,epochs=50,verbose=0)
    return model

loss, accuracy = model.evaluate(padded_docs,labels, verbose=0)
print('Accuracy %f'% (accuracy*100))

