# Pretrained Embedding layer

In Keras, the embedding matrix is represented as a "layer", and maps positive integers (indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pretrained embedding. In this part, you will learn how to create an [Embedding()](https://keras.io/layers/embeddings/) layer in Keras, initialize it with the GloVe 50-dimensional vectors loaded earlier in the notebook. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. But in the code below, we'll show you how Keras allows you to either train or leave fixed this layer.  

* The `Embedding()` layer takes an integer matrix of size (batch size, max input length) as input. This corresponds to sentences converted into lists of indices (integers), as shown in the figure below.
    * The `batch size` is the number of training examples in each min-batch
    * The `max input length` is the max number of words among all training examples in the min-batch

<img src="../images/embedding1.png" style="width:700px;height:250px;">
<caption><center> **Figure 4**: Embedding layer. This example shows the propagation of two examples through the embedding layer. Both have been zero-padded to a length of `max_len=5`. The final dimension of the representation is  `(2,max_len,50)` because the word embeddings we are using are 50 dimensional. </center></caption>

* The embedding layer contains a embedding matrix with shape (vocabulary size, dimension of word vectors)
* The largest integer (i.e. word index) in the input should be no larger than the vocabulary size. 
* The layer outputs an array of shape (batch size, max input length, dimension of word vectors).

The first step is to convert all your training sentences into lists of indices, and then zero-pad all these lists so that their length is the length of the longest sentence. 

**Exercise**: Implement the function below to convert X (array of sentences as strings) into an array of indices corresponding to words in the sentences. The output shape should be such that it can be given to `Embedding()` (described in Figure 4).

In [16]:
from keras.preprocessing.text import Tokenizer
from keras.layers.embeddings import Embedding
import numpy as np

In [8]:
def get_word2vec(file_name):
    
    with open(file_name, 'r') as f:
        word2vec = {}
        for line in f:
            line = line.strip().split()
            word2vec[line[0]] = np.array(line[1:], dtype=np.float64)
    return word2vec

In [9]:
tokenizer = Tokenizer()
texts = ["The sun is shining in June!","September is grey.","Life is beautiful in August.","I like it","This and other things?"]
tokenizer.fit_on_texts(texts)
word2index = tokenizer.word_index
print(word2index)

{'life': 9, 'june': 6, 'in': 2, 'things': 18, 'it': 14, 'other': 17, 'is': 1, 'the': 3, 'and': 16, 'august': 11, 'sun': 4, 'this': 15, 'i': 12, 'beautiful': 10, 'september': 7, 'shining': 5, 'like': 13, 'grey': 8}


In [10]:
tokenizer.texts_to_sequences(["June is beautiful and I like it!"])

[[6, 1, 10, 16, 12, 13, 14]]

In [12]:
word2vec = get_word2vec('./data/glove.6B.50d.txt')
print(len(word2vec))

400000


In [31]:
MAX_SEQUENCE_LENGTH = 100
embed_dim = 50
vocab_size = len(word2index) + 1
embed_matrix = np.zeros((vocab_size, embed_dim))

for word, index in word2index.items():
    vector = word2vec.get(word)
    if vector is not None:
        embed_matrix[index] = vector[:embed_dim]


print(embed_matrix.shape)

embedding_layer = Embedding(embed_matrix.shape[0],
                            embed_matrix.shape[1],
                            weights=[embed_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

print(embedding_layer.shape)

(19, 50)
[[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  6.18500000e-01   6.42540000e-01  -4.65520000e-01   3.75700000e-01
    7.48380000e-01   5.37390000e-01   2.2239

* data (the comments) -- N x T
* targets -- N x 6
* embedding -- v x D

where 
* N is the number of samples
* T is the sequence length
* V is the vocabulary size
* D is the embeddig dimension

In [23]:
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding



In [30]:
input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedding_layer(input_)
x = Conv1D(128, 3, activation='relu')(x)
x = MaxPooling1D(3)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = MaxPooling1D(3)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
output = Dense(6, activation='sigmoid')(x)

model = Model(input_, output)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 100, 50)           950       
_________________________________________________________________
conv1d_14 (Conv1D)           (None, 98, 128)           19328     
_________________________________________________________________
max_pooling1d_10 (MaxPooling (None, 32, 128)           0         
_________________________________________________________________
conv1d_15 (Conv1D)           (None, 30, 128)           49280     
_________________________________________________________________
max_pooling1d_11 (MaxPooling (None, 10, 128)           0         
_________________________________________________________________
conv1d_16 (Conv1D)           (None, 8, 128)            49280     
__________