# Pretrained Embedding layer

In Keras, the embedding matrix is represented as a "layer", and maps positive integers (indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pretrained embedding. In this part, you will learn how to create an [Embedding()](https://keras.io/layers/embeddings/) layer in Keras, initialize it with the GloVe 50-dimensional vectors loaded earlier in the notebook. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. But in the code below, we'll show you how Keras allows you to either train or leave fixed this layer.  

* The `Embedding()` layer takes an integer matrix of size (batch size, max input length) as input. This corresponds to sentences converted into lists of indices (integers), as shown in the figure below.
    * The `batch size` is the number of training examples in each min-batch
    * The `max input length` is the max number of words among all training examples in the min-batch

<img src="images/embedding1.png" style="width:700px;height:250px;">
<caption><center> **Figure 4**: Embedding layer. This example shows the propagation of two examples through the embedding layer. Both have been zero-padded to a length of `max_len=5`. The final dimension of the representation is  `(2,max_len,50)` because the word embeddings we are using are 50 dimensional. </center></caption>

* The embedding layer contains a embedding matrix with shape (vocabulary size, dimension of word vectors)
* The largest integer (i.e. word index) in the input should be no larger than the vocabulary size. 
* The layer outputs an array of shape (batch size, max input length, dimension of word vectors).

The first step is to convert all your training sentences into lists of indices, and then zero-pad all these lists so that their length is the length of the longest sentence. 

**Exercise**: Implement the function below to convert X (array of sentences as strings) into an array of indices corresponding to words in the sentences. The output shape should be such that it can be given to `Embedding()` (described in Figure 4).

In [14]:
from keras.preprocessing.text import Tokenizer

In [15]:
def get_word2vec(file_name):
    
    with open(file_name, 'r') as f:

        word2vec = {}
        for line in f:
            line = line.strip().split()
            word2vec[line[0]] = np.array(line[1:], dtype=np.float64)

    
    return word2vec


In [16]:
tokenizer = Tokenizer()
texts = ["The sun is shining in June!","September is grey.","Life is beautiful in August.","I like it","This and other things?"]
tokenizer.fit_on_texts(texts)
word2index = tokenizer.word_index
print(word2index)

{'shining': 5, 'is': 1, 'i': 12, 'things': 18, 'august': 11, 'beautiful': 10, 'like': 13, 'other': 17, 'the': 3, 'september': 7, 'this': 15, 'it': 14, 'and': 16, 'grey': 8, 'life': 9, 'june': 6, 'in': 2, 'sun': 4}


In [17]:
tokenizer.texts_to_sequences(["June is beautiful and I like it!"])

[[6, 1, 10, 16, 12, 13, 14]]

In [19]:
get_word2vec('./data/glove.6B.50d.txt')

embed_dim = 20
vocab_size = len(word2index) + 1
embed_matrix = np.zeros((vocab_size, embed_dim))

for word, index in word2index:
    vector = word2vec.get(word)
    if vector is not None:
        embed_matrix[index] = vector


print(embedding_matrix.shape)


embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=12)

print(embedding_layer)

FileNotFoundError: [Errno 2] No such file or directory: './data/glove.6B.50d.txt'