#### Building a model from a pretrained GloVe model
- Training your own model embedding model is possible however it frequently requires a lot of data
- What we more frequently do is start with an embedding model trained by Facebook, Google or Stanford and then either:
    1. Freeze the embedding layer - assuming the linguistic regularity of our corpus/documents matches that off the embeddings generated by the pretrained models
    2. Use the pretrained weights and update them for our model

- In this class we won't go over fitting skipgram, word2vec or GloVe models from scratch because they require a lot of data and computation resources
- For a more thorough exposition see the following links:
    - [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html)
    - [Python gensim Word2Vec tutorial with TensorFlow and Keras](http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/)
    - [A Word2Vec Keras tutorial](http://adventuresinmachinelearning.com/word2vec-keras-tutorial/)
    - [Code example: Word2Vec (skipgram) in Keras with Gensim](https://github.com/nzw0301/keras-examples/blob/master/Skip-gram-with-NS.ipynb)
    

#### What we are going to be doing in this notebook
- In this notebook we'll go through using pretrained embeddings on our imdb data set
- When there isn't sufficient data, using some form of pre-trained mode is really beneficial (transfer learning)

- In sequence we will:
    - Download the raw imdb data set, unzip it, read it and assign pos/neg target values
    - Tokenize the imdb data set
    - Download GloVe word embeddings
    - Prepare the Glove word-embedding matrix
    - Setup our neural netowrk 
    - Load GloVe embedding in the model
        - Note we will freeze this layer so that it does not disrupt pre-trained weights
    - Training the model
    - Compare the model without pretrained word embeddings

### Setting up file imports for IMDB data
- Here we download the raw individual comments from IMDB users from -`https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset`
- The file structure is such that you have:
    - /aclImdb -> (
        - /test -> 
            - (/neg,/pos, urls_neg.txt, url_post.txt), 
        - /train -> 
            - (/neg,/pos, urls_neg.txt, url_post.txt)))

In [None]:
# -- Import libraries --
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb
import numpy as np
import os
# Restrict to 10000 most common words
max_words = 10000

In [None]:
# LOAD IMDB DATA
(x_train_val, y_train_val), (x_test, y_test) = imdb.load_data(num_words = max_words)
max_val_word_index = max([max(sequence) for sequence in x_train_val])
max_length_review = max([len(sequence) for sequence in x_train_val])

# Printing output
print(" Train and Validation data {x}\n Train labels {y}".format(x = x_train_val.shape, y = y_train_val.shape))
print("_"*20)
print(" Test and Validation  data {x}\n Test labels {y}".format(x = x_test.shape, y = y_test.shape))
print("_"*20)
print("Maximum value of a word index {}".format(max_val_word_index))
print("Maximum length num words of review in train {}".format(max_length_review))


In [None]:
# --- Getting reviews ---
# Reverse from integers to words using the DICTIONARY (given by keras...need to do nothing to create it)
# note that words without a mapping are given by a "?"
word_index = imdb.get_word_index()

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_train_val[123]])

print(decoded_review)

In [None]:
# -- Parameter setup --
# Cutoff reviews after 100 words
maxlen = 100
# Train on 20000 samples
training_samples = 20000
# Validation on 5000 samples
validation_samples = 5000

In [None]:
# Padding/bounding number of words in a sequence
data = pad_sequences(x_train_val, maxlen = maxlen)
labels = np.asarray(y_train_val)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

In [None]:
# -- Shuffling indices --
np.random.seed(1234)
# Shuffling data set  because ordered
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
# --- Training and validation set ---
data = data[indices]
labels = labels[indices]

# - Setting up train/test split -
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples:(training_samples + validation_samples)]
y_val = labels[training_samples:(training_samples + validation_samples)]

In [None]:
# Downloading the glove file
# WARNING - this is a big file, about 1GB.
import urllib
urllib.request.urlretrieve("http://nlp.stanford.edu/data/glove.6B.zip", '/tmp/glove')

In [None]:
import zipfile
zip_ref = zipfile.ZipFile("/tmp/glove", 'r')
zip_ref.extractall("/tmp/glove.6B")

#### Build a vector map 
- We are going to use a 100 dimensional glove embedding
- Before we do so we need to build a dictionary of words:vector_embeddings
- Next we build an index that maps words (as strings) to their vector represenation in 100D

In [None]:
# --- Parsing the GloVe word-embeddings file --
# After unzipping file 

glove_dir = os.path.join('/tmp/', 'glove.6B')

# Dictionary where we store the word:vector_embedding map
embeddings_index = {}
word_index = {}
count=0

# Setting up embedding array
with open(os.path.join(glove_dir, 'glove.6B.100d.txt')) as f:
  for line in f:
      values = line.split()
      word = values[0]
      coefs = np.asarray(values[1:], dtype='float32')
      # Embeddings is a dictionary of words:word_vector_embeddings
      embeddings_index[word] = coefs
      word_index[word] = count
      count+=1

print('Found {} word vectors.'.format(len(embeddings_index)))

In [None]:
# What is the embedding of the word 'happy'
embeddings_index['happy']

- Here we setup the embedding matrix with only the words that we need
- Note that we loop over the file only selecting the most frequently occuring words

In [None]:
print('Our word index dictionary is given by: (word, index), a sample 10 entries are:')
list(word_index.items())[:10]

In [None]:
# --- Preprocessing the GloVe word-embeddings matrix --
embedding_dim = 100

# Instantiating a 10000 x 100 matrix 
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    # Make sure that we are not exceeding the max token size
    if i < max_words:
        # Get the embedded vector for the word
        embedding_vector = embeddings_index.get(word)
        # Provided that a word is known store it in the 
        # embeddig matrix at position i
        if embedding_vector is not None:
            embedding_matrix[i, :] = embedding_vector 

In [None]:
print("The size of the word embedding matrix is:" + str(embedding_matrix.shape))

In [None]:
# --- Definining our model ---
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

# -- Using the GloVe Embedding to train our model ---
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length = maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1,activation = 'sigmoid'))
model.summary()

Note that we already have over 1 million parameters in this network. Most of those are contained in the embedding layer. 

But those embeddings weights aren't free parameters. Instead, we'll use the pre-trained Glove embedding and insert them into the model, and then freeze those weights so that they won't be touched during gradient descent.

- Since we are interested in using our pretrain model we need to :
    1. Load the weights into the first layer
    2. Freeze that layer to make sure that during training it does not get updated
- Here we load the pretrained weights into the first layer (embedding layer) and then freeze the layer

In [None]:
# Setting up the weights
model.layers[0].set_weights([embedding_matrix])

# Freeze or train the GloVe layer
model.layers[0].trainable = False

In [None]:
model.summary()

Note the big decrease in the number of trainable parameters.

In [None]:
# -- Training our model --
model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train,y_train,
                   epochs =50,
                   batch_size = 256,
                   validation_data = (x_val, y_val))             
             

In [None]:
#-- Plotting the results --
import matplotlib.pyplot as plt

# ~ Plotting parameters ~
# Pulling out :
#   - Training: accuracy and loss
#   - validation: accuracy and loss
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

In [None]:
# Plotting the data 

# Training + Valdiation Accuracy
epochs = range(1,len(acc) + 1)
plt.plot( epochs, acc, 'bo', label = 'Training Accuracy')
plt.plot( epochs, val_acc, 'b', label = 'Validation Accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

# Training + Validation Loss
plt.plot( epochs, loss, 'bo', label = 'Training Loss')
plt.plot( epochs, val_loss, 'b', label = 'Validation Loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

What do you think of these loss curves? What might we changes in the model?

**Exercise**: Change the model using some techniques we've learned about in this class and refit. 