# Embedding Projector

http://projector.tensorflow.org/

## IMDB dataset

![image.png](./image/imdbdataset.png)

http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import tensorflow as tf

In [2]:
import tensorflow_datasets as tfds

W0429 15:52:31.549160 140735952987008 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [5]:
imdb, info = tfds.load("imdb_reviews",with_info=True,as_supervised=True)



In [11]:
import numpy as np

In [7]:
train_data, test_data = imdb['train'] , imdb['test']

### Convert the sentences to numpy arrays

In [8]:
training_sentences = []
training_labels = []

In [9]:
test_sentences = []
test_labels = []

In [12]:

# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  test_sentences.append(str(s.numpy()))
  test_labels.append(l.numpy())
  
# Convert arrays to numpy arrays
training_labels_final = np.array(training_labels)
training_sentences_final = np.array(training_sentences)
testing_labels_final = np.array(test_labels)
testing_sentences_final = np.array(test_sentences)

AttributeError: 'Tensor' object has no attribute 'numpy'

In [10]:
print(training_labels_final.shape)
print(training_sentences_final.shape)
print(testing_labels_final.shape)
print(testing_sentences_final.shape)

(25000,)
(25000,)
(25000,)
(25000,)


In [11]:
print(training_labels[2])
print(training_sentences[2])

0
b"Any movie that portrays the hard-working responsible husband as the person who has to change because of bored, cheating wife is an obvious result of 8 years of the Clinton era.<br /><br />It's little wonder that this movie was written by a woman."


In [12]:
item=24999
print(training_labels_final[item])
print(training_sentences[item])

0
b'The orange tone to everything was just yucky. Oh yeah, the main character lives in a ghetto that is all orange-tinted with orange-tinted people. Meanwhile, to mentally escape from this crushing poverty of the body, she plays a full-immersion video game (which sucks in that no rules are clear and no logic follows the gameplay). She apparently earns an income playing the game but she is revealed to not be an employee of the game company?. Lots of non-speaking pauses later the story drags on slowly. She uses a glitchy orange computer interface with an operating interface that is so visually annoying and I can only suspect a Microsoft future release.<br /><br />Meanwhile, I the viewer, ask basically why she is wasting her precious time in some moronic game when she barely has the necessities of life? Oh yeah, playing games is fun, but what is the point when you\'re almost starving? While she is piddling her life away playing some lousy even-more-orange-tinted lame full-immersion video 

## Tokenization of the sentences

In [13]:
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Create instance of tokenizer
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)

# fit the tokenizer on the training set
tokenizer.fit_on_texts(training_sentences)

# Create index of words
word_index = tokenizer.word_index

# Replace the string with their word indexes 
sequences = tokenizer.texts_to_sequences(training_sentences)

# pad or Truncate the sentences based on the max_length value
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)

# Testing Set
# Tokenize the testing sentences with the tokenizer created from testing sentences
testing_sequences = tokenizer.texts_to_sequences(test_sentences)
# pad or Truncate the sentences based on the max_length value
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)


In [14]:
testing_padded.shape

(25000, 120)

### Create the model

In [15]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 11526     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [16]:
tf.compat.v1.enable_eager_execution()

In [17]:
training_labels_final.shape

(25000,)

In [18]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(padded[3]))
print(training_sentences[3])

<OOV> young iranian women dress as boys and try to get into a world cup <OOV> match between iran and <OOV> when they 're caught they 're penned in an area where the match remains within <OOV> but out of sight the prisoners <OOV> to be let go but rules are rules br br given the <OOV> of its director <OOV> panahi it was <OOV> to discover that <OOV> is a comedy and a frequently hilarious one in 1997 's the mirror panahi presents two versions of iranian <OOV> and leaves the audience to wonder which one is real in 2000 's the circle several iranian women step outside the system their <OOV> are different but they all end up
b'Several young Iranian women dress as boys and try to get into a World Cup qualifying match between Iran and Bahrain. When they\'re caught, they\'re penned in an area where the match remains within earshot, but out of sight. The prisoners plead to be let go, but rules are rules.<br /><br />Given the pedigree of its director, Jafar Panahi, it was disarming to discover tha

In [22]:
num_epochs = 10
history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## View the embedding layers

In [23]:
history.history

{'loss': [9.220737520372495e-05,
  5.72577488841489e-05,
  3.618520189542323e-05,
  2.2955438579665498e-05,
  1.4180951579182874e-05,
  8.97175200516358e-06,
  5.721847122476902e-06,
  3.619031097041443e-06,
  2.3019605360605054e-06,
  1.4763939488329926e-06],
 'acc': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
 'val_loss': [0.8279541819947958,
  0.8599751739788055,
  0.8919651262843609,
  0.9226111056375503,
  0.9542629728603363,
  0.9830620627832413,
  1.0115778042697907,
  1.0385503787845838,
  1.0651467025566101,
  1.091361799912001],
 'val_acc': [0.82668,
  0.82648,
  0.827,
  0.82712,
  0.8272,
  0.82652,
  0.8266,
  0.8264,
  0.82612,
  0.82592]}

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history

acc = history_dict['acc']
val_acc = history_dict['val_acc']
loss=history_dict['loss']
val_loss=history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()

In [19]:
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(10000, 16)


![image.png](./image/mapthewords.png)

## View in embedding Projector

In [22]:
import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

## For Google Colab

In [None]:

try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

In [23]:
sentence = "I really think this is amazing. honest."
sequence = tokenizer.texts_to_sequences(sentence)
print(sequence)

[[11], [], [1430], [968], [4], [1537], [1537], [4739], [], [790], [2015], [11], [2922], [2190], [], [790], [2015], [11], [579], [], [11], [579], [], [4], [1783], [4], [4508], [11], [2922], [1277], [], [], [2015], [1005], [2922], [968], [579], [790], []]
