#Word Embeddings
In the previous notebook we saw tokenization where we created a dictionary from a corpus of words, where each word had a numerical value so that the machine could understand it.
Let us consider the IMDb dataset, in this dataset we have a fairly large vocabulary almost 10000 words. So now if we tokenize each word we have about 10000 key value pairs where each word has a unique value. 
These words are what we call 'Categorical Data', refers to input features that represent one or more discrete items form a finite set of choices. Categorical data is most efficiently represented by Sparse tensors, tensors with very few non-zero elements i.e they are mostly filled with zeros.
This is also known as Bag Of Words (Bag of word model processes the text to find how many times each word appeared in the sentence. This is also called as vectorization)
There are a few challenges with this approach:
1. Size of the network - the more the words in the vocabulary the bigger the network. Also with the size of the network, the amount of data increases - the weights and the amount of computation also increases.
2. Lack of meaningful relations between vectors.

The solution?
Embeddings. 

Emedding translate large sparse vectors into lower-dimensional space that preserves semantic relationship.

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

Embeddings solve both of the above mentioned problems. With the use of lower-dimensional space we can drastically reduce the size of the network and hence reduce the amount of data and computation as well. The matrices are not sparse anymore. With each word being represented as a vector in the lower-dimensional space, semantically similar vectors are closer to each other and hence there is no lack of relation between vectors.


The two cells below are the representation of Bag Of Words.
As you can see the vector is sparse.
We have one row for each sentence. 
A particular column represents one word from the dictionary.
If the word appears once the value is 1. If it appears twice in a sentence then the value is 2 and so on. 



In [1]:
import nltk
nltk.download('punkt')

#Creating frequency distribution of words using nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
text="""Achievers are not afraid of Challenges, rather they relish them, thrive in them, use them. Challenges makes is stronger.
        Challenges makes us uncomfortable. If you get comfortable with uncomfort then you will grow. Challenge the challenge """
#Tokenize the sentences from the text corpus
tokenized_text=sent_tokenize(text)
#using CountVectorizer and removing stopwords in english language
cv1= CountVectorizer(lowercase=True,stop_words='english')
#fitting the tonized senetnecs to the countvectorizer
text_counts=cv1.fit_transform(tokenized_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
# Words with their indices
print(cv1.vocabulary_)
# Bag of Words or Vectorized representation
print(text_counts.toarray())

{'achievers': 0, 'afraid': 1, 'challenges': 3, 'relish': 7, 'thrive': 9, 'use': 12, 'makes': 6, 'stronger': 8, 'uncomfortable': 11, 'comfortable': 4, 'uncomfort': 10, 'grow': 5, 'challenge': 2}
[[1 1 0 1 0 0 0 1 0 1 0 0 1]
 [0 0 0 1 0 0 1 0 1 0 0 0 0]
 [0 0 0 1 0 0 1 0 0 0 0 1 0]
 [0 0 0 0 1 1 0 0 0 0 1 0 0]
 [0 0 2 0 0 0 0 0 0 0 0 0 0]]


#IMDb review dataset
Tensorflow has the most popular datasets already available for us in tensorflow_datasets. Like mnist and fashion mnist for Image Recognition. Similarly, the IMDb dataset is there for NLP. 
We are going to use this dataset to build a model that classifies movie reviews as either Positive(1) or Negative(0). The dataset contains the review and it's label. (text, label)

The implementation will be as follows:

* Load the dataset from tensorflow_datasets
* 

In [3]:
import tensorflow_datasets as tfds
# imdb is the data - train test and unsupervised.
# info is the information about the dataset - authors and all.
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteZJ191S/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteZJ191S/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteZJ191S/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [4]:
# IMPORTANT DETAILS TO UNDERSTAND THE NEXT CELL

import sys

# str means a literal that has a sequence of Unicode characters 
# encoded in UTF-16 or UTF-32
string = "my name is razi and i am amazing"
# bytes mean literals that represent integers between 0 and 255 
# (also known as octets). Adding b before a string converts it to bytes.
# They contain ASCII characters. UTF-8 is also used for bytes.
b_string = b"my name is razi and i am amazing"

print(type(string))
print(type(b_string))

print(sys.getsizeof(string))
print(sys.getsizeof(b_string))

print(string[0])
print(b_string[0])

<class 'str'>
<class 'bytes'>
81
65
m
109


The data in the imdb data set is stored in the form of tf.Tensor() and the review in that is stored as a bytes literal instead of str. Inorder to access the data in the tf.Tensor() object we use the .numpy() method.

To then convert the review from a bytes literal to a str we use the decode() method. And we are decoding it from the encoded format of utf-8.

In [5]:
for s,l in imdb["train"]:
  print(s)
  print(s.numpy())
  print(s.numpy().decode('utf-8'))
  print(l)
  print(l.numpy())
  break

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline

In [6]:
import numpy as np

train_data = imdb["train"]
test_data = imdb["test"]

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

for s,l in train_data:
  training_sentences.append(s.numpy().decode('utf-8'))
  training_labels.append(l.numpy())

for s,l in test_data:
  testing_sentences.append(s.numpy().decode('utf-8'))
  testing_labels.append(l.numpy())

# Convert the labels to a numpy array for our model
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

Now that we have our data ready. Let's us tokenize the sentences / reviews and padded the sentences to a max len of 120. We will have a vocabulary of size 10000 which will also be the shape of our input to the model.



In [7]:
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

In [9]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 11526     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [10]:
model.fit(padded, training_labels_final,
          epochs=10,
          validation_data=(testing_padded, testing_labels_final))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f4781a25898>

We have successfully trained a model to learn the words from our vocabulary and classify the reviews as either positive(1) or negative(0).

Let's explore the details now.

In [11]:
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(10000, 16)


In [12]:
# Each word out of the 10000 is represented by a 16 dimensional vector
print(weights[0])

[ 1.2022594e-02 -2.9514054e-02  3.8469184e-02 -7.5785047e-03
 -2.6300007e-03  1.2467058e-02  7.4244060e-02  3.1351380e-02
  2.4856718e-03 -3.0818164e-02 -2.5466124e-02  2.1721730e-02
 -2.6210058e-03 -4.9586773e-02 -5.6827346e-05 -1.1848134e-02]


In [13]:
print("Input shape of the model: ", model.input_shape)


Input shape of the model:  (None, 120)


This means that the model expects an list of 120 words in the form of a sequence.

In [14]:
# Something like this
padded[0]

array([   0,    0,    0,   12,   14,   33,  425,  392,   18,   90,   28,
          1,    9,   32, 1366, 3585,   40,  486,    1,  197,   24,   85,
        154,   19,   12,  213,  329,   28,   66,  247,  215,    9,  477,
         58,   66,   85,  114,   98,   22, 5675,   12, 1322,  643,  767,
         12,   18,    7,   33,  400, 8170,  176, 2455,  416,    2,   89,
       1231,  137,   69,  146,   52,    2,    1, 7577,   69,  229,   66,
       2933,   16,    1, 2904,    1,    1, 1479, 4940,    3,   39, 3900,
        117, 1584,   17, 3585,   14,  162,   19,    4, 1231,  917, 7917,
          9,    4,   18,   13,   14, 4139,    5,   99,  145, 1214,   11,
        242,  683,   13,   48,   24,  100,   38,   12, 7181, 5515,   38,
       1366,    1,   50,  401,   11,   98, 1197,  867,  141,   10],
      dtype=int32)

In [15]:
model.output_shape

(None, 1)

The model outputs a label for the given review, either positive(1) or negative(0).

In [16]:
# Something like this
training_labels[0]

0

In [17]:
# Lets pass a review from the test set and see how the model predicts
print("Sentence: ",testing_sentences[1])
print("Label: ",testing_labels[1])
print("Sequence: \n", testing_padded[1])
# Notice the extra [] below to pass a list of input data
# and the tolist method to get the shape of the input
# numpy array is of shape (120,)
# input shape is (None, 120)
model.predict([testing_padded[1].tolist()])

Sentence:  A blackly comic tale of a down-trodden priest, Nazarin showcases the economy that Luis Bunuel was able to achieve in being able to tell a deeply humanist fable with a minimum of fuss. As an output from his Mexican era of film making, it was an invaluable talent to possess, with little money and extremely tight schedules. Nazarin, however, surpasses many of Bunuel's previous Mexican films in terms of the acting (Francisco Rabal is excellent), narrative and theme.<br /><br />The theme, interestingly, is something that was explored again in Viridiana, made three years later in Spain. It concerns the individual's struggle for humanity and altruism amongst a society that rejects any notion of virtue. Father Nazarin, however, is portrayed more sympathetically than Sister Viridiana. Whereas the latter seems to choose charity because she wishes to atone for her (perceived) sins, Nazarin's whole existence and reason for being seems to be to help others, whether they (or we) like it o

array([[1.]], dtype=float32)

In [18]:
# Instead of a word mapping to a number
# A number will map to a word
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [19]:
import io

# Creating to files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  # Getting the word from the number
  word = reverse_word_index[word_num]
  # Getting the 16 dimensional vector for that word
  embeddings = weights[word_num]
  # Writing the word to meta.tsv
  out_m.write(word + "\n")
  # Writing the embeddings to vecs.tsv 
  # 16 values in 1 row separated by tab
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [20]:
# Download the files

try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

.tsv (Tab Separated Values)

We have a 16 dimensional vector for each word in the vecs.tsv.
All the words are present in the meta.tsv file.

Go to [Tensorflow Projector](http://projector.tensorflow.org/) to see the visualization of our data. It's really cooool.
Steps:
* Go to the site and click on Load button on the left.
* Load both the files (vecs.tsv and meta.tsv)
* Check sphereize data and enjoy