<a href="https://colab.research.google.com/github/space-owner/Tensorflow-2/blob/main/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ***Word embeddings***
This post is **based on the Tensorflow tutorial** for study purposes. [Link](https://www.tensorflow.org/tutorials)

***Learning Point:***
- **```tensorflow.keras.layers.Embedding```**
- **```tensorflow.keras.layers.GlobalAveragePooling1D```**
- **```tensorflow.keras.layers.TextVectorization```**


In [None]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(
    fname="aclImdb_v1.tar.gz", origin=url, untar=True, cache_dir=".", cache_subdir=""
)

dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")

print(">>> dataset dir =", os.listdir(dataset_dir))

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
>>> dataset dir = ['test', 'imdb.vocab', 'train', 'imdbEr.txt', 'README']


In [None]:
train_dir = os.path.join(dataset_dir, "train")

print(">>> train dir =", os.listdir(train_dir))

>>> train dir = ['pos', 'unsupBow.feat', 'labeledBow.feat', 'urls_neg.txt', 'neg', 'unsup', 'urls_unsup.txt', 'urls_pos.txt']


In [None]:
remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

print(">>> train dir =", os.listdir(train_dir))

>>> train dir = ['pos', 'unsupBow.feat', 'labeledBow.feat', 'urls_neg.txt', 'neg', 'urls_unsup.txt', 'urls_pos.txt']


In [None]:
batch_size = 1024
seed = 48

train_dataset = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed
)

val_dataset = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [None]:
for text_batch, label_batch in train_dataset.take(1):
    for i in range(5):
        print(label_batch[i].numpy(), text_batch.numpy()[i])

0 b'I give this movie a ONE, for it is truly an awful movie. Sound track of the DVD is so bad, it actually hurts my ear. But the vision, no matter how disjointed, does show something really fancy in the Italian society. I will not go into detail what actually was so shocking , but the various incidents are absolutely abnormal. So for the kink value, i give it one.Otherwise, the video, photography, acting of the adults actors /actresses are simply substandard, a practical jock to people who love foreign movies.Roberto, the main character, has full spectrum of emotions but exaggerated to the point of being unbelievable.however, the children in the movie are mostly 3/4 years old, and they are genuine and the movie provides glimpse of the Italian life..'
1 b"Prior to seeing Show People, my impression of silent comedy was essentially slapstick, and slapstick only. I could not imagine how screen comedy could be possible without relying heavily on spoken word or numerous pratfalls. But this m

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_dataset = train_dataset.cache().prefetch(buffer_size=AUTOTUNE)

val_dataset = val_dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [None]:
embedding_layer = tf.keras.layers.Embedding(1000, 5)

In [None]:
print(">>> tf.constant([1]) shape =", tf.constant([1, 2, 3]))

result = embedding_layer(tf.constant([1, 2, 3]))

print(">>> result =\n{}"
    .format(result.numpy())
)

>>> tf.constant([1]) shape = tf.Tensor([1 2 3], shape=(3,), dtype=int32)
>>> result =
[[ 0.01199257  0.02671323  0.02542325 -0.01883491 -0.04637767]
 [ 0.01570792 -0.04970592  0.03695065  0.02329585 -0.04684756]
 [ 0.04838461  0.0031688   0.0089935  -0.030189   -0.0320302 ]]


In [None]:
print(">>> tf.constant([1]) shape =", tf.constant([[0, 1, 2], [3, 4, 5]]))

result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))

print(">>> result =\n{}"
    .format(result.numpy())
)
print(">>> result shape =", result.shape)

>>> tf.constant([1]) shape = tf.Tensor(
[[0 1 2]
 [3 4 5]], shape=(2, 3), dtype=int32)
>>> result =
[[[-0.02006651  0.04897007  0.04261822 -0.00602493 -0.04881423]
  [ 0.01199257  0.02671323  0.02542325 -0.01883491 -0.04637767]
  [ 0.01570792 -0.04970592  0.03695065  0.02329585 -0.04684756]]

 [[ 0.04838461  0.0031688   0.0089935  -0.030189   -0.0320302 ]
  [-0.03019472 -0.04646552 -0.02069526 -0.0038862  -0.03209398]
  [-0.02491921 -0.01457477  0.04100542  0.03982289  0.02374678]]]
>>> result shape = (2, 3, 5)
