# Word Embeddigns

In [38]:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np
import pickle
import joblib

In [25]:
### sentences
sent = ['the glass of milk',
        'the glass of juice',
        'the cup of tea',
        'I am a good boy',
        'I am a good developer',
        'understand the meaning of words',
        'my name is rondon',
        'my life is very good',
        'my life is bad',
        ]

In [26]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'my name is rondon',
 'my life is very good',
 'my life is bad']

In [27]:
## Define the vocabulary size

vocab_size = 10000

In [28]:
### One Hot Representation
one_hot_repr = []

for words in sent:
    one_hot_repr.append(one_hot(words, vocab_size))

one_hot_repr



[[914, 7786, 7674, 3473],
 [914, 7786, 7674, 5858],
 [914, 5780, 7674, 7146],
 [4639, 42, 6084, 8211, 3672],
 [4639, 42, 6084, 8211, 7236],
 [5761, 914, 3651, 7674, 2477],
 [3402, 1857, 4223, 2162],
 [3402, 6721, 4223, 3133, 8211],
 [3402, 6721, 4223, 399]]

Unlike traditional one-hot encoding, which transforms each word into a high-dimensional sparse vector with a vocabulary-sized dimension (e.g., 10,000) and only a single 1 among zeros, the tensorflow.keras.preprocessing.text.one_hot() method provides a more compact representation.
Instead of generating the full binary vector, it returns the index position where the 1 would have been, effectively mapping each word to an integer value based on a hashing function. This significantly reduces memory usage and avoids the inefficiency of handling large sparse matrices during preprocessing.

One of the main limitations of this method, when preparing text data for neural networks, is that each sentence may have a different length. This variability must be addressed, as models require inputs of uniform size. Without resolving this issue, it is not possible to train any standard neural network architecture.

In [29]:
## Word Embedding Representation

sent_length = 8
embedded_docs = pad_sequences(one_hot_repr, padding='pre', maxlen=sent_length)
print(embedded_docs)

[[   0    0    0    0  914 7786 7674 3473]
 [   0    0    0    0  914 7786 7674 5858]
 [   0    0    0    0  914 5780 7674 7146]
 [   0    0    0 4639   42 6084 8211 3672]
 [   0    0    0 4639   42 6084 8211 7236]
 [   0    0    0 5761  914 3651 7674 2477]
 [   0    0    0    0 3402 1857 4223 2162]
 [   0    0    0 3402 6721 4223 3133 8211]
 [   0    0    0    0 3402 6721 4223  399]]


In [30]:
embedded_docs

array([[   0,    0,    0,    0,  914, 7786, 7674, 3473],
       [   0,    0,    0,    0,  914, 7786, 7674, 5858],
       [   0,    0,    0,    0,  914, 5780, 7674, 7146],
       [   0,    0,    0, 4639,   42, 6084, 8211, 3672],
       [   0,    0,    0, 4639,   42, 6084, 8211, 7236],
       [   0,    0,    0, 5761,  914, 3651, 7674, 2477],
       [   0,    0,    0,    0, 3402, 1857, 4223, 2162],
       [   0,    0,    0, 3402, 6721, 4223, 3133, 8211],
       [   0,    0,    0,    0, 3402, 6721, 4223,  399]], dtype=int32)

To address this issue, the pad_sequences method was used to standardize the length of all input sentences. This method adds zeros at the beginning of each sequence (pre-padding) based on a predefined maxlen parameter, ensuring that all input sequences have the same length and can be properly processed by the model.

In [31]:
## Feature representation

dim = 10

In [32]:
model = Sequential()
model.add(Embedding(vocab_size, dim, input_length=sent_length))
model.compile('adam', 'mse')





1. **Initialization**:
   - When it creates an Embedding layer with `Embedding(vocab_size, dim)`, it creates a weight matrix of size (vocab_size × dim)
   - For example with vocab_size=10000 and dim=10, it creates a matrix of size 10000×10
   - This matrix is initialized with random values (usually small random numbers)

2. **Lookup Operation**:
   - When input a word index (like 7786 for "glass"), the Embedding layer performs a lookup operation
   - It's like a dictionary lookup where:
     - The word index is the key
     - The corresponding row in the weight matrix is the value
   - For example, if input index 7786, it returns row 7786 from the weight matrix

3. **Mathematical Representation**:
   - The operation is essentially a matrix multiplication with a one-hot vector
   - If we represent the word index as a one-hot vector (all zeros except a 1 at the word's position)
   - The embedding operation is: `embedding_vector = one_hot_vector × weight_matrix`
   - This is equivalent to selecting the corresponding row from the weight matrix

4. **Learning Process**:
   - During training, the weights in this matrix are updated using backpropagation
   - The model learns to place similar words closer together in this vector space
   - Words that appear in similar contexts will have similar vector representations

5. **Efficiency**:
   - This is much more efficient than traditional one-hot encoding
   - Instead of having a vector of size vocab_size (e.g., 10000) with mostly zeros
   - We get a dense vector of size dim (e.g., 10) with meaningful values

6. **Semantic Relationships**:
   - The model learns to place words with similar meanings close to each other in this vector space
   - For example, "king" and "queen" might have similar vectors
   - The difference between their vectors might represent the concept of gender




In [33]:
model.summary()

In [34]:
model.predict(embedded_docs)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step


array([[[ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.00583773,  0.02295682,  0.04650483, -0.02558252,
         -0.03492404, -0.04526548,  0.03072326,  0.02082846,
          0.02799947, -0.02984726],
        [-0.03549889, -0.01597271,  0.01658494, -0.03778584,
          0.00234162,  0.03394393,  0.04135187,  0.04740966,
         -0.04382951, -0.00191938],
        [ 0.03295279, -0.02001818, -0.00102039,  0.0

In [35]:
embedded_docs[0]

array([   0,    0,    0,    0,  914, 7786, 7674, 3473], dtype=int32)

In [36]:
model.predict(embedded_docs[0].reshape(1, -1))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step


array([[[ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.01382059,  0.01172507, -0.04027963,  0.03795722,
         -0.02197634, -0.01471256,  0.04129604,  0.00802499,
          0.0404079 ,  0.03234955],
        [ 0.00583773,  0.02295682,  0.04650483, -0.02558252,
         -0.03492404, -0.04526548,  0.03072326,  0.02082846,
          0.02799947, -0.02984726],
        [-0.03549889, -0.01597271,  0.01658494, -0.03778584,
          0.00234162,  0.03394393,  0.04135187,  0.04740966,
         -0.04382951, -0.00191938],
        [ 0.03295279, -0.02001818, -0.00102039,  0.0

In [37]:
model.predict(embedded_docs[0].reshape(1, -1)).shape

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step


(1, 8, 10)


1. **Input Processing**:
   - The model starts with raw text sentences
   - Each word is converted to a unique integer using `one_hot()` function
   - The sentences are padded to have the same length (8 words) using `pad_sequences()`

2. **Embedding Layer**:
   - The `Embedding` layer takes three main parameters:
     - `vocab_size`: 10000 (total number of unique words)
     - `dim`: 10 (size of the embedding vector for each word)
     - `input_length`: 8 (length of each padded sequence)

3. **Transformation Process**:
   - Each word (represented as an integer) is mapped to a 10-dimensional vector
   - For example, if we have the word "glass" with index 7786, it gets transformed into a vector of 10 numbers
   - These vectors are learned during training to capture semantic relationships between words

4. **Output**:
   - The model outputs a 3D tensor with shape (number_of_sentences, sequence_length, embedding_dimension)
   - In your case, it's (9, 8, 10) - 9 sentences, each with 8 words, and each word represented by 10 numbers

5. **What's Special**:
   - Unlike one-hot encoding which creates sparse vectors (mostly zeros), embeddings create dense vectors
   - Words with similar meanings will have similar vector representations
   - The model learns these representations during training

For example, in your output:
- The first sentence "the glass of milk" was transformed from integers [914, 7786, 7674, 3473] into 10-dimensional vectors
- Each word now has a dense representation that captures its meaning in the context of the sentence

This is much more efficient than one-hot encoding and allows the model to learn meaningful relationships between words. The embedding layer essentially creates a lookup table where each word index maps to a learned vector representation.



1. **One-Hot Vector Representation**:
   - Considering have a vocabulary of 3 words: ["cat", "dog", "bird"]
   - Each word is represented as a one-hot vector:
     ```
     cat:  [1, 0, 0]
     dog:  [0, 1, 0]
     bird: [0, 0, 1]
     ```

2. **Embedding Matrix**:
   - Let's create a small embedding matrix W of size (vocabulary_size × embedding_dimension)
   - For our example, let's use embedding dimension = 2:
     ```
     W = [
         [0.1, 0.2],  # cat
         [0.3, 0.4],  # dog
         [0.5, 0.6]   # bird
     ]
     ```

3. **Matrix Multiplication**:
   - The embedding operation is a matrix multiplication between the one-hot vector and the embedding matrix
   - For the word "cat":
     ```
     [1, 0, 0] × [
         [0.1, 0.2],
         [0.3, 0.4],
         [0.5, 0.6]
     ] = [0.1, 0.2]
     ```

4. **Mathematical Formula**:
   - Let's represent this formally:
     ```
     e = o × W
     where:
     e = embedding vector
     o = one-hot vector
     W = embedding matrix
     ```

5. **Numerical Example**:

   a. One-hot vector for "dog":
   ```
   o = [0, 1, 0]
   ```

   b. Embedding matrix:
   ```
   W = [
       [0.1, 0.2],  # cat
       [0.3, 0.4],  # dog
       [0.5, 0.6]   # bird
   ]
   ```

   c. Matrix multiplication:
   ```
   e = o × W
   e = [0, 1, 0] × [
       [0.1, 0.2],
       [0.3, 0.4],
       [0.5, 0.6]
   ]
   ```

   d. Calculating each element of the resulting vector:
   ```
   e[0] = (0 × 0.1) + (1 × 0.3) + (0 × 0.5) = 0.3
   e[1] = (0 × 0.2) + (1 × 0.4) + (0 × 0.6) = 0.4
   ```

   e. Final embedding vector:
   ```
   e = [0.3, 0.4]
   ```

6. **Efficiency Trick**:
   - In practice, it doesn't actually perform the matrix multiplication
   - Since the one-hot vector has only one 1 and the rest are 0s
   - It can simply select the corresponding row from the embedding matrix
   - This is why embedding layers are so efficient


7. **Learning Process**:
   - During training, the values in the embedding matrix are updated
   - The goal is to make similar words have similar vectors
   - This is done through backpropagation and gradient descent
   - The model learns to place words with similar meanings close to each other in this vector space




# Save Model

In [39]:
joblib.dump(embedded_docs, 'pickle/embedded_docs.joblib')
model.save('pickle/model_embedding.h5')


