In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2024-03-05 14:10:53.281939: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
#sample text data
texts = [
    "This is the first document. ",
    "This document is the second document. ",
    "And this is the third one. ",
    "Is this the first document?",
]

Tokenize the text: Split your text into Individual words or tokens.
Create a vocabulary: Build a vocabulary mapping each unique word/token to an integer index.
Convert text to sequences: Replace each word/token in the text with its corresponding integer index based on the vocabulary.


In [4]:
#step 1: Tokenize the text
tokenizer = Tokenizer()

tokenizer.fit_on_texts(texts)
word_index=tokenizer.word_index

In [5]:
word_index

{'this': 1,
 'is': 2,
 'the': 3,
 'document': 4,
 'first': 5,
 'second': 6,
 'and': 7,
 'third': 8,
 'one': 9}

In [6]:
#step 2:Convert text to sequences
sequences = tokenizer.texts_to_sequences(texts)



In [7]:
sequences

[[1, 2, 3, 5, 4], [1, 4, 2, 3, 6, 4], [7, 1, 2, 3, 8, 9], [2, 1, 3, 5, 4]]

tokenize the text using Tokenizer and convert it into sequences of integers, pad the sequences to ensure they all have the same length.
then create an embedding matrix using Embedding layer, where each word index in the sequences is mapped to a dense vector representation.


In [9]:
#step3: pad sequences(optional)
#Ensure all sequences have the same length by padding them with zeros or truncating them.

max_sequence_length = max([len(seq) for seq in sequences])

sequences_padded = pad_sequences(sequences, max_sequence_length,padding='post')

In [10]:
sequences_padded

array([[1, 2, 3, 5, 4, 0],
       [1, 4, 2, 3, 6, 4],
       [7, 1, 2, 3, 8, 9],
       [2, 1, 3, 5, 4, 0]], dtype=int32)

In [11]:
#step 4: Apply Embeddeding layer
vocab_size = len(word_index) + 1 #Add 1 for the padding token
embedding_dim = 10 #Dimensionality of the dense embedding
embedding_matrix = tf.keras.layers.Embedding(vocab_size,embedding_dim)(sequences_padded)

In [12]:
vocab_size

10

In [13]:
embedding_matrix

<tf.Tensor: shape=(4, 6, 10), dtype=float32, numpy=
array([[[-0.02462852, -0.03239276, -0.01633903, -0.02879429,
         -0.00597594,  0.00267736,  0.03933405, -0.00447339,
          0.01258976, -0.00805227],
        [ 0.01320871,  0.04057591,  0.04310811, -0.02795578,
          0.03285135, -0.0220477 , -0.0276749 ,  0.00879598,
         -0.01257243, -0.00731296],
        [-0.00112033, -0.0274989 , -0.01941402,  0.03740039,
          0.02167271,  0.043897  , -0.01676438,  0.03806067,
         -0.04262164,  0.01647612],
        [ 0.01266922, -0.02711682, -0.01906847, -0.02273291,
          0.02866424, -0.00551039,  0.01910396,  0.02294696,
          0.04056872,  0.04903822],
        [ 0.01748944, -0.0207118 , -0.0221781 ,  0.02035685,
         -0.04420285, -0.02716148,  0.03583176,  0.03661368,
         -0.00591414,  0.02284663],
        [ 0.00608613, -0.04886366, -0.02538626,  0.03250617,
          0.03394547,  0.02606091, -0.02088164, -0.0145107 ,
          0.01942391, -0.03785391]],

In [14]:
#print the embedding matrix shape
print(embedding_matrix.shape)  #Output: (num_samples , max_sequence_length,embedding_dim)

(4, 6, 10)


Dimension 1 (4): This dimension corresponds to the number of samples in your input data. In this case, you have 4 samples (or sentences/documents).
Dimension 2 (6): This dimension represents the length of the sequences after padding. Since you've padded the sequences to a length of 6, this dimension is
Dimension 3 (100): This dimension is the dimensionality of the dense embedding vectors. You've chosen an embedding dimension of 100, so each word is represented by a dense vector of length 100.
