In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### Word embeddings
Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. Unlike traditional one-hot encoding, where words are represented as isolated units, embeddings capture the semantic relationships between words by placing semantically similar words closer together in the embedding space. This results in a dense representation of words, where each word is represented by a real-valued vector with several dimensions (typically hundreds or thousands).

#### Steps Involved and Uses:
Preprocessing Text: The first step involves cleaning and preparing the text data. This may include removing punctuation, converting text to lowercase, tokenizing (splitting text into words or tokens), and possibly removing stop words (common words that are often filtered out).

Choosing Vocabulary Size: Decide on the vocabulary size, which includes determining the set of words that will be represented in the embedding space. This often involves selecting the most frequent words in the dataset.

Word Encoding: Each word in the vocabulary is assigned a unique integer ID. This can be done using methods like one-hot encoding initially before being converted into dense vectors through embeddings.

Training the Embedding Layer: The embedding vectors can be learned in two ways: by training a model on a specific task (such as text classification or sentiment analysis) and learning the embeddings within this model, or by using pre-trained embeddings obtained by training on large text corpora like Google News or Wikipedia.

Using Embeddings: Once trained, these embeddings can be used as input to various NLP tasks, such as text classification, sentiment analysis, machine translation, and more. The embeddings capture semantic meanings, allowing the model to understand context and similarity between words.

#### Example with the Sentence "I like you":
Suppose each word in the sentence "I like you" is represented in a 3-dimensional embedding space. After preprocessing and assigning an integer ID to each unique word, the embeddings for these words are learned or retrieved from a pre-trained set. Instead of representing "I", "like", and "you" as one-hot vectors (which would be sparse and high-dimensional), each word is represented as a dense vector, e.g.,

"I" might be represented as [0.5, 0.1, -0.4],
"like" as [0.7, -0.3, 0.2],
"you" as [0.4, 0.9, -0.5].
These vectors capture more information about each word, including semantic relationships to other words.

#### Advantages:
Semantic Meaning: Word embeddings capture the semantic relationships between words, meaning that words with similar meanings are closer in the vector space. This allows models to better understand context and nuances in language.

Efficiency: Embeddings provide a dense, low-dimensional representation of words, which is more efficient than high-dimensional sparse vectors like one-hot encodings. This efficiency translates to models that are faster to train and have better performance.

#### Limitations:
Out-of-Vocabulary (OOV) Words: Words not seen during training (OOV words) are not represented in the embedding space. This can limit the model's ability to understand and process new or rare words.

Static Representations: Traditional word embeddings provide a static representation for each word, meaning that the context in which a word is used is not considered. This can be a limitation for words with multiple meanings based on context (polysemy). However, more recent models like BERT and ELMo attempt to address this by providing context-dependent embeddings.

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, Flatten
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.regularizers import l2  # Example regularizer

# Check TensorFlow version (Ensure it's > 2.0)
print(f"TensorFlow Version: {tf.__version__}")

# Define sentences to be encoded and embedded
sentences = [
    'the glass of milk',
    'the glass of juice',
    'the cup of tea',
    'I am a good boy',
    'I am a good developer',
    'understand the meaning of words',
    'your videos are good'
]

# Specify the vocabulary size
vocabulary_size = 500

# Function to perform one-hot encoding on the sentences
def encode_sentences(sentences, vocab_size):
    return [one_hot(sentence, vocab_size) for sentence in sentences]

# Encode the sentences
one_hot_encoded = encode_sentences(sentences, vocabulary_size)

# Pre-pad the sequences to ensure uniform length
sequence_length = 8
padded_sequences = pad_sequences(one_hot_encoded, padding='pre', maxlen=sequence_length)
print("Padded Sequences:\n", padded_sequences)

# Define embedding dimensions
embedding_dim = 10

# Create a Sequential model
model = Sequential([
    Embedding(
        input_dim=vocabulary_size,
        output_dim=embedding_dim,
        embeddings_initializer="uniform",  # Default, but specified for clarity
        embeddings_regularizer=l2(0.01),  # Example L2 regularization
        # embeddings_constraint= <YourConstraintHere>, Optional: specify if needed
        # mask_zero=True, Uncomment if you are going to use RNNs and need to mask zero padding
    ),
    Flatten(),  # Flatten the output for Dense layer
    Dense(1, activation='sigmoid')  # Example of adding a Dense layer for a potential classification task
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')  # Assuming a binary classification task

# Display model summary
model.summary()

# Example: Predict embeddings for the first sentence
first_sentence_embedding = model.predict(padded_sequences[0].reshape(1, -1))
print("Embedding for the first sentence:\n", first_sentence_embedding)

# Example: Predict embeddings for all sentences
all_embeddings = model.predict(padded_sequences)
print("Embeddings for all sentences:\n", all_embeddings)

2024-03-03 07:45:03.670461: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-03 07:45:03.670622: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-03 07:45:03.839456: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


TensorFlow Version: 2.15.0
Padded Sequences:
 [[  0   0   0   0 430 463 110 311]
 [  0   0   0   0 430 463 110 390]
 [  0   0   0   0 430 173 110 419]
 [  0   0   0 436 185  11 314 277]
 [  0   0   0 436 185  11 314  88]
 [  0   0   0  74 430 332 110 467]
 [  0   0   0   0 469 230 354 314]]


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 273ms/step
Embedding for the first sentence:
 [[0.5006255]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
Embeddings for all sentences:
 [[0.5006255 ]
 [0.4971775 ]
 [0.4944139 ]
 [0.49560043]
 [0.49277392]
 [0.49107322]
 [0.48755872]]
