
Embedding vectors and one-hot encoding are both techniques to represent categorical data numerically,
 but embeddings are learned, dense, and compact, capturing semantic relationships in a lower-dimensional 
 space, while one-hot encoding creates sparse, high-dimensional vectors where each category is an orthogonal 
 dimension, making embeddings better for large datasets and semantic tasks, and one-hot encoding suitable for 
 small, fixed categories without inherent order.  


One-Hot Encoding
* How it works: Creates a binary vector for each category, where only one element (the "hot" one) is 1,
 and the rest are 0. 
* Pros: Simple, deterministic, and effective for nominal (no natural order) categorical variables. 

* Cons: Creates very high-dimensional and sparse vectors, which can lead to the "curse of dimensionality" and are inefficient for large datasets with many categories. Lacks semantic meaning and relationship between categories. 

Embeddings
* How it works: Maps data into dense, lower-dimensional vectors where each dimension captures latent features or semantic meaning. 
* Pros: More computationally efficient, requires less memory, and can capture complex relationships and patterns, leading to better model performance and generalization. 
* Cons: Requires significant training data and computational resources. The quality depends on the underlying training algorithm (e.g., Word2Vec, GloVe). 
When to Use Which 

* Use One-Hot Encoding When: You have a small, fixed number of categories and don't need to capture semantic relationships between them. 
* Use Embeddings When: You have a large number of categories, or you need to leverage the semantic meaning and context of the data, as in natural language processing (NLP). 

Key Differences
* Dimensionality: Embeddings are lower-dimensional and dense; one-hot encoding creates high-dimensional and sparse vectors. 
* Information Content: Embeddings encode semantic relationships; one-hot encoding provides no semantic information. 
* Learning: Embeddings are learned through machine learning models; one-hot encoding is a fixed, deterministic process. 

Word2Vec -- semantic relationship and meaning information 
* Developed by Google (Mikolov et al., 2013).
* It converts words into dense vectors (called embeddings), where similar words are close to each other in vector space.
* Example: king - man + woman ≈ queen.
 How it works:
1. Two architectures:
    * CBOW (Continuous Bag of Words): predicts a word from its surrounding words.
    * Skip-Gram: predicts surrounding words given a word.
2. The model trains on large text data and learns vector representations.
3. After training, each word is represented as a fixed-size vector (e.g., 100-dimensional).
 Limitation: It only learns whole word embeddings, so unseen words (OOV – out of vocabulary) are not represented.

i <love> eating   - CBOW  40 % 
 I  love  eating icecream alone   -- skip - gram 
 
FastText
* Developed by Facebook AI Research (2016).
* It is an extension of Word2Vec.
* Key difference: Instead of representing each word as a whole, FastText breaks words into subword units (character n-grams).
Examples
Word = “playing”
* Subwords: “play”, “layi”, “ayin”, “ing”
* Embedding is built from these pieces.  50 % 
Advantages of FastText:  -  
* Handles rare words better.
* Can create embeddings for out-of-vocabulary (OOV) words (e.g., misspellings, new words).
* Especially useful for morphologically rich languages (like Hindi, Tamil, Turkish, Finnish, etc.).


Word2Vec = word-level embeddings.
FastText = word + subword embeddings (smarter for rare/OOV words).


Feature	   CBOW	                            Skip-Gram
Predicts	Word from context	          Context from word
Speed	     Faster	                       Slower
Accuracy	Good for frequent words	        Better for rare words
Best for	Small datasets	                   Large datasets


2017 - transfomer ( word2vec)

In [1]:
def one_hot_encode(text):
    words = text.split()
    vocabulary = sorted(set(words))
    word_to_index = {word: i for i, word in enumerate(vocabulary)}
    one_hot_encoded = []
    for word in words:
        one_hot_vector = [0] * len(vocabulary)
        one_hot_vector[word_to_index[word]] = 1
        one_hot_encoded.append(one_hot_vector)
    return one_hot_encoded, word_to_index, vocabulary

example_text = "cat in the hat dog on the mat bird in the tree"

one_hot_encoded, word_to_index, vocabulary = one_hot_encode(example_text)

print("Vocabulary:", vocabulary)                 # should work
print("Word to Index Mapping:", word_to_index)
print("One-Hot Encoded Matrix:")
for word, encoding in zip(example_text.split(), one_hot_encoded):
    print(f"{word}: {encoding}")

Vocabulary: ['bird', 'cat', 'dog', 'hat', 'in', 'mat', 'on', 'the', 'tree']
Word to Index Mapping: {'bird': 0, 'cat': 1, 'dog': 2, 'hat': 3, 'in': 4, 'mat': 5, 'on': 6, 'the': 7, 'tree': 8}
One-Hot Encoded Matrix:
cat: [0, 1, 0, 0, 0, 0, 0, 0, 0]
in: [0, 0, 0, 0, 1, 0, 0, 0, 0]
the: [0, 0, 0, 0, 0, 0, 0, 1, 0]
hat: [0, 0, 0, 1, 0, 0, 0, 0, 0]
dog: [0, 0, 1, 0, 0, 0, 0, 0, 0]
on: [0, 0, 0, 0, 0, 0, 1, 0, 0]
the: [0, 0, 0, 0, 0, 0, 0, 1, 0]
mat: [0, 0, 0, 0, 0, 1, 0, 0, 0]
bird: [1, 0, 0, 0, 0, 0, 0, 0, 0]
in: [0, 0, 0, 0, 1, 0, 0, 0, 0]
the: [0, 0, 0, 0, 0, 0, 0, 1, 0]
tree: [0, 0, 0, 0, 0, 0, 0, 0, 1]


In [1]:
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
)

print(model)


<gensim.models.fasttext.FastText object at 0x1692e5a10>


In [2]:
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
    model.save(tmp.name, separately=[])

# Load back the same model.
loaded_model = FastText.load(tmp.name)
print(loaded_model)

os.unlink(tmp.name)  

<gensim.models.fasttext.FastText object at 0x169851bd0>


In [3]:
wv = model.wv
print(wv)

#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print('night' in wv.key_to_index)

<gensim.models.fasttext.FastTextKeyedVectors object at 0x169a4dc50>
True


In [4]:
print('nights' in wv.key_to_index)

False


In [5]:
print(wv['night'])

array([-0.20445213,  0.19205473, -0.2696594 , -0.08837921,  0.06665937,
        0.37713125,  0.29362696,  0.4960144 ,  0.25551304, -0.2335439 ,
        0.02672886, -0.16090696, -0.22832492,  0.510989  , -0.40147826,
       -0.5596598 ,  0.18688893, -0.25138256, -0.42957833, -0.54374593,
       -0.4709174 , -0.05898951, -0.45171773, -0.12916653, -0.20273305,
       -0.3277675 , -0.6941753 , -0.1159768 , -0.33495227,  0.2764851 ,
       -0.33012283,  0.30461985,  0.8438509 , -0.26730436,  0.18636127,
        0.4047822 ,  0.3868081 , -0.1048426 , -0.3803687 , -0.34521735,
        0.4718249 , -0.42686203,  0.02969932, -0.41831368, -0.5224052 ,
       -0.304583  , -0.08432893,  0.11869861,  0.3761345 , -0.00140285,
        0.35731208, -0.43189424,  0.296017  , -0.41103962, -0.1897895 ,
       -0.18131128, -0.15968084, -0.13176629,  0.04600557, -0.35803026,
       -0.3385793 , -0.44628662, -0.17897253,  0.34664026, -0.12272559,
        0.69051284,  0.06279059,  0.06756178,  0.4308072 ,  0.24

In [6]:
print(wv['nights'])

array([-1.77679002e-01,  1.67121261e-01, -2.33521938e-01, -7.63154328e-02,
        5.64601831e-02,  3.25218409e-01,  2.55535722e-01,  4.31267142e-01,
        2.21683949e-01, -2.03850344e-01,  2.48830821e-02, -1.37788236e-01,
       -1.98772654e-01,  4.40243781e-01, -3.48890901e-01, -4.85076696e-01,
        1.61376417e-01, -2.17509687e-01, -3.71195912e-01, -4.72282678e-01,
       -4.05491233e-01, -5.22639528e-02, -3.91675442e-01, -1.13440156e-01,
       -1.74479946e-01, -2.82782376e-01, -6.00582361e-01, -9.83458012e-02,
       -2.90395141e-01,  2.41430253e-01, -2.84523755e-01,  2.63830066e-01,
        7.29589701e-01, -2.31260464e-01,  1.61623687e-01,  3.50224078e-01,
        3.36478353e-01, -9.08171460e-02, -3.29726756e-01, -2.99643755e-01,
        4.07969803e-01, -3.68960917e-01,  2.52312347e-02, -3.61906409e-01,
       -4.53014016e-01, -2.62427658e-01, -7.00611174e-02,  1.03351928e-01,
        3.26833248e-01, -1.33855865e-04,  3.10725361e-01, -3.73938143e-01,
        2.56851554e-01, -

In [7]:
print(wv.similarity("night", "nights"))

0.99999166


In [8]:
print(wv.most_similar("nights"))

[('night', 0.999991774559021),
 ('flights', 0.9999871850013733),
 ('rights', 0.9999870657920837),
 ('overnight', 0.9999867677688599),
 ('fight', 0.999985933303833),
 ('fighting', 0.9999855160713196),
 ('entered', 0.9999850392341614),
 ('fighters', 0.9999849200248718),
 ('starting', 0.9999844431877136),
 ('fighter', 0.9999843835830688)]
