<a href="https://colab.research.google.com/github/rani-sikdar/GenAI-complete-course-codes/blob/main/embedding_techniques_in_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is Embedding in NLP?

In Natural Language Processing (NLP), **embeddings** are dense vector representations of words, phrases, or even entire documents. They capture semantic relationships between words, meaning that words with similar meanings will have similar vector representations in a high-dimensional space.

Think of it like this: instead of representing each word as a unique, isolated symbol (like a one-hot encoded vector where each word has its own dimension and all other dimensions are zero), embeddings map words into a continuous vector space.

Here's why they are important:

* **Capture Semantic Meaning:** Words with similar meanings are closer together in the vector space. For example, the vectors for "king" and "queen" might be similar, and the vector for "man" minus the vector for "woman" might be roughly equal to the vector for "king" minus the vector for "queen". This allows models to understand relationships between words.
* **Dimensionality Reduction:** Embeddings are typically much lower in dimensionality than one-hot encoding, making them more computationally efficient.
* **Improved Performance:** Using embeddings as input to NLP models (like neural networks) generally leads to better performance on various tasks, such as text classification, sentiment analysis, machine translation, and question answering.

Examples of popular word embedding techniques include:

* **Word2Vec:** Learns word embeddings by predicting neighboring words in a sentence.
* **GloVe (Global Vectors for Word Representation):** Learns embeddings by considering global word-word co-occurrence statistics from a corpus.
* **FastText:** An extension of Word2Vec that considers character n-grams, allowing it to handle out-of-vocabulary words and morphological variations better.
* **Contextual Embeddings (like ELMo, BERT, GPT):** These embeddings are dynamic and change based on the context in which a word appears in a sentence, capturing more nuanced meanings.

In essence, embeddings provide a way for computers to understand and process human language in a more meaningful and efficient way.

In [None]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "A fast brown fox leaps across lazy hounds.",
    "The sleepy cat curled up on the warm mat.",
    "A sly fox sneaked past the watchful owl.",
    "Gentle deer graze peacefully near the quiet stream.",
    "Two clever foxes raced across the frosty field.",
    "The curious rabbit hopped into the garden silently.",
    "A playful puppy chased the fluttering butterflies.",
    "Brown bears hibernate through the cold winter nights.",
    "Crows cawed loudly above the tall pine trees.",
    "A pack of wolves howled at the pale moonlight.",
    "The silent snake slithered across the dusty path.",
    "Early birds chirped merrily as the sun rose.",
    "The strong lion roared from his rocky throne."
]


### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It is calculated by multiplying two metrics:

*   **Term Frequency (TF):** How often a word appears in a document.
*   **Inverse Document Frequency (IDF):** A measure of how much information the word provides, i.e., if it's common or rare across all documents.

The resulting TF-IDF score increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

In [None]:
#@title TF-IDF (Term Frequency-Inverse Document Frequency)

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform
tfidf_matrix = vectorizer.fit_transform(corpus)

# Convert to DataFrame for readability
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(df_tfidf.round(2))

    above  across    as    at  bears  birds  brown  butterflies   cat  cawed  \
0    0.00    0.00  0.00  0.00   0.00   0.00   0.32         0.00  0.00   0.00   
1    0.00    0.00  0.00  0.00   0.00   0.00   0.00         0.00  0.00   0.00   
2    0.00    0.33  0.00  0.00   0.00   0.00   0.33         0.00  0.00   0.00   
3    0.00    0.00  0.00  0.00   0.00   0.00   0.00         0.00  0.37   0.00   
4    0.00    0.00  0.00  0.00   0.00   0.00   0.00         0.00  0.00   0.00   
5    0.00    0.00  0.00  0.00   0.00   0.00   0.00         0.00  0.00   0.00   
6    0.00    0.30  0.00  0.00   0.00   0.00   0.00         0.00  0.00   0.00   
7    0.00    0.00  0.00  0.00   0.00   0.00   0.00         0.00  0.00   0.00   
8    0.00    0.00  0.00  0.00   0.00   0.00   0.00         0.44  0.00   0.00   
9    0.00    0.00  0.00  0.00   0.39   0.00   0.30         0.00  0.00   0.00   
10   0.37    0.00  0.00  0.00   0.00   0.00   0.00         0.00  0.00   0.37   
11   0.00    0.00  0.00  0.37   0.00   0

In [None]:
#@title Get feature importance for first sentence
first_doc_scores = df_tfidf.iloc[0]
important_words = first_doc_scores.sort_values(ascending=False)
print("Top TF-IDF words in first sentence:\n", important_words.head())

Top TF-IDF words in first sentence:
 jumps    0.415676
quick    0.415676
over     0.360945
dog      0.360945
fox      0.322112
Name: 0, dtype: float64


In [None]:
pip install gensim

In [None]:
import nltk
nltk.download('all')

### Word2Vec

Word2Vec is a popular technique for learning word embeddings. It creates dense vector representations of words where words with similar meanings are located closer to each other in the vector space. There are two main architectures:

*   **Skip-gram:** Predicts the surrounding context words given a target word.
*   **CBOW (Continuous Bag of Words):** Predicts the target word given the surrounding context words.

Word2Vec models learn these embeddings by training on a large corpus of text, where the relationships between words are captured through their co-occurrence patterns.

In [None]:
#@title Word2Vec (Contextual Neural Embeddings)
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Tokenize corpus
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train Word2Vec model (skip-gram by default if sg=1)
model_w2v = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=5, min_count=1, workers=2, sg=1)

# Access vector
print("Vector for 'fox':\n", model_w2v.wv['fox'])

Vector for 'fox':
 [ 2.85719102e-03 -5.28241089e-03 -1.41586214e-02 -1.55909630e-02
 -1.82540510e-02 -1.18924184e-02 -3.64811346e-03 -8.61516781e-03
 -1.29679115e-02 -7.51042739e-03  8.60515144e-03 -7.51296477e-03
  1.68587286e-02  3.11068213e-03 -1.44905010e-02  1.89288426e-02
  1.52591188e-02  1.09936399e-02 -1.36847515e-02  1.15767354e-02
  8.07453226e-03  1.03750220e-02  8.53674766e-03  3.90454801e-03
 -6.24129549e-03  1.67949796e-02  1.92348473e-02  7.60872057e-03
 -5.67133445e-03  4.46162303e-05  2.40645162e-03 -1.69198960e-02
 -1.64732635e-02 -4.68570011e-04  2.44278298e-03 -1.14404494e-02
 -9.33911931e-03 -1.46774286e-02  1.66406762e-02  2.19345166e-04
 -8.98859743e-03  1.14497757e-02  1.84087288e-02 -8.21412727e-03
  1.59900170e-02  1.06961364e-02  1.17508778e-02  9.77434451e-04
  1.63874421e-02 -1.40219862e-02]


In [None]:
# Most similar words
print("Most similar to 'lazy':", model_w2v.wv.most_similar('lazy'))

# Cosine similarity between words
similarity = model_w2v.wv.similarity('fox', 'dog')
print("Similarity between 'lazy' and 'dog':", similarity)


Most similar to 'lazy': [('nights', 0.37047961354255676), ('sun', 0.24188484251499176), ('chirped', 0.23741601407527924), ('silent', 0.2007860392332077), ('curled', 0.19357748329639435), ('roared', 0.17555497586727142), ('moonlight', 0.168952077627182), ('a', 0.16599375009536743), ('leaps', 0.16575321555137634), ('merrily', 0.1603947877883911)]
Similarity between 'lazy' and 'dog': -0.10234931


To use GloVe vectors, you'll need to download pretrained embeddings like: `GloVe 6B (50D)`

download link- https://nlp.stanford.edu/data/glove.6B.zip



### GloVe (Global Vectors for Word Representation)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Unlike Word2Vec which focuses on local context, GloVe learns word embeddings by considering global word-word co-occurrence statistics from a large corpus.

It essentially combines the advantages of two main models:

*   **Global Matrix Factorization:** Like Latent Semantic Analysis (LSA), it captures global statistics of how often words appear together across the entire corpus.
*   **Local Context Window Methods:** Like Word2Vec, it also considers the context of words within a sliding window.

The training objective of GloVe is to learn word vectors such that their dot product is equal to the logarithm of the words' co-occurrence probability. This allows GloVe to capture both semantic and syntactic relationships between words effectively.

In [None]:
#@title GloVe (Pretrained Global Vectors)
import numpy as np

# Load pretrained GloVe vectors
def load_glove(path):
    embeddings = {}
    with open(path, 'r', encoding='utf8') as file:
        for line in file:
            parts = line.split()
            word = parts[0]
            vector = np.array(parts[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_path = "/content/glove.6B.50d.txt"  # Download and place here
glove_embeddings = load_glove(glove_path)

# Check vector for a word
print("GloVe vector for 'fox':", glove_embeddings['fox'])

GloVe vector for 'fox': [ 0.44206   0.059552  0.15861   0.92777   0.1876    0.24256  -1.593
 -0.79847  -0.34099  -0.24021  -0.32756   0.43639  -0.11057   0.50472
  0.43853   0.19738  -0.1498   -0.046979 -0.83286   0.39878   0.062174
  0.28803   0.79134   0.31798  -0.21933  -1.1015   -0.080309  0.39122
  0.19503  -0.5936    1.7921    0.3826   -0.30509  -0.58686  -0.76935
 -0.61914  -0.61771  -0.68484  -0.67919  -0.74626  -0.036646  0.78251
 -1.0072   -0.59057  -0.7849   -0.39113  -0.49727  -0.4283   -0.15204
  1.5064  ]


In [None]:
#@title Explore Important GloVe Features
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Cosine similarity between two vectors
vec_fox = glove_embeddings['fox']
vec_dog = glove_embeddings['dog']
similarity = cosine_similarity([vec_fox], [vec_dog])[0][0]
print("Similarity (fox vs dog):", similarity)

# Word analogy (king - man + woman ≈ queen)
def analogy(word_a, word_b, word_c):
    vec = glove_embeddings[word_b] - glove_embeddings[word_a] + glove_embeddings[word_c]
    return vec

analogy_vec = analogy('man', 'king', 'woman')

# Find closest word
def find_closest(vec, embeddings):
    sims = {}
    # Filter out embeddings with incorrect dimensions
    filtered_embeddings = {word: vector for word, vector in embeddings.items() if vector.shape[0] == vec.shape[0]}
    for word in filtered_embeddings:
        sims[word] = cosine_similarity([vec], [filtered_embeddings[word]])[0][0]
    return sorted(sims.items(), key=lambda x: x[1], reverse=True)[:5]

print("Analogy result (king - man + woman):", find_closest(analogy_vec, glove_embeddings))

Similarity (fox vs dog): 0.45282266
Analogy result (king - man + woman): [('king', 0.8859834), ('queen', 0.8609581), ('daughter', 0.76845115), ('prince', 0.76407), ('throne', 0.763497)]


### FastText

FastText is another word embedding model that builds upon Word2Vec. A key difference is that FastText considers subword information (character n-grams) in addition to words. This has several advantages:

*   **Handles Out-of-Vocabulary (OOV) words:** FastText can create vectors for words it hasn't seen during training by combining the vectors of its character n-grams.
*   **Better for morphologically rich languages:** Languages with many word variations (like Spanish or German) benefit from the subword information.
*   **Learns representations for suffixes and prefixes:** This can help in understanding the meaning of words.

FastText can also be used for text classification, often outperforming traditional methods due to its ability to capture subword information.

In [None]:
#@title FastText Embedding
from gensim.models import FastText
from nltk.tokenize import word_tokenize

# Tokenize sentences
tokenized_corpus = [word_tokenize(sent.lower()) for sent in corpus]

# Train FastText model
fasttext_model = FastText(sentences=tokenized_corpus, vector_size=50, window=5, min_count=1, workers=2, sg=1)

# Get word vector
print("Vector for 'fox':\n", fasttext_model.wv['fox'])

# Handle out-of-vocabulary word
print("Vector for 'foxlike' (OOV):\n", fasttext_model.wv['foxlike'])

Vector for 'fox':
 [ 3.2321583e-03 -2.9412520e-03 -6.8576261e-03 -4.6252571e-03
 -4.4154222e-03 -1.4195185e-04  3.1816140e-03 -8.4446119e-03
 -1.2518661e-03 -3.4119654e-03 -2.3245369e-03  7.8332582e-03
  8.5215410e-03  1.0361478e-03 -6.3343653e-03  1.1195343e-03
  1.1074417e-02 -2.8426494e-04 -2.2770849e-03  2.3481403e-03
 -3.1608285e-03  7.7675688e-03  3.7854540e-03  3.4268529e-03
 -7.3994296e-03  7.5485385e-03  8.1852730e-03 -1.1603630e-03
 -3.0495930e-03 -2.4095501e-03  2.6099731e-03 -2.7533751e-03
 -9.3813123e-06  1.1778839e-03  8.4802480e-03 -5.9449342e-03
 -9.5305042e-03 -5.0317021e-03  3.9002087e-03  4.1881201e-04
 -7.7766986e-03  4.4137714e-03  1.1148970e-02  5.3869882e-03
  7.6330313e-03 -4.7343769e-03  5.8285235e-03 -2.2695637e-03
 -2.7858932e-03 -5.7004625e-03]
Vector for 'foxlike' (OOV):
 [ 3.6175190e-03  1.0286048e-03  1.3084356e-03 -6.7578652e-04
  5.0787901e-04 -1.3946147e-03  1.2139681e-03 -1.0449693e-05
 -1.1531699e-03 -1.9276165e-03  9.9236891e-04  1.9493692e-03
  2.2

In [None]:
# Similar words
print("Words similar to 'lazy':", fasttext_model.wv.most_similar('lazy'))

# Cosine similarity
print("Similarity between 'fox' and 'dog':", fasttext_model.wv.similarity('fox', 'dog'))

Words similar to 'lazy': [('of', 0.29647505283355713), ('clever', 0.28435948491096497), ('mat', 0.2834983170032501), ('pine', 0.2832966446876526), ('into', 0.22485658526420593), ('watchful', 0.21624240279197693), ('winter', 0.2126881629228592), ('brown', 0.19851846992969513), ('strong', 0.19307392835617065), ('a', 0.18323905766010284)]
Similarity between 'fox' and 'dog': 0.11950576


In [None]:
pip install -q transformers sentence-transformers

### BERT (Bidirectional Encoder Representations from Transformers)

BERT is a state-of-the-art, pre-trained transformer-based model for NLP. Unlike traditional word embeddings like Word2Vec or GloVe that generate a single vector representation for each word regardless of its context, BERT produces contextualized embeddings. This means the vector representation for a word changes based on the surrounding words in a sentence.

Key features of BERT:

*   **Bidirectional:** BERT is trained to consider the context from both the left and right sides of a word simultaneously, which leads to a deeper understanding of the word's meaning.
*   **Transformer Architecture:** BERT utilizes the transformer architecture, which is highly effective at capturing long-range dependencies in text.
*   **Pre-trained:** BERT is pre-trained on a massive dataset of text (like books and Wikipedia), allowing it to learn a rich understanding of language. This pre-training can then be fine-tuned for specific downstream NLP tasks.

Contextual embeddings like BERT have significantly advanced the performance of various NLP tasks by providing more nuanced and context-aware representations of words and sentences.

In [None]:
#@title BERT Embedding (Contextual Embedding using Transformers)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Get sentence embeddings
embeddings = model.encode(corpus)

# Cosine similarity between first and second sentence
sim_score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"Cosine similarity (Sentence 1 vs 2): {sim_score:.4f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  return forward_call(*args, **kwargs)


Cosine similarity (Sentence 1 vs 2): 0.5738


In [None]:
# Cosine similarity between first and second sentence
sim_score = cosine_similarity([embeddings[3]], [embeddings[3]])[0][0]
print(f"Cosine similarity (Sentence 3 vs 4): {sim_score:.4f}")

Cosine similarity (Sentence 3 vs 4): 1.0000
