Here is a concise table of the most popular embedding models in NLP:

| **Model**               | **Developer**        | **Key Features**                                                                 | **Use Cases**                           |
|--------------------------|----------------------|---------------------------------------------------------------------------------|-----------------------------------------|
| **Word2Vec**            | Google              | Continuous vector space, context-based embeddings (CBOW, skip-gram).           | Word similarity, analogy tasks.         |
| **GloVe**               | Stanford            | Global word-word co-occurrence statistics for embedding generation.            | Analogy tasks, semantic relationships.  |
| **FastText**            | Facebook            | Uses subword information; handles OOV words effectively.                       | Morphologically rich languages.         |
| **BERT**                | Google              | Context-aware embeddings using transformers; bidirectional processing.         | Question answering, text classification.|
| **RoBERTa**             | Facebook            | Optimized BERT with better pretraining techniques.                             | Improved NLP task performance.          |
| **GPT-3**               | OpenAI              | Large-scale transformer model, text generation, and semantic embeddings.       | Text generation, semantic search.       |
| **Universal Sentence Encoder** | Google       | Encodes sentences into fixed-length high-dimensional vectors.                  | Semantic similarity, clustering.        |
| **NV-Embed**            | NVIDIA              | Advanced embedding designs for generalist NLP tasks.                           | Multitask text embeddings.              |


#Word 2 Vec

![](https://lena-voita.github.io/resources/lectures/word_emb/w2v/cbow_skip-min.png)

In [None]:
!pip install gensim




In [None]:
from gensim.models.keyedvectors import KeyedVectors
import gensim.downloader as api
model= api.load('word2vec-google-news-300')



![](https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2Fcbc4321d285b982ff987720350f349e6cafc35d5-1920x1020.png&w=3840&q=75)

In [None]:


# Check the vector for a word
word = "king"
king_vector = model[word]
print(f"Vector for '{word}':\n{king_vector[:10]}... (truncated for display)")

# Find most similar words
similar_words = model.most_similar("king", topn=5)
print("\nWords most similar to 'king':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

# Word analogy example: king - man + woman = ?
result = model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
print(f"\n'King' - 'Man' + 'Woman' ≈ {result[0][0]} (Similarity: {result[0][1]:.4f})")


Vector for 'king':
[ 0.12597656  0.02978516  0.00860596  0.13964844 -0.02563477 -0.03613281
  0.11181641 -0.19824219  0.05126953  0.36328125]... (truncated for display)

Words most similar to 'king':
kings: 0.7138
queen: 0.6511
monarch: 0.6413
crown_prince: 0.6204
prince: 0.6160

'King' - 'Man' + 'Woman' ≈ queen (Similarity: 0.7118)


#Huggingface emdeddings

The training objective of **BERT (Bidirectional Encoder Representations from Transformers)** consists of two key tasks:

1. **Masked Language Modeling (MLM)**:
   - Randomly masks 15% of tokens in the input text.
   - The model predicts the original tokens based on their context (both left and right).
   - Enables bidirectional understanding of text.

2. **Next Sentence Prediction (NSP)**:
   - Given two sentences, the model predicts if the second sentence follows the first in the original text.
   - Helps capture relationships between sentences, useful for tasks like question answering and text classification.

These objectives enable BERT to generate deep, context-aware embeddings suitable for various NLP tasks.

In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

print("BERT model and tokenizer loaded.")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BERT model and tokenizer loaded.


In [None]:
# Example query and documents
query = "What is machine learning?"
documents = [
    "Machine learning is a branch of artificial intelligence.",
    "Deep learning is a subset of machine learning.",
    "The capital of France is Paris.",
    "Supervised learning requires labeled data."
]


In [None]:
def get_sentence_embedding(sentence):
    # Tokenize the input sentence and convert to tensors
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=512, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)

    # Get embeddings for all tokens and average them
    token_embeddings = outputs.last_hidden_state  # Shape: [1, seq_len, hidden_dim]
    sentence_embedding = torch.mean(token_embeddings, dim=1)  # Shape: [1, hidden_dim]
    return sentence_embedding


In [None]:
from torch.nn.functional import cosine_similarity

# Get the embedding for the query
query_embedding = get_sentence_embedding(query)

# Get embeddings for all documents
document_embeddings = [get_sentence_embedding(doc) for doc in documents]

# Compute similarity scores
print("\nSimilarity Scores:")
for i, doc_embedding in enumerate(document_embeddings):
    similarity = cosine_similarity(query_embedding, doc_embedding)
    print(f"Query: '{query}'\nDocument: '{documents[i]}'\nSimilarity: {similarity.item():.4f}\n")



Similarity Scores:
Query: 'What is machine learning?'
Document: 'Machine learning is a branch of artificial intelligence.'
Similarity: 0.7353

Query: 'What is machine learning?'
Document: 'Deep learning is a subset of machine learning.'
Similarity: 0.7279

Query: 'What is machine learning?'
Document: 'The capital of France is Paris.'
Similarity: 0.4916

Query: 'What is machine learning?'
Document: 'Supervised learning requires labeled data.'
Similarity: 0.7069



#Sentence embeddings from huggingface

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)

from torch.nn.functional import cosine_similarity

# Get the embedding for the query
query_embedding = model.encode(query)

# Get embeddings for all documents
document_embeddings = [model.encode(doc) for doc in documents]

# Compute similarity scores
print("\nSimilarity Scores:")
for i, doc_embedding in enumerate(document_embeddings):
    similarity = model.similarity(query_embedding,doc_embedding)
    print(f"Query: '{query}'\nDocument: '{documents[i]}'\nSimilarity: {similarity.item():.4f}\n")



Similarity Scores:
Query: 'What is machine learning?'
Document: 'Machine learning is a branch of artificial intelligence.'
Similarity: 0.8677

Query: 'What is machine learning?'
Document: 'Deep learning is a subset of machine learning.'
Similarity: 0.7666

Query: 'What is machine learning?'
Document: 'The capital of France is Paris.'
Similarity: 0.1962

Query: 'What is machine learning?'
Document: 'Supervised learning requires labeled data.'
Similarity: 0.4676

