# **Sentence Embeddings**

**Sentence Embeddings** are vector representations of entire **sentences or documents**,  
designed to capture their **semantic meaning** in a continuous vector space.

While **Word Embeddings** represent individual words,  
**Sentence Embeddings** summarize the **meaning of a whole sentence** as a single dense vector.

- Sentences with **similar meanings** → have **similar vectors**
- Useful for:
  - Semantic search  
  - Sentence similarity and clustering  
  - Text classification  
  - Question answering and retrieval systems

**Advantages**
- Captures **sentence-level semantics**  
- Enables comparison and clustering of text meaningfully  
- Suitable for downstream NLP tasks (retrieval, QA, classification)

**Limitations**
- Some models are **computationally expensive**  
- May still lose **fine-grained syntactic nuances**

## **AvgWord2Vec**

**Average Word2Vec (AvgWord2Vec)** is a simple technique to create a **fixed-length vector representation for an entire sentence or document** by averaging the **Word2Vec embeddings** of all words in it.  
Averaging the embeddings smooths out noise and gives a **semantic summary** of the sentence.  
Similar sentences will have **similar average vectors**.


**Advantages**
- Simple and efficient  
- Converts variable-length text into fixed-size vectors  
- Works well as baseline features for classification or clustering

**Limitations**
- Ignores **word order and syntax**  
- All words contribute **equally**, regardless of importance

We will use "Pretrained word2vec" and generate the "AvgWord2Vec" from that for simple demonstration of working

In [1]:
# importing libraries

import gensim.downloader as api
from gensim.models import KeyedVectors
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

In [3]:
# Download necessary NLTK components if you haven't already

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ptpl-652\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ptpl-652\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ptpl-652\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# some sample sentences
sentences = [
    "The horse is running fast across the wide field.",
    "Data science is a fantastic field for women and men.",
    "The old man fed the cat a fish."
]

In [5]:
# Loading the Pre-trained Word2Vec Model

print("Loading pre-trained Word2Vec model (word2vec-google-news-300)...")

try:
    model: KeyedVectors = api.load('word2vec-google-news-300')
    VECTOR_SIZE = model.vector_size
    print(f"Model loaded successfully. Vector dimension: {VECTOR_SIZE}")

except Exception as e:
    print(f"Error loading model: {e}")
    print("Please ensure you have internet access and sufficient memory (model is ~3.4GB uncompressed).")
    exit()

Loading pre-trained Word2Vec model (word2vec-google-news-300)...
Model loaded successfully. Vector dimension: 300


In [6]:
# Initialize NLTK tools

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [7]:
# Preprocessing

def preprocess_text(sentence):

    # Convert to lowercase
    text = sentence.lower()
    
    # Remove punctuation/special characters
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Tokenization
    tokens = word_tokenize(text)
    
    # Remove Stopwords and Lemmatize
    processed_tokens = []
    for word in tokens:
        if word not in stop_words:
            # Lemmatization (converting to base form)
            processed_tokens.append(lemmatizer.lemmatize(word, pos='v'))
            
    return processed_tokens


In [8]:
# Average Word2Vec

def get_avg_word2vec_vector(words, w2v_model, vector_size):

    # Removing the OOV words (Out Of Vocabulary)
    word_vectors = [w2v_model[word] for word in words if word in w2v_model]
    
    # If no words in the sentence are in the vocabulary, return a zero vector
    if not word_vectors:
        return np.zeros(vector_size)
    
    # Convert list of vectors to a NumPy array for efficient averaging
    vectors = np.array(word_vectors)
    
    # average vector across all found word vectors (axis=0)
    avg_vector = np.mean(vectors, axis=0)
    
    return avg_vector


In [9]:
print("--- Sentence Vector Results ---")
for sentence in sentences:
    # Preprocess
    tokenized_words = preprocess_text(sentence)
    
    # Get the average vector
    sentence_vector = get_avg_word2vec_vector(
        tokenized_words, 
        model, 
        VECTOR_SIZE
    )
    
    # Print the sentence and its vector (showing only first 5 dimensions)
    print(f"\nSentence: {sentence}")
    print(f"Tokens Used: {tokenized_words}")
    print(f"Vector (First 5 dimensions of {VECTOR_SIZE}):")
    # NumPy array formatting to show a cleaner output
    print(sentence_vector[:5]) 
    print(f"Vector Shape: {sentence_vector.shape}")

--- Sentence Vector Results ---

Sentence: The horse is running fast across the wide field.
Tokens Used: ['horse', 'run', 'fast', 'across', 'wide', 'field']
Vector (First 5 dimensions of 300):
[-0.0418218   0.01549276  0.04768372  0.02396647 -0.02583504]
Vector Shape: (300,)

Sentence: Data science is a fantastic field for women and men.
Tokens Used: ['data', 'science', 'fantastic', 'field', 'women', 'men']
Vector (First 5 dimensions of 300):
[-0.12813313  0.06978353  0.0892334   0.00399272  0.10479736]
Vector Shape: (300,)

Sentence: The old man fed the cat a fish.
Tokens Used: ['old', 'man', 'feed', 'cat', 'fish']
Vector (First 5 dimensions of 300):
[ 0.06906738  0.1616211  -0.05200195  0.05683594  0.00507812]
Vector Shape: (300,)


In [11]:
# sample similarity observation between the texts

texts = [
    "I love this product",
    "I hate this product",
    "This product is working excellently"
]

In [12]:
vectors = []
for sentence in texts:
    # Preprocess
    tokenized_words = preprocess_text(sentence)
    
    # Get the average vector
    sentence_vector = get_avg_word2vec_vector(
        tokenized_words, 
        model, 
        VECTOR_SIZE
    )
    vectors.append(sentence_vector)

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

print(f"Similarity between 'text1' and 'text2'\ntext 1: {texts[0]}\ntext 2: {texts[1]}")
print('similarity:',cosine_similarity(vectors[0].reshape(1, -1), vectors[1].reshape(1, -1))[0][0])

Similarity between 'text1' and 'text2'
text 1: I love this product
text 2: I hate this product
similarity: 0.77934355


In [18]:
print(f"Similarity between 'text1' and 'text3'\ntext 1: {texts[0]}\ntext 2: {texts[2]}")
print('similarity:',cosine_similarity(vectors[0].reshape(1, -1), vectors[2].reshape(1, -1))[0][0])

Similarity between 'text1' and 'text3'
text 1: I love this product
text 2: This product is working excellently
similarity: 0.5230589


* since **AvgWord2Vec** giving equal importance to all the words, the results are not satisfying
* But it is the starting point for the birth of excellent sentence embedding models
* As like **AvgWord2Vec**, we can also perform **AvgGlove** and **AvgFasttext**

## **SBERT**

**Sentence-BERT (SBERT)** is a modification of **BERT (Bidirectional Encoder Representations from Transformers)**  
designed specifically to create **sentence-level embeddings** that capture **semantic similarity** efficiently.

It was introduced by **Reimers and Gurevych (2019)** in the paper:  
> “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”

While **BERT** produces powerful contextual embeddings,  
it’s computationally expensive to compare two sentences because it requires a **forward pass for each pair**.

SBERT solves this by:
- Using a **Siamese network architecture**
- Generating **fixed-length sentence embeddings**
- Allowing **cosine similarity** or **Euclidean distance** to measure semantic similarity directly

**Popular Pre-trained SBERT Models**

| Model Name | Description |
|-------------|--------------|
| `all-MiniLM-L6-v2` | Lightweight, fast, 384-dim embeddings |
| `all-mpnet-base-v2` | High-performance general-purpose model |
| `paraphrase-MiniLM-L6-v2` | Optimized for paraphrase detection |
| `multi-qa-MiniLM-L6-cos-v1` | For question-answer semantic search |
| `distiluse-base-multilingual-cased` | Multilingual version for 50+ languages |

All models are available via the **`sentence-transformers`** library.

**Advantages**
- Produces **high-quality sentence embeddings**
- Enables **semantic similarity**, **search**, and **clustering**
- Much **faster** than vanilla BERT for pairwise sentence comparison
- Many **pre-trained models** for different domains and languages

**Limitations**
- Context window limited to 512 tokens  
- Still **static per sentence** (not token-level contextual)

Using "all-MiniLM-L6-V2"

In [21]:
# importing libraries
import warnings
warnings.filterwarnings('ignore')

from sentence_transformers import SentenceTransformer, util
import numpy as np

In [22]:
# downloading the model

model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
print(f"Loaded SentenceTransformer model: {model_name}")

Loaded SentenceTransformer model: all-MiniLM-L6-v2


In [32]:
# sample sentences

sentences = [
    "The weather is lovely today.",
    "It's so sunny and beautiful outside!",
    "I'm driving to the grocery store now.",
    "A cat chases a mouse."
]

print("--- Sample Texts ---")
for i, s in enumerate(sentences):
    print(f"Sentence {i+1}: {s}")

--- Sample Texts ---
Sentence 1: The weather is lovely today.
Sentence 2: It's so sunny and beautiful outside!
Sentence 3: I'm driving to the grocery store now.
Sentence 4: A cat chases a mouse.


In [33]:
embeddings = model.encode(sentences, convert_to_tensor=True)

print("--- Sentence Vectors (Embeddings) ---")
print(f"Shape of embeddings tensor: {embeddings.shape}")
print(f"Embedding for Sentence 1 (first 5 dimensions): {embeddings[0][:5].cpu().numpy()}")
print(f"Embedding for Sentence 2 (first 5 dimensions): {embeddings[1][:5].cpu().numpy()}")

--- Sentence Vectors (Embeddings) ---
Shape of embeddings tensor: torch.Size([4, 384])
Embedding for Sentence 1 (first 5 dimensions): [0.01919573 0.1200854  0.15959834 0.0670659  0.0500748 ]
Embedding for Sentence 2 (first 5 dimensions): [0.01488302 0.0534854  0.09693496 0.05794089 0.05688087]


In [34]:
# performing cosine similarity

cosine_scores = util.cos_sim(embeddings, embeddings)

print("\n--- Similarity Scores (Cosine Similarity Matrix) ---")
print(cosine_scores)


--- Similarity Scores (Cosine Similarity Matrix) ---
tensor([[ 1.0000,  0.7014,  0.1942, -0.0239],
        [ 0.7014,  1.0000,  0.1868,  0.0015],
        [ 0.1942,  0.1868,  1.0000,  0.0564],
        [-0.0239,  0.0015,  0.0564,  1.0000]])


* We can observe that
    - As expected -- text1 and text2 are more similary
    - text3 is less similar with text1 and text2

In [35]:
sim_1_2 = cosine_scores[0, 1].item()
sim_1_3 = cosine_scores[0, 2].item()

print(f"Similarity (Sentence 1 vs. Sentence 2): {sim_1_2:.4f} (High)")
print(f"Similarity (Sentence 1 vs. Sentence 3): {sim_1_3:.4f} (Low)")

Similarity (Sentence 1 vs. Sentence 2): 0.7014 (High)
Similarity (Sentence 1 vs. Sentence 3): 0.1942 (Low)


**Simple Semantic Search**

In [39]:
# Performing simple Semantic Search

query = input("Enter the Query: ")
query_embedding = model.encode(query, convert_to_tensor=True)

# computing similarity score
query_scores = util.cos_sim(query_embedding, embeddings)[0]

print(f"\n--- Semantic Search for Query: '{query}' ---")
for i, sentence in enumerate(sentences):
    print(f"Score: {query_scores[i].item():.4f}, Sentence: {sentence}")


--- Semantic Search for Query: 'How was the whether today ?' ---
Score: 0.3728, Sentence: The weather is lovely today.
Score: 0.1642, Sentence: It's so sunny and beautiful outside!
Score: 0.1908, Sentence: I'm driving to the grocery store now.
Score: -0.0323, Sentence: A cat chases a mouse.
