# Siamese Network for Semantic Question Similarity

## üìã Overview
This notebook implements a **Bi-Encoder (Siamese Network)** using **DistilBERT** to detect semantic similarity between questions. Unlike traditional keyword matching, this approach maps questions into a high-dimensional vector space where semantically similar questions are physically closer to one another.



---

## üõ†Ô∏è Workflow Steps

### 1. Setup & Synthetic Data
* **Environment**: Configures PyTorch to utilize GPU (CUDA) for accelerated training.
* **Triplet Data**: Generates a synthetic dataset consisting of:
    * **Anchor**: The reference question.
    * **Positive**: A duplicate/paraphrased version of the anchor.
    * **Negative**: A completely unrelated question.

### 2. Custom Dataset & Tokenization
* **TripletDataset**: A custom PyTorch class that handles on-the-fly tokenization using `distilbert-base-uncased`.
* **Padding & Truncation**: Ensures all sequences are normalized to a fixed length for batch processing.

### 3. Model Architecture
* **Siamese Network**: Implements a Bi-Encoder where three identical DistilBERT models (sharing weights) process the triplet.
* **CLS Pooling**: Extracts the `[CLS]` token's hidden state as the definitive 768-dimensional vector representation of the sentence.

### 4. Training with Triplet Loss
* **Objective**: Uses **Triplet Margin Loss** to minimize the distance between the Anchor and Positive while maximizing the distance between the Anchor and Negative.
* **Formula**: $$Loss = \max(d(Anchor, Positive) - d(Anchor, Negative) + margin, 0)$$
* **Optimization**: Employs the `AdamW` optimizer with a linear learning rate.



### 5. Semantic Search Inference
* **Knowledge Base Indexing**: Pre-calculates and stores embeddings for a "database" of known questions.
* **Cosine Similarity**: When a user submits a query, it is encoded into a vector and compared against the index using Cosine Similarity.
* **Threshold Logic**: Provides a suggested answer if the similarity score exceeds a predefined confidence threshold (e.g., 0.7).

---

## üìö Key Tech Stack
| Category | Tools |
| :--- | :--- |
| **Deep Learning** | `PyTorch`, `torch.nn` |
| **Transformers** | `Hugging Face (transformers)`, `DistilBERT` |
| **Analysis** | `pandas`, `numpy`, `scikit-learn` |

# 1. Imports & Setup

We start by importing PyTorch and the Hugging Face Transformers library. We also set the device to GPU if available.


In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModel
from torch.optim import AdamW

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import random

# Set random seed for reproducibility
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


# 2. Data Generation (Dummy Data)

Instead of downloading a massive dataset, we will create a small synthetic dataset of "Triplets".

- Anchor: The user's question.

- Positive: A different way of asking the same thing (Duplicate).

- Negative: A completely unrelated question.


In [3]:
# Create dummy data
data = [
    # (Anchor, Positive, Negative)
    ("How do I install PyTorch?", "Installation guide for PyTorch", "How to make pasta?"),
    ("What is the capital of France?", "France capital city name", "How do I install PyTorch?"),
    ("Python list vs tuple difference", "Difference between list and tuple in Python", "What is the capital of France?"),
    ("How to fix 404 error?", "Resolving HTTP 404 not found", "Python list vs tuple difference"),
    ("Best way to learn machine learning", "Guide to start ML career", "How to fix 404 error?"),
    ("How to center a div in CSS?", "CSS centering div techniques", "Best way to learn machine learning"),
]

# Convert to DataFrame
df = pd.DataFrame(data, columns=['anchor', 'positive', 'negative'])

# Split into Train and Validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Training Samples: {len(train_df)}")
print(train_df.head())

Training Samples: 4
                               anchor  \
5         How to center a div in CSS?   
2     Python list vs tuple difference   
4  Best way to learn machine learning   
3               How to fix 404 error?   

                                      positive  \
5                 CSS centering div techniques   
2  Difference between list and tuple in Python   
4                     Guide to start ML career   
3                 Resolving HTTP 404 not found   

                             negative  
5  Best way to learn machine learning  
2      What is the capital of France?  
4               How to fix 404 error?  
3     Python list vs tuple difference  


# 3. Text Preprocessing & Dataset Class

We need a custom PyTorch Dataset to handle the tokenization on the fly. We use distilbert-base-uncased because it is fast and effective.


In [None]:
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

class TripletDataset(Dataset):
    def __init__(self, df, tokenizer, max_len=32):
        self.df = df
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        row = self.df.iloc[index]
        
        # Tokenize all three sentences
        anchor = self.tokenize(row['anchor'])
        positive = self.tokenize(row['positive'])
        negative = self.tokenize(row['negative'])
        print(f"Anchor: {row['anchor']}")
        print(f"Positive: {row['positive']}")   
        print(f"Negative: {row['negative']}")
        

        return {
            'anchor_ids': anchor['input_ids'].flatten(),
            'anchor_mask': anchor['attention_mask'].flatten(),
            'pos_ids': positive['input_ids'].flatten(),
            'pos_mask': positive['attention_mask'].flatten(),
            'neg_ids': negative['input_ids'].flatten(),
            'neg_mask': negative['attention_mask'].flatten()
        }

    def tokenize(self, text):
        return self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_len,
            return_tensors="pt"
        )

# Create DataLoaders
train_dataset = TripletDataset(train_df, tokenizer)
val_dataset = TripletDataset(val_df, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=2)

# 4. The Model (Siamese Network)
This is the core. We use a Bi-Encoder.

1. Pass the sentence through BERT.

2. Take the output of the [CLS] token (the first token) as the representation of the entire sentence.

3. We do this for Anchor, Positive, and Negative separately (sharing weights).

In [5]:
class SiameseNetwork(nn.Module):
    def __init__(self):
        super(SiameseNetwork, self).__init__()
        # Load pre-trained BERT model
        self.bert = AutoModel.from_pretrained("distilbert-base-uncased")

    def forward(self, input_ids, attention_mask):
        # Pass input through BERT
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        # Extract the [CLS] token (first token) hidden state
        # Shape: (Batch_Size, Hidden_Dim) -> (2, 768)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        
        return cls_embedding

model = SiameseNetwork().to(device)

Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:00<00:00, 137.06it/s, Materializing param=transformer.layer.5.sa_layer_norm.weight]   
[1mDistilBertModel LOAD REPORT[0m from: distilbert-base-uncased
Key                     | Status     |  | 
------------------------+------------+--+-
vocab_projector.bias    | UNEXPECTED |  | 
vocab_transform.weight  | UNEXPECTED |  | 
vocab_layer_norm.weight | UNEXPECTED |  | 
vocab_layer_norm.bias   | UNEXPECTED |  | 
vocab_transform.bias    | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


# 5. Training Loop

We use Triplet Margin Loss.

<!-- $$Loss = \max(d(Anchor, Positive) - d(Anchor, Negative) + margin, 0)$$This forces the Positive to be closer to the Anchor than the Negative is. -->

In [6]:
# Hyperparameters
epochs = 3
optimizer = AdamW(model.parameters(), lr=2e-5)
criterion = nn.TripletMarginLoss(margin=1.0, p=2) # p=2 is Euclidean Distance

# Training Function
def train_one_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss = 0
    
    for batch in loader:
        # Move batch to GPU
        a_ids = batch['anchor_ids'].to(device)
        a_mask = batch['anchor_mask'].to(device)
        p_ids = batch['pos_ids'].to(device)
        p_mask = batch['pos_mask'].to(device)
        n_ids = batch['neg_ids'].to(device)
        n_mask = batch['neg_mask'].to(device)

        optimizer.zero_grad()

        # Forward Pass (Get embeddings for all 3)
        a_emb = model(a_ids, a_mask)
        p_emb = model(p_ids, p_mask)
        n_emb = model(n_ids, n_mask)

        # Calculate Loss
        loss = criterion(a_emb, p_emb, n_emb)
        
        # Backward Pass
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        
    return total_loss / len(loader)

# Run Training
print("Starting Training...")
for epoch in range(epochs):
    loss = train_one_epoch(model, train_loader, optimizer, criterion)
    print(f"Epoch {epoch+1}/{epochs} | Loss: {loss:.4f}")

print("Training Complete!")

Starting Training...
Epoch 1/3 | Loss: 1.1935
Epoch 2/3 | Loss: 0.9432
Epoch 3/3 | Loss: 0.2229
Training Complete!


# 6. Inference Phase
Now we act like Quora/Stack Overflow. We have a database of "`Answered Questions`". We need to index them.

### Step 6a: Create the "`Database`" Index 
We pre-calculate the embeddings for our known questions.

In [8]:
# The "Database" of existing questions and their answers
knowledge_base = [
    {"id": 1, "question": "How do I install PyTorch?", "answer": "Run `pip install torch`."},
    {"id": 2, "question": "What is the capital of France?", "answer": "The capital is Paris."},
    {"id": 3, "question": "Python list vs tuple difference", "answer": "Lists are mutable, tuples are immutable."},
    {"id": 4, "question": "How to fix 404 error?", "answer": "Check your URL or server configuration."},
]

# Function to encode text to vector
# --- REPLACEMENT CODE FOR get_embedding FUNCTION ---

def get_embedding(text):
    model.eval()
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=32)
    
    # Move specific tensors to GPU
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    with torch.no_grad():
        # EXPLICITLY pass only the arguments the model expects
        # We do NOT pass **inputs here to avoid the 'token_type_ids' error
        embedding = model(input_ids=input_ids, attention_mask=attention_mask)
        
    return embedding.cpu().numpy()

# Index the Knowledge Base
print("Indexing Knowledge Base...")
kb_embeddings = []
for item in knowledge_base:
    vec = get_embedding(item['question'])
    kb_embeddings.append(vec)

# Convert list to numpy array for fast search
kb_embeddings = np.vstack(kb_embeddings) # Shape: (4, 768)
print("Indexing Complete.")

Indexing Knowledge Base...
Indexing Complete.


### Step 6b: The Search Function 
When a user asks a new question, we convert it to a vector and find the closest vector in our "Database" using Cosine Similarity.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

def search_similar_question(user_query, threshold=0.7):
    # 1. Convert user query to vector
    query_vec = get_embedding(user_query)
    
    # 2. Calculate similarity with ALL database questions
    # (In production, use FAISS for this step)
    similarities = cosine_similarity(query_vec, kb_embeddings) # Shape: (1, 4)
    
    # 3. Find the best match
    best_idx = np.argmax(similarities)
    best_score = similarities[0, best_idx]
    
    print(f"User Query: '{user_query}'")
    
    if best_score > threshold:
        matched_item = knowledge_base[best_idx]
        print(f"Found Similar Question (Score: {best_score:.4f}): '{matched_item['question']}'")
        print(f"Suggested Answer: {matched_item['answer']}")
    else:
        print(f"No similar question found. (Best Score: {best_score:.4f})")
        print("Please post this as a new question.")
    print("-" * 50)

# Test Cases
print("\n--- TESTING THE PIPELINE ---\n")

# Case 1: Semantic Match (Different wording, same meaning)
search_similar_question("guide to installing torch python library")

# Case 2: Exact Semantic Match
search_similar_question("What is the capital city of France?")

# Case 3: Completely New Question
search_similar_question("How do I bake a chocolate cake?")


--- TESTING THE PIPELINE ---

User Query: 'guide to installing torch python library'
Found Similar Question (Score: 0.8046): 'Python list vs tuple difference'
Suggested Answer: Lists are mutable, tuples are immutable.
--------------------------------------------------
User Query: 'What is the capital city of France?'
Found Similar Question (Score: 0.9975): 'What is the capital of France?'
Suggested Answer: The capital is Paris.
--------------------------------------------------
User Query: 'How do I bake a chocolate cake?'
Found Similar Question (Score: 0.9638): 'How do I install PyTorch?'
Suggested Answer: Run `pip install torch`.
--------------------------------------------------
