# Semantic Matching service for e-Commerce

- Requirements
- Architecture Design
- Data 
- Models
- Serving

Dataset: https://huggingface.co/datasets/Studeni/amazon-esci-data
- ESCI dataset, that includes queries and products with relevance labels

## Requirements


Build a semantic search engine for a e-commerce.

# Architecture Design

We will use a "two tower" architecture and bi-encoder models for this task. 

**API:**
Client interact with the service via REST API. Request body contains a query as free text. Responce contains a list of product ids sorted by relevance to the query.

**Inference:**
The product data is represented by vector embeddings and stored in the vector in HNSW graph.
When user submits the query the query encoder converts the query into a vector. The we call the vector database to retrieve the top N most similar products.
The resulting matchset is sorted by the similarity score and returned to the user.

**Product Embeddings Generation:**
We use HNSW for vector storage. The datasource for product data is a parquet file with the following columns:
- product_id
- product_title
- product_description
- product_bullet_point
- product_brand
- product_color
The offline job reads the parquet file and converts the product data into a vector embeddings. The embeddings are stored in the vector database.

**Model Training:**
We use a bi-encoder model for this task. The model is trained on the training set and evaluated on the validation set.
Base model: TK:all-MiniLM-L6-v2
Tokenizer: TK:SentencePiece
Model validation metrics: ESCI dataset, TK:(MAP@10, MRR@10, NDCG@10, Precision@10, Recall@10, F1@10)    
Training algorithm: Contrastive loss

**Datasets:**
- ESCI dataset, that includes queries and products with relevance labels


# Data

In [59]:
!ls ~/datasets/shopping_queries_dataset

import os
import pandas as pd


config = {
    'data_path': '/Users/vladimirkroz/datasets/shopping_queries_dataset',
}

config = {
    **config,
    'ds_queries': os.path.join(config['data_path'], 'shopping_queries_dataset_examples.parquet'),
    'ds_products': os.path.join(config['data_path'], 'shopping_queries_dataset_products.parquet'),
}

print('QUERIES: ', config['ds_queries'])
df = pd.read_parquet(config['ds_queries'])
df.head()


5653.89s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


README.md
shopping_queries_dataset_examples.parquet
shopping_queries_dataset_products.parquet
shopping_queries_dataset_sources.csv
QUERIES:  /Users/vladimirkroz/datasets/shopping_queries_dataset/shopping_queries_dataset_examples.parquet


Unnamed: 0,example_id,query,query_id,product_id,product_locale,esci_label,small_version,large_version,split
0,0,revent 80 cfm,0,B000MOO21W,us,I,0,1,train
1,1,revent 80 cfm,0,B07X3Y6B1V,us,E,0,1,train
2,2,revent 80 cfm,0,B07WDM7MQQ,us,E,0,1,train
3,3,revent 80 cfm,0,B07RH6Z8KW,us,E,0,1,train
4,4,revent 80 cfm,0,B07QJ7WYFQ,us,E,0,1,train


In [60]:
print('PRODUCTS: ', config['ds_products'])
df = pd.read_parquet(config['ds_products'])
df.head()


PRODUCTS:  /Users/vladimirkroz/datasets/shopping_queries_dataset/shopping_queries_dataset_products.parquet


Unnamed: 0,product_id,product_title,product_description,product_bullet_point,product_brand,product_color,product_locale
0,B079VKKJN7,"11 Degrees de los Hombres Playera con Logo, Ne...",Esta playera con el logo de la marca Carrier d...,11 Degrees Negro Playera con logo\nA estrenar ...,11 Degrees,Negro,es
1,B079Y9VRKS,Camiseta Eleven Degrees Core TS White (M),,,11 Degrees,Blanco,es
2,B07DP4LM9H,11 Degrees de los Hombres Core Pull Over Hoodi...,La sudadera con capucha Core Pull Over de 11 G...,11 Degrees Azul Core Pull Over Hoodie\nA estre...,11 Degrees,Azul,es
3,B07G37B9HP,11 Degrees Poli Panel Track Pant XL Black,,,11 Degrees,,es
4,B07LCTGDHY,11 Degrees Gorra Trucker Negro OSFA (Talla úni...,,,11 Degrees,Negro (,es


# Model training



In [61]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import os

# Configuration
# Merge with existing config dictionary
config = {
    **config,  # Existing config with data paths
    'model_name': 'xlm-roberta-base',
    'batch_size': 32, 
    'epochs': 7,
    'learning_rate': 2e-5,
    'max_length': 128
}

config

{'data_path': '/Users/vladimirkroz/datasets/shopping_queries_dataset',
 'ds_queries': '/Users/vladimirkroz/datasets/shopping_queries_dataset/shopping_queries_dataset_examples.parquet',
 'ds_products': '/Users/vladimirkroz/datasets/shopping_queries_dataset/shopping_queries_dataset_products.parquet',
 'model_name': 'xlm-roberta-base',
 'batch_size': 32,
 'epochs': 7,
 'learning_rate': 2e-05,
 'max_length': 128}

In [62]:
# BiEncoder Model

class BiEncoder(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        self.query_encoder = AutoModel.from_pretrained(model_name)
        self.product_encoder = AutoModel.from_pretrained(model_name)
        self.projection = nn.Linear(self.query_encoder.config.hidden_size, 256)  # Project to smaller embedding size
        
    def encode_query(self, query_inputs):
        outputs = self.query_encoder(**query_inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return self.projection(cls_embedding)
        
    def encode_product(self, product_inputs):
        outputs = self.product_encoder(**product_inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return self.projection(cls_embedding)
        
    def forward(self, query_inputs, product_inputs):
        query_emb = self.encode_query(query_inputs)
        product_emb = self.encode_product(product_inputs)
        return query_emb, product_emb

In [63]:
# Dataset class

class AmazonESCIDataset(Dataset):
    def __init__(self, queries_df, products_df, tokenizer, max_length=128):
        self.queries = queries_df
        self.products = products_df
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.label_map = {'E': 1.0, 'S': 0.8, 'C': 0.3, 'I': 0.0}
        
    def __len__(self):
        return len(self.queries)
    
    def __getitem__(self, idx):
        row = self.queries.iloc[idx]
        query = row['query']
        product_id = row['product_id']
        
        # Get product information
        product = self.products[self.products['product_id'] == product_id].iloc[0]
        
        # Handle missing values safely
        def safe_get(field):
            value = product[field]
            return value if isinstance(value, str) else ''
            
        product_text = (
            "Title: " + safe_get('product_title') + "; " +
            "Description: " + safe_get('product_description') + "; " +
            "Bullet Points: " + safe_get('product_bullet_point')
        )
        
        # Tokenize inputs
        query_inputs = self.tokenizer(
            query, 
            max_length=self.max_length, 
            padding='max_length', 
            truncation=True, 
            return_tensors='pt'
        )
        
        product_inputs = self.tokenizer(
            product_text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        label = torch.tensor(self.label_map[row['esci_label']], dtype=torch.float)
        
        return {
            'query_inputs': {k: v.squeeze(0) for k, v in query_inputs.items()},
            'product_inputs': {k: v.squeeze(0) for k, v in product_inputs.items()},
            'label': label
        }

In [64]:
# Training Function

def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    
    for batch in dataloader:
        optimizer.zero_grad()
        
        query_inputs = {k: v.to(device) for k, v in batch['query_inputs'].items()}
        product_inputs = {k: v.to(device) for k, v in batch['product_inputs'].items()}
        labels = batch['label'].to(device)
        
        query_emb, product_emb = model(query_inputs, product_inputs)
        cos_sim = torch.nn.functional.cosine_similarity(query_emb, product_emb)
        loss = criterion(cos_sim, labels)
        
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    return total_loss / len(dataloader)

In [65]:
config

{'data_path': '/Users/vladimirkroz/datasets/shopping_queries_dataset',
 'ds_queries': '/Users/vladimirkroz/datasets/shopping_queries_dataset/shopping_queries_dataset_examples.parquet',
 'ds_products': '/Users/vladimirkroz/datasets/shopping_queries_dataset/shopping_queries_dataset_products.parquet',
 'model_name': 'xlm-roberta-base',
 'batch_size': 32,
 'epochs': 7,
 'learning_rate': 2e-05,
 'max_length': 128}

In [66]:
# Data preparation

# Load data
products = pd.read_parquet(config['ds_products'])
queries = pd.read_parquet(config['ds_queries'])

# Filter and merge data
queries = queries[queries['small_version'] == 1]
merged_df = queries.merge(products, on=['product_id', 'product_locale'], how='inner')

# Split data
train_df = merged_df[merged_df['split'] == 'train']
val_df = merged_df[merged_df['split'] == 'test']

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(config['model_name'])

In [67]:
# Training setup

# Create datasets
train_dataset = AmazonESCIDataset(train_df, products, tokenizer, config['max_length'])
val_dataset = AmazonESCIDataset(val_df, products, tokenizer, config['max_length'])

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=config['batch_size'])

# Initialize model and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BiEncoder(config['model_name']).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=config['learning_rate'])
criterion = nn.MSELoss()

In [68]:
# Training loop

for epoch in range(config['epochs']):
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
    print(f'Epoch {epoch + 1}/{config["epochs"]}, Loss: {train_loss:.4f}')

# Save model
torch.save(model.state_dict(), 'bi_encoder_model.pth')

KeyboardInterrupt: 