# Semantic Matching service for e-Commerce

- Requirements
- Architecture Design
- Data 
- Models
- Serving

Dataset: https://huggingface.co/datasets/Studeni/amazon-esci-data
- ESCI dataset, that includes queries and products with relevance labels

## Requirements


Build a semantic search engine for a e-commerce.

# Architecture Design

We will use a "two tower" architecture and bi-encoder models for this task. 

**API:**
Client interact with the service via REST API. Request body contains a query as free text. Responce contains a list of product ids sorted by relevance to the query.

**Inference:**
The product data is represented by vector embeddings and stored in the vector in HNSW graph.
When user submits the query the query encoder converts the query into a vector. The we call the vector database to retrieve the top N most similar products.
The resulting matchset is sorted by the similarity score and returned to the user.

**Product Embeddings Generation:**
We use HNSW for vector storage. The datasource for product data is a parquet file with the following columns:
- product_id
- product_title
- product_description
- product_bullet_point
- product_brand
- product_color
The offline job reads the parquet file and converts the product data into a vector embeddings. The embeddings are stored in the vector database.

**Model Training:**
We use a bi-encoder model for this task. The model is trained on the training set and evaluated on the validation set.
Base model: TK:all-MiniLM-L6-v2
Tokenizer: TK:SentencePiece
Model validation metrics: ESCI dataset, TK:(MAP@10, MRR@10, NDCG@10, Precision@10, Recall@10, F1@10)    
Training algorithm: Contrastive loss

**Datasets:**
- ESCI dataset, that includes queries and products with relevance labels


In [1]:
import datasets
print(datasets.__version__)


3.3.2


In [2]:
from datasets import load_dataset

esci = load_dataset("tasksource/esci")
esci_train = load_dataset("tasksource/esci", split="train")
esci_test  = load_dataset("tasksource/esci", split="test")


In [19]:
print(f'\n\nesci: {esci}')  # If you loaded the entire DatasetDict
print(f'\n\nesci_train[0]: {esci_train[0]}')



esci: DatasetDict({
    train: Dataset({
        features: ['example_id', 'query', 'query_id', 'product_id', 'product_locale', 'esci_label', 'small_version', 'large_version', 'product_title', 'product_description', 'product_bullet_point', 'product_brand', 'product_color', 'product_text'],
        num_rows: 2027874
    })
    test: Dataset({
        features: ['example_id', 'query', 'query_id', 'product_id', 'product_locale', 'esci_label', 'small_version', 'large_version', 'product_title', 'product_description', 'product_bullet_point', 'product_brand', 'product_color', 'product_text'],
        num_rows: 652490
    })
})


esci_train[0]: {'example_id': 0, 'query': ' revent 80 cfm', 'query_id': 0, 'product_id': 'B000MOO21W', 'product_locale': 'us', 'esci_label': 'Irrelevant', 'small_version': 0, 'large_version': 1, 'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan', 'product_description': None, 'product_bullet_point': 'WhisperCeiling fans feature a totally 

In [20]:
for item in esci_train.take(5):  # take first 5 examples
    print(item["query"], "->", item["esci_label"])

 revent 80 cfm -> Irrelevant
bathroom fan without light -> Exact
 revent 80 cfm -> Exact
 revent 80 cfm -> Exact
 revent 80 cfm -> Exact


In [21]:
esci_us = esci_train.filter(lambda x: x["product_locale"] == "us")
print(len(esci_us))

1420372


# Data

In [22]:
# Then you can access splits like:
train_df = esci["train"].to_pandas()
val_df = esci["test"].to_pandas()



In [23]:
train_df

Unnamed: 0,example_id,query,query_id,product_id,product_locale,esci_label,small_version,large_version,product_title,product_description,product_bullet_point,product_brand,product_color,product_text
0,0,revent 80 cfm,0,B000MOO21W,us,Irrelevant,0,1,Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...,,WhisperCeiling fans feature a totally enclosed...,Panasonic,White,Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...
1,291891,bathroom fan without light,13723,B000MOO21W,us,Exact,1,1,Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...,,WhisperCeiling fans feature a totally enclosed...,Panasonic,White,Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...
2,1,revent 80 cfm,0,B07X3Y6B1V,us,Exact,0,1,Homewerks 7141-80 Bathroom Fan Integrated LED ...,,OUTSTANDING PERFORMANCE: This Homewerk's bath ...,Homewerks,80 CFM,Homewerks 7141-80 Bathroom Fan Integrated LED ...
3,2,revent 80 cfm,0,B07WDM7MQQ,us,Exact,0,1,Homewerks 7140-80 Bathroom Fan Ceiling Mount E...,,OUTSTANDING PERFORMANCE: This Homewerk's bath ...,Homewerks,White,Homewerks 7140-80 Bathroom Fan Ceiling Mount E...
4,3,revent 80 cfm,0,B07RH6Z8KW,us,Exact,0,1,Delta Electronics RAD80L BreezRadiance 80 CFM ...,This pre-owned or refurbished product has been...,Quiet operation at 1.5 sones\nBuilt-in thermos...,DELTA ELECTRONICS (AMERICAS) LTD.,White,Delta Electronics RAD80L BreezRadiance 80 CFM ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2027869,2621262,ﾚﾃﾞｨｰｽ水着,130650,B07VDSXYYC,jp,Exact,0,1,ZHPUAT ラッシュガード レディース 水着 タンキニ セパレーツ 長袖 前開き ショート...,,,ZHPUAT,,ZHPUAT ラッシュガード レディース 水着 タンキニ セパレーツ 長袖 前開き ショート...
2027870,2621263,ﾚﾃﾞｨｰｽ水着,130650,B07TMNGGCH,jp,Exact,0,1,Yun-Wear（ユンウェア） 水着 レディース ビキニ セクシー シンプル 無地 クール ...,✅【原産国】 中国 ✅【素材構成】ナイロン ✅【サイズ】M：バスト83～88cm、ウエスト6...,✅【原産国】 中国\n✅【素材構成】ナイロン\n✅【サイズ】M：バスト83～88cm、ウエス...,Yun-Wear,ブラック,Yun-Wear（ユンウェア） 水着 レディース ビキニ セクシー シンプル 無地 クール ...
2027871,2621264,ﾚﾃﾞｨｰｽ水着,130650,B079FXWL9J,jp,Exact,0,1,[COTARON] 水着 レディース 体型カバー タンキニ カバーアップ オーバーTシャツ ...,こなれた大人のおしゃれ感が印象的な、オーバーTシャツ、タンキニ上下、フレアパンツの体型カバー...,✅【Amazon限定価格で販売中】【 高品質な水着セットを低価格で】”自分らしくを応援します...,COTARON,1ブラック×ボヘミアン,[COTARON] 水着 レディース 体型カバー タンキニ カバーアップ オーバーTシャツ ...
2027872,2621267,ﾚﾃﾞｨｰｽ水着,130650,B07GR94N75,jp,Exact,0,1,レディース 水着 オーバーウェア ビキニ セパレーツ 無地 二点セット 海水浴 水泳 温泉 ...,,★メイン素材：90%Polyester、10%spandex\n★人気の女性用セクシーワイヤ...,kayiyasu,ブラック,レディース 水着 オーバーウェア ビキニ セパレーツ 無地 二点セット 海水浴 水泳 温泉 ...


In [24]:
val_df


Unnamed: 0,example_id,query,query_id,product_id,product_locale,esci_label,small_version,large_version,product_title,product_description,product_bullet_point,product_brand,product_color,product_text
0,291854,bathroom fan with light,13722,B07X3Y6B1V,us,Exact,0,1,Homewerks 7141-80 Bathroom Fan Integrated LED ...,,OUTSTANDING PERFORMANCE: This Homewerk's bath ...,Homewerks,80 CFM,Homewerks 7141-80 Bathroom Fan Integrated LED ...
1,48617,110cfm bathroom exhaust fan without light,1750,B076Q7V5WX,us,Exact,1,1,Panasonic FV-0511VQ1 WhisperCeiling DC Ventila...,,Installation: Features a 4-inch or 6-inch duct...,Panasonic,White,Panasonic FV-0511VQ1 WhisperCeiling DC Ventila...
2,291841,bathroom fan quiet,13721,B076Q7V5WX,us,Exact,0,1,Panasonic FV-0511VQ1 WhisperCeiling DC Ventila...,,Installation: Features a 4-inch or 6-inch duct...,Panasonic,White,Panasonic FV-0511VQ1 WhisperCeiling DC Ventila...
3,1021805,household ventilation fans,51549,B076Q7V5WX,us,Exact,1,1,Panasonic FV-0511VQ1 WhisperCeiling DC Ventila...,,Installation: Features a 4-inch or 6-inch duct...,Panasonic,White,Panasonic FV-0511VQ1 WhisperCeiling DC Ventila...
4,48636,110cfm bathroom exhaust fan without light,1750,B075ZBF9HG,us,Irrelevant,1,1,Panasonic FV-0510VSL1 WhisperValue DC Ventilat...,,Installation: Features a low profile can ideal...,Panasonic,White,Panasonic FV-0510VSL1 WhisperValue DC Ventilat...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
652485,2621283,�����j�[�h�p�[�x abrasus,130651,B0063ASUY4,jp,Exact,0,1,薄い財布 abrAsus ブッテーロレザーエディション ブラック,2013年グッドデザイン賞受賞。<br/> 「薄い財布abrAsus」は、財布をゼロベースで...,イタリア産の上質なヌメ革を使用した最上級ブッテーロレザーエディション。\nグッドデザイン賞を...,abrAsus(アブラサス),ブラック,薄い財布 abrAsus ブッテーロレザーエディション ブラック\nabrAsus(アブラサ...
652486,2621284,�����j�[�h�p�[�x abrasus,130651,B0062EZYIG,jp,Exact,0,1,アブラサス (abrAsus) 薄い財布 ブラック,グッドデザイン賞受賞。 <br /> 「薄い財布abrAsus」は、財布をゼロベースで考えた...,グッドデザイン賞を受賞しました。\n特別な構造（特許取得済み）で、圧倒的な薄さを実現しました...,abrAsus(アブラサス),ブラック,アブラサス (abrAsus) 薄い財布 ブラック\nabrAsus(アブラサス)\nブラッ...
652487,2621285,�����j�[�h�p�[�x abrasus,130651,B07H8MWBZN,jp,Substitute,0,1,Bellroy Hide & Seek Wallet - スリムなレザー製二つ折り財布、RF...,<b>スリムな財布へ</b><br> ベルロイは、毎日の必需品をスマートに持ち運べる製品の開...,ポケットをフラットに保てるスリムなシルエットかつクラシックなデザイン\nRFID保護（スキミ...,Bellroy(ベルロイ),Black - RFID (New),Bellroy Hide & Seek Wallet - スリムなレザー製二つ折り財布、RF...
652488,2621286,�����j�[�h�p�[�x abrasus,130651,B00IZH4T9S,jp,Exact,0,1,小さい小銭入れ abrAsus (アブラサス) ダークグリーン,<b>「コイン」「紙幣」「キー」だけを持ち運ぶ キーホルダーみたいな財布</b> <br> ...,「コイン」「紙幣」「キー」だけを一緒に持ち歩く、キーホルダーのような財布です。\n500円玉...,abrAsus(アブラサス),ダークグリーン,小さい小銭入れ abrAsus (アブラサス) ダークグリーン\nabrAsus(アブラサス...


# Model training



In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import os

# Configuration
# Merge with existing config dictionary
config = {
    **config,  # Existing config with data paths
    'model_name': 'xlm-roberta-base',
    'batch_size': 32, 
    'epochs': 7,
    'learning_rate': 2e-5,
    'max_length': 128
}

config

In [62]:
# BiEncoder Model

class BiEncoder(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        self.query_encoder = AutoModel.from_pretrained(model_name)
        self.product_encoder = AutoModel.from_pretrained(model_name)
        self.projection = nn.Linear(self.query_encoder.config.hidden_size, 256)  # Project to smaller embedding size
        
    def encode_query(self, query_inputs):
        outputs = self.query_encoder(**query_inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return self.projection(cls_embedding)
        
    def encode_product(self, product_inputs):
        outputs = self.product_encoder(**product_inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return self.projection(cls_embedding)
        
    def forward(self, query_inputs, product_inputs):
        query_emb = self.encode_query(query_inputs)
        product_emb = self.encode_product(product_inputs)
        return query_emb, product_emb

In [63]:
# Dataset class

class AmazonESCIDataset(Dataset):
    def __init__(self, queries_df, products_df, tokenizer, max_length=128):
        self.queries = queries_df
        self.products = products_df
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.label_map = {'E': 1.0, 'S': 0.8, 'C': 0.3, 'I': 0.0}
        
    def __len__(self):
        return len(self.queries)
    
    def __getitem__(self, idx):
        row = self.queries.iloc[idx]
        query = row['query']
        product_id = row['product_id']
        
        # Get product information
        product = self.products[self.products['product_id'] == product_id].iloc[0]
        
        # Handle missing values safely
        def safe_get(field):
            value = product[field]
            return value if isinstance(value, str) else ''
            
        product_text = (
            "Title: " + safe_get('product_title') + "; " +
            "Description: " + safe_get('product_description') + "; " +
            "Bullet Points: " + safe_get('product_bullet_point')
        )
        
        # Tokenize inputs
        query_inputs = self.tokenizer(
            query, 
            max_length=self.max_length, 
            padding='max_length', 
            truncation=True, 
            return_tensors='pt'
        )
        
        product_inputs = self.tokenizer(
            product_text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        label = torch.tensor(self.label_map[row['esci_label']], dtype=torch.float)
        
        return {
            'query_inputs': {k: v.squeeze(0) for k, v in query_inputs.items()},
            'product_inputs': {k: v.squeeze(0) for k, v in product_inputs.items()},
            'label': label
        }

In [64]:
# Training Function

def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    
    for batch in dataloader:
        optimizer.zero_grad()
        
        query_inputs = {k: v.to(device) for k, v in batch['query_inputs'].items()}
        product_inputs = {k: v.to(device) for k, v in batch['product_inputs'].items()}
        labels = batch['label'].to(device)
        
        query_emb, product_emb = model(query_inputs, product_inputs)
        cos_sim = torch.nn.functional.cosine_similarity(query_emb, product_emb)
        loss = criterion(cos_sim, labels)
        
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    return total_loss / len(dataloader)

In [None]:
config

In [66]:
# Data preparation

# Load data
products = pd.read_parquet(config['ds_products'])
queries = pd.read_parquet(config['ds_queries'])

# Filter and merge data
queries = queries[queries['small_version'] == 1]
merged_df = queries.merge(products, on=['product_id', 'product_locale'], how='inner')

# Split data
train_df = merged_df[merged_df['split'] == 'train']
val_df = merged_df[merged_df['split'] == 'test']

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(config['model_name'])

In [67]:
# Training setup

# Create datasets
train_dataset = AmazonESCIDataset(train_df, products, tokenizer, config['max_length'])
val_dataset = AmazonESCIDataset(val_df, products, tokenizer, config['max_length'])

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=config['batch_size'])

# Initialize model and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BiEncoder(config['model_name']).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=config['learning_rate'])
criterion = nn.MSELoss()

In [None]:
# Training loop

# for epoch in range(config['epochs']):
#     train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
#     print(f'Epoch {epoch + 1}/{config["epochs"]}, Loss: {train_loss:.4f}')

# # Save model
# torch.save(model.state_dict(), 'bi_encoder_model.pth')