# Evaluating a Pre-trained Model for Semantic Search (Retrieval)

## Import libraries and make GPU avaiable

In [1]:
# Standard library imports
import argparse
import json
import time
from datetime import datetime

# Related third-party imports
import matplotlib.pyplot as plt
import pandas as pd

# Local application/library specific imports
from src.data_loader import DataLoader
from src.data_preprocessor import DataPreprocessor
from src.model_evaluator import ModelEvaluator

  from .autonotebook import tqdm as notebook_tqdm


In [1]:
# Check if GPU is available and set device accordingly
import torch
print(torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU is available")
    print(torch.cuda.get_device_name(0))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

True
GPU is available
NVIDIA GeForce RTX 4070 SUPER


## Dataset Overview

### ESCI Data 

The **Shopping Queries Data Set** (https://github.com/amazon-science/esci-data) is a large-scale, manually annotated collection of challenging search queries, created to support research in semantic matching between queries and products. For each query, the dataset includes up to 40 potentially relevant products, each annotated with an ESCI relevance judgment:

- **Exact match (E)**
- **Substitute (S)**
- **Complement (C)**
- **Irrelevant (I)**

These labels indicate how well each product aligns with the query. Each query-product pair is also accompanied by additional contextual information. The dataset is **multilingual**, featuring queries in **English (us)**, **Japanese (jp)**, and **Spanish (es)**.

#### Full Dataset

- **130,652** unique queries
- **2,621,738** query-item pairs
- Provides a broader and more comprehensive set of examples for extensive training and evaluation

#### Reduced Version Provided by the Author 

- **48,300** unique queries
- **1,118,011** query-item pairs



The dataset is stratified by queries into **train** and **test** splits, ensuring a balanced distribution of query types across both splits.

### Data Fields

Each entry in the dataset contains the following fields:

`example_id`, `query`, `query_id`, `product_id`, `product_locale`, `esci_label`, `small_version`, `large_version`, `split`, `product_title`, `product_description`, `product_bullet_point`, `product_brand`, `product_color`, `source`

### Columns Used for This Task

For this particular task, we focus on the following columns:

- `query`
- `product_title`
- `product_description`
- `product_locale`
- `small_version`
- `large_version`
- `split`
- `esci_label`

These fields provide the necessary information for training and evaluating semantic search models, combining textual features from queries and product metadata with relevance labels for supervised learning.

In [2]:
# Take a look at the ESCI dataset
loaded_data = DataLoader(
        example_path="dataset/shopping_queries_dataset_examples.parquet",
        products_path="dataset/shopping_queries_dataset_products.parquet",
        small_version=True, # Load a small version of the dataset for data exploration
    )
loaded_data.load_data()
loaded_data.df.head(5)

Unnamed: 0,query,title,description,relevance,split,product_locale
16,!awnmower tires without rims,"RamPro 10"" All Purpose Utility Air Tires/Wheel...","<b>About The Ram-Pro All Purpose Utility 10"" A...",1,train,us
17,!awnmower tires without rims,MaxAuto 2-Pack 13x5.00-6 2PLY Turf Mower Tract...,MaxAuto 2-Pack 13x5.00-6 2PLY Turf Mower Tract...,4,train,us
18,!awnmower tires without rims,NEIKO 20601A 14.5 inch Steel Tire Spoon Lever ...,,1,train,us
19,!awnmower tires without rims,2PK 13x5.00-6 13x5.00x6 13x5x6 13x5-6 2PLY Tur...,"Tire Size: 13 x 5.00 - 6 Axle: 3/4"" inside dia...",3,train,us
20,!awnmower tires without rims,(Set of 2) 15x6.00-6 Husqvarna/Poulan Tire Whe...,No fuss. Just take off your old assembly and r...,4,train,us


### Data Preprocessing

In preparation for semantic search and product retrieval tasks, I performed a comprehensive data cleaning process on the **Shopping Queries Data Set**. This was necessary to ensure that the input text is clean, meaningful, and suitable for downstream modeling.

---

#### Key Data Cleaning Steps

##### 1. Removal of HTML Tags

Upon inspecting the extracted vocabulary, it became clear that a significant portion of the words were HTML tags rather than meaningful content. These tags introduce noise into both bag-of-words models and semantic search embeddings.

To address this, I systematically removed all HTML tags from the following text fields:
- **query**
- **product_title**
- **product_description**

This ensured that only the true textual content was retained for model training and evaluation.

---

##### 2. Removal of Special Characters

For English-language (`us`) data, I further cleaned the text by removing special characters and symbols that do not contribute to the semantic meaning of the text. This included punctuation marks, brackets, and other non-alphanumeric characters, while preserving spaces to maintain word boundaries.

This step helped reduce vocabulary fragmentation and improved the quality of both traditional tokenization and semantic embeddings.

---

##### 3. Text Normalization

As part of the cleaning process:
- All text was converted to **lowercase** to ensure case-insensitive matching.
- Tokenization steps (for bag-of-words analysis) were performed on clean, normalized text.

---

##### 4. Locale Filtering

Since the dataset is multilingual (English, Japanese, Spanish), I filtered the data to focus only on the desired locale (e.g., English `"us"`) when necessary.  
Different languages may require different tokenization and preprocessing strategies, so this filtering step ensured consistent and comparable training and evaluation.

---

##### 5. Creation of Combined Text Field

For feature extraction, I created a new field, **`combined_text`**, by concatenating the cleaned `product_title` and `product_description`.  
This allowed the models to leverage richer product information and better capture the full semantic context during training and inference.

---


In [None]:
# Run preprocessing on the loaded data
data_preprocessor = DataPreprocessor(loaded_data.df,language="us")
print(data_preprocessor.get_bag_of_words(top_n=10))

# Clean the data
data_preprocessor.data_preprocessing()

     word  frequency
18     br    4056430
48     la     438325
66   para     414585
30     el     331113
53     li     321154
..    ...        ...
6      18      46060
13   baby      46022
95  waist      45831
81   skin      45089
45   just      44797

[100 rows x 2 columns]


In [None]:
# Initial data analysis on products title and description
def create_bag_of_words_analysis(df_products):
    # Filter for US products
    df_products_us = df_products[df_products['product_locale'] == 'us']
    
    print(f"Analyzing {len(df_products_us)} US products")
    
    # Clean title and description columns
    df_products_us['clean_title'] = df_products_us['product_title'].apply(remove_special_characters)
    df_products_us['clean_description'] = df_products_us['product_description'].apply(remove_special_characters)
    
    # Combine title and description for analysis
    df_products_us['combined_text'] = df_products_us['clean_title'] + " " + df_products_us['clean_description']
    
    # Create bag of words separately for titles, descriptions, and combined
    print("\n===== BAG OF WORDS ANALYSIS =====")
    
    # Function to create and analyze bag of words
    def analyze_bow(text_series, name, min_df=2, max_features=100):
        print(f"\n--- {name} Analysis ---")
        
        # Remove empty texts
        valid_texts = text_series[text_series != ""].tolist()
        
        if not valid_texts:
            print(f"No valid texts found for {name}")
            return None
        
        # Create bag of words
        vectorizer = CountVectorizer(stop_words='english', min_df=min_df, max_features=max_features)
        X = vectorizer.fit_transform(valid_texts)
        
        # Get feature names and counts
        feature_names = vectorizer.get_feature_names_out()
        counts = X.sum(axis=0).A1
        
        # Create DataFrame with word frequencies
        word_freq = pd.DataFrame({'word': feature_names, 'frequency': counts})
        word_freq = word_freq.sort_values('frequency', ascending=False)
        
        print(f"Top 20 most frequent words for {name}:")
        print(word_freq.head(20))
        
        # Generate word cloud
        create_wordcloud(word_freq, name)
        
        return word_freq
    
    # Function to create word cloud
    def create_wordcloud(word_freq_df, name):
        word_dict = dict(zip(word_freq_df['word'], word_freq_df['frequency']))
        
        plt.figure(figsize=(12, 8))
        wordcloud = WordCloud(width=800, height=400, background_color='white', 
                             max_words=100, contour_width=3, contour_color='steelblue')
        wordcloud.generate_from_frequencies(word_dict)
        
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.title(f'Word Cloud - {name}')
        plt.axis('off')
        plt.tight_layout()
        print(f"Word cloud saved as 'wordcloud_{name.lower().replace(' ', '_')}.png'")
    
    # Analyze titles
    title_bow = analyze_bow(df_products_us['clean_title'], 'Product Titles')
    
    # Analyze descriptions
    desc_bow = analyze_bow(df_products_us['clean_description'], 'Product Descriptions')
    
    # Analyze combined text
    combined_bow = analyze_bow(df_products_us['combined_text'], 'Combined Title & Description')
    
    # Create bar plots for top words
    def plot_top_words(word_freq_df, name, n=15):
        plt.figure(figsize=(12, 6))
        top_words = word_freq_df.head(n)
        sns.barplot(x='frequency', y='word', data=top_words)
        plt.title(f'Top {n} Words in {name}')
        plt.tight_layout()
    
    # Plot bar charts for top words
    if title_bow is not None:
        plot_top_words(title_bow, 'Product Titles')
    
    if desc_bow is not None:
        plot_top_words(desc_bow, 'Product Descriptions')
    
    if combined_bow is not None:
        plot_top_words(combined_bow, 'Combined Text')
    
    # Return the dataframes for further analysis
    return {
        'us_products': df_products_us,
        'title_bow': title_bow,
        'description_bow': desc_bow,
        'combined_bow': combined_bow
    }

products_results = create_bag_of_words_analysis(df_products)

## Data preprocessing

In [None]:
class DataPreprocessor:
    def __init__(self, language="us", small_version=True):
        """
        Initialize the data preprocessor
        """
        self.language = language
        self.small_version = small_version
        self.df = None
    
    def load_data(self, examples_path="dataset/shopping_queries_dataset_examples.parquet", 
                  products_path="dataset/shopping_queries_dataset_products.parquet"):
        """
        Load the datasets from the specified paths.
        """
        df_examples = pd.read_parquet(examples_path)
        df_products = pd.read_parquet(products_path)

        # Merge the examples and products dataframes on product_locale and product_id
        self.df = pd.merge(
            df_examples,
            df_products,
            how='left',
            left_on=['product_locale','product_id'],
            right_on=['product_locale', 'product_id']
        )

        # Create a dictionary to map product locales to their respective relevance scores
        esci_dict = {"E":4,"S":3,"C":2,"I":1}
        self.df['esci_label'] = self.df['esci_label'].map(esci_dict)

        # Rename the columns for clarity
        self.df.rename(columns={
            'esci_label': 'relevance',
            'product_title': 'title',
            'product_description': 'description',
        }, inplace=True)

        # Drop na values in the 'revelance' column
        self.df = self.df.dropna(subset=['relevance','title','query'])

        if self.small_version:
            # Create a small version of the dataset for testing purposes
            self.df = self.df[self.df['small_version'] == 1]

        # Only keep the relevant columns
        self.df = self.df[['query', 'title', 'description', 'relevance','split','product_locale']]
    
    def get_dataset_stats(self):
        """
        Get statistics about the dataset
        """
        if self.df is None:
            raise ValueError("Dataset not loaded. Call load_data() first.")
            
        stats = {
            'total_samples': len(self.df),
            'unique_queries': self.df['query'].nunique(),
            'avg_products_per_query': len(self.df) / self.df['query'].nunique(),
            'relevance_distribution': self.df['relevance'].value_counts(normalize=True).sort_index().to_dict(),
        }
        
        if 'product_locale' in self.df.columns:
            stats['locale_distribution'] = self.df['product_locale'].value_counts(normalize=True).to_dict()
        
        return stats
    def split_data(self,split_column):
        """
        Splits the DataFrame into train, validation, and test sets based on the split_column.
        """
        df_train = self.df[self.df[split_column] == 'train']
        df_test = self.df[self.df[split_column] == 'test']

        print(f"Train set size: {len(df_train)} rows, {df_train['query'].nunique()} unique queries")
        print(f"Test set size: {len(df_test)} rows, {df_test['query'].nunique()} unique queries")
        return df_train, df_test
    
    def data_preprocessing(self):
        """
        Preprocess the data by removing HTML tags and cleaning text.
        """
        # Remove HTML tags and clean text
        self.df['query'] = self.df['query'].apply(remove_html_tags)
        self.df['title'] = self.df['title'].apply(remove_html_tags)
        self.df['description'] = self.df['description'].apply(remove_html_tags)

        # Only remove special characters for US language
        if (self.language=='us'):
            self.df['query'] = self.df['query'].apply(remove_special_characters)
            self.df['title'] = self.df['title'].apply(remove_special_characters)
            self.df['description'] = self.df['description'].apply(remove_special_characters)

        self.df['combined_text'] = self.df['title'] + " " + self.df['description']

    def filter_by_locale(self):
        """
        Filter the dataset by product locale
        """
        if self.df is None:
            raise ValueError("Dataset not loaded. Call load_data() first.")
            
        if 'product_locale' not in self.df.columns:
            raise ValueError("'product_locale' column not found in dataset")
            
        filtered_df = self.df[self.df['product_locale'] == self.language]
        print(f"Filtered dataset to {len(filtered_df)} rows with locale '{self.language}'")
        
        # Set the filtered DataFrame as the current DataFrame
        self.df = filtered_df

In [48]:
class ModelEvaluator:
    def __init__(self, model_name="all-MiniLM-L6-v2", use_gpu=True):
        """
        Initialize the model evaluator
        """
        # Check if CUDA is available
        self.device = torch.device("cuda" if torch.cuda.is_available() and use_gpu else "cpu")
        self.model_name = model_name
        self.model = SentenceTransformer(model_name)
        print(f"Using device: {self.device}")
        print(f"Loaded model: {model_name}")
        
    @staticmethod
    def calculate_ndcg(relevance_scores, k=10):
        """
        Calculate NDCG@k for a single query
        """
        if len(relevance_scores) == 0:
            return 0.0
        
        # Limit to top k scores
        relevance_scores = relevance_scores[:k]
        
        # Calculate DCG
        dcg = 0
        for i, score in enumerate(relevance_scores):
            dcg += (2 ** score - 1) / np.log2(i + 2)  # i+2 because i is 0-indexed
        
        # Calculate ideal DCG
        ideal_scores = sorted(relevance_scores, reverse=True)
        idcg = 0
        for i, score in enumerate(ideal_scores):
            idcg += (2 ** score - 1) / np.log2(i + 2)
        
        # Avoid division by zero
        if idcg == 0:
            return 0.0
        
        # Return NDCG
        return dcg / idcg
    
    @staticmethod
    def calculate_recall_at_k(relevant_items, retrieved_items, k=10):
        """
        Calculate Recall@k for a single query
        """
        if len(relevant_items) == 0:
            return 0.0
        
        # Limit to top k retrieved items
        retrieved_items_at_k = set(retrieved_items[:k])
        
        # Calculate recall
        return len(retrieved_items_at_k.intersection(relevant_items)) / len(relevant_items)
    
    @staticmethod
    def calculate_mrr(retrieved_items, relevant_items, k=10):
        """
        Calculate MRR@k for a single query
        """
        # Limit to top k retrieved items
        retrieved_items = retrieved_items[:k]
        
        # Find the first relevant item
        for i, item in enumerate(retrieved_items):
            if item in relevant_items:
                return 1.0 / (i + 1)  # +1 for 1-based rank
        
        return 0.0
    
    def evaluate(self, test_df, threshold=3.0, k=10, batch_size=128):
        """
        Evaluate the model on the test set
        """
        # Group the test data by query
        query_groups = test_df.groupby('query')

        # Adjust batch size based on GPU memory
        if self.device.type == 'cuda':
            # Smaller batch size for GPU to avoid out-of-memory errors
            if batch_size > 128:
                print(f"Reducing batch size from {batch_size} to 128 for GPU processing")
                batch_size = 128
        
        # Track metrics for each query
        ndcg_scores = []
        recall_scores = []
        mrr_scores = []
        
        # Process queries in batches
        unique_queries = list(query_groups.groups.keys())
        
        for i in tqdm(range(0, len(unique_queries), batch_size), desc="Evaluating queries"):
            batch_queries = unique_queries[i:i+batch_size]
            
            # Collect all products for the batch queries
            batch_data = []
            for query in batch_queries:
                query_df = query_groups.get_group(query)
                query_products = query_df['combined_text'].tolist()
                query_relevances = query_df['relevance'].tolist()
                batch_data.append((query, query_products, query_relevances))
            
            # Encode all unique queries in the batch
            query_texts = [item[0] for item in batch_data]

            query_embeddings = self.model.encode(
                query_texts, 
                convert_to_tensor=True, 
                show_progress_bar=False,
                device=self.device
            )
            
            # Process each query in the batch
            for idx, (query, products, relevances) in enumerate(batch_data):
                # Skip if no products
                if len(products) == 0:
                    continue
                
                # Encode products
                product_embeddings = self.model.encode(products, convert_to_tensor=True, show_progress_bar=False
                ,device=self.device)
                
                # Calculate similarities
                q_embedding = query_embeddings[idx].unsqueeze(0)  # Add batch dimension
                similarities = util.pytorch_cos_sim(q_embedding, product_embeddings)[0].cpu().numpy()
                
                # Sort products by similarity
                sorted_indices = np.argsort(-similarities)
                
                # Get sorted relevance scores
                sorted_relevances = [relevances[idx] for idx in sorted_indices]
                
                # Calculate NDCG@k
                ndcg = self.calculate_ndcg(sorted_relevances, k)
                ndcg_scores.append(ndcg)
                
                # Calculate Recall@k and MRR@k
                relevant_indices = {i for i, rel in enumerate(relevances) if rel >= threshold}
                retrieved_indices = sorted_indices.tolist()
                
                recall = self.calculate_recall_at_k(relevant_indices, retrieved_indices, k)
                recall_scores.append(recall)
                
                mrr = self.calculate_mrr(retrieved_indices, relevant_indices, k)
                mrr_scores.append(mrr)
                if self.device.type == 'cuda':
                    torch.cuda.empty_cache()
        
        # Calculate average metrics
        avg_ndcg = np.mean(ndcg_scores)
        avg_recall = np.mean(recall_scores)
        avg_mrr = np.mean(mrr_scores)
        
        # Return metrics
        return {
            'NDCG@10': avg_ndcg,
            'Recall@10': avg_recall,
            'MRR@10': avg_mrr,
            'Number of Queries': len(unique_queries)
        }
    def print_results(self,results,language=None,small_version=None):
        """
        Print evaluation results
        """
        print("\nEvaluation Results:")
        print("-" * 50)
        print(f"Model: {self.model_name}")
        print(f"Dataset: ESCI {language.upper() if language else 'All Languages'} {'Small' if small_version else 'Large'}")
        print(f"Number of test queries: {results['Number of Queries']}")
        print("-" * 50)
        print(f"NDCG@10: {results['NDCG@10']:.4f}")
        print(f"Recall@10: {results['Recall@10']:.4f}")
        print(f"MRR@10: {results['MRR@10']:.4f}")

In [49]:
def run_evaluation(language=None, model_name="all-MiniLM-L6-v2", small_version=True,split_column='split',batch_size=64):
    """
    Run the complete evaluation pipeline
    """
    # Initialize data preprocessor
    if language is None:
        print(f"Initializing data preprocessor for all languages")
    else:
        print(f"Initializing data preprocessor for language: {language}")
    preprocessor = DataPreprocessor(language=language, small_version=small_version)
    
    # Download and prepare dataset
    preprocessor.load_data()
    
    # Preprocess data
    preprocessor.data_preprocessing()

    # Filter by locale
    if (language is not None):
        preprocessor.filter_by_locale()
    
    # Get dataset statistics
    stats = preprocessor.get_dataset_stats()
    print("\nDataset Statistics:")
    print(f"Total samples: {stats['total_samples']}")
    print(f"Unique queries: {stats['unique_queries']}")
    print(f"Average products per query: {stats['avg_products_per_query']:.2f}")
    print("Relevance distribution:")
    for score, pct in stats['relevance_distribution'].items():
        print(f"  Score {score}: {pct:.2%}")
    
    # Split data
    train_df, test_df = preprocessor.split_data(split_column)
    
    # Initialize model evaluator
    print(f"\nInitializing model evaluator with model: {model_name}")
    evaluator = ModelEvaluator(model_name=model_name)
    
    # Evaluate test model
    print(f"Evaluating model on {language} test set...")
    results = evaluator.evaluate(test_df, threshold=3.0, batch_size=batch_size)  # Consider 'E' and 'S' as relevant (4 and 3)
    evaluator.print_results(results)

    # Evaluate train model
    print(f"\nEvaluating model on {language} train set...")
    results = evaluator.evaluate(train_df, threshold=3.0, batch_size=batch_size)  # Consider 'E' and 'S' as relevant (4 and 3)
    evaluator.print_results(results)

    print("\n")
    
    return results

In [40]:
torch.cuda.empty_cache()

In [41]:
# Run evaluation for US language with small and full dataset versions
us_results_small = run_evaluation(language="us", model_name="all-MiniLM-L6-v2", small_version=True,split_column='split',batch_size=64)
us_results_full = run_evaluation(language="us", model_name="all-MiniLM-L6-v2", small_version=False,split_column='split',batch_size=64)

Initializing data preprocessor for language: us
Filtered dataset to 601354 rows with locale 'us'

Dataset Statistics:
Total samples: 601354
Unique queries: 29842
Average products per query: 20.15
Relevance distribution:
  Score 1: 16.87%
  Score 2: 4.52%
  Score 3: 35.12%
  Score 4: 43.49%
Train set size: 419653 rows, 20887 unique queries
Test set size: 181701 rows, 8955 unique queries

Initializing model evaluator with model: all-MiniLM-L6-v2
Using device: cuda
Loaded model: all-MiniLM-L6-v2
Evaluating model on us test set...
Reducing batch size from 128 to 64 for GPU processing


Evaluating queries: 100%|██████████| 140/140 [02:43<00:00,  1.17s/it]



Evaluation Results:
--------------------------------------------------
Model: all-MiniLM-L6-v2
Dataset: ESCI All Languages Large
Number of test queries: 8955
--------------------------------------------------
NDCG@10: 0.9072
Recall@10: 0.6034
MRR@10: 0.9244

Evaluating model on us train set...
Reducing batch size from 128 to 64 for GPU processing


Evaluating queries: 100%|██████████| 327/327 [09:03<00:00,  1.66s/it]



Evaluation Results:
--------------------------------------------------
Model: all-MiniLM-L6-v2
Dataset: ESCI All Languages Large
Number of test queries: 20887
--------------------------------------------------
NDCG@10: 0.9060
Recall@10: 0.6077
MRR@10: 0.9242


Initializing data preprocessor for language: us
Filtered dataset to 1818825 rows with locale 'us'

Dataset Statistics:
Total samples: 1818825
Unique queries: 97307
Average products per query: 18.69
Relevance distribution:
  Score 1: 8.90%
  Score 2: 2.20%
  Score 3: 20.31%
  Score 4: 68.59%
Train set size: 1393063 rows, 74865 unique queries
Test set size: 425762 rows, 22455 unique queries

Initializing model evaluator with model: all-MiniLM-L6-v2
Using device: cuda
Loaded model: all-MiniLM-L6-v2
Evaluating model on us test set...
Reducing batch size from 128 to 64 for GPU processing


Evaluating queries: 100%|██████████| 351/351 [09:06<00:00,  1.56s/it]



Evaluation Results:
--------------------------------------------------
Model: all-MiniLM-L6-v2
Dataset: ESCI All Languages Large
Number of test queries: 22455
--------------------------------------------------
NDCG@10: 0.9556
Recall@10: 0.6136
MRR@10: 0.9659

Evaluating model on us train set...
Reducing batch size from 128 to 64 for GPU processing


Evaluating queries: 100%|██████████| 1170/1170 [28:43<00:00,  1.47s/it]



Evaluation Results:
--------------------------------------------------
Model: all-MiniLM-L6-v2
Dataset: ESCI All Languages Large
Number of test queries: 74865
--------------------------------------------------
NDCG@10: 0.9652
Recall@10: 0.6177
MRR@10: 0.9743




In [None]:
# Run evaluation for all langauges (us, jp, es) with small and full dataset versions
us_results_small = run_evaluation(model_name="all-MiniLM-L6-v2", small_version=True,split_column='split',batch_size=128)
us_results_full = run_evaluation(model_name="all-MiniLM-L6-v2", small_version=False,split_column='split',batch_size=128)

Initializing data preprocessor for all languages

Dataset Statistics:
Total samples: 1118011
Unique queries: 48249
Average products per query: 23.17
Relevance distribution:
  Score 1: 16.76%
  Score 2: 5.15%
  Score 3: 34.30%
  Score 4: 43.78%
Train set size: 781638 rows, 33776 unique queries
Test set size: 336373 rows, 14488 unique queries

Initializing model evaluator with model: all-MiniLM-L6-v2
Using device: cuda
Loaded model: all-MiniLM-L6-v2
Evaluating model on None test set...


Evaluating queries: 100%|██████████| 114/114 [08:36<00:00,  4.53s/it]



Evaluation Results:
--------------------------------------------------
Model: all-MiniLM-L6-v2
Dataset: ESCI All Languages Large
Number of test queries: 14488
--------------------------------------------------
NDCG@10: 0.9027
Recall@10: 0.5502
MRR@10: 0.9147

Evaluating model on None train set...


Evaluating queries:  83%|████████▎ | 219/264 [13:49<03:21,  4.48s/it]

## Limitation

In [None]:
# from langdetect import detect

# def detect_language(text):
#     try:
#         return detect(text)
#     except:
#         return 'unknown'

# # Add language detection to preprocessing
# df_examples_products_small['query_lang'] = df_examples_products_small['query'].apply(detect_language)
# df_examples_products_small['title_lang'] = df_examples_products_small['title'].apply(detect_language)
# df_examples_products_small['description_lang'] = df_examples_products_small['description'].apply(detect_language)

# # Create language match flags
# df_examples_products_small['query_title_same_lang'] = df_examples_products_small['query_lang'] == df_examples_products_small['title_lang']

# # Analyze language distribution
# print(df_examples_products_small['query_lang'].value_counts())
# print(df_examples_products_small['title_lang'].value_counts())
# print(df_examples_products_small.groupby(['query_lang', 'title_lang']).size())

# # Evaluate separately by language match
# english_matches = df_examples_products_small[(df_examples_products_small['query_lang'] == 'en') & (df_examples_products_small['title_lang'] == 'en')]
# non_english_matches = df_examples_products_small[(df_examples_products_small['query_lang'] != 'en') & (df_examples_products_small['query_lang'] == df_examples_products_small['title_lang'])]
# cross_language = df_examples_products_small[df_examples_products_small['query_lang'] != df_examples_products_small['title_lang']]