### Test: RoBERTa and DeBERTa without further fine-tune as baseline
This notebook evaluates the RoBERTa and DeBERTa models available in the sentence_transformers library for our binary classification task, deciding whether a query matches a vehicle description. As those models originally provide three outputs: entailment, contradiction and neutral score, we had to define a calculation, that is: `prediction = (a * entailment + b * neutrality + c * contradiction) > threshold` to decide whether a query is labeled as 'true' or not. Different combinations for the parameters a, b, c and the threshold are tested on the train-and-val set, and the best combination is used to make the final prediction on the test set to evaluate the models as our baseline.

In [1]:
import json
import yaml
import numpy as np
from sentence_transformers import CrossEncoder
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report
from sklearn.model_selection import ParameterGrid
import logging
from typing import Dict, List, Tuple, Optional

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class CrossEncoderEvaluator:
    """Evaluator for pretrained CrossEncoder on vehicle matching task"""
    
    def __init__(self, model_name: str):
        """Initialize with pretrained CrossEncoder model"""
        self.model = CrossEncoder(model_name)
        logger.info(f"Loaded CrossEncoder model: {model_name}")
    
    def load_test_data(self, test_vehicles_file: str, test_questions_file: str) -> List[Tuple[str, str, int]]:
        """Load test data and return test pairs"""
        
        # Load test vehicle data
        with open(test_vehicles_file, 'r', encoding='utf-8') as f:
            vehicles_data = yaml.safe_load(f)

        # Load test questions data
        with open(test_questions_file, 'r', encoding='utf-8') as f:
            questions_data = json.load(f)

        # Prepare test pairs
        test_pairs = []
        for vehicle_url, vehicle_info in vehicles_data.items():
            vehicle_text = self._create_vehicle_description(vehicle_info)
            if vehicle_url in questions_data:
                questions = questions_data[vehicle_url]
                vehicle_pairs = [(q, vehicle_text, int(label)) for q, label in questions.items()]
                test_pairs.extend(vehicle_pairs)

        logger.info(f"Test set: {len(vehicles_data)} vehicles → {len(test_pairs)} pairs")
        
        return test_pairs
    
    def _create_vehicle_description(self, vehicle_info: Dict) -> str:
        """Create a comprehensive vehicle description from the data"""
        description_parts = []
        
        # Add information dictionary details
        if 'information_dict' in vehicle_info:
            info_dict = vehicle_info['information_dict']
            for key, value in info_dict.items():
                description_parts.append(f"{key}: {value}")
        
        # Add details list
        if 'details_list' in vehicle_info:
            details = " | ".join(vehicle_info['details_list'])
            description_parts.append(details)
        
        # Add details text if available
        if 'details_text' in vehicle_info:
            description_parts.append(vehicle_info['details_text'])
        
        return " | ".join(description_parts)
    
    def predict_probabilities(self, test_pairs: List[Tuple[str, str, int]]) -> Tuple[np.ndarray, np.ndarray]:
        """Get predictions from the CrossEncoder model"""
        
        # Prepare input pairs for the model
        input_pairs = [(query, vehicle_text) for query, vehicle_text, _ in test_pairs]
        true_labels = np.array([label for _, _, label in test_pairs])
        
        # Get predictions: [contradiction, neutral, entailment]
        logger.info("Getting predictions from CrossEncoder...")
        predictions = self.model.predict(input_pairs, apply_softmax=True)
        
        logger.info(f"Predictions shape: {predictions.shape}")
        logger.info(f"Prediction sample: {predictions[0]}")
        
        return predictions, true_labels
    
    def optimize_binary_classification(self, predictions: np.ndarray, true_labels: np.ndarray) -> Dict:
        """Find optimal way to combine entailment and contradiction for binary classification"""
        
        # Extract individual probabilities
        contradiction_probs = predictions[:, 0]  # Index 0: contradiction
        neutral_probs = predictions[:, 1]       # Index 1: neutral
        entailment_probs = predictions[:, 2]    # Index 2: entailment
        
        best_f1 = 0
        best_config = None
        results = []
        
        # Define parameter grid for optimization
        param_grid = {
            'entailment_weight': [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
            'contradiction_weight': [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, -0.2, -0.4, -0.6, -0.8, -1.0],
            'neutral_weight': [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, -0.2, -0.4, -0.6, -0.8, -1.0],
            'threshold': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
        }
        
        logger.info("Optimizing binary classification parameters...")
        
        for params in ParameterGrid(param_grid):
            # Skip if all weights are zero
            if params['entailment_weight'] + params['contradiction_weight'] + params['neutral_weight'] == 0:
                continue
            
            # Calculate weighted score
            weighted_score = (
                params['entailment_weight'] * entailment_probs +
                params['contradiction_weight'] * contradiction_probs +
                params['neutral_weight'] * neutral_probs
            )
            
            # Apply threshold to get binary predictions
            binary_predictions = (weighted_score > params['threshold']).astype(int)
            
            # Calculate metrics
            f1 = f1_score(true_labels, binary_predictions)
            accuracy = accuracy_score(true_labels, binary_predictions)
            precision = precision_score(true_labels, binary_predictions, zero_division=0)
            recall = recall_score(true_labels, binary_predictions, zero_division=0)
            
            result = {
                'entailment_weight': params['entailment_weight'],
                'contradiction_weight': params['contradiction_weight'],
                'neutral_weight': params['neutral_weight'],
                'threshold': params['threshold'],
                'f1': f1,
                'accuracy': accuracy,
                'precision': precision,
                'recall': recall
            }
            
            results.append(result)
            
            # Track best F1 score
            if f1 > best_f1:
                best_f1 = f1
                best_config = result.copy()
        
        return best_config, results
    
    def evaluate_with_config(self, predictions: np.ndarray, true_labels: np.ndarray, config: Dict) -> Dict:
        """Evaluate using specific configuration"""
        
        # Extract probabilities
        contradiction_probs = predictions[:, 0]
        neutral_probs = predictions[:, 1]
        entailment_probs = predictions[:, 2]
        
        # Calculate weighted score
        weighted_score = (
            config['entailment_weight'] * entailment_probs +
            config['contradiction_weight'] * contradiction_probs +
            config['neutral_weight'] * neutral_probs
        )
        
        # Apply threshold
        binary_predictions = (weighted_score > config['threshold']).astype(int)
        
        # Calculate detailed metrics
        metrics = {
            'accuracy': accuracy_score(true_labels, binary_predictions),
            'f1': f1_score(true_labels, binary_predictions),
            'precision': precision_score(true_labels, binary_predictions, zero_division=0),
            'recall': recall_score(true_labels, binary_predictions, zero_division=0),
            'classification_report': classification_report(true_labels, binary_predictions)
        }
        
        return metrics, binary_predictions, weighted_score
    
    def run_evaluation(self, test_vehicles_file: str, test_questions_file: str) -> Dict:
        """Run complete evaluation pipeline"""
        
        # Load test data
        test_pairs = self.load_test_data(test_vehicles_file, test_questions_file)
        
        # Get predictions
        predictions, true_labels = self.predict_probabilities(test_pairs)
        
        # Optimize binary classification
        best_config, all_results = self.optimize_binary_classification(predictions, true_labels)
        
        # Evaluate with best configuration
        final_metrics, binary_predictions, weighted_scores = self.evaluate_with_config(
            predictions, true_labels, best_config
        )
        
        # Create results summary
        results = {
            'best_config': best_config,
            'final_metrics': final_metrics,
            'num_test_pairs': len(test_pairs),
            'predictions': predictions,
            'true_labels': true_labels,
            'binary_predictions': binary_predictions,
            'weighted_scores': weighted_scores,
            'optimization_results': all_results
        }
        
        return results

    def display_results(self, test_vehicles_path, test_queries_path, results, model_name=None):
        """Run evaluation and print results for a given evaluator."""
        title = model_name if model_name else "CROSSENCODER"
        print("\n" + "="*50)
        print(f"{title.upper()} EVALUATION RESULTS")
        print("="*50)

        print(f"\nDataset Size: {results['num_test_pairs']} test pairs")

        print(f"\nBest Configuration:")
        best_config = results['best_config']
        print(f"  Entailment Weight: {best_config['entailment_weight']:.2f}")
        print(f"  Contradiction Weight: {best_config['contradiction_weight']:.2f}")
        print(f"  Neutral Weight: {best_config['neutral_weight']:.2f}")
        print(f"  Threshold: {best_config['threshold']:.2f}")

        print(f"\nFinal Metrics:")
        final_metrics = results['final_metrics']
        print(f"  Accuracy: {final_metrics['accuracy']:.4f}")
        print(f"  F1 Score: {final_metrics['f1']:.4f}")
        print(f"  Precision: {final_metrics['precision']:.4f}")
        print(f"  Recall: {final_metrics['recall']:.4f}")

        print(f"\nDetailed Classification Report:")
        print(final_metrics['classification_report'])

        # Show some example predictions
        print(f"\nExample Predictions (first 10):")
        test_pairs = self.load_test_data(test_vehicles_path, test_queries_path)
        for i in range(min(10, len(test_pairs))):
            query, vehicle_text, true_label = test_pairs[i]
            pred_probs = results['predictions'][i]
            binary_pred = results['binary_predictions'][i]
            weighted_score = results['weighted_scores'][i]

            print(f"\nExample {i+1}:")
            print(f"  Query: {query[:100]}...")
            print(f"  True Label: {true_label}")
            print(f"  Probabilities [contradiction, neutral, entailment]: {pred_probs}")
            print(f"  Weighted Score: {weighted_score:.4f}")
            print(f"  Binary Prediction: {binary_pred}")
            print(f"  Correct: {'✓' if binary_pred == true_label else '✗'}")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# File paths
test_vehicles_path = "../../data/test_vehicles_info.yaml"
test_queries_path = "../../data/test_generated_questions.json"
train_vehicles_path = "../../data/train_vehicles_info.yaml"
train_queries_path = "../../data/train_generated_questions.json"

# Initialize evaluators
evaluator_roberta = CrossEncoderEvaluator(model_name='cross-encoder/nli-roberta-base')
evaluator_deberta = CrossEncoderEvaluator(model_name='cross-encoder/nli-deberta-v3-base')

INFO:sentence_transformers.cross_encoder.CrossEncoder:Use pytorch device: cuda:0
INFO:__main__:Loaded CrossEncoder model: cross-encoder/nli-roberta-base
INFO:sentence_transformers.cross_encoder.CrossEncoder:Use pytorch device: cuda:0
INFO:__main__:Loaded CrossEncoder model: cross-encoder/nli-deberta-v3-base


In [3]:
# Display results for RoBERTa
results_roberta = evaluator_roberta.run_evaluation(train_vehicles_path, train_queries_path)
evaluator_roberta.display_results(test_vehicles_path, test_queries_path, results_roberta, model_name='RoBERTa')

INFO:__main__:Test set: 471 vehicles → 4710 pairs
INFO:__main__:Getting predictions from CrossEncoder...
Batches: 100%|██████████| 148/148 [00:14<00:00,  9.99it/s]
INFO:__main__:Predictions shape: (4710, 3)
INFO:__main__:Prediction sample: [0.01257597 0.01020252 0.9772215 ]
INFO:__main__:Optimizing binary classification parameters...
INFO:__main__:Test set: 82 vehicles → 820 pairs



ROBERTA EVALUATION RESULTS

Dataset Size: 4710 test pairs

Best Configuration:
  Entailment Weight: 0.60
  Contradiction Weight: -0.40
  Neutral Weight: 1.00
  Threshold: 0.20

Final Metrics:
  Accuracy: 0.6501
  F1 Score: 0.7171
  Precision: 0.6017
  Recall: 0.8874

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.41      0.54      2356
           1       0.60      0.89      0.72      2354

    accuracy                           0.65      4710
   macro avg       0.69      0.65      0.63      4710
weighted avg       0.69      0.65      0.63      4710


Example Predictions (first 10):

Example 1:
  Query: Looking for a 5-door all-terrain electric vehicle with automatic transmission, black metallic color,...
  True Label: 1
  Probabilities [contradiction, neutral, entailment]: [0.01257597 0.01020252 0.9772215 ]
  Weighted Score: 0.5915
  Binary Prediction: 1
  Correct: ✓

Example 2:
  Query: Searching for an electric v

In [5]:
# Display results for DeBERTa
results_deberta = evaluator_deberta.run_evaluation(train_vehicles_path, train_queries_path)
evaluator_roberta.display_results(test_vehicles_path, test_queries_path, results_deberta, model_name='DeBERTa')

INFO:__main__:Test set: 471 vehicles → 4710 pairs
INFO:__main__:Getting predictions from CrossEncoder...
Batches: 100%|██████████| 148/148 [00:36<00:00,  4.02it/s]
INFO:__main__:Predictions shape: (4710, 3)
INFO:__main__:Prediction sample: [3.5967646e-04 3.6983230e-04 9.9927050e-01]
INFO:__main__:Optimizing binary classification parameters...
INFO:__main__:Test set: 82 vehicles → 820 pairs



DEBERTA EVALUATION RESULTS

Dataset Size: 4710 test pairs

Best Configuration:
  Entailment Weight: 0.60
  Contradiction Weight: -1.00
  Neutral Weight: 1.00
  Threshold: 0.50

Final Metrics:
  Accuracy: 0.7968
  F1 Score: 0.8146
  Precision: 0.7487
  Recall: 0.8934

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.70      0.78      2356
           1       0.75      0.89      0.81      2354

    accuracy                           0.80      4710
   macro avg       0.81      0.80      0.79      4710
weighted avg       0.81      0.80      0.79      4710


Example Predictions (first 10):

Example 1:
  Query: Looking for a 5-door all-terrain electric vehicle with automatic transmission, black metallic color,...
  True Label: 1
  Probabilities [contradiction, neutral, entailment]: [3.5967646e-04 3.6983230e-04 9.9927050e-01]
  Weighted Score: 0.5996
  Binary Prediction: 1
  Correct: ✓

Example 2:
  Query: Searching for an e