# GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity

## Paper Information
- **Title**: GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training
- **Authors**: Omer Nacar, Anis Koubaa, Serry Sibaee, Yasser Al-Habashi, Adel Ammar, Wadii Boulila
- **ArXiv ID**: 2505.24581v1
- **Publication Date**: May 30, 2025
- **Paper Link**: https://arxiv.org/abs/2505.24581v1
- **Domain**: Natural Language Processing, Arabic Text Processing, Semantic Textual Similarity

## Abstract Summary
> This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms larger models, including OpenAI, with a 20-25% performance improvement on STS benchmarks, effectively capturing the unique semantic nuances of Arabic.

## Key Contributions
1. **Hybrid Loss Strategy**: Combines cosine similarity for semantic tasks and softmax-based classification
2. **Enhanced Model Robustness**: Incorporates curated Arabic NLI triplet and labeled pair datasets
3. **Scalable Arabic Embeddings**: Adapts Matryoshka Representation Learning to Arabic (768, 512, 256, 128, 64 dimensions)
4. **State-of-the-Art Performance**: 20-25% improvement over larger models on Arabic STS benchmarks

## Environment Setup and Dependencies

### Installing Required Libraries
We'll use LangChain ecosystem for text processing and evaluation, along with specialized libraries for Arabic NLP and Matryoshka embeddings.

In [None]:
# Install required dependencies
!pip install torch transformers sentence-transformers
!pip install langchain langchain-community langchain-openai
!pip install deepeval mteb datasets
!pip install chromadb faiss-cpu
!pip install pandas numpy matplotlib seaborn plotly
!pip install arabic-reshaper python-bidi  # Arabic text processing
!pip install ctranslate2  # For neural machine translation

In [None]:
# Import essential libraries
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional, Union
import warnings
warnings.filterwarnings('ignore')

# Transformers and Sentence Transformers
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.trainer import SentenceTransformerTrainer

# LangChain imports
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma, FAISS
from langchain.schema import Document
from langchain_community.embeddings import HuggingFaceEmbeddings

# DeepEval for evaluation
from deepeval import evaluate
from deepeval.metrics import SemanticSimilarityMetric
from deepeval.test_case import LLMTestCase

# Data processing
from datasets import Dataset, load_dataset
import json
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import pearsonr, spearmanr

print("✅ All dependencies imported successfully!")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🤗 Transformers available: {torch.cuda.is_available()}")
print(f"🚀 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"📊 GPU: {torch.cuda.get_device_name(0)}")

## Data Preparation

### Loading Arabic NLI Datasets
According to the paper, GATE uses Arabic-adapted subsets from Stanford NLI and MultiNLI datasets:
- **STS Subset**: 8.63K training, 1.68K test samples
- **Triplet Subset**: 571K training, 6.58K test samples  
- **Pair Classification**: 981K training, 19.7K test samples

In [None]:
# Create synthetic Arabic NLI dataset for demonstration
# In real implementation, you would use translated SNLI/MultiNLI datasets

def create_arabic_sample_data():
    """Create sample Arabic text pairs for demonstration"""
    
    # Sample Arabic sentence pairs with similarity scores
    arabic_sts_samples = [
        {
            "sentence1": "رجل يعزف على الجيتار",  # A man playing the guitar
            "sentence2": "شخص يعزف على آلة موسيقية",  # A person playing a musical instrument
            "score": 0.8
        },
        {
            "sentence1": "الرجال يلعبون كرة القدم",  # Men are playing football
            "sentence2": "الأولاد يلعبون كرة القدم",  # Boys are playing football
            "score": 0.72
        },
        {
            "sentence1": "رجل يقوم بخدعة بالورق",  # A man doing a card trick
            "sentence2": "رجل يؤدي خدعة بالورق",  # A man performing a card trick
            "score": 1.0
        },
        {
            "sentence1": "رجل يعزف على الجيتار",  # A man playing the guitar
            "sentence2": "رجل يقود سيارة",  # A man driving a car
            "score": 0.1
        }
    ]
    
    # Sample Arabic NLI triplets (premise, positive, negative)
    arabic_triplet_samples = [
        {
            "anchor": "الطالب يدرس في المكتبة",  # The student studies in the library
            "positive": "شخص يقرأ كتاباً في مكان هادئ",  # A person reads a book in a quiet place
            "negative": "الرجل يلعب كرة السلة",  # The man plays basketball
        },
        {
            "anchor": "المرأة تطبخ الطعام",  # The woman cooks food
            "positive": "شخص يحضر وجبة في المطبخ",  # Someone prepares a meal in the kitchen
            "negative": "الطفل يلعب في الحديقة",  # The child plays in the garden
        }
    ]
    
    # Sample classification pairs (premise, hypothesis, label)
    arabic_classification_samples = [
        {
            "premise": "الرجل يجلس على الكرسي",  # The man sits on the chair
            "hypothesis": "الشخص يجلس",  # The person sits
            "label": "entailment"
        },
        {
            "premise": "القطة تنام على السرير",  # The cat sleeps on the bed
            "hypothesis": "الكلب ينام على السرير",  # The dog sleeps on the bed
            "label": "contradiction"
        },
        {
            "premise": "الطلاب يذهبون إلى المدرسة",  # Students go to school
            "hypothesis": "الناس يتحركون",  # People are moving
            "label": "neutral"
        }
    ]
    
    return {
        "sts_data": arabic_sts_samples,
        "triplet_data": arabic_triplet_samples,
        "classification_data": arabic_classification_samples
    }

# Load sample data
sample_data = create_arabic_sample_data()

print("📊 Sample Arabic Dataset Created:")
print(f"   - STS samples: {len(sample_data['sts_data'])}")
print(f"   - Triplet samples: {len(sample_data['triplet_data'])}")
print(f"   - Classification samples: {len(sample_data['classification_data'])}")

# Display sample data
print("\n🔍 Sample STS Data:")
for i, sample in enumerate(sample_data['sts_data'][:2]):
    print(f"   {i+1}. Sentence 1: {sample['sentence1']}")
    print(f"      Sentence 2: {sample['sentence2']}")
    print(f"      Similarity Score: {sample['score']}\n")

## Matryoshka Representation Learning Implementation

### Mathematical Foundation
According to the paper, MRL optimizes multi-class classification loss for each dimension subset:

$$L_{MRL} = \sum_{m \in M} c_m L_{CE}(W^{(m)} z_{1:m}, y)$$

Where:
- $z_{1:m} \in \mathbb{R}^m$ is the truncated embedding vector
- $W^{(m)} \in \mathbb{R}^{L \times m}$ are classifier weights for dimension $m$
- $c_m$ represents relative importance of each dimension

In [None]:
import torch.nn as nn

class MatryoshkaEmbeddingModel(nn.Module):
    """Matryoshka Representation Learning Model for Arabic Text Embeddings"""
    
    def __init__(self, base_model_name: str, dimensions: List[int] = [768, 512, 256, 128, 64]):
        super().__init__()
        self.dimensions = sorted(dimensions, reverse=True)  # Largest to smallest
        self.max_dim = max(dimensions)
        
        # Load base model (AraBERT or similar)
        self.base_model = AutoModel.from_pretrained(base_model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        
        # Matryoshka classifiers for each dimension
        self.classifiers = nn.ModuleDict({
            str(dim): nn.Linear(dim, 3)  # 3 classes: entailment, neutral, contradiction
            for dim in self.dimensions
        })
        
        # Dimension importance weights
        self.dim_weights = nn.Parameter(torch.ones(len(dimensions)))
        
    def mean_pooling(self, model_output, attention_mask):
        """Apply mean pooling to get sentence embeddings"""
        token_embeddings = model_output[0]  # First element contains token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    def forward(self, input_ids, attention_mask, return_all_dims=False):
        """Forward pass through the model"""
        # Get base model output
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        
        # Apply mean pooling
        embeddings = self.mean_pooling(outputs, attention_mask)
        
        if return_all_dims:
            # Return embeddings for all dimensions
            dim_embeddings = {}
            for dim in self.dimensions:
                dim_embeddings[dim] = embeddings[:, :dim]
            return dim_embeddings
        else:
            # Return full dimensional embedding
            return embeddings
    
    def compute_matryoshka_loss(self, embeddings, labels):
        """Compute Matryoshka loss across all dimensions"""
        total_loss = 0.0
        
        for i, dim in enumerate(self.dimensions):
            # Get embeddings for this dimension
            dim_embeddings = embeddings[:, :dim]
            
            # Get classifier predictions
            logits = self.classifiers[str(dim)](dim_embeddings)
            
            # Compute cross-entropy loss
            loss = nn.CrossEntropyLoss()(logits, labels)
            
            # Weight by dimension importance
            total_loss += self.dim_weights[i] * loss
        
        return total_loss / len(self.dimensions)

# Initialize model with AraBERT base
model_name = "aubmindlab/bert-base-arabertv02"  # Popular Arabic BERT model
matryoshka_model = MatryoshkaEmbeddingModel(model_name)

print("🎯 Matryoshka Model Initialized!")
print(f"   - Base Model: {model_name}")
print(f"   - Dimensions: {matryoshka_model.dimensions}")
print(f"   - Parameters: {sum(p.numel() for p in matryoshka_model.parameters())/1e6:.1f}M")

## Hybrid Loss Training Implementation

### Mathematical Foundation
The paper proposes two specialized loss functions:

1. **Classification Loss** (SoftmaxLoss):
$$L_{cls} = -\frac{1}{n} \sum_{i=1}^{n} \log \frac{e^{s(x_i,y^+)/\tau}}{e^{s(x_i,y^+)/\tau} + \sum_{j=1}^{k} e^{s(x_i,y_j^-)/\tau}}$$

2. **STS Loss** (CoSENTLoss):
$$L_{sts} = \log \left(1 + \sum_{s(x_i,x_j) > s(x_m,x_n)} \exp \frac{\cos(x_m,x_n) - \cos(x_i,x_j)}{\tau}\right)$$

In [None]:
class HybridLossTrainer:
    """Hybrid Loss Training for GATE Model"""
    
    def __init__(self, model, temperature=0.05):
        self.model = model
        self.temperature = temperature
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        
    def classification_loss(self, premise_embeddings, hypothesis_embeddings, labels):
        """Compute classification loss for NLI task"""
        # Compute similarity scores
        similarities = torch.cosine_similarity(premise_embeddings, hypothesis_embeddings, dim=1)
        
        # Scale by temperature
        similarities = similarities / self.temperature
        
        # Convert similarities to logits for 3-class classification
        # This is a simplified version - in practice, you'd use proper classifier heads
        batch_size = similarities.size(0)
        logits = torch.zeros(batch_size, 3).to(self.device)
        
        # Map similarities to class probabilities
        # High similarity -> entailment, medium -> neutral, low -> contradiction
        for i, sim in enumerate(similarities):
            if sim > 0.7:
                logits[i, 0] = sim  # entailment
            elif sim > 0.3:
                logits[i, 1] = sim  # neutral
            else:
                logits[i, 2] = 1 - sim  # contradiction
        
        # Compute cross-entropy loss
        return nn.CrossEntropyLoss()(logits, labels)
    
    def sts_loss(self, sentence1_embeddings, sentence2_embeddings, similarity_scores):
        """Compute STS loss using CoSENT approach"""
        # Compute cosine similarities
        predicted_similarities = torch.cosine_similarity(sentence1_embeddings, sentence2_embeddings, dim=1)
        
        # Scale target similarities to match cosine range [-1, 1]
        target_similarities = (similarity_scores - 2.5) / 2.5  # Assuming 0-5 scale -> -1,1
        
        # Compute pairwise ranking loss (simplified CoSENT)
        batch_size = predicted_similarities.size(0)
        loss = 0.0
        count = 0
        
        for i in range(batch_size):
            for j in range(batch_size):
                if i != j and target_similarities[i] > target_similarities[j]:
                    # If target i > target j, then predicted i should > predicted j
                    diff = predicted_similarities[j] - predicted_similarities[i]
                    loss += torch.log(1 + torch.exp(diff / self.temperature))
                    count += 1
        
        return loss / max(count, 1)
    
    def compute_hybrid_loss(self, batch, task_type):
        """Compute hybrid loss based on task type"""
        if task_type == 'classification':
            # NLI classification task
            premise_embeddings = self.model(batch['premise_input_ids'], batch['premise_attention_mask'])
            hypothesis_embeddings = self.model(batch['hypothesis_input_ids'], batch['hypothesis_attention_mask'])
            return self.classification_loss(premise_embeddings, hypothesis_embeddings, batch['labels'])
        
        elif task_type == 'sts':
            # Semantic textual similarity task
            sentence1_embeddings = self.model(batch['sentence1_input_ids'], batch['sentence1_attention_mask'])
            sentence2_embeddings = self.model(batch['sentence2_input_ids'], batch['sentence2_attention_mask'])
            return self.sts_loss(sentence1_embeddings, sentence2_embeddings, batch['similarity_scores'])
        
        else:
            raise ValueError(f"Unknown task type: {task_type}")

# Initialize hybrid loss trainer
hybrid_trainer = HybridLossTrainer(matryoshka_model)
print("🔥 Hybrid Loss Trainer initialized!")
print(f"   - Device: {hybrid_trainer.device}")
print(f"   - Temperature: {hybrid_trainer.temperature}")

## Training Pipeline Implementation

### Multi-Task Training Loop
Based on the paper's methodology, we implement a multi-dataset training strategy that alternates between classification and STS tasks.

In [None]:
def prepare_training_data(sample_data):
    """Prepare training data for both tasks"""
    
    # Prepare STS data
    sts_examples = []
    for sample in sample_data['sts_data']:
        sts_examples.append({
            'sentence1': sample['sentence1'],
            'sentence2': sample['sentence2'],
            'score': sample['score']
        })
    
    # Prepare classification data
    classification_examples = []
    label_map = {'entailment': 0, 'neutral': 1, 'contradiction': 2}
    
    for sample in sample_data['classification_data']:
        classification_examples.append({
            'premise': sample['premise'],
            'hypothesis': sample['hypothesis'],
            'label': label_map[sample['label']]
        })
    
    return sts_examples, classification_examples

def tokenize_batch(sentences, tokenizer, max_length=512):
    """Tokenize a batch of sentences"""
    return tokenizer(
        sentences,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

def training_step_demo():
    """Demonstrate a training step with sample data"""
    
    # Prepare sample data
    sts_examples, classification_examples = prepare_training_data(sample_data)
    
    print("🚂 Starting Training Demonstration...")
    
    # Set model to training mode
    matryoshka_model.train()
    optimizer = torch.optim.AdamW(matryoshka_model.parameters(), lr=2e-5)
    
    # Training step for STS task
    print("\n📊 STS Training Step:")
    if sts_examples:
        # Prepare STS batch
        sentences1 = [ex['sentence1'] for ex in sts_examples]
        sentences2 = [ex['sentence2'] for ex in sts_examples]
        scores = torch.tensor([ex['score'] for ex in sts_examples], dtype=torch.float)
        
        # Tokenize
        tokens1 = tokenize_batch(sentences1, matryoshka_model.tokenizer)
        tokens2 = tokenize_batch(sentences2, matryoshka_model.tokenizer)
        
        # Prepare batch for hybrid trainer
        sts_batch = {
            'sentence1_input_ids': tokens1['input_ids'],
            'sentence1_attention_mask': tokens1['attention_mask'],
            'sentence2_input_ids': tokens2['input_ids'],
            'sentence2_attention_mask': tokens2['attention_mask'],
            'similarity_scores': scores
        }
        
        # Compute loss
        sts_loss = hybrid_trainer.compute_hybrid_loss(sts_batch, 'sts')
        print(f"   STS Loss: {sts_loss.item():.4f}")
    
    # Training step for classification task
    print("\n🔍 Classification Training Step:")
    if classification_examples:
        # Prepare classification batch
        premises = [ex['premise'] for ex in classification_examples]
        hypotheses = [ex['hypothesis'] for ex in classification_examples]
        labels = torch.tensor([ex['label'] for ex in classification_examples], dtype=torch.long)
        
        # Tokenize
        premise_tokens = tokenize_batch(premises, matryoshka_model.tokenizer)
        hypothesis_tokens = tokenize_batch(hypotheses, matryoshka_model.tokenizer)
        
        # Prepare batch for hybrid trainer
        cls_batch = {
            'premise_input_ids': premise_tokens['input_ids'],
            'premise_attention_mask': premise_tokens['attention_mask'],
            'hypothesis_input_ids': hypothesis_tokens['input_ids'],
            'hypothesis_attention_mask': hypothesis_tokens['attention_mask'],
            'labels': labels
        }
        
        # Compute loss
        cls_loss = hybrid_trainer.compute_hybrid_loss(cls_batch, 'classification')
        print(f"   Classification Loss: {cls_loss.item():.4f}")
    
    print("\n✅ Training step demonstration completed!")
    print("💡 In full training, you would:")
    print("   1. Alternate between STS and classification batches")
    print("   2. Apply Matryoshka loss across multiple dimensions")
    print("   3. Use proper data loaders with larger datasets")
    print("   4. Implement learning rate scheduling")
    print("   5. Add validation and checkpointing")

# Run training demonstration
training_step_demo()

## Evaluation with MTEB and DeepEval

### MTEB Benchmarks
The paper evaluates on Arabic STS benchmarks: STS17, STS22, and STS22-v2.
We'll implement evaluation metrics following the paper's methodology.

In [None]:
from scipy.stats import pearsonr, spearmanr
import numpy as np

class GATEEvaluator:
    """Evaluation toolkit for GATE models"""
    
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    def encode_sentences(self, sentences, dimension=768):
        """Encode sentences to embeddings of specified dimension"""
        self.model.eval()
        embeddings = []
        
        with torch.no_grad():
            for sentence in sentences:
                # Tokenize
                tokens = self.tokenizer(
                    sentence, 
                    padding=True, 
                    truncation=True, 
                    max_length=512, 
                    return_tensors='pt'
                ).to(self.device)
                
                # Get embeddings
                embedding = self.model(tokens['input_ids'], tokens['attention_mask'])
                
                # Truncate to specified dimension
                if dimension < embedding.size(1):
                    embedding = embedding[:, :dimension]
                
                embeddings.append(embedding.cpu().numpy())
        
        return np.vstack(embeddings)
    
    def evaluate_sts(self, test_data, dimensions=[768, 512, 256, 128, 64]):
        """Evaluate semantic textual similarity across multiple dimensions"""
        results = {}
        
        # Extract sentences and scores
        sentences1 = [item['sentence1'] for item in test_data]
        sentences2 = [item['sentence2'] for item in test_data]
        true_scores = [item['score'] for item in test_data]
        
        for dim in dimensions:
            print(f"\n🔍 Evaluating dimension {dim}...")
            
            # Encode sentences
            embeddings1 = self.encode_sentences(sentences1, dim)
            embeddings2 = self.encode_sentences(sentences2, dim)
            
            # Compute similarity scores
            predicted_scores = []
            for emb1, emb2 in zip(embeddings1, embeddings2):
                similarity = cosine_similarity([emb1], [emb2])[0, 0]
                # Scale from [-1, 1] to [0, 1] for comparison
                similarity = (similarity + 1) / 2
                predicted_scores.append(similarity)
            
            # Compute correlation metrics
            pearson_corr, _ = pearsonr(true_scores, predicted_scores)
            spearman_corr, _ = spearmanr(true_scores, predicted_scores)
            
            results[dim] = {
                'pearson': pearson_corr,
                'spearman': spearman_corr,
                'predicted_scores': predicted_scores
            }
            
            print(f"   Pearson: {pearson_corr:.4f}")
            print(f"   Spearman: {spearman_corr:.4f}")
        
        return results
    
    def deepeval_assessment(self, test_data):
        """Evaluate using DeepEval framework"""
        print("\n🎯 DeepEval Assessment:")
        
        # Create test cases for DeepEval
        test_cases = []
        for item in test_data[:2]:  # Limit for demo
            test_case = LLMTestCase(
                input=item['sentence1'],
                actual_output=item['sentence2'],
                expected_output=f"Similarity: {item['score']}"
            )
            test_cases.append(test_case)
        
        # Define semantic similarity metric
        semantic_similarity_metric = SemanticSimilarityMetric(threshold=0.7)
        
        # Evaluate
        try:
            results = evaluate(test_cases, [semantic_similarity_metric])
            print(f"   DeepEval Score: {results}")
            return results
        except Exception as e:
            print(f"   DeepEval assessment skipped: {e}")
            return None

# Initialize evaluator
evaluator = GATEEvaluator(matryoshka_model, matryoshka_model.tokenizer)

# Run evaluation on sample data
print("🚀 Starting GATE Model Evaluation...")
evaluation_results = evaluator.evaluate_sts(sample_data['sts_data'])

# DeepEval assessment
deepeval_results = evaluator.deepeval_assessment(sample_data['sts_data'])

## Results Analysis and Visualization

### Performance Comparison Across Dimensions
Recreating Figure 1 from the paper showing correlation-based similarity metrics.

In [None]:
def visualize_results(evaluation_results):
    """Visualize evaluation results across dimensions"""
    
    dimensions = list(evaluation_results.keys())
    pearson_scores = [evaluation_results[dim]['pearson'] for dim in dimensions]
    spearman_scores = [evaluation_results[dim]['spearman'] for dim in dimensions]
    
    # Create subplots
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot 1: Pearson Correlation across dimensions
    ax1.plot(dimensions, pearson_scores, marker='o', linewidth=2, markersize=8)
    ax1.set_title('Pearson Correlation vs Embedding Dimensions', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Embedding Dimensions')
    ax1.set_ylabel('Pearson Correlation')
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, 1)
    
    # Plot 2: Spearman Correlation across dimensions
    ax2.plot(dimensions, spearman_scores, marker='s', linewidth=2, markersize=8, color='orange')
    ax2.set_title('Spearman Correlation vs Embedding Dimensions', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Embedding Dimensions')
    ax2.set_ylabel('Spearman Correlation')
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(0, 1)
    
    # Plot 3: Comparison of both metrics
    x = np.arange(len(dimensions))
    width = 0.35
    
    ax3.bar(x - width/2, pearson_scores, width, label='Pearson', alpha=0.8)
    ax3.bar(x + width/2, spearman_scores, width, label='Spearman', alpha=0.8)
    ax3.set_title('Correlation Metrics Comparison', fontsize=14, fontweight='bold')
    ax3.set_xlabel('Embedding Dimensions')
    ax3.set_ylabel('Correlation Score')
    ax3.set_xticks(x)
    ax3.set_xticklabels(dimensions)
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Predicted vs True scores for largest dimension
    largest_dim = max(dimensions)
    true_scores = [item['score'] for item in sample_data['sts_data']]
    pred_scores = evaluation_results[largest_dim]['predicted_scores']
    
    ax4.scatter(true_scores, pred_scores, alpha=0.7, s=100)
    ax4.plot([0, 1], [0, 1], 'r--', alpha=0.8)  # Perfect correlation line
    ax4.set_title(f'Predicted vs True Similarity (Dim {largest_dim})', fontsize=14, fontweight='bold')
    ax4.set_xlabel('True Similarity Score')
    ax4.set_ylabel('Predicted Similarity Score')
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print("\n📊 GATE Model Performance Summary:")
    print("="*50)
    for dim in dimensions:
        pearson = evaluation_results[dim]['pearson']
        spearman = evaluation_results[dim]['spearman']
        print(f"Dimension {dim:3d}: Pearson={pearson:.4f}, Spearman={spearman:.4f}")
    
    # Performance degradation analysis
    print("\n📈 Performance Degradation Analysis:")
    base_pearson = evaluation_results[max(dimensions)]['pearson']
    base_spearman = evaluation_results[max(dimensions)]['spearman']
    
    for dim in sorted(dimensions, reverse=True)[1:]:
        pearson_drop = (base_pearson - evaluation_results[dim]['pearson']) / base_pearson * 100
        spearman_drop = (base_spearman - evaluation_results[dim]['spearman']) / base_spearman * 100
        print(f"Dimension {dim}: Pearson drop {pearson_drop:.1f}%, Spearman drop {spearman_drop:.1f}%")

# Visualize results
if evaluation_results:
    visualize_results(evaluation_results)
else:
    print("⚠️ No evaluation results to visualize")

## LangChain Integration for RAG Applications

### Using GATE Embeddings in LangChain Pipeline
Demonstrate how to integrate the GATE model into a LangChain RAG system for Arabic text retrieval.

In [None]:
class GATELangChainEmbeddings:
    """LangChain-compatible wrapper for GATE embeddings"""
    
    def __init__(self, gate_model, tokenizer, dimension=768):
        self.gate_model = gate_model
        self.tokenizer = tokenizer
        self.dimension = dimension
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents"""
        embeddings = []
        
        self.gate_model.eval()
        with torch.no_grad():
            for text in texts:
                tokens = self.tokenizer(
                    text,
                    padding=True,
                    truncation=True,
                    max_length=512,
                    return_tensors='pt'
                ).to(self.device)
                
                embedding = self.gate_model(tokens['input_ids'], tokens['attention_mask'])
                
                # Truncate to specified dimension
                if self.dimension < embedding.size(1):
                    embedding = embedding[:, :self.dimension]
                
                embeddings.append(embedding.cpu().numpy().flatten().tolist())
        
        return embeddings
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query"""
        return self.embed_documents([text])[0]

def create_arabic_rag_demo():
    """Demonstrate Arabic RAG system with GATE embeddings"""
    
    # Sample Arabic documents
    arabic_documents = [
        "الذكاء الاصطناعي هو محاكاة عمليات الذكاء البشري بواسطة الآلات والأنظمة الحاسوبية.",
        "التعلم الآلي هو فرع من فروع الذكاء الاصطناعي يركز على تطوير خوارزميات يمكنها التعلم من البيانات.",
        "الشبكات العصبية الاصطناعية مستوحاة من طريقة عمل الدماغ البشري في معالجة المعلومات.",
        "معالجة اللغة الطبيعية تتيح للحاسوب فهم وتحليل وتوليد اللغة البشرية بطريقة طبيعية.",
        "الرؤية الحاسوبية تمكن الآلات من تفسير وفهم المحتوى البصري مثل الصور والفيديو."
    ]
    
    print("🚀 Creating Arabic RAG System with GATE Embeddings...")
    
    # Initialize GATE embeddings for LangChain
    gate_embeddings = GATELangChainEmbeddings(
        matryoshka_model, 
        matryoshka_model.tokenizer, 
        dimension=256  # Use smaller dimension for efficiency
    )
    
    # Create documents
    documents = [Document(page_content=text, metadata={"source": f"doc_{i}"}) 
                for i, text in enumerate(arabic_documents)]
    
    # Create vector store
    try:
        vectorstore = FAISS.from_documents(documents, gate_embeddings)
        print("✅ Vector store created successfully!")
        
        # Test similarity search
        query = "ما هو التعلم الآلي؟"  # What is machine learning?
        print(f"\n🔍 Query: {query}")
        
        # Perform similarity search
        similar_docs = vectorstore.similarity_search(query, k=3)
        
        print("\n📋 Most Similar Documents:")
        for i, doc in enumerate(similar_docs, 1):
            print(f"   {i}. {doc.page_content}")
            print(f"      Source: {doc.metadata['source']}\n")
        
        # Test with similarity scores
        similar_docs_with_scores = vectorstore.similarity_search_with_score(query, k=3)
        
        print("🎯 Similarity Scores:")
        for doc, score in similar_docs_with_scores:
            print(f"   Score: {score:.4f} - {doc.page_content[:50]}...")
        
        return vectorstore
        
    except Exception as e:
        print(f"❌ Error creating vector store: {e}")
        print("💡 This might be due to dimension mismatch or missing dependencies")
        return None

# Create Arabic RAG demonstration
arabic_vectorstore = create_arabic_rag_demo()

## Personal Research Template

### Extending GATE for Your Research
This section provides a template for extending the GATE framework for your own research projects.

In [None]:
# Personal Research Template
print("""🔬 GATE Research Extension Template
=====================================

This template guides you through extending GATE for your research:

1. 📊 CUSTOM DATASET INTEGRATION
   - Replace sample_data with your Arabic text corpus
   - Implement domain-specific preprocessing
   - Create evaluation benchmarks for your domain

2. 🏗️ ARCHITECTURE MODIFICATIONS
   - Experiment with different base models (MARBERT, ARBERT, etc.)
   - Modify Matryoshka dimensions for your use case
   - Add domain-specific loss functions

3. 🎯 TASK-SPECIFIC ADAPTATIONS
   - Information Retrieval: Optimize for document ranking
   - Question Answering: Add QA-specific training objectives
   - Summarization: Include summarization quality metrics
   - Classification: Add multi-label classification support

4. 🔧 TRAINING OPTIMIZATIONS
   - Implement curriculum learning strategies
   - Add adversarial training for robustness
   - Experiment with different optimizers and schedules
   - Add gradient clipping and regularization

5. 📈 EVALUATION ENHANCEMENTS
   - Add domain-specific evaluation metrics
   - Implement cross-lingual evaluation
   - Create visualization dashboards
   - Add statistical significance testing

6. 🚀 DEPLOYMENT CONSIDERATIONS
   - Model quantization for production
   - API endpoint development
   - Batch processing optimization
   - Memory usage monitoring

7. 📝 RESEARCH EXTENSIONS
   - Multi-modal Arabic embeddings (text + images)
   - Cross-lingual transfer learning
   - Few-shot learning capabilities
   - Interpretability analysis
""")

class ResearchExtensionTemplate:
    """Template class for extending GATE research"""
    
    def __init__(self):
        self.research_config = {
            'domain': 'your_domain_here',  # e.g., 'medical', 'legal', 'news'
            'base_model': 'aubmindlab/bert-base-arabertv02',
            'dimensions': [768, 512, 256, 128, 64],
            'batch_size': 64,
            'learning_rate': 2e-5,
            'epochs': 5,
            'max_length': 512
        }
    
    def load_custom_dataset(self, dataset_path):
        """Load your custom Arabic dataset"""
        # TODO: Implement your dataset loading logic
        print(f"📁 Loading dataset from: {dataset_path}")
        pass
    
    def custom_preprocessing(self, texts):
        """Apply domain-specific preprocessing"""
        # TODO: Add your preprocessing steps
        # - Diacritization
        # - Normalization
        # - Domain-specific cleaning
        print("🔧 Applying custom preprocessing...")
        return texts
    
    def define_custom_loss(self):
        """Define domain-specific loss functions"""
        # TODO: Implement custom loss functions
        print("🎯 Defining custom loss function...")
        pass
    
    def custom_evaluation_metrics(self):
        """Define domain-specific evaluation metrics"""
        # TODO: Implement custom metrics
        print("📊 Defining custom evaluation metrics...")
        pass
    
    def hyperparameter_tuning(self):
        """Perform hyperparameter optimization"""
        # TODO: Implement hyperparameter search
        print("🔍 Starting hyperparameter tuning...")
        pass
    
    def model_interpretation(self):
        """Analyze model behavior and interpretability"""
        # TODO: Add interpretation techniques
        # - Attention visualization
        # - Embedding analysis
        # - Error analysis
        print("🔬 Analyzing model interpretability...")
        pass

# Initialize research template
research_template = ResearchExtensionTemplate()
print("\n🎓 Research extension template initialized!")
print("   Customize the methods above for your specific research needs.")
print("   Check the focused learning notebooks for deep dives into complex concepts.")

## Next Steps and Resources

### Recommended Learning Path

1. **📚 Study the Focused Learning Notebooks**:
   - `MRL_Deep_Learning.ipynb` - Deep dive into Matryoshka Representation Learning
   - `Hybrid_Loss_Architecture.ipynb` - Master multi-task loss functions
   - `Arabic_NLP_Challenges.ipynb` - Handle Arabic language complexities
   - `Contrastive_Triplet_Learning.ipynb` - Advanced contrastive learning

2. **🔬 Paper References**:
   - Original GATE Paper: [arXiv:2505.24581v1](https://arxiv.org/abs/2505.24581v1)
   - Matryoshka Representation Learning: [Kusupati et al., 2022](https://arxiv.org/abs/2205.13147)
   - Sentence-BERT: [Reimers & Gurevych, 2019](https://arxiv.org/abs/1908.10084)

3. **🛠️ Implementation Resources**:
   - GATE Models: [Hugging Face Hub](https://huggingface.co/collections/gate-models)
   - Arabic NLP Datasets: [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb)
   - LangChain Documentation: [LangChain Docs](https://python.langchain.com/)

4. **💡 Research Opportunities**:
   - Multi-modal Arabic embeddings
   - Cross-lingual transfer learning
   - Domain adaptation techniques
   - Efficiency optimization methods

### Key Takeaways

✅ **GATE achieves state-of-the-art Arabic STS performance**  
✅ **Matryoshka learning enables flexible embedding dimensions**  
✅ **Hybrid loss training improves multi-task performance**  
✅ **Integration with LangChain enables RAG applications**  
✅ **DeepEval provides comprehensive evaluation framework**  

---

*This notebook implements the GATE framework as described in "GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training" (arXiv:2505.24581v1)*

**Authors**: Omer Nacar, Anis Koubaa, Serry Sibaee, Yasser Al-Habashi, Adel Ammar, Wadii Boulila  
**Implementation**: Educational Research Notebook  
**Framework**: LangChain + DeepEval + PyTorch