# Task 1: Sentence Transformer Implementation

Implement a sentence transformer model using any deep learning framework of your choice.
This model should be able to encode input sentences into fixed-length embeddings. Test your
implementation with a few sample sentences and showcase the obtained embeddings.
Describe any choices you had to make regarding the model architecture outside of the
transformer backbone.

# 1. Import Libraries

We import PyTorch and the Hugging Face `transformers` package, as well as NumPy for optional post-processing.


In [1]:
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm


2. Define the Sentence Transformer Model

We wrap the pretrained transformer in a simple `nn.Module` and add two pooling strategies:
- **CLS pooling**: Use the [CLS] token embedding.
- **Mean pooling**: Average token embeddings (excluding padding).


In [9]:
import torch
from transformers import AutoModel

class SentenceTransformerModel(torch.nn.Module):
    """
    A simple Sentence Transformer that wraps a pretrained transformer and 
    applies a pooling strategy to convert token-level embeddings into fixed-length 
    sentence embeddings.

    Supported pooling strategies:
      - 'cls': use the [CLS] token embedding.
      - 'mean': compute the mean of token embeddings, ignoring padding.
    """
    
    def __init__(self, model_name='bert-base-uncased', pooling='cls'):
        """
        Initializes the model.

        Args:
            model_name (str): Name of the pretrained model to load from HuggingFace.
            pooling (str): Pooling strategy to use. Either 'cls' or 'mean'.
        """
        super().__init__()
        # Load the pretrained transformer (e.g., BERT)
        self.encoder = AutoModel.from_pretrained(model_name)
        # Store the pooling method (used later in forward pass)
        self.pooling = pooling

    def forward(self, input_ids, attention_mask):
        """
        Forward pass of the model. Encodes inputs and reduces token embeddings
        to a single sentence-level embedding.

        Args:
            input_ids (Tensor): Input token IDs, shape (batch_size, seq_len)
            attention_mask (Tensor): Attention mask, shape (batch_size, seq_len)

        Returns:
            Tensor: Sentence embeddings, shape (batch_size, hidden_size)
        """
        # Step 1: Pass input through the transformer encoder
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state  # shape: (B, L, H)

        if self.pooling == 'cls':
            # Step 2A: Use the embedding of the [CLS] token (first position)
            return last_hidden[:, 0]  # shape: (B, H)

        elif self.pooling == 'mean':
            # Step 2B: Compute the mean of all token embeddings
            # Exclude padding tokens using attention_mask
            # Expand mask to match hidden size: (B, L) → (B, L, H)
            mask = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
            # Apply mask and sum across sequence length
            summed = torch.sum(last_hidden * mask, dim=1)  # shape: (B, H)
            # Count non-padding tokens for each sentence
            counts = torch.clamp(mask.sum(dim=1), min=1e-9)  # shape: (B, H)
            # Divide sum by count to get mean
            return summed / counts  # shape: (B, H)

        else:
            raise ValueError("Unsupported pooling type. Choose 'cls' or 'mean'.")


# 3. Initialize Tokenizer and Model

We choose `bert-base-uncased` and default to `cls` pooling. 


In [10]:
model_name = 'bert-base-uncased'
pooling_strategy = 'cls'  # or 'mean'

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = SentenceTransformerModel(model_name=model_name, pooling=pooling_strategy)




# 4. Prepare Sample Sentences

We’ll encode these three example sentences and inspect their embeddings.


In [11]:
sentences = [
    "The cat sat on the mat.",
    "Artificial intelligence is transforming the world.",
    "I love machine learning!"
]

# 5. Tokenize Sentences

The tokenizer handles padding and truncation so that all inputs share the same shape.


In [12]:
encoded = tokenizer(
    sentences,
    padding=True,        # Pad to longest in batch
    truncation=True,     # Truncate to model’s max length
    return_tensors='pt'  # Return PyTorch tensors
)


# 6. Generate Embeddings

We run a forward pass under `torch.no_grad()` to disable gradient computation.


In [13]:
model.eval()  # Set model to evaluation mode

with torch.no_grad():
    embeddings = model(
        input_ids=encoded['input_ids'],
        attention_mask=encoded['attention_mask']
    )


# 7. Inspect the Embeddings

We print the raw tensor and its shape.


In [14]:
print("Fixed-length sentence embeddings (one per sentence):")
print(embeddings)
print("\nEmbeddings shape:", embeddings.shape)


Fixed-length sentence embeddings (one per sentence):
tensor([[-0.3642, -0.0531, -0.3673,  ..., -0.3797,  0.5818,  0.4386],
        [-0.0397,  0.1847,  0.0180,  ..., -0.5428,  0.3945,  0.3015],
        [ 0.2184,  0.2637,  0.0406,  ..., -0.2125,  0.1841,  0.3618]])

Embeddings shape: torch.Size([3, 768])


# 8. Convert to NumPy & Show Sample Values

For downstream tasks or visualization, you might convert to NumPy.


In [15]:
embeddings_np = embeddings.cpu().numpy()

for idx, sentence in enumerate(sentences):
    print(f"\nSentence: \"{sentence}\"")
    # Display first 8 dimensions as a quick sanity check
    print(f"Embedding (first 8 values): {embeddings_np[idx][:8]} ...")



Sentence: "The cat sat on the mat."
Embedding (first 8 values): [-0.36422354 -0.05305347 -0.36732256 -0.02967383 -0.46078447 -0.1010612
  0.01669887  0.5957765 ] ...

Sentence: "Artificial intelligence is transforming the world."
Embedding (first 8 values): [-0.03967224  0.18470013  0.01796044 -0.06997652 -0.3907642  -0.6572221
  0.8504888   1.0212841 ] ...

Sentence: "I love machine learning!"
Embedding (first 8 values): [ 0.2184243   0.26371536  0.04059215 -0.11266454 -0.32512203 -0.5171517
  0.30736122  0.7386393 ] ...


# Design Choices Explanation

1. **Transformer Backbone**  
   We chose `bert-base-uncased` for its reliable performance and easy accessibility.

2. **Pooling Strategy**  
   - **CLS pooling** (default) uses the dedicated [CLS] token embedding.  
   - **Mean pooling** averages non-padded token embeddings for potentially smoother representations.

3. **Tokenization**  
   The Hugging Face tokenizer ensures correct token IDs, padding, and truncation in line with the pretrained model’s expectations.

4. **Architectural Simplicity**  
   No additional projection or normalization layers are added, preserving raw model outputs for pure embedding extraction. Adaptation layers could be added later if needed.

5. **Batching & Padding**  
   Built-in support for batching makes it easy to process multiple sentences at once without manual padding logic.
