<a href="https://colab.research.google.com/github/ravisankarg/notebooks/blob/main/mininLM_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
!pip install -q transformers torch

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# 1. Download all-MiniLM encoder model
model_name = "sentence-transformers/all-MiniLM-L12-v2" # Using a common sentence transformer model with an encoder
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of BertForMaskedLM were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L12-v2 and are newly initialized: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# 2. Test masked language model capability
sentence = "swimming in water."
mask_token = tokenizer.mask_token
masked_sentence = sentence.replace("swimming", mask_token) # Masking the word "test"

print(f"Original sentence: {sentence}")
print(f"Masked sentence: {masked_sentence}")

# Tokenize the masked sentence
inputs = tokenizer(masked_sentence, return_tensors="pt")
input_ids = inputs["input_ids"]
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# Get the top 3 predicted tokens for the masked position
top_k = 10
top_k_predicted_token_ids = torch.topk(predictions[0, mask_token_index], top_k).indices.tolist()
top_k_predicted_tokens = [tokenizer.decode(token_id) for token_id in top_k_predicted_token_ids]


print(f"Top {top_k} predicted tokens: {top_k_predicted_tokens}")

Original sentence: swimming in water.
Masked sentence: [MASK] in water.
Top 10 predicted tokens: ['drowning swimming living walking breathing floating lost drowned covered bathing']


# Task
Create an end-to-end example using the "sentence-transformers/all-MiniLM-L12-v2" model. The example should demonstrate how to fine-tune this model for a sentence meaningfulness classification task. This involves adding a classification layer on top of the pre-trained model, generating synthetic training data (around 1000 samples) with multiple sentences per input and sequence IDs, training the modified model on this data, and finally testing the trained model with new examples.

## Load the pre-trained model and tokenizer

### Subtask:
Load the `sentence-transformers/all-MiniLM-L12-v2` model and its corresponding tokenizer.


**Reasoning**:
The first step is to import the necessary libraries and load the tokenizer and model as specified in the instructions.



In [24]:
from transformers import AutoTokenizer, AutoModel

# Specify the pre-trained model name
model_name = "sentence-transformers/all-MiniLM-L12-v2"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the pre-trained model
model = AutoModel.from_pretrained(model_name)

## Define the custom model

### Subtask:
Create a new model by adding a classification layer on top of the pre-trained MiniLM model. This layer will predict whether a sentence is meaningful or not.


**Reasoning**:
Define a custom PyTorch model with a classification layer on top of the pre-trained MiniLM model.



In [25]:
import torch.nn as nn

class MeaningfulnessClassifier(nn.Module):
    def __init__(self, pretrained_model):
        super(MeaningfulnessClassifier, self).__init__()
        self.pretrained_model = pretrained_model
        # Add a linear layer for binary classification (meaningful/not meaningful)
        self.classifier = nn.Linear(pretrained_model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        # Pass input through the pre-trained model
        outputs = self.pretrained_model(input_ids=input_ids, attention_mask=attention_mask)
        # Get the pooled output (usually the representation of the [CLS] token)
        # Sentence-BERT models typically use mean pooling over all tokens, but for simplicity here we'll use the first token's output
        pooled_output = outputs.last_hidden_state[:, 0]
        # Pass the pooled output through the classification layer
        logits = self.classifier(pooled_output)
        return logits

# Instantiate the custom model
meaningfulness_model = MeaningfulnessClassifier(model)

## Prepare synthetic data

### Subtask:
Generate a synthetic dataset of around 1000 samples. Each sample should consist of multiple sentences, with at least one meaningful sentence and some non-meaningful ones. Label each sentence as meaningful or not.


**Reasoning**:
Generate synthetic data and store it in a pandas DataFrame as instructed.



In [26]:
import pandas as pd
import random

# 1. Create lists of meaningful and non-meaningful sentences
meaningful_sentences = [
    "The sun rises in the east.",
    "Birds fly in the sky.",
    "Water is essential for life.",
    "Computers process information.",
    "Humans breathe oxygen.",
    "Fish swim in the ocean.",
    "The Earth revolves around the sun.",
    "Plants perform photosynthesis.",
    "Dogs are mammals.",
    "Fire produces heat.",
    "Rain falls from clouds.",
    "The moon orbits the Earth.",
    "Books contain stories.",
    "Cars have wheels.",
    "Music can evoke emotions.",
    "Time moves forward.",
    "Food provides energy.",
    "Languages are used for communication.",
    "Science seeks to understand the world.",
    "Art can be a form of expression."
]

non_meaningful_sentences = [
    "The square circle sang a loud silence.",
    "Purple ideas sleep furiously.",
    "A colorless green idea sleeps furiously.",
    "The talking rock told a joke to the wind.",
    "Invisible elephants danced on the ceiling.",
    "The concept of blue tastes like numbers.",
    "Butterflies whisper secrets to the trees.",
    "My shoes are made of abstract thoughts.",
    "The empty box was full of nothing.",
    "Yesterday's future is today's past.",
    "A dream within a dream is a reality.",
    "The sound of silence is deafening.",
    "He felt a cold warmth.",
    "The light was a dark brightness.",
    "She spoke in a silent shout.",
    "The beginning of the end is near.",
    "He ran at a walking pace.",
    "The small giant stood tall.",
    "It was a truthful lie.",
    "The friendly enemy smiled."
]

# 2. Generate approximately 1000 samples
data = []
num_samples = 1000

for _ in range(num_samples):
    sample_sentences = []
    sample_labels = []

    # Always include one meaningful sentence
    meaningful_sentence = random.choice(meaningful_sentences)
    sample_sentences.append(meaningful_sentence)
    sample_labels.append(1) # 1 for meaningful

    # Randomly select 1 to 3 non-meaningful sentences
    num_non_meaningful = random.randint(1, 3)
    selected_non_meaningful = random.sample(non_meaningful_sentences, num_non_meaningful)

    for non_meaningful_sentence in selected_non_meaningful:
        sample_sentences.append(non_meaningful_sentence)
        sample_labels.append(0) # 0 for non-meaningful

    # Shuffle the sentences within the sample to mix meaningful and non-meaningful ones
    combined = list(zip(sample_sentences, sample_labels))
    random.shuffle(combined)
    shuffled_sentences, shuffled_labels = zip(*combined)

    data.append({
        'sentences': shuffled_sentences,
        'labels': shuffled_labels
    })

# 5. Store in a Pandas DataFrame
synthetic_df = pd.DataFrame(data)

display(synthetic_df.head())

Unnamed: 0,sentences,labels
0,"(The small giant stood tall., Time moves forwa...","(0, 1)"
1,"(Art can be a form of expression., A dream wit...","(1, 0)"
2,"(The empty box was full of nothing., Rain fall...","(0, 1, 0)"
3,"(Butterflies whisper secrets to the trees., Do...","(0, 1)"
4,"(The talking rock told a joke to the wind., My...","(0, 0, 1, 0)"


## Tokenize and prepare data for training

### Subtask:
Tokenize the synthetic sentences and prepare them in a format suitable for training the custom model. This might involve creating input tensors with appropriate sequence IDs.


**Reasoning**:
Tokenize the sentences in the synthetic data and create input tensors and label tensors for training.



In [27]:
import torch

# Tokenize sentences and prepare data
tokenized_inputs = []
labels = []

max_seq_length = 128 # Define a maximum sequence length

for index, row in synthetic_df.iterrows():
    sentences = list(row['sentences']) # Convert tuple to list for tokenization
    sample_labels = list(row['labels']) # Convert tuple to list

    # Tokenize each sentence in the sample
    # Use truncation and padding to handle variable sentence lengths within the sample
    # The tokenizer handles the addition of special tokens like [CLS] and [SEP]
    encoded_inputs = tokenizer(
        sentences,
        padding='max_length',
        truncation=True,
        max_length=max_seq_length,
        return_tensors='pt', # Return PyTorch tensors
        is_split_into_words=False # Indicate that input is a list of strings
    )

    # The tokenizer returns a batch of inputs where each sentence is padded/truncated
    # We need to process these to match the structure expected by our model
    # For this task, we treat each sentence within the sample independently for classification
    # So, the tokenized inputs will be a list of dictionaries, each corresponding to a sentence
    # The labels will be a flattened list corresponding to each sentence
    num_sentences_in_sample = len(sentences)
    for i in range(num_sentences_in_sample):
        # Extract input_ids and attention_mask for each sentence
        # unsqueeze(0) adds a batch dimension of 1
        sentence_input_ids = encoded_inputs['input_ids'][i].unsqueeze(0)
        sentence_attention_mask = encoded_inputs['attention_mask'][i].unsqueeze(0)

        tokenized_inputs.append({
            'input_ids': sentence_input_ids,
            'attention_mask': sentence_attention_mask
        })
        labels.append(sample_labels[i])

# Convert the labels list to a PyTorch tensor
labels_tensor = torch.tensor(labels, dtype=torch.float)

print(f"Number of tokenized sentences across all samples: {len(tokenized_inputs)}")
print(f"Shape of the labels tensor: {labels_tensor.shape}")

Number of tokenized sentences across all samples: 2964
Shape of the labels tensor: torch.Size([2964])


## Train the model

### Subtask:
Train the custom model using the synthetic data.


In [42]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW # Corrected import location for AdamW
import torch.nn as nn

# 2. Create a custom PyTorch Dataset class
class MeaningfulnessDataset(Dataset):
    def __init__(self, tokenized_inputs, labels):
        self.tokenized_inputs = tokenized_inputs
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': self.tokenized_inputs[idx]['input_ids'].squeeze(0), # Remove the batch dimension added during initial tokenization
            'attention_mask': self.tokenized_inputs[idx]['attention_mask'].squeeze(0), # Remove the batch dimension
            'labels': self.labels[idx]
        }

# 3. Instantiate the custom Dataset
dataset = MeaningfulnessDataset(tokenized_inputs, labels_tensor)

# 4. Create a PyTorch DataLoader
batch_size = 16 # Define batch size
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# 6. Instantiate an optimizer and define a loss function
optimizer = AdamW(meaningfulness_model.parameters(), lr=5e-5) # Using AdamW as optimizer
loss_fn = nn.BCEWithLogitsLoss() # Binary Cross Entropy with Logits Loss for binary classification

# 5. Define the training loop and 7. Iterate through the DataLoader
num_epochs = 3 # Define number of epochs

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
meaningfulness_model.to(device)

print(f"Training on device: {device}")

meaningfulness_model.train() # Set model to training mode

for epoch in range(num_epochs):
    total_loss = 0
    for batch in dataloader:
        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad() # Zero out gradients

        # 8. Perform a forward pass
        logits = meaningfulness_model(input_ids=input_ids, attention_mask=attention_mask)

        # Ensure labels have the correct shape [batch_size, 1] for BCEWithLogitsLoss
        labels = labels.unsqueeze(1)

        # 8. Calculate the loss
        loss = loss_fn(logits, labels)
        total_loss += loss.item()

        # 8. Perform backpropagation
        loss.backward()

        # 8. Update the model's weights
        optimizer.step()

    avg_loss = total_loss / len(dataloader)
    # 9. Track and print the loss
    print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

print("Training finished.")

Training on device: cuda
Epoch 1/3, Average Loss: 0.0010
Epoch 2/3, Average Loss: 0.0004
Epoch 3/3, Average Loss: 0.0002
Training finished.


## Evaluate the model

### Subtask:
Evaluate the trained model on a separate test set (either part of the synthetic data or newly generated) to assess its performance in identifying meaningful sentences.


**Reasoning**:
Split the tokenized data and labels into training and testing sets, create a test dataset and dataloader, set the model to evaluation mode, and then iterate through the test set to calculate evaluation metrics.



In [35]:
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 1. Split the data into training and testing sets
# We need to split the original samples first to ensure sentences from the same sample are not split
# This requires re-tokenizing after splitting the original samples
train_samples, test_samples = train_test_split(synthetic_df, test_size=0.2, random_state=42)

def tokenize_samples(samples_df, tokenizer, max_seq_length):
    """Helper function to tokenize samples from a DataFrame."""
    tokenized_inputs = []
    labels = []
    for index, row in samples_df.iterrows():
        sentences = list(row['sentences'])
        sample_labels = list(row['labels'])

        encoded_inputs = tokenizer(
            sentences,
            padding='max_length',
            truncation=True,
            max_length=max_seq_length,
            return_tensors='pt',
            is_split_into_words=False
        )

        num_sentences_in_sample = len(sentences)
        for i in range(num_sentences_in_sample):
            sentence_input_ids = encoded_inputs['input_ids'][i].unsqueeze(0)
            sentence_attention_mask = encoded_inputs['attention_mask'][i].unsqueeze(0)

            tokenized_inputs.append({
                'input_ids': sentence_input_ids,
                'attention_mask': sentence_attention_mask
            })
            labels.append(sample_labels[i])
    return tokenized_inputs, torch.tensor(labels, dtype=torch.float)

# Tokenize the train and test sets
train_tokenized_inputs, train_labels_tensor = tokenize_samples(train_samples, tokenizer, max_seq_length)
test_tokenized_inputs, test_labels_tensor = tokenize_samples(test_samples, tokenizer, max_seq_length)

# 2. Create a new PyTorch Dataset and DataLoader for the test set
# Using the same Dataset class defined previously
test_dataset = MeaningfulnessDataset(test_tokenized_inputs, test_labels_tensor)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) # No need to shuffle test data

# Move model to the same device as during training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
meaningfulness_model.to(device)

# 3. Set the meaningfulness_model to evaluation mode
meaningfulness_model.eval()

# Lists to store predictions and true labels
all_predictions = []
all_true_labels = []

# 4. Iterate through the test DataLoader with torch.no_grad()
with torch.no_grad():
    for batch in test_dataloader:
        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # 5. Perform a forward pass
        logits = meaningfulness_model(input_ids=input_ids, attention_mask=attention_mask)

        # 6. Apply a sigmoid function to the logits to get probabilities
        probabilities = torch.sigmoid(logits).squeeze(1) # Squeeze to remove the last dimension

        # 7. Convert the probabilities to binary predictions (0 or 1) using a threshold
        predictions = (probabilities > 0.5).int() # Threshold at 0.5

        # Store predictions and true labels
        all_predictions.extend(predictions.cpu().tolist())
        all_true_labels.extend(labels.cpu().tolist())

# 8. Calculate evaluation metrics
accuracy = accuracy_score(all_true_labels, all_predictions)
precision = precision_score(all_true_labels, all_predictions)
recall = recall_score(all_true_labels, all_predictions)
f1 = f1_score(all_true_labels, all_predictions)

# 9. Print the calculated evaluation metrics
print("\nEvaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")


Evaluation Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-score: 1.0000


## Test with examples

### Subtask:
Use the trained model to predict the meaningfulness of new, unseen sentences to demonstrate its capability.


**Reasoning**:
Create a list of new sentences and then tokenize them and pass them through the trained model to get predictions. Finally, print the sentences and their predicted labels.



In [37]:
# 1. Create a list of new, unseen sentences
new_sentences = [
    "The quick brown fox jumps over the lazy dog.",  # Meaningful
    "Silence tastes like purple.",                  # Non-meaningful
    "Mount Everest is the highest mountain.",       # Meaningful
    "Ideas can sing in the rain.",                  # Non-meaningful
    "The capital of France is Paris.",              # Meaningful
    "The number seven is green.",                   # Non-meaningful
    "Water boils at 100 degrees Celsius.",          # Meaningful
    "My cat barked at the moon.",                   # Non-meaningful
    "Photosynthesis requires sunlight.",            # Meaningful
    "Invisible concepts dance on the ceiling."      # Non-meaningful
]

# 2. Tokenize these new sentences
# Use the same tokenizer and max_seq_length as during training
encoded_new_inputs = tokenizer(
    new_sentences,
    padding='max_length',
    truncation=True,
    max_length=max_seq_length,
    return_tensors='pt',  # Return PyTorch tensors
    is_split_into_words=False # Indicate that input is a list of strings
)

# Move tokenized inputs to the same device as the model
input_ids = encoded_new_inputs['input_ids'].to(device)
attention_mask = encoded_new_inputs['attention_mask'].to(device)

# 3. Set the trained meaningfulness_model to evaluation mode
meaningfulness_model.eval()

# 4. Perform a forward pass with torch.no_grad()
with torch.no_grad():
    # 5. Perform a forward pass
    logits = meaningfulness_model(input_ids=input_ids, attention_mask=attention_mask)

# 6. Apply a sigmoid function to the logits to convert them into probabilities
probabilities = torch.sigmoid(logits).squeeze(1) # Squeeze to remove the last dimension

# 7. Convert the probabilities into binary predictions (0 or 1) using a threshold (e.g., 0.5)
predictions = (probabilities > 0.5).int() # Threshold at 0.5

# 8. Print each original new sentence along with its predicted meaningfulness label
print("\nPredictions for new sentences:")
for i, sentence in enumerate(new_sentences):
    predicted_label = "Meaningful" if predictions[i].item() == 1 else "Not Meaningful"
    print(f"Sentence: \"{sentence}\" -> Predicted Label: {predicted_label}")


Predictions for new sentences:
Sentence: "The quick brown fox jumps over the lazy dog." -> Predicted Label: Not Meaningful
Sentence: "Silence tastes like purple." -> Predicted Label: Not Meaningful
Sentence: "Mount Everest is the highest mountain." -> Predicted Label: Meaningful
Sentence: "Ideas can sing in the rain." -> Predicted Label: Meaningful
Sentence: "The capital of France is Paris." -> Predicted Label: Meaningful
Sentence: "The number seven is green." -> Predicted Label: Not Meaningful
Sentence: "Water boils at 100 degrees Celsius." -> Predicted Label: Meaningful
Sentence: "My cat barked at the moon." -> Predicted Label: Not Meaningful
Sentence: "Photosynthesis requires sunlight." -> Predicted Label: Meaningful
Sentence: "Invisible concepts dance on the ceiling." -> Predicted Label: Not Meaningful


## Summary:

### Data Analysis Key Findings

*   The `sentence-transformers/all-MiniLM-L12-v2` model and its tokenizer were successfully loaded for the task.
*   A custom PyTorch model (`MeaningfulnessClassifier`) was successfully created by adding a linear classification layer on top of the pre-trained MiniLM model for binary classification.
*   A synthetic dataset of 1000 samples was generated, with each sample containing a mix of meaningful and non-meaningful sentences and their corresponding labels.
*   The sentences from the synthetic data were tokenized using the loaded tokenizer, applying padding and truncation to prepare them for model input.
*   A custom PyTorch `Dataset` and `DataLoader` were created to handle the tokenized data during training and evaluation.
*   The custom model was successfully trained for 3 epochs using the synthetic data, showing a decrease in average training loss.
*   The synthetic data was split into training and testing sets. The trained model was evaluated on the test set, achieving perfect scores (1.0000) for accuracy, precision, recall, and F1-score on this specific synthetic test data.
*   The trained model was used to predict the meaningfulness of new, unseen sentences. While some predictions were correct, the model misclassified "The quick brown fox jumps over the lazy dog." as "Not Meaningful" and "Ideas can sing in the rain." as "Meaningful", indicating limitations with the current synthetic data or training duration.

### Insights or Next Steps

*   The model shows promising results on the synthetic data it was trained on, but its performance on slightly different or more complex unseen sentences suggests that the synthetic data may not fully represent the nuances of real-world sentence meaningfulness.
*   To improve the model's generalization capabilities, the next step should involve training on a larger and more diverse dataset, potentially using real-world examples of meaningful and non-meaningful sentences.
