# Siamese BiLSTM Neural Network with Attention

<p>A Siamese BiLSTM for sentence similarity scores is a type of deep learning model that is designed to compare two input sentences and produce a score indicating how similar or dissimilar they are.</p>
<p>The Siamese BiLSTM architecture consists of two identical sub-networks that take in the two input sentences separately and process them through a Bidirectional Long Short-Term Memory (BiLSTM) layer. The BiLSTM layer captures the contextual information of the input sentences by processing them in both forward and backward directions, and produces a sequence of hidden states for each sentence. The output of each BiLSTM layer is typically fed through a fully connected layer to produce a final similarity score. The fully connected layer is essentially a linear transformation that maps the BiLSTM output to a scalar score</p>
<p>Loss functions are applied:
    <li> BCE loss - such that loss can be differentiable </li>
</p>
<p>During training, the model learns to adjust its parameters to minimize the difference between the predicted similarity scores and the true similarity scores.</p>

<p>Word2vec embeddings are fed as input to the BiLSTM models</p>

   

In [1]:
pip install gensim



In [1]:
pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15


In [3]:
from gensim.models import KeyedVectors
import pandas as pd
import pickle

In [4]:
import gensim.downloader as api

word2vec = api.load('word2vec-google-news-300')



In [None]:
# word2vec_path = "../data/GoogleNews-vectors-negative300.bin"
# word2vec = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [5]:
embedding_matrix = word2vec.vectors

In [6]:
word2idx = {word: i for i, word in enumerate(word2vec.index_to_key)}

In [7]:
from datasets import load_dataset
dataset = load_dataset("paws", "labeled_final")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/9.79k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.43M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49401 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8000 [00:00<?, ? examples/s]

In [8]:
train_dataset = dataset['train']
val_dataset = dataset['validation']
test_dataset = dataset['test']


sample = train_dataset[0]

In [10]:
sample

{'id': 1,
 'sentence1': 'In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland .',
 'sentence2': 'In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England .',
 'label': 0}

In [11]:
import pandas as pd
df = train_dataset.to_pandas()
val_df = val_dataset.to_pandas()
test_df = test_dataset.to_pandas()

In [12]:
df.head()

Unnamed: 0,id,sentence1,sentence2,label
0,1,"In Paris , in October 1560 , he secretly met t...","In October 1560 , he secretly met with the Eng...",0
1,2,The NBA season of 1975 -- 76 was the 30th seas...,The 1975 -- 76 season of the National Basketba...,1
2,3,"There are also specific discussions , public p...","There are also public discussions , profile sp...",0
3,4,When comparable rates of flow can be maintaine...,The results are high when comparable flow rate...,1
4,5,It is the seat of Zerendi District in Akmola R...,It is the seat of the district of Zerendi in A...,1


In [13]:
import re
def remove_abbrevations(text):
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"there's", "there is", text)
    text = re.sub(r"We're", "We are", text)
    text = re.sub(r"That's", "That is", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"they're", "they are", text)
    text = re.sub(r"Can't", "Cannot", text)
    text = re.sub(r"wasn't", "was not", text)
    text = re.sub(r"don\x89Ûªt", "do not", text)
    text = re.sub(r"aren't", "are not", text)
    text = re.sub(r"isn't", "is not", text)
    text = re.sub(r"What's", "What is", text)
    text = re.sub(r"haven't", "have not", text)
    text = re.sub(r"hasn't", "has not", text)
    text = re.sub(r"There's", "There is", text)
    text = re.sub(r"He's", "He is", text)
    text = re.sub(r"It's", "It is", text)
    text = re.sub(r"You're", "You are", text)
    text = re.sub(r"I'M", "I am", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"wouldn't", "would not", text)
    text = re.sub(r"i'm", "I am", text)
    text = re.sub(r"I\x89Ûªm", "I am", text)
    text = re.sub(r"I'm", "I am", text)
    text = re.sub(r"Isn't", "is not", text)
    text = re.sub(r"Here's", "Here is", text)
    text = re.sub(r"you've", "you have", text)
    text = re.sub(r"you\x89Ûªve", "you have", text)
    text = re.sub(r"we're", "we are", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"couldn't", "could not", text)
    text = re.sub(r"we've", "we have", text)
    text = re.sub(r"it\x89Ûªs", "it is", text)
    text = re.sub(r"doesn\x89Ûªt", "does not", text)
    text = re.sub(r"It\x89Ûªs", "It is", text)
    text = re.sub(r"Here\x89Ûªs", "Here is", text)
    text = re.sub(r"who's", "who is", text)
    text = re.sub(r"I\x89Ûªve", "I have", text)
    text = re.sub(r"y'all", "you all", text)
    text = re.sub(r"can\x89Ûªt", "cannot", text)
    text = re.sub(r"would've", "would have", text)
    text = re.sub(r"it'll", "it will", text)
    text = re.sub(r"we'll", "we will", text)
    text = re.sub(r"wouldn\x89Ûªt", "would not", text)
    text = re.sub(r"We've", "We have", text)
    text = re.sub(r"he'll", "he will", text)
    text = re.sub(r"Y'all", "You all", text)
    text = re.sub(r"Weren't", "Were not", text)
    text = re.sub(r"Didn't", "Did not", text)
    text = re.sub(r"they'll", "they will", text)
    text = re.sub(r"they'd", "they would", text)
    text = re.sub(r"DON'T", "DO NOT", text)
    text = re.sub(r"That\x89Ûªs", "That is", text)
    text = re.sub(r"they've", "they have", text)
    text = re.sub(r"i'd", "I would", text)
    text = re.sub(r"should've", "should have", text)
    text = re.sub(r"You\x89Ûªre", "You are", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"Don\x89Ûªt", "Do not", text)
    text = re.sub(r"we'd", "we would", text)
    text = re.sub(r"i'll", "I will", text)
    text = re.sub(r"weren't", "were not", text)
    text = re.sub(r"They're", "They are", text)
    text = re.sub(r"Can\x89Ûªt", "Cannot", text)
    text = re.sub(r"you\x89Ûªll", "you will", text)
    text = re.sub(r"I\x89Ûªd", "I would", text)
    text = re.sub(r"let's", "let us", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"don't", "do not", text)
    text = re.sub(r"you're", "you are", text)
    text = re.sub(r"i've", "I have", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"i'll", "I will", text)
    text = re.sub(r"doesn't", "does not", text)
    text = re.sub(r"i'd", "I would", text)
    text = re.sub(r"didn't", "did not", text)
    text = re.sub(r"ain't", "am not", text)
    text = re.sub(r"you'll", "you will", text)
    text = re.sub(r"I've", "I have", text)
    text = re.sub(r"Don't", "do not", text)
    text = re.sub(r"I'll", "I will", text)
    text = re.sub(r"I'd", "I would", text)
    text = re.sub(r"Let's", "Let us", text)
    text = re.sub(r"you'd", "You would", text)
    text = re.sub(r"It's", "It is", text)
    text = re.sub(r"Ain't", "am not", text)
    text = re.sub(r"Haven't", "Have not", text)
    text = re.sub(r"Could've", "Could have", text)
    text = re.sub(r"youve", "you have", text)
    text = re.sub(r"donå«t", "do not", text)
    text = re.sub(r"shan't", "shall not", text)
    text = re.sub(r"'Tis", "it is", text)
    text = re.sub(r"what's", "what is", text)
    return text

In [14]:
def remove_punctuation(sentence):

    return re.sub(r'[^\w\s]', '', sentence)

def lower_text(str1):
    return str1.lower()

def replace_numbers(text):
    text =  re.sub(r'\d+(,(\d+))*(\.(\d+))?%?\s',  'num', text)
    replaced_sentence = re.sub(r'\d+', 'num', text)
    return replaced_sentence

def replace_email(text):
    return re.sub(r'[a-zA-Z\.]+@[a-zA-Z\.\d]+',  'email', text)

In [15]:
def clean_text(text):
    text = replace_email(text)
    text = replace_numbers(text)
    text = remove_abbrevations(text)
    text = lower_text(text)
    text = remove_punctuation(text)
    #newly added
    #cleaned_text = " ".join(text.split())
    sent_tokens = text.split()
    return sent_tokens

In [16]:

df['sent1'] =df['sentence1'].apply(lambda x: clean_text(x))
df['sent2'] =df['sentence2'].apply(lambda x: clean_text(x))

In [17]:

val_df['sent1'] = val_df['sentence1'].apply(lambda x: clean_text(x))
val_df['sent2'] = val_df['sentence2'].apply(lambda x: clean_text(x))

In [18]:
test_df['sent1'] = test_df['sentence1'].apply(lambda x: clean_text(x))
test_df['sent2'] = test_df['sentence2'].apply(lambda x: clean_text(x))

In [None]:
# df = pd.read_csv('../data/cleaned_train_df1.csv')

In [None]:
# val_df = pd.read_csv('../data/cleaned_val_df1.csv')
# test_df = pd.read_csv('../data/cleaned_test_df1.csv')

In [19]:
df.head()

Unnamed: 0,id,sentence1,sentence2,label,sent1,sent2
0,1,"In Paris , in October 1560 , he secretly met t...","In October 1560 , he secretly met with the Eng...",0,"[in, paris, in, october, num, he, secretly, me...","[in, october, num, he, secretly, met, with, th..."
1,2,The NBA season of 1975 -- 76 was the 30th seas...,The 1975 -- 76 season of the National Basketba...,1,"[the, nba, season, of, num, numwas, the, numth...","[the, num, numseason, of, the, national, baske..."
2,3,"There are also specific discussions , public p...","There are also public discussions , profile sp...",0,"[there, are, also, specific, discussions, publ...","[there, are, also, public, discussions, profil..."
3,4,When comparable rates of flow can be maintaine...,The results are high when comparable flow rate...,1,"[when, comparable, rates, of, flow, can, be, m...","[the, results, are, high, when, comparable, fl..."
4,5,It is the seat of Zerendi District in Akmola R...,It is the seat of the district of Zerendi in A...,1,"[it, is, the, seat, of, zerendi, district, in,...","[it, is, the, seat, of, the, district, of, zer..."


In [21]:
# df['sent1'] = df['sent1'].apply(eval)
# df['sent2'] = df['sent2'].apply(eval)

In [22]:
sent1 = list(df['sent1'])
sent2 = list(df['sent2'])

In [82]:
total_sents =  list(df['sent1'])
total_sents.extend( list(df['sent2']))

In [83]:
len(total_sents)

98802

In [112]:
word_dict = {}
for word_tokens in total_sents:
    for word in word_tokens:
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1

In [113]:
vocab = word_dict

In [114]:
vocab_list = list(vocab.keys())

In [115]:
len(vocab_list)

30995

In [116]:
vocab_list.append("unk")

In [118]:
vocab_list = set(vocab_list)

<p>Create Subset Embedding matrix</p>

In [2]:
# Create Subset embedding matrix

In [120]:
vocab_list = [i for i in vocab_list if i in word2idx]

In [121]:
len(vocab_list)

16804

In [122]:
vocab_dict = {k:i for i,k in enumerate(vocab_list)}

In [124]:
vocab_dict['unk']

7635

In [125]:
word2idx_trunc = {}
for i in list(vocab_dict.keys()):
  word2idx_trunc[vocab_dict[i]] = word2idx[i]

In [103]:
#word2idx_trunc

In [127]:
word2idx['unk']

1459665

In [128]:
word2idx_trunc[7635]

1459665

In [129]:
#word_indexes = list(word2idx_trunc.values())
word_indexes = [value for key, value in sorted(word2idx_trunc.items())]

In [130]:
word_indexes.index(1459665)

7635

In [133]:
subset_embedding_matrix = word2vec.vectors[word_indexes]

In [134]:
len(subset_embedding_matrix)

16804

In [135]:
import numpy as np

In [136]:
np.array_equal(subset_embedding_matrix[7635], embedding_matrix[1459665])

True

In [152]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

class MyDataset(Dataset):
    def __init__(self, sentences1, sentences2, labels, word_to_ix):
        self.sentences1 = sentences1
        self.sentences2 = sentences2
        self.labels = labels
        self.word_to_ix = word_to_ix

    def __len__(self):
        return max(len(self.sentences1),len(self.sentences2))

    def __getitem__(self, idx):
        unk_token = self.word_to_ix['unk']
        sentence1 = self.sentences1[idx]
        sentence2 = self.sentences2[idx]
        score = self.labels[idx]
        seq1 = [self.word_to_ix[word] if word in self.word_to_ix else unk_token for word in sentence1]
        seq2 = [self.word_to_ix[word] if word in self.word_to_ix else unk_token for word in sentence2]
        #seq1 = [self.word_to_ix[word] for word in sentence1 if word in self.word_to_ix]
        #seq2 = [self.word_to_ix[word] for word in sentence2 if word in self.word_to_ix]
        return seq1, seq2, score

    def collate_fn(self, batch):
        sequences1, sequences2, labels = zip(*batch)
        padded_seqs1 = pad_sequence([torch.LongTensor(seq) for seq in sequences1], batch_first=True, padding_value=0)
        padded_seqs2 = pad_sequence([torch.LongTensor(seq) for seq in sequences2], batch_first=True, padding_value=0)
        #return padded_seqs1, padded_seqs2, torch.tensor(scores, dtype=torch.float)
        return padded_seqs1, padded_seqs2, torch.LongTensor(labels)

In [153]:
sent1_tokens = list(df['sent1'])
sent2_tokens = list(df['sent2'])
scores = list(df['label'])

In [165]:
word_to_ix = vocab_dict
train_dataset = MyDataset(sent1_tokens, sent2_tokens, scores, word_to_ix)
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=train_dataset.collate_fn)

In [155]:
val_sent1_tokens = list(val_df['sent1'])
val_sent2_tokens = list(val_df['sent2'])
val_scores = list(val_df['label'])

In [166]:
val_dataset = MyDataset(val_sent1_tokens, val_sent2_tokens, val_scores, word_to_ix)
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=True, collate_fn=val_dataset.collate_fn)

In [157]:
test_sent1_tokens = list(test_df['sent1'])
test_sent2_tokens = list(test_df['sent2'])
test_scores = list(test_df['label'])

In [167]:
test_dataset = MyDataset(test_sent1_tokens, test_sent2_tokens, test_scores, word_to_ix)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=True, collate_fn=test_dataset.collate_fn)

In [147]:
len(subset_embedding_matrix)

16804

In [148]:
subset_embedding_matrix.shape

(16804, 300)

In [182]:
import torch
import torch.nn as nn

class SiameseBiLSTM(nn.Module):
    def __init__(self, hidden_size, num_layers, embedding_dim, embd_matrix, dropout=0.2):
        super(SiameseBiLSTM, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding_dim = embedding_dim
        self.embd_matrix = embd_matrix

        self.word_embeddings = nn.Embedding(len(embd_matrix), embedding_dim)
        self.word_embeddings.weight = nn.Parameter(torch.from_numpy(self.embd_matrix))
        self.word_embeddings.weight.requires_grad = False

        self.bilstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, num_layers=num_layers,
                              batch_first=True, bidirectional=True)

        self.dropout = nn.Dropout(dropout)

        self.attention_fc = nn.Linear(hidden_size * 2, 1)
        self.attention_softmax = nn.Softmax(dim=1)

        self.fc = nn.Linear(hidden_size * 4, 1)  # 4 because we concatenate forward and backward hidden states of both LSTMs

    def forward_once(self, sentence):
        embeds = self.word_embeddings(sentence)

        lstm_out, _ = self.bilstm(embeds)

        lstm_out = self.dropout(lstm_out)

        attention_weights = self.attention_softmax(self.attention_fc(lstm_out))
        lstm_out = lstm_out * attention_weights
        lstm_out = lstm_out.sum(dim=1)

        return lstm_out

    def forward(self, sentence1, sentence2):
        output1 = self.forward_once(sentence1)

        output2 = self.forward_once(sentence2)

        concatenated = torch.cat((output1, output2), dim=1)

        similarity_score = self.fc(concatenated)

        return similarity_score



In [160]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [191]:
import torch.nn as nn


model = SiameseBiLSTM(hidden_size=50, num_layers=2, embedding_dim=300, embd_matrix=subset_embedding_matrix).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


criterion = nn.BCEWithLogitsLoss()

num_epochs = 10

# Train model
for epoch in range(num_epochs):
    model.train()  # Ensure the model is in training mode
    epoch_loss = 0.0
    for i, (sentence1, sentence2, label) in enumerate(train_dataloader):
        
        sentence1_tensor = sentence1.to(device)
        sentence2_tensor = sentence2.to(device)
        label_tensor = torch.tensor(label, dtype=torch.float).to(device)
        #print(label_tensor)

        
        optimizer.zero_grad()

        
        output = model(sentence1_tensor, sentence2_tensor)

        
        loss = criterion(output.squeeze(), label_tensor)

        
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    # Validation loop
    model.eval()  # Set the model to evaluation mode
    total_val_loss = 0
    with torch.no_grad():
        for j, (val_sentence1, val_sentence2, val_label) in enumerate(val_dataloader):
            val_sentence1_tensor = val_sentence1.to(device)
            val_sentence2_tensor = val_sentence2.to(device)
            val_label_tensor = torch.tensor(val_label, dtype=torch.float).to(device)
            outputs = model(val_sentence1_tensor, val_sentence2_tensor)
            val_loss = criterion(outputs.squeeze(), val_label_tensor)
            total_val_loss += val_loss.item()

    avg_train_loss = epoch_loss / len(train_dataloader)
    avg_val_loss = total_val_loss / len(val_dataloader)
    print('Epoch [{}/{}], Train Loss: {:.4f}, Val Loss: {:.4f}'.format(epoch+1, num_epochs, avg_train_loss, avg_val_loss))


  label_tensor = torch.tensor(label, dtype=torch.float).to(device)
  val_label_tensor = torch.tensor(val_label, dtype=torch.float).to(device)


Epoch [1/10], Train Loss: 0.6862, Val Loss: 0.6852
Epoch [2/10], Train Loss: 0.6818, Val Loss: 0.6821
Epoch [3/10], Train Loss: 0.6741, Val Loss: 0.6857
Epoch [4/10], Train Loss: 0.6652, Val Loss: 0.6850
Epoch [5/10], Train Loss: 0.6589, Val Loss: 0.6905
Epoch [6/10], Train Loss: 0.6520, Val Loss: 0.6925
Epoch [7/10], Train Loss: 0.6467, Val Loss: 0.6946
Epoch [8/10], Train Loss: 0.6396, Val Loss: 0.7009
Epoch [9/10], Train Loss: 0.6318, Val Loss: 0.7045
Epoch [10/10], Train Loss: 0.6219, Val Loss: 0.7150


In [194]:
df['label'].value_counts()

0    27572
1    21829
Name: label, dtype: int64

In [192]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np

test_predictions = []
test_pred_probas = []
test_labels = []
model.eval()  

with torch.no_grad():
    for k, (test_sentence1, test_sentence2, test_label) in enumerate(test_dataloader):
        test_sentence1_tensor = test_sentence1.to(device)
        test_sentence2_tensor = test_sentence2.to(device)
        test_label_tensor = torch.tensor(test_label, dtype=torch.float).to(device)

        test_output = model(test_sentence1_tensor, test_sentence2_tensor)

        
        test_output_binary = torch.sigmoid(test_output).cpu().numpy()
        test_pred_probas.append(test_output_binary)
        test_output_binary = (test_output_binary > 0.5).astype(int)

        test_predictions.extend(test_output_binary.flatten())
        test_labels.extend(test_label_tensor.cpu().numpy())


test_predictions = np.array(test_predictions)
test_labels = np.array(test_labels)


accuracy = accuracy_score(test_labels, test_predictions)
roc_auc = roc_auc_score(test_labels, test_predictions)

print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test ROC-AUC Score: {roc_auc:.4f}')


  test_label_tensor = torch.tensor(test_label, dtype=torch.float).to(device)


Test Accuracy: 0.5369
Test ROC-AUC Score: 0.5029


In [193]:
1 in test_predictions

True

In [None]:
torch.save(model, "../data/siamese_model_v1.pt")

### Bonus task to finetune with Sentence Bert as SBert also uses siamese network to derive semantically meaningful sentence embeddings

In [2]:
pip install transformers



In [195]:
# SBert Finetuning

In [7]:
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer

In [4]:

model_name = "sentence-transformers/bert-base-nli-mean-tokens"  # or any other model you want to use
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [8]:

dataset = load_dataset("paws", "labeled_final")
train_dataset = dataset['train']
val_dataset = dataset['validation']
test_dataset = dataset['test']

Downloading readme:   0%|          | 0.00/9.79k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.43M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49401 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8000 [00:00<?, ? examples/s]

In [9]:
train_dataset

Dataset({
    features: ['id', 'sentence1', 'sentence2', 'label'],
    num_rows: 49401
})

In [18]:

class SBERTDataset(Dataset):
    def __init__(self, dataset):
        self.sentences1 = dataset["sentence1"]
        self.sentences2 = dataset["sentence2"]
        self.labels = dataset["label"]

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        
        encoding = tokenizer(self.sentences1[idx], self.sentences2[idx],
                             padding='max_length', truncation=True,
                             max_length=128, return_tensors='pt')
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }


batch_size = 64  
train_loader = DataLoader(SBERTDataset(train_dataset), batch_size=batch_size, shuffle=True)
val_loader = DataLoader(SBERTDataset(val_dataset), batch_size=batch_size)
test_loader = DataLoader(SBERTDataset(test_dataset), batch_size=batch_size)

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [19]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/bert-base-nli-mean-tokens and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
from transformers import AdamW


optimizer = AdamW(model.parameters(), lr=5e-5)  

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

num_epochs = 2 

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0

    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_train_loss += loss.item()

    avg_train_loss = total_train_loss / len(train_loader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Training Loss: {avg_train_loss:.4f}")

    # Validation loop
    model.eval()
    total_val_loss = 0

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(val_loader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Validation Loss: {avg_val_loss:.4f}")


Epoch 1/2, Training Loss: 0.4390
Epoch 1/2, Validation Loss: 0.2736
Epoch 2/2, Training Loss: 0.1761
Epoch 2/2, Validation Loss: 0.2385


In [21]:
from sklearn.metrics import roc_auc_score, accuracy_score
import numpy as np


model.eval()
test_predictions = []
test_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        
        probabilities = torch.sigmoid(logits).cpu().numpy()
        test_predictions.extend(probabilities[:, 1])  
        test_labels.extend(labels.cpu().numpy())

# Compute metrics
test_predictions = np.array(test_predictions)
test_labels = np.array(test_labels)
auc = roc_auc_score(test_labels, test_predictions)
accuracy = accuracy_score(test_labels, (test_predictions > 0.5).astype(int)) 

print(f"Test AUC: {auc:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")


Test AUC: 0.9658
Test Accuracy: 0.9077
