# Siamese Network with BERT Pooling: Softmax Loss Function

- We train our siamese network with the training data from SemEval 2014.
- We use the **softmax loss function**.
- We then run k-NN search with test queries (previously generated for BM25) to produce test query results.

## Google Colab setups

This part only gets executed if this notebook is being run under Google Colab. **Please change the working path  directory below in advance!**

In [2]:
# Use Google Colab
use_colab = True

# Is this notebook running on Colab?
# If so, then google.colab package (github.com/googlecolab/colabtools)
# should be available in this environment

# Previous version used importlib, but we could do the same thing with
# just attempting to import google.colab
try:
    from google.colab import drive
    colab_available = True
except:
    colab_available = False

if use_colab and colab_available:
    drive.mount('/content/drive')
    
    # If there's a package I need to install separately, do it here
    !pip install sentence-transformers==0.3.9 transformers==3.4.0 jsonlines==1.2.0

    # cd to the appropriate working directory under my Google Drive
    %cd '/content/drive/My Drive/CS646_Final_Project/siamese'
    
    # List the directory contents
    !ls

## PyTorch GPU setup

In [4]:
# torch.device / CUDA Setup
import torch

use_cuda = True
use_colab_tpu = False
colab_tpu_available = False

if use_colab_tpu:
    try:
        assert os.environ['COLAB_TPU_ADDR']
        colab_tpu_available = True
    except:
        colab_tpu_available = True

if use_cuda and torch.cuda.is_available():
    
    torch_device = torch.device('cuda:0')

    # Set this to True to make your output immediately reproducible
    # Note: https://pytorch.org/docs/stable/notes/randomness.html
    torch.backends.cudnn.deterministic = False
    
    # Disable 'benchmark' mode: Set this False if you want to measure running times more fairly
    # Note: https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936
    torch.backends.cudnn.benchmark = True
    
    # Faster Host to GPU copies with page-locked memory
    use_pin_memory = True 

    # CUDA libraries version information
    print("CUDA Version: " + str(torch.version.cuda))
    print("cuDNN Version: " + str(torch.backends.cudnn.version()))
    print("CUDA Device Name: " + str(torch.cuda.get_device_name()))
    print("CUDA Capabilities: "+ str(torch.cuda.get_device_capability()))

elif use_colab_tpu and colab_tpu_available:
    # This needs to be installed separately
    # https://github.com/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb
    import torch_xla 
    import torch_xla.core.xla_model as xm

    torch_device = xm.xla_device()

else:
    torch_device = torch.device('cpu')
    use_pin_memory = False

CUDA Version: 11.0
cuDNN Version: 8004
CUDA Device Name: GeForce RTX 2080 Ti
CUDA Capabilities: (7, 5)


## Import packages

In [5]:
import os
import random
import json
import pathlib

import sentence_transformers
from sentence_transformers import losses
import numpy as np
import jsonlines

In [6]:
# Random seed settings
random_seed = 646
random.seed(random_seed) # Python
np.random.seed(random_seed) # NumPy
torch.manual_seed(random_seed) # PyTorch

<torch._C.Generator at 0x7f9e628dde58>

## Load the dataset

In [7]:
# 4 labels (1: Relevant, 2: Aspect only, 3: Sentiment only, 4: Not Relevant): Softmax Loss
with open(os.path.join('..', 'data', 'our_datasets_partially_correct_labels', 'laptop_train.json')) as laptop_train_file:
    laptop_train = json.load(laptop_train_file)

with open(os.path.join('..', 'data', 'our_datasets_partially_correct_labels', 'restaurant_train.json')) as restaurants_train_file:
    restaurants_train = json.load(restaurants_train_file)

### Training set: Joint = Laptop + Restaurants

In [8]:
train_combined_examples = []

for row in laptop_train:
    example = sentence_transformers.InputExample(
        texts=[row['query'][0] + ', ' + row['query'][1], row['doc']], label=row['label'])
    
    train_combined_examples.append(example)

for row in restaurants_train:
    example = sentence_transformers.InputExample(
        texts=[row['query'][0] + ', ' + row['query'][1], row['doc']], label=row['label'])
    
    train_combined_examples.append(example)

In [9]:
print(train_combined_examples[0])

<InputExample> label: 1, texts: charges, positive; It fires up in the morning in less than 30 seconds and I have never had any issues with it freezing.


## Siamese Network with BERT Pooling (SBERT) Model

- We use the pretrained weights released by the BERT-ADA authors.
- Please download and extract them to the same directory as this notebook: https://github.com/deepopinion/domain-adapted-atsc#release-of-bert-language-models-finetuned-on-a-specific-domain
    - **NOTE**: Because BERT-ADA was trained with an older version of `transformers`, you need to add `"model_type": "bert"` to `config.json`.

In [10]:
# Load the pretrained BERT-ADA model
# Extract the tar.xz file
#!tar -xf laptops_and_restaurants_2mio_ep15.tar.xz

pretrained_model_name = 'laptops_and_restaurants_2mio_ep15'

In [11]:
sbert_new_model_name = 'sbert_bert_ada_joint_partially_correct_softmax'

In [12]:
word_embedding_model = sentence_transformers.models.Transformer(
    pretrained_model_name, max_seq_length=256)

pooling_model = sentence_transformers.models.Pooling(
    word_embedding_model.get_word_embedding_dimension())

model = sentence_transformers.SentenceTransformer(
    modules=[word_embedding_model, pooling_model])

### Training

In [None]:
# PyTorch DataLoader
train_dataset = sentence_transformers.SentencesDataset(train_combined_examples, model)
train_dataloader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=16)

# Loss function
# Tuples of (DataLoader, LossFunction)
train_softmax_loss = (train_dataloader, losses.SoftmaxLoss(model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=4))

# Tune the model
model.fit(
    train_objectives=[train_softmax_loss], 
    epochs=20,
    warmup_steps=1200,
    weight_decay=0.01,
    use_amp=True)

HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=1514.0), HTML(value='')))




HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=1514.0), HTML(value='')))




HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=1514.0), HTML(value='')))




HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=1514.0), HTML(value='')))




HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=1514.0), HTML(value='')))

In [None]:
model.save(sbert_new_model_name)

### Play with my own sentences

In [None]:
# Uncomment the following line to load the existing trained model.
# model = sentence_transformers.SentenceTransformer(sbert_new_model_name)

In [None]:
query_embedding = model.encode('Windows 8, Positive')
passage_embedding = model.encode("This laptop's design is amazing")

print("Similarity:", sentence_transformers.util.pytorch_cos_sim(query_embedding, passage_embedding))

## k-NN Search

In [None]:
# Get the top k matches
top_k = 800

### Generate query results file for `trec_eval` evaluation: Laptop

In [None]:
test_laptop_documents_path = os.path.join('..', 'bm25', 'collection', 'laptop_test.jsonl')
test_laptop_documents_file = jsonlines.open(test_laptop_documents_path)

In [None]:
test_laptop_documents_id = []
test_laptop_documents = []

for d in test_laptop_documents_file:
    test_laptop_documents_id.append(d['id'])
    test_laptop_documents.append(d['contents'])

In [None]:
# Obtain embedding vector of test documents
test_laptop_embeddings = model.encode(test_laptop_documents, convert_to_tensor=True)

In [None]:
test_laptop_queries_path = os.path.join('..', 'bm25', 'test_queries_laptop.txt')
test_laptop_queries = open(test_laptop_queries_path, 'r').readlines()

In [None]:
test_laptop_result_path = os.path.join('.', 'query_results', sbert_new_model_name, 'top_' + str(top_k))
pathlib.Path(test_laptop_result_path).mkdir(parents=True, exist_ok=True)
test_laptop_result_file = 'test_results_laptop_' + sbert_new_model_name + '.txt'

In [None]:
!rm {os.path.join(test_laptop_result_path, test_laptop_result_file)}

for q_num, q in enumerate(test_laptop_queries):
    print("Processing query", q_num, ":", q)
    
    query_embedding = model.encode(q, convert_to_tensor=True)

    cos_scores = sentence_transformers.util.pytorch_cos_sim(query_embedding, test_laptop_embeddings)[0]

    if len(cos_scores) < top_k:
        top_k_retrieved = len(cos_scores)
    else:
        top_k_retrieved = top_k

    # We use torch.topk to find the highest 5 scores
    top_results = torch.topk(cos_scores, k=top_k_retrieved)

    # print("\n\n======================\n\n")
    # print("Query:", q)
    # print("\nTop 5 most similar sentences in corpus:")

    # for score, idx in zip(top_results[0], top_results[1]):
    #     print(test_laptop_documents[idx], "(Score: %.4f)" % (score))

    # trec_eval query results file
    i = 0

    for score, idx in zip(top_results[0], top_results[1]):
        line = str(q_num+1) + ' Q0 ' + test_laptop_documents_id[idx] + ' ' + str(i+1) + ' ' + '%.8f' % score + ' ' + sbert_new_model_name

        i = i + 1
      
        with open(os.path.join(test_laptop_result_path, test_laptop_result_file), 'a') as f:
            f.write("%s\n" % line)

### Generate query results file for `trec_eval` evaluation: Restaurant

In [None]:
test_restaurants_documents_path = os.path.join('..', 'bm25', 'collection', 'restaurant_test.jsonl')
test_restaurants_documents_file = jsonlines.open(test_restaurants_documents_path)

In [None]:
test_restaurants_documents_id = []
test_restaurants_documents = []

for d in test_restaurants_documents_file:
    test_restaurants_documents_id.append(d['id'])
    test_restaurants_documents.append(d['contents'])

test_restaurants_embeddings = model.encode(test_restaurants_documents, convert_to_tensor=True)

In [None]:
test_restaurants_queries_path = os.path.join('..', 'bm25', 'test_queries_restaurant.txt')
test_restaurants_queries = open(test_restaurants_queries_path, 'r').readlines()

In [None]:
test_restaurants_result_path = os.path.join('.', 'query_results', sbert_new_model_name, 'top_' + str(top_k))
pathlib.Path(test_restaurants_result_path).mkdir(parents=True, exist_ok=True)
test_restaurants_result_file = 'test_results_restaurant_' + sbert_new_model_name + '.txt'

In [None]:
!rm {os.path.join(test_restaurants_result_path, test_restaurants_result_file)}

for q_num, q in enumerate(test_restaurants_queries):
    print("Processing query", q_num, ":", q)

    query_embedding = model.encode(q, convert_to_tensor=True)

    cos_scores = sentence_transformers.util.pytorch_cos_sim(query_embedding, test_restaurants_embeddings)[0]

    if len(cos_scores) < top_k:
        top_k_retrieved = len(cos_scores)
    else:
        top_k_retrieved = top_k

    # We use torch.topk to find the highest 5 scores
    top_results = torch.topk(cos_scores, k=top_k_retrieved)

    # print("\n\n======================\n\n")
    # print("Query:", q)
    # print("\nTop 5 most similar sentences in corpus:")

    # for score, idx in zip(top_results[0], top_results[1]):
    #     print(test_laptop_documents[idx], "(Score: %.4f)" % (score))

    # trec_eval query results file
    i = 0

    for score, idx in zip(top_results[0], top_results[1]):
        line = str(q_num+1) + ' Q0 ' + test_restaurants_documents_id[idx] + ' ' + str(i+1) + ' ' + '%.8f' % score + ' ' + sbert_new_model_name

        i = i + 1
      
        with open(os.path.join(test_restaurants_result_path, test_restaurants_result_file), 'a') as f:
            f.write("%s\n" % line)