## Learning to rank with Transformer models

This notebook demonstrates how to train a cross-encoder and a bi-encoder for product ranking. This notebook is part
of the [commerce product ranking sample app](https://github.com/vespa-engine/sample-apps/tree/master/commerce-product-ranking). 

Blog post series:

* [Improving Product Search with Learning to Rank - part one](https://blog.vespa.ai/improving-product-search-with-ltr/)
* [Improving Product Search with Learning to Rank - part two](https://blog.vespa.ai/improving-product-search-with-ltr-part-two/)

This work uses the largest product relevance dataset released by Amazon:

>We introduce the “Shopping Queries Data Set”, a large dataset of difficult search queries, released with the aim of fostering research in the area of semantic matching of queries and products. For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. Each query-product pair is accompanied by additional information. The dataset is multilingual, as it contains queries in English, Japanese, and Spanish.

The dataset is found at [amazon-science/esci-data](https://github.com/amazon-science/esci-data). 
The dataset and is released under the [Apache 2.0 license](https://github.com/amazon-science/esci-data/blob/main/LICENSE).

In [None]:
!pip3 install --upgrade pandas requests sentence-transformers transformers pyarrow

In [None]:
!git lfs clone https://github.com/amazon-science/esci-data.git

The field we want to train the two models on and the batch size

In [None]:
document_field="product_title"
batch_size=128

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import CERerankingEvaluator
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers import evaluation
import os
import pandas as pd
import torch
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Data file pre-processing

In [None]:
df_examples = pd.read_parquet('esci-data/shopping_queries_dataset/shopping_queries_dataset_examples.parquet')

In [None]:
df_products = pd.read_parquet('esci-data/shopping_queries_dataset/shopping_queries_dataset_products.parquet')

In [None]:
df_examples_products = pd.merge(
        df_examples,
        df_products,
        how='left',
        left_on=['product_locale', 'product_id'],
        right_on=['product_locale', 'product_id']
    )

The esci labels mapping to gain.

In [None]:
esci_label2gain = {
        'E' : 1,
        'S' : 0.1,
        'C' : 0.01,
        'I' : 0,
    }

Filter on English (US) queries

In [None]:
df_examples_products = df_examples_products[df_examples_products['small_version'] == 1]
df_examples_products = df_examples_products[df_examples_products['split'] == "train"]
df_examples_products = df_examples_products[df_examples_products['product_locale'] == 'us']
df_examples_products['gain'] = df_examples_products['esci_label'].apply(lambda esci_label: esci_label2gain[esci_label])

Download our own train/dev split 

In [None]:
train_queries = pd.read_parquet("https://data.vespa.oath.cloud/sample-apps-data/train_query_ids.parquet")['query_id'].unique()

In [None]:
df_examples_products = df_examples_products[['query_id', 'query', 'product_title','product_description', 'product_bullet_point', 'gain']]
df_train = df_examples_products[df_examples_products['query_id'].isin(train_queries)]

In [None]:
def replace_none(text):
  if text == None:
    text = ''
  return text

In [None]:
train_samples = []
for (_, row) in df_train.iterrows():
  train_samples.append(InputExample(texts=[row['query'], replace_none(row[document_field])], label=float(row['gain'])))
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size, drop_last=True)

## Train cross-encoder 
Define the model and training parameters. Notice the number of labels is one. 

In [None]:
model_name = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
num_epochs = 2
num_labels = 1
max_length = 96
    
model = CrossEncoder(
  model_name, 
  num_labels=num_labels, 
  max_length=max_length, 
  default_activation_function=torch.nn.Identity(), 
  device=device
)
loss_fct=torch.nn.MSELoss()
warmup_steps = 10
lr = 4e-6

In [None]:
model.fit(
  train_dataloader=train_dataloader,
  loss_fct=loss_fct,
  epochs=num_epochs,
  optimizer_params={'lr': lr},
)
model.save("model")

Training done - now we upload the model weights to HF

In [None]:
token='HF_TOKEN' # To upload model to Hugging Face 

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

In [None]:
automodel = AutoModelForSequenceClassification.from_pretrained("./model/")

In [None]:
autotokenizer = AutoTokenizer.from_pretrained("./model/")

In [None]:
name = document_field + "_ranker"

In [None]:
automodel.push_to_hub(name, use_auth_token=token)

In [None]:
autotokenizer.push_to_hub(name, use_auth_token=token)

## Train bi-encoder with mean-pooling and Cosine Similarity (angular)

In [None]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
train_loss = losses.CosineSimilarityLoss(model=model)
num_epochs = 2

In [None]:
model.fit(
  train_objectives=[(train_dataloader, train_loss)],
  epochs=num_epochs,
  output_path="bi-encoder",
)

In [None]:
from transformers import BertModel 

In [None]:
autmodel = BertModel.from_pretrained("./bi-encoder")

In [None]:
name = document_field + "_encoder"

In [None]:
autmodel.push_to_hub(name, use_auth_token=token)

In [None]:
autotokenizer.push_to_hub(name, use_auth_token=token)