# Overview

This notebook runs through the week 2 task from the MLX apprenticeship, namely building a multi-stage search model via a two-towers architecture, and deploying it via Gradio and Tmux.

# Initial Imports

In [1]:
from datasets import load_dataset_builder, load_dataset # HuggingFace
import pandas as pd
import csv
import torch
import string
import tqdm

# Load Dataset

In [2]:
dataset = load_dataset("ms_marco", 'v1.1', split="train")
# df_train = pd.DataFrame(dataset)
# Restrict dataset for testing purposes
df_train = pd.DataFrame(dataset)[:100]

# Tokenise via SentencePiece

## Prepare data

In [3]:
from tokenizer import prepare_sentencepiece_dataset
import sentencepiece as spm
from tokenizer import train_sentencepiece

# Write a csv file to disk, in the format expected by the SentencePieceTrainer
prepare_sentencepiece_dataset(df_train, output_file = 'sentence_piece_input.csv')

## Train SP Model

In [4]:
# Define parameters for SP training
from config import Config
input = 'sentence_piece_input.csv'
model_prefix = 'mymodel'
vocab_size = Config.SP_VOCAB_SIZE
character_coverage = Config.SP_CHARACTER_COVERAGE
model_type = Config.SP_MODEL_TYPE

train_sentencepiece(input, model_prefix, vocab_size, character_coverage, model_type)

print("Model trained and saved as mymodel.model and mymodel.vocab!")

Model trained and saved as mymodel.model and mymodel.vocab!
Model trained and saved as mymodel.model and mymodel.vocab!


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=sentence_piece_input.csv --model_prefix=mymodel --vocab_size=1000 --character_coverage=0.999 --model_type=unigram
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: sentence_piece_input.csv
  input_format: 
  model_prefix: mymodel
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.999
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id:

## Run on Dataset

In [5]:
import sentencepiece as spm
# Load the trained SentencePiece model
sp = spm.SentencePieceProcessor()
sp.Load('mymodel.model')
# Read in prepared SP input
sentence_piece_input = pd.read_csv('sentence_piece_input.csv', header =None, names = ['sentence'])
# Tokenize each sentence into tokens and token ids
sentence_piece_input['tokenized'] = sentence_piece_input['sentence'].apply(lambda x: sp.EncodeAsPieces(str(x)))
sentence_piece_input['tokenized_ids'] = sentence_piece_input['sentence'].apply(lambda x: sp.EncodeAsIds(str(x)))
sentence_piece_input.to_csv('ms_marco_tokenised.csv')

# Output Token Embeddings

## Construct Word2Vec Dataset via CBOW

Some W2V Notes
- generate CBOW table
- initialise embedding matrix and linear layer
- for each loop:
    - grab embedding vectors for context words
    - sum into one embedding vector
    - multiply by linear layer
    - softmax the result
    - calc loss against target
    - backprop

In [6]:
from two_tower_datasets import W2VData

dataset = W2VData(sentence_piece_input, Config.CBOW_WINDOW_SIZE)

0.0
0.0012285012285012285
0.002457002457002457
0.0036855036855036856
0.004914004914004914
0.006142506142506142
0.007371007371007371
0.0085995085995086
0.009828009828009828
0.011056511056511056
0.012285012285012284
0.013513513513513514
0.014742014742014743
0.01597051597051597
0.0171990171990172
0.018427518427518427
0.019656019656019656
0.020884520884520884
0.022113022113022112
0.02334152334152334
0.02457002457002457
0.025798525798525797
0.02702702702702703
0.028255528255528257
0.029484029484029485
0.030712530712530713
0.03194103194103194
0.033169533169533166
0.0343980343980344
0.03562653562653563
0.036855036855036855
0.038083538083538086
0.03931203931203931
0.04054054054054054
0.04176904176904177
0.042997542997543
0.044226044226044224
0.045454545454545456
0.04668304668304668
0.04791154791154791
0.04914004914004914
0.05036855036855037
0.051597051597051594
0.052825552825552825
0.05405405405405406
0.05528255528255528
0.056511056511056514
0.05773955773955774
0.05896805896805897
0.0601965601

In [7]:
# Set a high batch size for the DataLoader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1024, shuffle=True)

## Initialise W2V Model

In [8]:
from model import CBOW
vocab_size = sp.GetPieceSize()
# Initialise CBOW model (vocab_size x embedding_dim)
cbow = CBOW(vocab_size, Config.W2V_EMBEDDING_DIM)
loss_function = torch.nn.NLLLoss()
optimizer = torch.optim.SGD(cbow.parameters(), lr=0.001)

## Train Word2Vec Model (to give token embeddings)

In [9]:
from train import train_cbow
# Run CBOW training, to get embedding matrix
# This will be passed to two-tower model
train_cbow(n_epochs=1, model=cbow, loss_function=loss_function, optimizer=optimizer, dataloader=dataloader)

Epoch 1/1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:02<00:00, 56.79batch/s]

Epoch 1/1, Loss: 1016.2908248901367





Now we have trained an embedding matrix, via the CBOW method, to give us an (vocab_size, embedding_dim) matrix. We have two options now:
1. Use an RNN/LSTM to convert these token embeddings into sentence embeddings, for all of our query and document sentences. Follow this up with a two-tower architecture.
2. Skip the sentence embedding step, and use the embedding matrix directly in a two-tower (RNN/LSTM) architecture. 

In this notebook I'll do the latter, because, time constraints, less complex architecture, and possibly improved performance, at the cost of training time (I think).

# Build Dataset for Two-Towers Architecture

In [10]:
# from two_tower_datasets import two_tower_dataset
# Use faster version of this function
from two_tower_datasets import two_tower_dataset_optimized

# Reload MS Marco dataset, to create two-tower dataset
dataset = load_dataset("ms_marco", 'v1.1', split="train")

# Again, restrict training data
df_train = pd.DataFrame(dataset)[:100]
print (len(df_train))

# non-optimised function also uses positive and negative user
# labels instead of a boolean, if bing returned doc
# result_df = two_tower_dataset(df_train)
result_df = two_tower_dataset_optimized(df_train)

100
0.0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.4
0.41
0.42
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0.54
0.55
0.56
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99


## Tokenise queries and passages

In [11]:
result_df['query'] = result_df['query'].apply(lambda x: sp.EncodeAsIds(str(x)))
result_df['passage_text'] = result_df['passage_text'].apply(lambda x: sp.EncodeAsIds(str(x)))

## Convert to dataset and dataloader

In [12]:
from torch.utils.data import DataLoader
from two_tower_datasets import TwoTowerData, collate_fn, pad_sequence
from torch.utils.data import DataLoader

two_tower_dataset = TwoTowerData(result_df)
batch_size = 512
two_tower_dataloader = DataLoader(two_tower_dataset, batch_size = batch_size, shuffle=True, collate_fn=collate_fn)

# Initialise Two-Tower and Train

In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from model import TwoTowerModel, CBOW
from loss import contrastive_loss
from train import train_two_tower

# Load CBOW model
embedding_weights = cbow.embeddings.weight.data.detach()
# Initialise two-tower model
model = TwoTowerModel(
    embedding_matrix=torch.tensor(embedding_weights),
    hidden_size=Config.TWO_TOWER_HIDDEN_DIM,
    output_size=Config.TWO_TOWER_OUTPUT_DIM)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
n_epochs_two_tower = 2

# Run two-tower training
train_two_tower(n_epochs_two_tower, model, contrastive_loss, optimizer, two_tower_dataloader)

  embedding_matrix=torch.tensor(embedding_weights),
Epoch 1/2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.42s/batch]
Epoch 2/2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.06s/batch]


Speed improvements to training:
1. GPU
2. num_workers in dataloaders
3. mixed precision training
4. set requires_grad = false on embedding (or, do I unfreeze?)

# Store Offline Doc Embeddings

In [14]:
import torch.nn.functional as F
import inference
from inference import create_offline_sentence_embeddings

# Ensure model is in evaluation mode
model.eval()
torch.save(model.state_dict(), 'two_tower.pth')

# Test
sentences = list(sentence_piece_input['sentence'].values)
tokenizer = sp

offline_embeddings_dict = create_offline_sentence_embeddings(sentences, model, tokenizer)

0.0012285012285012285
0.002457002457002457
0.0036855036855036856
0.004914004914004914
0.006142506142506142
0.007371007371007371
0.0085995085995086
0.009828009828009828
0.011056511056511056
0.012285012285012284
0.013513513513513514
0.014742014742014743
0.01597051597051597
0.0171990171990172
0.018427518427518427
0.019656019656019656
0.020884520884520884
0.022113022113022112
0.02334152334152334
0.02457002457002457
0.025798525798525797
0.02702702702702703
0.028255528255528257
0.029484029484029485
0.030712530712530713
0.03194103194103194
0.033169533169533166
0.0343980343980344
0.03562653562653563
0.036855036855036855
0.038083538083538086
0.03931203931203931
0.04054054054054054
0.04176904176904177
0.042997542997543
0.044226044226044224
0.045454545454545456
0.04668304668304668
0.04791154791154791
0.04914004914004914
0.05036855036855037
0.051597051597051594
0.052825552825552825
0.05405405405405406
0.05528255528255528
0.056511056511056514
0.05773955773955774
0.05896805896805897
0.06019656019656

In [15]:
offline_embeddings_dict

{"Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.": [[-0.10234706103801727,
   -0.03262746334075928,
   0.02449534460902214,
   -0.17178037762641907,
   -0.055499300360679626,
   -0.0063294824212789536,
   -0.07076375186443329,
   -0.19031566381454468,
   0.017942674458026886,
   0.07381822168827057,
   -0.024957720190286636,
   0.07642833888530731,
   -0.04250793531537056,
   -0.03328115865588188,
   -0.02559100277721882,
   -0.022549506276845932,
   0.14754050970077515,
   0.007015685550868511,
   0.08526171743869781,
   -0.008984530344605446,
   0

In [16]:
# Dump offline embeddings in json (doesn't work yet)
import json
converted_dict = {k: [v] if not isinstance(v, list) else v for k, v in offline_embeddings_dict.items()}

with open('offline_embeddings_dict.json', 'w') as f:
    json.dump(converted_dict, f)

# Test Search Manually

In [17]:
from inference import get_query_embedding, compute_similarities
query = "Sunlight in August at Orlando lasts for 13 hours and 7 minutes a day on average"
query_embedding = get_query_embedding(query, model, tokenizer)
similarities = compute_similarities(query_embedding, offline_embeddings_dict, model, tokenizer)

# Get top 10 matches (adjust as needed)
sorted_indices = sorted(similarities.items(), key=lambda item: item[1], reverse=True)
top_matches = sorted_indices[:10]

for i in top_matches:
    print(i)

('Long range weather outlook for Alcudia includes 14 day forecast summary: The outlook for Alcudia in the two weeks ahead shows the average daytime maximum temperature will be around 23°C, with a high for the two weeks of 25°C expected on the afternoon of Sunday 18th.', 0.34296828508377075)
('1. Select a suitable pan for poaching. The pan must be shallow and wide, as the trick to poaching well, without an egg poacher is to gently slip the egg into a wide, shallow pan filled with simmering water. The pan should be able to take about 1.5 liters (2 3/4 pints) of water, or 10cm (4) depth of water. Remove the poached egg with a slotted spoon. Work quickly to transfer each egg onto the plate, letting excess water drip back into the pan. Larousse Gastronomique advises to refresh the egg in cold water and then drain on a cloth.', 0.3104567229747772)
("Method 3 of 5: Using a silicon egg poacher cup. 1. If you have access to a good kitchen store, consider purchasing a small silicon egg poacher c