# Overview

This notebook runs through the week 2 task from the MLX apprenticeship, namely building a multi-stage search model via a two-towers architecture, and deploying it via Gradio and Tmux.

# Initial Imports

In [1]:
from datasets import load_dataset_builder, load_dataset # HuggingFace
import pandas as pd
import csv
import torch
import string
import tqdm

# Load Dataset

In [3]:
dataset = load_dataset("ms_marco", 'v1.1', split="train")
df_train = pd.DataFrame(dataset)

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x107e8e210>>
Traceback (most recent call last):
  File "/Users/shaheen.ahmed-chowd/git/personal/factual_recall/.venv/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 770, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 


KeyboardInterrupt: 

# Tokenise via SentencePiece

## Prepare data

In [None]:
from tokenizer import prepare_sentencepiece_dataset
import sentencepiece as spm
from tokenizer import train_sentencepiece

# Write a csv file to disk, in the format expected by the SentencePieceTrainer
prepare_sentencepiece_dataset(df_train, output_file = 'sentence_piece_input.csv')

## Train SP Model

In [None]:
# Define parameters for SP training
input = 'sentence_piece_input.csv'
model_prefix = 'mymodel'
vocab_size = 1000
character_coverage = 0.9990
model_type = 'unigram'

train_sentencepiece(input, model_prefix, vocab_size, character_coverage, model_type)

print("Model trained and saved as mymodel.model and mymodel.vocab!")

## Run on Dataset

In [None]:
import sentencepiece as spm
# Load the trained SentencePiece model
sp = spm.SentencePieceProcessor()
sp.Load('mymodel.model')
# Read in prepared SP input
sentence_piece_input = pd.read_csv('sentence_piece_input.csv', header =None, names = ['sentence'])
# Tokenize each sentence into tokens and token ids
sentence_piece_input['tokenized'] = sentence_piece_input['sentence'].apply(lambda x: sp.EncodeAsPieces(str(x)))
sentence_piece_input['tokenized_ids'] = sentence_piece_input['sentence'].apply(lambda x: sp.EncodeAsIds(str(x)))
sentence_piece_input.to_csv('ms_marco_tokenised.csv')

# Output Token Embeddings

## Construct Word2Vec Dataset via CBOW

Some W2V Notes
- generate CBOW table
- initialise embedding matrix and linear layer
- for each loop:
    - grab embedding vectors for context words
    - sum into one embedding vector
    - multiply by linear layer
    - softmax the result
    - calc loss against target
    - backprop

In [None]:
from two_tower_datasets import W2VData

dataset = W2VData(sentence_piece_input, 5)

In [None]:
# Set a high batch size for the DataLoader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1024, shuffle=True)

## Train Word2Vec Model (to give token embeddings)

In [None]:
from train import train_cbow
# Run CBOW training, to get embedding matrix
# This will be passed to two-tower model
train_cbow(n_epochs=1, model=cbow, loss_function=loss_function, optimizer=optimizer, dataloader=dataloader)

Now we have trained an embedding matrix, via the CBOW method, to give us an (vocab_size, embedding_dim) matrix. We have two options now:
1. Use an RNN/LSTM to convert these token embeddings into sentence embeddings, for all of our query and document sentences. Follow this up with a two-tower architecture.
2. Skip the sentence embedding step, and use the embedding matrix directly in a two-tower (RNN/LSTM) architecture. 

In this notebook I'll do the latter, because, time constraints, less complex architecture, and possibly improved performance, at the cost of training time (I think).

# Build Dataset for Two-Towers Architecture

In [None]:
from two_tower_datasets import two_tower_dataset
from two_tower_datasets import two_tower_dataset_optimized

# Reload MS Marco dataset, to create two-tower dataset
dataset = load_dataset("ms_marco", 'v1.1', split="train")
df_train = pd.DataFrame(dataset)
print (len(df_train))
# non-optimised function also uses positive and negative user labels 
# not bool(if bing returned doc) 
# result_df = two_tower_dataset(df_train)
result_df = two_tower_dataset_optimized(df_train)

Speed improvements to training:
1. GPU
2. num_workers in dataloaders
3. mixed precision training
4. set requires_grad = false on embedding (or, do I unfreeze?)

# Store Offline Doc Embeddings

In [None]:
import torch.nn.functional as F
from inference import create_offline_sentence_embeddings

# Ensure model is in evaluation mode
model.eval()
# torch.save(model, 'two_tower.pth')

# Test
sentences = list(sentence_piece_input['sentence'].values)
tokenizer = sp

offline_embeddings_dict = create_offline_sentence_embeddings(sentences, model, tokenizer)

In [None]:
# Dump offline embeddings in json (doesn't work yet)
import json
converted_dict = {k: [v] if not isinstance(v, list) else v for k, v in offline_embeddings_dict.items()}

with open('offline_embeddings_dict.json', 'w') as f:
    json.dump(converted_dict, f)

# Test Search Manually

In [None]:
from inference import get_query_embedding, compute_similarities

query = "Sunlight in August at Orlando lasts for 13 hours and 7 minutes a day on average"
query_embedding = get_query_embedding(query, model, tokenizer)
similarities = compute_similarities(query_embedding, offline_embeddings_dict, model, tokenizer)

# Get top 10 matches (adjust as needed)
sorted_indices = sorted(similarities.items(), key=lambda item: item[1], reverse=True)
top_matches = sorted_indices[:10]

for i in top_matches:
    print(i)