#### E2E Demo Augmented Query Geneation and Semantic Search with Cross Encoder refinement

    0) Tokenization of long passages using window and stride
    1) Unsupervised Query Generation 
    2) Fine Tuning Bi-Encoder to use for semantic search 
    3) Creation of Faiss index using bi-encoder encoded passages
    4) Incorporation of Cross-Encoder on top of results returned from Bi-Encoder retreival 

In [2]:
import sys,os,logging, gc
import pandas as pd
from pathlib import Path
from transformers import AutoTokenizer, T5Tokenizer,T5TokenizerFast, T5ForConditionalGeneration
from sentence_transformers import util , CrossEncoder
import torch 
#set up basic logging
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)
logger =  logging.getLogger(__name__)

#config path 
root_ = os.path.abspath("")
cfg_path = Path(root_) / "config.yaml"

#custom imports
sys.path.append(root_)
from util.misc import LoadCFG, seed_all, create_output_dirs
from util.data import load_data
from util.embedding_ops import query_ops
from util.model_ops import build_model , load_model
from util.index_ops import faiss_index 

#set seed 
SEED = 42
seed_all(SEED)


  from .autonotebook import tqdm as notebook_tqdm
INFO:faiss.loader:Loading faiss with AVX2 support.
INFO:faiss.loader:Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.


### LOAD CONFIG PARAMETERS

In [3]:

#load cfg params
cfg = LoadCFG(cfg_path, base_dir = root_).load()
DATA_PATH = cfg.data.input.data_path
SAVE_DIR = Path(cfg.data.output.data_save_dir)
MODEL_SAVE_DIR = Path(cfg.model.model_save_dir)
INDEX_SAVE_DIR = Path(cfg.model.ir.faiss_index.out_dir)

NSAMPS = cfg.model.n_samps
TOK_BATCH_SIZE = cfg.model.tokenizer.batch_size
BI_ENCODER_MODEL_NAME = cfg.model.ir.bi_encoder.model_name
EPOCHS = cfg.model.n_epochs
BI_ENCODER_BATCH_SIZE =  cfg.model.ir.bi_encoder.batch_size
CROSS_ENCODER_MODEL_NAME = cfg.model.ir.cross_encoder.model_name

#tokenizer setup
RETURN_TENSORS = cfg.model.tokenizer.return_tensors
PADDING =  cfg.model.tokenizer.padding
RETURN_OVERFLOW_TOKENS= cfg.model.tokenizer.return_overflow_tokens
MAX_SEQ_LEN = cfg.model.tokenizer.max_seq_len
TRUNCATION = cfg.model.tokenizer.truncation 
STRIDE = cfg.model.tokenizer.stride 

#query generator setup
GENQ_MODEL_NAME = cfg.model.query_gen.model_name 
N_QUERIES_PER_PASSAGE =  cfg.model.query_gen.n_queries_per_passage 

#device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#clean up gpu 
torch.cuda.empty_cache()
gc.collect()
logger.info(torch.cuda.memory_summary(device=DEVICE, abbreviated=True))

#create output dirs
create_output_dirs(SAVE_DIR, MODEL_SAVE_DIR)

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------

File DOES exist:
	 c:\Users\zjc10\Desktop\Projects\code\MyModules\semantic_search\genq_pinecone\config.yaml


### Load models used in demo

In [4]:
qgen_model = T5ForConditionalGeneration.from_pretrained(GENQ_MODEL_NAME)
tokenizer = T5TokenizerFast.from_pretrained(GENQ_MODEL_NAME, do_lower_case=False)
ir_model = load_model(str(MODEL_SAVE_DIR / 'fine_tuned_biencoder'), DEVICE)
_ce = CrossEncoder(CROSS_ENCODER_MODEL_NAME)

#call eval() to force / ensure model is running in 'INFERENCE MODE' and not 'TRAINING' mode
logger.info('forcing model into eval mode')
qgen_model.eval()
model = qgen_model.to(DEVICE)
print(DEVICE)

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: C:\Users\zjc10\Desktop\Projects\data\semantic_search\models\fine_tuned_biencoder
INFO:sentence_transformers.cross_encoder.CrossEncoder:Use pytorch device: cuda
INFO:__main__:forcing model into eval mode


cuda


#### LOAD DATA FOR QUERY GENERATION AND FINE TUNING 

In [6]:
logging.info('loading data from huggyface')
df = load_data( load_from_directory=False 
               , hf_dataset_name = 'squad' 
               , split ='train') 
df[:1]


# we want to emulate the scenario in which we do not have queries. 
# We will remove all but the 'context' data to do that. (aka all that is passed into framework is list of text)
logging.info('extracting text passages to generate queries for')
passages = list(set(df['context']))[2300:2300+ NSAMPS]
passages = [(idx,txt) for idx,txt in enumerate(passages)]
print(len(passages))

INFO:root:loading data from huggyface
INFO:root:extracting text passages to generate queries for


33


### Initalize Query Generation Instance 
0) tokenize text into overlapping spanning windows 
1) create mapping of original input doc to associated tokeinized chunks 
2) for each chunk, generate 3 queries using t5
3) return output data with 3 queries per input chunk

In [7]:
#initalize class to generate queries from passages
logger.info('initalize embedding querier')
queryer = query_ops(
     tokenizer
    , qgen_model 
    , SAVE_DIR
    , n_queries_per_passage = N_QUERIES_PER_PASSAGE
    , save_batch_size = 1000
    , train_batch_size = TOK_BATCH_SIZE    
    , return_tensors = RETURN_TENSORS
    , padding =  PADDING
    , return_overflowing_tokens= RETURN_OVERFLOW_TOKENS
    , max_seq_len = MAX_SEQ_LEN
    , truncation = TRUNCATION 
    , stride = STRIDE 
    )

#generate query,passage key value pairs , save to disk , return paths 
logger.info('generating query, passage key value pairs')
query_passage_outpaths = queryer.gen_queries_from_passages(passages)
                  
#create sentence_transformers comptable training dataset using InputExample() method from transformers
logger.info('creating training data for bi-encoder fine tuning')

#create train df, including docidx, and chunk idx information
train_df , pairs= queryer.create_training_data( query_passage_outpaths)

#create object to handle loading of InputExample() instances in batches of 50 
logger.info('creating loader to handle loading batches of data for model training')

#show mapping of chunks back to original doc 
pd.set_option('display.max_colwidth', None)
train_df[train_df['doc']==0]

INFO:__main__:initalize embedding querier
INFO:__main__:generating query, passage key value pairs
4it [00:23,  5.84s/it]
INFO:__main__:creating training data for bi-encoder fine tuning
INFO:__main__:creating loader to handle loading batches of data for model training


Unnamed: 0,text,doc,chunk,ec_query_txt,_index,passage
0,"On the assumption that intellectual property rights are actual rights Stallman argues that this claim does not live to the historical intentions behind these laws, which in the case of copyright served as a censorship system, and later on, a regulatory model for the printing press that may have benefited authors incidentally, but never interfered with the freedom of average readers. Still referring to copyright, he cites legal literature such as the United States Constitution and case law to demonstrate that it is meant to be an optional and experimental bargain that temporarily trades property rights and free speech for public, not private, benefit in the form of increased artistic production and knowledge. He mentions that ""if copyright were a natural right nothing could justify terminating this right after a certain period of time"".</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",0,0,how does the copyright rule affect the legal system,0_0,"0_0:On the assumption that intellectual property rights are actual rights Stallman argues that this claim does not live to the historical intentions behind these laws, which in the case of copyright served as a censorship system, and later on, a regulatory model for the printing press that may have benefited authors incidentally, but never interfered with the freedom of average readers. Still referring to copyright, he cites legal literature such as the United States Constitution and case law to demonstrate that it is meant to be an optional and experimental bargain that temporarily trades property rights and free speech for public, not private, benefit in the form of increased artistic production and knowledge. He mentions that ""if copyright were a natural right nothing could justify terminating this right after a certain period of time"".</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>"
1,"On the assumption that intellectual property rights are actual rights Stallman argues that this claim does not live to the historical intentions behind these laws, which in the case of copyright served as a censorship system, and later on, a regulatory model for the printing press that may have benefited authors incidentally, but never interfered with the freedom of average readers. Still referring to copyright, he cites legal literature such as the United States Constitution and case law to demonstrate that it is meant to be an optional and experimental bargain that temporarily trades property rights and free speech for public, not private, benefit in the form of increased artistic production and knowledge. He mentions that ""if copyright were a natural right nothing could justify terminating this right after a certain period of time"".</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",0,0,what is the purpose of copyright law,0_0,"0_0:On the assumption that intellectual property rights are actual rights Stallman argues that this claim does not live to the historical intentions behind these laws, which in the case of copyright served as a censorship system, and later on, a regulatory model for the printing press that may have benefited authors incidentally, but never interfered with the freedom of average readers. Still referring to copyright, he cites legal literature such as the United States Constitution and case law to demonstrate that it is meant to be an optional and experimental bargain that temporarily trades property rights and free speech for public, not private, benefit in the form of increased artistic production and knowledge. He mentions that ""if copyright were a natural right nothing could justify terminating this right after a certain period of time"".</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>"
2,"On the assumption that intellectual property rights are actual rights Stallman argues that this claim does not live to the historical intentions behind these laws, which in the case of copyright served as a censorship system, and later on, a regulatory model for the printing press that may have benefited authors incidentally, but never interfered with the freedom of average readers. Still referring to copyright, he cites legal literature such as the United States Constitution and case law to demonstrate that it is meant to be an optional and experimental bargain that temporarily trades property rights and free speech for public, not private, benefit in the form of increased artistic production and knowledge. He mentions that ""if copyright were a natural right nothing could justify terminating this right after a certain period of time"".</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",0,0,what type of right is copyright,0_0,"0_0:On the assumption that intellectual property rights are actual rights Stallman argues that this claim does not live to the historical intentions behind these laws, which in the case of copyright served as a censorship system, and later on, a regulatory model for the printing press that may have benefited authors incidentally, but never interfered with the freedom of average readers. Still referring to copyright, he cites legal literature such as the United States Constitution and case law to demonstrate that it is meant to be an optional and experimental bargain that temporarily trades property rights and free speech for public, not private, benefit in the form of increased artistic production and knowledge. He mentions that ""if copyright were a natural right nothing could justify terminating this right after a certain period of time"".</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>"


In [12]:
train_df[train_df['doc']==7]

Unnamed: 0,text,doc,chunk,ec_query_txt,_index,passage
24,"By August 1805, Napoleon had realized that the strategic situation had changed fundamentally. Facing a potential invasion from his continental enemies, he decided to strike first and turned his army's sights from the English Channel to the Rhine. His basic objective was to destroy the isolated Austrian armies in Southern Germany before their Russian allies could arrive. On 25 September, after great secrecy and feverish marching, 200,000 French troops began to cross the Rhine on a front of 260 km (160 mi). Austrian commander Karl Mack had gathered the greater part of the Austrian army at the fortress of Ulm in Swabia. Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagement</s>",7,0,why did the ulm maneuver occur,7_0,"7_0:By August 1805, Napoleon had realized that the strategic situation had changed fundamentally. Facing a potential invasion from his continental enemies, he decided to strike first and turned his army's sights from the English Channel to the Rhine. His basic objective was to destroy the isolated Austrian armies in Southern Germany before their Russian allies could arrive. On 25 September, after great secrecy and feverish marching, 200,000 French troops began to cross the Rhine on a front of 260 km (160 mi). Austrian commander Karl Mack had gathered the greater part of the Austrian army at the fortress of Ulm in Swabia. Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagement</s>"
25,"By August 1805, Napoleon had realized that the strategic situation had changed fundamentally. Facing a potential invasion from his continental enemies, he decided to strike first and turned his army's sights from the English Channel to the Rhine. His basic objective was to destroy the isolated Austrian armies in Southern Germany before their Russian allies could arrive. On 25 September, after great secrecy and feverish marching, 200,000 French troops began to cross the Rhine on a front of 260 km (160 mi). Austrian commander Karl Mack had gathered the greater part of the Austrian army at the fortress of Ulm in Swabia. Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagement</s>",7,0,what was napoleon's objective in fighting the austrianian army?,7_0,"7_0:By August 1805, Napoleon had realized that the strategic situation had changed fundamentally. Facing a potential invasion from his continental enemies, he decided to strike first and turned his army's sights from the English Channel to the Rhine. His basic objective was to destroy the isolated Austrian armies in Southern Germany before their Russian allies could arrive. On 25 September, after great secrecy and feverish marching, 200,000 French troops began to cross the Rhine on a front of 260 km (160 mi). Austrian commander Karl Mack had gathered the greater part of the Austrian army at the fortress of Ulm in Swabia. Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagement</s>"
26,"By August 1805, Napoleon had realized that the strategic situation had changed fundamentally. Facing a potential invasion from his continental enemies, he decided to strike first and turned his army's sights from the English Channel to the Rhine. His basic objective was to destroy the isolated Austrian armies in Southern Germany before their Russian allies could arrive. On 25 September, after great secrecy and feverish marching, 200,000 French troops began to cross the Rhine on a front of 260 km (160 mi). Austrian commander Karl Mack had gathered the greater part of the Austrian army at the fortress of Ulm in Swabia. Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagement</s>",7,0,which strategic advance by napoleon was an example of the grande armee maneuver?,7_0,"7_0:By August 1805, Napoleon had realized that the strategic situation had changed fundamentally. Facing a potential invasion from his continental enemies, he decided to strike first and turned his army's sights from the English Channel to the Rhine. His basic objective was to destroy the isolated Austrian armies in Southern Germany before their Russian allies could arrive. On 25 September, after great secrecy and feverish marching, 200,000 French troops began to cross the Rhine on a front of 260 km (160 mi). Austrian commander Karl Mack had gathered the greater part of the Austrian army at the fortress of Ulm in Swabia. Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagement</s>"
27,"Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagements that culminated in the Battle of Ulm, Mack finally surrendered after realizing that there was no way to break out of the French encirclement. For just 2000 French casualties, Napoleon had managed to capture a total of 60,000 Austrian soldiers through his army's rapid marching. The Ulm Campaign is generally regarded as a strategic masterpiece and was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain</s>",7,1,what was the ulm maneuver,7_1,"7_1:Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagements that culminated in the Battle of Ulm, Mack finally surrendered after realizing that there was no way to break out of the French encirclement. For just 2000 French casualties, Napoleon had managed to capture a total of 60,000 Austrian soldiers through his army's rapid marching. The Ulm Campaign is generally regarded as a strategic masterpiece and was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain</s>"
28,"Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagements that culminated in the Battle of Ulm, Mack finally surrendered after realizing that there was no way to break out of the French encirclement. For just 2000 French casualties, Napoleon had managed to capture a total of 60,000 Austrian soldiers through his army's rapid marching. The Ulm Campaign is generally regarded as a strategic masterpiece and was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain</s>",7,1,"what was the ulm maneuver, when did it end?",7_1,"7_1:Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagements that culminated in the Battle of Ulm, Mack finally surrendered after realizing that there was no way to break out of the French encirclement. For just 2000 French casualties, Napoleon had managed to capture a total of 60,000 Austrian soldiers through his army's rapid marching. The Ulm Campaign is generally regarded as a strategic masterpiece and was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain</s>"
29,"Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagements that culminated in the Battle of Ulm, Mack finally surrendered after realizing that there was no way to break out of the French encirclement. For just 2000 French casualties, Napoleon had managed to capture a total of 60,000 Austrian soldiers through his army's rapid marching. The Ulm Campaign is generally regarded as a strategic masterpiece and was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain</s>",7,1,what was the result of napoleon's invasion of austria?,7_1,"7_1:Napoleon swung his forces to the southeast and the Grande Armée performed an elaborate wheeling movement that outflanked the Austrian positions. The Ulm Maneuver completely surprised General Mack, who belatedly understood that his army had been cut off. After some minor engagements that culminated in the Battle of Ulm, Mack finally surrendered after realizing that there was no way to break out of the French encirclement. For just 2000 French casualties, Napoleon had managed to capture a total of 60,000 Austrian soldiers through his army's rapid marching. The Ulm Campaign is generally regarded as a strategic masterpiece and was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain</s>"
30,"was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain had total domination of the seas for the duration of the Napoleonic Wars.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",7,2,why did trafalgar take place,7_2,"7_2:was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain had total domination of the seas for the duration of the Napoleonic Wars.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>"
31,"was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain had total domination of the seas for the duration of the Napoleonic Wars.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",7,2,when was the battle of trafalgar,7_2,"7_2:was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain had total domination of the seas for the duration of the Napoleonic Wars.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>"
32,"was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain had total domination of the seas for the duration of the Napoleonic Wars.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>",7,2,which event in the world is most associated with the battle of trafalgar?,7_2,"7_2:was influential in the development of the Schlieffen Plan in the late 19th century. For the French, this spectacular victory on land was soured by the decisive victory that the Royal Navy attained at the Battle of Trafalgar on 21 October. After Trafalgar, Britain had total domination of the seas for the duration of the Napoleonic Wars.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>"


### Train Information Retreival model using Augmented (Query , chunk) inputs  
###### NOTE: using 1 epoch for demo, should be >1

In [6]:
#build and train the bi-encoder to be used for asymetric search (information retrieval)
#the trained model will encode passages into embeddings that are trained to be queried via short questions (as oppposed to just blindly taking the cossime between a short a long seq of text)
logger.info('building model')
ir_model  = build_model(pairs
                    , BI_ENCODER_MODEL_NAME
                    , str(MODEL_SAVE_DIR / 'fine_tuned_biencoder')
                    , epochs=EPOCHS
                    , batch_size = BI_ENCODER_BATCH_SIZE
                    )

del ir_model
#build index to encode a fast query trained asyemetric embeddings
ir_model = load_model(str(MODEL_SAVE_DIR / 'fine_tuned_biencoder'), DEVICE)
ir_model.eval()

INFO:__main__:building model
Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['mpnet.pooler.dense.weight', 'mpnet.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cuda
Iteration: 100%|██████████| 778/778 [12:13<00:00,  1.06it/s]
Epoch: 100%|██████████| 1/1 [12:13<00:00, 733.96s/it]
INFO:sentence_transformers.SentenceTransformer:Save model to C:\Users\zjc10\Desktop\Projects\data\semantic_search\models\fine_tuned_biencoder
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: C:\Users\zjc10\Desktop\Projects\data\semantic_search\models\fine_tuned_biencoder


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

### Create FAISS INDEX WITH PRE-TRAINED BI-ENCODER and optinal CROSS ENCODER FOR FINE TUNING RESULTS

##### Note: using basic flat l2 indexing approach (aka brute force), updating to quantized implementation will speed up querying for large datasets

In [16]:
#create passage embeddings using the  new fine tuned bi - encoder
#define index object parameters
f_idx = faiss_index(train_df[['_index','passage']].drop_duplicates().reset_index(drop=True) #df
                    , ir_model #model
                    , ir_model[1].word_embedding_dimension
                    , text_col = 'passage'
                    , id_col = '_index'
                    , index_outpath = INDEX_SAVE_DIR
                    , cross_encoder_model_name = CROSS_ENCODER_MODEL_NAME
                    )
#create index
index_outp , data_outp, id_outp = f_idx.create_index()

INFO:sentence_transformers.cross_encoder.CrossEncoder:Use pytorch device: cuda
Batches: 100%|██████████| 41/41 [00:50<00:00,  1.24s/it]


##### Query FAISS INDEX without use of autoencoder

In [16]:
#search index 
query_ = 'who is the king of pop?'
results = f_idx.search(query_,10, refine_with_crossencoder=False)


NameError: name 'f_idx' is not defined

In [19]:
f_idx.search(query_,10, refine_with_crossencoder=True)

Batches: 100%|██████████| 1/1 [00:00<00:00, 60.56it/s]


>>>> Results in Total Time: 0.0260775089263916
(array([[3.549967 , 3.5291357, 3.4253078, 3.2928545, 3.2901855, 3.159152 ,
        3.089158 , 3.0736406, 3.0254188, 3.0220432]], dtype=float32), array([[226, 259, 162, 625, 514, 421, 835, 258, 225, 690]], dtype=int64))
[162, 225, 226, 258, 259, 421, 514, 625, 690, 835]
[3.549967050552368, 3.5291357040405273, 3.4253077507019043, 3.2928545475006104, 3.2901854515075684, 3.159152030944824, 3.089158058166504, 3.0736405849456787, 3.025418758392334, 3.022043228149414]
{'125_0': {'text': '125_0:In Evita (1996), Madonna played the title role of Eva Perón. For a long time, Madonna had desired to play Perón and wrote to director Alan Parker to explain why she would be perfect for the part. She said later, "This is the role I was born to play. I put everything of me into this because it was much more than a role in a movie. It was exhilarating and intimidating at the same time..... And I am prouder of Evita than anything else I have done." After secur

Batches: 100%|██████████| 1/1 [00:00<00:00, 73.91it/s]


[(-10.946937,
  '\t177_0:Various music journalists, critical theorists, and authors have deemed Madonna the most influential female recording artist of all time. Author Carol Clerk wrote that "during her career, Madonna has transcended the term \'pop star\' to become a global cultural icon." Rolling Stone of Spain wrote that "She became the first viral Master of Pop in history, years before the Internet was massively used. Madonna was everywhere; in the almighty music television channels, \'radio formulas\', magazine covers and even in bookshops. A pop dialectic, never seen since The Beatles\'s reign, which allowed her to keep on the edge of tendency and commerciality." Laura Barcella in her book Madonna and Me: Women Writers on the Queen of Pop (2012) wrote that "really, Madonna changed everything the musical landscape, the \'80s look du jour, and most significantly, what a mainstream</s>'),
 (-11.169013,
  '\t469_1:sion in 1936, King George VI and his consort, Queen Elizabeth, allowe

##### Query FAISS INDEX with use of autoencoder

In [15]:
#del f_idx 
index_ ,data,_ids = faiss_index.load_index(index_outp, data_outp, id_outp)
_ce = CrossEncoder(CROSS_ENCODER_MODEL_NAME)
faiss_index.static_search(index_
                      , ir_model
                      , query_
                      , 10 
                      , data
                      , '_index'
                      , 'passage'
                      , _ce = _ce)

#del index_, data, _ids, _ce

NameError: name 'index_outp' is not defined