# Bayesian Reranker

In [None]:
pip install -r requirements.txt

In [25]:
import pandas as pd
import boto3
import openai
from azure.core.credentials import AzureKeyCredential
import more_itertools
import chromadb
import pickle
import random
from multiprocessing import Pool
from bayesian_reranker import batch_bayesian_optimization as bbo


## Miniwiki Corpus

In [4]:
bucket = 'sagemaker-us-east-2-344400919253'
path = 'batch_jobs/promptimizer/rag/2025-09-24'
df = pd.read_csv(f's3://{bucket}/{path}/output/0AIiuigo92kaCmY7/demonstrations.csv')

In [6]:
collection_name = 'query'
chroma_client = chromadb.PersistentClient(path="mini_wiki")
if collection_name in [c.name for c in chroma_client.list_collections()]:
    collection = chroma_client.get_collection(name=collection_name)
else:
    collection = chroma_client.create_collection(name=collection_name)
    collection.add(
        ids=[str(x) for x in df['id']],
        documents = df['passage'].to_list()
    )

### Convert question to search queries suitable for a vector db

In [7]:
from bayesian_reranker.prompt_library import wiki_demo as wd

Read the passages and come up with a question

In [14]:

question = "What do wolves usually eat?"
question = "Tell me about Michael Faraday's personal life and career. Where was he born? Did he marry? Did he have kids?"

1. It's good to improve the query to ask for more details. It makes it easier to determine relevance.
2. Come up with search terms.

In [15]:
improved_question = bbo.call_gpt({'system': 'You are a filing clerk. Your job is come up wi', 
                              'user':  wd.improve_query + question})
print(improved_question)
search_terms = bbo.call_gpt({'system': 'You are a filing clerk. Your job is come up with search terms to help find answers to queries', 
                              'user': wd.search_term_prompt + improved_question})

Provide an in-depth overview of Michael Faraday's personal life and professional achievements. Include details about his birthplace, early life, educational background, and key moments in his career as a scientist. Additionally, cover aspects of his personal life, such as whether he was married, details about his spouse, whether he had children, and insights into his personality, interests, or hobbies beyond his scientific work. Aim for a balanced narrative that highlights both his contributions to science and his personal experiences.


Search the vector db

In [16]:
search_results = collection.query(
    query_texts = eval(search_terms),
    n_results = 10
    )

Pairs of texts for cross hop reasoning

In [19]:
S = {}
for i, d in zip(search_results['ids'], search_results['documents']):
    for x, y in zip(i, d):
        S[x] = y
n = len(S.keys())
print("Number of possible pairs of selctions", round((n**2-n)/2))

J = {}
ct = 0
K = S.keys()
for i, p in enumerate(K):
    for j, q in enumerate(K):
        if j > i:
            J[f'{p}_{q}'] = f"page {p}:\n" + S[p] + f"\n\npage {q}:\n" + S[q]
            ct +=1


Number of possible pairs of selctions 105


In [52]:
from importlib import reload
reload(wd)

<module 'bayesian_reranker.prompt_library.wiki_demo' from '/mnt/custom-file-systems/efs/fs-0efe0723c8fe23def_fsap-009815092600551f0/projects/bayesian_reranker/bayesian_reranker/prompt_library/wiki_demo.py'>

## Bayesian Optimization

### Initialize

1. Generate Embeddings for possible "few shots" or "augmentations"

In [22]:
singles = bbo.get_embedding([S[k] for k in K])
E = []
for e in more_itertools.batched([J[k] for k in J.keys()], 50):
    E += bbo.get_embedding(e)
    print(len(E))
 

{'prompt_tokens': 833, 'completion_tokens': 0, 'total_tokens': 833}
{'prompt_tokens': 7096, 'completion_tokens': 0, 'total_tokens': 7096}
50
{'prompt_tokens': 5368, 'completion_tokens': 0, 'total_tokens': 5368}
100
{'prompt_tokens': 421, 'completion_tokens': 0, 'total_tokens': 421}
105


2. Save just in case

In [26]:
with open('/tmp/query.mbd', 'wb') as f:
    pickle.dump(E, f)

3. Combined all embeddings into a single dictionary

In [27]:
combined_embeddings = {}
combined_text = {}
for k, s in zip(S.keys(), singles):
    combined_embeddings[k] = s
    combined_text[k] = S[k]
pair_embeddings = {}
for k, s in zip(J.keys(),  E):
    combined_embeddings[k] = s
    combined_text[k] = J[k]


5. Start with a random sample

In [33]:
initial_sample = random.sample([s for s in combined_text.keys()], 10)
answer = [bbo.call_gpt({'system': "You are a librarian. Your job is to determine if a reference is relevant to a query",
                          'user': wd.relevance_prompt.format(improved_question, combined_text[s])}) for s in initial_sample]

scored_answers = {}
Q = [bbo.x_relevance(a) for a in answer]
for k, q in zip(initial_sample, Q):
    if q > 0:
        scored_answers[k] = q

### Calculate Batch Expected Improvement

1.  Get best batch for next iteration

In [39]:
B = bbo.best_batch_finder(400, 4)
x2id, unscored_embeddings = B.fit(scored_answers, combined_embeddings)
B.create_batches(unscored_embeddings)
best_idx = B.get_best_batch()


-1000 0.33
0.33 0.65


Take a look at text in our initial sample

In [40]:
for x in initial_sample:
    print(combined_text[x][:200])
    print('---')


page 126:
Michael Faraday's signature

page 116:
Michael Faraday's grave at Highgate Cemetery
---
page 135:
* Agassi, Joseph (1971), Faraday as a Natural Philosopher, Chicago: University of Chicago Press.

page 132:
* Gladstone, J. H. (1872). Michael Faraday, Macmillan.
---
page 111:
Education was another area of service for Faraday. He lectured on the topic in 1854 at the Royal Institution, and in 1862 he appeared before a Public Schools Commission to give his views on 
---
page 74:
Michael Faraday, FRS (September 22, 1791 â August 25, 1867) was an English chemist and physicist (or natural philosopher, in the terminology of that time) who contributed to the fields of e
---
page 80:
Michael Faraday was born in Newington Butts, near present-day South London, England. His family was not well off. His father, James, was a member of the Sandemanian sect of Christianity. Jame
---
page 106:
Michael Faraday meets Father Thames, from Punch (July 21, 1855)

page 111:
Education was another are

Take a look at the *best* text after the first random sample

In [41]:
for x in B.batch_idx[best_idx]:
    print(combined_text[x2id[x]][:200])
    print('---')

As a respected scientist in a nation with strong maritime interests, Faraday spent extensive amounts of time on projects such as the construction and operation of light houses and protecting the botto
---
Michael Faraday's signature
---
page 135:
* Agassi, Joseph (1971), Faraday as a Natural Philosopher, Chicago: University of Chicago Press.

page 84:
Faraday was a devout Christian and a member of the small Sandemanian denomination, 
---
page 132:
* Gladstone, J. H. (1872). Michael Faraday, Macmillan.

page 108:
As a respected scientist in a nation with strong maritime interests, Faraday spent extensive amounts of time on projects suc
---


2. Prepare for multiprocessing

In [44]:
parameters = [{'system': "You are a librarian. Your job is to determine if a reference is relevant to a query",
                          'user': bbo.relevance_prompt.format(improved_question, combined_text[x2id[s]])} for s in B.batch_idx[best_idx]]

3. Evaluate Queries in parallel

In [47]:
p = Pool(len(parameters))
xwer = p.map(bbo.call_gpt, parameters)  
p.close()

for k, q in zip(B.batch_idx[best_idx], [bbo.x_relevance(a) for a in xwer]):
    if q > 0:
        scored_answers[x2id[k]] = q


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

4. Extract best passages
- including pairs
- remove duplicates

In [48]:
df = pd.DataFrame({'id': [s for s in scored_answers.keys()],
                   'score': [scored_answers[s] for s in scored_answers.keys()]})
relevant_ids = []
for x in df.sort_values('score', ascending = False).head(3)['id']:
    relevant_ids += x.split('_')

What the question looks like with the new references

In [57]:
print(wd.rag.format(improved_question, references))

I'm going to give you a question with some references. Try to answer the question. If you use the references, indicate that you have used them. 
If you cannot answer the question from the references of if the references are only partially helpful, indicate that as well. Do not speculate beyond what is contained in the references.

#### QUESTION ###
Provide an in-depth overview of Michael Faraday's personal life and professional achievements. Include details about his birthplace, early life, educational background, and key moments in his career as a scientist. Additionally, cover aspects of his personal life, such as whether he was married, details about his spouse, whether he had children, and insights into his personality, interests, or hobbies beyond his scientific work. Aim for a balanced narrative that highlights both his contributions to science and his personal experiences.

### REFERENCES ###
page 132:
* Gladstone, J. H. (1872). Michael Faraday, Macmillan.

page 74:
Michael Fara

5. Evaluate answer with new additions *so far*

In [53]:
references = "\n\n".join([f"page {i}:\n" + combined_text[i] for i in set(relevant_ids)])

final_answer = bbo.call_gpt({'system': 'You are a helpful assistaint.',
                              'user': wd.rag.format(improved_question, references)})


In [54]:
print(final_answer)

Using the provided references, here is an in-depth overview of Michael Faraday's personal life and professional achievements. The information is primarily sourced from pages 74, 80, and 111, with no additional input from pages 106, 116, or 132 as they do not provide further relevant details on Faraday's biography.

---

### **Personal Life**
Michael Faraday was born on **September 22, 1791**, in **Newington Butts**, near present-day South London, England (page 80). Faraday grew up in a poor family; his father, **James Faraday**, was a blacksmith who had moved to London around 1790 from **Outhgill in Westmorland**. His family was connected to the **Sandemanian sect of Christianity**, a strict and devout religious community. Michael was one of **four children** and had only basic schooling. Much of his education came through self-study, beginning when he worked as an **apprentice** at the age of 14 for **George Riebau**, a bookseller and bookbinder (page 80).

Faraday's apprenticeship pr