# Bayesian Reranker

In [14]:
import pandas as pd
import openai
from azure.core.credentials import AzureKeyCredential
import more_itertools
import chromadb
import pickle
import random
from multiprocessing import Pool
from bayesian_reranker import batch_bayesian_optimization as bbo
from bayesian_reranker.prompt_library import wiki_demo as wd

## Miniwiki Corpus

In [12]:
df = pd.read_csv('mini_wiki.csv')

In [13]:
collection_name = 'query'
chroma_client = chromadb.PersistentClient(path="mini_wiki")
if collection_name in [c.name for c in chroma_client.list_collections()]:
    collection = chroma_client.get_collection(name=collection_name)
else:
    collection = chroma_client.create_collection(name=collection_name)
    collection.add(
        ids=[str(x) for x in df['id']],
        documents = df['passage'].to_list()
    )

### Convert question to search queries suitable for a vector db

Read the passages and come up with a question

In [14]:
question = "What do wolves usually eat?"
question = "Tell me about Michael Faraday's personal life and career. Where was he born? Did he marry? Did he have kids?"
question = 'Compare and contracts economic conditions in Egypt with other countries in that area in general'

1. It's good to improve the query to ask for more details. It makes it easier to determine relevance.
2. Come up with search terms.

In [21]:
improved_question = bbo.call_gpt({'system': 'You are a filing clerk. Your job is come up wi', 
                              'user':  wd.improve_query + question})
print(improved_question)
search_terms = bbo.call_gpt({'system': 'You are a filing clerk. Your job is come up with search terms to help find answers to queries', 
                              'user': wd.search_term_prompt + improved_question})

"Provide a detailed comparison of the economic conditions in Egypt with those of other countries in the Middle East and North Africa (MENA) region. Discuss aspects such as GDP, economic growth rates, key industries, unemployment rates, income inequality, inflation, trade balances, and government policies impacting the economy. Highlight similarities and differences, and provide data or examples to support the analysis."


Search the vector db

In [22]:
search_results = collection.query(
    query_texts = eval(search_terms),
    n_results = 10
    )

[0;93m2025-10-08 08:19:45.254597377 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"[m


Pairs of texts for cross hop reasoning

In [23]:
S = {}
for i, d in zip(search_results['ids'], search_results['documents']):
    for x, y in zip(i, d):
        S[x] = y
n = len(S.keys())
print("Number of possible pairs of selctions", round((n**2-n)/2))

J = {}
ct = 0
K = S.keys()
for i, p in enumerate(K):
    for j, q in enumerate(K):
        if j > i:
            J[f'{p}_{q}'] = f"page {p}:\n" + S[p] + f"\n\npage {q}:\n" + S[q]
            ct +=1


Number of possible pairs of selctions 406


## Bayesian Optimization

### Initialize

1. Generate Embeddings for possible "few shots" or "augmentations" - passages, and pairs of passages

In [25]:
singles = bbo.get_embedding([S[k] for k in K])
E = []
for e in more_itertools.batched([J[k] for k in J.keys()], 50):
    E += bbo.get_embedding(e)
    print(len(E))
 

{'prompt_tokens': 3400, 'completion_tokens': 0, 'total_tokens': 3400}
{'prompt_tokens': 12149, 'completion_tokens': 0, 'total_tokens': 12149}
50
{'prompt_tokens': 9301, 'completion_tokens': 0, 'total_tokens': 9301}
100
{'prompt_tokens': 10957, 'completion_tokens': 0, 'total_tokens': 10957}
150
{'prompt_tokens': 11271, 'completion_tokens': 0, 'total_tokens': 11271}
200
{'prompt_tokens': 13271, 'completion_tokens': 0, 'total_tokens': 13271}
250
{'prompt_tokens': 18544, 'completion_tokens': 0, 'total_tokens': 18544}
300
{'prompt_tokens': 17695, 'completion_tokens': 0, 'total_tokens': 17695}
350
{'prompt_tokens': 5895, 'completion_tokens': 0, 'total_tokens': 5895}
400
{'prompt_tokens': 1109, 'completion_tokens': 0, 'total_tokens': 1109}
406


2. Save just in case

In [26]:
with open('/tmp/query.mbd', 'wb') as f:
    pickle.dump(E, f)

3. Combine all embeddings into a single dictionary

In [27]:
combined_embeddings = {}
combined_text = {}
for k, s in zip(S.keys(), singles):
    combined_embeddings[k] = s
    combined_text[k] = S[k]
pair_embeddings = {}
for k, s in zip(J.keys(),  E):
    combined_embeddings[k] = s
    combined_text[k] = J[k]


5. Start with a random sample

In [28]:
initial_sample = random.sample([s for s in combined_text.keys()], 10)
answer = [bbo.call_gpt({'system': "You are a librarian. Your job is to determine if a reference is relevant to a query",
                          'user': wd.relevance_prompt.format(improved_question, combined_text[s])}) for s in initial_sample]

scored_answers = {}
Q = [bbo.x_relevance(a) for a in answer]
for k, q in zip(initial_sample, Q):
    if q > 0:
        scored_answers[k] = q

### Calculate Batch Expected Improvement

1.  Get best batch for next iteration

In [29]:
B = bbo.best_batch_finder(400, 4)
x2id, unscored_embeddings = B.fit(scored_answers, combined_embeddings)
B.create_batches(unscored_embeddings)
best_idx = B.get_best_batch()


-1000 0.33
0.33 0.5
0.5 0.65


Take a look at text in our initial sample

In [30]:
for x in initial_sample:
    print(combined_text[x][:200])
    print('---')


Egypt is one of the most populous countries in Africa. The great majority of its estimated 78 million people (2007) live near the banks of the Nile River in an area of about   where the only arable ag
---
page 901:
Egypt has a developed energy market based on coal, oil, natural gas, and hydro power.  Substantial coal deposits are in the north-east Sinai, and are mined at the rate of about   per year.  
---
page 907:
Egypt is the most populated country in the Middle East and the second-most populous on the African continent, with an estimated 78 million people. Almost all the population is concentrated a
---
page 900:
The government has struggled to prepare the economy for the new millennium through economic reform and massive investments in communications and physical infrastructure. Egypt has been recei
---
page 900:
The government has struggled to prepare the economy for the new millennium through economic reform and massive investments in communications and physical infrastructure. 

Take a look at the *best* text after the first random sample

In [31]:
for x in B.batch_idx[best_idx]:
    print(combined_text[x2id[x]][:200])
    print('---')

page 902:
Economic conditions have started to improve considerably after a period of stagnation from the adoption of more liberal economic policies by the government, as well as increased revenues fro
---
page 890:
Egypt's foreign policy operates along moderate lines. Factors such as population size, historical events, military strength, diplomatic expertise and a strategic geographical position give E
---
page 905:
The best known examples of Egyptian companies that have expanded regionally and globally are the Orascom Group and Raya. The IT sector has been expanding rapidly in the past few years, with 
---
page 808:
Indonesia's estimated Gross Domestic Product (GDP) for 2007 is US$408 billion (US$1,038 bn PPP).    In 2007, estimated nominal per capita GDP is US$1,812, and per capita GDP PPP was US$4,616
---


2. Prepare for multiprocessing

In [33]:
parameters = [{'system': "You are a librarian. Your job is to determine if a reference is relevant to a query",
                          'user': wd.relevance_prompt.format(improved_question, combined_text[x2id[s]])} for s in B.batch_idx[best_idx]]

3. Evaluate Queries in parallel

In [34]:
p = Pool(len(parameters))
xwer = p.map(bbo.call_gpt, parameters)  
p.close()

for k, q in zip(B.batch_idx[best_idx], [bbo.x_relevance(a) for a in xwer]):
    if q > 0:
        scored_answers[x2id[k]] = q


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

4. Extract best passages
- including pairs
- remove duplicates

In [35]:
df = pd.DataFrame({'id': [s for s in scored_answers.keys()],
                   'score': [scored_answers[s] for s in scored_answers.keys()]})
relevant_ids = []
for x in df.sort_values('score', ascending = False).head(3)['id']:
    relevant_ids += x.split('_')

What the question looks like with the new references

In [37]:
references = "\n\n".join([f"page {i}:\n" + combined_text[i] for i in set(relevant_ids)])
print(wd.rag.format(improved_question, references))

I'm going to give you a question with some references. Try to answer the question. If you use the references, indicate that you have used them. 
If you cannot answer the question from the references of if the references are only partially helpful, indicate that as well. Do not speculate beyond what is contained in the references.

#### QUESTION ###
"Provide a detailed comparison of the economic conditions in Egypt with those of other countries in the Middle East and North Africa (MENA) region. Discuss aspects such as GDP, economic growth rates, key industries, unemployment rates, income inequality, inflation, trade balances, and government policies impacting the economy. Highlight similarities and differences, and provide data or examples to support the analysis."

### REFERENCES ###
page 902:
Economic conditions have started to improve considerably after a period of stagnation from the adoption of more liberal economic policies by the government, as well as increased revenues from tou

5. Evaluate answer with new additions *so far*

In [38]:
final_answer = bbo.call_gpt({'system': 'You are a helpful assistaint.',
                              'user': wd.rag.format(improved_question, references)})


In [39]:
print(final_answer)

The information in the provided references is insufficient to deliver a comprehensive comparison of Egypt's economic conditions with other countries in the Middle East and North Africa (MENA) region. While the references offer some insights into Egypt’s economy, they lack data or context about other MENA countries, thus making it impossible to draw a detailed or balanced comparison.

### Analysis Based on the Provided References:
1. **Economic Growth and Reform in Egypt**:
   - According to page 902, Egypt has implemented significant economic reforms since the early 2000s, such as reducing customs/tariffs and cutting corporate taxes to encourage investment. By 2006, these reforms reportedly doubled tax revenue, and Egypt was recognized by the IMF as a top reformer globally.
   - Other stimulus factors include revenue from tourism and the Suez Canal (page 900), alongside foreign aid from the United States, which averages $2.2 billion annually.

2. **Challenges in Wealth Distribution and