# Contextual Bayesian Optimisation via Large Language Models

This notebook will:
- Demonstrate the works of https://github.com/ur-whitelab/BO-LIFT, which focusses on few-shot/in-context learning (FS/ICL) for estimating the aqueous solubility (ESOL--Estimated SOLubility) of a compound and also yield calculations from chemical compound interactions. The advantages of ICL are demonstrated here: https://en.wikipedia.org/wiki/In-context_learning_(natural_language_processing). 
- Show attempts of extending the works of https://arxiv.org/pdf/2304.05341.pdf via (non-exhaustive):
1. Implementation of advanced contextual prompting (not simply just compound+solubility or compound+yield).
2. Implementation of further acquisition functions for the Bayesian optimisation protocol.
- If possible, try alternative frameworks:
- Experimenting with chain-of-thought prompting variations (https://www.promptingguide.ai/techniques/cot).
- Experimenting with tree-of-thought prompting (https://www.promptingguide.ai/techniques/tot).
- Multi-task Bayesian optimization (for instance, we might want to optimize not just for solubility, but also for yield, or some other property), you could use a multi-task Bayesian optimization approach.

<DIV STYLE="background-color:#000000; height:10px; width:100%;">

# Import Libraries

In [None]:
# Standard Library
import json
import itertools
import os
import requests

# Third Party
import numpy as np
import pandas as pd
import openai
import matplotlib.pyplot as plt

# Private
import bolift
import qafnet
from bolift.llm_model import GaussDist, DiscreteDist
from langchain.prompts.prompt import PromptTemplate

In [None]:
# Seed results
np.random.seed(0)

In [None]:
# Default OpenAI API Key
os.environ["OPENAI_API_KEY"] = "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"

# Data Preparation

In [None]:
# Establish path to solubility data
esol_df = pd.read_csv("paper/data/esol_iupac.csv")
aqsol_df = pd.read_csv("paper/data/full_solubility.csv")

In [None]:
# Clean
aqsol_df = aqsol_df.dropna()
aqsol_df = aqsol_df.drop_duplicates().reset_index(drop=True)
aqsol_df

In [None]:
# Rename column
aqsol_df.rename(columns={'Name': 'Compound ID'}, inplace=True)

In [None]:
# Clean
esol_df = esol_df.dropna()
esol_df = esol_df.drop_duplicates().reset_index(drop=True)
esol_df

In [None]:
# Merge two datasets to find common names
final_df = aqsol_df.merge(esol_df, on='Compound ID')

In [None]:
  np.random.seed(0)
    # Establish path to solubility data
    esol_df = pd.read_csv("../paper/data/esol_iupac.csv")
    aqsol_df = pd.read_csv("../paper/data/full_solubility.csv")
    # Clean
    aqsol_df = aqsol_df.dropna()
    aqsol_df = aqsol_df.drop_duplicates().reset_index(drop=True)
    aqsol_df.rename(columns={'Name': 'Compound ID'}, inplace=True)
    esol_df = esol_df.dropna()
    esol_df = esol_df.drop_duplicates().reset_index(drop=True)
    final_df = aqsol_df.merge(esol_df, on='Compound ID')
    final_df = final_df.drop(["SMILES_x"], axis=1)
    final_df.rename(columns={'SMILES_y': 'SMILES', 'MolWt': 'Molecular Weight', 'BalabanJ': 'Balaban J',
                             'BertzCT': 'Complexity Index'}, inplace=True)
    # Obtain extra context to provide new interface
    new_df = final_df[["Compound ID", "SMILES", "Molecular Weight", "Complexity Index", "Solubility"]]
    new_df.columns = [col.lower() if col != "SMILES" else col for col in new_df.columns]
    # Instantiate LLM model through ask-tell interface
    bolift_at = bolift.AskTellFewShotTopk()
    qaf_at = qafnet.QAFFewShotTopK()
    # Tell the model some points (few-shot/ICL)
    mini_df = new_df.head(5)
    icl_examples = []
    icl_values = []
    # BO-Lift model
    for _, row in mini_df.iterrows():
        icl_examples.append(row["compound id"])
        icl_values.append(row["solubility"])
        bolift_at.tell(row["compound id"], row["solubility"])
    # QAFNet model
    qaf_at.tell(data=mini_df)
    # Make a prediction for a molecule
    molecule_name = new_df.iloc[6]["compound id"]
    molecule_sol = new_df.iloc[6]["solubility"]
    bolift_pred = bolift_at.predict(molecule_name)
    qaf_pred = qaf_at.predict(new_df.iloc[6][:-1])
    # Find the MAE and Standard Deviation
    if isinstance(bolift_pred, list):
        bolift_mae = np.abs(molecule_sol - bolift_pred[0].mean())
        bolift_std = bolift_pred[0].std()
    else:
        bolift_mae = np.abs(molecule_sol - bolift_pred.mean())
        bolift_std = bolift_pred.std()
    if isinstance(qaf_pred, list):
        qaf_mae = np.abs(molecule_sol - qaf_pred[0].mean())
        qaf_std = qaf_pred[0].std()
    else:
        qaf_mae = np.abs(molecule_sol - qaf_pred.mean())
        qaf_std = qaf_pred.std
    print(f"Molecule {molecule_name} has solubility level {molecule_sol}. The following algorithms return: \n")
    print(f"BO-Lift: MAE = {bolift_mae} | Standard Deviation = {bolift_std}.")
    print(f"QAF-Net: MAE = {qaf_mae} | Standard Deviation = {qaf_std}.")

# ICL

## Ask-Tell

In [None]:
# Instantiate LLM model through ask-tell interface
asktell = bolift.AskTellFewShotTopk()
# Tell the model some points (few-shot/ICL)
mini_df = final_df.head(5)
icl_examples = []
icl_values = []
# Tell the LLM modelxs
for _, row in mini_df.iterrows():
    icl_examples.append(row["Compound ID"])
    icl_values.append(row["Solubility"])
    asktell.tell(row["Compound ID"], row["Solubility"])

In [None]:
# Make a prediction for a molecule
yhat = asktell.predict(final_df.iloc[6]["Compound ID"])
print(f"Y_Hat for ICL (before BO): {yhat}")
print(f"Y_Hat Mean: {yhat.mean()}")
print(f"Y_Hat Standard Deviation: {yhat.std()}")

In [None]:
final_df.iloc[6][["Compound ID", "Solubility"]]

## LLM as BO

In [None]:
# Now treat LLM model as a BO protcol
pool_list_1 = [
    "1-bromoheptane",
    "1-bromohexane",
    "1-bromo-2-methylpropane",
    "butan-1-ol"
]
# Create the pool object
pool_1 = bolift.Pool(pool_list_1)
# Ask for the next most likely point (found through using UCB as the acquisition function on the previous points)
next_point_1 = asktell_1.ask(pool_1)
print(f"The next point for the optimiser to try is: {next_point_1}")

In [None]:
# Tell the LLM the "actual" solubility value
asktell_1.tell(next_point_1[0][0], esol_1[esol_1["IUPAC"]==next_point_1[0][0]].values[0][1])
yhat = asktell_1.predict("1-bromobutane")
print(f"Y_Hat for ICL+BO: {yhat}")
print(f"Y_Hat Mean: {yhat.mean()}")
print(f"Y_Hat Standard Deviation: {yhat.std()}")

In [None]:
# The actual yield
esol_1[esol_1["IUPAC"]=="1-bromobutane"]

<DIV STYLE="background-color:#000000; height:10px; width:100%;">

## Idea 1:

Here, we aim to improve the "LLM as BO" by adding more contextual information. Contextual information can be added in many different ways - the 3 ways we will look at are:

1. Change the prompt template:

`prompt_template = PromptTemplate(input_variables=["x", "Answer", "y_name"] + self._answer_choices,
                                   template="Q: Given {x}. What is {y_name}?\n"
                                   + "\n".join([f"{a}. {{{a}}}" for a in self._answer_choices])
                                   + "\nAnswer: {Answer}\n\n")`
                                  
2. Change the acquisition function to incorporate context (UCB -> C-UCB).
3. Use Policy Learning (e.g. policy learning with reinforcement learning can involve using a function approximator like a neural network to predict actions, and then updating the weights of the network based on the observed reward. Here, your "actions" would be your predictions of solubility, and your "reward" would be how close those predictions are to the true solubility).

Note that these can also be combined ideas.

In [None]:
# Use another modified dataset
esol_2 = esol_data[["IUPAC", "measured log(solubility:mol/L)", "SMILES"]]
esol_2 = esol_2.dropna()
esol_2 = esol_2.drop_duplicates().reset_index(drop=True)
esol_2

In [None]:
# Instantiate LLM model through ask-tell interface
asktell_2 = bolift.AskTellFewShotTopk_Template()
# Tell the model some points (few-shot/ICL)
for molecule in icl_examples:
    asktell_2.tell(esol_2[esol_2["IUPAC"]==molecule].values[0][2], 
                   esol_2[esol_2["IUPAC"]==molecule].values[0][1])

In [None]:
# Make a prediction for a molecule
yhat = asktell_2.predict("CCCCBr")
print(f"Y_Hat for ICL (before BO): {yhat}")
print(f"Y_Hat Mean: {yhat.mean()}")
print(f"Y_Hat Standard Deviation: {yhat.std()}")

In [None]:
# Now treat LLM model as a BO protcol
pool_examples = ["1-bromoheptane", "1-bromohexane", "1-bromo-2-methylpropane", "butan-1-ol"]
pool_list_2 = []
for molecule in pool_examples:
    pool_list_2.append(esol_2[esol_2["IUPAC"]==molecule].values[0][2])
# Create the pool object
pool_2 = bolift.Pool(pool_list_2)
# Ask for the next most likely point (found through using UCB as the acquisition function on the previous points)
next_point_2 = asktell_2.ask(pool_2)
print(f"The next point for the optimiser to try is: {next_point_2}")

In [None]:
# Tell the LLM the "actual" solubility value
asktell_2.tell(next_point_2[0][0], esol_2[esol_2["SMILES"]==next_point_2[0][0]].values[0][1])
yhat = asktell_2.predict("CCCCBr")
print(f"Y_Hat for ICL+BO: {yhat}")
print(f"Y_Hat Mean: {yhat.mean()}")
print(f"Y_Hat Standard Deviation: {yhat.std()}")

## Idea 2:

Here, we aim to improve the "LLM as BO" by noticing chemical compounds with similar SMILES structures will potentially have similar chemical characteristics. We can exploit this and then cleverly feed specific clusters to the LLM, to help improve prediction accuracy by:

1. Designing a function to encode SMILES strings into a numeric format (embedding) requires choosing an appropriate representation for the chemical structures. One particular representation is fingerprint. Then, applying a machine learning model to learn a mapping from the fingerprint (Morgan) to a scalar value that's predictive of the molecule's solubility. 
2. Applying chain-of-thought reasoning (and its variations) i.e. feeding examples in a particular order rather than random.

1. Try to find more context data and improve the prompt design (for this dataset).
2. Try on other contextual bo applications and see how it performs on the LLM.