# Wazy

This is a method for optimizing sequences for a numeric task, like quantitative activity or solubility. Wazy uses Bayesian Optimization to propose which new sequences should be tried. The method is designed for when you have few (1-100) starting sequences and want to know which additional sequences to try in order to find the best. See the [paper](https://www.biorxiv.org/content/10.1101/2022.08.05.502972v1) and the [code](https://github.com/ur-whitelab/wazy) for complete details on how the algorithm works.

Instructions:

1. Provide at least one example of a sequence and its numeric label (e.g., activity)
2. You can then use `predict` to get a prediction for an unknown sequence or `ask` to find out which sequence you should try next.

Credit:

* This doc authored by [@andrewwhite01](https://twitter.com/andrewwhite01)
* Wazy authored by [@andrewwhite01](https://twitter.com/andrewwhite01) and [@ZiyueYang37](https://twitter.com/ZiyueYang37)

In [None]:
# @title Install Dependencies and Set Seed
# @markdown Changing the seed makes random outcomes change in this spreadsheet. You can leave as 0, or change if you want the proposed sequences to be different
seed = 0  # @param {type:"integer"}
!pip install -q wazy pandas odfpy openpyxl xlrd
import wazy
import jax
import numpy as np
import functools

np.random.seed(seed)
key = jax.random.PRNGKey(seed)

In [None]:
# @title Option A: Type out results
# @markdown Double click this cell (or click "Show Code") and follow the example to type out sequence/labels
boa = wazy.BOAlgorithm()
# Example
boa.tell(key, "SEQ", label=4)

In [None]:
# @title Option B: Upload Spreadsheet
# @markdown csv, xls, xlsx, xlsm, xlsb, odf, ods and odt supported. First column should be sequence, second numeric label
from google.colab import files
import pandas as pd

uploaded = files.upload()
# @markdown *Check the header box if there is a header row in your file*
header = False  # @param {type:"boolean"}
if header:
    header = 0
else:
    header = None

for fn in uploaded.keys():
    if ".csv" in fn:
        data = pd.read_csv(fn, header=header)
    else:
        data = pd.read_excel(fn, header=header)
print("Loaded:")
boa = wazy.BOAlgorithm()
for i in range(data.shape[0]):
    seq, label = data.iloc[i, 0], data.iloc[i, 1]
    seq = str(seq)
    label = float(label)
    if i < 10:
        print(seq, label)
    elif i == 10:
        print("...")
    boa.tell(key, seq, label)

In [None]:
# @title Predict
seq = "TEST"  # @param {type:"string"}
l, v, _ = boa.predict(key, seq)
print(f"Predicted label for {seq} is {l:.2f} ± {v:.2f}")

In [None]:
# @title Ask
# @markdown These sequences balance gathering more information and being optimal. If you just want the best predicted sequences, choose "best" from dropdown.
acquisition_fxn = "bo-ucb"  # @param ["bo-ucb", "best", "bo-ei"]
seq_length = 10  # @param {type:"slider", min:0, max:100, step:1}
num_sequences = 3  # @param {type:"slider", min:1, max:100, step:1}
taf = {"bo-ucb": "ucb", "best": "max", "bo-ei": "ei"}
batch_s = 1
if num_sequences > 10:
    _num_sequences = max(4, num_sequences // boa.aconfig.bo_batch_size)
    while _num_sequences * batch_s < num_sequences:
        batch_s += 1
    num_sequences = _num_sequences
key = jax.random.split(key)[0]
if num_sequences == 1:
    result, score = boa.ask(
        key, length=seq_length, return_seqs=batch_s, aq_fxn=taf[acquisition_fxn]
    )
    print(result)
else:
    result, score = boa.batch_ask(
        key,
        num_sequences,
        lengths=[seq_length] * num_sequences,
        return_seqs=batch_s,
        aq_fxn=taf[acquisition_fxn],
    )
    for i, r in enumerate(result):
        print(r)

As you gather more results, just re-run everything! If you have a problem, [share your issue here](https://github.com/ur-whitelab/wazy/issues/new)