## Example study and method comparisons--GB1_2016
GB1_2016 has combintorially measured all the possible mutations and been studied extensively in the previous literature. Here, we use GB1_2016 as an example to explore many different possible settings in ODBO and collect the corresponding results. We will walk through this notebook with detailed comments.
To use this notebook, please change the global control varibles and read the comments.

In [1]:
import numpy as np
import torch
import pandas as pd
import odbo
import os

### Global control varibles
This section describe the parameters used for three different datasets. We also have a detailed walkthrough for other datasets with less explorations in experimental settings in the seperated notebook. We recommend to checkout the [ODBO_for_different_datasets.ipynb](./ODBO_for_different_datasets.ipynb) instead.
We note that we didn't do extensive hyperparameter tuning, so other values can work and might even work better.
Values used in the data collection is listed in the comments. Due to different randomness for different device, the results might differ slightly from the results we obtain using local computer.

In [16]:
# Experiment settings 
dataset_name ='GB1_2016'
random_seed = 0 #Random seed for the trial
search_iter = 50 #Number of new observations, GB1_2014=100, BRCA1=50, avGFP_2016=50
# Initialization method protocol
update_method='independent'#find round 0 experiments to initiate BO. For the datasets with few changes in the sequences, 'correlate' mode is recommended. 
allow_abundance=True #If we allow the top scoring experiments to take abundance of a mutation in different sites into account.
# Featurization settings
method=['Avg','Max','Avg','Max'] #switching order for feature spaces to overcome local maxima in one certain representation
mode='independent' #Feature computing mode. 
# Adaptive search space predicted by XGBOD model (Prescreening step)
threshold = 0.05 #Use 0.05 of as threshold.
cMat_plot = True #Plot the confusion matrix to check the accuracy of search space prescreening or not
# BO method settings (Optimization step)
BO_method = 'BO' #Must be 'ODBO_BO' or 'ODBO_TuRBO'
gp_method='robust_regression' #Must be 'gp_regression' or 'robust_regression'
tr_length = [3.2] #Trust region length, used in the TuRBO. 
batch_size = 1 #Number of new oberservations provided by BO. We found 1 is the most cost-effective experimentally
failure_tolerance =10 #Number of failure iterations to change TR length in TuRBO
save_files = False #Save files or not


### Data initalization

In [17]:
# Load dataset
np.random.seed(random_seed)
data_test = pd.read_csv('../datasets/GB1_2016_149361.csv', sep=',')
name_pre, Y_test = np.array(data_test['AACombo']), np.array(data_test['Fitness'])
shuffle_order = np.arange(len(Y_test))
np.random.shuffle(shuffle_order[1:])
name_pre[1:], Y_test[1:] = name_pre[shuffle_order[1:]], Y_test[shuffle_order[1:]]
name = odbo.utils.code_to_array(name_pre)
#Load the preselected indices using a certain shuffling order. Control Round 0 experiments to be the same for different trials
if os.path.isfile('sele_experiment_GB1_2016.npy') == True:
    name_sele = np.load('sele_experiment_GB1_2016.npy')
    Y_train = np.load('sele_fitness_GB1_2016.npy')
else:
    # Let each site has 20 AA codes at least show up twice 
    sele_indices = odbo.initialization.initial_design(name, least_occurance=2*np.ones(name.shape[1]),allow_abundance=allow_abundance, update_method=update_method,verbose = True)
    # Initial experiments are selected to be name_sele with fitness of Y_sele
    name_sele, Y_train = name[sele_indices, :], Y_test[sele_indices]
print('Selected initial experiments no. is ', len(Y_train))
print('Select max Y: ', Y_train.max())

Selected initial experiments no. is  40
Select max Y:  1.320616068


### Featurization and find the adaptive search space model

In [18]:
# Using MassiveFeatureTransform method to transform features. (Since GB1 2016 mutates all the sites)
threshold = 0.05
feature_model = odbo.featurization.MassiveFeatureTransform(raw_vars=name_sele, Y=Y_train, method = method[0], mode=mode)
X_train = feature_model.transform(name_sele)
X_test = feature_model.transform(name)
if BO_method == 'ODBO_BO' or BO_method == 'ODBO_TuRBO':
    # Get outliers or inliers using the threshold
    labels_train = odbo.prescreening.sp_label(X_train, Y_train, thres=threshold)
    # Find the XGBOD adaptive search space model
    pre_model = odbo.prescreening.XGBOD(eval_metric = 'error', random_state = random_seed)
    pre_model.fit(X_train, labels_train)
    # Predict the entire search space to get the adapt search space
    labels_test = odbo.prescreening.sp_label(X_test, Y_test, thres=threshold)
    pred_test_labels = pre_model.predict(X_test)
    sele_id_test = list(np.where(pred_test_labels == 0)[0])
    # Plot the confusion matrix to check the accuracy of search space prescreening
    if cMat_plot:
        out_outlier, in_outlier, out_inlier, in_inlier = odbo.plot.plot_cm(labels_test, pred_test_labels, Y_test)
        print("Correct ratio: {0:.3%}".format((len(out_outlier)+len(in_inlier))/len(labels_test)))
        print("FN ratio: {0:.3%}".format(len(out_inlier)/len(labels_test)))
        print("FP ratio: {0:.3%}".format(len(in_outlier)/len(labels_test)))
else:
    sele_id_test = np.arange(len(Y_test.ravel()))
print("Adapt space size, Entire space size: ", len(sele_id_test), name.shape[0])



Adapt space size, Entire space size:  149361 149361


### Run optimizations on the datasets

In [None]:
# Creat search data
X_train_sele, Y_train_sele = torch.tensor(X_train), torch.tensor(Y_train.reshape(len(Y_train),1))
search_name_sele, name_sele_temp = name[sele_id_test, :], name_sele
X_test_sele, Y_test_sele = torch.tensor(X_test[sele_id_test, :]), torch.tensor(Y_test[sele_id_test].reshape(len(sele_id_test),1))

## Run BO experiment with robust regression or directly gp
l, failure_count = 0, 0 #l is the current iteration, failure_count controls when to switch to another featurization
if BO_method == 'ODBO_TuRBO' or BO_method == 'TuRBO':
    state = odbo.turbo.TurboState(dim=X_train_sele.shape[1], batch_size=batch_size, length=tr_length, n_trust_regions=len(tr_length), failure_tolerance = failure_tolerance)
    state.best_value = Y_train_sele.max()

while l < search_iter:
    print("Iter: ", l, "Current Max: ", Y_train_sele.max().detach().numpy(), "Test max: ", Y_test_sele.max().detach().numpy())
    # Run optimization using different methods    
    if BO_method == 'ODBO_BO' or BO_method == 'BO':
        X_next, acq_value, next_exp_id = odbo.bo_design(X=X_train_sele, Y=Y_train_sele, X_pending=X_test_sele, gp_method=gp_method, batch_size=batch_size)
    elif BO_method == 'ODBO_TuRBO' or BO_method == 'TuRBO':
        X_next, acq_value, raw_next_exp_id = odbo.turbo_design(state=state, X=X_train_sele, Y=Y_train_sele, X_pending=X_test_sele, n_trust_regions=len(tr_length), batch_size=batch_size, gp_method=gp_method)
        Y_next_m = torch.zeros((len(tr_length), batch_size, 1), device=Y_train_sele.device, dtype=Y_train_sele.dtype)
        next_exp_id = []  
        for i in range(batch_size):
            next_exp_id_m = raw_next_exp_id[:, i]
            Y_next_m[:, i, 0], idtoadd = Y_test_sele[next_exp_id_m].reshape(len(tr_length)), next_exp_id_m[np.argmax(Y_test_sele[next_exp_id_m])]
            next_exp_id.append(idtoadd)
    Y_train_sele = torch.cat([Y_train_sele, Y_test_sele[next_exp_id]])
    ids_keep = list(np.delete(range(Y_test_sele.shape[0]), next_exp_id))
    Y_test_sele = Y_test_sele[ids_keep]
    name_sele_temp = np.concatenate((name_sele_temp, search_name_sele[next_exp_id]))
    search_name_sele = search_name_sele[ids_keep]
    print("Newly added value: ", Y_train_sele[-batch_size:].detach().numpy(), ''.join(name_sele_temp[-1, :]))
    if BO_method == 'ODBO_TuRBO'or BO_method == 'TuRBO':
        # Update the TuRBO state with the newly added Y values
        state = odbo.turbo.update_state(state=state, Y_next=Y_next_m)

    # Switch different representations if one representation fails in 
    if Y_train_sele[-batch_size:].detach().numpy().max() > Y_train_sele[:-batch_size].max():
        failure_count = 0
        feature_model1 = odbo.featurization.MassiveFeatureTransform(raw_vars=name_sele_temp, 
                                                                Y=Y_train_sele.detach().numpy(), method=method[1], mode=mode)
    else:
        failure_count = failure_count + 1
        if failure_count >= 3:
            feature_model1 = odbo.featurization.MassiveFeatureTransform(raw_vars=name_sele_temp, 
                                                                    Y=Y_train_sele.detach().numpy(), method=method[2], mode=mode)
        else:
            feature_model1 = odbo.featurization.MassiveFeatureTransform(raw_vars=name_sele_temp, 
                                                                    Y=Y_train_sele.detach().numpy(), method=method[3], mode=mode)
    X_test_sele= torch.tensor(feature_model1.transform(search_name_sele))
    X_train_sele = torch.tensor(feature_model1.transform(name_sele_temp))

    l = l + 1

# Save the BO results. Note we save all the observations including the Round 0 ones

if save_files:
    if gp_method == 'robust_regression':
        np.save('results/{}/{}_{}_RobustGP_batch{}_{}.npy'.format(dataset_name, dataset_name, BO_method, batch_size, random_seed), Y_train_sele)
    elif gp_method == 'gp_regression':
        np.save('results/{}/{}_{}_GP_batch{}_{}.npy'.format(dataset_name, dataset_name, BO_method, batch_size, random_seed), Y_train_sele)



Iter:  0 Current Max:  1.320616068 Test max:  8.761965656
Newly added value:  [[0.02815925]] VIMV
Iter:  1 Current Max:  1.320616068 Test max:  8.761965656
Newly added value:  [[0.01501826]] VNWC
Iter:  2 Current Max:  1.320616068 Test max:  8.761965656
Newly added value:  [[0.02128803]] VIMM
Iter:  3 Current Max:  1.320616068 Test max:  8.761965656
Newly added value:  [[0.03257972]] VIVC
Iter:  4 Current Max:  1.320616068 Test max:  8.761965656
Newly added value:  [[0.05564098]] VYMC
Iter:  5 Current Max:  1.320616068 Test max:  8.761965656
Newly added value:  [[1.34186633]] FIMC
Iter:  6 Current Max:  1.341866327 Test max:  8.761965656
Newly added value:  [[1.20626245]] VIMA
Iter:  7 Current Max:  1.341866327 Test max:  8.761965656
Newly added value:  [[1.]] VDGV
Iter:  8 Current Max:  1.341866327 Test max:  8.761965656
Newly added value:  [[1.16199914]] FDMA
Iter:  9 Current Max:  1.341866327 Test max:  8.761965656
Newly added value:  [[1.73291399]] FDGA
Iter:  10 Current Max:  1.73

### Run random selection on the datasets

In [None]:
sele_Y, Y_train_sele = list(np.random.choice(Y_test, search_iter, replace = False)), list(Y_train.copy())
Y_train_sele.extend(sele_Y)
print('Max Y', max(sele_Y))
if save_files:
    np.save('results/{}/{}_random_{}.npy'.format(dataset_name, dataset_name, random_seed), Y_train_sele)