This is a brief tutorial on how to use LLAMBO for your own black-box functions. First, please make sure you follow the steps in ```README``` to set up your environment.

In this script, we will provide:
1. Overview of ```LLAMBO``` class,
2. Tutorial of how you can run ```LLAMBO``` on your black-box function by
    - Providing ```task_context```, which semantically describes the problem sapce
    - Providing ```init_f``` (initialization function with customizable initialization strategy) and ```bbox_eval_f``` (function to evaluate proposed point on the black-box function)

### Overview of LLAMBO

```
llambo_opt = LLAMBO(
    task_context: dict,         # dictionary describing task
    sm_mode,                    # either 'generative' or 'discriminative', for generative or discriminative surrogate model
    n_candidates,               # number of candidate points to sample at each iteration
    n_templates,                # number of different prompts (or templates) used for LLM queries
    n_gens,                     # number of generations for LLM, set at 5
    alpha,                      # alpha for candidate point sampler, recommended to be -0.2
    n_initial_samples,          # number of initialization points to evaluate
    n_trials,                   # number of trials to run,
    init_f,                     # function to generate initial configurations
    bbox_eval_f,                # bbox function to evaluate a point
    chat_engine,                # LLM chat engine
    top_pct=None,               # only used for generative SM, top percentage of points to consider for generative SM
    use_input_warping=False,    # whether to use input warping
    prompt_setting=None,        # ablation on prompt design, either 'full_context' or 'partial_context' or 'no_context' (only used for ablation experiments)
    shuffle_features=False      # whether to shuffle features in prompt generation (only used for ablation experiments)
)
```

To use LLAMBO optimizer, you would need to provide two key components:
- ```task_context```: contextual information about the optimization problem that is used to construct prompts
- ```init_f```: function to generate ```n_initial_samples``` (e.g. 5) initial points used to initialize the BO search process
- ```bbox_eval_f```: bbox function that evaluates a proposed point

1. **task_context**: Here is an example task_context, that is automatically extracted for Bayesmark task: [```RandomForest``` (model), ```breast``` (dataset)]

In [None]:
task_context = {
    'model': 'RandomForest', 
    'task': 'classification', 
    'tot_feats': 30, 
    'cat_feats': 0, 
    'num_feats': 30, 
    'n_classes': 2, 
    'metric': 'accuracy', 
    'lower_is_better': False, 
    'num_samples': 455, 
    'hyperparameter_constraints': {
        'max_depth': ['int', 'linear', [1, 15]],        # [type, transform, [min_value, max_value]]
        'max_features': ['float', 'logit', [0.01, 0.99]], 
        'min_impurity_decrease': ['float', 'linear', [0.0, 0.5]], 
        'min_samples_leaf': ['float', 'logit', [0.01, 0.49]], 
        'min_samples_split': ['float', 'logit', [0.01, 0.99]], 
        'min_weight_fraction_leaf': ['float', 'logit', [0.01, 0.49]]
    }
}

2. **init_f**:
```
def init_f(n_samples: int):
    '''
    Generate initialization points for BO search
    Args: n_samples (int)
    Returns: init_configs (list of dictionaries, each dictionary is a point to be evaluated)
    '''
    return initial_samples
```
The initialization function should accept ```n_samples (int)```, which indicates the number of initial points to return, and returns them in a list of dictionaries, where each dictionary is a point to be evaluated.

3. **bbox_eval_f**: 

```
def bbox_eval_f(point_to_evaluate: dictionary):
    '''
    Evaluate a single point on bbox function
    Args: point_to_evaluate (dict), dictionary containing point to be evaluated
    Returns: (point_to_evaluate, f_vals) (dict, dict)
             point_to_evaluate (dict) is the point evaluated
             f_vals (dict) is a dictionary that can track an arbitrary number of metrics, but must contain 'score' which is what LLAMBO optimizer tries to optimize by default

    Example f_vals:
    f_vals = {
        'score': float,                     -> 'score' is what the LLAMBO optimizer tries to optimize
        'generalization_score': float,
        'acc': float,
        ...
        'f1': float
    }
    '''
```
The black-box evaluation function should accept ```point_to-evaluate (dict)``` which is a dictionary containing the point to be evaluated, and returns this point and a dictionary of evaluation results.


Below is an example of a class written to run Bayesmark BO tasks, which contains two functions ```generate_initialization``` (init_f) and ```evalaute_point``` (bbox_eval_f)

In [None]:
class BayesmarkExpRunner:
    def __init__(self, task_context, dataset, seed):
        self.seed = seed
        self.model = task_context['model']
        self.task = task_context['task']
        self.metric = task_context['metric']
        self.dataset = dataset
        self.hyperparameter_constraints = task_context['hyperparameter_constraints']
        self.bbox_func = get_bayesmark_func(self.model, self.task, dataset['test_y'])
    
    def generate_initialization(self, n_samples):
        '''
        Generate initialization points for BO search
        Args: n_samples (int)
        Returns: init_configs (list of dictionaries, each dictionary is a point to be evaluated)
        '''

        # Read from fixed initialization points (all baselines see same init points)
        init_configs = pd.read_json(f'bayesmark/configs/{self.model}/{self.seed}.json').head(n_samples)
        init_configs = init_configs.to_dict(orient='records')

        assert len(init_configs) == n_samples

        return init_configs
        
    def evaluate_point(self, candidate_config):
        '''
        Evaluate a single point on bbox
        Args: candidate_config (dict), dictionary containing point to be evaluated
        Returns: (dict, dict), first dictionary is candidate_config (the evaluated point), second dictionary is fvals (the evaluation results)
        '''
        np.random.seed(self.seed)
        random.seed(self.seed)

        X_train, X_test, y_train, y_test = self.dataset['train_x'], self.dataset['test_x'], self.dataset['train_y'], self.dataset['test_y']

        for hyperparam, value in candidate_config.items():
            if self.hyperparameter_constraints[hyperparam][0] == 'int':
                candidate_config[hyperparam] = int(value)

        if self.task == 'regression':
            mean_ = np.mean(y_train)
            std_ = np.std(y_train)
            y_train = (y_train - mean_) / std_
            y_test = (y_test - mean_) / std_

        model = self.bbox_func(**candidate_config)
        scorer = get_scorer(self.metric)

        with warnings.catch_warnings():
            warnings.filterwarnings('ignore', category=UserWarning)
            S = cross_val_score(model, X_train, y_train, scoring=scorer, cv=5)
        cv_score = np.mean(S)
        
        model = self.bbox_func(**candidate_config)  
        model.fit(X_train, y_train)
        generalization_score = scorer(model, X_test, y_test)

        if self.metric == 'neg_mean_squared_error':
            cv_score = -cv_score
            generalization_score = -generalization_score

        return candidate_config, {'score': cv_score, 'generalization_score': generalization_score}

## Putting it Together!

After preparing ```task_context```, your search initialization function (```init_f```), and your black box function (```bbox_eval_f```), you can run LLAMBO optimization with a few lines of code:

In [None]:
dataset = 'breast'
seed = 0
chat_engine = # LLM Chat Engine, currently our code only supports OpenAI LLM API

# load data
pickle_fpath = f'bayesmark/data/{dataset}.pickle'
with open(pickle_fpath, 'rb') as f:
    data = pickle.load(f)

# instantiate BayesmarkExpRunner
benchmark = BayesmarkExpRunner(task_context, data, seed)

# instantiate LLAMBO
llambo = LLAMBO(task_context, sm_mode='discriminative', n_candidates=10, n_templates=2, n_gens=10, 
                alpha=0.1, n_initial_samples=5, n_trials=25, 
                init_f=benchmark.generate_initialization,
                bbox_eval_f=benchmark.evaluate_point, 
                chat_engine=chat_engine)
llambo.seed = seed

# run optimization
configs, fvals = llambo.optimize()