Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you release the notebook to reproduce figure 10 of the paper? #8

Closed
Thomaswbt opened this issue Mar 6, 2023 · 17 comments
Closed

Comments

@Thomaswbt
Copy link

Hi Samuel!

Thank you for your repo and great work! It is reported in the paper that single-objective LaMBO can greatly outperform LSBO in the penalized logp task. I found your released notebooks reproducing figure 1 and figure 3 of the paper very helpful and it would be of great help if you could also release the notebook and the wandb loggings to reproduce the figure 10 in the appendix.

屏幕快照 2023-03-06 下午1 33 03

Thanks a lot in advance!

@Thomaswbt
Copy link
Author

The reason I raise this issue is that I tried to train a single-objective LaMBO model with the exact command from README:
python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=lanmt task=chem_lsbo tokenizer=selfies surrogate=single_task_svgp acquisition=ei encoder=lanmt_cnn

but this is the wandb logging I get wrt the penalized logp metric:
屏幕快照 2023-03-07 下午3 41 23

the bb evaluations totaled 64*50=3.2k, but the best score was just above 6, which is different from the results in figure 10, so I wonder if I have missed some extra processing steps to reproduce the results. Thanks!

@samuelstanton
Copy link
Owner

Sorry for the delayed response, would you mind sharing the link to the wandb data for your run?

@Thomaswbt
Copy link
Author

Sure! This is the link to this particular run:
https://wandb.ai/thomaswang/lambo_replicate/runs/3shmruby?workspace=user-thomaswang

@samuelstanton
Copy link
Owner

samuelstanton commented Mar 7, 2023

thanks for sharing. there's three things that account for the discrepancy here

  1. For consistency with PyMOO I followed the convention in the code that all objectives are minimized, so you need to account for the sign difference for maximized properties like penalized logP.

  2. the candidates/obj_val_* field shows the best objective value within each query batch over time, so to transform into a plot like the one shown in the paper you need to apply a cummin transform to show the best-so-far as a function of time

  3. I recall there being pretty substantial variance in performance across seeds (which is why I plotted quantiles), so you'd want to run for at least 5 trials, apply the cummin transform and compute the quantiles to reproduce the plot in the paper.

I will see what I can do about getting you a notebook to reproduce, but it may have to wait a bit while I deal with other work on my plate. In any case, I'm delighted you're taking the time to reproduce these experiments, if you have any further questions don't hesitate to ask :)

@Thomaswbt
Copy link
Author

Thank you for your suggestion! I will first fix these differences.

@samuelstanton
Copy link
Owner

samuelstanton commented Mar 8, 2023

sounds good. note that the seed is fixed in the config, so you'll want to be sure to override it, e.g.

python scripts/black_box_opt.py -m optimizer=lambo optimizer.encoder_obj=lanmt task=chem_lsbo tokenizer=selfies surrogate=single_task_svgp acquisition=ei encoder=lanmt_cnn seed=1,2,3,4

@samuelstanton
Copy link
Owner

samuelstanton commented Mar 8, 2023

one more thing, obj_val_0 is actually the negative penalized logP, so you'll want to either apply cummin and negate, or negate then apply cummax. I've edited my previous response to reflect this.

https://github.com/samuelstanton/lambo/blob/main/lambo/tasks/chem/chem.py#L105

@Thomaswbt
Copy link
Author

Sure, thanks for the reminder! The experiments are still running, and I also wonder if it's reasonable that the single-objective experiments need 1 day 12 hours to finish, while the multi-objective experiments need just 5 hours? Intuitively the single-objective runs should be faster than the multi-objective ones?

@samuelstanton
Copy link
Owner

samuelstanton commented Mar 10, 2023

fair question. the single-objective experiment collects bigger batches of data over more rounds than the multi-objective experiments, so using exact GP inference would require a lot of GPU memory and would likely be numerically unstable. Instead for this task I use a variational GP, which has constant memory footprint and is more numerically stable for large datasets. Unfortunately variational GPs are fairly slow to train, which leads to the dramatic increase in runtime. There probably is room for optimization here, the current training recipe is optimized more for stability than speed.

@Thomaswbt
Copy link
Author

Thanks for the reply! It makes sense now.

However, as I re-ran the experiments with seed 1, 2, 3, 4, I found that the optimization performance is still under expectation. The wandb loggings are the ones with id 12, 13, 14, 15 of the project:

https://wandb.ai/thomaswang/lambo_replicate/groups/test/table?workspace=user-thomaswang

I did not do cummin operations for the log outputs, but we can see that the min values for the obj_val_0 are around -7 in all runs, which are 7 for penalized logp.

I wonder if there are some problems with the default configurations of the setting? Would it be possible for you to double check the configurations? On my side, I will also double check if there is something wrong with my reproduction.

Thanks very much!

@samuelstanton
Copy link
Owner

hm ok I'll take a look, thanks for raising the issue

@jasonkyuyim
Copy link

Hi! I am also intereested in the single objective use case for LaMBO. Is there any update on reproducing the published numbers?

@kirjner
Copy link

kirjner commented Mar 27, 2023

@samuelstanton I'm also having some trouble reproducing the results, I ran the following line:
python scripts/black_box_opt.py -m optimizer=lambo optimizer.encoder_obj=lanmt task=chem_lsbo tokenizer=selfies surrogate=single_task_svgp acquisition=ei encoder=lanmt_cnn seed=1,2,3,4
and, while the script is still running, I'm getting results similar to @Thomaswbt above (in fact slightly worse)
Screen Shot 2023-03-27 at 10 26 15 AM

It would be great to get an update on this, thank you!

@samuelstanton
Copy link
Owner

Thank you all for your patience. I've determined that the some of the default hyperparameters were indeed misconfigured and have updated the command in the README. That being said the results I'm getting now are not quite what I expect and I will continue to investigate. Here's what I'm getting now

40%, 60%, and 80% quantiles across 5 seeds (0-4)
image

Performance by seed
image

While this is much better than the results you were seeing and the algorithm does "solve" the problem for 3/5 seeds (i.e. learn to output long hydrocarbon chains), this is not as good as what I was seeing before and is more sensitive to the random seed than I'd like. In any case I wanted to share an update while I continue looking in to this. I've also pushed the notebook I used to create these plots to notebooks/plot_lsbo_comparison.ipynb.

@samuelstanton
Copy link
Owner

The major hypers that have been corrected are:

  • optimizer.window_size=1 --> optimizer.window_size=8 this hyperparameter controls how many corruptions are made to the seed sequence and can have a major effect when the optimal solution requires large increases to the sequence length.
  • surrogate.bs=32 --> surrogate.bs=256 with a larger dataset increasing the batch size decreases the run time significantly, I was seeing about 6 hours per seed on an A100 after this change
  • optimizer.resampling_weight=1.0 --> optimizer.resampling_weight=0.5 this change makes the optimizer sample "good" seeds more aggressively when constructing batches of candidates.

@samuelstanton
Copy link
Owner

samuelstanton commented May 1, 2023

Increasing the max context length to 256 (task.max_len=256) improves performance on this benchmark, as I noted in the paper, but variance across seeds is still an issue.

image

@Thomaswbt
Copy link
Author

Sorry for the late response. Thank you for your effort! Previously I also found that the choice for the starting sequences matters a lot to the final results. I think I will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants