CombinGym is a curated collection of protein combinatorial variant datasets and benchmarks aimed at evaluating the ability of models to predict the combinatorial mutation effects. The tasks include zero-shot prediction and predicting higher-order mutants (HOMs) based on lower-order ones, covering scenarios such as 0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest. Currently, CombinGym offers a comprehensive suite of 14 combinatorial variant datasets, featuring 4 to 16 mutations across 9 proteins, and encompassing over 400,000 meticulously curated data points. These datasets cover a range of protein properties, including protein-protein interaction, fluorescence, and enzymatic activities targeting nucleic acids, protein, and small molecules. CombinGym supports 9 models, including sequence-based, structure-based, sequence-label machine learning, and statistical models. It provides metrics for both mutational effect prediction and design tasks.
The data folder provides all input data for all baselines.
- For the DMS data: We provide both the raw data from the original publication and the clean data, which includes at least the genotype, number of mutations, and fitness for each mutant.
- For the WT sequence: We provide the FASTA file of the full amino acid sequence for each mutated protein, corresponding to the starting sequence for deep mutational scanning library construction, multiple sequence alignment file generation, and structure prediction.
- For the MSA data: We provide the A2M file of multiple sequence alignment for each protein, generated by the local EVcouplings application using the WT sequence as input. Multiple A2M files with different bitscore thresholds for specific proteins are included for comparison.
- For the structure data: We provide the PDB file of each protein, predicted by the AlphaFold3 model.
For more detailed information, please see Data_summary.csv. The above files are also accessible via our dedicated website: https://www.combingym.org/.
The benchmarks folder provides detailed performance files for all baselines on the tasks of 0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest.
- For mutational effect prediction: We report the Spearman’s correlation coefficient metric.
- For the mutant design: We report the NDCG metric.
For the supervised models GVP-Mut, MAVE-NN, CNN, and Ridge, the metrics of 0-vs-rest tasks are absent. For the unsupervised models EVmutation and DeepSequence, the metrics of 1/2/3-vs-rest tasks represent the performance on the “rest” mutants without supervision by the label of single, double, or triple mutants.
The full benchmark performance and leaderboard are also available on the CombinGym website: https://www.combingym.org/. It includes leaderboards for both the mutational effect prediction and mutant design benchmarks, as well as detailed performance for all proteins, baselines, and tasks. The current version includes the following baselines:
| Model Name | Input | Reference |
|---|---|---|
| EVmutation | MSA | Hopf, T.A., et al. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135. |
| DeepSequence | MSA | Riesselman, A.J., et al. (2018). Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15, 816-822. |
| ESM-1b | Protein language & sequence-label | Rives, A., et al. (2019). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 118. |
| ESM-1v | Protein language & sequence-label | Meier, J., et al. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS. |
| GVP-Mut | Structure & sequence-label | Chen L, et al. (2023). Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Systems, 14(8): 706-721. e5. |
| MAVE-NN | Sequence-label | Tareen A, et al. (2022). MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome biology, 23(1): 98. |
| CNN | Sequence-label | Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890. |
| Ridge | Sequence-label | Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890. |
| BLOSUM62 | Substitution | Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890. |
If you would like to contribute new datasets to be part of CombinGym, please submit via “Datasets Submit” on the CombinGym website (https://www.combingym.org/). We are also open to collaboration around the world on the building and testing of protein combinatorial libraries based on our world’s largest automated biofoundry. The criteria we typically consider for inclusion are as follows:
- The corresponding raw dataset needs to be publicly available.
- The assay needs to be a protein-related combinatorial library.
- The dataset needs to have a sufficient number of measurements and be well-documented with at least the genotype, number of mutations, and fitness.
If you would like new baselines to be included in CombinGym, please follow these steps:
- Prepare your model code: Make sure your model code is complete and includes necessary dependencies and environment configuration so that we can reproduce the results if necessary.
- Run the model and record the results: Run your model using the same dataset and data splits as the CombinGym project. Record the performance scores and other relevant metrics of the model for comparison and verification.
Please use a pull request to submit your model code and metrics.
-
Ridge, CNN, ESM-1b, ESM-1v, BLOSUM62
- Source: FLIP by Christian Dallago et al.
- GitHub Link: FLIP
- Provided Scripts:
mutant_to_seq.ipynb: Generates mutants from sequences.seq_to_mutant.ipynb: Generates sequences from mutants.split.ipynb: Splits the data into training and testing sets.dictionary.ipynb: Generates a dictionary to be added to thetrain_all.pyfile.run_train_all.sh: Executes predictions on files within a directory in batch mode.evaluate.py: Evaluates Spearman and NDCG metrics in batch mode.
-
GVP-Mut
- Source: GVP-MSA by Lin Chen et al.
- GitHub Link: GVP-MSA
- Provided Script:
gvpmutsplit.ipynb: Splits data using a strategy based ontrain_single2multi.pyfrom GVP-MSA, but without using MSA.
-
MAVE-NN
- Source: MAVE-NN by Ammar Tareen et al.
- GitHub Link: MAVE-NN
- Provided Scripts:
mavenn_prediction.ipynb: Used for prediction and visualization on a single dataset.run_mavenn.py: Used for batch prediction on multiple files.
-
DeepSequence
- Source: DeepSequence by Thomas A Hopf et al.
- GitHub Link: DeepSequence
- Provided Script:
run_predict.py: Used for making predictions.
-
EVmutation
- Source: EVmutation by Thomas A Hopf et al.
- GitHub Link: EVmutation
- Provided Script:
multi_prediction.ipynb: Used for making predictions on multipoint datasets.
Please refer to the instructions on the respective GitHub pages of each model for installation and usage.
Our codebase leveraged code from the following repositories to compute baselines:
| Model | Repository |
|---|---|
| EVmutation | EVmutation GitHub |
| DeepSequence | DeepSequence GitHub |
| ESM-1b | FLIP GitHub |
| ESM-1v | FLIP GitHub |
| GVP-Mut | GVP-MSA GitHub |
| MAVE-NN | MAVE-NN GitHub |
| CNN | FLIP GitHub |
| Ridge | FLIP GitHub |
| BLOSUM62 | FLIP GitHub |
This project is available under the MIT license found in the LICENSE file in this GitHub repository.
This work is currently under submission.
- Website: https://www.combingym.org/
- CombinGym paper: (link to be added upon publication)