CombinGym

Overview

CombinGym is a curated collection of protein combinatorial variant datasets and benchmarks aimed at evaluating the ability of models to predict the combinatorial mutation effects. The tasks include zero-shot prediction and predicting higher-order mutants (HOMs) based on lower-order ones, covering scenarios such as 0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest. Currently, CombinGym offers a comprehensive suite of 14 combinatorial variant datasets, featuring 4 to 16 mutations across 9 proteins, and encompassing over 400,000 meticulously curated data points. These datasets cover a range of protein properties, including protein-protein interaction, fluorescence, and enzymatic activities targeting nucleic acids, protein, and small molecules. CombinGym supports 9 models, including sequence-based, structure-based, sequence-label machine learning, and statistical models. It provides metrics for both mutational effect prediction and design tasks.

Data

The data folder provides all input data for all baselines.

For the DMS data: We provide both the raw data from the original publication and the clean data, which includes at least the genotype, number of mutations, and fitness for each mutant.
For the WT sequence: We provide the FASTA file of the full amino acid sequence for each mutated protein, corresponding to the starting sequence for deep mutational scanning library construction, multiple sequence alignment file generation, and structure prediction.
For the MSA data: We provide the A2M file of multiple sequence alignment for each protein, generated by the local EVcouplings application using the WT sequence as input. Multiple A2M files with different bitscore thresholds for specific proteins are included for comparison.
For the structure data: We provide the PDB file of each protein, predicted by the AlphaFold3 model.

For more detailed information, please see Data_summary.csv. The above files are also accessible via our dedicated website: https://www.combingym.org/.

Benchmarks

The benchmarks folder provides detailed performance files for all baselines on the tasks of 0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest.

For mutational effect prediction: We report the Spearman’s correlation coefficient metric.
For the mutant design: We report the NDCG metric.

For the supervised models GVP-Mut, MAVE-NN, CNN, and Ridge, the metrics of 0-vs-rest tasks are absent. For the unsupervised models EVmutation and DeepSequence, the metrics of 1/2/3-vs-rest tasks represent the performance on the “rest” mutants without supervision by the label of single, double, or triple mutants.

Leaderboard

The full benchmark performance and leaderboard are also available on the CombinGym website: https://www.combingym.org/. It includes leaderboards for both the mutational effect prediction and mutant design benchmarks, as well as detailed performance for all proteins, baselines, and tasks. The current version includes the following baselines:

Model Name	Input	Reference
EVmutation	MSA	Hopf, T.A., et al. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.
DeepSequence	MSA	Riesselman, A.J., et al. (2018). Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15, 816-822.
ESM-1b	Protein language & sequence-label	Rives, A., et al. (2019). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 118.
ESM-1v	Protein language & sequence-label	Meier, J., et al. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS.
GVP-Mut	Structure & sequence-label	Chen L, et al. (2023). Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Systems, 14(8): 706-721. e5.
MAVE-NN	Sequence-label	Tareen A, et al. (2022). MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome biology, 23(1): 98.
CNN	Sequence-label	Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890.
Ridge	Sequence-label	Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890.
BLOSUM62	Substitution	Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890.

How to Contribute

New Datasets

If you would like to contribute new datasets to be part of CombinGym, please submit via “Datasets Submit” on the CombinGym website (https://www.combingym.org/). We are also open to collaboration around the world on the building and testing of protein combinatorial libraries based on our world’s largest automated biofoundry. The criteria we typically consider for inclusion are as follows:

The corresponding raw dataset needs to be publicly available.
The assay needs to be a protein-related combinatorial library.
The dataset needs to have a sufficient number of measurements and be well-documented with at least the genotype, number of mutations, and fitness.

New Baselines

If you would like new baselines to be included in CombinGym, please follow these steps:

Prepare your model code: Make sure your model code is complete and includes necessary dependencies and environment configuration so that we can reproduce the results if necessary.
Run the model and record the results: Run your model using the same dataset and data splits as the CombinGym project. Record the performance scores and other relevant metrics of the model for comparison and verification.

Please use a pull request to submit your model code and metrics.

CombinGym

Ridge, CNN, ESM-1b, ESM-1v, BLOSUM62
- Source: FLIP by Christian Dallago et al.
- GitHub Link: FLIP
- Provided Scripts:
  - mutant_to_seq.ipynb: Generates mutants from sequences.
  - seq_to_mutant.ipynb: Generates sequences from mutants.
  - split.ipynb: Splits the data into training and testing sets.
  - dictionary.ipynb: Generates a dictionary to be added to the train_all.py file.
  - run_train_all.sh: Executes predictions on files within a directory in batch mode.
  - evaluate.py: Evaluates Spearman and NDCG metrics in batch mode.
GVP-Mut
- Source: GVP-MSA by Lin Chen et al.
- GitHub Link: GVP-MSA
- Provided Script:
  - gvpmutsplit.ipynb: Splits data using a strategy based on train_single2multi.py from GVP-MSA, but without using MSA.
MAVE-NN
- Source: MAVE-NN by Ammar Tareen et al.
- GitHub Link: MAVE-NN
- Provided Scripts:
  - mavenn_prediction.ipynb: Used for prediction and visualization on a single dataset.
  - run_mavenn.py: Used for batch prediction on multiple files.
DeepSequence
- Source: DeepSequence by Thomas A Hopf et al.
- GitHub Link: DeepSequence
- Provided Script:
  - run_predict.py: Used for making predictions.
EVmutation
- Source: EVmutation by Thomas A Hopf et al.
- GitHub Link: EVmutation
- Provided Script:
  - multi_prediction.ipynb: Used for making predictions on multipoint datasets.

Installation and Usage

Please refer to the instructions on the respective GitHub pages of each model for installation and usage.

Acknowledgements

Our codebase leveraged code from the following repositories to compute baselines:

Model	Repository
EVmutation	EVmutation GitHub
DeepSequence	DeepSequence GitHub
ESM-1b	FLIP GitHub
ESM-1v	FLIP GitHub
GVP-Mut	GVP-MSA GitHub
MAVE-NN	MAVE-NN GitHub
CNN	FLIP GitHub
Ridge	FLIP GitHub
BLOSUM62	FLIP GitHub

License

This project is available under the MIT license found in the LICENSE file in this GitHub repository.

Citation

This work is currently under submission.

Links

Website: https://www.combingym.org/
CombinGym paper: (link to be added upon publication)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Benchmarks		Benchmarks
CombinGym		CombinGym
Data		Data
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
assays.bib		assays.bib
config.json		config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CombinGym

Table of Contents

Overview

Data

Benchmarks

Leaderboard

How to Contribute

New Datasets

New Baselines

CombinGym

Installation and Usage

Acknowledgements

License

Citation

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CombinGym

Table of Contents

Overview

Data

Benchmarks

Leaderboard

How to Contribute

New Datasets

New Baselines

CombinGym

Installation and Usage

Acknowledgements

License

Citation

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages