Skip to content

sitonglab/CombinGym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CombinGym

Table of Contents

Overview

CombinGym is a curated collection of protein combinatorial variant datasets and benchmarks aimed at evaluating the ability of models to predict the combinatorial mutation effects. The tasks include zero-shot prediction and predicting higher-order mutants (HOMs) based on lower-order ones, covering scenarios such as 0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest. Currently, CombinGym offers a comprehensive suite of 14 combinatorial variant datasets, featuring 4 to 16 mutations across 9 proteins, and encompassing over 400,000 meticulously curated data points. These datasets cover a range of protein properties, including protein-protein interaction, fluorescence, and enzymatic activities targeting nucleic acids, protein, and small molecules. CombinGym supports 9 models, including sequence-based, structure-based, sequence-label machine learning, and statistical models. It provides metrics for both mutational effect prediction and design tasks.

Data

The data folder provides all input data for all baselines.

  • For the DMS data: We provide both the raw data from the original publication and the clean data, which includes at least the genotype, number of mutations, and fitness for each mutant.
  • For the WT sequence: We provide the FASTA file of the full amino acid sequence for each mutated protein, corresponding to the starting sequence for deep mutational scanning library construction, multiple sequence alignment file generation, and structure prediction.
  • For the MSA data: We provide the A2M file of multiple sequence alignment for each protein, generated by the local EVcouplings application using the WT sequence as input. Multiple A2M files with different bitscore thresholds for specific proteins are included for comparison.
  • For the structure data: We provide the PDB file of each protein, predicted by the AlphaFold3 model.

For more detailed information, please see Data_summary.csv. The above files are also accessible via our dedicated website: https://www.combingym.org/.

Benchmarks

The benchmarks folder provides detailed performance files for all baselines on the tasks of 0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest.

  • For mutational effect prediction: We report the Spearman’s correlation coefficient metric.
  • For the mutant design: We report the NDCG metric.

For the supervised models GVP-Mut, MAVE-NN, CNN, and Ridge, the metrics of 0-vs-rest tasks are absent. For the unsupervised models EVmutation and DeepSequence, the metrics of 1/2/3-vs-rest tasks represent the performance on the “rest” mutants without supervision by the label of single, double, or triple mutants.

Leaderboard

The full benchmark performance and leaderboard are also available on the CombinGym website: https://www.combingym.org/. It includes leaderboards for both the mutational effect prediction and mutant design benchmarks, as well as detailed performance for all proteins, baselines, and tasks. The current version includes the following baselines:

Model Name Input Reference
EVmutation MSA Hopf, T.A., et al. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.
DeepSequence MSA Riesselman, A.J., et al. (2018). Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15, 816-822.
ESM-1b Protein language & sequence-label Rives, A., et al. (2019). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 118.
ESM-1v Protein language & sequence-label Meier, J., et al. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS.
GVP-Mut Structure & sequence-label Chen L, et al. (2023). Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Systems, 14(8): 706-721. e5.
MAVE-NN Sequence-label Tareen A, et al. (2022). MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome biology, 23(1): 98.
CNN Sequence-label Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890.
Ridge Sequence-label Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890.
BLOSUM62 Substitution Dallago C, et al. (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021: 2021.11. 09.467890.

How to Contribute

New Datasets

If you would like to contribute new datasets to be part of CombinGym, please submit via “Datasets Submit” on the CombinGym website (https://www.combingym.org/). We are also open to collaboration around the world on the building and testing of protein combinatorial libraries based on our world’s largest automated biofoundry. The criteria we typically consider for inclusion are as follows:

  1. The corresponding raw dataset needs to be publicly available.
  2. The assay needs to be a protein-related combinatorial library.
  3. The dataset needs to have a sufficient number of measurements and be well-documented with at least the genotype, number of mutations, and fitness.

New Baselines

If you would like new baselines to be included in CombinGym, please follow these steps:

  1. Prepare your model code: Make sure your model code is complete and includes necessary dependencies and environment configuration so that we can reproduce the results if necessary.
  2. Run the model and record the results: Run your model using the same dataset and data splits as the CombinGym project. Record the performance scores and other relevant metrics of the model for comparison and verification.

Please use a pull request to submit your model code and metrics.

CombinGym

  • Ridge, CNN, ESM-1b, ESM-1v, BLOSUM62

    • Source: FLIP by Christian Dallago et al.
    • GitHub Link: FLIP
    • Provided Scripts:
      • mutant_to_seq.ipynb: Generates mutants from sequences.
      • seq_to_mutant.ipynb: Generates sequences from mutants.
      • split.ipynb: Splits the data into training and testing sets.
      • dictionary.ipynb: Generates a dictionary to be added to the train_all.py file.
      • run_train_all.sh: Executes predictions on files within a directory in batch mode.
      • evaluate.py: Evaluates Spearman and NDCG metrics in batch mode.
  • GVP-Mut

    • Source: GVP-MSA by Lin Chen et al.
    • GitHub Link: GVP-MSA
    • Provided Script:
      • gvpmutsplit.ipynb: Splits data using a strategy based on train_single2multi.py from GVP-MSA, but without using MSA.
  • MAVE-NN

    • Source: MAVE-NN by Ammar Tareen et al.
    • GitHub Link: MAVE-NN
    • Provided Scripts:
      • mavenn_prediction.ipynb: Used for prediction and visualization on a single dataset.
      • run_mavenn.py: Used for batch prediction on multiple files.
  • DeepSequence

    • Source: DeepSequence by Thomas A Hopf et al.
    • GitHub Link: DeepSequence
    • Provided Script:
      • run_predict.py: Used for making predictions.
  • EVmutation

    • Source: EVmutation by Thomas A Hopf et al.
    • GitHub Link: EVmutation
    • Provided Script:
      • multi_prediction.ipynb: Used for making predictions on multipoint datasets.

Installation and Usage

Please refer to the instructions on the respective GitHub pages of each model for installation and usage.

Acknowledgements

Our codebase leveraged code from the following repositories to compute baselines:

Model Repository
EVmutation EVmutation GitHub
DeepSequence DeepSequence GitHub
ESM-1b FLIP GitHub
ESM-1v FLIP GitHub
GVP-Mut GVP-MSA GitHub
MAVE-NN MAVE-NN GitHub
CNN FLIP GitHub
Ridge FLIP GitHub
BLOSUM62 FLIP GitHub

License

This project is available under the MIT license found in the LICENSE file in this GitHub repository.

Citation

This work is currently under submission.

Links

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors