This repository contains the code for the paper "What cleaves? Is proteasomal cleavage prediction reaching a ceiling?", accepted at NeurIPS'22 Learning Meaningful Representations of Life.
Epitope vaccines are a promising direction to enable precision treatment for cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate prediction of proteasomal cleavage in order to ensure that the epitopes in the vaccine are presented to T cells by the major histocompatibility complex (MHC). While direct identification of proteasomal cleavage in vitro is cumbersome and low throughput, it is possible to implicitly infer cleavage events from the termini of MHC-presented epitopes, which can be detected in large amounts thanks to recent advances in high-throughput MHC ligandomics. Inferring cleavage events in such a way provides an inherently noisy signal which can be tackled with new developments in the field of deep learning that supposedly make it possible to learn predictors from noisy labels. Inspired by such innovations, we sought to modernize proteasomal cleavage predictors by benchmarking a wide range of recent methods, including LSTMs, transformers, CNNs, and denoising methods, on a recently introduced cleavage dataset. We found that increasing model scale and complexity appeared to deliver limited performance gains, as several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on N-terminal cleavage prediction. This suggests that the noise and/or complexity of proteasomal cleavage and the subsequent biological processes of the antigen processing pathway are the major limiting factors for predictive performance rather than the specific modeling approach used. While biological complexity can be tackled by more data and better models, noise and randomness inherently limit the maximum achievable predictive performance. All our datasets and experiments are available at https://github.com/ziegler-ingo/cleavage_prediction.
preprocessing
includes the notebooks that shuffle and split the data into C- and N-terminal, train, evaluation, and test split, as well as perform 3-mer and other basic preprocessing steps, such as tokenizationdata
holds the.csv
and.tsv
train, evaluation, and test files, as well as a vocabulary filedenoise/divide_mix
holds our adjusted implementation of the DivideMix algorithm- try-out runs (e.g. testing impact of varying epochs, weight of unlabeled loss samples and distributions after Gaussian Mixture Model separation) can be found under
denoise/dividemix_tryout_debug
- all other tested denoising methods are directly impelemented in the notebooks
- try-out runs (e.g. testing impact of varying epochs, weight of unlabeled loss samples and distributions after Gaussian Mixture Model separation) can be found under
models/hyperparam_search
holds the training and hyperparameter search implementation of the asynchronous hyperband algorithm using Ray Tunemodels/final
holds the final training and evaluation structure for model architectures paired with all denoising approaches
- All notebooks are named as follows: the applicable terminal, i.e.
c
orn
, followed by the model architecture, e.g.bilstm
, followed by the denoising method, e.g.dividemix
- Example:
c_bilstm_dividemix.ipynb
- BiLSTM, called
bilstm
- BiLSTM with Attention, called
bilstm_att
- BiLSTM with pre-trained Prot2Vec embeddings, called
bilstm_prot2vec
- Attention enhanced CNN, called
cnn
- BiLSTM with ESM2 representations as embeddings, called
bilstm_esm2
- Fine-tuning of ESM2, called
esm2
- BiLSTM with T5 representations as embeddings, called
bilstm_t5
- Base BiLSTM with various trained tokenizers
- Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called
bilstm_bppe1
andbilstm_bbpe50
- WordPair tokenizer with vocabulary size 50000, called
bilstm_wp50
- Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called
- BiLSTM with forward-backward representations as embeddings, called
bilstm_fwbw
- Co-Teaching, called
coteaching
- Co-Teaching+, called
coteaching_plus
- JoCoR, called
jocor
- Noise Adaptation Layer, called
nad
- DivideMix, called
dividemix
- BiLSTM model architecture based on Ozols et. al., 2021
- Model architecture based on Liu and Gong, 2019, Github
- Model architecture based on Li et al., 2020, Repository available via download section on Homepage
- Prot2Vec embeddings based on Asgari and Mofrad, 2015, available on Github
- Sequence Encoder model architecture based on Heigold et al., 2016
- Model architecture based on DeepCalpain, Liu et al., 2019 and Terminitor, Yang et al., 2020
- T5 Encoder taken from Elnagger et al., 2020, Github, Model on Huggingface Hub
- ESM2 taken from Lin et al., 2022, Github
- Noise adaptation layer implementation is based on Goldberger and Ben-Reuven, 2017, and unofficial implementation on Github
- Co-teaching loss function and training process adaptations are based on Han et al., 2018, and official implementation on Github
- Co-teaching+ loss function and training process adaptations are based on Yu et al., 2019, and official implementation on Github
- JoCoR loss function and training process adaptations are based on Wei et al., 2020, and official implementation on Github
- DivideMix structure is based on Li et al., 2020, Github
- As DivideMix was originally implemented for image data, we adjusted the MixMatch and Mixup part for sequential data, based on Guo et al., 2019
- This part is directly implemented in the respective forward pass in the notebooks, and thus cannot be found in the DivideMix section