Cleavage Prediction

This repository contains the code for the paper "What cleaves? Is proteasomal cleavage prediction reaching a ceiling?", accepted at NeurIPS'22 Learning Meaningful Representations of Life.

Abstract

Epitope vaccines are a promising direction to enable precision treatment for cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate prediction of proteasomal cleavage in order to ensure that the epitopes in the vaccine are presented to T cells by the major histocompatibility complex (MHC). While direct identification of proteasomal cleavage in vitro is cumbersome and low throughput, it is possible to implicitly infer cleavage events from the termini of MHC-presented epitopes, which can be detected in large amounts thanks to recent advances in high-throughput MHC ligandomics. Inferring cleavage events in such a way provides an inherently noisy signal which can be tackled with new developments in the field of deep learning that supposedly make it possible to learn predictors from noisy labels. Inspired by such innovations, we sought to modernize proteasomal cleavage predictors by benchmarking a wide range of recent methods, including LSTMs, transformers, CNNs, and denoising methods, on a recently introduced cleavage dataset. We found that increasing model scale and complexity appeared to deliver limited performance gains, as several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on N-terminal cleavage prediction. This suggests that the noise and/or complexity of proteasomal cleavage and the subsequent biological processes of the antigen processing pathway are the major limiting factors for predictive performance rather than the specific modeling approach used. While biological complexity can be tackled by more data and better models, noise and randomness inherently limit the maximum achievable predictive performance. All our datasets and experiments are available at https://github.com/ziegler-ingo/cleavage_prediction.

Repository Structure

preprocessing includes the notebooks that shuffle and split the data into C- and N-terminal, train, evaluation, and test split, as well as perform 3-mer and other basic preprocessing steps, such as tokenization
data holds the .csv and .tsv train, evaluation, and test files, as well as a vocabulary file
denoise/divide_mix holds our adjusted implementation of the DivideMix algorithm
- try-out runs (e.g. testing impact of varying epochs, weight of unlabeled loss samples and distributions after Gaussian Mixture Model separation) can be found under denoise/dividemix_tryout_debug
- all other tested denoising methods are directly impelemented in the notebooks
models/hyperparam_search holds the training and hyperparameter search implementation of the asynchronous hyperband algorithm using Ray Tune
models/final holds the final training and evaluation structure for model architectures paired with all denoising approaches

Naming structure of final notebooks

All notebooks are named as follows: the applicable terminal, i.e. c or n, followed by the model architecture, e.g. bilstm, followed by the denoising method, e.g. dividemix
Example: c_bilstm_dividemix.ipynb

Available model architectures

BiLSTM, called bilstm
BiLSTM with Attention, called bilstm_att
BiLSTM with pre-trained Prot2Vec embeddings, called bilstm_prot2vec
Attention enhanced CNN, called cnn
BiLSTM with ESM2 representations as embeddings, called bilstm_esm2
Fine-tuning of ESM2, called esm2
BiLSTM with T5 representations as embeddings, called bilstm_t5
Base BiLSTM with various trained tokenizers
- Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called bilstm_bppe1 and bilstm_bbpe50
- WordPair tokenizer with vocabulary size 50000, called bilstm_wp50
BiLSTM with forward-backward representations as embeddings, called bilstm_fwbw

Available denoising architectures

Co-Teaching, called coteaching
Co-Teaching+, called coteaching_plus
JoCoR, called jocor
Noise Adaptation Layer, called nad
DivideMix, called dividemix

Achieved performances

Sources for model architectures and denoising approaches

LSTM Architecture

BiLSTM model architecture based on Ozols et. al., 2021

LSTM Attention Architecture

Model architecture based on Liu and Gong, 2019, Github

CNN Architecture

Model architecture based on Li et al., 2020, Repository available via download section on Homepage

Prot2Vec Embeddings

Prot2Vec embeddings based on Asgari and Mofrad, 2015, available on Github

FwBw Architecture

Sequence Encoder model architecture based on Heigold et al., 2016

Noise Adaptation Layer

Noise adaptation layer implementation is based on Goldberger and Ben-Reuven, 2017, and unofficial implementation on Github

Co-teaching

Co-teaching loss function and training process adaptations are based on Han et al., 2018, and official implementation on Github

Co-teaching+

Co-teaching+ loss function and training process adaptations are based on Yu et al., 2019, and official implementation on Github

JoCoR

JoCoR loss function and training process adaptations are based on Wei et al., 2020, and official implementation on Github

DivideMix

DivideMix structure is based on Li et al., 2020, Github
As DivideMix was originally implemented for image data, we adjusted the MixMatch and Mixup part for sequential data, based on Guo et al., 2019
- This part is directly implemented in the respective forward pass in the notebooks, and thus cannot be found in the DivideMix section

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
data		data
denoise		denoise
img		img
models		models
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cleavage Prediction

Abstract

Repository Structure

Naming structure of final notebooks

Available model architectures

Available denoising architectures

Achieved performances

Sources for model architectures and denoising approaches

LSTM Architecture

LSTM Attention Architecture

CNN Architecture

Prot2Vec Embeddings

FwBw Architecture

MLP Architecture

T5 Architecture

ESM2 Architecture

Noise Adaptation Layer

Co-teaching

Co-teaching+

JoCoR

DivideMix

About

Releases

Packages

Languages

ziegler-ingo/cleavage_prediction

Folders and files

Latest commit

History

Repository files navigation

Cleavage Prediction

Abstract

Repository Structure

Naming structure of final notebooks

Available model architectures

Available denoising architectures

Achieved performances

Sources for model architectures and denoising approaches

LSTM Architecture

LSTM Attention Architecture

CNN Architecture

Prot2Vec Embeddings

FwBw Architecture

MLP Architecture

T5 Architecture

ESM2 Architecture

Noise Adaptation Layer

Co-teaching

Co-teaching+

JoCoR

DivideMix

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages