Skip to content

Code and dataset for the NeurIPS'22 paper "What cleaves? Is proteasomal cleavage prediction reaching a ceiling?" https://arxiv.org/abs/2210.12991

Notifications You must be signed in to change notification settings

ziegler-ingo/cleavage_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cleavage Prediction

This repository contains the code for the paper "What cleaves? Is proteasomal cleavage prediction reaching a ceiling?", accepted at NeurIPS'22 Learning Meaningful Representations of Life.

Abstract

Epitope vaccines are a promising direction to enable precision treatment for cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate prediction of proteasomal cleavage in order to ensure that the epitopes in the vaccine are presented to T cells by the major histocompatibility complex (MHC). While direct identification of proteasomal cleavage in vitro is cumbersome and low throughput, it is possible to implicitly infer cleavage events from the termini of MHC-presented epitopes, which can be detected in large amounts thanks to recent advances in high-throughput MHC ligandomics. Inferring cleavage events in such a way provides an inherently noisy signal which can be tackled with new developments in the field of deep learning that supposedly make it possible to learn predictors from noisy labels. Inspired by such innovations, we sought to modernize proteasomal cleavage predictors by benchmarking a wide range of recent methods, including LSTMs, transformers, CNNs, and denoising methods, on a recently introduced cleavage dataset. We found that increasing model scale and complexity appeared to deliver limited performance gains, as several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on N-terminal cleavage prediction. This suggests that the noise and/or complexity of proteasomal cleavage and the subsequent biological processes of the antigen processing pathway are the major limiting factors for predictive performance rather than the specific modeling approach used. While biological complexity can be tackled by more data and better models, noise and randomness inherently limit the maximum achievable predictive performance. All our datasets and experiments are available at https://github.com/ziegler-ingo/cleavage_prediction.

Repository Structure

  • preprocessing includes the notebooks that shuffle and split the data into C- and N-terminal, train, evaluation, and test split, as well as perform 3-mer and other basic preprocessing steps, such as tokenization
  • data holds the .csv and .tsv train, evaluation, and test files, as well as a vocabulary file
  • denoise/divide_mix holds our adjusted implementation of the DivideMix algorithm
    • try-out runs (e.g. testing impact of varying epochs, weight of unlabeled loss samples and distributions after Gaussian Mixture Model separation) can be found under denoise/dividemix_tryout_debug
    • all other tested denoising methods are directly impelemented in the notebooks
  • models/hyperparam_search holds the training and hyperparameter search implementation of the asynchronous hyperband algorithm using Ray Tune
  • models/final holds the final training and evaluation structure for model architectures paired with all denoising approaches

Naming structure of final notebooks

  • All notebooks are named as follows: the applicable terminal, i.e. c or n, followed by the model architecture, e.g. bilstm, followed by the denoising method, e.g. dividemix
  • Example: c_bilstm_dividemix.ipynb

Available model architectures

  • BiLSTM, called bilstm
  • BiLSTM with Attention, called bilstm_att
  • BiLSTM with pre-trained Prot2Vec embeddings, called bilstm_prot2vec
  • Attention enhanced CNN, called cnn
  • BiLSTM with ESM2 representations as embeddings, called bilstm_esm2
  • Fine-tuning of ESM2, called esm2
  • BiLSTM with T5 representations as embeddings, called bilstm_t5
  • Base BiLSTM with various trained tokenizers
    • Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called bilstm_bppe1 and bilstm_bbpe50
    • WordPair tokenizer with vocabulary size 50000, called bilstm_wp50
  • BiLSTM with forward-backward representations as embeddings, called bilstm_fwbw

Available denoising architectures

  • Co-Teaching, called coteaching
  • Co-Teaching+, called coteaching_plus
  • JoCoR, called jocor
  • Noise Adaptation Layer, called nad
  • DivideMix, called dividemix

Achieved performances

Performance Comparison of all models and denoising architectures for C- and N-terminal Performance Comparison of all models and denoising architectures for C- and N-terminal

Sources for model architectures and denoising approaches

LSTM Architecture

LSTM Attention Architecture

CNN Architecture

Prot2Vec Embeddings

FwBw Architecture

MLP Architecture

T5 Architecture

ESM2 Architecture

Noise Adaptation Layer

Co-teaching

  • Co-teaching loss function and training process adaptations are based on Han et al., 2018, and official implementation on Github

Co-teaching+

  • Co-teaching+ loss function and training process adaptations are based on Yu et al., 2019, and official implementation on Github

JoCoR

  • JoCoR loss function and training process adaptations are based on Wei et al., 2020, and official implementation on Github

DivideMix

  • DivideMix structure is based on Li et al., 2020, Github
  • As DivideMix was originally implemented for image data, we adjusted the MixMatch and Mixup part for sequential data, based on Guo et al., 2019
    • This part is directly implemented in the respective forward pass in the notebooks, and thus cannot be found in the DivideMix section

About

Code and dataset for the NeurIPS'22 paper "What cleaves? Is proteasomal cleavage prediction reaching a ceiling?" https://arxiv.org/abs/2210.12991

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published