Deep Peptide Representation Learning

We present an interface for training, inferencing representations, generative modelling aka "babbling", and data management. All three architectures (64, 256, and 1900 units) are provided along with the trained architectures, the random initializations used to begin evotuning (to ensure reproducibility) and the evotuned parameters.

For training/finetuning: note that backpropagation of an mLSTM of this size is very memory intensive, and the primary determinant of memory use is the max length of the input sequence rather than the batch size. We have finetuned on GFP-like fluorescent proteins (~120-280aa) on a p3.2xlarge instance (aws) with 16G GPU memory successfully. Higher memory hardware should accommodate larger sequences, as will using one of the smaller pre-trained models (64 or 256).

Overview

Start with 24mil UniRef50 sequence trained model. This model has learned a general representation of protein sequences.
Find sequences of proteins known to bind to our target sequences
Remove sequences longer than 500AAs and with invalid letters (~100 sequences)
Use JackHMMER (20 iter) to iteratively find protein sequences that are likely evolutionary related (~25k)
13k unsupervised weight updates (65 epochs) of the sequence trained model on these ~25k sequence
Train a sparse linear regression model to predict binding to GDF8, GDF11, activinA, and BMP9. Inputs are the sequence trained model's representations of the proteins and targets are whether they bind.
Now we can predict binding on peptides that have not yet been tested.

Setup

System requirements: NVIDIA CUDA 8.0 (V8.0.61), NVIDIA cuDNN 6.0.21, NVIDIA GPU Driver 410.79 (though == 361.93 or >= 375.51 should work. Untested), nvidia-docker. The 64-unit model should be OK to run on any machine. The full-sized model will require a machine with more than 16GB of GPU RAM.
Build docker: docker build -f docker/Dockerfile.gpu -t eclrep-gpu . This step pulls the Tensorflow 1.3 GPU image and installs a few required python packages. Note that Tensorflow pulls from Ubuntu 16.04.
Run docker: docker/run_gpu_docker.sh. This will launch Jupyter. Copy and paste the provided URL into your browser. Note that if you are running this code on a remote machine you will need to set up port forwarding between your local machine and your remote machine. See this example (note that in our case jupyter is serving port 8888, not 8889 as the example describes).
Open up the unirep_tutorial.ipynb notebook and get started.

Data Sources

Stability data is from David Baker's Science paper "Global analysis of protein folding using massively parallel design, synthesis, and testing"

Weight files

The unirep_tutorial.ipynb notebook downloads the needed weight files for the 64-unit and 1900-unit UniRep models.

Weight directories are in data/

1900_weights/: trained weights for the 1900-unit (full) UniRep model
256_weights/: trained weights for the 256-unit UniRep model
64_weights/: trained weights for the 64-unit UniRep model
1900_weights_random/: random weights that were used to initialize the 1900-unit (full) UniRep model for Random Evotuned.
256_weights_random/: random weights that could be used to initialize the 256-unit UniRep model (e.g. for evotuning).
64_weights_random/: random weights that could be used to initialize the 64-unit UniRep model (e.g. for evotuning).
evotuned/unirep/: the weights, as a tensorflow checkpoint file, after 13k unsupervised weight updates on fluorescent protein homologs obtained with JackHMMer of the globally pre-trained UniRep (1900-unit model).
evotuned/random_init/: the weights, as a tensorflow checkpoint file, after 13k unsupervised weight updates on fluorescent protein homologs obtained with JackHMMer of a randomly initialized UniRep (initialized with 1900_weights_random) that was not pre-trained at all (1900-unit model).

Description of files in this repository

unirep_tutorial.ipynb - Start here for examples on loading the model, preparing data, training, and running inference.
unirep_tutorial.py - A pure python script version of the tutorial notebook.
unirep.py - Interface for most use cases.
custom_models.py - Custom implementations of GRU, LSTM and mLSTM cells as used in representation training on UniRef50
data_utils.py - Convenience functions for data management.
formatted.txt and seqs.txt - Tutorial files.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
data		data
docker		docker
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Capture.PNG		Capture.PNG
README.md		README.md
custom_models.py		custom_models.py
data_utils.py		data_utils.py
de_novo_design.ipynb		de_novo_design.ipynb
de_novo_design.py		de_novo_design.py
formatted.txt		formatted.txt
requirements.txt		requirements.txt
seqs.txt		seqs.txt
stability_predictions.ipynb		stability_predictions.ipynb
stability_predictions.py		stability_predictions.py
stability_predictions_new.ipynb		stability_predictions_new.ipynb
stability_predictions_new.py		stability_predictions_new.py
stability_representations.ipynb		stability_representations.ipynb
stability_representations.py		stability_representations.py
unirep.py		unirep.py
unirep_tutorial.ipynb		unirep_tutorial.ipynb
unirep_tutorial.py		unirep_tutorial.py

xanderdunn/eclrep

Folders and files

Latest commit

History

Repository files navigation

Deep Peptide Representation Learning

Overview

Setup

Data Sources

Weight files

Description of files in this repository

About

Resources

Stars

Watchers

Forks

Languages