Pretrained models using the Human Protein Atlas:

All required packages are listed in package-list.txt and can be installed via conda. Model training requires GPU.

Pretrained models using the Human Protein Atlas:

./checkpoints has the model parameters for all the 22 epochs.
All results in the paper are based on the model parameters in “splice_isoform_dataset_cell_line_and_gene_split_full-epoch=01-val_combined_loss=0.18.ckpt”

Compiling & accessing the dataset:

The repository is configured to use the free MongoDB Service to store and serve all data for training and validation. Not only does MongoDB scale well for the hundreds of thousands of immunofluorescent images & ESM-2 feature binaries, it also natively supports hierarchical data structures which is crucial for clean data handling.

Whenever the dataset is updated or accessed the mongo server must be up and running. To do so open run the command sudo mongod --dbpath <MONGO_PATH>
It is recommended to also download Mongo Compass to visualize the datasets directly from a GUI (screenshots below of the hierarchical structure of the dataset as seen from Compass)
From the compass UI it is possible to directly create a database https://www.mongodb.com/docs/compass/current/databases/. Before any data can be downloaded or accessed you must first create a database named ‘hpa’.

Dataset preparation:

Down the public tabular subcellular_localization dataset from HPA (https://www.proteinatlas.org/download/subcellular_location.tsv.zip). Note: the tabular dataset is used only for its index of genes for the rest of the pipeline to reference. The rest of the information in the tabular dataset including the localization labels are not used as they only describe the coarse gene level information.
To populate the proteoform level information (localization labels, cell images, amino acid sequence, ESM-2 representation) refer to dataset/download_data.py. The python code is all set up to run but may take a few days to finish pulling data from the web.
To create dataset splits for training & evaluation refer to dataset/create_datasets.py. A number of different training / evaluation splits are offered.

Visualization of train/test data

datasplits_matrix_visualization.ipynb (Figure 2a)

Model training:

The model is defined in src/model:

Image inpainting: src/model/nn_unet.py
Localization prediction using the sequence representation: src/model/nn_multilabel_mlp.py
Learning protein sequence representation: src/model/nn_light_attention.py
Full model: src/model/full_model.py For model training, run ‘python train.py’. Model training progress can monitored with TensorBoard

Visualize model performance and protein localization variability

The following plots are generated using plotLoss.ipynb and plotLoss_holdout2.ipynb (same procedure separately applied to training/Holdout 1 and Holdout 2)
- Computing protein image prediction losses for all held-out proteins in the test set of Holdout 1 and in Holdout 2 (Figure 2b)
- Plotting examples of protein image predictions (Figure 2c)
- Ploting predicted intra-nuclear proportions (Figure 3a)
- Computing the variability of intra-nuclear proportions across cell lines (Figure 3)
- Computing the variability of intra-nuclear proportions across single cells of the same cell lines (Figure 4)
- Gene ontology of the most variable proteins: go_variableProteins.ipynb (Training and Holdout 1); go_variableProteins_holdout2 (Holdout 2)
spectral_bleed_through.ipynb; guided backprop attribution of model attention and a visualization of model predictions across different proteoforms and cell lines (Supplementary Figure 4b)

Evaluations of the experimental validation

plotLoss_experiment.ipynb (Figure 5)

Visualization of cell and protein representations (Figure 6)

latent_proteoform_representation_visualization.ipynb - PCA of protein sequence representations (Figure 6a left panel, Supplemental Figure 12)
latent_proteoform_nucCytosol.ipynb - PCA of proteins in nucleoplasm, cytosol, or both (Figure 6a right panel)
shap.ipynb - Shapley analysis (Figure 6b)
jointEmbedding.ipynb - Image representation (Figure 6c) and joint representation (Figure 6d)
latent_cell_representation_visualization.ipynb - PCA of image representations for 36 cell lines (Supplemental Figure 13)

Name	Name	Last commit message	Last commit date
Latest commit xinyiz98 Create LICENSE Jul 19, 2024 c8122c9 · Jul 19, 2024 History 10 Commits
checkpoints	checkpoints	submission	Jul 5, 2024
src	src	submission	Jul 5, 2024
.gitignore	.gitignore	Initial commit	Jul 5, 2024
LICENSE	LICENSE	Create LICENSE	Jul 19, 2024
README.md	README.md	Update README.md	Jul 19, 2024
datasplits_matrix_visualization.ipynb	datasplits_matrix_visualization.ipynb	submission	Jul 5, 2024
go_variableProteins.ipynb	go_variableProteins.ipynb	submission	Jul 5, 2024
go_variableProteins_holdout2.ipynb	go_variableProteins_holdout2.ipynb	submission	Jul 5, 2024
jointEmbedding.ipynb	jointEmbedding.ipynb	submission	Jul 5, 2024
latent_cell_representation_visualization.ipynb	latent_cell_representation_visualization.ipynb	submission	Jul 5, 2024
latent_proteoform_nucCytosol.ipynb	latent_proteoform_nucCytosol.ipynb	submission	Jul 5, 2024
latent_proteoform_representation_visualization.ipynb	latent_proteoform_representation_visualization.ipynb	submission	Jul 5, 2024
mongo.001.jpeg	mongo.001.jpeg	mongodb	Jul 5, 2024
mongo.002.jpeg	mongo.002.jpeg	mongodb	Jul 5, 2024
mongo.003.jpeg	mongo.003.jpeg	mongodb	Jul 5, 2024
package-list.txt	package-list.txt	package versions	Jul 19, 2024
plotLoss.ipynb	plotLoss.ipynb	submission	Jul 5, 2024
plotLoss_experiment.ipynb	plotLoss_experiment.ipynb	submission	Jul 5, 2024
plotLoss_holdout2.ipynb	plotLoss_holdout2.ipynb	submission	Jul 5, 2024
shap.ipynb	shap.ipynb	submission	Jul 5, 2024
spectral_bleed_through.ipynb	spectral_bleed_through.ipynb	submission	Jul 5, 2024
train.py	train.py	submission	Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pretrained models using the Human Protein Atlas:

Compiling & accessing the dataset:

Dataset preparation:

Visualization of train/test data

Model training:

Visualize model performance and protein localization variability

Evaluations of the experimental validation

Visualization of cell and protein representations (Figure 6)

About

Releases

Packages

Languages

License

uhlerlab/PUPS

Folders and files

Latest commit

History

Repository files navigation

Pretrained models using the Human Protein Atlas:

Compiling & accessing the dataset:

Dataset preparation:

Visualization of train/test data

Model training:

Visualize model performance and protein localization variability

Evaluations of the experimental validation

Visualization of cell and protein representations (Figure 6)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages