# `ProteinWorkshop` Tutorial, Part 1 - Training a New Model
![Models](../docs/source/_static/box_models.png)

## Train a new model using the `ProteinWorkshop`

In [12]:
%load_ext autoreload
%autoreload 2
# %load_ext blackcellmagic

[autoreload of lightning_fabric.utilities.types failed: Traceback (most recent call last):
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 475, in superreload
    module = reload(module)
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 619, in _exec
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/site-packages/lightning_fabric/utilities/types.py", line 22, in <module>
    from lightning_fabric.utilities.imports import _TORCH_GREATER_EQUAL_1_13, _TORCH_GREATER_EQUAL_2_0
ImportError

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[autoreload of flash failed: Traceback (most recent call last):
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 475, in superreload
    module = reload(module)
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 619, in _exec
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/site-packages/flash/__init__.py", line 32, in <module>
    from flash.core.trainer import Trainer  # noqa: E402
  File "/home/zhang/miniconda3/envs/3d/lib/python3.10/site-packages/flash/core/trainer.py", line 25, i

Welcome to the tutorial series for the `ProteinWorkshop`! 

In the `ProteinWorkshop`, we implement numerous [featurisation](https://www.proteins.sh/configs/features) schemes, [datasets](https://www.proteins.sh/configs/dataset) for [self-supervised pre-training](https://proteins.sh/quickstart_component/pretrain.html) and [downstream evaluation](https://proteins.sh/quickstart_component/downstream.html), [pre-training](https://proteins.sh/configs/task) tasks, and [auxiliary tasks](https://proteins.sh/configs/task.html#auxiliary-tasks).

[Processed datasets](https://drive.google.com/drive/folders/18i8rLST6ZICTBu6Q67ClT0KqN9AHeqoW?usp=sharing) and [pre-trained weights](https://drive.google.com/drive/folders/1zK1r8FpmGaqV_QwUJuvDacwSL0RW-Vw9?usp=sharing) are made available. Downloading datasets is not required; upon first run all datasets will be downloaded and processed from their respective source.

The `ProteinWorkshop` encompasses several models as well as pre-trained weights for them so that you can readily use them.

In this tutorial, we show you how you can use what is already available in the protein workshop to train and use models for specific tasks. The `ProteinWorkshop` is structured as a very modular package; we will therefore talk about how to change the different parts of it, like the model, training task, dataset, featurization scheme, etc. in this tutorial. 

Besides using all the different options we provide, you can make use of the modular nature of the `ProteinWorkshop` to add your own models, datasets, featurization schemes, and training tasks. We will show you how to do this in the next tutorials.

To train a new model, you can follow the following 3-step procedure:

1. Choose the parts you want to consider: model, training task, dataset, featurization scheme and auxiliary tasks
2. Validate the designed training config
3. Use the designed config to train a new model

### 1. Choose the parts you want to consider: model, training task, dataset, featurization scheme and auxiliary tasks

You can switch out any of these for another available option by replacing the corresponding argument's value in `overrides`:

`cfg = hydra.compose("template", overrides=["encoder=schnet", "task=inverse_folding", "dataset=afdb_swissprot_v4", "features=ca_base", "+aux_task=none"], return_hydra_config=True)`

In [13]:
import os

os.environ['HYDRA_FULL_ERROR'] = '1'

In [14]:
# Misc. tools
import os

# Hydra tools
import hydra

from hydra.compose import GlobalHydra
from hydra.core.hydra_config import HydraConfig

from proteinworkshop.constants import HYDRA_CONFIG_PATH
from proteinworkshop.utils.notebook import init_hydra_singleton

version_base = "1.2"  # Note: Need to update whenever Hydra is upgraded
init_hydra_singleton(reload=True, version_base=version_base)

path = HYDRA_CONFIG_PATH
rel_path = os.path.relpath(path, start=".")
# print(rel_path)
GlobalHydra.instance().clear()
hydra.initialize(rel_path, version_base=version_base)

cfg = hydra.compose(
    config_name="train",
    overrides=[
        "encoder=schnet",
        "encoder.hidden_channels=512", # Number of channels in the hidden layers
        "encoder.out_dim=32", # Output dimension of the model
        "encoder.num_layers=6", # Number of filters used in convolutional layers
        "encoder.num_filters=128", # Number of convolutional layers in the model
        "encoder.num_gaussians=50", # Number of Gaussian functions used for radial filters
        "encoder.cutoff=10.0", # Cutoff distance for interactions
        "encoder.max_num_neighbors=32", # Maximum number of neighboring atoms to consider
        "encoder.readout=add", # Global pooling method to be used
        "encoder.dipole=False",
        "encoder.mean=null",
        "encoder.std=null",
        "encoder.atomref=null",
        "encoder.pretraining=True",

        # "decoder.graph_label.dummy=True",
        "decoder=subgraph_distances", # here
        "decoder.subgraph_distances.hidden_channels=32", # here

        "task=subgraph_distance_prediction", # here
        "dataset=afdb_swissprot_v4", # here
        "dataset.datamodule.batch_size=32", # here
        # "dataset.datamodule.train_split=0.5", # here
        # "dataset.datamodule.val_split=0.05", # here
        "features=fe_subgraph",  # here
        "+aux_task=none",
        
        "trainer.max_epochs=10",
        "optimiser=adam",
        "optimiser.optimizer.lr=3e-4",
        "callbacks.early_stopping.patience=10", # to change, don't comment this
        "test=True",
        "scheduler=linear_warmup_cosine_decay", # to set parameters in yaml
        # "optimizer.weight_decay=0.5"
    ],
    return_hydra_config=True,
)

# Note: Customize as needed e.g., when running a sweep
cfg.hydra.job.num = 0
cfg.hydra.job.id = 0
cfg.hydra.hydra_help.hydra_help = False
cfg.hydra.runtime.output_dir = "outputs"

HydraConfig.instance().set_config(cfg)

### 2. Validate the designed training config

This is not strictly necessary, but it is a good idea to validate the config before training. This will check that all the arguments you have provided are valid and that the config is complete.

In [15]:
from proteinworkshop.configs import config

cfg = config.validate_config(cfg)

In [16]:
print(cfg.keys())
for key in cfg.keys():
    print(key)
    print(cfg[key])

dict_keys(['hydra', 'env', 'dataset', 'features', 'encoder', 'decoder', 'transforms', 'callbacks', 'optimiser', 'scheduler', 'trainer', 'extras', 'metrics', 'task', 'logger', 'name', 'seed', 'num_workers', 'task_name', 'ckpt_path_test', 'test', 'aux_task'])
hydra
env
{'paths': {'root_dir': '${oc.env:ROOT_DIR}', 'data': '${oc.env:DATA_PATH}', 'output_dir': '${hydra:runtime.output_dir}', 'work_dir': '${hydra:runtime.cwd}', 'log_dir': '${oc.env:RUNS_PATH}', 'runs': '${oc.env:RUNS_PATH}', 'run_dir': '${env.paths.runs}/${name}/${env.init_time}'}, 'python': {'version': '${python_version:micro}'}, 'init_time': '${now:%y-%m-%d_%H:%M:%S}'}
dataset
{'datamodule': {'_target_': 'graphein.ml.datasets.foldcomp_dataset.FoldCompLightningDataModule', 'data_dir': '${env.paths.data}/afdb_swissprot_v4/', 'database': 'afdb_swissprot_v4', 'batch_size': 32, 'num_workers': 32, 'train_split': 0.8, 'val_split': 0.1, 'test_split': 0.1, 'pin_memory': True, 'use_graphein': True, 'transform': '${transforms}'}, 'dat

### 3. Use the designed config to train a new model

Now with the config you have designed, you can train a new model. You can also use the `ProteinWorkshop` to evaluate the model on a downstream task.

In training:

ProteinBatch(fill_value=[32], atom_list=[32], coords=[15602, 37, 3], residues=[32], residue_id=[32], 
chains=[15602], residue_type=[15602], b_factor=[15602], id=[32], x=[15602, 23], seq_pos=[15602, 1], batch=[15602], 
ptr=[33], pos=[15602, 3], edge_index=[2, 247543], subgraphs=[1548, 149], subgraph_distances=[1548], 
subgraph_lengths=[1548])


In [17]:
from proteinworkshop.train import train_model

train_model(cfg)

Seed set to 52


100%|██████████| 542378/542378 [00:00<00:00, 4592302.78it/s]
Processing...
Done!


Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
HPU available: False, using: 0 HPUs
100%|██████████| 542378/542378 [00:00<00:00, 4662335.89it/s]
Processing...
Done!
100%|██████████| 542378/542378 [00:00<00:00, 4781910.36it/s]
Processing...
Done!
100%|██████████| 542378/542378 [00:00<00:00, 4658669.54it/s]
Processing...
Done!

This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 28, which is smaller than what this Da


Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.



100%|██████████| 542378/542378 [00:00<00:00, 4449078.20it/s]
Processing...
Done!
100%|██████████| 542378/542378 [00:00<00:00, 4591422.26it/s]
Processing...
Done!
100%|██████████| 542378/542378 [00:00<00:00, 4554779.36it/s]
Processing...
Done!

This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 28, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.



100%|██████████| 542378/542378 [00:00<00:00, 4520045.49it/s]
Processing...
Done!
100%|██████████| 542378/542378 [00:00<00:00, 4756972.08it/s]
Processing...
Done!
100%|██████████| 542378/542378 [00:00<00:00, 4736529.36it/s]
Processing...
Done!
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]



You have overridden `on_after_batch_transfer` in `LightningModule` but have passed in a `LightningDataModule`. It will use the implementation from `LightningModule` instance.



Output()

Metric val/subgraph_distances/mse improved. New best score: 0.024
Metric train/loss/total improved. New best score: 0.024
Epoch 0, global step 13560: 'val/subgraph_distances/mse' reached 0.02420 (best 0.02420), saving model to '/home/zhang/Projects/3d/ProteinWorkshop/notebooks/outputs/checkpoints/epoch_000.ckpt' as top 1
Epoch 0, global step 13560: 'val/subgraph_distances/mse' reached 0.02420 (best 0.02420), saving model to '/home/zhang/Projects/3d/ProteinWorkshop/notebooks/outputs/checkpoints/epoch_000.ckpt' as top 1


Metric val/subgraph_distances/mse improved by 0.000 >= min_delta = 0.0. New best score: 0.024
Metric train/loss/total improved by 0.004 >= min_delta = 0.0. New best score: 0.020
Epoch 1, global step 27120: 'val/subgraph_distances/mse' reached 0.02417 (best 0.02417), saving model to '/home/zhang/Projects/3d/ProteinWorkshop/notebooks/outputs/checkpoints/epoch_001.ckpt' as top 1
Epoch 1, global step 27120: 'val/subgraph_distances/mse' reached 0.02417 (best 0.02417), saving model to '/home/zhang/Projects/3d/ProteinWorkshop/notebooks/outputs/checkpoints/epoch_001.ckpt' as top 1


Epoch 2, global step 40680: 'val/subgraph_distances/mse' was not in top 1
Epoch 2, global step 40680: 'val/subgraph_distances/mse' was not in top 1


### 4. Wrapping up

Have any additional questions about using the components provided in the `ProteinWorkshop`? [Create a new issue](https://github.com/a-r-j/ProteinWorkshop/issues/new/choose) on our [GitHub repository](https://github.com/a-r-j/ProteinWorkshop). We would be happy to work with you to leverage the full power of the repository!

/home/yang/anaconda3/envs/3d/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py