## Take Home Assessment

**Disclaimer**: This assessment is work in progress, so we apologise in advance for any hiccup. Any feedback is valuable!

**Setup**: You are provided with some training code for a model that takes protein 3D structure and predicts the associated amino acid sequence. This notebook provides the required steps to download the code repository and training data (a subset of the Protein Data Bank), alongside minimal code to call the training loop. Please fork the repository that you can find below and edit your own version.

**Compute**: You will be provided a [Lambda](https://cloud.lambdalabs.com/) instance with a A10 GPU on an agreed day. For this we need your public key and we will share an IP address to access the compute instance.

**Evaluation**: The following questions are on purpose quite open-ended. No specific answer is expected. The aim is to provide a semi-realistic setup that you may encounter if you were to join our team. We want to assess your ability to probe deep learning models and to come up with solutions to alleviate potential identified limitations. Please write down your answers (e.g. with plots, tables etc) in your copy of the repository (e.g. in this notebook or in any other format of your choice) and push them to your fork. Do include any documentation of what all you did to arrive at your answers. We will discuss during the onsite interview. Please keep the time commitment under 4h.

**Questions**:
1. Log and profile the training loop.  What would you recommend if we wanted to train more quickly? Implement some of your proposals.
2. What kinds of issues will arise as model size increases? How could these be partially alleviated? Implement some of your proposed solutions.
3. The way the dataloader is organized in this project is unusual.  What will happen as we increase the size of the training dataset (e.g. using the AlphaFold database)?  How would you re-organize the code to avoid these issues?  What techniques would you consider using to ensure training scales efficiently with the dataset size?
4. Log the average norm of the weights & activations through training. How would you organize this information to help diagnose training dynamics?  How would you characterize the values you observe here?

In [None]:
# Download subset of training data
!wget https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02_sample.tar.gz
!tar xvf "pdb_2021aug02_sample.tar.gz"
!rm pdb_2021aug02_sample.tar.gz

In [None]:
from training.training import main as run_training
import random
import numpy as np
import torch

torch.manual_seed(0)
np.random.seed(0)
random.seed(0)

class MyArgs(object):
  def __init__(self):
    self.path_for_training_data = "/tmp/content/pdb_2021aug02_sample"
    self.path_for_outputs = "/tmp/content/test"
    self.previous_checkpoint = ""
    self.num_epochs = 2
    self.save_model_every_n_epochs = 5
    self.reload_data_every_n_epochs = 4
    self.num_examples_per_epoch = 200
    self.batch_size = 2000
    self.max_protein_length = 2000
    self.hidden_dim = 128
    self.num_encoder_layers = 3
    self.num_decoder_layers = 3
    self.num_neighbors = 32
    self.dropout = 0.1
    self.backbone_noise = 0.1
    self.rescut = 3.5
    self.debug = False
    self.gradient_norm = -1.0 #no norm

args = MyArgs()
run_training(args)
