# Retention Time Preidiction 

This notebook is prepared to be run in Google [Colaboratory](https://colab.research.google.com/). In order to train the model faster, please change the runtime of Colab to use Hardware Accelerator, either GPU or TPU.

This is an extension of the original walkthrough example available [here](https://github.com/wilhelm-lab/dlomix-resources/blob/main/notebooks/RetentionTime/Example_RTModel_Walkthrough_colab.ipynb).

### Task 4: Data Split
Similar to the initial notebook, we will initialize our model and train it. The target here is to experiment with different data splits and observe the impact on the performance and whether it reflects a realistic evaluation. 

In [None]:
# install the mlomix package in the current environment using pip

!python -m pip install -q dlomix==0.0.3

In [None]:
!python -m pip install -q wandb

In [None]:
import numpy as np
import pandas as pd
import dlomix
from dlomix.models import RetentionTimePredictor
import tensorflow as tf
from dlomix.eval import TimeDeltaMetric

import wandb
from wandb.keras import WandbCallback

In [None]:
# enter project name for weights and biases
project_name = 'dlomix_retention_time'

The code below creates a dataset, creates the model, and trains it. You should try with the two available data splits (`feature_a` and `feature_b`). Please Refer to the initial notebook to analyze the results.

Hint: Use the paths available below. Description for features is as follows:
- suffix `_A`: feature A
- suffix `_B`: feature B

**Different features span through different ranges the absolute losses/metrics might be misleading**

In [None]:
TRAIN_DATAPATH_A = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/main/example_datasets/RetentionTime/feature_a/proteomeTools_train_val_a.csv'
TRAIN_DATAPATH_B = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/main/example_datasets/RetentionTime/feature_b/proteomeTools_train_val_b.csv'

TEST_DATAPATH_A = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/main/example_datasets/RetentionTime/feature_a/proteomeTools_test_a.csv'
TEST_DATAPATH_B = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/main/example_datasets/RetentionTime/feature_b/proteomeTools_test_b.csv'

In [None]:
from dlomix.data import RetentionTimeDataset

BATCH_SIZE = 64

rtdata = RetentionTimeDataset(data_source=TRAIN_DATAPATH_B,target_col='retention_time',
                              seq_length=30, batch_size=BATCH_SIZE, val_ratio=0.2)


# this is the test dataset object, do not forget to change it to the respective suffix (A or B)
# when you change the training dataset

test_rtdata = RetentionTimeDataset(data_source=TEST_DATAPATH_B,target_col='retention_time',
                              seq_length=30, batch_size=BATCH_SIZE, test=True)

In [None]:
# Enter weights and biases run name. Make sure that different splits rates have different run names.
wandb.init(project=project_name, name='data_split_b')

# create model
model = RetentionTimePredictor(seq_length=30)

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)

# compile the model  with the optimizer and the metrics we want to use, we can add our custom time-delta metric

model.compile(optimizer=optimizer, 
            loss='mse', metrics=['mean_absolute_error', TimeDeltaMetric()])

history = model.fit(rtdata.train_data, validation_data=rtdata.val_data, epochs=15,
                    callbacks=[WandbCallback(save_model=False)])


# Mark the run as finished
wandb.finish()

### Bonus:
After analyzing the results, can you figure out what is wrong with these splits and how different are they from each other?