# Fragment ion intensities Prediction 

This notebook is prepared to be run in Google [Colaboratory](https://colab.research.google.com/). In order to train the model faster, please change the runtime of Colab to use Hardware Accelerator, either GPU or TPU.

This is an extension of the original walkthrough example available [here](https://github.com/wilhelm-lab/dlomix-resources/tree/tasks/intensity/notebooks/Intensity/Example_IntensityModel_Walkthrough_colab.ipynb).

### Task 4: Data Split
Similar to the initial notebook, we will initialize our model and train it. The target here is to experiment with different data splits and observe the impact on the performance and whether it reflects a realistic evaluation. 

In [1]:
# install the mlomix package in the current environment using pip

!python -m pip install -q git+https://github.com/wilhelm-lab/dlomix.git@feature/intensity_tutorial



In [None]:
!python -m pip install -q wandb

In [3]:
import numpy as np
import pandas as pd
import dlomix
from dlomix.models import PrositIntensityPredictor
import tensorflow as tf
from dlomix.losses import masked_spectral_distance, masked_pearson_correlation_distance
tf.get_logger().setLevel('ERROR')

import wandb
from wandb.keras import WandbCallback

The code below creates a dataset, creates the model, and trains it. You should try with the two available data splits (`split_a` and `split_b`). Please Refer to the initial notebook to analyze the results.

Hint: Use the paths available below. Description for splits is as follows:
- suffix `_A`: split A
- suffix `_B`: split B

In [2]:
TRAIN_DATAPATH_A = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/tasks/intensity/example_datasets/Intensity/split_a/proteomeTools_train_val_a.csv'
TRAIN_DATAPATH_B = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/tasks/intensity/example_datasets/Intensity/split_b/proteomeTools_train_val_b.csv'

TEST_DATAPATH_A = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/tasks/intensity/example_datasets/Intensity/split_a/proteomeTools_test_a.csv'
TEST_DATAPATH_B = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix-resources/tasks/intensity/example_datasets/Intensity/split_b/proteomeTools_test_b.csv'

In [None]:
from dlomix.data import IntensityDataset

BATCH_SIZE = 64

int_data = RetentionTimeDataset(data_source=TRAIN_DATAPATH_A,
                              seq_length=30, collision_energy_col='collision_energy',batch_size=BATCH_SIZE, val_ratio=0.2)


# this is the test dataset object, do not forget to change it to the respective suffix (A or B)
# when you change the training dataset

int_data = RetentionTimeDataset(data_source=TEST_DATAPATH_A,
                              seq_length=30, collision_energy_col='collision_energy',batch_size=BATCH_SIZE, test=True)

In [None]:
# Enter weights and biases run name. Make sure that different datasets splits have different names.
wandb.init(project=project_name, name='splits_a')

# create model
model = PrositIntensityPredictor(seq_length=30)

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)

# compile the model  with the optimizer and the metrics we want to use, we can add our custom time-delta metric

model.compile(optimizer=optimizer, 
            loss=masked_spectral_distance, metrics=[masked_pearson_correlation_distance,'mean_absolute_error', 'mse'])

history = model.fit(int_data.train_data, validation_data=int_data.val_data, epochs=15,
                    callbacks=[WandbCallback(save_model=False)])


# Mark the run as finished
wandb.finish()

### Bonus:
After analyzing the results, can you figure out what is wrong with these splits and how different are they from each other?