# Peptide Detectability Prediction 

This notebook is prepared to be run in Google [Colaboratory](https://colab.research.google.com/).

One of the example datasets used in this notebook is deposited in the ProteomeXchange Consortium via the MAssIVE partner repository with the identifier PXD024364. The other dataset is deposited to the ProteomeXchange Consortium via the PRIDE partner repository with identifier PXD010154. 


#### Installing the DLOmix Package

If you have not installed the DLOmix package yet, you need to do so before running the code. 

You can install the DLOmix package using pip.

In [None]:
# uncomment the following line to install the DLOmix package in the current environment using pip

#!python -m pip install dlomix>0.1.3

#### Importing Required Libraries

Before running the code, ensure you import all the necessary libraries. These imports are essential for accessing the functionalities needed for data processing, model training, and evaluation.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import dlomix
import sys
import os

## Model

We can now create the model. The model architecture is an encoder-decoder with an attention mechanism, that is based on Bidirectional Recurrent Neural Network (BRNN) with Gated Recurrent Units (GRU). Both the Encoder and Decoder consists of a single layer, with the Decoder also including a Dense layer. The model has the default working arguments.

In [None]:
from dlomix.models import DetectabilityModel
from dlomix.constants import CLASSES_LABELS, alphabet, aa_to_int_dict

In [None]:
CLASSES_LABELS, len(alphabet), aa_to_int_dict

In [None]:
total_num_classes = len(CLASSES_LABELS)
input_dimension = len(alphabet)
num_cells = 64

model = DetectabilityModel(num_units = num_cells, num_clases = total_num_classes)

#### Model Weights Configuration

In the following section, you need to specify the path to the model weights you wish to use. The default path provided is set to the weights for the **Pfly** model, which is the fine-tuned model mentioned in the publication associated with this notebook.

- **Using the Default Pfly Model**: If you are utilizing the fine-tuned Pfly model as described in the publication, you can keep the default path unchanged. This will load the model weights for Pfly.

- **Using the Base Model or Different Weights**: If you intend to use the base model or have different weights (e.g., for a custom model), you should update the path to reflect the location of these weights.

In [None]:
## Loading model weights 

model_save_path = 'output/weights/new_fine_tuned_model/fine_tuned_model_weights_detectability'

model.load_weights(model_save_path)

## Workflow Overview

This notebook supports two different workflows depending on your dataset:

- **Labeled Data**: Use this pipeline when your dataset includes ground truth labels. This setup not only makes predictions but also allows for detailed evaluation by comparing the true labels with the predicted values, facilitating the generation of a comprehensive evaluation report.

- **Unlabeled Data**: Use this pipeline when your dataset does not include labels. Here, the focus is on making predictions only, without generating a detailed performance report, as there are no labels to compare against.

### Notebook Structure

Subtitles throughout the notebook indicate the sections for each type of data:

- **Labeled Data Section**: Follow these when your dataset includes labels to receive predictions and a comprehensive evaluation report.

- **Unlabeled Data Section**: Use these when your dataset lacks labels, focusing solely on generating predictions.

Make sure to select the appropriate pipeline based on your dataset.

# Labeled Data

## 1. Load Data 

You can import the `DetectabilityDataset` class and create an instance to manage data for training, validation, and testing. This instance handles TensorFlow dataset objects and simplifies configuring and controlling how your data is preprocessed and split.

For the paramters of the dataset class, please refer to the DLOmix documentation: https://dlomix.readthedocs.io/en/main/dlomix.data.html#


**Note**: If class labels are provided, the following encoding scheme should be used:
- **Non-Flyer**: 0
- **Weak Flyer**: 1
- **Intermediate Flyer**: 2
- **Strong Flyer**: 3

In [None]:
from dlomix.data import DetectabilityDataset

In [None]:
# load the dataset from huggingface for prediction

from datasets import load_dataset, DatasetDict

# pick one of the available datasets on the HuggingFace Hub
# Collection: https://huggingface.co/collections/Wilhelmlab/detectability-datasets-671e76fb77035878c50a9c1d

hf_data_name = "Wilhelmlab/detectability-sinitcyn"
#hf_data_name = "Wilhelmlab/detectability-wang"

hf_dataset_split = load_dataset(hf_data_name, split="test")
hf_dataset = DatasetDict({"test": hf_dataset_split})
hf_dataset

In [None]:

max_pep_length = 40
BATCH_SIZE = 128

detectability_data = DetectabilityDataset(data_source=hf_dataset,
                                          data_format='hf',
                                          max_seq_len=max_pep_length,
                                          label_column="Classes",
                                          sequence_column="Sequences",
                                          dataset_columns_to_keep=['Proteins'],
                                          batch_size=BATCH_SIZE,
                                          with_termini=False,
                                          alphabet=aa_to_int_dict)


In [None]:
# This is the dataset with the test split  
# You can see the column names under each split (the columns starting with _ are internal, but can also be used to look up original sequences for example "_parsed_sequence")
detectability_data

In [None]:
# Accessing elements in the dataset is done by specificing the split name and then the column name
# Example here for one sequence after encoding & padding comapred to the original sequence

detectability_data["test"]["Sequences"][0], "".join(detectability_data["test"]["_parsed_sequence"][0])

## 2. Testing and Reporting

We use the test dataset to assess our model's performance, which is only applicable if labels are available. The `DetectabilityReport` class allows us to compute various metrics, generate reports, and create plots for a comprehensive evaluation of the model.

Note: The reporting module is currently under development, so some features may be unstable or subject to change.

##### Generate Predictions on Test Data Using `model.predict`

To obtain predictions for your test data, use the Keras `model.predict` method. Simply pass your test dataset to this method, and it will return the model's predictions.

In [None]:
predictions = model.predict(detectability_data.tensor_test_data)

In [None]:
predictions.shape

To generate reports and calculate evaluation metrics against predictions, we obtain the targets and the data for the specific dataset split. This can be achieved using the `DetectabilityDataset` class directly.

In [None]:
# access val dataset and get the Classes column
test_targets = detectability_data["test"]["Classes"]


# if needed, the decoded version of the classes can be retrieved by looking up the class names
test_targets_decoded = [CLASSES_LABELS[x] for x in test_targets]


test_targets[0:5], test_targets_decoded[0:5]

In [None]:
# The dataframe needed for the report

test_data_df = pd.DataFrame(
    {
        "Sequences": detectability_data["test"]["_parsed_sequence"], # get the raw parsed sequences
        "Classes": test_targets, # get the test targets from above
        "Proteins": detectability_data["test"]["Proteins"] # get the Proteins column from the dataset object
    }
)

test_data_df.Sequences = test_data_df.Sequences.apply(lambda x: "".join(x)) # join the sequences since they are a list of string amino acids.
test_data_df.head(5)

In [None]:
from dlomix.reports.DetectabilityReport import DetectabilityReport, predictions_report
WANDB_REPORT_API_DISABLE_MESSAGE=True

#### Generate a Report Using the `DetectabilityReport` Class

The `DetectabilityReport` class provides a comprehensive way to evaluate your model by generating detailed reports and visualizations. The outputs include:

1. **A PDF Report**: This includes evaluation metrics and plots.
2. **A CSV File**: Contains the model’s predictions.
3. **Independent Image Files**: Visualizations are saved as separate image files.

To generate a report, provide the following parameters to the `DetectabilityReport` class:

- **targets**: The true labels for the dataset, which are used to assess the model’s performance.
- **predictions**: The model’s output predictions for the dataset, which will be compared against the true labels.
- **input_data_df**: The DataFrame containing the input data used for generating predictions.
- **output_path**: The directory path where the generated reports, images, and CSV file will be saved.
- **history**: The training history object (e.g., containing metrics from training) if available. Set this to `None` if not applicable, such as when the report is generated for predictions without training.
- **rank_by_prot**: A boolean indicating whether to rank peptides based on their associated proteins (`True` or `False`). Defaults to `False`.
- **threshold**: The classification threshold used to adjust the decision boundary for predictions. By default, this is set to `None`, meaning no specific threshold is applied.
- **name_of_dataset**: The name of the dataset used for generating predictions, which will be included in the report to provide context.
- **name_of_model**: The name of the model used to generate the predictions, which will be specified in the report for reference.

In [None]:
# Since the detectabiliy report expects the true labels in one-hot encoded format, we expand them here.

num_classes = np.max(test_targets) + 1
test_targets_one_hot = np.eye(num_classes)[test_targets]
test_targets_one_hot.shape, len(test_targets)

In [None]:
report = DetectabilityReport(targets = test_targets_one_hot, 
                             predictions = predictions, 
                             input_data_df = test_data_df, 
                             output_path = "./output/report_on_Sinitcyn_2000_proteins_test_set_labeled", 
                             history = None, 
                             rank_by_prot = True,
                             threshold = None,
                             name_of_dataset = 'Sinitcyn 2000 proteins test set',
                             name_of_model = 'Fine-tuned model (Original)')

#### Predictions report

In [None]:
results_df = report.detectability_report_table
results_df

### Generating Evaluation Plots with `DetectabilityReport`

The `DetectabilityReport` class enables you to generate a range of plots to visualize and evaluate model performance. It offers a comprehensive suite of visualizations to help you interpret the results of your model's predictions. Here’s how to use it:

#### ROC curve (Binary)

In [None]:
report.plot_roc_curve_binary()

#### Confusion matrix (Binary)

In [None]:
report.plot_confusion_matrix_binary()

#### ROC curve (Multi-class)

In [None]:
report.plot_roc_curve()

#### Confusion matrix (Multi-class)

In [None]:
report.plot_confusion_matrix_multiclass()

#### Heatmap of Average Error Between Actual and Predicted Classes

In [None]:
report.plot_heatmap_prediction_prob_error()

We can also produce a complete evaluation report with all the relevant plots in one PDF file by calling the `generate_report` function.

In [None]:
report.generate_report()

# Unlabeled Data

## 1. Load data

For predicting on unlabeled data, follow the same workflow as described earlier (refer to the "Load Data" section for labeled data). Specifically, create an instance of the `DetectabilityDataset` class using your unlabeled data.The configuration below ensures that the entire dataset is treated as test data without generating additional splits (i.e., training and validation sets).

In [None]:
# load the dataset from huggingface for prediction

from datasets import load_dataset, DatasetDict

# pick one of the available datasets on the HuggingFace Hub
# Collection: https://huggingface.co/collections/Wilhelmlab/detectability-datasets-671e76fb77035878c50a9c1d

hf_data_name = "Wilhelmlab/detectability-sinitcyn"
#hf_data_name = "Wilhelmlab/detectability-wang"

hf_dataset_split = load_dataset(hf_data_name, split="test")
hf_dataset = DatasetDict({"test": hf_dataset_split})
hf_dataset

In [None]:
# simulate that the class labels are not there (insert None), but we keep the column since it is needed for the dataset class

hf_dataset = hf_dataset.map(lambda example: {**example, 'Classes': None})

In [None]:

max_pep_length = 40
BATCH_SIZE = 128
       
test_data_unlabeled = DetectabilityDataset(data_source=hf_dataset,
                                           data_format='hf',
                                           max_seq_len=max_pep_length,
                                           label_column='Classes',
                                           sequence_column="Sequences",
                                           dataset_columns_to_keep=['Proteins'],
                                           batch_size=BATCH_SIZE,
                                           with_termini=False,
                                           alphabet=aa_to_int_dict)

In [None]:
test_data_unlabeled["test"]["Classes"][0:5]

## 2. Predicting and reporting

We use the previously loaded model to generate predictions on the dataset. If labels are not available, you can utilize the `predictions_report` function to produce a clear and organized report based on these predictions. Note that the `predictions_report` function is specifically designed for scenarios where labels are not present.

##### Generate Predictions on Test Data Using `model.predict`

To obtain predictions for your test data, use the Keras `model.predict` method. Simply pass your test dataset to this method, and it will return the model's predictions.

In [None]:
predictions_unlabeled = model.predict(test_data_unlabeled.tensor_test_data)

To generate reports we obtain the data for the specific dataset split. This can be achieved using the `DetectabilityDataset` class directly.

In [None]:
# The dataframe needed for the report

test_data_unlabeled_df = pd.DataFrame(
    {
        "Sequences": test_data_unlabeled["test"]["_parsed_sequence"], # get the raw parsed sequences
        "Proteins": test_data_unlabeled["test"]["Proteins"] # get the Proteins column from the dataset object
    }
)

test_data_unlabeled_df.Sequences = test_data_unlabeled_df.Sequences.apply(lambda x: "".join(x)) # join the sequences since they are a list of string amino acids.
test_data_unlabeled_df.head(5)

#### Generate a report using the `predictions_report` class by providing the following parameters:

- **predictions**: The model's output predictions for the dataset.
- **input_data_df**: The DataFrame containing the input data used for generating the predictions.
- **output_path**: The path where the generated report (in CSV format) will be saved.
- **rank_by_prot**: A boolean indicating whether to rank peptides based on their associated proteins (`True` or `False`). Defaults to `False`.
- **threshold**: The classification threshold used to adjust the decision boundary for predictions. By default, this is set to `None`, meaning no specific threshold is applied.

The `predictions_report` class processes the model’s predictions and generates a comprehensive CSV report with the results, including any specified settings, which facilitates evaluation and interpretation of the predictions.

In [None]:
new_predictions_report = predictions_report(predictions = predictions_unlabeled, 
                                            input_data_df = test_data_unlabeled_df, 
                                            output_path = "./output/report_on_Sinitcyn_2000_proteins_test_set_unlabeled", 
                                            rank_by_prot = True,
                                            threshold = None)

In [None]:
results_unlabeled_df = new_predictions_report.predictions_report
results_unlabeled_df