# Inference and Evaluation with the Finetuned Models

In this notebook, we demonstrate how one can evaluate various finetuned models on both the promoter and phage test datasets.

The main steps are:
  * Preparing the models and datasets
  * Setting up the parameters for the evaluation
  * Running inference and collecting the results for each dataset


## Setting Up the Environment

While ProkBERT can operate on CPUs, leveraging GPUs significantly accelerates the process. Google Colab offers free GPU usage (subject to time and memory limits), making it an ideal platform for trying and experimenting with ProkBERT models.


### Enabling and testing the GPU (if you are using google colab)

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down



### Setting up the packages and the installs
First, we'll install the ProkBERT package directly from its GitHub repository:


In [None]:
# ProkBERT
!pip install git+https://github.com/nbrg-ppcu/prokbert

# Imports
import torch
import numpy as np
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
from prokbert.models import BertForBinaryClassificationWithPooling
from prokbert.prok_datasets import ProkBERTTrainingDatasetPT
from prokbert.config_utils import ProkBERTConfig
from prokbert.prokbert_tokenizer import ProkBERTTokenizer
from prokbert.training_utils import compute_metrics_eval_prediction, get_torch_data_from_segmentdb_classification, \
evaluate_binary_classification_bert_build_pred_results, evaluate_binary_classification_bert
from os.path import join
from torch.utils.data import Dataset, DataLoader
import pandas as pd


Next, we'll confirm that we can connect to the GPU with pytorch:

In [None]:
import torch
# Check if CUDA (GPU support) is available
if not torch.cuda.is_available():
    raise SystemError('GPU device not found')
else:
    device_name = torch.cuda.get_device_name(0)
    print(f'Found GPU at: {device_name}')

## Loading the Finetuned Model for Promoter Identification

Next, we will download the finetuned model for promoter identification. Our binary classification model utilizes the base model enhanced with a pooling layer, which integrates information across all nucleotides in the sequence. This approach leads to better performance compared to traditional Hugging Face sentence classification models, which only consider the embedding of the special starting token [CLS].

In addition, we create the corresponding tokenizer with appropriate parameters. Here, we use the 'mini' model that uses a kmer of 6 and a shift of 1. Finetuned models such as 'mini-c' and 'mini-long' are also available; if you wish to try them, adjust the tokenizer parameters accordingly.

Then, we move the model to the GPU, if available, and set the model to 'evaluation' mode (we only run forward passes).


In [None]:

finetuned_model = "neuralbioinfo/prokbert-mini-promoter"
kmer = 6
shift= 1

tok_params = {'kmer' : kmer,
             'shift' : shift}
tokenizer = ProkBERTTokenizer(tokenization_params=tok_params)
model = BertForBinaryClassificationWithPooling.from_pretrained(finetuned_model)

# Get the device. 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = model.to(device)
_=model.eval()


## Loading the Dataset

In this section, we will evaluate the test sets of the promoter dataset. We have two different types of tests: a sigma70 test (known E. coli promoters) referred to as the 'test_sigma70' set, and a multispecies dataset, which consists of promoters from various species as well as CDS sequences (non-promoters) and randomly generated sequences. For a more detailed description, see: [Bacterial Promoters Dataset](https://huggingface.co/datasets/neuralbioinfo/bacterial_promoters)

Here, we convert the Hugging Face datasets into pandas dataframes. If there is no ground truth label available, please add a pseudo column with pseudo labels.


In [None]:
# Loading the predefined dataset
dataset = load_dataset("neuralbioinfo/bacterial_promoters")

test_sigma70_set = dataset["test_sigma70"]
multispecies_set = dataset["test_multispecies"]

test_sigma70_db = test_sigma70_set.to_pandas()
test_ms_db = multispecies_set.to_pandas()



### Tokenization and PyTorch Dataset Creation

Now that we have the tokenizer, the dataset can be prepared for evaluation. Here, we prepare the data for both sigma70 and multispecies datasets.

#### Creating Datasets

We will process the data into a format suitable for our PyTorch model. This involves tokenizing the sequences and converting them into a format that our model can understand. The following code will convert our `test_ms_db` (multispecies dataset) and `test_sigma70_db` (sigma70 dataset) into PyTorch datasets.


In [None]:
## Creating datasets!
[X_test, y_test, torchdb_test] = get_torch_data_from_segmentdb_classification(tokenizer, test_ms_db)
print(f'Processing validation database!')
[X_val, y_val, torchdb_val] = get_torch_data_from_segmentdb_classification(tokenizer, test_sigma70_db)

test_ds = ProkBERTTrainingDatasetPT(X_test, y_test, AddAttentionMask=True)
val_ds = ProkBERTTrainingDatasetPT(X_val, y_val, AddAttentionMask=True)


## Performing Inference and Evaluation

Now that both the model and the data are prepared, we will perform inference and process the dataset.

For simplicity, this demonstration will be conducted on a single GPU. An important parameter to consider is the batch size, which is set to 1024 in our example. Adjust this value according to the capabilities of your GPU to ensure efficient processing.

In real-world scenarios, especially when evaluating large datasets (exceeding 1,000,000 samples), we recommend utilizing Torch Distributed Data Parallel (DDP) and compiled models for optimized performance. 

The prediction results will be aggregated into a list of numpy arrays, each containing the predicted label, ground truth label, and logits for each class (promoter vs. non-promoter). The evaluation metrics, including AUC, Matthews Correlation Coefficient (MCC), accuracy, and others, will be summarized and returned in a dictionary format.

Below is the code to perform batch-wise inference and accumulate the prediction results:


In [None]:
batch_size = 4096
dataloader = DataLoader(test_ds, batch_size=batch_size)

# Calculate the total number of batches
total_batches = len(dataloader)

pred_results_ls = []
processed_batches = 0

for batch in dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}  # Move batch data to the appropriate device
    with torch.no_grad():  # Inference mode: no gradient computation
        outputs = model(**batch)
    pred_results = evaluate_binary_classification_bert_build_pred_results(outputs['logits'], batch['labels'])
    pred_results_ls.append(pred_results)  # Collecting prediction results
    
    processed_batches += 1  # Increment the count of processed batches
    percent_complete = (processed_batches / total_batches) * 100  # Calculate the percentage of completion
    print(f'Batch {processed_batches}/{total_batches} processed, {percent_complete:.2f}% complete.')  # Print the progress

# Combine all batch results into one array for evaluation
pred_results = np.concatenate(pred_results_ls)
# Calculate and retrieve evaluation metrics
eval_results, eval_results_ls = evaluate_binary_classification_bert(pred_results)

# cleanup
del model
del batch
del dataset
torch.cuda.empty_cache()


# Convert the results dictionary to a DataFrame
results_df = pd.DataFrame([eval_results])
# Set more meaningful index name
results_df.index = ['Metrics']
results_df


# Inference with the Phage Models

In this section, we will walk through the evaluation of the phage test dataset, following a similar procedure to the previous example. The workflow includes:
  * Preparing the model
  * Preparing the dataset
  * Evaluating the test set and measuring performance metrics


# Prearing the model: 
This is a simple model, that is trained using the 'MegatronBertForSequenceClassification' class.


In [None]:
from transformers import MegatronBertForSequenceClassification

finetuned_model = "neuralbioinfo/prokbert-mini-phage"
kmer = 6
shift= 1

tok_params = {'kmer' : kmer,
             'shift' : shift}
tokenizer = ProkBERTTokenizer(tokenization_params=tok_params)
model = MegatronBertForSequenceClassification.from_pretrained(finetuned_model)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = model.to(device)
_=model.eval()


## Preparing the Phage Dataset for Evaluation

In this section, we load and prepare the phage dataset for evaluation. The dataset is a smaller subset suitable for testing, named "phage-test-small". We will demonstrate how to load this dataset, convert it to a pandas DataFrame.

First, we load the dataset and set the batch size for the DataLoader, which controls how many samples will be processed simultaneously. Then, we convert the loaded dataset to a pandas DataFrame for easier manipulation and processing. Finally, we prepare the data for PyTorch by tokenizing and converting it into a format suitable for our model.


In [None]:
dataset = load_dataset("nerualbioinfo/phage-test-small")
batch_size = 64

# Loading and converting the dataset
test_set = dataset["sample_test_L1024"].to_pandas()
print(f'Processing the database!')
[X, y, torchdb] = get_torch_data_from_segmentdb_classification(tokenizer, test_set)
dataset = ProkBERTTrainingDatasetPT(X, y, AddAttentionMask=True)
dataloader = DataLoader(dataset, batch_size=batch_size)


## Inference and Evaluation of the Phage Dataset

In this part of the process, we will perform inference on the phage dataset using the prepared DataLoader. The objective is to process each batch from the dataset, make predictions using the finetuned model, and then compile these predictions for evaluation.

We initiate an empty list, `pred_results_ls`, to store the prediction results from each batch. We iterate over each batch in the DataLoader, ensuring that the batch data is moved to the same device as the model (GPU or CPU). With gradient computation disabled (to enhance performance and reduce memory usage during inference), we pass the batch through the model to obtain output logits.

For each batch's output, we evaluate the binary classification results using the `evaluate_binary_classification_bert_build_pred_results` function, which processes the model's logits and the actual labels to generate prediction results. These results are then appended to our list.

After processing all batches, we concatenate the list of arrays into a single numpy array. This aggregated result allows us to evaluate the overall performance of the model on the entire test set. We use the `evaluate_binary_classification_bert` function to calculate various evaluation metrics such as accuracy, precision, recall, F1 score, etc., based on the compiled prediction results. The `eval_results` dictionary will store these metrics for review and analysis.


In [None]:
# Calculate the total number of batches
total_batches = len(dataloader)

pred_results_ls = []
processed_batches = 0

for batch in dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}  # Move batch data to the appropriate device
    with torch.no_grad():  # Inference mode: no gradient computation
        outputs = model(**batch)
    pred_results = evaluate_binary_classification_bert_build_pred_results(outputs['logits'], batch['labels'])
    pred_results_ls.append(pred_results)  # Collecting prediction results
    
    processed_batches += 1  # Increment the count of processed batches
    percent_complete = (processed_batches / total_batches) * 100  # Calculate the percentage of completion
    print(f'Batch {processed_batches}/{total_batches} processed, {percent_complete:.2f}% complete.')  # Print the progress

# Combine all batch results into one array for evaluation
pred_results = np.concatenate(pred_results_ls)
# Calculate and retrieve evaluation metrics
eval_results, eval_results_ls = evaluate_binary_classification_bert(pred_results)

# Convert the results dictionary to a DataFrame
results_df = pd.DataFrame([eval_results])
# Set more meaningful index name
results_df.index = ['Metrics']
results_df


## Final Remarks

In this evaluation, we have successfully executed the inference process using the pre-trained phage model and assessed its performance on the test dataset. Key metrics such as accuracy, F1 score, Matthews Correlation Coefficient (MCC), and Area Under the Curve (AUC) provide a comprehensive view of the model's ability to distinguish between classes.

Additionally, while these results are promising, further validation and testing on independent datasets are recommended to ensure the model's generalizability and robustness. This could involve cross-validation, additional external datasets, or real-world applications.

Finally, this notebook serves as a foundation for further exploration and refinement. Researchers are encouraged to experiment with different models, parameters, and datasets to enhance understanding and performance. The ultimate goal is to leverage these computational tools to advance our knowledge in microbiology and related fields.
