# Pfam Protein Sequence Classification with Tensorflow and Keras

Adapted from Saleh Alkhalifa. [Machine Learning in Biotechnology and Life Sciences](https://github.com/PacktPublishing/Machine-Learning-in-Biotechnology-and-Life-Sciences).

## Overview

This tutorial demonstrates how to develop a protein sequence classification model using deep learning. We will classify protein sequences based on their known family accession using the Pfam dataset. The model will use TensorFlow and Keras to process amino acid sequences and predict their protein family classifications.

### Pfam

The Pfam dataset consists of several columns, as follows:

- *Family_id*: The name of the family that the seqeunce belongs to (for example, filamin).
- *Family Accession*: The class or output that our model will aim to predict.
- *Sequence*: The amino acid sequence we will use as input for our model

Pfam: The protein families database in 2021: J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar, E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj, L.J. Richardson, R.D. Finn, A. Bateman
Nucleic Acids Research (2020) doi: 10.1093/nar/gkaa913

### TensorFlow and Keras

- **TensorFlow** is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community. 
- **Keras** is a deep learning API written in Python, running on top of the machine learning platform TensorFlow.


## Learning Objectives

- Learn how to preprocess protein sequence data for deep learning
    - Encode amino acid sequences
    - Pad sequences to uniform length
    - Convert labels to categorical format
- Build and train a deep learning model for sequence classification using TensorFlow/Keras
    - Implement embedding layers
    - Use bidirectional LSTM architecture
    - Apply dropout for regularization
- Evaluate model performance using standard metrics
    - Analyze accuracy and loss curves
    - Interpret classification reports
    - Visualize confusion matrices

### Tasks to be completed

- Download and prepare Pfam dataset
- Preprocess protein sequences and labels
- Build and train deep learning model
- Evaluate model performance
- Generate predictions and visualize results

## Prerequisites

- A working Python environment and familiarity with Python
- Basic understanding of machine learning concepts
- Familiarity with pandas and numpy libraries
- Knowledge of basic statistical concepts

## Get Started

- Please select kernel "conda_tensorflow2_p310" from SageMaker notebook instance.
- Import the necessary libraries and download the needed data.

### Import necessary libraries

In [None]:
# Import the pyplot module from matplotlib for plotting.
import matplotlib.pyplot as plt
# Import the numpy library for numerical operations, often used for array manipulations.
import numpy as np
# Import the pandas library for data manipulation and analysis, especially for DataFrames.
import pandas as pd
# Import the seaborn library for statistical data visualization, built on top of matplotlib.
import seaborn as sns
# Import specific layers from keras.layers for building neural networks.
from keras.layers import (
    LSTM, # Import LSTM layer for Long Short-Term Memory networks.
    Bidirectional, # Import Bidirectional layer for bidirectional processing in RNNs.
    Conv1D, # Import Conv1D layer for 1D convolutional neural networks.
    Dense, # Import Dense layer for fully connected neural networks.
    Dropout, # Import Dropout layer for regularization to prevent overfitting.
    Embedding, # Import Embedding layer for creating word embeddings.
    Flatten, # Import Flatten layer to flatten the input tensor.
    Input, # Import Input layer to instantiate a Keras tensor.
    MaxPooling1D, # Import MaxPooling1D layer for 1D max pooling.
)
# Import the Model class and Sequential class from keras.models to define neural network models.
from keras.models import Model, Sequential
# Import the pad_sequences function from keras.preprocessing.sequence for padding sequences to the same length.
from keras.preprocessing.sequence import pad_sequences
# Import the l2 regularizer from keras.regularizers for applying L2 regularization to layers.
from keras.regularizers import l2
# Import classification_report and confusion_matrix from sklearn.metrics for model evaluation.
from sklearn.metrics import classification_report, confusion_matrix
# Import train_test_split from sklearn.model_selection for splitting data into training and testing sets.
from sklearn.model_selection import train_test_split
# Import LabelEncoder from sklearn.preprocessing for encoding categorical labels into numerical form.
from sklearn.preprocessing import LabelEncoder

# Import the tensorflow library, the main deep learning framework.
import tensorflow as tf
# Import EarlyStopping callback from tensorflow.keras.callbacks to stop training early when validation loss stops improving.
from tensorflow.keras.callbacks import EarlyStopping
# Import to_categorical from tensorflow.keras.utils for one-hot encoding of categorical variables.
from tensorflow.keras.utils import to_categorical
# Import Sequential model from tensorflow.keras.models for linear stack of layers.
from tensorflow.keras.models import Sequential
# Import Embedding, Bidirectional, LSTM, Dropout, and Dense layers from tensorflow.keras.layers for building neural networks.
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dropout, Dense

# Set the default style for seaborn plots to "darkgrid" for better visualization.
sns.set_style("darkgrid")

# Import the os module to interact with the operating system, used here for environment variables.
import os
# Force CPU usage if no GPU detected
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # Set environment variable to force TensorFlow to use CPU only by disabling GPU visibility.

# Import the tensorflow library again (it's already imported above, this line might be redundant).
import tensorflow as tf
# Print the number of GPUs available to TensorFlow, useful for checking GPU setup.
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

### Download dataset

Here, we download the data directly into a pandas dataframe.

In [None]:
# Define the base URL for the dataset files hosted on GitHub.
URL = "https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-in-Biotechnology-and-Life-Sciences/main/datasets/dataset_pfam"

# Initialize an empty list called 'files' to store DataFrames.
files = []
# Loop 8 times to read in 8 different CSV files.
for i in range(8):
    # Read each CSV file from the specified URL pattern into a pandas DataFrame.
    # The filename is constructed by appending 'dataset_pfam_seq_sd' and the loop index (i+1), followed by '.csv'.
    # 'index_col=None' prevents pandas from using the first column as index.
    # 'header=0' sets the first row as the header of the DataFrame.
    df = pd.read_csv(f"{URL}/dataset_pfam_seq_sd{i+1}.csv", index_col=None, header=0)
    # Append the DataFrame read from the CSV file to the 'files' list.
    files.append(df)

# Concatenate all DataFrames stored in the 'files' list into a single DataFrame 'df'.
# 'axis=0' concatenates along rows (vertically).
# 'ignore_index=True' resets the index of the resulting DataFrame to a new sequential index.
df = pd.concat(files, axis=0, ignore_index=True)
# Print the shape (number of rows and columns) of the concatenated DataFrame 'df'.
df.shape

### Examine the data

Peek into the data.

In [None]:
df.head()

Check missing data.

In [None]:
df.isna().sum()

Get top 10 abundant family ids.

In [None]:
df["family_id"].groupby(df["family_id"]).value_counts().nlargest(10)

Get top 10 abundant family accessions.

In [None]:
df["family_accession"].groupby(df["family_accession"]).value_counts().nlargest(10)

Plot the sequence length frequency distribution.

In [None]:
sns.displot(df["sequence"].apply(lambda x: len(x)), bins=75, height=4, aspect=2)

Get mean sequence length.

In [None]:
df["sequence"].str.len().mean()

Get min sequence length.

In [None]:
df["sequence"].str.len().min()

Get max sequence length.

In [None]:
df["sequence"].str.len().max()

Get median sequence length.

In [None]:
df["sequence"].str.len().median()

Get family accessions with counts more than 1200.

In [None]:
df_filt = df.groupby("family_accession").filter(lambda x: len(x) > 1200)
df_filt

## Create a balanced dataset

In [None]:
df_bal = df_filt.groupby("family_accession").apply(lambda x: x.sample(1200))
df_bal.family_accession.value_counts()

In [None]:
# Peek into the balanced dataset
df_bal.head()

## Prepare input dataframe for modeling

`reset_index` in pandas is used to reset index of the dataframe object to default indexing (0 to number of rows minus 1) or to reset multi level index. By doing so, the original index gets converted to a column.

In [None]:
df_red = df_bal[["family_accession", "sequence"]].reset_index(drop=True)
df_red.head()

Compute num of unique classes.

In [None]:
num_classes = len(df_red.family_accession.value_counts())
num_classes

Get Pfam family accession unique number counts.

In [None]:
df_red.family_accession.value_counts()

### Make train and test datasets

Split data into 75% X_train and 25% X_Test, among `X_Test`, 50% for validation (`X_val`) and 50% for test (`X_test`).

In [None]:
# Splits the DataFrame 'df_red' into training set 'X_train' and a temporary test set 'X_Test', allocating 25% of the data to the test set.
X_train, X_Test = train_test_split(df_red, test_size=0.25)
# Splits the temporary test set 'X_Test' further into validation set 'X_val' and final test set 'X_test', allocating 50% of 'X_Test' to the final test set.
X_val, X_test = train_test_split(X_Test, test_size=0.50)

Get the train, test, and validation dataset sizes.

In [None]:
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)

### Create amino acid sequence dictionary

In [None]:
aa_seq_dict = {
    "A": 1,
    "C": 2,
    "D": 3,
    "E": 4,
    "F": 5,
    "G": 6,
    "H": 7,
    "I": 8,
    "K": 9,
    "L": 10,
    "M": 11,
    "N": 12,
    "P": 13,
    "Q": 14,
    "R": 15,
    "S": 16,
    "T": 17,
    "V": 18,
    "W": 19,
    "Y": 20,
}

In [None]:
# Encode amino acid sequence using the dictionary above
# Define a function named 'aa_seq_encoder' that takes 'data' as input.
def aa_seq_encoder(data):
    # Initialize an empty list 'full_sequence_list' to store encoded sequences.
    full_sequence_list = []
    # Iterate over each sequence in the 'sequence' column of the input 'data' DataFrame.
    for i in data["sequence"].values:
        # Initialize an empty list 'row_sequence_list' for each individual sequence.
        row_sequence_list = []
        # Iterate over each amino acid 'j' in the current protein sequence 'i'.
        for j in i:
            # Look up the numerical encoding for amino acid 'j' in 'aa_seq_dict' and append it to 'row_sequence_list'. If not found, default to 0.
            row_sequence_list.append(aa_seq_dict.get(j, 0))
        # After processing all amino acids in a sequence, append the 'row_sequence_list' (now a NumPy array) to 'full_sequence_list'.
        full_sequence_list.append(np.array(row_sequence_list))
    # Return the 'full_sequence_list' containing encoded amino acid sequences as NumPy arrays.
    return full_sequence_list

# Encode the 'sequence' column in X_train DataFrame using the 'aa_seq_encoder' function and assign the result to 'X_train_encode'.
X_train_encode = aa_seq_encoder(X_train)
# Encode the 'sequence' column in X_val DataFrame using the 'aa_seq_encoder' function and assign the result to 'X_val_encode'.
X_val_encode = aa_seq_encoder(X_val)
# Encode the 'sequence' column in X_test DataFrame using the 'aa_seq_encoder' function and assign the result to 'X_test_encode'.
X_test_encode = aa_seq_encoder(X_test)

In [None]:
# Show an example encoded amino acid sequence
X_train_encode[0]

Pad sequence to the same length of 100

In [None]:
# Define the maximum length for padding sequences.
max_length = 100

# Pad the training sequences 'X_train_encode' to 'max_length', using 'post' padding and 'post' truncation.
X_train_padded = pad_sequences(
    X_train_encode, maxlen=max_length, padding="post", truncating="post"
)
# Pad the validation sequences 'X_val_encode' to 'max_length', using 'post' padding and 'post' truncation.
X_val_padded = pad_sequences(
    X_val_encode, maxlen=max_length, padding="post", truncating="post"
)
# Pad the test sequences 'X_test_encode' to 'max_length', using 'post' padding and 'post' truncation.
X_test_padded = pad_sequences(
    X_test_encode, maxlen=max_length, padding="post", truncating="post"
)

In [None]:
X_train.sequence[1]

In [None]:
X_train_encode[1][:]

In [None]:
X_train_padded[1][:]

Encode target labels with value between `0` and `n_classes-1`

In [None]:
# Initialize a LabelEncoder object to convert categorical labels to numerical values.
le = LabelEncoder()

# Fit the LabelEncoder on the 'family_accession' column of the training data (X_train) and transform it.
y_train_enc = le.fit_transform(X_train["family_accession"])
# Transform the 'family_accession' column of the validation data (X_val) using the fitted LabelEncoder.
y_val_enc = le.transform(X_val["family_accession"])
# Transform the 'family_accession' column of the test data (X_test) using the fitted LabelEncoder.
y_test_enc = le.transform(X_test["family_accession"])

In [None]:
X_train["family_accession"]

In [None]:
y_train_enc

In [None]:
num_classes = len(le.classes_)
num_classes

In [None]:
# Converts the training class labels (y_train_enc) into a binary class matrix using one-hot encoding.
y_train = to_categorical(y_train_enc)
# Converts the validation class labels (y_val_enc) into a binary class matrix using one-hot encoding.
y_val = to_categorical(y_val_enc)
# Converts the test class labels (y_test_enc) into a binary class matrix using one-hot encoding.
y_test = to_categorical(y_test_enc)

In [None]:
y_train

## Build model

In [None]:
# Sequential groups a linear stack of layers into a tf.keras.Model.
# Sequential provides training and inference features on this model.
model = Sequential()

# EmbeddingLayer: Turns positive integers (indexes) into dense vectors of fixed size.
#  input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.
#  output_dim: Integer. Dimension of the dense embedding.
#  input_length: Length of input sequences, when it is constant.
model.add(Embedding(21, 16, name="EmbeddingLayer"))  # max_length not needed for Bidirectional LSTM

# Bidirectional wrapper for RNNs with 16 units of LSTM
model.add(Bidirectional(LSTM(16), name="BidirectionalLayer"))

# Applies Dropout to the input with 20% of the input units to drop.
model.add(Dropout(0.2, name="DropoutLayer"))

# densely-connected NN layer of 28 units
model.add(Dense(28, activation="softmax", name="DenseLayer"))

# Optimizer that implements the Adam algorithm
opt = tf.keras.optimizers.legacy.Adam(learning_rate=0.01)  # For TF 2.10+

# Configures the model for training use 'Adam' as optimizer, 'categorical_crossentropy'
# as loss funciton, and 'accuracy' as evaluation metrics
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()

Stop training when a monitored metric has stopped improving

In [None]:
# monitor: Quantity to be monitored.
# Specifies the metric to monitor for early stopping (in this case, 'val_loss' - validation loss).
es = EarlyStopping(
    # patience: Number of epochs with no improvement after which training will be stopped.
    # Sets the number of epochs to wait after last time validation loss improved before stopping (patience=3).
    patience=3,
    # restore_best_weights: Whether to restore model weights from the epoch with the best value of the monitored quantity.
    # When set to True, restores model weights to the best epoch's weights when stopping (restore_best_weights=True).
    restore_best_weights=True
)

(The following cell may takeu a few minutes.)

In [None]:
# Trains the model for a fixed number of epochs (iterations on a dataset)
# Training with fixed callbacks
# Calls the 'fit' method of the 'model' to train the neural network.
history = model.fit(
    # Provides the input training data (features) as 'X_train_padded'.
    X_train_padded,
    # Provides the target training data (labels) as 'y_train'.
    y_train,
    # Specifies the number of training epochs (full passes through the training data) as 10.
    epochs=10,
    # Sets the batch size for training, processing 256 samples at a time.
    batch_size=256,
    # Provides validation data (features and labels) as a tuple for monitoring performance on a separate dataset during training.
    validation_data=(X_val_padded, y_val),
    # Includes a list of callbacks, here using 'es' (likely an EarlyStopping callback) to control training process.
    callbacks=[es],  # Now using properly imported callback
    # Sets verbosity to 1 to display progress bars and training information during each epoch.
    verbose=1
)

### Plot accuracy and loss

In [None]:
# Create a new figure for plotting with a size of 10x10 inches.
fig = plt.figure(figsize=(10, 10))

# Define the first subplot in a 2x2 grid (top-left subplot).
plt.subplot(2, 2, 1)
# Set the title of the subplot to "Accuracy" with a fontsize of 15.
plt.title("Accuracy", fontsize=15)
# Set the label for the x-axis to "Epochs" with a fontsize of 15.
plt.xlabel("Epochs", fontsize=15)
# Set the label for the y-axis to "Accuracy (%)" with a fontsize of 15.
plt.ylabel("Accuracy (%)", fontsize=15)
# Plot the validation accuracy from the training history, label it "Validation Accuracy", and use a dashed line style.
plt.plot(
    history.history["val_accuracy"], label="Validation Accuracy", linestyle="dashed"
)
# Plot the training accuracy from the training history and label it "Training Accuracy".
plt.plot(history.history["accuracy"], label="Training Accuracy")
# Display a legend in the lower right corner to distinguish between validation and training accuracy lines.
plt.legend(["Validation", "Training"], loc="lower right")

# Define the second subplot in a 2x2 grid (top-right subplot).
plt.subplot(2, 2, 2)
# Set the title of the subplot to "Loss" with a fontsize of 15.
plt.title("Loss", fontsize=15)
# Set the label for the x-axis to "Epochs" with a fontsize of 15.
plt.xlabel("Epochs", fontsize=15)
# Set the label for the y-axis to "Loss" with a fontsize of 15.
plt.ylabel("Loss", fontsize=15)
# Plot the validation loss from the training history, label it "Validation loss", and use a dashed line style.
plt.plot(history.history["val_loss"], label="Validation loss", linestyle="dashed")
# Plot the training loss from the training history and label it "Training loss".
plt.plot(history.history["loss"], label="Training loss")
# Display a legend in the upper right corner to distinguish between validation and training loss lines.
plt.legend(["Validation", "Training"], loc="upper right")

## Generate predictions

In [None]:
# Generates output predictions for the input samples using the trained model.
y_pred = model.predict(X_test_padded)

# Build a text report showing the main classification metrics using sklearn's classification_report.
print(
    classification_report(
        # Converts one-hot encoded true labels (y_test) back to class indices using argmax.
        np.argmax(y_test, axis=1),
        # Converts probability predictions (y_pred) to class indices using argmax.
        np.argmax(y_pred, axis=1),
        # Uses class names from the label encoder (le) to label the classes in the report.
        target_names=le.classes_,
    )
)

- **Support** is the number of actual occurrences of the class in the specified dataset. 
- **Macro avg** takes the arithmetic mean (aka unweighted mean). 
- **Weighted avg** takes the mean of all per-class while considering each class’s support.

### Show predicted values

In [None]:
y_pred

### Confusion matrix

Compute confusion matrix to evaluate the accuracy of a classification.

In [None]:
# Calculates the confusion matrix using true labels (y_test) and predicted labels (y_pred).
#   - y_true: np.argmax(y_test, axis=1) -  True class labels.
#     - y_test:  Represents the true labels, likely in one-hot encoded format (e.g., from test data).
#     - np.argmax(y_test, axis=1): Converts one-hot encoded y_test to categorical labels by finding the index of the maximum value along axis 1 (rows).
#   - y_pred: np.argmax(y_pred, axis=1) - Predicted class labels.
#     - y_pred: Represents the predicted probabilities or one-hot encoded predictions from the model.
#     - np.argmax(y_pred, axis=1): Converts probability predictions or one-hot encoded predictions to categorical labels by finding the index of the maximum probability along axis 1 (rows).
cf_matrix = confusion_matrix(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))

Plot confusion_matrix

In [None]:
# Set the figure size for the heatmap plot to 15 inches wide and 10 inches tall.
plt.figure(figsize=(15, 10))
# Create a heatmap using seaborn's heatmap function.
#   - cf_matrix: The confusion matrix data to be visualized as a heatmap.
#   - annot=True: Display numerical values (annotations) in each cell of the heatmap.
#   - fmt="":  Format string for the annotations (empty string means default formatting).
#   - cmap="Blues": Use the "Blues" colormap for the heatmap, representing values with shades of blue.
sns.heatmap(cf_matrix, annot=True, fmt="", cmap="Blues")

## Conclusion

Through this tutorial, we learned how to:
- Process and prepare protein sequence data for deep learning
- Implement a deep learning model for protein family classification
- Train and evaluate the model's performance
- Visualize and interpret the results using various metrics

The model achieved good classification performance across multiple protein families, demonstrating the effectiveness of deep learning approaches for protein sequence analysis.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
