# The following code snippet is a common set of import statements in Python for three popular libraries:

## TensorFlow is an open-source machine learning framework developed by the Google Brain team. It provides tools for building and training various machine learning models.

## Pandas is a powerful data manipulation and analysis library. It provides data structures like DataFrames, making it easy to work with structured data.

## NumPy is a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays.

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np

# The following code loads data from a CSV file into a Pandas DataFrame (train_df) and then prints the data types of each column in that DataFrame. The file path points to a CSV file containing the training data.

In [None]:
train_fp = './dataset/train.csv'
train_df = pd.read_csv(train_fp);
train_df.dtypes

# The following function is a simple classifier for an assay result. If the assay string (after stripping and converting to lowercase) is 'positive', it returns 1; otherwise, it returns 0. This kind of function is commonly used when dealing with binary classification tasks where the goal is to map certain input conditions to binary outcomes.

In [None]:
def target_fn(assay):
    assay = assay.strip().lower()
    if assay == 'positive':
        return 1
    else:
        return 0

# x_train contains the input features for the machine learning model, excluding the columns 'Epitope', 'MHC', and 'Assay'. y_train contains the corresponding binary labels derived from the 'Assay' column using the target_fn function. This kind of data preparation is common in supervised learning, where the goal is to train a model to predict the target variable ('Assay' in this case) based on input features.

In [None]:
x_train = train_df.drop(['Epitope', 'MHC','Assay'], axis=1)
y_train = train_df.pop('Assay').apply(target_fn)

# The following will output the first five rows (by default) of the x_train DataFrame, allowing us to inspect the structure and content of the training features. 

In [None]:
x_train.head()

# The following will output the first five elements (by default) of the y_train Series, allowing us to inspect the binary labels associated with the corresponding rows in the training data. 

In [None]:
y_train.head()

# In the following code snippet, we are creating a new Series epitope_and_mhc_comb_train by concatenating the 'Epitope' and 'MHC' columns of the train_df DataFrame, and then converting the resulting strings to uppercase. Additionally, we are extracting the values of this Series into a NumPy array named epitope_and_mhc_train_texts. 

In [None]:
epitope_and_mhc_comb_train = (train_df['Epitope'] + train_df['MHC']).str.upper()
epitope_and_mhc_train_texts = epitope_and_mhc_comb_train.values

# In the following code snippet, we are importing various modules and classes from the TensorFlow Keras library, which is commonly used for building neural network models in machine learning. 

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, Embedding, Activation, Flatten, Dense,concatenate,GRU
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Dropout,AveragePooling1D,BatchNormalization,Bidirectional
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping

# The following code configures a Tokenizer for character-level tokenization, fits it on the provided text data, and customizes its vocabulary using a predefined character dictionary. 

In [None]:
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(epitope_and_mhc_train_texts)
alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789-*:'
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1

# The following code processes the text data (epitope_and_mhc_train_texts) by converting it to integer-encoded sequences using the configured Tokenizer and then pads the sequences to ensure a consistent length of 25. The resulting sequences are stored in train_sequences_pad.

In [None]:
train_sequences = tk.texts_to_sequences(epitope_and_mhc_train_texts)
train_sequences_pad = pad_sequences(train_sequences, maxlen=25)
print(train_sequences[0])
maxlen = 25

# The following code defines a complex neural network model using the Keras functional API. This model takes text data through a tokenizer and embedding layer, processes it with convolutional layers, and combines it with numerical features. It is designed for binary classification with a sigmoid output.We will explain the final model structure with legends and annotations in the article.

## Tokenizer and Embedding Layer: vocab_size is set to the length of the vocabulary in the tokenizer. Embedding_size is set to 32, which represents the dimensionality of the embedding space.

## Text Input and Embedding Layer: input_comb is the input layer for the text data, with a shape of maxlen.The Embedding layer converts integer-encoded sequences into dense vectors of fixed size (embedding_size).

## Convolutional and Pooling Layers: Two sets of 1D convolutional layers followed by batch normalization and max-pooling. These layers are commonly used in text or sequence data processing for feature extraction.

## Flatten and Dense Layers: Flatten layer to convert the output from the convolutional layers into a 1D array.Dense layers with ReLU activation, followed by dropout for regularization.

## Numerical Input and Dense Layer: input_res is the input layer for numerical features with a shape of 3. A dense layer for processing numerical features followed by dropout.

## Concatenation and Output Layers: Concatenation of the outputs from the text and numerical branches. Additional dense layers with ReLU activation, followed by the final output layer with a sigmoid activation for binary classification.

In [None]:
vocab_size = len(tk.word_index)
embedding_size = 32

input_comb = Input(shape=maxlen, name='epitope_and_mhc')
x = Embedding(vocab_size + 1, embedding_size, input_length=maxlen)(input_comb)
x = Conv1D(16, 3, activation='relu')(x)
x = BatchNormalization()(x)
x = MaxPooling1D(pool_size=3)(x)
x = Conv1D(32, 3, activation='relu')(x)
x = BatchNormalization()(x)
x = MaxPooling1D(pool_size=3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.2)(x)
x = Model(inputs=input_comb,outputs=x)
input_res = Input(shape=3, name = 'numerical_features')
y = Dense(128, activation='relu')(input_res)
y = Dropout(0.2)(y)
y = Model(inputs=input_res, outputs=y)
combined_out = concatenate([x.output, y.output])
z = Dense(128, activation='relu')(combined_out)
z = Dense(1, activation='sigmoid')(z)
complex_model = Model(inputs=[x.input, y.input], outputs=z)

# The following code compiles the complex_model using the Adam optimizer, binary cross-entropy loss, and accuracy as the evaluation metric. 

In [None]:
complex_model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# The following code defines an EarlyStopping callback for use during the training of a neural network model.The EarlyStopping callback is designed to monitor the validation accuracy during training. If the validation accuracy does not improve by at least min_delta over the specified patience number of epochs, the training will be stopped early to avoid overfitting. 

## monitor='val_accuracy': This specifies the metric to monitor for early stopping. In this case, it's the validation accuracy (val_accuracy). The training process will stop when the specified metric stops improving.

## min_delta=0.0001: This parameter sets the minimum change in the monitored metric to qualify as an improvement. If the change is less than this value, it won't be considered as an improvement.

## patience=10: This is the number of epochs with no improvement after which training will be stopped. In this case, if there is no improvement in validation accuracy for 10 consecutive epochs, the training will stop.

In [None]:
earlystop_callback = EarlyStopping(
  monitor='val_loss', min_delta=0.0001,
  patience=10)

# The following code is training the complex_model using the provided data and settings.The training history, including loss and accuracy metrics for both training and validation sets, will be stored in the history object.

## x=[train_sequences_pad, x_train]: The input data for the model consists of two components – the padded sequences (train_sequences_pad) and the numerical features (x_train).

## y=y_train: The target data is the binary labels (y_train).

## batch_size=64: The number of samples per gradient update. The model's weights are updated after processing each batch of 64 samples.

## epochs=50: The number of times the entire training dataset is passed forward and backward through the neural network.

## callbacks=[earlystop_callback]: The early stopping callback is applied during training. It will monitor the validation loss and stop training if there is no reduce for a certain number of epochs.

## validation_split=0.2: Specifies that 20% of the training data will be used for validation. The model's performance on this validation set is monitored during training.

## verbose="auto": The verbosity mode during training. Setting it to "auto" means the verbosity level is set to 1 if a TQDM progress bar is used, and 2 otherwise.

In [None]:
history = complex_model.fit([train_sequences_pad, x_train], y_train,
                   batch_size=64,
                   epochs=50,
                   callbacks=[earlystop_callback],
                   validation_split=0.2,
                   verbose="auto",
                           )