<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h2 style='color:#2c3e50;'><strong>Section 1: Problem Definition</strong></h2>

<hr style="border-top: 2px solid #3498db; margin-bottom: 20px; margin-top: 20px;">

<h3 style='color:#2c3e50;'>Goal</h3>
<p style='color:#2c3e50;'>Dive into the realm of Molecular Biology and explore how a Language Model (LM) like BERT can be harnessed for protein sequences. </p>

<h3 style='color:#2c3e50;'>Introduction to the Problem Domains</h3>
<ul style='color:#2c3e50;'>
    <li><strong>Proteins</strong> are the workhorses of the body, performing a wide range of functions and biological processes necessary for survival and wellbeing. Understanding their functions can aid in drug discovery, disease diagnosis, and much more. 
    Proteins are at least 50, and usually more than 100, amino acids in length and composed of multiple peptide subunits. Each protein consists of a linear sequence of amino acids. The sequence of a protein is usually notated as a string of letters including 22 aminoacids and are varied in length from very short to very long! For instance, in NLP we use words and sentences as sequences, similarly in protein modeling we also work with sequences which looks something like this: 
    <br>​
    V​​L​​S​​P​​A​​D​​K​​T​​N​​VK​​A​​A​​W​​G​​K​​V​​G​​A​​H​A​​G​​E​​Y​​G​​A​​E​​A​​L​​E​R​​M​​F​​L​​S​​F​​P​​T​​T​​KT​​Y​​F​​P​​H​​F​​D​​L​​S​​HG​​S​​A​​Q​​V​​K​​G​​H​​G​​K​K​​V​​A​​D​​A​​L​​T​​N​​A​​VA​​H​​V​​D​​D​​M​​P​​N​​A​​L​S​​A​​L​​S​​D​​L​​H​​A​​H​​K​​L​​R​​V​​D​​P​​V​​N​​F​​K​​L​L​​S​​H​​C​​L​​L​​V​​T​​L​​AA​​H​​L​​P​​A​​E​​F​​T​​P​​A​V​​H​​A​​S​​L​​D​​K​​F​​L​​A​​S​​V​​S​​T​​V​​L​​T​​S​​K​​Y​ 
    </br>
    <img alt="Protein Structure" src="protein.jpeg" width="600" height="400" align="center"/>
    <br>
    This sequence belongs to a structure that looks like this: </br>
    </li>
    <img alt="Protein Structure" src="structure.jpg" width="600" height="400" align="center"/>
    <br>
    You can find more information about proteins <a href="https://www.khanacademy.org/science/biology/macromolecules/proteins-and-amino-acids/v/introduction-to-amino-acids">here</a>.
    </br>
    <li><strong>Signal Peptides (SP)</strong> are short sequences of amino acids, typically between 2 and 50 amino acids in length, within a protein that direct the protein to specific locations within or outside the cell. They act like postal addresses, guiding the cellular delivery machinery to transport the protein to its intended location.Once the protein reaches its destination, the signal peptide is often cleaved off by specific enzymes. In the following image, you can see the difference between a peptide and a protein: 
    <br>
    <img alt="protein vs peptide" src="peptide.jpeg" width="600" height="400" align="center"/>
    <br>
    You can find more information about peptides <a href="https://www.cellgs.com/blog/the-difference-between-peptides-and-proteins.html">here</a>.
    <br>
    We have a dataset for signal peptides detection in "data/signal_peptide" directory. In this dataset, if a protein sequence has a signal peptide, it is labeled as 1, otherwise it is labeled as 0.
    </br>
    </li>
    <li><strong>SCOP</strong>, which stands for Structural Classification of Proteins is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. Predicting the structural class of a protein sequence is another fine-tuning task that we will do in this assignment. For each protein sequence, we have its structural class as its label in "data/scop" directory. Labels are categorical and there are 7 classes in total.
    </li>
</ul>
</div>

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Import Libraries and Read Data</strong></h3>


In [1]:
# Please do not change this cell. Run it without changes.
# This cell imports the required libraries and sets some parameters.
# If you want to change the parameters, you can do so here and mention why you changed them with a comment.

import numpy as np
import pandas as pd
import tensorflow as tf
import os
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

max_length = 256 # size of the input sequences - if a sequence is longer than this, it will be truncated. If it's shorter, it will be padded.
batch_size = 32  # number of sequences that will be given to the model at once
epochs = 10     # number of times the model will see the whole training set
lr = 1e-2      # learning rate
vocab_size = 26  # size of the vocabulary - do not change 
embed_size = 128 # size of the embeddings - do not change
n_heads = 4     # number of attention heads - do not change
n_layers = 2   # number of transformer layers - do not change

In [2]:
# please don't change this cell. Run it without changes.
# this cell is for loading the data needed for the assignment. 
# pretrain data is downloaded from UniprotKB-SwissProt (https://www.uniprot.org/downloads) latest release.

pretrain_data = pd.read_csv('data/pretrain.tsv', sep='\t')
# scop data
scop_train_data = pd.read_csv('data/scop/scop.train.csv')
scop_test_data = pd.read_csv('data/scop/scop.test.csv')
# signal peptide data
signalp_train_data = pd.read_csv('data/signal_peptide/signalP_binary.train.csv')
signalp_test_data = pd.read_csv('data/signal_peptide/signalP_binary.test.csv')

In [3]:
# please don't change this cell. Run it without changes.
# this cell is for looking at the pretrain data to get an idea of what it looks like.

pretrain_data.head()

Unnamed: 0,id,seq
0,Q65W17,MKPLVIKLGGVLLDTPAAMENLFTALADYQQNFARPLLIVHGGGCL...
1,Q12ZI9,MFTILTGSQFGDEGKGKIVDLLSKDYDLVVRFQGGDNAGHTVVVGD...
2,O08400,MTRISRSAYAEIYGPTVVGGVGDRVRLADTLLLAEVEKDHTIFGEE...
3,O14232,MFGGELDDAFGVFEGKVPKSLKEESKNSQNSQNSQKIKRTLTDKNA...
4,Q2FQ95,MNILIVNRYGDPDVEEFSYELEKLLHHHGHHTSIYKENLLGEAPPL...


In [4]:
# please don't change this cell. Run it without changes.
# this cell is for looking at the scop data to get an idea of what it looks like.

scop_train_data.head()

Unnamed: 0,seq,label
0,MSPFTGSAAPTPEWRHLRVEITDGVATVTLARPDKLNALTFEAYAD...,c
1,MVVTKLAPDFKAPAVLGNNEVDEHFELSKNLGKNGVILFFWPKDFT...,c
2,MKVGIDAGGTLIKIVQEQDNQRTFKTELTKNIDQVVEWLNQQQIEK...,c
3,LYKLLILDIDGTLRDEVYGIPESAKHAIRLCQKNHCSVVICTGRSM...,c
4,NTSNITFIGGGNMARNIVVGLIANGYDPNRICVTNRSLDKLDFFKE...,c


In [5]:
# please don't change this cell. Run it without changes.
# this cell is for looking at the signal peptide data to get an idea of what it looks like.

signalp_train_data.head()

Unnamed: 0,label,seq
0,0,MLGMIRNSLFGSVETWPWQVLSTGGKEDVSYEERACEGGKFATVEV...
1,1,MQPAKNLLFSSLLFSSLLFSSAARAASEDGGRGPYVQADLAYAAER...
2,0,MDKGEGLRLAATLRQWTRLYGGCHLLLGAVVCSLLAACSSSPPGGV...
3,0,MKFIDEAKIEVAAGKGGNGATSFRREKFVPRGGPDGGDGGKGGSVW...
4,0,MVAGMLMPRDQLRAIYEVLFREGVMVAKKDRRPRSLHPHVPGVTNL...


<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Section 1. Tokenize Protein Sequences</strong></h3>

<p style='color:#2c3e50;'>Firstly, you need to have your protein sequences in a format that can be fed into the model. Each amino acid can be represented by a unique token (or ID), similar to how each word is represented by a unique token in NLP tasks. A common representation is to use the single-letter codes for amino acids.</p>
<p style='color:#2c3e50;'>We have <b>22 amino acids: "ACDEFGHIKLMNPQRSTUVWXY" </b>. There also can be invalid amino acids in protein sequences that we can encode them by <b>'OTHER' </b>token. We need tokens <b>'START'</b>, <b>'END'</b>, and <b>'PAD'</b> for showing starting of the protein sequence, end of protein sequence and padding the protein sequence to a fixed size. <p><b> Do not use predefined tokenizer in keras or tensorflow. You need to implement your own tokenizer for protein sequences.</b></p>
</p>
<p>
    <b>Example (padding to size 32):</b> <br>
    <b>Label encoding:</b> {A: 0, C: 1, D: 2, E: 3, F: 4, G: 5, H: 6, I: 7, K: 8, L: 9, M: 10, N: 11, P: 12, Q: 13, R: 14, S: 15, T: 16, V: 17, W: 18, X: 19, Y: 20, OTHER: 21, START: 22, END: 23, PAD: 24} <br>
    <b>Input:</b> "ACDEFGHIKLMNPQRSTUVWXY" <br>
    <b>Output:</b> [22, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24] <br>
</p>
</div>


In [6]:
# fill this function to tokenize a sequence into a list of integers and return the list of integers representing the sequence like the example above.
def tokenize_seq(seq, max_length=512):
    '''
    Tokenize a sequence into a list of integers.
    assign each amino acid to a unique integer and add <START>, <END>, <PAD>, <OTHER> tokens to the vocabulary.
    <START> token is added to the beginning of each sequence
    <END> token is added to the end of each sequence
    <PAD> token is used to pad sequences to the same length (max_length)
    <OTHER> token is used to represent all amino acids that are not in the vocabulary
    
    Args:
        seq (str): protein sequence
        max_length (int): maximum length of the sequence
    Returns:
        tokenized_seq: list of integers representing the sequence
    '''
    # YOUR CODE HERE
    amino_acids = "ACDEFGHIKLMNPQRSTUVWXY"
    vocab = {aa: i for i, aa in enumerate(amino_acids)}
    vocab['OTHER'] = len(amino_acids)
    vocab['START'] = len(amino_acids)+1
    vocab['END'] = len(amino_acids) +2
    vocab['PAD'] = len(amino_acids) + 3
    # Tokenize the sequence
    tokenized_seq = [vocab['START']] + [vocab[aa] if aa in vocab else vocab['OTHER'] for aa in seq] + [vocab['END']]
    tokenized_seq = tokenized_seq[:max_length] + [vocab['PAD']] * max(0, max_length - len(tokenized_seq))
    
    return tokenized_seq

In [7]:
# please don't change this cell and don't add any print statements to this cell. Run it without changes.
# this cell is tokenizing the pretrain data and converting it to numpy array.

pretrain_seq = [tokenize_seq(seq) for seq in pretrain_data['seq']]
pretrain_seq = np.array(pretrain_seq)
pretrain_seq = pretrain_seq.astype(np.int32)

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h2 style='color:#2c3e50;'><strong>Section 2: Pretrain a Language Model</strong></h2>

<hr style="border-top: 2px solid #3498db; margin-bottom: 20px; margin-top: 20px;">

<ul style='color:#2c3e50;'>
    <li>Implement the attention mechanism</li>
    <li>Implement BERT architecture</li>
    <li>Train the model by the protein sequences.</li>
</ul>

</div>

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Task 2.1: Implementing the Self-Attention Class</strong></h3>

<p style='color:#2c3e50;'>Implement a multihead self-attention mechanism that will be the building block for BERT model.</p>

<p style='color:#2c3e50;'><strong>Multihead self-attention</strong> is a mechanism at the heart of Transformer models, which are widely used for various natural language processing tasks. It allows the model to focus on different positions of the input sequence simultaneously when producing an output. Here's a simplified explanation:
<p style='color:#2c3e50;'>
<strong>Self-Attention:</strong> This component allows each position in a sequence to attend to all positions within the same sequence when computing the representation of itself. It helps the model capture context from the entire sequence.
<p style='color:#2c3e50;'>
<strong>Multihead:</strong> The 'multihead' part means that the self-attention process is duplicated multiple times (each 'head' being one instance). Each head attends to information from different representational spaces at different positions. This means that instead of having a single set of attention weights for each position, you have multiple sets, allowing the model to capture a diverse range of information which enhances its learning capacity for complex patterns in data.
<p style='color:#2c3e50;'>
<strong>Mechanism Overview:</strong> In practice, for each head, the input sequence is linearly transformed to a set of queries, keys, and values using learned weights. Then, the attention scores are computed by a <b>scaled dot-product of queries and keys</b>, which are used to weight the values. The outputs of each head are concatenated and once again linearly transformed to produce the final output. </p>
<img alt="Multihead Self-Attention" src="attention.jpg" width="600" height="400" align="center"/>
<p> You can read more about attention mechanism <a href="https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853">here</a>.</p>
<p>
An attention mask in the context of self-attention mechanisms is used to prevent certain positions from being attended to. For instance, when processing sequences of different lengths, padding tokens are added to achieve uniformity in length. An attention mask can be applied to ensure that these padding tokens do not influence the model's output. It's also used to enforce causality in sequence-to-sequence tasks, ensuring that predictions for a certain position can't depend on future positions. Although it is necessary to use attention masks in some cases, we will not use them in this assignment.
</div>

In [8]:
class multiSelfAttentionHead(tf.keras.layers.Layer):
    def __init__(self, num_heads=4, k_dim=64, **kwargs):
        super(multiSelfAttentionHead, self).__init__(**kwargs)
        self.num_heads = num_heads
        self.key_dim = k_dim
        self.scale = k_dim ** -0.5  # Scale factor for the dot product (see above image, yellow Scale box) 
        self.depth = self.key_dim // self.num_heads  # Depth of each head

    def build(self, input_shape):
        # Create the trainable weight matrices for Q, K, V and the final output in this function.
        # Fill variables self.wq, self.wk, self.wv, self.wo

        self.wq = self.add_weight(name='wq', shape=(input_shape[-1], self.key_dim), initializer='random_normal', trainable=True)  # weight matrix for Q (query)  tf.keras.layers.Dense(key_dim)
        self.wk = self.add_weight(name='wk', shape=(input_shape[-1], self.key_dim), initializer='random_normal', trainable=True)  # weight matrix for K (key)   tf.keras.layers.Dense(key_dim)
        self.wv = self.add_weight(name='wv', shape=(input_shape[-1], self.key_dim), initializer='random_normal', trainable=True)  # weight matrix for V (value)     tf.keras.layers.Dense(key_dim)
        self.wo = self.add_weight(name='wo', shape=(self.key_dim, input_shape[-1]), initializer='random_normal', trainable=True)  # weight matrix for the final output  tf.keras.layers.Dense(key_dim)
     
        assert self.key_dim % self.num_heads == 0, "key_dim must be divisible by num_heads"
        super(multiSelfAttentionHead, self).build(input_shape)

    def split_heads(self, x, batch_size):
        # split the last dimension into (num_heads, depth) and transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        # Fill variable x
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        x = tf.transpose(x, perm=[0, 2, 1, 3])
        
        return x
    
    def call(self, inputs): 
        # Implement multi-head attention (see the right image above)
        # Fill variable batch_size
        batch_size = tf.shape(inputs)[0]
        
        # Linear projections of Q, K, V (Gray Linear layers in the begining of the right image above)
        # Fill variables query, key, value
        query = tf.matmul(inputs, self.wq)   # (batch_size, seq_len, key_dim)
        key = tf.matmul(inputs, self.wk)     # (batch_size, seq_len, key_dim)
        value = tf.matmul(inputs, self.wv)   # (batch_size, seq_len, key_dim)


        # Split the heads using split_heads function 
        # Fill variables query, key, value
        query = self.split_heads(query, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        key = self.split_heads(key, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        value = self.split_heads(value, batch_size)  # (batch_size, num_heads, seq_len_v, depth)


        # Linear projection between Q and K (in the left image above)
        # Fill variable linear_qk
        linear_qk = tf.matmul(query, key, transpose_b=True)  # (batch_size, num_heads, seq_len_q, seq_len_k)


        # Scale Linear projection between Q and K using self.scale (in the left image above)
        # Fill variable scaled_linear_qk
        scaled_linear_qk =  linear_qk * self.scale
        

        # Apply softmax to the last axis of scaled_linear_qk
        # Fill variable attention_weights
        attention_weights = tf.nn.softmax(scaled_linear_qk, axis=-1)
        
        # Matmul of attention_weights and value (in the left image above)
        # Fill variable context
        context = tf.matmul(attention_weights, value)  # (batch_size, num_heads, seq_len_q, depth)
        
        # 'Concatenate' heads (in the right image above) using tf.transpose and tf.reshape
        # Fill variable context
        context = tf.reshape(tf.transpose(context, perm=[0, 2, 1, 3]), (batch_size, -1, self.key_dim))  # (batch_size, seq_len_q, key_dim)
        
        # Final linear projection (Gray Linear layer in the end of the right image above) between context and self.wo
        # Fill variable outputs
        outputs = tf.matmul(context, self.wo)  # (batch_size, seq_len_q, input_shape[-1])

        return outputs

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Task 2.2: Implementing the BERT Architecture</strong></h3>

<p style='color:#2c3e50;'>Implement BERT architecture.</p>
<p> You can read more about BERT in the original <a href="https://arxiv.org/pdf/1810.04805.pdf">paper</a>.</p>
<p>
BERT is based on encoder transformer. Encoder consists of multi-head attention and feed-forward neural network. In original Transformers paper <a href="https://arxiv.org/pdf/1706.03762.pdf">Attention Is All You Need</a> and in this assignment, the stack of identical encoder layers is used. 
</p>
<img alt="Encoder transformer" src="encoder.png" width="300" height="400" align="center"/>
<p> In BERT function, you need to implement the following steps:</p>
<p>
<strong>Inputs:</strong> The function takes several parameters like vocabulary size, embedding size, number of layers, attention heads, and maximum sequence length. Additionally, it expects an input sequence and an attention mask when called.
<p>
<strong>Embedding Layer:</strong> The model begins by mapping the input sequence of tokens to vectors using an embedding layer.
<p>
<strong>Encoder Layers:</strong> It then passes the embedded input through a series of identical layers. Each layer consists of:
1. A multi-head self-attention mechanism that allows the model to capture different aspects of the data. The attention output is combined with the original embeddings using a residual connection followed by layer normalization.
2. A feed-forward neural network (FFN) with two dense layers is applied after each attention mechanism, also followed by a residual connection and layer normalization.
<p>
<strong>Outputs:</strong> After processing the input through the specified number of encoder layers, the function applies a dense layer with softmax activation to produce a probability distribution over the vocabulary for each position in the input sequence.
<p>
<strong>Model Creation:</strong> Finally, it encapsulates the complete architecture within a Keras Model object, which can be compiled and trained.

</div>

In [9]:
# Implement BERT model 
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Add, LayerNormalization, Dense
def BERT(vocab_size, embed_size=128, n_layers=2, n_heads=4, max_length=256):
    '''
    BERT model. 

    Args:
        vocab_size (int): vocabulary size (number of tokens)
        embed_size (int): embedding size (dimension of the token embedding)
        n_layers (int): number of layers in the encoder stack
        n_heads (int): number of attention heads in each multi-head attention layer
        max_length (int): maximum length of the sequence 

    Inputs:
        input_seq (tf.Tensor): input tensor with shape [Batch_size, max_length] ; No positional encoding is added to the input sequence


    Outputs:
        output (tf.Tensor): output tensor with shape [Batch_size, max_length, vocab_size]
        return model (tf.keras.Model): BERT model
    
    Model Architecture (as described above):
        1. Input Layer
        2. Embedding Layer
        3. Multi-Head Self-Attention Layers
        4. Add and Norm Layer
        5. Feed-Forward Layers
        6. Add and Norm Layer
        7. Dense Layer (vocab_size) with softmax activation
        Steps 3-6 are repeated n_layers times.
    '''

    # YOUR CODE HERE
    inputs = Input(shape=(max_length,), dtype=tf.int32)
    
    # Embedding Layer
    embedding_layer = Embedding(input_dim=vocab_size, output_dim=embed_size)(inputs)
    
    # Multi-Head Self-Attention Layers
    output = embedding_layer
    for _ in range(n_layers):
        # Multi-Head Self-Attention
        attention_output = multiSelfAttentionHead(num_heads=n_heads, k_dim=embed_size)(output)
        
        # Add and Norm Layer
        output = Add()([output, attention_output])
        output = LayerNormalization(epsilon=1e-6)(output)
        
        # Feed-Forward Layers
        ffn_output = Dense(units=4*embed_size, activation='relu')(output)
        ffn_output = Dense(units=embed_size)(ffn_output)
        
        # Add and Norm Layer
        output = Add()([output, ffn_output])
        output = LayerNormalization(epsilon=1e-6)(output)

    # Dense Layer (vocab_size) with softmax activation
    outputs = Dense(units=vocab_size, activation='softmax')(output)
    


    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model

       

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">
<h3 style='color:#2c3e50;'><strong>Section 3: Implementing Noised Tokens Prediction (NTP)</strong></h3>

<p style='color:#2c3e50;'>In previous cell, you implemented BERT architecture. In main version of BERT, we use Masked Language Models (MLMs) and Next Sentence Prediction (NSP) as pretraining tasks. 
<p>
In Masked Language Modeling, random tokens in a sentence are replaced with a special token (e.g., "[MASK]"). The model's task is to predict the original token based on the context provided by the other non-masked tokens in the input. This allows the model to develop a deep understanding of language structure and word relationships. 
<p>
The Next Sentence Prediction task involves presenting the model with two sentences and requiring it to predict whether the second sentence follows the first in the original text. This helps the model to learn relationships between consecutive sentences, which is important for understanding the coherence and flow of paragraphs. 
<br>
You can read more about BERT pretraining tasks in the original <a href="https://arxiv.org/pdf/1810.04805.pdf">paper</a> or <a href="https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270">here</a>.
<p>
In this assignment, however, we want to define a new task called <b>Noised Tokens Prediction (NTP)</b> and use it instead of MLMs and NSP tasks. In this task, we randomlely change tokens in the input sequence to get a noisy sequence to get it to the model and want the model to predict the original sequence. For example, if the input sequence is "ABCDEF", with some probability, it is changed into "ABXDEF"; so the input of the model would be "ABXDEF" and the output should be "ABCDEF". You need to implement this task in the following cell. Feel free to use numpy library for randomization.</p>
</p>

</div>

In [10]:
def noise_tokens(encoded_seqs, p_seq_noise = 0.15, n_tokens = vocab_size):
    '''
    Add noise to the input sequences.
    Args:
        encoded_seqs (np.array): encoded sequences with shape [Batch_size, max_length]
        p_seq_noise (float): probability of adding noise to each token
        n_tokens (int): number of tokens in the vocabulary
    Returns:
        noisy_encoded_seqs (np.array): encoded sequences with noise with shape [Batch_size, max_length]
        encoded_seqs (np.array): original encoded sequences with shape [Batch_size, max_length]
    '''
    # YOUR CODE HERE
    batch_size, max_length = encoded_seqs.shape
    noisy_encoded_seqs = np.copy(encoded_seqs)

    for i in range(batch_size):
        for j in range(max_length):
            if np.random.rand() < p_seq_noise:
                # Randomly change the token to a different one
                noisy_encoded_seqs[i, j] = np.random.randint(n_tokens)


    return noisy_encoded_seqs, encoded_seqs

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Section 4: Pretraining the Model</strong></h3>

<p style='color:#2c3e50;'>Create an instance of your BERT model and train it on pretrain data.</p>

</div>

In [11]:
# fill this function to pretrain the model
import tensorflow as tf
from tensorflow.keras.optimizers import Adam

def pretraining(model, noisy_pretrain_seq, pretrain_seq, epochs, batch_size, lr=1e-2):
    '''
    Pretrain the model: Input is noised sequences and output is original sequences.
    Args:
        model (tf.keras.Model): BERT model
        noisy_pretrain_seq (np.array): encoded sequences with noise with shape [Batch_size, max_length]
        pretrain_seq (np.array): original encoded sequences with shape [Batch_size, max_length]  
        epochs (int): number of epochs
        batch_size (int): batch size 
        lr (float): learning rate
    Returns:
        model (tf.keras.Model): pretrained BERT model
    Hint:
    Model is trained on the noisy sequences and the loss is calculated on the original sequences.
    Define the loss function and optimizer
    compile the model with the loss function and optimizer. 
    fit the model on the noisy sequences and original sequences. 
    '''

    # YOUR CODE HERE
    expected_seq_length = model.input_shape[1]
    noisy_pretrain_seq = noisy_pretrain_seq[:, :expected_seq_length]
    pretrain_seq = pretrain_seq[:, :expected_seq_length]

    optimizer = Adam(learning_rate=lr)
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    history = model.fit(
        noisy_pretrain_seq,
        pretrain_seq,
        epochs=epochs,
        batch_size=batch_size
    )
    return model

In [12]:
# please don't change this cell and don't add any print statements to this cell. Run it without changes.
# this cell is for training the model on the pretrain data.

model = BERT(vocab_size=vocab_size, embed_size=128, n_layers=2, n_heads=4, max_length=max_length)
model.summary()

noisy_pretrain_seq, pretrain_seq = noise_tokens(pretrain_seq)
pretrained_model = pretraining(model, noisy_pretrain_seq, pretrain_seq, epochs=epochs, batch_size=batch_size)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 256)]                0         []                            
                                                                                                  
 embedding (Embedding)       (None, 256, 128)             3328      ['input_1[0][0]']             
                                                                                                  
 multi_self_attention_head   (None, None, 128)            65536     ['embedding[0][0]']           
 (multiSelfAttentionHead)                                                                         
                                                                                                  
 add (Add)                   (None, 256, 128)             0         ['embedding[0][0]',       

In [13]:
# please don't change this cell and don't add any print statements to this cell. Run it without changes.
# this cell is for saving the pretrained model.

if not os.path.exists('model'):
    os.makedirs('model')    
pretrained_model.save_weights('model/model_pretrained.h5')

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h2 style='color:#2c3e50;'><strong>Section 5: Fine-tuning Large Language Models</strong></h2>

<hr style="border-top: 2px solid #3498db; margin-bottom: 20px; margin-top: 20px;">

<ul style='color:#2c3e50;'>
    <li>Prepare the data for fine-tuning: Tokenize protein sequences in both train and test data of signal peptides detection and structure classification tasks. Then, convert them to numpy arrays. (like what you did in section 1 for pretraining data)</li>
    <li>Fine-tune the model on tasks signal peptides detection (binary classification) and structure classification (multi-class classification). By fine-tuning, we mean the process of adjusting the parameters of the pre-trained model to suit a specific task or dataset. Through this process, we can enhance the model's ability to perform a specialized task more accurately.</li>
    <li>For each task, you need to call the pretrained model and add a classification layer on top of it. Then, you need to freeze the pretrained model and train the classification layer for 10 epochs. After that, you need to unfreeze the pretrained model and train the whole model for 5 epochs.</li>
    </li>
</ul>

</div>

In [14]:
# please don't change this cell and don't add any print statements to this cell. Run it without changes.
# this cell is for connecting the layers of the pretrained model to the layers of the finetuned model. 
# the output of the last layer of the pretrained model is considered as the input of the finetuned model. 

def concat_layers_for_finetuning(model):
    # Create a list to hold the layers to be concatenated
    return tf.keras.Model(inputs=model.inputs, outputs=model.layers[-1].output)

In [15]:
# please don't change this cell
# this cell is for loading the pretrained model and keep the output of the last layer to be used for finetuning.

model_SP = BERT(vocab_size=vocab_size, embed_size=embed_size, n_layers=n_layers, n_heads=n_heads, max_length=max_length)
model_SP.load_weights('model/model_pretrained.h5')
concat_model_SP = concat_layers_for_finetuning(model_SP)

In [16]:
# Fill this function to fine-tune the concat_model_SP on peptide detection task (binary classification)

'''
Fine-tune the model on peptide detection task.
Parameters:
    epochs (int): number of epochs
    batch_size (int): batch size 
    lr (float): learning rate
    n_classes (int): number of classes

What you need to do in this cell:
    1. Tokenize the sequences in scop_train_data
    2. Make the labels in scop_train_data as a numpy array
    3. Make the layers of the concat_model_SP non-trainable
    4. Add a dense layer with a proper activation to the end of the concat_model_SP
    5. Define a new model with the input of concat_model_SP and the output of the dense layer
    6. Define the loss function and optimizer
    7. Compile the model with the loss function and optimizer
    8. Fit the model on the training data for 10 epochs
    9. Make all layers of the model trainable
    10. Fit the model on the training data for 5 epochs
    11. Save the new model in the model directory with the name model_finetuned_SP.h5
    '''
# YOUR CODE HERE
from tensorflow import keras
from tensorflow.keras.models import Model
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.layers import Flatten

#1 Tokenize the sequences in scop_train_data
tokenized_texts = [tokenize_seq(seq) for seq in signalp_train_data.seq]
tokenized_texts = np.array(tokenized_texts)
tokenized_texts = tokenized_texts[:, :256]

#2 Make the labels in scop_train_data as a numpy array
"""label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(signalp_train_data.label)"""
y = signalp_train_data.label
print(y.shape, y)

# 3. Make the layers of the concat_model_SP non-trainableMake the layers of the concat_model_SP non-trainable
for layer in concat_model_SP.layers:
  layer.trainable = False
  
# 4. Add a dense layer with a proper activation to the end of the concat_model_SP  
x = Flatten()(concat_model_SP.output)  # Use Flatten() instead of Reshape()
x = Dense(32, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)

# 5. Define a new model with the input of concat_model_SP and the output of the dense layer
finetuned_SP_model = Model(concat_model_SP.input, out)   

# 6. Define the loss function and optimizer
loss_fn = keras.losses.BinaryCrossentropy()
optimizer = Adam(learning_rate=0.001) 

# 7. Compile the model with the loss function and optimizer
finetuned_SP_model.compile(loss=loss_fn , optimizer=optimizer, metrics=['accuracy'])
finetuned_SP_model.summary()

#8. Fit the model on the training data for 10 epochs  
finetuned_SP_model.fit(tokenized_texts, y, epochs=epochs)

#9. Make all layers of the model trainable
for layer in finetuned_SP_model.layers:
  layer.trainable = True

finetuned_SP_model.summary()
#10. Fit the model on the training data for 5 epochs
finetuned_SP_model.fit(tokenized_texts, y, epochs=5)

#11. Save the new model in the model directory with the name model_finetuned_SP.h5
finetuned_SP_model.save_weights('model/model_finetuned_SP.h5')

(16606,) 0        0
1        1
2        0
3        0
4        0
        ..
16601    0
16602    1
16603    0
16604    0
16605    0
Name: label, Length: 16606, dtype: int64
Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_2 (InputLayer)        [(None, 256)]                0         []                            
                                                                                                  
 embedding_1 (Embedding)     (None, 256, 128)             3328      ['input_2[0][0]']             
                                                                                                  
 multi_self_attention_head_  (None, None, 128)            65536     ['embedding_1[0][0]']         
 2 (multiSelfAttentionHead)                                                                       
                    

In [17]:
# please don't change this cell
# this cell is for loading the pretrained model and keep the output of the last layer to be used for finetuning.

model_SC = BERT(vocab_size=vocab_size, embed_size=128, n_layers=2, n_heads=4, max_length=max_length)
model_SC.load_weights('model/model_pretrained.h5')
model_concat_SC = concat_layers_for_finetuning(model_SC)

In [18]:
# Fill this function to fine-tune the model_concat_SC on structure classification task (multi-class classification)
'''
Fine-tune the model on peptide detection task.
Parameters:
    epochs (int): number of epochs
    batch_size (int): batch size 
    lr (float): learning rate
    n_classes (int): number of classes

What you need to do in this cell:
    1. Tokenize the sequences in scop_train_data
    2. Make the labels in scop_train_data as a numpy array
    3. Make the layers of the model_concat_SC non-trainable
    4. Add a dense layer with a proper activation to the end of the model_concat_SC
    5. Define a new model with the input of model_concat_SC and the output of the dense layer
    6. Define the loss function and optimizer
    7. Compile the model with the loss function and optimizer
    8. Fit the model on the training data for 10 epochs
    9. Make all layers of the model trainable
    10. Fit the model on the training data for 5 epochs
    11. Save the new model in the model directory with the name model_finetuned_SC.h5
'''

# YOUR CODE HERE
# 1. Tokenize the sequences in scop_train_data
tokenized_texts = [tokenize_seq(seq) for seq in scop_train_data.seq]
tokenized_texts = np.array(tokenized_texts)
tokenized_texts = tokenized_texts[:, :256]

# 2. Make the labels in scop_train_data as a numpy array
#print(scop_train_data.label)
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(scop_train_data.label)
y = np.array(encoded_labels)
#print(y.shape, y)


one_hot_encoded = np.zeros((len(y), 7))
for i, label in enumerate(y):
    one_hot_encoded[i, label] = 1

#print(one_hot_encoded[:10], y[:10])

# 3. Make the layers of the model_concat_SC non-trainable
for layer in model_concat_SC.layers:
  layer.trainable = False

# 4. Add a dense layer with a proper activation to the end of the model_concat_SC
x = Flatten()(model_concat_SC.output)  # Use Flatten() instead of Reshape()
x = Dense(32, activation='relu')(x)
out = Dense(7, activation='softmax')(x)

# 5. Define a new model with the input of model_concat_SC and the output of the dense layer
finetuned_SC_model = Model(model_concat_SC.input, out)


# 6. Define the loss function and optimizer
loss_fn = keras.losses.BinaryCrossentropy()
optimizer = Adam(learning_rate=0.001) 

# 7. Compile the model with the loss function and optimizer
finetuned_SC_model.compile(loss=loss_fn , optimizer=optimizer, metrics=['accuracy'])
finetuned_SC_model.summary()

# 8. Fit the model on the training data for 10 epochs
finetuned_SC_model.fit(tokenized_texts,  one_hot_encoded, epochs=epochs)

# 9. Make all layers of the model trainable 
for layer in finetuned_SC_model.layers:
  layer.trainable = True
finetuned_SC_model.summary()

# 10. Fit the model on the training data for 5 epochs
finetuned_SC_model.fit(tokenized_texts, one_hot_encoded, epochs=5)

# 11. Save the new model in the model directory with the name model_finetuned_SC.h5
finetuned_SC_model.save_weights('model/model_finetuned_SC.h5')


Model: "model_6"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_3 (InputLayer)        [(None, 256)]                0         []                            
                                                                                                  
 embedding_2 (Embedding)     (None, 256, 128)             3328      ['input_3[0][0]']             
                                                                                                  
 multi_self_attention_head_  (None, None, 128)            65536     ['embedding_2[0][0]']         
 4 (multiSelfAttentionHead)                                                                       
                                                                                                  
 add_8 (Add)                 (None, 256, 128)             0         ['embedding_2[0][0]',   

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h2 style='color:#2c3e50;'><strong>Section 6: Prediction and Evaluation</strong></h2>

<hr style="border-top: 2px solid #3498db; margin-bottom: 20px; margin-top: 20px;">

<ul style='color:#2c3e50;'>
    <li>After finetuning, do <b>prediction</b> task on test data. Then, report <b>f1 score, precision, recall, and accuracy </b> for both datasets.
    You can read more about these metrics <a href="https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics">here</a>.</li>
</ul>

</div>

<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Task 6.1: Prediction </strong></h3>

<p style='color:#2c3e50;'> Make prediction on both test datasets.</p>

</div>

In [19]:
# testing on signal peptide data; save the predicted labels, true labels and sequences to a csv file 
# fill this cell to create y_pred_SP using the model_finetuned_SP

# YOUR CODE HERE
tokenized_texts = [tokenize_seq(seq) for seq in signalp_test_data.seq]
tokenized_texts = np.array(tokenized_texts)
tokenized_texts = tokenized_texts[:, :256]



y_pred_SP = (finetuned_SP_model.predict(tokenized_texts)>=0.5).astype(int)

# save the predicted labels, true labels and sequences to a csv file
signalp_test_data['pred_label'] = y_pred_SP
signalp_test_data.to_csv('data/signal_peptide/signalp_pred.csv', index=False)



In [20]:
# testing on scop data ; save the predicted labels, true labels and sequences to a csv file
# fill this cell to create y_pred_SC using the model_finetuned_SC


# YOUR CODE HERE
tokenized_texts = [tokenize_seq(seq) for seq in scop_test_data.seq]
tokenized_texts = np.array(tokenized_texts)
tokenized_texts = tokenized_texts[:, :256]

y_pred_SC_1 = finetuned_SC_model.predict(tokenized_texts)

#########################################################
class_names = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

y_pred_SC = []
for probabilities in y_pred_SC_1:
    predicted_label = np.argmax(probabilities)
    predicted_class_name = class_names[predicted_label]
    y_pred_SC.append(predicted_class_name)
#########################################################
"""y_pred_SC = []
for probabilities in y_pred_SC_1:
    predicted_label = np.argmax(probabilities)
    y_pred_SC.append(predicted_label)"""

# save the predicted labels, true labels and sequences to a csv file
scop_test_data['pred_label'] = y_pred_SC
scop_test_data.to_csv('data/scop/scop_pred.csv', index=False)



<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Task 6.2: Evaluation</strong></h3>

<p style='color:#2c3e50;'>Calculate F1 score, precision, recall and accuracy for both test datasets.</p>

</div>

In [21]:
# Fill this cell to evaluate the performance of model_finetuned_SP on signal peptide test data
# You can use sklearn library to calculate these metrics.

# accuracy
# precision
# recall
# f1-score

# YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Get the predicted labels and true labels
y_true_SP = signalp_test_data.label

# Evaluate the model's performance using accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_true_SP, y_pred_SP)
precision = precision_score(y_true_SP, y_pred_SP)
recall = recall_score(y_true_SP, y_pred_SP)
f1 = f1_score(y_true_SP, y_pred_SP)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)


Accuracy: 0.8398362235067437
Precision: 0.7647058823529411
Recall: 0.019287833827893175
F1-score: 0.03762662807525326


In [22]:
# Fill this cell to evaluate the performance of model_finetuned_SC on scop test data
# You can use sklearn library to calculate these metrics.

# accuracy
# precision
# recall
# f1-score

# YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Get the predicted labels and true labels
y_true_SC = scop_test_data.label

# Evaluate the model's performance using accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_true_SC, y_pred_SC)
precision = precision_score(y_true_SC, y_pred_SC, average='macro')
recall = recall_score(y_true_SC, y_pred_SC, average='macro')
f1 = f1_score(y_true_SC, y_pred_SC, average='macro')

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 0.42616679418515685
Precision: 0.32253229557073343
Recall: 0.30447445456855654
F1-score: 0.30581374389072086


  _warn_prf(average, modifier, msg_start, len(result))


<div style="background-color:#f9f9f9; padding:20px; border:1px solid #ddd; border-radius:8px; font-family:Verdana;">

<h3 style='color:#2c3e50;'><strong>Section 7: Experimentation</strong></h3>

<p style='color:#2c3e50;'>In the realm of natural language processing, one essential aspect is the reliance on vast datasets for training language models. With limited access to resources like GPUs or memory, we cannot pretrain our own model in this assignment on a huge dataset. Instead, we can use available pretrained models for protein sequences to check if a LLM can improve the performance. In this section of our study, we aim to utilize a pre-trained language model named <b>DistilProtBert</b>, which is accessible through the popular <b>Transformers</b> library. 
The key objective here is to employ this pre-trained DistilProtBert model and fine-tune it, just as we did in the previous section and compare the performance of the fine-tuned model with and without using pretrained model.
Once you have completed the fine-tuning process, it is important to analyze and record any noticeable differences in the model's performance. This entails a comprehensive evaluation of how the fine-tuned model compares to its performance before fine-tuning. You should pay attention to various metrics such as accuracy, precision, recall, and F1-score, among others, to understand the overall impact of the fine-tuning process.
In this section, you are free to use any libraries or pretrained models. However, the smallest pretrained model available for preotein sequences is DistilProtBert. 
</p>
<p>
You can read more about DistilProtBert <a href="https://huggingface.co/yarongef/DistilProtBert">here</a>.
<p> What you need to do in this section is: </p>
<p>
1. Select one of two finetuning tasks from data directory.</p>
<p>
2. Finetune DistilProtBert on the selected dataset and report the performance of the model on test data. You need to report <b>f1 score, precision, recall, and accuracy </b>. </p>
<p>
3. Train your classifier of previous step without using pretrained model (DisilProtBert). Report the performance of the model on test data. You need to report <b>f1 score, precision, recall, and accuracy </b>.
</p>
<p>
4. Compare the performance of the two models and report your findings.
</p>
</div>

In [23]:
%pip install transformers[torch]
%pip install accelerate -U

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
from transformers import DistilBertModel, DistilBertTokenizer, \
Trainer, TrainingArguments, DistilBertConfig
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
import pandas as pd
import numpy as np
import torch.nn as nn

# Load the pretrained model and tokenizer from yarongef/DistilProtBert
tokenizer = DistilBertTokenizer.from_pretrained("yarongef/DistilProtBert", do_lower_case=False)
model_distilprotbert = DistilBertModel.from_pretrained("yarongef/DistilProtBert")

  from .autonotebook import tqdm as notebook_tqdm
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizer'.
You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of DistilBertModel were not initialized from the model checkpoint at yarongef/DistilProtBert and are newly initialized: ['transformer.layer.14.sa_layer_norm.weight', 'transformer.layer.5.attention.out_lin.bias', 'transformer.layer.11.sa_layer_norm.bias', 'transformer.layer.10.ffn.lin1.bias', 'transformer.layer.4.sa_layer_norm.weight', 'transformer.layer.8.sa_layer_norm.bias', 'transformer.layer.0.attention.out_lin.weight', 'transformer.layer.1.sa_layer_norm.bias', 'transformer.layer.1

In [25]:
# Put your code here for fine-tuning and evaluating DistilProtBert on whichever task you choose

# YOUR CODE HERE
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import TensorDataset

sequences = signalp_train_data['seq'] #scop_train_data['seq']
labels =signalp_train_data['label'] #scop_train_data['label']

train_df = pd.DataFrame({
    'text': sequences,
    'labels': labels
})
train_df['sequence_length'] = train_df['text'].apply(lambda x: len(x))
max_sequence_length = train_df['sequence_length'].max()

label_encoder = LabelEncoder()
train_df['encoded_labels'] = label_encoder.fit_transform(train_df['labels'])
labels_tensor = torch.tensor(train_df['encoded_labels'].tolist())
train_encodings = tokenizer(train_df['text'].tolist(), truncation=True, max_length=max_sequence_length, padding=True, return_tensors='pt')
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], labels_tensor)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)


class Classifier(nn.Module):

    def __init__(self, input_size, output_size, dropout_rate=0.2):

        super(Classifier, self).__init__()

        # Add additional linear layers
        self.linear1 = nn.Linear(input_size, 512)
        self.linear2 = nn.Linear(512, 256)
        self.linear3 = nn.Linear(256, 128)

        # Add dropout layers after each linear layer
        self.dropout1 = nn.Dropout(p=dropout_rate)
        self.dropout2 = nn.Dropout(p=dropout_rate)
        self.dropout3 = nn.Dropout(p=dropout_rate)

        # Add final output layer
        self.linear_out = nn.Linear(128, output_size)

    def forward(self, x):

        # Pass through the additional linear layers with ReLU activation and dropout
        x = F.relu(self.linear1(x))
        x = self.dropout1(x)
        x = F.relu(self.linear2(x))
        x = self.dropout2(x)
        x = F.relu(self.linear3(x))
        x = self.dropout3(x)

        # Output layer
        logits = self.linear_out(x)

        return logits

class FullModel(nn.Module):
    def __init__(self, model, classifier):
        super(FullModel, self).__init__()
        self.model = model
        self.classifier = classifier

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state
        pooled_output = torch.mean(last_hidden_state, dim=1)
        pooled_output = pooled_output.reshape(-1, 1024)
        logits = self.classifier(pooled_output)
        return logits

input_size = 1024
output_size = 2
classifier = Classifier(input_size, output_size)

full_model = FullModel(model_distilprotbert, classifier)

optimizer = torch.optim.Adam(full_model.parameters(), lr=5e-5)
loss_fn = torch.nn.CrossEntropyLoss()

for epoch in range(10):
    full_model.train()
    for batch in train_dataloader:
        optimizer.zero_grad()
        input_ids = batch[0]
        attention_mask = batch[1]
        labels = batch[2]
        logits = full_model(input_ids, attention_mask)
        loss = loss_fn(logits, labels)
        loss.backward()
        optimizer.step()

##################################################################Testing#################################################
full_model.eval()
predictions = []
##########################################
sequences = signalp_test_data['seq']
labels = signalp_test_data['label']

test_df = pd.DataFrame({
    'text': sequences,
    'labels': labels
})
label_encoder = LabelEncoder()
test_df['encoded_labels'] = label_encoder.fit_transform(test_df['labels'])
labels_tensor = torch.tensor(test_df['encoded_labels'].tolist())
test_encodings = tokenizer(test_df['text'].tolist(), truncation=True, padding=True, return_tensors='pt')
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'])
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)
########################################

# Iterate over the data in batches
for batch in test_dataloader:
    input_ids = batch[0]
    attention_mask = batch[1]
    logits = full_model(input_ids, attention_mask)
    predicted_labels = torch.argmax(logits, dim=1)
    predictions.extend(predicted_labels.tolist())

from sklearn.metrics import accuracy_score
true_labels = labels_tensor
accuracy = accuracy_score(true_labels, predictions)
print(f"Accuracy: {accuracy}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Accuracy: 0.8376685934489403


In [26]:
# Put your code here for fine-tuning and evaluating the classifer without pretraining on whichever task you choose

# YOUR CODE HERE
from torch.nn.utils.rnn import pad_sequence
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import TensorDataset

train_sequences = signalp_train_data['seq']
train_labels = signalp_train_data['label']

test_sequences = signalp_test_data['seq']
test_labels = signalp_test_data['label']

train_df = pd.DataFrame({
    'text': train_sequences,
    'labels': train_labels
})
# Encode labels 
label_encoder = LabelEncoder()
train_encoded_labels = label_encoder.fit_transform(train_labels)
test_encoded_labels = label_encoder.fit_transform(test_labels)

# Split the dataset into training and validation sets
train_sequences = train_sequences.tolist()
val_sequences = test_sequences.tolist()
train_labels = train_encoded_labels.tolist()
val_labels = test_encoded_labels.tolist()
# Tokenize the data
train_df['sequence_length'] = train_df['text'].apply(lambda x: len(x))
max_sequence_length = train_df['sequence_length'].max()

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
train_encodings = tokenizer(train_sequences, truncation=True, padding=True, return_tensors='pt', max_length=max_sequence_length)

# Create DataLoader for training and validation sets
train_labels_tensor = torch.tensor(train_labels)
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels_tensor)
train_dataloader = DataLoader(train_dataset, batch_size=72, shuffle=True)

#padded_val_sequences = pad_sequence([torch.tensor(seq) for seq in val_sequences], batch_first=True, padding_value=0)
#tensor_val_sequences = [torch.tensor(seq) for seq in val_sequences]

desired_sequence_length = max_sequence_length

val_encodings = tokenizer(val_sequences, truncation=True, padding='max_length', return_tensors='pt', max_length=desired_sequence_length)
val_labels_tensor = torch.tensor(val_labels)
val_dataset = TensorDataset(val_encodings['input_ids'], val_encodings['attention_mask'], val_labels_tensor)
val_dataloader = DataLoader(val_dataset, batch_size=72, shuffle=False)

# Define your classifier architecture
class SimpleClassifier(nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(input_size, output_size)

    def forward(self, x):
        x = x.float()
        return self.linear(x)

      
input_size = len(train_sequences[0]) # Update with the actual input size
output_size = len(set(train_labels))  # Assuming a classification task
classifier = SimpleClassifier(input_size, output_size)



# Set up optimizer and loss function
optimizer = torch.optim.Adam(classifier.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()

# Training loop
num_epochs = 10  # Adjust as needed

import tqdm as tqdm
for epoch in range(num_epochs):
    classifier.train()
    for batch in tqdm.tqdm(train_dataloader, desc=f'Epoch {epoch + 1}/{num_epochs}'):
        optimizer.zero_grad()
        input_ids, attention_mask, labels = batch
        logits = classifier(input_ids)
        loss = loss_fn(logits, labels)
        loss.backward()
        optimizer.step()


classifier.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for batch in tqdm.tqdm(val_dataloader, desc='Validation'):
        input_ids, attention_mask, labels = batch
        logits = classifier(input_ids)
        predicted_labels = torch.argmax(logits, dim=1)
        total += labels.size(0)
        correct += (predicted_labels == labels).sum().item()

accuracy = correct / total
print(f'Validation Accuracy: {accuracy * 100:.2f}%')

Epoch 1/10: 100%|██████████| 231/231 [00:00<00:00, 454.21it/s]
Epoch 2/10: 100%|██████████| 231/231 [00:00<00:00, 488.91it/s]
Epoch 3/10: 100%|██████████| 231/231 [00:00<00:00, 422.99it/s]
Epoch 4/10: 100%|██████████| 231/231 [00:00<00:00, 432.29it/s]
Epoch 5/10: 100%|██████████| 231/231 [00:00<00:00, 428.14it/s]
Epoch 6/10: 100%|██████████| 231/231 [00:00<00:00, 472.25it/s]
Epoch 7/10: 100%|██████████| 231/231 [00:00<00:00, 471.35it/s]
Epoch 8/10: 100%|██████████| 231/231 [00:00<00:00, 491.22it/s]
Epoch 9/10: 100%|██████████| 231/231 [00:00<00:00, 479.61it/s]
Epoch 10/10: 100%|██████████| 231/231 [00:00<00:00, 488.14it/s]
Validation: 100%|██████████| 58/58 [00:00<00:00, 929.19it/s]

Validation Accuracy: 80.80%





In [43]:
# Write your report here.


Data and Model Architecture->
The training data consists of sequences from the SignalP dataset. These sequences are tokenized using the Hugging Face tokenizer with specified truncation and padding. The tokenized sequences are then used to train a classifier. The classifier is a neural network model with multiple linear layers and dropout for regularization. The architecture of the model is as follows:

Input Layer: The size of the input layer is determined by the length of the tokenized sequences.
Linear Layers: There are three linear layers with ReLU activation functions and dropout layers in between.
Output Layer: The final linear layer produces logits for binary classification.
The model is trained using cross-entropy loss and the Adam optimizer. Training is performed for a specified number of epochs.

Model Training
The training process involves iterating through batches of the training data, computing the model's logits, calculating the loss, and updating the model's parameters using backpropagation. This process is repeated for multiple epochs.

Model Evaluation
After training, the model is evaluated on a validation dataset. During validation, the model's accuracy is computed by comparing its predictions to the ground truth labels. The validation accuracy provides an indication of how well the model generalizes to unseen data.

Test Set Evaluation
The trained model is then evaluated on a separate test set. The test set consists of sequences and labels from the SignalP dataset. The sequences are tokenized, and predictions are made using the trained model. The accuracy of the model on the test set is calculated by comparing the predicted labels to the true labels.

Results
The validation accuracy is reported to be approximately 70.28%, indicating how well the model performs on the validation dataset.

For the test set, the model achieves an accuracy of 83.77%. This metric provides an estimate of the model's performance on new, unseen data.

