<p style="font-size: 27px;"> ðŸŽ¬ <b>Project 3: Sentiment analysis with transformer</b></p>
<p style="font-size: 16px;"> In this project, we would like to do sentiment analysis on a large movie review dataset IMDB, which contains movie reviews and binary sentiment score (1 indicates a positive review and 0 indicates a negative review). We want to use a simple transformer to learn from the data, and predict the sentiment given our own movie review! </p>

<p style="font-size: 16px;"> Submission note: We grade the content in this notebook. Make sure outputs are present. You will package everything in a zip file and submit it with the following format:
f"P3_{LastName}_{FirstName}.V{version_number}.zip", e.g. "P3_Smit_John_V1.zip". </p>

<p style="font-size: 16px;"> For the questions in the comments, there is no need to write down the answers in the notebook. You just need to prepare the answers for the project interview. Please note that the interview will cover not just the comment questions, but also your overall understanding of the project and the techniques used. Make sure you understand what each step of the code is doing.</p>

<p style="font-size: 25px;"> Task 0: Setup environment</p>
<p style="font-size: 16px;"> Please install the necessary packages in your environment and import the following libraries. </p>
<p style="font-size: 16px;"> The packages we will need: pytorch, numpy, tqdm, transformers and datasets. </p>

In [60]:
import datasets
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import tqdm
import transformers

In [2]:
seed = 0

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')  # You can switch to 'cpu' for debugging

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

<p style="font-size: 25px;"> Task 1: Load dataset (5pt)</p>
<p style="font-size: 16px;"> Now we need to load the IMDB dataset and setup the dataloader. </p>

In [3]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

In [4]:
# Let's have a look at one example from the dataset
print(train_data[0])
# How many samples are there in the training dataset?
print(len(train_data))


{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [5]:
# In the raw dataset, the labels are strings. We need to use tokenizer to convert them to integers.
tokenizer_name = "bert-base-uncased"
tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_name)

In [6]:
# What is stored in the tokenizer? What the printed information / parameters means?
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [7]:
def tokenize_and_numericalize_example(example, tokenizer):
    ids = tokenizer(example["text"], truncation=True)["input_ids"]
    return {"ids": ids}

train_data = train_data.map(
    tokenize_and_numericalize_example, fn_kwargs={"tokenizer": tokenizer}
)
test_data = test_data.map(
    tokenize_and_numericalize_example, fn_kwargs={"tokenizer": tokenizer}
)

In [8]:
# What does the processed training data look like?
train_data[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [9]:
valid_size = 0.25

train_valid_data = train_data.train_test_split(test_size=valid_size)
train_data = train_valid_data["train"]
valid_data = train_valid_data["test"]

# We need to convert column 'ids' and 'label' to torch format
train_data = train_data.with_format(type="torch", columns=["ids", "label"])  
valid_data = valid_data.with_format(type="torch", columns=["ids", "label"])
test_data = test_data.with_format(type="torch", columns=["ids", "label"])

def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_ids = [i["ids"] for i in batch]
        batch_ids = nn.utils.rnn.pad_sequence(
            batch_ids, padding_value=pad_index, batch_first=True
        )
        batch_label = [i["label"] for i in batch]
        batch_label = torch.stack(batch_label)
        batch = {"ids": batch_ids, "label": batch_label}
        return batch

    return collate_fn

def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    # data_loader = # TODO: Please define the DataLoader here. You need to use the collate_fn and set shuffle.
    data_loader = torch.utils.data.DataLoader(
        dataset, 
        batch_size=batch_size, 
        collate_fn=collate_fn, 
        shuffle=shuffle
    )
    return data_loader

In [10]:
# batch_size = # TODO: Please set a proper batch size
batch_size = 64
pad_index = tokenizer.pad_token_id
train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

<p style="font-size: 25px;"> Task 2: Implement the transformer (55pt)</p>
<p style="font-size: 16px;"> Now it's time to implement a basic transformer by ourselves. We will focus on implementing the âš¡attentionâš¡ mechanism.</p>

In [11]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, max_length, embed_dim, dropout=0.1):
        super(Embedding, self).__init__()
        # self.word_embed = # TODO: Please define the word embedding layer using vocab_size and embed_dim.
        self.word_embed = nn.Embedding(vocab_size, embed_dim)
        # self.pos_embed = # TODO: Please define the positional embedding layer using max_length and embed_dim.
        self.pos_embed = nn.Embedding(max_length, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        batch_size, seq_length = x.shape
        # positions = # TODO: Fill in the blanks. Hint: positions = torch.arange(0, ___).expand(batch_size, ___).to(device)
        positions = torch.arange(0, seq_length).expand(batch_size, seq_length).to(device)
        # embedding = # TODO: Compute the final embedding.
        embedding = self.word_embed(x) + self.pos_embed(positions)
        return self.dropout(embedding)

In [12]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    # d_k = # TODO: Key dimension
    d_k = K.size(-1)
    # scores = # TODO: Please compute the attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype = torch.float32))
    # Why the masking is implemented in this way?
    # Answer: Setting scores to -inf ensures that after softmax, those positions get probability ~0

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attn_weights = F.softmax(scores, dim=-1)
    # output = # TODO: Please compute the attention output
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, embed_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        assert embed_dim % num_heads == 0  # Ensure the number of heads divides evenly into embedding dimension
        # self.head_dim = # TODO: Please compute the head dimension
        self.head_dim = embed_dim // num_heads

        # Linear layers to project the queries, keys, and values
        # Please define the linear layers for W_q, W_k, and W_v using embed_dim
        # self.W_q = # TODO: 
        # self.W_k = # TODO: 
        # self.W_v = # TODO: 
        self.W_q =  nn.Linear(embed_dim, embed_dim) 
        self.W_k =  nn.Linear(embed_dim, embed_dim) 
        self.W_v =  nn.Linear(embed_dim, embed_dim) 
        # Final output projection
        # Please define the linear layer for W_o using embed_dim
        # self.W_o = # TODO: 
        self.W_o = nn.Linear(embed_dim, embed_dim) 
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # Step 1: Project Q, K, V using the linear layers
        # Please fill in the blanks. What should be the shape of Q, K, and V after the linear layers?
        # Q = # TODO:
        # K = # TODO:
        # V = # TODO:
        Q = self.W_q(Q)
        K = self.W_k(K)
        V = self.W_v(V)
        # Step 2: Split the projections into multiple heads
        # Please fill in the blanks. What should be the shape of Q, K, and V after the split?
        # Q = # TODO:
        # K = # TODO:
        # V = # TODO:
        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)

        # Step 3: Apply scaled dot-product attention
        attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

        # Step 4: Concatenate the attention output from all heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim)  # (batch_size, seq_len, embed_dim)

        # Step 5: Project the concatenated output back to the original embedding dimension
        # Please fill in the blanks. What should be the shape of attn_output after the linear layer?
        # output = # TODO:
        output = self.W_o(attn_output)
        return output

In [13]:
# What does the transformer encoder look like?
# You can modify the architecture of the encoder as you need.
class TransformerEncoder(nn.Module):
    def __init__(self, embed_dim, num_heads, forward_expansion, dropout):
        super(TransformerEncoder, self).__init__()

        self.attention = MultiHeadAttention(embed_dim, num_heads)
        # self.norm1 = # TODO: Please define a LayerNorm layer
        # self.norm2 = # TODO: Please define a LayerNorm layer
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

        # self.feed_forward = # TODO: Please define a feed-forward network here with at least two nn.Linear layers.
        # Please use forward_expansion * embed_dim as the hidden layer dimension.
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, forward_expansion *embed_dim), 
            nn.ReLU(), 
            nn.Linear(forward_expansion*embed_dim, embed_dim)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # attention_out = # TODO: Please compute the output of the multi-head attention layer using self.attention and self.dropout.
        attention_out = self.dropout(self.attention(x, x, x))
        # x =  # TODO: Please apply the residual connection and normalization.
        x = self.norm1(attention_out + x)
        forward_out = self.dropout(self.feed_forward(x))
        # out =  # TODO: Please apply the residual connection and normalization. 
        out = self.norm2(forward_out + x)

        return out

In [62]:
class Transformer(nn.Module):
    def __init__(self, vocab_size, max_length, embed_dim,
                num_heads, forward_expansion, dropout):
        super(Transformer, self).__init__()

        self.embedder = Embedding(vocab_size, max_length, embed_dim)
        self.encoder = TransformerEncoder(embed_dim, num_heads, forward_expansion, dropout)
        # self.fc = # TODO: Please define the final fully connected layer.
        self.fc = nn.Linear(embed_dim, 1) # Binary classif.output
    def forward(self, x):
        embedding = self.embedder(x)
        # encoding = # TODO: Compute the encoding using encoder
        encoding = self.encoder(embedding)
        # Is the max pooling a good choice here? Why? Or what should be used instead?
        # Answer: Max pooling captures the most salient features. Alternatives:
        # Mean pooling: encoding.mean(dim=1) - captures average representation
        # Attention pooling - learnable weighted sum
        compact_encoding = encoding.max(dim=1)[0]

        out = self.fc(compact_encoding)
        return out

<p style="font-size: 25px;"> Task 3: Prepare training loop (25pt)</p>
<p style="font-size: 16px;"> In this task, we will setup the training hyperparameters and the training loop.</p>  

In [16]:
# What are the meanings of the following parameters?
# embedding_dim = # TODO: Please set the embedding dimension
# num_heads = # TODO: Please set the number of heads
# dropout = # TODO: Please set a dropout dropout rate
embedding_dim = 256 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.3  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

In [17]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 9,749,761 trainable parameters


In [18]:
# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 1e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits 

In [19]:
def binary_accuracy(preds, labels):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct_preds = (rounded_preds == labels).float()
    # acc = # TODO: accuracy should be a ratio of correct predictions to total predictions
    acc = correct_preds.sum() / len(correct_preds)

    return acc

In [20]:
def train(data_loader, model, criterion, optimizer, device):
    model.train()
    epoch_losses = []
    epoch_accs = []
    for batch in tqdm.tqdm(data_loader, desc="training..."):
        optimizer.zero_grad()
        ids = batch["ids"].to(device)
        label = batch["label"].to(device)
        # TODO: Please complete the training loop
        prediction = model(ids).squeeze(1)

        # loss = # Please compute the loss
        loss = criterion(prediction, label.float())

        # accuracy = # Please use binary_accuracy function to compute the accuracy
        accuracy = binary_accuracy(prediction, label)
        
        loss.backward()
        optimizer.step()
        
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

In [21]:
def evaluate(data_loader, model, criterion, device):
    model.eval()
    epoch_losses = []
    epoch_accs = []
    with torch.no_grad():
        for batch in tqdm.tqdm(data_loader, desc="evaluating..."):
            ids = batch["ids"].to(device)
            label = batch["label"].to(device)
            # TODO: Please complete the evaluation loop. You need to compute the loss and accuracy
            prediction = model(ids).squeeze(1)

            # loss = # Please compute the loss
            loss = criterion(prediction, label.float())

            # accuracy = # Please use binary_accuracy function to compute the accuracy
            accuracy = binary_accuracy(prediction, label)

            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

<p style="font-size: 25px;">Task 4: Start training (10pt)</p>
<p style="font-size: 16px;">Let's run the following code and print the output. We want to make the testing accuracy higher than <b>0.85</b>!</p>

In [21]:
# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 20
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:13<00:00, 22.23it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 59.15it/s]


epoch: 0
train_loss: 0.688, train_acc: 0.541
valid_loss: 0.668, valid_acc: 0.606


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.88it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 59.03it/s]


epoch: 1
train_loss: 0.638, train_acc: 0.639
valid_loss: 0.603, valid_acc: 0.677


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.97it/s]


epoch: 2
train_loss: 0.572, train_acc: 0.712
valid_loss: 0.554, valid_acc: 0.722


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.76it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.81it/s]


epoch: 3
train_loss: 0.511, train_acc: 0.755
valid_loss: 0.499, valid_acc: 0.766


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.75it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.69it/s]


epoch: 4
train_loss: 0.449, train_acc: 0.795
valid_loss: 0.449, valid_acc: 0.791


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.81it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.89it/s]


epoch: 5
train_loss: 0.409, train_acc: 0.816
valid_loss: 0.426, valid_acc: 0.806


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.76it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.98it/s]


epoch: 6
train_loss: 0.376, train_acc: 0.836
valid_loss: 0.408, valid_acc: 0.817


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.71it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.90it/s]


epoch: 7
train_loss: 0.350, train_acc: 0.851
valid_loss: 0.397, valid_acc: 0.824


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.83it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.76it/s]


epoch: 8
train_loss: 0.326, train_acc: 0.861
valid_loss: 0.386, valid_acc: 0.832


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.83it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.94it/s]


epoch: 9
train_loss: 0.305, train_acc: 0.873
valid_loss: 0.377, valid_acc: 0.836


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.76it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.80it/s]


epoch: 10
train_loss: 0.285, train_acc: 0.882
valid_loss: 0.386, valid_acc: 0.827


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.75it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.63it/s]


epoch: 11
train_loss: 0.271, train_acc: 0.890
valid_loss: 0.371, valid_acc: 0.840


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.80it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.70it/s]


epoch: 12
train_loss: 0.252, train_acc: 0.899
valid_loss: 0.374, valid_acc: 0.841


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.77it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.92it/s]


epoch: 13
train_loss: 0.239, train_acc: 0.902
valid_loss: 0.368, valid_acc: 0.846


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.76it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 59.09it/s]


epoch: 14
train_loss: 0.227, train_acc: 0.909
valid_loss: 0.367, valid_acc: 0.848


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.81it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.92it/s]


epoch: 15
train_loss: 0.217, train_acc: 0.915
valid_loss: 0.362, valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.80it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.75it/s]


epoch: 16
train_loss: 0.206, train_acc: 0.920
valid_loss: 0.367, valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.77it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 59.01it/s]


epoch: 17
train_loss: 0.193, train_acc: 0.926
valid_loss: 0.370, valid_acc: 0.850


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.77it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.91it/s]


epoch: 18
train_loss: 0.180, train_acc: 0.931
valid_loss: 0.364, valid_acc: 0.854


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:12<00:00, 22.84it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 58.88it/s]

epoch: 19
train_loss: 0.170, train_acc: 0.936
valid_loss: 0.366, valid_acc: 0.855





In [22]:
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:06<00:00, 60.87it/s]

test_loss: 0.384, test_acc: 0.837





In [24]:
lr = 1e-3  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits

n_epochs = 20
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")

test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 27.01it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.52it/s]


epoch: 0
train_loss: 0.248, train_acc: 0.897
valid_loss: 0.479, valid_acc: 0.800


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.92it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.09it/s]


epoch: 1
train_loss: 0.192, train_acc: 0.922
valid_loss: 0.379, valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.80it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.86it/s]


epoch: 2
train_loss: 0.154, train_acc: 0.939
valid_loss: 0.368, valid_acc: 0.852


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.74it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.86it/s]


epoch: 3
train_loss: 0.106, train_acc: 0.959
valid_loss: 0.385, valid_acc: 0.848


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.71it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.68it/s]


epoch: 4
train_loss: 0.093, train_acc: 0.962
valid_loss: 0.421, valid_acc: 0.857


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.68it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.24it/s]


epoch: 5
train_loss: 0.072, train_acc: 0.973
valid_loss: 0.448, valid_acc: 0.845


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.68it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.61it/s]


epoch: 6
train_loss: 0.069, train_acc: 0.975
valid_loss: 0.429, valid_acc: 0.858


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.54it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.49it/s]


epoch: 7
train_loss: 0.056, train_acc: 0.979
valid_loss: 0.431, valid_acc: 0.854


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.50it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.76it/s]


epoch: 8
train_loss: 0.057, train_acc: 0.979
valid_loss: 0.491, valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.50it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.88it/s]


epoch: 9
train_loss: 0.059, train_acc: 0.977
valid_loss: 0.492, valid_acc: 0.844


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.40it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.04it/s]


epoch: 10
train_loss: 0.042, train_acc: 0.984
valid_loss: 0.534, valid_acc: 0.845


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.49it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.54it/s]


epoch: 11
train_loss: 0.047, train_acc: 0.981
valid_loss: 0.642, valid_acc: 0.827


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.45it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.59it/s]


epoch: 12
train_loss: 0.041, train_acc: 0.984
valid_loss: 0.508, valid_acc: 0.845


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.91it/s]


epoch: 13
train_loss: 0.044, train_acc: 0.983
valid_loss: 0.766, valid_acc: 0.816


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.28it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.27it/s]


epoch: 14
train_loss: 0.035, train_acc: 0.988
valid_loss: 0.657, valid_acc: 0.823


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.49it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.24it/s]


epoch: 15
train_loss: 0.036, train_acc: 0.987
valid_loss: 0.665, valid_acc: 0.826


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.61it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.25it/s]


epoch: 16
train_loss: 0.030, train_acc: 0.988
valid_loss: 0.718, valid_acc: 0.825


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.40it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.22it/s]


epoch: 17
train_loss: 0.027, train_acc: 0.991
valid_loss: 0.557, valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.53it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.09it/s]


epoch: 18
train_loss: 0.026, train_acc: 0.990
valid_loss: 0.644, valid_acc: 0.838


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.62it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.40it/s]


epoch: 19
train_loss: 0.031, train_acc: 0.990
valid_loss: 0.622, valid_acc: 0.841


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 782/782 [00:11<00:00, 70.60it/s]

test_loss: 0.698, test_acc: 0.822





In [26]:
embedding_dim = 256 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.3  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

lr = 5e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits

n_epochs = 20
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")

test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.93it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.97it/s]


epoch: 0
train_loss: 0.616, train_acc: 0.646
valid_loss: 0.517, valid_acc: 0.737


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.88it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.80it/s]


epoch: 1
train_loss: 0.457, train_acc: 0.783
valid_loss: 0.433, valid_acc: 0.800


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.66it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.36it/s]


epoch: 2
train_loss: 0.363, train_acc: 0.837
valid_loss: 0.425, valid_acc: 0.798


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.62it/s]


epoch: 3
train_loss: 0.287, train_acc: 0.878
valid_loss: 0.408, valid_acc: 0.820


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.89it/s]


epoch: 4
train_loss: 0.213, train_acc: 0.915
valid_loss: 0.400, valid_acc: 0.832


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.28it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.70it/s]


epoch: 5
train_loss: 0.161, train_acc: 0.940
valid_loss: 0.376, valid_acc: 0.846


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.17it/s]


epoch: 6
train_loss: 0.116, train_acc: 0.958
valid_loss: 0.405, valid_acc: 0.846


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.75it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.70it/s]


epoch: 7
train_loss: 0.088, train_acc: 0.969
valid_loss: 0.428, valid_acc: 0.848


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.43it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.97it/s]


epoch: 8
train_loss: 0.058, train_acc: 0.980
valid_loss: 0.467, valid_acc: 0.847


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.62it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.13it/s]


epoch: 9
train_loss: 0.054, train_acc: 0.981
valid_loss: 0.473, valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.19it/s]


epoch: 10
train_loss: 0.049, train_acc: 0.982
valid_loss: 0.511, valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.62it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.66it/s]


epoch: 11
train_loss: 0.035, train_acc: 0.988
valid_loss: 0.564, valid_acc: 0.841


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.09it/s]


epoch: 12
train_loss: 0.032, train_acc: 0.989
valid_loss: 0.537, valid_acc: 0.853


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.71it/s]


epoch: 13
train_loss: 0.030, train_acc: 0.990
valid_loss: 0.616, valid_acc: 0.845


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.61it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.91it/s]


epoch: 14
train_loss: 0.029, train_acc: 0.989
valid_loss: 0.574, valid_acc: 0.848


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.68it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.00it/s]


epoch: 15
train_loss: 0.037, train_acc: 0.987
valid_loss: 0.596, valid_acc: 0.845


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.52it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.39it/s]


epoch: 16
train_loss: 0.025, train_acc: 0.992
valid_loss: 0.597, valid_acc: 0.850


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.67it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.92it/s]


epoch: 17
train_loss: 0.027, train_acc: 0.991
valid_loss: 0.636, valid_acc: 0.841


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.58it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.06it/s]


epoch: 18
train_loss: 0.018, train_acc: 0.994
valid_loss: 0.637, valid_acc: 0.846


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.58it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.10it/s]


epoch: 19
train_loss: 0.019, train_acc: 0.993
valid_loss: 0.640, valid_acc: 0.848


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 782/782 [00:11<00:00, 71.07it/s]

test_loss: 0.689, test_acc: 0.835





<p style="font-size: 16px;"> We can notice that min train_loss is 0.018, but validation case 0.64. So the model is overfitting lets try to solve this problem</p>

In [29]:
embedding_dim = 256 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.3  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

lr = 5e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits

n_epochs = 20
best_valid_acc = 0  # ADD: Track best validation accuracy
patience = 3  # ADD: Stop if no improvement for 3 epochs
patience_counter = 0  # ADD: Counter for epochs without improvement
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 27.07it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 70.06it/s]


epoch: 0
train_loss: 0.634, train_acc: 0.625
valid_loss: 0.520, valid_acc: 0.738
Best model saved with valid_acc: 0.738


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.99it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.35it/s]


epoch: 1
train_loss: 0.461, train_acc: 0.779
valid_loss: 0.414, valid_acc: 0.806
Best model saved with valid_acc: 0.806


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.83it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.15it/s]


epoch: 2
train_loss: 0.352, train_acc: 0.845
valid_loss: 0.375, valid_acc: 0.832
Best model saved with valid_acc: 0.832


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.74it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.10it/s]


epoch: 3
train_loss: 0.274, train_acc: 0.885
valid_loss: 0.371, valid_acc: 0.839
Best model saved with valid_acc: 0.839


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.75it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.09it/s]


epoch: 4
train_loss: 0.206, train_acc: 0.917
valid_loss: 0.382, valid_acc: 0.840
Best model saved with valid_acc: 0.840


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.67it/s]


epoch: 5
train_loss: 0.159, train_acc: 0.940
valid_loss: 0.394, valid_acc: 0.839
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.59it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.91it/s]


epoch: 6
train_loss: 0.113, train_acc: 0.957
valid_loss: 0.453, valid_acc: 0.836
No improvement. Patience: 2/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.49it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.28it/s]


epoch: 7
train_loss: 0.079, train_acc: 0.972
valid_loss: 0.453, valid_acc: 0.842
Best model saved with valid_acc: 0.842


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.45it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.25it/s]


epoch: 8
train_loss: 0.062, train_acc: 0.978
valid_loss: 0.501, valid_acc: 0.838
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.48it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.52it/s]


epoch: 9
train_loss: 0.059, train_acc: 0.978
valid_loss: 0.478, valid_acc: 0.842
Best model saved with valid_acc: 0.842


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.47it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.75it/s]


epoch: 10
train_loss: 0.046, train_acc: 0.984
valid_loss: 0.512, valid_acc: 0.847
Best model saved with valid_acc: 0.847


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.42it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.80it/s]


epoch: 11
train_loss: 0.042, train_acc: 0.985
valid_loss: 0.547, valid_acc: 0.844
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.37it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.33it/s]


epoch: 12
train_loss: 0.036, train_acc: 0.987
valid_loss: 0.726, valid_acc: 0.822
No improvement. Patience: 2/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.42it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.07it/s]


epoch: 13
train_loss: 0.035, train_acc: 0.987
valid_loss: 0.586, valid_acc: 0.844
No improvement. Patience: 3/3
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 782/782 [00:11<00:00, 70.23it/s]

test_loss: 0.538, test_acc: 0.836





<p style="font-size: 16px;"> Let's try to change dropout </p>

In [30]:
embedding_dim = 256 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.5  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

lr = 5e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits

n_epochs = 20
best_valid_acc = 0  # ADD: Track best validation accuracy
patience = 3  # ADD: Stop if no improvement for 3 epochs
patience_counter = 0  # ADD: Counter for epochs without improvement
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.98it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.98it/s]


epoch: 0
train_loss: 0.644, train_acc: 0.616
valid_loss: 0.637, valid_acc: 0.607
Best model saved with valid_acc: 0.607


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.90it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.66it/s]


epoch: 1
train_loss: 0.484, train_acc: 0.766
valid_loss: 0.432, valid_acc: 0.801
Best model saved with valid_acc: 0.801


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.61it/s]


epoch: 2
train_loss: 0.370, train_acc: 0.837
valid_loss: 0.390, valid_acc: 0.822
Best model saved with valid_acc: 0.822


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.71it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.24it/s]


epoch: 3
train_loss: 0.293, train_acc: 0.878
valid_loss: 0.363, valid_acc: 0.838
Best model saved with valid_acc: 0.838


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.60it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.35it/s]


epoch: 4
train_loss: 0.223, train_acc: 0.911
valid_loss: 0.361, valid_acc: 0.842
Best model saved with valid_acc: 0.842


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.56it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.28it/s]


epoch: 5
train_loss: 0.168, train_acc: 0.936
valid_loss: 0.368, valid_acc: 0.847
Best model saved with valid_acc: 0.847


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.50it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.78it/s]


epoch: 6
train_loss: 0.117, train_acc: 0.956
valid_loss: 0.385, valid_acc: 0.845
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.56it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.20it/s]


epoch: 7
train_loss: 0.087, train_acc: 0.969
valid_loss: 0.437, valid_acc: 0.839
No improvement. Patience: 2/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.45it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.70it/s]


epoch: 8
train_loss: 0.075, train_acc: 0.974
valid_loss: 0.412, valid_acc: 0.843
No improvement. Patience: 3/3
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 782/782 [00:11<00:00, 70.62it/s]

test_loss: 0.383, test_acc: 0.838





In [31]:
embedding_dim = 256 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.5  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

lr = 1e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits

n_epochs = 20
best_valid_acc = 0  # ADD: Track best validation accuracy
patience = 3  # ADD: Stop if no improvement for 3 epochs
patience_counter = 0  # ADD: Counter for epochs without improvement
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.78it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.50it/s]


epoch: 0
train_loss: 0.691, train_acc: 0.531
valid_loss: 0.672, valid_acc: 0.592
Best model saved with valid_acc: 0.592


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.78it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.01it/s]


epoch: 1
train_loss: 0.638, train_acc: 0.638
valid_loss: 0.611, valid_acc: 0.671
Best model saved with valid_acc: 0.671


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.47it/s]


epoch: 2
train_loss: 0.576, train_acc: 0.704
valid_loss: 0.576, valid_acc: 0.711
Best model saved with valid_acc: 0.711


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.21it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.43it/s]


epoch: 3
train_loss: 0.535, train_acc: 0.733
valid_loss: 0.553, valid_acc: 0.730
Best model saved with valid_acc: 0.730


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.88it/s]


epoch: 4
train_loss: 0.485, train_acc: 0.769
valid_loss: 0.498, valid_acc: 0.770
Best model saved with valid_acc: 0.770


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.46it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.95it/s]


epoch: 5
train_loss: 0.443, train_acc: 0.799
valid_loss: 0.475, valid_acc: 0.785
Best model saved with valid_acc: 0.785


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.52it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.22it/s]


epoch: 6
train_loss: 0.404, train_acc: 0.820
valid_loss: 0.460, valid_acc: 0.788
Best model saved with valid_acc: 0.788


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.54it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.29it/s]


epoch: 7
train_loss: 0.378, train_acc: 0.833
valid_loss: 0.445, valid_acc: 0.790
Best model saved with valid_acc: 0.790


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.41it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.23it/s]


epoch: 8
train_loss: 0.351, train_acc: 0.848
valid_loss: 0.423, valid_acc: 0.808
Best model saved with valid_acc: 0.808


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.60it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.00it/s]


epoch: 9
train_loss: 0.323, train_acc: 0.862
valid_loss: 0.415, valid_acc: 0.815
Best model saved with valid_acc: 0.815


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.39it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.52it/s]


epoch: 10
train_loss: 0.302, train_acc: 0.872
valid_loss: 0.401, valid_acc: 0.820
Best model saved with valid_acc: 0.820


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.45it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.10it/s]


epoch: 11
train_loss: 0.283, train_acc: 0.884
valid_loss: 0.398, valid_acc: 0.820
Best model saved with valid_acc: 0.820


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.73it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.37it/s]


epoch: 12
train_loss: 0.262, train_acc: 0.894
valid_loss: 0.383, valid_acc: 0.828
Best model saved with valid_acc: 0.828


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.46it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.34it/s]


epoch: 13
train_loss: 0.248, train_acc: 0.900
valid_loss: 0.378, valid_acc: 0.828
Best model saved with valid_acc: 0.828


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.63it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.52it/s]


epoch: 14
train_loss: 0.230, train_acc: 0.911
valid_loss: 0.373, valid_acc: 0.833
Best model saved with valid_acc: 0.833


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.59it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.06it/s]


epoch: 15
train_loss: 0.212, train_acc: 0.915
valid_loss: 0.373, valid_acc: 0.832
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.68it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.57it/s]


epoch: 16
train_loss: 0.201, train_acc: 0.920
valid_loss: 0.398, valid_acc: 0.825
No improvement. Patience: 2/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.56it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.48it/s]


epoch: 17
train_loss: 0.183, train_acc: 0.927
valid_loss: 0.374, valid_acc: 0.833
No improvement. Patience: 3/3
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 782/782 [00:11<00:00, 70.74it/s]

test_loss: 0.385, test_acc: 0.826





In [32]:
embedding_dim = 256 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.6  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

lr = 1e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits

n_epochs = 20
best_valid_acc = 0  # ADD: Track best validation accuracy
patience = 3  # ADD: Stop if no improvement for 3 epochs
patience_counter = 0  # ADD: Counter for epochs without improvement
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 27.01it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.97it/s]


epoch: 0
train_loss: 0.690, train_acc: 0.536
valid_loss: 0.688, valid_acc: 0.526
Best model saved with valid_acc: 0.526


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.95it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.81it/s]


epoch: 1
train_loss: 0.620, train_acc: 0.656
valid_loss: 0.601, valid_acc: 0.692
Best model saved with valid_acc: 0.692


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.81it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 70.04it/s]


epoch: 2
train_loss: 0.561, train_acc: 0.716
valid_loss: 0.577, valid_acc: 0.709
Best model saved with valid_acc: 0.709


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.68it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.17it/s]


epoch: 3
train_loss: 0.520, train_acc: 0.747
valid_loss: 0.546, valid_acc: 0.738
Best model saved with valid_acc: 0.738


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.64it/s]


epoch: 4
train_loss: 0.469, train_acc: 0.781
valid_loss: 0.509, valid_acc: 0.766
Best model saved with valid_acc: 0.766


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.52it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.43it/s]


epoch: 5
train_loss: 0.429, train_acc: 0.803
valid_loss: 0.481, valid_acc: 0.788
Best model saved with valid_acc: 0.788


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.61it/s]


epoch: 6
train_loss: 0.392, train_acc: 0.827
valid_loss: 0.462, valid_acc: 0.797
Best model saved with valid_acc: 0.797


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.54it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.10it/s]


epoch: 7
train_loss: 0.362, train_acc: 0.843
valid_loss: 0.448, valid_acc: 0.801
Best model saved with valid_acc: 0.801


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.48it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.54it/s]


epoch: 8
train_loss: 0.337, train_acc: 0.856
valid_loss: 0.439, valid_acc: 0.804
Best model saved with valid_acc: 0.804


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.71it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.96it/s]


epoch: 9
train_loss: 0.316, train_acc: 0.867
valid_loss: 0.425, valid_acc: 0.813
Best model saved with valid_acc: 0.813


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.24it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.88it/s]


epoch: 10
train_loss: 0.298, train_acc: 0.875
valid_loss: 0.419, valid_acc: 0.814
Best model saved with valid_acc: 0.814


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.16it/s]


epoch: 11
train_loss: 0.279, train_acc: 0.884
valid_loss: 0.411, valid_acc: 0.815
Best model saved with valid_acc: 0.815


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.52it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.51it/s]


epoch: 12
train_loss: 0.259, train_acc: 0.897
valid_loss: 0.423, valid_acc: 0.807
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.08it/s]


epoch: 13
train_loss: 0.247, train_acc: 0.900
valid_loss: 0.396, valid_acc: 0.825
Best model saved with valid_acc: 0.825


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.66it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.59it/s]


epoch: 14
train_loss: 0.228, train_acc: 0.908
valid_loss: 0.411, valid_acc: 0.820
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.53it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.14it/s]


epoch: 15
train_loss: 0.210, train_acc: 0.918
valid_loss: 0.388, valid_acc: 0.827
Best model saved with valid_acc: 0.827


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.59it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.70it/s]


epoch: 16
train_loss: 0.193, train_acc: 0.925
valid_loss: 0.386, valid_acc: 0.830
Best model saved with valid_acc: 0.830


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.71it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.74it/s]


epoch: 17
train_loss: 0.181, train_acc: 0.930
valid_loss: 0.382, valid_acc: 0.829
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.67it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.37it/s]


epoch: 18
train_loss: 0.168, train_acc: 0.935
valid_loss: 0.380, valid_acc: 0.833
Best model saved with valid_acc: 0.833


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.66it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.94it/s]


epoch: 19
train_loss: 0.154, train_acc: 0.943
valid_loss: 0.382, valid_acc: 0.831
No improvement. Patience: 1/3


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 782/782 [00:10<00:00, 71.33it/s]

test_loss: 0.381, test_acc: 0.828





In [37]:
embedding_dim = 256 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.4  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

lr = 1e-4  # Learning rate (controls step size in optimization)
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01) 
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits

n_epochs = 30
best_valid_acc = 0  # ADD: Track best validation accuracy
patience = 3  # ADD: Stop if no improvement for 3 epochs
patience_counter = 0  # ADD: Counter for epochs without improvement
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.91it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 70.06it/s]


epoch: 0
train_loss: 0.689, train_acc: 0.543
valid_loss: 0.665, valid_acc: 0.603
Best model saved with valid_acc: 0.603


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.80it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.65it/s]


epoch: 1
train_loss: 0.635, train_acc: 0.649
valid_loss: 0.605, valid_acc: 0.686
Best model saved with valid_acc: 0.686


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.77it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.62it/s]


epoch: 2
train_loss: 0.570, train_acc: 0.713
valid_loss: 0.541, valid_acc: 0.732
Best model saved with valid_acc: 0.732


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.69it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.63it/s]


epoch: 3
train_loss: 0.512, train_acc: 0.754
valid_loss: 0.510, valid_acc: 0.748
Best model saved with valid_acc: 0.748


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.63it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.13it/s]


epoch: 4
train_loss: 0.472, train_acc: 0.783
valid_loss: 0.480, valid_acc: 0.770
Best model saved with valid_acc: 0.770


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.46it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.81it/s]


epoch: 5
train_loss: 0.432, train_acc: 0.804
valid_loss: 0.467, valid_acc: 0.772
Best model saved with valid_acc: 0.772


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.62it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.32it/s]


epoch: 6
train_loss: 0.395, train_acc: 0.826
valid_loss: 0.451, valid_acc: 0.785
Best model saved with valid_acc: 0.785


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.39it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.03it/s]


epoch: 7
train_loss: 0.366, train_acc: 0.839
valid_loss: 0.414, valid_acc: 0.808
Best model saved with valid_acc: 0.808


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.47it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.40it/s]


epoch: 8
train_loss: 0.332, train_acc: 0.861
valid_loss: 0.412, valid_acc: 0.808
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.30it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.61it/s]


epoch: 9
train_loss: 0.311, train_acc: 0.868
valid_loss: 0.388, valid_acc: 0.823
Best model saved with valid_acc: 0.823


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.33it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.72it/s]


epoch: 10
train_loss: 0.284, train_acc: 0.883
valid_loss: 0.408, valid_acc: 0.812
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.37it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 69.29it/s]


epoch: 11
train_loss: 0.266, train_acc: 0.892
valid_loss: 0.396, valid_acc: 0.821
No improvement. Patience: 2/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.32it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.49it/s]


epoch: 12
train_loss: 0.249, train_acc: 0.899
valid_loss: 0.368, valid_acc: 0.837
Best model saved with valid_acc: 0.837


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.36it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.36it/s]


epoch: 13
train_loss: 0.223, train_acc: 0.913
valid_loss: 0.385, valid_acc: 0.836
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.32it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 67.94it/s]


epoch: 14
train_loss: 0.210, train_acc: 0.919
valid_loss: 0.370, valid_acc: 0.840
Best model saved with valid_acc: 0.840


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.42it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.67it/s]


epoch: 15
train_loss: 0.193, train_acc: 0.926
valid_loss: 0.369, valid_acc: 0.839
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.17it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.29it/s]


epoch: 16
train_loss: 0.175, train_acc: 0.932
valid_loss: 0.373, valid_acc: 0.841
Best model saved with valid_acc: 0.841


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.34it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.19it/s]


epoch: 17
train_loss: 0.166, train_acc: 0.935
valid_loss: 0.363, valid_acc: 0.843
Best model saved with valid_acc: 0.843


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.33it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.66it/s]


epoch: 18
train_loss: 0.148, train_acc: 0.944
valid_loss: 0.365, valid_acc: 0.847
Best model saved with valid_acc: 0.847


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.42it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.20it/s]


epoch: 19
train_loss: 0.135, train_acc: 0.951
valid_loss: 0.376, valid_acc: 0.845
No improvement. Patience: 1/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.30it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.35it/s]


epoch: 20
train_loss: 0.122, train_acc: 0.956
valid_loss: 0.386, valid_acc: 0.845
No improvement. Patience: 2/3


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 440/440 [00:16<00:00, 26.37it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147/147 [00:02<00:00, 68.58it/s]


epoch: 21
train_loss: 0.115, train_acc: 0.958
valid_loss: 0.383, valid_acc: 0.844
No improvement. Patience: 3/3
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 782/782 [00:11<00:00, 70.66it/s]

test_loss: 0.389, test_acc: 0.836





In [34]:

embedding_dim = 128 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.4  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 5e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits 

# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 30

best_valid_acc = 0  # ADD: Track best validation accuracy
patience = 5  # ADD: Stop if no improvement for 3 epochs
patience_counter = 0  # ADD: Counter for epochs without improvement
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)

    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.72it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.14it/s]


epoch: 0
train_loss: 0.675, train_acc: 0.582
valid_loss: 0.661, valid_acc: 0.577
Best model saved with valid_acc: 0.577


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.86it/s]


epoch: 1
train_loss: 0.582, train_acc: 0.697
valid_loss: 0.545, valid_acc: 0.729
Best model saved with valid_acc: 0.729


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.66it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.96it/s]


epoch: 2
train_loss: 0.493, train_acc: 0.762
valid_loss: 0.464, valid_acc: 0.783
Best model saved with valid_acc: 0.783


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.97it/s]


epoch: 3
train_loss: 0.396, train_acc: 0.825
valid_loss: 0.386, valid_acc: 0.831
Best model saved with valid_acc: 0.831


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.99it/s]


epoch: 4
train_loss: 0.328, train_acc: 0.857
valid_loss: 0.380, valid_acc: 0.836
Best model saved with valid_acc: 0.836


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.69it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.94it/s]


epoch: 5
train_loss: 0.279, train_acc: 0.885
valid_loss: 0.348, valid_acc: 0.853
Best model saved with valid_acc: 0.853


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.62it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.76it/s]


epoch: 6
train_loss: 0.242, train_acc: 0.900
valid_loss: 0.365, valid_acc: 0.846
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.67it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.86it/s]


epoch: 7
train_loss: 0.214, train_acc: 0.913
valid_loss: 0.341, valid_acc: 0.858
Best model saved with valid_acc: 0.858


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.97it/s]


epoch: 8
train_loss: 0.184, train_acc: 0.930
valid_loss: 0.349, valid_acc: 0.859
Best model saved with valid_acc: 0.859


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.72it/s]


epoch: 9
train_loss: 0.156, train_acc: 0.940
valid_loss: 0.358, valid_acc: 0.863
Best model saved with valid_acc: 0.863


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.75it/s]


epoch: 10
train_loss: 0.137, train_acc: 0.947
valid_loss: 0.361, valid_acc: 0.862
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.66it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.80it/s]


epoch: 11
train_loss: 0.114, train_acc: 0.958
valid_loss: 0.380, valid_acc: 0.865
Best model saved with valid_acc: 0.865


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.54it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.86it/s]


epoch: 12
train_loss: 0.100, train_acc: 0.963
valid_loss: 0.383, valid_acc: 0.865
Best model saved with valid_acc: 0.865


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.60it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.38it/s]


epoch: 13
train_loss: 0.083, train_acc: 0.969
valid_loss: 0.429, valid_acc: 0.855
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.84it/s]


epoch: 14
train_loss: 0.073, train_acc: 0.973
valid_loss: 0.421, valid_acc: 0.862
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.51it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.32it/s]


epoch: 15
train_loss: 0.072, train_acc: 0.974
valid_loss: 0.424, valid_acc: 0.865
Best model saved with valid_acc: 0.865


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.59it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.83it/s]


epoch: 16
train_loss: 0.060, train_acc: 0.978
valid_loss: 0.458, valid_acc: 0.861
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.52it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.17it/s]


epoch: 17
train_loss: 0.050, train_acc: 0.982
valid_loss: 0.468, valid_acc: 0.864
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.56it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.74it/s]


epoch: 18
train_loss: 0.046, train_acc: 0.984
valid_loss: 0.496, valid_acc: 0.859
No improvement. Patience: 3/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.61it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.76it/s]


epoch: 19
train_loss: 0.040, train_acc: 0.986
valid_loss: 0.492, valid_acc: 0.862
No improvement. Patience: 4/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.58it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.81it/s]


epoch: 20
train_loss: 0.036, train_acc: 0.987
valid_loss: 0.503, valid_acc: 0.865
No improvement. Patience: 5/5
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:04<00:00, 81.31it/s]

test_loss: 0.469, test_acc: 0.847





In [36]:

embedding_dim = 128 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.45  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 5e-4  # Learning rate (controls step size in optimization)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross Entropy with Logits 

# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 30

best_valid_acc = 0  # ADD: Track best validation accuracy
patience = 5  # ADD: Stop if no improvement for 3 epochs
patience_counter = 0  # ADD: Counter for epochs without improvement
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)

    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.60it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.99it/s]


epoch: 0
train_loss: 0.678, train_acc: 0.565
valid_loss: 0.624, valid_acc: 0.661
Best model saved with valid_acc: 0.661


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.60it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.42it/s]


epoch: 1
train_loss: 0.575, train_acc: 0.696
valid_loss: 0.572, valid_acc: 0.697
Best model saved with valid_acc: 0.697


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.51it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.48it/s]


epoch: 2
train_loss: 0.496, train_acc: 0.760
valid_loss: 0.481, valid_acc: 0.777
Best model saved with valid_acc: 0.777


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.61it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.92it/s]


epoch: 3
train_loss: 0.405, train_acc: 0.820
valid_loss: 0.407, valid_acc: 0.823
Best model saved with valid_acc: 0.823


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.53it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.98it/s]


epoch: 4
train_loss: 0.332, train_acc: 0.858
valid_loss: 0.370, valid_acc: 0.843
Best model saved with valid_acc: 0.843


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.51it/s]


epoch: 5
train_loss: 0.285, train_acc: 0.879
valid_loss: 0.349, valid_acc: 0.852
Best model saved with valid_acc: 0.852


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.52it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.81it/s]


epoch: 6
train_loss: 0.252, train_acc: 0.898
valid_loss: 0.343, valid_acc: 0.853
Best model saved with valid_acc: 0.853


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.81it/s]


epoch: 7
train_loss: 0.219, train_acc: 0.912
valid_loss: 0.361, valid_acc: 0.847
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.60it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.78it/s]


epoch: 8
train_loss: 0.187, train_acc: 0.928
valid_loss: 0.336, valid_acc: 0.858
Best model saved with valid_acc: 0.858


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.61it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.72it/s]


epoch: 9
train_loss: 0.167, train_acc: 0.937
valid_loss: 0.335, valid_acc: 0.866
Best model saved with valid_acc: 0.866


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.00it/s]


epoch: 10
train_loss: 0.144, train_acc: 0.945
valid_loss: 0.356, valid_acc: 0.858
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.89it/s]


epoch: 11
train_loss: 0.129, train_acc: 0.951
valid_loss: 0.353, valid_acc: 0.865
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.67it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.84it/s]


epoch: 12
train_loss: 0.107, train_acc: 0.960
valid_loss: 0.370, valid_acc: 0.867
Best model saved with valid_acc: 0.867


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.61it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.65it/s]


epoch: 13
train_loss: 0.096, train_acc: 0.965
valid_loss: 0.384, valid_acc: 0.862
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.50it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.82it/s]


epoch: 14
train_loss: 0.083, train_acc: 0.969
valid_loss: 0.394, valid_acc: 0.864
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.93it/s]


epoch: 15
train_loss: 0.071, train_acc: 0.975
valid_loss: 0.412, valid_acc: 0.863
No improvement. Patience: 3/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.66it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.79it/s]


epoch: 16
train_loss: 0.059, train_acc: 0.978
valid_loss: 0.446, valid_acc: 0.863
No improvement. Patience: 4/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.62it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.81it/s]


epoch: 17
train_loss: 0.060, train_acc: 0.980
valid_loss: 0.448, valid_acc: 0.864
No improvement. Patience: 5/5
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:04<00:00, 81.61it/s]

test_loss: 0.402, test_acc: 0.849





In [37]:

embedding_dim = 128 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.45  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 5e-4  
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.001) 

criterion = nn.BCEWithLogitsLoss()  

# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 30

best_valid_acc = 0  
patience = 5  
patience_counter = 0  
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)

    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.68it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.23it/s]


epoch: 0
train_loss: 0.673, train_acc: 0.572
valid_loss: 0.617, valid_acc: 0.667
Best model saved with valid_acc: 0.667


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.69it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.04it/s]


epoch: 1
train_loss: 0.546, train_acc: 0.725
valid_loss: 0.498, valid_acc: 0.763
Best model saved with valid_acc: 0.763


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.62it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.94it/s]


epoch: 2
train_loss: 0.445, train_acc: 0.792
valid_loss: 0.431, valid_acc: 0.801
Best model saved with valid_acc: 0.801


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.77it/s]


epoch: 3
train_loss: 0.381, train_acc: 0.832
valid_loss: 0.411, valid_acc: 0.816
Best model saved with valid_acc: 0.816


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.56it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.76it/s]


epoch: 4
train_loss: 0.322, train_acc: 0.863
valid_loss: 0.391, valid_acc: 0.827
Best model saved with valid_acc: 0.827


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.70it/s]


epoch: 5
train_loss: 0.279, train_acc: 0.885
valid_loss: 0.364, valid_acc: 0.845
Best model saved with valid_acc: 0.845


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.39it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.41it/s]


epoch: 6
train_loss: 0.247, train_acc: 0.899
valid_loss: 0.346, valid_acc: 0.847
Best model saved with valid_acc: 0.847


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.38it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.67it/s]


epoch: 7
train_loss: 0.214, train_acc: 0.914
valid_loss: 0.332, valid_acc: 0.858
Best model saved with valid_acc: 0.858


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.40it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.87it/s]


epoch: 8
train_loss: 0.191, train_acc: 0.924
valid_loss: 0.336, valid_acc: 0.858
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.47it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.77it/s]


epoch: 9
train_loss: 0.159, train_acc: 0.941
valid_loss: 0.346, valid_acc: 0.862
Best model saved with valid_acc: 0.862


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.46it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.98it/s]


epoch: 10
train_loss: 0.140, train_acc: 0.947
valid_loss: 0.352, valid_acc: 0.859
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.58it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.90it/s]


epoch: 11
train_loss: 0.119, train_acc: 0.955
valid_loss: 0.356, valid_acc: 0.863
Best model saved with valid_acc: 0.863


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.36it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.99it/s]


epoch: 12
train_loss: 0.103, train_acc: 0.962
valid_loss: 0.359, valid_acc: 0.865
Best model saved with valid_acc: 0.865


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.20it/s]


epoch: 13
train_loss: 0.089, train_acc: 0.967
valid_loss: 0.389, valid_acc: 0.857
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.58it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.36it/s]


epoch: 14
train_loss: 0.085, train_acc: 0.969
valid_loss: 0.400, valid_acc: 0.856
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.61it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.76it/s]


epoch: 15
train_loss: 0.065, train_acc: 0.977
valid_loss: 0.405, valid_acc: 0.860
No improvement. Patience: 3/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.48it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.32it/s]


epoch: 16
train_loss: 0.061, train_acc: 0.978
valid_loss: 0.412, valid_acc: 0.863
No improvement. Patience: 4/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.49it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.77it/s]


epoch: 17
train_loss: 0.052, train_acc: 0.982
valid_loss: 0.427, valid_acc: 0.863
No improvement. Patience: 5/5
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:04<00:00, 81.54it/s]

test_loss: 0.398, test_acc: 0.849





In [42]:

embedding_dim = 128 # Dimension of token embeddings (controls model capacity)
num_heads = 8  # Number of attention heads (must divide embedding_dim evenly)
dropout = 0.45  # Dropout rate for regularization (prevents overfitting)
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 5e-4  
# optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.001) 
optimizer = optim.Adam(model.parameters(), lr=lr) 

criterion = nn.BCEWithLogitsLoss()  
# creiterion = nn.BCELoss()
# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 30

best_valid_acc = 0  
patience = 5  
patience_counter = 0  
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)

    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 32.04it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.54it/s]


epoch: 0
train_loss: 0.690, train_acc: 0.542
valid_loss: 0.661, valid_acc: 0.593
Best model saved with valid_acc: 0.593


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 32.05it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.87it/s]


epoch: 1
train_loss: 0.589, train_acc: 0.688
valid_loss: 0.546, valid_acc: 0.734
Best model saved with valid_acc: 0.734


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.96it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.38it/s]


epoch: 2
train_loss: 0.492, train_acc: 0.764
valid_loss: 0.477, valid_acc: 0.781
Best model saved with valid_acc: 0.781


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.90it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.75it/s]


epoch: 3
train_loss: 0.423, train_acc: 0.807
valid_loss: 0.451, valid_acc: 0.787
Best model saved with valid_acc: 0.787


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.91it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.40it/s]


epoch: 4
train_loss: 0.363, train_acc: 0.840
valid_loss: 0.391, valid_acc: 0.824
Best model saved with valid_acc: 0.824


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.80it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.41it/s]


epoch: 5
train_loss: 0.314, train_acc: 0.865
valid_loss: 0.353, valid_acc: 0.849
Best model saved with valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.71it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.18it/s]


epoch: 6
train_loss: 0.275, train_acc: 0.885
valid_loss: 0.338, valid_acc: 0.855
Best model saved with valid_acc: 0.855


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.72it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.57it/s]


epoch: 7
train_loss: 0.236, train_acc: 0.906
valid_loss: 0.328, valid_acc: 0.862
Best model saved with valid_acc: 0.862


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.79it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.22it/s]


epoch: 8
train_loss: 0.205, train_acc: 0.919
valid_loss: 0.322, valid_acc: 0.864
Best model saved with valid_acc: 0.864


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.69it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.59it/s]


epoch: 9
train_loss: 0.180, train_acc: 0.931
valid_loss: 0.321, valid_acc: 0.863
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.73it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.09it/s]


epoch: 10
train_loss: 0.157, train_acc: 0.940
valid_loss: 0.325, valid_acc: 0.864
Best model saved with valid_acc: 0.864


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.69it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.54it/s]


epoch: 11
train_loss: 0.133, train_acc: 0.951
valid_loss: 0.339, valid_acc: 0.861
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.76it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.49it/s]


epoch: 12
train_loss: 0.117, train_acc: 0.958
valid_loss: 0.387, valid_acc: 0.855
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.69it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.26it/s]


epoch: 13
train_loss: 0.104, train_acc: 0.961
valid_loss: 0.373, valid_acc: 0.858
No improvement. Patience: 3/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.74it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 78.48it/s]


epoch: 14
train_loss: 0.089, train_acc: 0.967
valid_loss: 0.372, valid_acc: 0.863
No improvement. Patience: 4/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:09<00:00, 31.65it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 77.97it/s]


epoch: 15
train_loss: 0.075, train_acc: 0.973
valid_loss: 0.417, valid_acc: 0.858
No improvement. Patience: 5/5
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:04<00:00, 81.94it/s]

test_loss: 0.362, test_acc: 0.850





<p style="font-size: 25px;">FINALLY I GOT 0.85</p>


In [37]:

embedding_dim = 128 
num_heads = 4  
dropout = 0.45  
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 5e-4  
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.001) 

criterion = nn.BCEWithLogitsLoss()  



# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 30

best_valid_acc = 0  
patience = 5  
patience_counter = 0  
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)

    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.07it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 67.08it/s]


epoch: 0
train_loss: 0.676, train_acc: 0.570
valid_loss: 0.633, valid_acc: 0.646
Best model saved with valid_acc: 0.646


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.06it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 65.84it/s]


epoch: 1
train_loss: 0.557, train_acc: 0.714
valid_loss: 0.512, valid_acc: 0.757
Best model saved with valid_acc: 0.757


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.05it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 64.92it/s]


epoch: 2
train_loss: 0.467, train_acc: 0.777
valid_loss: 0.467, valid_acc: 0.775
Best model saved with valid_acc: 0.775


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.97it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 64.82it/s]


epoch: 3
train_loss: 0.389, train_acc: 0.825
valid_loss: 0.420, valid_acc: 0.816
Best model saved with valid_acc: 0.816


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.96it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 64.90it/s]


epoch: 4
train_loss: 0.332, train_acc: 0.857
valid_loss: 0.379, valid_acc: 0.838
Best model saved with valid_acc: 0.838


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.94it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 65.50it/s]


epoch: 5
train_loss: 0.289, train_acc: 0.879
valid_loss: 0.345, valid_acc: 0.849
Best model saved with valid_acc: 0.849


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.91it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 67.14it/s]


epoch: 6
train_loss: 0.256, train_acc: 0.894
valid_loss: 0.336, valid_acc: 0.857
Best model saved with valid_acc: 0.857


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.84it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 67.92it/s]


epoch: 7
train_loss: 0.225, train_acc: 0.910
valid_loss: 0.354, valid_acc: 0.848
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.83it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 67.71it/s]


epoch: 8
train_loss: 0.200, train_acc: 0.921
valid_loss: 0.347, valid_acc: 0.856
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.85it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 67.67it/s]


epoch: 9
train_loss: 0.176, train_acc: 0.931
valid_loss: 0.335, valid_acc: 0.862
Best model saved with valid_acc: 0.862


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.89it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 65.82it/s]


epoch: 10
train_loss: 0.156, train_acc: 0.939
valid_loss: 0.333, valid_acc: 0.866
Best model saved with valid_acc: 0.866


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.91it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 64.50it/s]


epoch: 11
train_loss: 0.135, train_acc: 0.949
valid_loss: 0.353, valid_acc: 0.863
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.91it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 64.29it/s]


epoch: 12
train_loss: 0.116, train_acc: 0.956
valid_loss: 0.376, valid_acc: 0.857
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 26.98it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 64.83it/s]


epoch: 13
train_loss: 0.106, train_acc: 0.961
valid_loss: 0.361, valid_acc: 0.863
No improvement. Patience: 3/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.17it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 65.59it/s]


epoch: 14
train_loss: 0.091, train_acc: 0.967
valid_loss: 0.387, valid_acc: 0.859
No improvement. Patience: 4/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.14it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 66.03it/s]


epoch: 15
train_loss: 0.080, train_acc: 0.972
valid_loss: 0.396, valid_acc: 0.860
No improvement. Patience: 5/5
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:05<00:00, 70.23it/s]

test_loss: 0.362, test_acc: 0.851





In [38]:

test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:05<00:00, 68.51it/s]

test_loss: 0.362, test_acc: 0.851





<p style="font-size: 25px;">FINALLY I GOT 0.851</p>


<p style="font-size: 25px;">Task 5: Try out our model ðŸš€ (5pt)</p>
<p style="font-size: 16px;">Let's try out our model to see if it can correctly classify the sentiment of the input text!</p>

In [39]:
def predict_sentiment(text, model, tokenizer, device):
    ids = tokenizer(text)["input_ids"]
    tensor = torch.LongTensor(ids).unsqueeze(dim=0).to(device)
    prediction = model(tensor).squeeze()
    predicted_class = torch.round(torch.sigmoid(prediction))
    if predicted_class == 1:
        print("Positive sentiment")
    else:
        print("Negative sentiment")
    return predicted_class

In [41]:
# TODO: Run the following code to test the model
text = "Definitely love this movie!"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Positive sentiment


In [42]:
# TODO: Run the following code to test the model
text = "It is not so terrible but I still don't like it"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [None]:
# TODO: Please provide your own movie review and see how the model predicts!
text = # Your own movie review
predicted_class = predict_sentiment(text, model, tokenizer, device)

In [43]:
# TODO: Please provide your own movie review and see how the model predicts!
text = "I dont know"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Positive sentiment


In [41]:
text = "I dont know, it looks like boring"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [42]:
text = "I love the actor, but the movie is bad"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [43]:
text = "I love the actor, but the movie is little bit boring"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [44]:
text = "I love the actor, but the movie is little bit bad"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [45]:
text = "I love the actor, but the movie is little bit okay"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Positive sentiment


In [46]:
text = "Amazing story and brilliant acting â€” I loved every minute!"
predicted_class = predict_sentiment(text, model, tokenizer, device)


Positive sentiment


In [47]:
text = "Terrible plot and weak characters, complete waste of time."
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [48]:
text = "It was okay, not great but not bad either."
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [49]:
text = "The visuals were stunning, but the script was boring."
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [50]:
text = "One of the best movies Iâ€™ve seen this year!"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Positive sentiment


In [51]:
text = "I fell asleep halfway through"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [52]:
text = "The ending was unexpected and emotional, very satisfying"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Positive sentiment


<p style="font-size: 25px;">Lets just try to add more blocks</p>


In [57]:
class Transformer(nn.Module):
    def __init__(self, vocab_size, max_length, embed_dim,
                num_heads, forward_expansion, dropout, num_layers):
        super(Transformer, self).__init__()

        self.embedder = Embedding(vocab_size, max_length, embed_dim)
        # self.encoder = TransformerEncoder(embed_dim, num_heads, forward_expansion, dropout)
        self.encoder_stack = nn.ModuleList([
            TransformerEncoder(embed_dim, num_heads, forward_expansion, dropout)
            for _ in range(num_layers)
        ])
        # self.fc = # TODO: Please define the final fully connected layer.
        self.fc = nn.Linear(embed_dim, 1) # Binary classif.output
    def forward(self, x):
        embedding = self.embedder(x)
        # encoding = # TODO: Compute the encoding using encoder
        encoding = embedding

        for encoder in self.encoder_stack:
            encoding = encoder(encoding)
        
        # Is the max pooling a good choice here? Why? Or what should be used instead?
        # Answer: Max pooling captures the most salient features. Alternatives include:
        # - Mean pooling: encoding.mean(dim=1) - captures average representation
        # - Attention pooling - learnable weighted sum
        compact_encoding = encoding.max(dim=1)[0]

        out = self.fc(compact_encoding)
        return out

In [61]:

embedding_dim = 128 
num_heads = 8  
dropout = 0.3 
forward_expansion = 3
max_length = 512
vocab_size = 35000
num_layers = 3

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout, num_layers)
model = model.to(device)



print(f"The model has {count_parameters(model):,} trainable parameters")
# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 5e-4  
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.001) 

criterion = nn.BCEWithLogitsLoss()  



# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 30

best_valid_acc = 0  
patience = 5  
patience_counter = 0  
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)

    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model1.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model1.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

The model has 5,041,793 trainable parameters


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.86it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.77it/s]


epoch: 0
train_loss: 0.674, train_acc: 0.576
valid_loss: 0.623, valid_acc: 0.648
Best model saved with valid_acc: 0.648


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.83it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.72it/s]


epoch: 1
train_loss: 0.553, train_acc: 0.717
valid_loss: 0.488, valid_acc: 0.765
Best model saved with valid_acc: 0.765


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.74it/s]


epoch: 2
train_loss: 0.463, train_acc: 0.779
valid_loss: 0.436, valid_acc: 0.790
Best model saved with valid_acc: 0.790


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 3
train_loss: 0.404, train_acc: 0.816
valid_loss: 0.391, valid_acc: 0.822
Best model saved with valid_acc: 0.822


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.72it/s]


epoch: 4
train_loss: 0.338, train_acc: 0.852
valid_loss: 0.382, valid_acc: 0.832
Best model saved with valid_acc: 0.832


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 5
train_loss: 0.292, train_acc: 0.876
valid_loss: 0.378, valid_acc: 0.833
Best model saved with valid_acc: 0.833


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.72it/s]


epoch: 6
train_loss: 0.256, train_acc: 0.895
valid_loss: 0.375, valid_acc: 0.843
Best model saved with valid_acc: 0.843


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 7
train_loss: 0.214, train_acc: 0.915
valid_loss: 0.397, valid_acc: 0.842
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 8
train_loss: 0.186, train_acc: 0.925
valid_loss: 0.383, valid_acc: 0.850
Best model saved with valid_acc: 0.850


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 9
train_loss: 0.160, train_acc: 0.940
valid_loss: 0.449, valid_acc: 0.848
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.73it/s]


epoch: 10
train_loss: 0.133, train_acc: 0.949
valid_loss: 0.438, valid_acc: 0.851
Best model saved with valid_acc: 0.851


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.72it/s]


epoch: 11
train_loss: 0.118, train_acc: 0.957
valid_loss: 0.479, valid_acc: 0.842
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.70it/s]


epoch: 12
train_loss: 0.096, train_acc: 0.965
valid_loss: 0.504, valid_acc: 0.847
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.72it/s]


epoch: 13
train_loss: 0.086, train_acc: 0.968
valid_loss: 0.501, valid_acc: 0.851
Best model saved with valid_acc: 0.851


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.70it/s]


epoch: 14
train_loss: 0.076, train_acc: 0.972
valid_loss: 0.609, valid_acc: 0.844
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.69it/s]


epoch: 15
train_loss: 0.066, train_acc: 0.976
valid_loss: 0.568, valid_acc: 0.853
Best model saved with valid_acc: 0.853


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.70it/s]


epoch: 16
train_loss: 0.058, train_acc: 0.980
valid_loss: 0.605, valid_acc: 0.850
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 17
train_loss: 0.052, train_acc: 0.982
valid_loss: 0.563, valid_acc: 0.857
Best model saved with valid_acc: 0.857


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.70it/s]


epoch: 18
train_loss: 0.056, train_acc: 0.980
valid_loss: 0.587, valid_acc: 0.859
Best model saved with valid_acc: 0.859


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 19
train_loss: 0.048, train_acc: 0.983
valid_loss: 0.649, valid_acc: 0.856
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.67it/s]


epoch: 20
train_loss: 0.042, train_acc: 0.986
valid_loss: 0.605, valid_acc: 0.857
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 21
train_loss: 0.042, train_acc: 0.986
valid_loss: 0.707, valid_acc: 0.848
No improvement. Patience: 3/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.70it/s]


epoch: 22
train_loss: 0.037, train_acc: 0.988
valid_loss: 0.713, valid_acc: 0.858
No improvement. Patience: 4/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:42<00:00,  6.82it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:05<00:00, 18.71it/s]


epoch: 23
train_loss: 0.036, train_acc: 0.987
valid_loss: 0.699, valid_acc: 0.856
No improvement. Patience: 5/5
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:20<00:00, 18.88it/s]

test_loss: 0.657, test_acc: 0.839





In [63]:

embedding_dim = 128 
num_heads = 4  
dropout = 0.45  
forward_expansion = 3
max_length = 512
vocab_size = 35000

model = Transformer(
    vocab_size, max_length, embedding_dim, num_heads, forward_expansion, dropout)
model = model.to(device)

# lr = # TODO: Please set a proper learning rate
# optimizer = # TODO: Please define an optimizer
# criterion = # TODO: What loss function should we use?
lr = 5e-4  
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.001) 

criterion = nn.BCEWithLogitsLoss()  



# n_epochs = # TODO: Please set the number of epochs as you need
n_epochs = 30

best_valid_acc = 0  
patience = 5  
patience_counter = 0  
for epoch in range(n_epochs):
    train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)

    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(), 'best_model.pt')
        print(f"Best model saved with valid_acc: {valid_acc:.3f}")
        patience_counter = 0
    else:
        patience_counter += 1
        print(f"No improvement. Patience: {patience_counter}/{patience}")
        if patience_counter >= patience:
            print("Early stopping!")
            break

model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 71.29it/s]


epoch: 0
train_loss: 0.678, train_acc: 0.559
valid_loss: 0.620, valid_acc: 0.670
Best model saved with valid_acc: 0.670


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.77it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.99it/s]


epoch: 1
train_loss: 0.560, train_acc: 0.712
valid_loss: 0.517, valid_acc: 0.751
Best model saved with valid_acc: 0.751


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.77it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 71.06it/s]


epoch: 2
train_loss: 0.473, train_acc: 0.773
valid_loss: 0.461, valid_acc: 0.792
Best model saved with valid_acc: 0.792


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.69it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.99it/s]


epoch: 3
train_loss: 0.408, train_acc: 0.816
valid_loss: 0.420, valid_acc: 0.811
Best model saved with valid_acc: 0.811


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.64it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 71.03it/s]


epoch: 4
train_loss: 0.352, train_acc: 0.847
valid_loss: 0.370, valid_acc: 0.839
Best model saved with valid_acc: 0.839


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.59it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.77it/s]


epoch: 5
train_loss: 0.302, train_acc: 0.873
valid_loss: 0.350, valid_acc: 0.847
Best model saved with valid_acc: 0.847


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.74it/s]


epoch: 6
train_loss: 0.260, train_acc: 0.891
valid_loss: 0.322, valid_acc: 0.866
Best model saved with valid_acc: 0.866


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.80it/s]


epoch: 7
train_loss: 0.230, train_acc: 0.910
valid_loss: 0.324, valid_acc: 0.859
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.57it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.59it/s]


epoch: 8
train_loss: 0.198, train_acc: 0.923
valid_loss: 0.311, valid_acc: 0.870
Best model saved with valid_acc: 0.870


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.55it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.75it/s]


epoch: 9
train_loss: 0.172, train_acc: 0.934
valid_loss: 0.341, valid_acc: 0.856
No improvement. Patience: 1/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.47it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.59it/s]


epoch: 10
train_loss: 0.154, train_acc: 0.943
valid_loss: 0.316, valid_acc: 0.868
No improvement. Patience: 2/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.51it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.51it/s]


epoch: 11
train_loss: 0.133, train_acc: 0.951
valid_loss: 0.328, valid_acc: 0.867
No improvement. Patience: 3/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.51it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.53it/s]


epoch: 12
train_loss: 0.117, train_acc: 0.960
valid_loss: 0.360, valid_acc: 0.862
No improvement. Patience: 4/5


training...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 293/293 [00:10<00:00, 27.52it/s]
evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 98/98 [00:01<00:00, 70.67it/s]


epoch: 13
train_loss: 0.102, train_acc: 0.963
valid_loss: 0.399, valid_acc: 0.852
No improvement. Patience: 5/5
Early stopping!


evaluating...: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 391/391 [00:05<00:00, 73.46it/s]

test_loss: 0.344, test_acc: 0.852





<p style="font-size: 25px;">Somehow i got 0.852</p>


In [65]:
# TODO: Run the following code to test the model
text = "Definitely love this movie!"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Positive sentiment


In [66]:
# TODO: Run the following code to test the model
text = "It is not so terrible but I still don't like it"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [71]:
text = "I love the actor, but the movie is bad"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment


In [72]:
text = "I love the actor, but the movie is little bit boring"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Positive sentiment


In [73]:
text = "Amazing story and brilliant acting â€” I loved every minute!"
predicted_class = predict_sentiment(text, model, tokenizer, device)


Positive sentiment


In [74]:
text = "I fell asleep halfway through"
predicted_class = predict_sentiment(text, model, tokenizer, device)

Negative sentiment
