# Transition to Deep Learning

After evaluating several classical machine learning models (TF-IDF + Linear Models / SVM / LightGBM), we move to a **Deep Learning approach** to better capture the semantic structure of the data.

## Why Deep Learning?

Traditional models based on TF-IDF ignore:

- word order  
- long-range dependencies  
- semantic meaning  
- similarities between exercises  

Although we could use simpler neural architectures such as dense networks or CNNs, these models are **not well suited for sequential textual data**.  

Recurrent architectures (RNNs, LSTMs, GRUs) can handle sequences, but they struggle with long texts, lack parallelization, and are generally outperformed by more modern approaches.

## Transformers: the right architecture for text

Transformers are currently the **state-of-the-art** in natural language processing because they:

- handle long sequences  
- capture global context with self-attention  
- are highly parallelizable  
- perform extremely well in multilabel classification  

Therefore, Transformers are the most appropriate architecture for our task.

## The challenge: limited data & limited compute

Training a Transformer from scratch is **not feasible** in our setting:

- we do not have enough labeled data  
- we cannot train a large model from scratch  
- the computational cost would be prohibitive

## Solution: Transfer Learning

We leverage **pretrained Transformer models** and fine-tune them on our dataset:

- we keep the pretrained encoder (frozen or partially frozen)  
- we add a **custom multilabel classification head** on top  
- we train only the final layers on our dataset  

This approach drastically reduces sample complexity and compute requirements.

## Suitable pretrained models

- **For problem descriptions (natural language):**  
  - *DistilBERT*  
  - *BERT-base*  
  - *RoBERTa-base*

- **For source code (programming languages):**  
  - *CodeBERT (Microsoft)*  
  - *GraphCodeBERT*  
  - *CodeT5*

These pretrained models already encode meaningful representations of text or code, making them ideal for fine-tuning on our multilabel classification task.


In [2]:
import pandas as pd
import numpy as np

In [3]:
import sys
sys.path.append("..")

# Data Loading

In [4]:
# Local Case

from src.processing import load_processed_data

df = load_processed_data()
df.head()

ModuleNotFoundError: No module named 'src'

In [5]:
# Colab Case

dataset_url = (
    "https://raw.githubusercontent.com/valdugay/illuin_interview/main/data/processed/cleaned_code_classification_dataset.jsonl"
)

df = pd.read_json(dataset_url, lines=True)
df.head()


Unnamed: 0,index,src_uid,source_code,tags,full_description
0,0,bb3fc45f903588baf131016bea175a9f,# calculate convex of polygon v.\n# v is list ...,[geometry],Problem Description:\nIahub has drawn a set of...
1,1,7d6faccc88a6839822fa0c0ec8c00251,s = input().strip();N = len(s)\nif len(s) == 1...,[strings],Problem Description:\nSome time ago Lesha foun...
2,2,891fabbb6ee8a4969b6f413120f672a8,"n = int(input())\nfor _ in range(n):\n k,x = m...","[number theory, math]",Problem Description:\nToday at the lesson of m...
3,3,9d46ae53e6dc8dc54f732ec93a82ded3,temp = list(input())\nm = int(input())\ntrans ...,"[math, strings]",Problem Description:\nPasha got a very beautif...
4,4,0e0f30521f9f5eb5cff2549cd391da3c,"N, B, E = input(), [], 0\nfor a in map(int, ra...",[math],Problem Description:\nYou are given an array $...


In [6]:
def get_labels(df):
    """ Return 8-length binary vectors representing the labels """

    focus_tags = ['math', 'graphs', 'strings', 'number theory',
              'trees', 'geometry', 'games', 'probabilities']


    def encode_tags(tag_list):
        return [1 if t in tag_list else 0 for t in focus_tags]

    labels_vector = df["tags"].apply(encode_tags)

    return np.vstack(labels_vector.values)


# To be able to decode the labels later
label_mapping = {
    'math': 0,
    'graphs': 1,
    'strings': 2,
    'number theory': 3,
    'trees': 4,
    'geometry': 5,
    'games': 6,
    'probabilities': 7
}


Y = get_labels(df)


X_descriptions = df["full_description"].values
X_code = df["source_code"].values

In [7]:
# Just as for the ML approach, we have 2 features (text) : the description and the code

# Description

In [8]:
import torch
import torch.nn as nn

from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [9]:
num_labels = len(label_mapping)  # 8

In [10]:
model_name_desc = "distilbert-base-uncased"
tokenizer_desc = AutoTokenizer.from_pretrained(model_name_desc)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [11]:
# Test it
example_text = "Given a tree with n nodes, compute the diameter."
enc = tokenizer_desc(
    example_text,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors="pt"
)
enc["input_ids"].shape, enc["attention_mask"].shape

(torch.Size([1, 13]), torch.Size([1, 13]))

In [12]:
model_desc = AutoModelForSequenceClassification.from_pretrained(
    model_name_desc,
    num_labels=num_labels,
    problem_type="multi_label_classification"
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
print(model_desc)


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [14]:
backbone_params = 0
classifier_params = 0

for name, param in model_desc.named_parameters():
    if "classifier" in name:      # Head
        classifier_params += param.numel()
    else:                         # Pretrained backbone
        backbone_params += param.numel()

print(f"Backbone params: {backbone_params:,}")
print(f"Classifier params: {classifier_params:,}")
print(f"Total params: {backbone_params + classifier_params:,}")


Backbone params: 66,362,880
Classifier params: 596,744
Total params: 66,959,624


In [15]:
# We can check the trainable parameters
trainable_params = 0
for param in model_desc.parameters():
    if param.requires_grad:
        trainable_params += param.numel()
print(f"Trainable params: {trainable_params:,}")

Trainable params: 66,959,624


In [16]:
def predict_tags_for_description(texts, model, tokenizer, threshold=0.5, max_length=256):
    model.eval()
    with torch.no_grad():
        batch = tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )
        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"]
        )
        logits = outputs.logits
        probas = torch.sigmoid(logits)
        preds = (probas > threshold).int()

    return probas, preds

In [17]:
# Example usage
texts = [
    "You are given a tree with n vertices. Find the number of paths...",
    "You are given a string s of length n, consisting of letters 'a' and 'b'..."
]

probas, preds = predict_tags_for_description(texts, model_desc, tokenizer_desc)
print("Probabilities:\n", probas)
print("Predictions:\n", preds)

Probabilities:
 tensor([[0.5130, 0.4848, 0.4457, 0.5280, 0.4948, 0.4680, 0.4894, 0.4857],
        [0.5143, 0.4844, 0.4553, 0.5069, 0.4974, 0.4743, 0.4791, 0.4788]])
Predictions:
 tensor([[1, 0, 0, 1, 0, 0, 0, 0],
        [1, 0, 0, 1, 0, 0, 0, 0]], dtype=torch.int32)


In [18]:
from torch.utils.data import Dataset

class DescriptionDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        enc = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )

        return {
            "input_ids": enc["input_ids"].squeeze(0),
            "attention_mask": enc["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.float)
        }


In [19]:
from sklearn.model_selection import train_test_split

X_train_desc, X_val_desc, Y_train, Y_val = train_test_split(
    X_descriptions, Y, test_size=0.2, random_state=42
)


In [20]:
from torch.utils.data import DataLoader

train_dataset = DescriptionDataset(X_train_desc, Y_train, tokenizer_desc)
val_dataset   = DescriptionDataset(X_val_desc,   Y_val,   tokenizer_desc)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=8, shuffle=False)


In [21]:
from torch.optim import AdamW

optimizer = AdamW(model_desc.parameters(), lr=2e-5, weight_decay=1e-2)


In [22]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_desc = model_desc.to(device)

print("GPU :", torch.cuda.is_available())

GPU : True


In [23]:
import torch
import torch.nn.functional as F
from tqdm import tqdm

def train_one_epoch(model, loader, optimizer, scheduler=None):
    model.train()
    total_loss = 0

    for batch in tqdm(loader, desc="Training"):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        loss = outputs.loss
        loss.backward()
        total_loss += loss.item()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        if scheduler:
            scheduler.step()
        optimizer.zero_grad()

    return total_loss / len(loader)


In [24]:
from sklearn.metrics import f1_score
import numpy as np

def validate(model, val_loader, criterion=None, device="cuda"):
    model.eval()
    total_loss = 0.0
    all_preds = []
    all_targets = []

    if criterion is None:
        criterion = torch.nn.BCEWithLogitsLoss()

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            loss = criterion(logits, labels.float())
            total_loss += loss.item() * input_ids.size(0)

            preds = (logits.sigmoid() > 0.5).int().cpu()
            all_preds.append(preds)
            all_targets.append(labels.cpu())

    val_loss = total_loss / len(val_loader.dataset)

    all_preds = torch.cat(all_preds, dim=0)
    all_targets = torch.cat(all_targets, dim=0)

    micro = f1_score(all_targets, all_preds, average="micro", zero_division=0)
    macro = f1_score(all_targets, all_preds, average="macro", zero_division=0)

    return val_loss, micro, macro

In [25]:
from transformers import get_linear_schedule_with_warmup

epochs = 10

total_steps = len(train_loader) * epochs

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.1 * total_steps),
    num_training_steps=total_steps
)


In [25]:
# A good practice when the dataset is small
# First epoch: train only the classification head (freeze the backbone)
# From the second epoch: unfreeze the backbone and train all layers

train_loss_history = []
test_loss_history = []


for epoch in range(epochs):
    print(f"\n----- Epoch {epoch+1}/{epochs} -----")

    if epoch == 0:
        print("Training ONLY the classification head (backbone frozen).")
        for param in model_desc.base_model.parameters():
            param.requires_grad = False

    if epoch == 1:
        print("Unfreezing the backbone: training ALL layers.")
        for param in model_desc.base_model.parameters():
            param.requires_grad = True

        optimizer = torch.optim.AdamW(model_desc.parameters(), lr=2e-5, weight_decay=1e-2)

    train_loss = train_one_epoch(model_desc, train_loader, optimizer, scheduler)
    val_loss, micro, macro = validate(model_desc, val_loader)

    test_loss_history.append(val_loss)
    train_loss_history.append(train_loss)

    # Early Stopping (custom) with patience of 2
    if len(test_loss_history) > 2 and test_loss_history[-1] > test_loss_history[-2] and test_loss_history[-2] > test_loss_history[-3]:
        print("Early stopping triggered.")
        break

    print(f"Train Loss: {train_loss:.4f}")
    print(f"Val Loss: {val_loss:.4f}")
    print(f"Val Micro F1: {micro:.4f}")
    print(f"Val Macro F1: {macro:.4f}")


----- Epoch 1/10 -----
Training ONLY the classification head (backbone frozen).


Training: 100%|██████████| 268/268 [00:18<00:00, 14.16it/s]


Train Loss: 0.5732
Val Loss: 0.4081
Val Micro F1: 0.2309
Val Macro F1: 0.0581

----- Epoch 2/10 -----
Unfreezing the backbone: training ALL layers.


Training: 100%|██████████| 268/268 [00:54<00:00,  4.87it/s]


Train Loss: 0.3084
Val Loss: 0.2472
Val Micro F1: 0.6731
Val Macro F1: 0.4978

----- Epoch 3/10 -----


Training: 100%|██████████| 268/268 [00:56<00:00,  4.72it/s]


Train Loss: 0.2357
Val Loss: 0.2276
Val Micro F1: 0.6969
Val Macro F1: 0.5304

----- Epoch 4/10 -----


Training: 100%|██████████| 268/268 [00:57<00:00,  4.64it/s]


Train Loss: 0.1966
Val Loss: 0.2211
Val Micro F1: 0.7347
Val Macro F1: 0.6794

----- Epoch 5/10 -----


Training: 100%|██████████| 268/268 [00:52<00:00,  5.15it/s]


Train Loss: 0.1643
Val Loss: 0.2217
Val Micro F1: 0.7380
Val Macro F1: 0.6718

----- Epoch 6/10 -----


Training: 100%|██████████| 268/268 [00:52<00:00,  5.13it/s]


Early stopping triggered.


# For Code

In [26]:
# For Code, we need a similar architecture (bert)
# But with a model pretrained on code data, like CodeBERT

code_model_name = "microsoft/codebert-base"

code_tokenizer = AutoTokenizer.from_pretrained(code_model_name)

code_model = AutoModelForSequenceClassification.from_pretrained(
    code_model_name,
    num_labels=8,
    problem_type="multi_label_classification"
)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
X_code_train, X_code_val, Y_train, Y_val = train_test_split(
    X_code, Y, test_size=0.2, random_state=42
)


code_train_dataset = DescriptionDataset(X_code_train, Y_train, code_tokenizer, max_length=256)
code_val_dataset   = DescriptionDataset(X_code_val,   Y_val,   code_tokenizer, max_length=256)

code_train_loader = DataLoader(code_train_dataset, batch_size=8, shuffle=True)
code_val_loader   = DataLoader(code_val_dataset, batch_size=8, shuffle=False)


In [28]:
code_model.to(device)
code_optimizer = AdamW(code_model.parameters(), lr=2e-5)

In [29]:
train_loss_history = []
test_loss_history = []


for epoch in range(epochs):
    print(f"\n----- Epoch {epoch+1}/{epochs} -----")

    if epoch == 0:
        print("Training ONLY the classification head (backbone frozen).")
        for param in code_model.base_model.parameters():
            param.requires_grad = False

    if epoch == 1:
        print("Unfreezing the backbone: training ALL layers.")
        for param in code_model.base_model.parameters():
            param.requires_grad = True

        optimizer = torch.optim.AdamW(code_model.parameters(), lr=2e-5, weight_decay=1e-2)

    train_loss = train_one_epoch(code_model, train_loader, optimizer, scheduler)
    val_loss, micro, macro = validate(code_model, val_loader)

    test_loss_history.append(val_loss)
    train_loss_history.append(train_loss)

    # Early Stopping (custom) with patience of 2
    if len(test_loss_history) > 2 and test_loss_history[-1] > test_loss_history[-2] and test_loss_history[-2] > test_loss_history[-3]:
        print("Early stopping triggered.")
        break

    print(f"Train Loss: {train_loss:.4f}")
    print(f"Val Loss: {val_loss:.4f}")
    print(f"Val Micro F1: {micro:.4f}")
    print(f"Val Macro F1: {macro:.4f}")



----- Epoch 1/10 -----
Training ONLY the classification head (backbone frozen).



Training:   0%|          | 0/268 [00:00<?, ?it/s][A
Training:   0%|          | 1/268 [00:00<00:58,  4.58it/s][A
Training:   1%|          | 2/268 [00:00<00:48,  5.47it/s][A
Training:   1%|          | 3/268 [00:00<00:45,  5.83it/s][A
Training:   1%|▏         | 4/268 [00:00<00:44,  5.94it/s][A
Training:   2%|▏         | 5/268 [00:00<00:47,  5.59it/s][A
Training:   2%|▏         | 6/268 [00:01<00:48,  5.44it/s][A
Training:   3%|▎         | 7/268 [00:01<00:44,  5.81it/s][A
Training:   3%|▎         | 8/268 [00:01<00:45,  5.68it/s][A
Training:   3%|▎         | 9/268 [00:01<00:53,  4.82it/s][A
Training:   4%|▎         | 10/268 [00:01<00:51,  4.99it/s][A
Training:   4%|▍         | 11/268 [00:02<00:51,  5.01it/s][A
Training:   4%|▍         | 12/268 [00:02<00:57,  4.44it/s][A
Training:   5%|▍         | 13/268 [00:02<00:57,  4.40it/s][A
Training:   5%|▌         | 14/268 [00:02<00:58,  4.32it/s][A
Training:   6%|▌         | 15/268 [00:03<00:55,  4.59it/s][A
Training:   6%|▌         

Train Loss: 0.7411
Val Loss: 0.7451
Val Micro F1: 0.2741
Val Macro F1: 0.2495

----- Epoch 2/10 -----
Unfreezing the backbone: training ALL layers.


Training: 100%|██████████| 268/268 [01:44<00:00,  2.57it/s]


Train Loss: 0.3788
Val Loss: 0.3275
Val Micro F1: 0.5533
Val Macro F1: 0.1876

----- Epoch 3/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.56it/s]


Train Loss: 0.3053
Val Loss: 0.2707
Val Micro F1: 0.6694
Val Macro F1: 0.3689

----- Epoch 4/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.57it/s]


Train Loss: 0.2769
Val Loss: 0.2657
Val Micro F1: 0.6599
Val Macro F1: 0.3633

----- Epoch 5/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.56it/s]


Train Loss: 0.2575
Val Loss: 0.2618
Val Micro F1: 0.6412
Val Macro F1: 0.4032

----- Epoch 6/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.57it/s]


Train Loss: 0.2348
Val Loss: 0.2637
Val Micro F1: 0.6568
Val Macro F1: 0.4523

----- Epoch 7/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.56it/s]


Train Loss: 0.2118
Val Loss: 0.2582
Val Micro F1: 0.6745
Val Macro F1: 0.5803

----- Epoch 8/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.56it/s]


Train Loss: 0.1942
Val Loss: 0.2549
Val Micro F1: 0.6667
Val Macro F1: 0.5382

----- Epoch 9/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.56it/s]


Train Loss: 0.1722
Val Loss: 0.2804
Val Micro F1: 0.6574
Val Macro F1: 0.5963

----- Epoch 10/10 -----


Training: 100%|██████████| 268/268 [01:44<00:00,  2.56it/s]


Early stopping triggered.


In [30]:
# For code, the results are largely better than ML approach
# Let's try an hybrid approach

# Combining

In [29]:
# Thanks to the architecture of this approach
# We won't just 'mix' the preidction (or probabilities) of both models
# Instead, we will concatenate the embeddings from both models
# And then, use one final classification head

In [30]:
# To do this, we create a custom model :

from transformers import AutoModel

class MultiEncoderClassifier(nn.Module):
    def __init__(self, model_name_1="microsoft/codebert-base",
                       model_name_2="distilbert-base-uncased",
                       num_labels=8):
        super().__init__()

        # Encoder 1: CodeBERT
        self.encoder1 = AutoModel.from_pretrained(model_name_1)

        # Encoder 2: DistilBERT
        self.encoder2 = AutoModel.from_pretrained(model_name_2)

        # Hidden sizes
        h1 = self.encoder1.config.hidden_size
        h2 = self.encoder2.config.hidden_size
        combined = h1 + h2

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(combined, 1024),
            nn.Dropout(0.2),
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, num_labels)
        )

    def forward(self, input_ids_1, attention_mask_1,
                      input_ids_2, attention_mask_2):

        # CodeBERT encoding
        out1 = self.encoder1(input_ids=input_ids_1,
                             attention_mask=attention_mask_1)
        cls1 = out1.last_hidden_state[:, 0, :]  # CLS token

        # DistilBERT encoding
        out2 = self.encoder2(input_ids=input_ids_2,
                             attention_mask=attention_mask_2)
        cls2 = out2.last_hidden_state[:, 0, :]

        # Combine embeddings
        combined = torch.cat([cls1, cls2], dim=1)

        # Classification
        logits = self.classifier(combined)
        return logits


In [31]:
class DualEncoderDataset(Dataset):
    def __init__(self, texts, labels, tokenizer1, tokenizer2, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tok1 = tokenizer1
        self.tok2 = tokenizer2
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = torch.tensor(self.labels[idx], dtype=torch.float32)

        enc1 = self.tok1(text, truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")
        enc2 = self.tok2(text, truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")

        return {
            "input_ids_1": enc1["input_ids"].squeeze(0),
            "attention_mask_1": enc1["attention_mask"].squeeze(0),
            "input_ids_2": enc2["input_ids"].squeeze(0),
            "attention_mask_2": enc2["attention_mask"].squeeze(0),
            "labels": label
        }


In [32]:
tokenizer1 = AutoTokenizer.from_pretrained("microsoft/codebert-base")
tokenizer2 = AutoTokenizer.from_pretrained("distilbert-base-uncased")

train_dataset = DualEncoderDataset(X_train_desc, Y_train, tokenizer1, tokenizer2)
val_dataset   = DualEncoderDataset(X_val_desc,   Y_val,   tokenizer1, tokenizer2)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=8)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [33]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MultiEncoderClassifier().to(device)

In [34]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=1e-2)

In [35]:
from tqdm.auto import tqdm

EPOCHS = 20

train_loss_history = []
val_loss_history = []

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0


    # Train for each batch of training data
    pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")

    for batch in pbar:


        optimizer.zero_grad()

        logits = model(
            batch["input_ids_1"].to(device),
            batch["attention_mask_1"].to(device),
            batch["input_ids_2"].to(device),
            batch["attention_mask_2"].to(device),
        )

        loss = criterion(logits, batch["labels"].to(device))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            logits = model(
                batch["input_ids_1"].to(device),
                batch["attention_mask_1"].to(device),
                batch["input_ids_2"].to(device),
                batch["attention_mask_2"].to(device),
            )
            loss = criterion(logits, batch["labels"].to(device))
            val_loss += loss.item()

    train_loss_history.append(total_loss / len(train_loader))
    val_loss_history.append(val_loss / len(val_loader))

    # Early Stopping (custom) with patience of 2
    if len(val_loss_history) > 2 and val_loss_history[-1] > val_loss_history[-2] and val_loss_history[-2] > val_loss_history[-3]:
        print("Early stopping triggered.")
        break


    print(f"Epoch {epoch+1} | Loss = {total_loss/len(train_loader):.4f} | Val Loss = {val_loss/len(val_loader):.4f}")

Epoch 1/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 1 | Loss = 0.3723 | Val Loss = 0.2602


Epoch 2/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 2 | Loss = 0.2508 | Val Loss = 0.2237


Epoch 3/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 3 | Loss = 0.2064 | Val Loss = 0.2207


Epoch 4/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 4 | Loss = 0.1744 | Val Loss = 0.2172


Epoch 5/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 5 | Loss = 0.1413 | Val Loss = 0.2299


Epoch 6/20:   0%|          | 0/268 [00:00<?, ?it/s]

Early stopping triggered.


In [36]:
# Evaluate with our metrics

y_preds = []
y_trues = []

for batch in val_loader:
    logits = model(
        batch["input_ids_1"].to(device),
        batch["attention_mask_1"].to(device),
        batch["input_ids_2"].to(device),
        batch["attention_mask_2"].to(device),
    )
    preds = (torch.sigmoid(logits) > 0.5).int().cpu()
    y_preds.append(preds)
    y_trues.append(batch["labels"].cpu())

y_preds = torch.cat(y_preds, dim=0)
y_trues = torch.cat(y_trues, dim=0)

micro = f1_score(y_trues, y_preds, average="micro", zero_division=0)
macro = f1_score(y_trues, y_preds, average="macro", zero_division=0)

print(f"Val Micro F1: {micro:.4f}")
print(f"Val Macro F1: {macro:.4f}")

Val Micro F1: 0.7422
Val Macro F1: 0.7004


In [37]:
from sklearn.metrics import classification_report
print(classification_report(y_trues, y_preds, target_names=label_mapping.keys(), zero_division=0))

               precision    recall  f1-score   support

         math       0.79      0.80      0.80       281
       graphs       0.81      0.45      0.58       110
      strings       0.86      0.90      0.88        90
number theory       0.58      0.58      0.58        76
        trees       0.83      0.75      0.79        60
     geometry       0.70      0.64      0.67        33
        games       0.71      0.77      0.74        22
probabilities       0.55      0.60      0.57        10

    micro avg       0.77      0.72      0.74       682
    macro avg       0.73      0.69      0.70       682
 weighted avg       0.77      0.72      0.74       682
  samples avg       0.78      0.76      0.75       682



In [None]:
# This is the best result we get through the deep learning approach

# The combination of both code and description embeddings seems to help a lot
# But we did not improve if we compare with ML approach
# While the model is much more complex and not interpretable

In [38]:
# Save the model
torch.save(model.state_dict(), "multi_encoder_model.pt")


In [None]:
# TO save it in local (we trained it in Colab)
from google.colab import files
files.download("multi_encoder_model.pt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>