# Transition to Deep Learning

After evaluating several classical machine learning models (TF-IDF + Linear Models / SVM / LightGBM), we move to a **Deep Learning approach** to better capture the semantic structure of the data.

## Why Deep Learning?

Traditional models based on TF-IDF ignore:

- word order  
- long-range dependencies  
- semantic meaning  
- similarities between exercises  

Although we could use simpler neural architectures such as dense networks or CNNs, these models are **not well suited for sequential textual data**.  

Recurrent architectures (RNNs, LSTMs, GRUs) can handle sequences, but they struggle with long texts, lack parallelization, and are generally outperformed by more modern approaches.

## Transformers: the right architecture for text

Transformers are currently the **state-of-the-art** in natural language processing because they:

- handle long sequences  
- capture global context with self-attention  
- are highly parallelizable  
- perform extremely well in multilabel classification  

Therefore, Transformers are the most appropriate architecture for our task.

## The challenge: limited data & limited compute

Training a Transformer from scratch is **not feasible** in our setting:

- we do not have enough labeled data  
- we cannot train a large model from scratch  
- the computational cost would be prohibitive

## Solution: Transfer Learning

We leverage **pretrained Transformer models** and fine-tune them on our dataset:

- we keep the pretrained encoder (frozen or partially frozen)  
- we add a **custom multilabel classification head** on top  
- we train only the final layers on our dataset  

This approach drastically reduces sample complexity and compute requirements.

## Suitable pretrained models

- **For problem descriptions (natural language):**  
  - *DistilBERT*  
  - *BERT-base*  
  - *RoBERTa-base*

- **For source code (programming languages):**  
  - *CodeBERT (Microsoft)*  
  - *GraphCodeBERT*  
  - *CodeT5*

These pretrained models already encode meaningful representations of text or code, making them ideal for fine-tuning on our multilabel classification task.


In [42]:
import pandas as pd
import numpy as np

In [43]:
import sys
sys.path.append("..")

# Data Loading

In [2]:
# Local Case

from src.processing import load_processed_data

df = load_processed_data()
df.head()

ModuleNotFoundError: No module named 'src'

In [119]:
# Colab Case

dataset_url = (
    "https://raw.githubusercontent.com/valdugay/illuin_interview/main/data/processed/cleaned_code_classification_dataset.jsonl"
)

df = pd.read_json(dataset_url, lines=True)
df.head()


Unnamed: 0,index,src_uid,source_code,tags,full_description
0,0,bb3fc45f903588baf131016bea175a9f,# calculate convex of polygon v.\n# v is list ...,[geometry],Problem Description:\nIahub has drawn a set of...
1,1,7d6faccc88a6839822fa0c0ec8c00251,s = input().strip();N = len(s)\nif len(s) == 1...,[strings],Problem Description:\nSome time ago Lesha foun...
2,2,891fabbb6ee8a4969b6f413120f672a8,"n = int(input())\nfor _ in range(n):\n k,x = m...","[number theory, math]",Problem Description:\nToday at the lesson of m...
3,3,9d46ae53e6dc8dc54f732ec93a82ded3,temp = list(input())\nm = int(input())\ntrans ...,"[math, strings]",Problem Description:\nPasha got a very beautif...
4,4,0e0f30521f9f5eb5cff2549cd391da3c,"N, B, E = input(), [], 0\nfor a in map(int, ra...",[math],Problem Description:\nYou are given an array $...


In [120]:
def get_labels(df):
    """ Return 8-length binary vectors representing the labels """

    focus_tags = ['math', 'graphs', 'strings', 'number theory',
              'trees', 'geometry', 'games', 'probabilities']

    
    def encode_tags(tag_list):
        return [1 if t in tag_list else 0 for t in focus_tags]

    labels_vector = df["tags"].apply(encode_tags)

    return np.vstack(labels_vector.values)


# To be able to decode the labels later
label_mapping = {
    'math': 0,
    'graphs': 1,
    'strings': 2,
    'number theory': 3,
    'trees': 4,
    'geometry': 5,
    'games': 6,
    'probabilities': 7
}


Y = get_labels(df)


X_descriptions = df["full_description"].values
X_code = df["source_code"].values

In [121]:
# Just as for the ML approach, we have 2 features (text) : the description and the code

# Description

In [122]:
import torch
import torch.nn as nn

from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [123]:
num_labels = len(label_mapping)  # 8

In [124]:
model_name_desc = "distilbert-base-uncased"
tokenizer_desc = AutoTokenizer.from_pretrained(model_name_desc)

In [125]:
# Test it
example_text = "Given a tree with n nodes, compute the diameter."
enc = tokenizer_desc(
    example_text,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors="pt"
)
enc["input_ids"].shape, enc["attention_mask"].shape

(torch.Size([1, 13]), torch.Size([1, 13]))

In [126]:
model_desc = AutoModelForSequenceClassification.from_pretrained(
    model_name_desc,
    num_labels=num_labels,
    problem_type="multi_label_classification"
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [127]:
print(model_desc)


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [128]:
backbone_params = 0
classifier_params = 0

for name, param in model_desc.named_parameters():
    if "classifier" in name:      # Head
        classifier_params += param.numel()
    else:                         # Pretrained backbone
        backbone_params += param.numel()

print(f"Backbone params: {backbone_params:,}")
print(f"Classifier params: {classifier_params:,}")
print(f"Total params: {backbone_params + classifier_params:,}")


Backbone params: 66,362,880
Classifier params: 596,744
Total params: 66,959,624


In [129]:
# We can check the trainable parameters
trainable_params = 0
for param in model_desc.parameters():
    if param.requires_grad:
        trainable_params += param.numel()
print(f"Trainable params: {trainable_params:,}")

Trainable params: 66,959,624


In [130]:
def predict_tags_for_description(texts, model, tokenizer, threshold=0.5, max_length=256):
    model.eval()
    with torch.no_grad():
        batch = tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )
        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"]
        )
        logits = outputs.logits
        probas = torch.sigmoid(logits)
        preds = (probas > threshold).int()

    return probas, preds

In [131]:
# Example usage
texts = [
    "You are given a tree with n vertices. Find the number of paths...",
    "You are given a string s of length n, consisting of letters 'a' and 'b'..."
]

probas, preds = predict_tags_for_description(texts, model_desc, tokenizer_desc)
print("Probabilities:\n", probas)
print("Predictions:\n", preds)

Probabilities:
 tensor([[0.5026, 0.4769, 0.5000, 0.4561, 0.5085, 0.4643, 0.4819, 0.5361],
        [0.5117, 0.4815, 0.4975, 0.4630, 0.5104, 0.4716, 0.4797, 0.5190]])
Predictions:
 tensor([[1, 0, 1, 0, 1, 0, 0, 1],
        [1, 0, 0, 0, 1, 0, 0, 1]], dtype=torch.int32)


In [132]:
from torch.utils.data import Dataset

class DescriptionDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        enc = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )

        return {
            "input_ids": enc["input_ids"].squeeze(0),
            "attention_mask": enc["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.float)
        }


In [133]:
from sklearn.model_selection import train_test_split

X_train_desc, X_val_desc, Y_train, Y_val = train_test_split(
    X_descriptions, Y, test_size=0.2, random_state=42
)


In [134]:
from torch.utils.data import DataLoader

train_dataset = DescriptionDataset(X_train_desc, Y_train, tokenizer_desc)
val_dataset   = DescriptionDataset(X_val_desc,   Y_val,   tokenizer_desc)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=8, shuffle=False)


In [135]:
from torch.optim import AdamW

optimizer = AdamW(model_desc.parameters(), lr=2e-5, weight_decay=1e-2)


In [136]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_desc = model_desc.to(device)

print("GPU :", torch.cuda.is_available())

GPU : True


In [137]:
import torch
import torch.nn.functional as F
from tqdm import tqdm

def train_one_epoch(model, loader, optimizer, scheduler=None):
    model.train()
    total_loss = 0

    for batch in tqdm(loader, desc="Training"):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        loss = outputs.loss
        loss.backward()
        total_loss += loss.item()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        if scheduler:
            scheduler.step()
        optimizer.zero_grad()

    return total_loss / len(loader)


In [138]:
from sklearn.metrics import f1_score
import numpy as np

def validate(model, val_loader, criterion=None, device="cuda"):
    model.eval()
    total_loss = 0.0
    all_preds = []
    all_targets = []

    if criterion is None:
        criterion = torch.nn.BCEWithLogitsLoss()

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            loss = criterion(logits, labels.float())
            total_loss += loss.item() * input_ids.size(0)

            preds = (logits.sigmoid() > 0.5).int().cpu()
            all_preds.append(preds)
            all_targets.append(labels.cpu())

    val_loss = total_loss / len(val_loader.dataset)

    all_preds = torch.cat(all_preds, dim=0)
    all_targets = torch.cat(all_targets, dim=0)

    micro = f1_score(all_targets, all_preds, average="micro", zero_division=0)
    macro = f1_score(all_targets, all_preds, average="macro", zero_division=0)

    return val_loss, micro, macro

In [139]:
from transformers import get_linear_schedule_with_warmup

epochs = 10

total_steps = len(train_loader) * epochs

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.1 * total_steps),
    num_training_steps=total_steps
)


In [None]:
# A good practice when the dataset is small
# First epoch: train only the classification head (freeze the backbone)
# From the second epoch: unfreeze the backbone and train all layers

train_loss_history = []
test_loss_history = []


for epoch in range(epochs):
    print(f"\n----- Epoch {epoch+1}/{epochs} -----")

    if epoch == 0:
        print("Training ONLY the classification head (backbone frozen).")
        for param in model_desc.base_model.parameters():
            param.requires_grad = False

    if epoch == 1:
        print("Unfreezing the backbone: training ALL layers.")
        for param in model_desc.base_model.parameters():
            param.requires_grad = True

        optimizer = torch.optim.AdamW(model_desc.parameters(), lr=2e-5, weight_decay=1e-2)

    train_loss = train_one_epoch(model_desc, train_loader, optimizer, scheduler)
    val_loss, micro, macro = validate(model_desc, val_loader)

    test_loss_history.append(val_loss)
    train_loss_history.append(train_loss)

    # Early Stopping (custom) with patience of 2
    if len(test_loss_history) > 2 and test_loss_history[-1] > test_loss_history[-2] and test_loss_history[-2] > test_loss_history[-3]:
        print("Early stopping triggered.")
        break

    print(f"Train Loss: {train_loss:.4f}")
    print(f"Val Loss: {val_loss:.4f}")
    print(f"Val Micro F1: {micro:.4f}")
    print(f"Val Macro F1: {macro:.4f}")


----- Epoch 1/10 -----
➡️ Training ONLY the classification head (backbone frozen).


Training: 100%|██████████| 268/268 [00:08<00:00, 29.79it/s]


Train Loss: 0.5782
Val Micro F1: 0.4614
Val Macro F1: 0.0860

----- Epoch 2/10 -----
➡️ Unfreezing the backbone: training ALL layers.


Training: 100%|██████████| 268/268 [00:15<00:00, 16.77it/s]


Train Loss: 0.3054
Val Micro F1: 0.6426
Val Macro F1: 0.3399

----- Epoch 3/10 -----


Training: 100%|██████████| 268/268 [00:15<00:00, 16.92it/s]


Train Loss: 0.2348
Val Micro F1: 0.7076
Val Macro F1: 0.5927

----- Epoch 4/10 -----


Training: 100%|██████████| 268/268 [00:15<00:00, 16.96it/s]


Train Loss: 0.1937
Val Micro F1: 0.7290
Val Macro F1: 0.6825

----- Epoch 5/10 -----


Training: 100%|██████████| 268/268 [00:15<00:00, 16.97it/s]


Train Loss: 0.1595
Val Micro F1: 0.7274
Val Macro F1: 0.6684

----- Epoch 6/10 -----


Training: 100%|██████████| 268/268 [00:15<00:00, 16.98it/s]


Early stopping triggered.


# For Code

In [145]:
# For Code, we need a similar architecture (bert)
# But with a model pretrained on code data, like CodeBERT

code_model_name = "microsoft/codebert-base"

code_tokenizer = AutoTokenizer.from_pretrained(code_model_name)

code_model = AutoModelForSequenceClassification.from_pretrained(
    code_model_name,
    num_labels=8,
    problem_type="multi_label_classification"
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [146]:
X_code_train, X_code_val, Y_train, Y_val = train_test_split(
    X_code, Y, test_size=0.2, random_state=42
)


code_train_dataset = DescriptionDataset(X_code_train, Y_train, code_tokenizer, max_length=256)
code_val_dataset   = DescriptionDataset(X_code_val,   Y_val,   code_tokenizer, max_length=256)

code_train_loader = DataLoader(code_train_dataset, batch_size=8, shuffle=True)
code_val_loader   = DataLoader(code_val_dataset, batch_size=8, shuffle=False)


In [147]:
code_model.to(device)
code_optimizer = AdamW(code_model.parameters(), lr=2e-5)

In [149]:
train_loss_history = []
test_loss_history = []


for epoch in range(epochs):
    print(f"\n----- Epoch {epoch+1}/{epochs} -----")

    if epoch == 0:
        print("Training ONLY the classification head (backbone frozen).")
        for param in code_model.base_model.parameters():
            param.requires_grad = False

    if epoch == 1:
        print("Unfreezing the backbone: training ALL layers.")
        for param in code_model.base_model.parameters():
            param.requires_grad = True

        optimizer = torch.optim.AdamW(code_model.parameters(), lr=2e-5, weight_decay=1e-2)

    train_loss = train_one_epoch(code_model, train_loader, optimizer, scheduler)
    val_loss, micro, macro = validate(code_model, val_loader)

    test_loss_history.append(val_loss)
    train_loss_history.append(train_loss)

    # Early Stopping (custom) with patience of 2
    if len(test_loss_history) > 2 and test_loss_history[-1] > test_loss_history[-2] and test_loss_history[-2] > test_loss_history[-3]:
        print("Early stopping triggered.")
        break

    print(f"Train Loss: {train_loss:.4f}")
    print(f"Val Loss: {val_loss:.4f}")
    print(f"Val Micro F1: {micro:.4f}")
    print(f"Val Macro F1: {macro:.4f}")



----- Epoch 1/10 -----
Training ONLY the classification head (backbone frozen).


Training: 100%|██████████| 268/268 [00:12<00:00, 21.56it/s]


Train Loss: 0.6790
Val Loss: 0.6791
Val Micro F1: 0.3315
Val Macro F1: 0.1715

----- Epoch 2/10 -----
Unfreezing the backbone: training ALL layers.


Training: 100%|██████████| 268/268 [00:26<00:00, 10.05it/s]


Train Loss: 0.3765
Val Loss: 0.3232
Val Micro F1: 0.5654
Val Macro F1: 0.1925

----- Epoch 3/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.02it/s]


Train Loss: 0.3040
Val Loss: 0.2787
Val Micro F1: 0.6427
Val Macro F1: 0.3344

----- Epoch 4/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.02it/s]


Train Loss: 0.2771
Val Loss: 0.2617
Val Micro F1: 0.6644
Val Macro F1: 0.3639

----- Epoch 5/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.02it/s]


Train Loss: 0.2582
Val Loss: 0.2684
Val Micro F1: 0.6304
Val Macro F1: 0.3657

----- Epoch 6/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.05it/s]


Train Loss: 0.2353
Val Loss: 0.2570
Val Micro F1: 0.6656
Val Macro F1: 0.4276

----- Epoch 7/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.05it/s]


Train Loss: 0.2137
Val Loss: 0.2551
Val Micro F1: 0.6639
Val Macro F1: 0.4768

----- Epoch 8/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.06it/s]


Train Loss: 0.1920
Val Loss: 0.2767
Val Micro F1: 0.6651
Val Macro F1: 0.5015

----- Epoch 9/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.05it/s]


Train Loss: 0.1709
Val Loss: 0.2690
Val Micro F1: 0.6794
Val Macro F1: 0.5160

----- Epoch 10/10 -----


Training: 100%|██████████| 268/268 [00:26<00:00, 10.03it/s]


Train Loss: 0.1515
Val Loss: 0.2757
Val Micro F1: 0.6892
Val Macro F1: 0.5533


In [None]:
# For code, the results are largely better than ML approach
# Let's try an hybrid approach

# Combining

In [None]:
# Thanks to the architecture of this approach
# We won't just 'mix' the preidction (or probabilities) of both models
# Instead, we will concatenate the embeddings from both models
# And then, use one final classification head

In [167]:
# To do this, we create a custom model :

from transformers import AutoModel

class MultiEncoderClassifier(nn.Module):
    def __init__(self, model_name_1="microsoft/codebert-base", 
                       model_name_2="distilbert-base-uncased",
                       num_labels=8):
        super().__init__()

        # Encoder 1: CodeBERT
        self.encoder1 = AutoModel.from_pretrained(model_name_1)

        # Encoder 2: DistilBERT
        self.encoder2 = AutoModel.from_pretrained(model_name_2)

        # Hidden sizes
        h1 = self.encoder1.config.hidden_size
        h2 = self.encoder2.config.hidden_size
        combined = h1 + h2

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(combined, 1024),
            nn.Dropout(0.2),
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, num_labels)
        )

    def forward(self, input_ids_1, attention_mask_1,
                      input_ids_2, attention_mask_2):

        # CodeBERT encoding
        out1 = self.encoder1(input_ids=input_ids_1,
                             attention_mask=attention_mask_1)
        cls1 = out1.last_hidden_state[:, 0, :]  # CLS token

        # DistilBERT encoding
        out2 = self.encoder2(input_ids=input_ids_2,
                             attention_mask=attention_mask_2)
        cls2 = out2.last_hidden_state[:, 0, :]

        # Combine embeddings
        combined = torch.cat([cls1, cls2], dim=1)

        # Classification
        logits = self.classifier(combined)
        return logits


In [168]:
class DualEncoderDataset(Dataset):
    def __init__(self, texts, labels, tokenizer1, tokenizer2, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tok1 = tokenizer1
        self.tok2 = tokenizer2
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = torch.tensor(self.labels[idx], dtype=torch.float32)

        enc1 = self.tok1(text, truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")
        enc2 = self.tok2(text, truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")

        return {
            "input_ids_1": enc1["input_ids"].squeeze(0),
            "attention_mask_1": enc1["attention_mask"].squeeze(0),
            "input_ids_2": enc2["input_ids"].squeeze(0),
            "attention_mask_2": enc2["attention_mask"].squeeze(0),
            "labels": label
        }


In [169]:
tokenizer1 = AutoTokenizer.from_pretrained("microsoft/codebert-base")
tokenizer2 = AutoTokenizer.from_pretrained("distilbert-base-uncased")

train_dataset = DualEncoderDataset(X_train_desc, Y_train, tokenizer1, tokenizer2)
val_dataset   = DualEncoderDataset(X_val_desc,   Y_val,   tokenizer1, tokenizer2)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=8)

In [170]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MultiEncoderClassifier().to(device)

In [171]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=1e-2)

In [172]:
from tqdm.auto import tqdm

EPOCHS = 20

train_loss_history = []
val_loss_history = []

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0


    # Train for each batch of training data
    pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")
    
    for batch in pbar:

        
        optimizer.zero_grad()

        logits = model(
            batch["input_ids_1"].to(device),
            batch["attention_mask_1"].to(device),
            batch["input_ids_2"].to(device),
            batch["attention_mask_2"].to(device),
        )

        loss = criterion(logits, batch["labels"].to(device))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            logits = model(
                batch["input_ids_1"].to(device),
                batch["attention_mask_1"].to(device),
                batch["input_ids_2"].to(device),
                batch["attention_mask_2"].to(device),
            )
            loss = criterion(logits, batch["labels"].to(device))
            val_loss += loss.item()

    train_loss_history.append(total_loss / len(train_loader))
    val_loss_history.append(val_loss / len(val_loader))

    # Early Stopping (custom) with patience of 2
    if len(val_loss_history) > 2 and val_loss_history[-1] > val_loss_history[-2] and val_loss_history[-2] > val_loss_history[-3]:
        print("Early stopping triggered.")
        break


    print(f"Epoch {epoch+1} | Loss = {total_loss/len(train_loader):.4f} | Val Loss = {val_loss/len(val_loader):.4f}")

Epoch 1/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 1 | Loss = 0.3740 | Val Loss = 0.2725


Epoch 2/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 2 | Loss = 0.2605 | Val Loss = 0.2322


Epoch 3/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 3 | Loss = 0.2117 | Val Loss = 0.2143


Epoch 4/20:   0%|          | 0/268 [00:00<?, ?it/s]

Epoch 4 | Loss = 0.1753 | Val Loss = 0.2180


Epoch 5/20:   0%|          | 0/268 [00:00<?, ?it/s]

Early stopping triggered.


In [173]:
# Evaluate with our metrics

y_preds = []
y_trues = []

for batch in val_loader:
    logits = model(
        batch["input_ids_1"].to(device),
        batch["attention_mask_1"].to(device),
        batch["input_ids_2"].to(device),
        batch["attention_mask_2"].to(device),
    )
    preds = (torch.sigmoid(logits) > 0.5).int().cpu()
    y_preds.append(preds)
    y_trues.append(batch["labels"].cpu())

y_preds = torch.cat(y_preds, dim=0)
y_trues = torch.cat(y_trues, dim=0)

micro = f1_score(y_trues, y_preds, average="micro", zero_division=0)
macro = f1_score(y_trues, y_preds, average="macro", zero_division=0)

print(f"Val Micro F1: {micro:.4f}")
print(f"Val Macro F1: {macro:.4f}")

Val Micro F1: 0.7492
Val Macro F1: 0.7046


In [174]:
print(classification_report(y_trues, y_preds, target_names=label_mapping.keys(), zero_division=0))

               precision    recall  f1-score   support

         math       0.81      0.84      0.82       281
       graphs       0.76      0.43      0.55       110
      strings       0.88      0.83      0.86        90
number theory       0.69      0.45      0.54        76
        trees       0.86      0.72      0.78        60
     geometry       0.68      0.76      0.71        33
        games       0.76      0.73      0.74        22
probabilities       0.83      0.50      0.62        10

    micro avg       0.80      0.71      0.75       682
    macro avg       0.78      0.66      0.70       682
 weighted avg       0.79      0.71      0.74       682
  samples avg       0.78      0.74      0.74       682



In [None]:
# This is the best result we get through the deep learning approach

# The combination of both code and description embeddings seems to help a lot
# But we did not improve if we compare with ML approach
# While the model is much more complex and not interpretable

In [175]:
# Save the model
torch.save(model.state_dict(), "multi_encoder_model.pt")


In [176]:
# TO save it in local (we trained it in Colab)
from google.colab import files
files.download("multi_encoder_model.pt")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [1]:
import os
os.listdir("/content")


['.config', 'multi_encoder_model.pt', 'sample_data']