# Transition to Deep Learning

After evaluating several classical machine learning models (TF-IDF + Linear Models / SVM / LightGBM), we move to a **Deep Learning approach** to better capture the semantic structure of the data.

## Why Deep Learning?

Traditional models based on TF-IDF ignore:

- word order  
- long-range dependencies  
- semantic meaning  
- similarities between exercises  

Although we could use simpler neural architectures such as dense networks or CNNs, these models are **not well suited for sequential textual data**.  

Recurrent architectures (RNNs, LSTMs, GRUs) can handle sequences, but they struggle with long texts, lack parallelization, and are generally outperformed by more modern approaches.

## Transformers: the right architecture for text

Transformers are currently the **state-of-the-art** in natural language processing because they:

- handle long sequences  
- capture global context with self-attention  
- are highly parallelizable  
- perform extremely well in multilabel classification  

Therefore, Transformers are the most appropriate architecture for our task.

## The challenge: limited data & limited compute

Training a Transformer from scratch is **not feasible** in our setting:

- we do not have enough labeled data  
- we cannot train a large model from scratch  
- the computational cost would be prohibitive

## Solution: Transfer Learning

We leverage **pretrained Transformer models** and fine-tune them on our dataset:

- we keep the pretrained encoder (frozen or partially frozen)  
- we add a **custom multilabel classification head** on top  
- we train only the final layers on our dataset  

This approach drastically reduces sample complexity and compute requirements.

## Suitable pretrained models

- **For problem descriptions (natural language):**  
  - *DistilBERT*  
  - *BERT-base*  
  - *RoBERTa-base*

- **For source code (programming languages):**  
  - *CodeBERT (Microsoft)*  
  - *GraphCodeBERT*  
  - *CodeT5*

These pretrained models already encode meaningful representations of text or code, making them ideal for fine-tuning on our multilabel classification task.


In [4]:
import pandas as pd
import numpy as np

In [5]:
import sys
sys.path.append("..")

# Data Loading

In [2]:
from src.processing import load_processed_data

df = load_processed_data()
df.head()

Unnamed: 0_level_0,source_code,tags,full_description
src_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bb3fc45f903588baf131016bea175a9f,# calculate convex of polygon v.\n# v is list ...,[geometry],Problem Description:\nIahub has drawn a set of...
7d6faccc88a6839822fa0c0ec8c00251,s = input().strip();N = len(s)\nif len(s) == 1...,[strings],Problem Description:\nSome time ago Lesha foun...
891fabbb6ee8a4969b6f413120f672a8,"n = int(input())\nfor _ in range(n):\n k,x = m...","[number theory, math]",Problem Description:\nToday at the lesson of m...
9d46ae53e6dc8dc54f732ec93a82ded3,temp = list(input())\nm = int(input())\ntrans ...,"[math, strings]",Problem Description:\nPasha got a very beautif...
0e0f30521f9f5eb5cff2549cd391da3c,"N, B, E = input(), [], 0\nfor a in map(int, ra...",[math],Problem Description:\nYou are given an array $...


In [6]:
def get_labels(df):
    """ Return 8-length binary vectors representing the labels """

    focus_tags = ['math', 'graphs', 'strings', 'number theory',
              'trees', 'geometry', 'games', 'probabilities']

    
    def encode_tags(tag_list):
        return [1 if t in tag_list else 0 for t in focus_tags]

    labels_vector = df["tags"].apply(encode_tags)

    return np.vstack(labels_vector.values)


# To be able to decode the labels later
label_mapping = {
    'math': 0,
    'graphs': 1,
    'strings': 2,
    'number theory': 3,
    'trees': 4,
    'geometry': 5,
    'games': 6,
    'probabilities': 7
}


Y = get_labels(df)


X_descriptions = df["full_description"].values
X_code = df["source_code"].values

In [7]:
# Just as for the ML approach, we have 2 features (text) : the description and the code

# Description