## Fine-tuning Transformer Models on SmellyCode++ Dataset

In this notebook, we will fine-tune selected transformer models on the task of code smell detection using source code snippets from the `SmellyCode++` dataset. Specifically, we will use the dataset we prepared earlier (`merged_for_training.csv`) and filter it to include only entries originating from `SmellyCode++`.

We will train each model to perform **multi-label classification** on the following code smell categories:

- Long Method
- God Class / Large Class
- Feature Envy
- Data Class
- Clean

The selected models for fine-tuning are:

- **CodeBERT** (`microsoft/codebert-base`)
- **GraphCodeBERT** (`microsoft/graphcodebert-base`)
- **CodeT5** (`Salesforce/codet5-base`)

We will follow the steps below:
1. Load and filter the dataset
2. Tokenize source code
3. Format data for multi-label classification
4. Fine-tune selected transformer models
5. Evaluate and compare model performance

In [1]:
from src.training.data_utils import load_and_prepare_dataset

df, label_cols = load_and_prepare_dataset("../data/processed/merged_for_training.csv")

# Display class distribution
print("Number of samples:", len(df))
print("\nClass distribution:")
print(df[label_cols].sum())

Number of samples: 86043

Class distribution:
Long Method         1252
God/Large Class     3467
Feature Envy        1596
Data Class          2628
Clean              78109
dtype: int64


## Tokenization Strategy

Each of the transformer models requires source code to be tokenized in a specific way, depending on its architecture and tokenizer. We will use the HuggingFace `transformers` library to load the correct tokenizer for each model.

We will use the `Code` column as input text and the five binary smell indicators as targets. As this is a **multi-label classification task**, the targets will be represented as binary vectors (e.g., `[1, 0, 1, 0, 0]`).

The tokenization will:
- truncate long inputs to the model's `max_length`
- pad shorter inputs
- return `input_ids` and `attention_mask` for training

## Fine-tune CodeBERT

In [None]:
from src.training.train_codebert import train_multilabel_transformer

train_multilabel_transformer(
    model_name="microsoft/codebert-base",
    df=df,
    label_cols=["Long Method", "God/Large Class", "Feature Envy", "Data Class", "Clean"],
    output_dir="../models/transformers"
)

## Fine-tune GraphCodeBERT

In [None]:
from src.training.train_codebert import train_multilabel_transformer

train_multilabel_transformer(
    model_name="microsoft/codebert-base",
    df=df,
    label_cols=["Long Method", "God/Large Class", "Feature Envy", "Data Class", "Clean"],
    output_dir="../models/transformers"
)