# Transformer Models — Inference & Evaluation

This notebook evaluates the fine-tuned transformer models:

- **CodeBERT**
- **GraphCodeBERT**
- **CodeT5**

using the prepared dataset: `data/processed/merged_for_evaluation.csv`

We compute:

- Precision
- Recall
- F1‑Score
- Hamming Loss
- Subset Accuracy
- PR‑Curves for every class

Predictions are also exported to: `data/predictions/`

This experiment allows us to compare transformer-based models with the classical baselines (Random Forest and XGBoost).

## Load dataset

In [1]:
from src.inference.load_evaluation_data import load_and_prepare_evaluation_data

texts, y_true, label_cols = load_and_prepare_evaluation_data("../data/processed/merged_for_evaluation.csv")

## Load inference utilities

In [2]:
from src.inference.transformer_models_inference import (
    load_sequence_classification_model,
    load_seq2seq_model,
    predict_multilabel_classification,
    predict_codet5,
    evaluate_predictions,
    plot_pr_curves,
    save_predictions_csv
)

## CodeBERT — Inference & Evaluation

In [3]:
model_path = "../models/transformers/codebert/codebert-base_multilabel_finetuned"

tokenizer, model = load_sequence_classification_model(model_path)

y_pred, probs = predict_multilabel_classification(texts, tokenizer, model)

report, ham_loss, subset_acc = evaluate_predictions(y_true, y_pred, label_cols)

print(report)
print("Hamming Loss:", ham_loss)
print("Subset Accuracy:", subset_acc)

plot_pr_curves(y_true, probs, label_cols, "codebert", "../data/images")
save_predictions_csv(texts, y_true, y_pred, probs, label_cols, "../data/predictions/codebert_predictions.csv")

[INFO] Starting load_sequence_classification_model...
[INFO] Finished load_sequence_classification_model.
[INFO] Starting predict_multilabel_classification...
[INFO] Finished predict_multilabel_classification. Time taken: 1365.92 seconds.
[INFO] Starting evaluate_predictions...
[INFO] Finished evaluate_predictions. Time taken: 0.01 seconds.
                 precision    recall  f1-score   support

    Long Method       0.34      0.36      0.35       314
God/Large Class       0.60      0.05      0.09       866
   Feature Envy       0.39      0.32      0.35       400
     Data Class       0.62      0.61      0.61       656
          Clean       0.94      0.98      0.96     19527

      micro avg       0.91      0.91      0.91     21763
      macro avg       0.58      0.46      0.47     21763
   weighted avg       0.90      0.91      0.89     21763
    samples avg       0.91      0.91      0.91     21763

Hamming Loss: 0.036846264701780485
Subset Accuracy: 0.910510901399284
[INFO] Startin

## GraphCodeBERT — Inference & Evaluation

In [3]:
model_path = "../models/transformers/graphcodebert/graphcodebert-base_multilabel_finetuned"

tokenizer, model = load_sequence_classification_model(model_path)

y_pred, probs = predict_multilabel_classification(texts, tokenizer, model)

report, ham_loss, subset_acc = evaluate_predictions(y_true, y_pred, label_cols)

print(report)
print("Hamming Loss:", ham_loss)
print("Subset Accuracy:", subset_acc)

plot_pr_curves(y_true, probs, label_cols, "graphcodebert", "../data/images")
save_predictions_csv(texts, y_true, y_pred, probs, label_cols, "../data/predictions/graphcodebert_predictions.csv")

[INFO] Starting load_sequence_classification_model...
[INFO] Finished load_sequence_classification_model.
[INFO] Starting predict_multilabel_classification...
[INFO] Finished predict_multilabel_classification. Time taken: 1293.21 seconds.
[INFO] Starting evaluate_predictions...
[INFO] Finished evaluate_predictions. Time taken: 0.01 seconds.
                 precision    recall  f1-score   support

    Long Method       0.00      0.00      0.00       314
God/Large Class       0.00      0.00      0.00       866
   Feature Envy       0.00      0.00      0.00       400
     Data Class       0.00      0.00      0.00       656
          Clean       0.92      0.99      0.95     19527

      micro avg       0.92      0.89      0.90     21763
      macro avg       0.18      0.20      0.19     21763
   weighted avg       0.82      0.89      0.85     21763
    samples avg       0.90      0.90      0.90     21763

Hamming Loss: 0.03864999302682349
Subset Accuracy: 0.901538747617498
[INFO] Starting

## CodeT5 — Inference & Evaluation

In [3]:
import numpy as np

model_path = "../models/transformers/codet5/codet5-base_multilabel_finetuned"

tokenizer, model = load_seq2seq_model(model_path)

decoded, text_to_vec = predict_codet5(texts, tokenizer, model)

# Convert decoded text labels → multilabel vectors
y_pred = np.array([text_to_vec(s, label_cols) for s in decoded])

# For PR curves, we cannot use text generation probabilities → set dummy equal to predictions
probs = y_pred.astype(float)

report, ham_loss, subset_acc = evaluate_predictions(y_true, y_pred, label_cols)

print(report)
print("Hamming Loss:", ham_loss)
print("Subset Accuracy:", subset_acc)

plot_pr_curves(y_true, probs, label_cols, "codet5", "../data/images")
save_predictions_csv(texts, y_true, y_pred, probs, label_cols, "../data/predictions/codet5_predictions.csv")

[INFO] Starting load_seq2seq_model...
[INFO] Finished load_seq2seq_model.
[INFO] Starting predict_codet5...
[INFO] Finished predict_codet5. Time taken: 2521.94 seconds.
[INFO] Starting evaluate_predictions...
[INFO] Finished evaluate_predictions. Time taken: 0.01 seconds.
                 precision    recall  f1-score   support

    Long Method       0.65      0.11      0.20       314
God/Large Class       0.61      0.22      0.32       866
   Feature Envy       0.71      0.11      0.19       400
     Data Class       0.79      0.70      0.75       656
          Clean       0.94      0.99      0.96     19527

      micro avg       0.93      0.92      0.93     21763
      macro avg       0.74      0.43      0.48     21763
   weighted avg       0.91      0.92      0.91     21763
    samples avg       0.93      0.93      0.93     21763

Hamming Loss: 0.030142717679326855
Subset Accuracy: 0.9292919901445772
[INFO] Starting plot_pr_curves...
[INFO] Finished plot_pr_curves. Time taken: 0.16 

### Transformer Models Performance Analysis (Code Smell Detection)

We evaluated three transformer-based models — **CodeBERT**, **GraphCodeBERT**, and **CodeT5** — on the multilabel classification task of code smell detection using the test set derived from the `SmellyCode++` dataset.

#### CodeBERT
- **Overall performance** was robust, particularly for the `Clean` class (`F1 = 0.96`).
- CodeBERT demonstrated moderate ability to detect `Data Class` (F1 = 0.61) and `Feature Envy` (F1 = 0.35), with relatively low effectiveness for `Long Method` (F1 = 0.35) and `God/Large Class` (F1 = 0.09).
- **Precision-Recall (PR) curves** confirmed the strong separability of `Clean` and `Data Class`, while other smell classes exhibited limited precision across the recall spectrum.

#### GraphCodeBERT
- While achieving strong results on `Clean` code (`F1 = 0.95`), this model failed to identify any of the code smell classes (`F1 = 0.00` for all).
- The PR curves show near-zero performance for smell labels — suggesting that the model overfitted to the majority class.
- **Subset accuracy** (0.90) and **Hamming Loss** (≈0.038) remained decent, solely due to dominance of `Clean` samples in the dataset.

#### CodeT5
- **Best overall multilabel performance** across all classes among the three models.
- Achieved the **highest scores on code smell classes**, particularly `Data Class` (F1 = 0.75), and `God/Large Class` (F1 = 0.32).
- Performed well for `Clean` (F1 = 0.96), maintaining high recall and precision.
- PR curves demonstrated relatively **consistent performance across all labels**, unlike the other models.
- **Subset Accuracy: 92.9%**, **Lowest Hamming Loss** (0.030), and **macro F1 = 0.48**, confirming better balance across classes.

#### Summary
- **CodeT5** outperformed both CodeBERT and GraphCodeBERT on most evaluation metrics, especially on minority code smell classes.
- **GraphCodeBERT** underperformed due to high class imbalance sensitivity.
- **CodeBERT** remains a strong baseline, but its performance drops significantly on complex smell categories like `God Class` and `Long Method`.

These findings suggest that sequence-to-sequence models (like CodeT5) may better capture inter-token patterns required to detect subtle design flaws in source code. Further comparison with classical models (e.g., Random Forest, XGBoost) will help confirm this hypothesis.