# Comparative Evaluation of Transformer vs Classical Models in Code Smell Detection

This notebook presents a comparative evaluation of traditional machine learning classifiers (*Random Forest*, *XGBoost*) and transformer-based models (*CodeBERT*, *GraphCodeBERT*, *CodeT5*) for the task of **multi-label code smell detection**. All transformer models were trained using a **self-supervised learning approach** on unlabeled code, and tested on labeled samples for performance benchmarking.

---

## Evaluation Metrics by Model and Code Smell Type

| **Model**       | **Long Method F1** | **God/Large Class F1** | **Feature Envy F1** | **Data Class F1** | **Clean F1** | **Hamming Loss ↓** | **Subset Accuracy ↑** |
|-----------------|--------------------|-------------------------|---------------------|-------------------|--------------|---------------------|------------------------|
| **RandomForest** | 0.04               | 0.56                    | 0.20                | 0.22              | 0.95         | 0.0426              | 0.8927                 |
| **XGBoost**      | 0.03               | 0.59                    | 0.01                | 0.09              | 0.92         | 0.0463              | 0.8437                 |
| **CodeBERT**     | 0.08               | 0.22                    | 0.11                | 0.66              | 0.97         | 0.0319              | 0.9235                 |
| **GraphCodeBERT**| 0.08               | 0.24                    | 0.13                | 0.73              | 0.96         | 0.0307              | 0.9256                 |
| **CodeT5**       | **0.20**           | **0.32**                | **0.19**            | **0.75**          | **0.96**     | **0.0301**          | **0.9293**             |

> **Legend**:
> - **F1** = Harmonic mean of precision and recall.
> - **Hamming Loss** = Proportion of incorrect labels (lower is better).
> - **Subset Accuracy** = Exact match accuracy (higher is better).

---

## Observations

### CodeT5 is the best-performing model overall:
- Achieves the **highest F1-scores** across all code smell types, especially:
  - `Data Class`: 0.75 F1
  - `God/Large Class`: 0.32 F1
  - `Feature Envy`: 0.19 F1 (still higher than others)
- It also has the **lowest Hamming Loss (0.0301)** and **highest Subset Accuracy (92.93%)**.

### Transformer models outperform classical ML:
- All three transformers outperform RandomForest and XGBoost in detecting smells requiring semantic understanding (`Data Class`, `Feature Envy`).
- Classical models perform **relatively well only on `God Class`**, likely due to surface-level features being more sufficient for detecting large class size patterns.

### Classical models show limitations:
- Poor performance on underrepresented classes like `Long Method`, `Feature Envy`, and `Data Class`.
- Their ability to generalize complex semantic patterns is limited without code-aware embeddings.

---

## Hypothesis Validation

> **Hypothesis**: Self-supervised transformer models can effectively identify code smells without manual annotation and outperform classical approaches in precision and generalization.

**Supported** by results:
- Transformer models — especially **CodeT5** — achieved significantly better generalization on all code smells.
- The self-supervised setup eliminated the need for manually labeled training data.
- These findings demonstrate the strong potential of using pretrained transformer representations for software quality analysis.

---

