# Inference Pipeline for Classical Models

This notebook performs inference of trained classical machine learning models (Random Forest, XGBoost) on the evaluation dataset `merged_for_evaluation.csv`. The pipeline supports:
- Model selection (Random Forest or XGBoost)
- Metric calculation (Precision, Recall, F1-score, Hamming Loss, Subset Accuracy)
- Optional saving of predictions to a CSV file

This step evaluates the **generalization performance** of classical models on unseen code metrics from *SmellyCode++*, separated earlier for evaluation purposes.

## Load Evaluation Dataset

We load the pre-split evaluation dataset containing only samples from *SmellyCode++*, previously excluded from model training. This dataset includes code metrics and binary labels for multiple code smells.

In [1]:
import pandas as pd


df_eval = pd.read_csv("../data/processed/merged_for_evaluation.csv")
print(f"Evaluation dataset shape: {df_eval.shape}")

Evaluation dataset shape: (21511, 18)


## Run Inference with Classical Model

Choose a trained model (`random_forest` or `xgboost`) and run the inference pipeline to:
- Predict labels for each code smell
- Evaluate model performance with standard metrics
- Optionally export predictions to a CSV file

In [2]:
from src.inference.classical_models_predict_smells import run_inference_pipeline


# model_type: "random_forest" or "xgboost"
# save_path: file path to save predictions CSV, or None to skip saving
run_inference_pipeline(
    df=df_eval,
    model_type="random_forest",
    save_path="../data/predictions/random_forest_predictions.csv"
)


=== RANDOM_FOREST Inference Evaluation ===

--- Label: Long Method ---
              precision    recall  f1-score   support

           0       0.99      0.98      0.99     21197
           1       0.04      0.04      0.04       314

    accuracy                           0.97     21511
   macro avg       0.51      0.51      0.51     21511
weighted avg       0.97      0.97      0.97     21511


--- Label: God Class ---
              precision    recall  f1-score   support

           0       0.98      0.99      0.98     20645
           1       0.69      0.48      0.56       866

    accuracy                           0.97     21511
   macro avg       0.84      0.73      0.77     21511
weighted avg       0.97      0.97      0.97     21511


--- Label: Feature Envy ---
              precision    recall  f1-score   support

           0       0.99      0.98      0.98     21111
           1       0.17      0.23      0.20       400

    accuracy                           0.97     21511
 

In [4]:
from src.inference.classical_models_predict_smells import run_inference_pipeline


# model_type: "random_forest" or "xgboost"
# save_path: file path to save predictions CSV, or None to skip saving
run_inference_pipeline(
    df=df_eval,
    model_type="xgboost",
    save_path="../data/predictions/xgboost_predictions.csv"
)


=== XGBOOST Inference Evaluation ===

--- Label: Long Method ---
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     21197
           1       0.80      0.01      0.03       314

    accuracy                           0.99     21511
   macro avg       0.89      0.51      0.51     21511
weighted avg       0.98      0.99      0.98     21511


--- Label: God Class ---
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     20645
           1       0.70      0.50      0.59       866

    accuracy                           0.97     21511
   macro avg       0.84      0.75      0.79     21511
weighted avg       0.97      0.97      0.97     21511


--- Label: Feature Envy ---
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     21111
           1       1.00      0.01      0.01       400

    accuracy                           0.98     21511
   macr

## Inference Results Analysis

This section presents an evaluation of **Random Forest** and **XGBoost** models on the unseen `merged_for_evaluation.csv` dataset. These results validate how well the models generalize beyond the training set.
The goal is to assess the real-world performance of our classical models on new code samples, including both **smelly** and **clean** classes. These insights will serve as a benchmark for comparing against transformer-based models in the next phase of the study.

---

### Random Forest Results

| Code Smell     | Precision | Recall | F1-score | Support |
|----------------|-----------|--------|----------|---------|
| Long Method    | 0.04      | 0.04   | 0.04     | 314     |
| God Class      | 0.69      | 0.48   | 0.56     | 866     |
| Feature Envy   | 0.17      | 0.23   | 0.20     | 400     |
| Data Class     | 0.70      | 0.13   | 0.22     | 656     |
| Clean          | 0.93      | 0.97   | 0.95     | 19,527  |
| **Macro Avg**  | **0.51**  | **0.51**| **0.51** | -       |

- **Hamming Loss**: 0.0426
- **Subset Accuracy**: 89.27%

Predictions saved to: `../data/predictions/random_forest_predictions.csv`

---

### XGBoost Results

| Code Smell     | Precision | Recall | F1-score | Support |
|----------------|-----------|--------|----------|---------|
| Long Method    | 0.80      | 0.01   | 0.03     | 314     |
| God Class      | 0.70      | 0.50   | 0.59     | 866     |
| Feature Envy   | 1.00      | 0.01   | 0.01     | 400     |
| Data Class     | 0.44      | 0.05   | 0.09     | 656     |
| Clean          | 0.93      | 0.91   | 0.92     | 19,527  |
| **Macro Avg**  | **0.61**  | **0.62**| **0.62** | -       |

- **Hamming Loss**: 0.0463
- **Subset Accuracy**: 84.37%

Predictions saved to: `../data/predictions/random_forest_predictions.csv`

---

### Insights

- **Clean samples** are detected reliably by both models due to their overwhelming presence in the dataset.
- Both models **struggle to detect rare smells**, especially *Long Method* and *Feature Envy*. Their poor F1-scores are mainly caused by very low recall.
- **Random Forest** shows slightly better balance between recall and precision for minority classes, while **XGBoost** often fails to recall positive samples.
- The large gap in detection performance highlights the limitations of classical models, especially when relying solely on static code metrics.

These findings further support the need for **context-aware models**, such as **transformer-based architectures**, which can capture semantic and structural nuances of source code beyond shallow syntactic features.