## This notebook evaluates classical ML models on the CSQADataset to detect code smells using only software metrics. Results will later be compared with transformer-based models trained on source code.

## Splitting and Normalizing the CSQA Dataset

Before training classical machine learning models (e.g., Random Forest, SVM, XGBoost), we perform two essential preprocessing steps:

1. **Train/Test Split**
   The dataset is divided into two subsets:
   - **Training set** – used to fit the model.
   - **Test set** – used to evaluate generalization performance.

   We apply *stratified sampling* to ensure the label distribution remains consistent across both sets. This is especially important for imbalanced datasets.

2. **Feature Normalization**
   Standardization is applied to all feature columns using `StandardScaler`, which transforms features to have **zero mean** and **unit variance**.
   This step is crucial for distance-based models such as **SVM** and **KNN**, which are sensitive to feature scales.

---

Now, let’s apply the `split_and_scale()` function to the merged and cleaned CSQA dataset.

In [1]:
import numpy as np
import pandas as pd
import sys
sys.path.append('../src')
from data_processing.csqa_prepare import split_and_scale

csqa_df = pd.read_csv('../data/processed/csqa_merged_metrics.csv')

X_train, X_test, y_train, y_test, scaler = split_and_scale(csqa_df, label_col='label')

print("Any NaNs in X_train?", np.isnan(X_train).any())
print("Any NaNs in X_test?", np.isnan(X_test).any())
print("Train set size:", X_train.shape, "Test set size:", X_test.shape)
print("Label distribution in train set:\n", y_train.value_counts(normalize=True))

print(f"Train set size: {X_train.shape}, Test set size: {X_test.shape}")
print(f"Label distribution in train set:\n{y_train.value_counts(normalize=True)}")
print(f"Label distribution in test set:\n{y_test.value_counts(normalize=True)}")
print(f"First 5 rows of scaled training features:\n{X_train[:5]}")

Any NaNs in X_train? False
Any NaNs in X_test? False
Train set size: (3244992, 96) Test set size: (811249, 96)
Label distribution in train set:
 label
0    0.996595
1    0.003405
Name: proportion, dtype: float64
Train set size: (3244992, 96), Test set size: (811249, 96)
Label distribution in train set:
label
0    0.996595
1    0.003405
Name: proportion, dtype: float64
Label distribution in test set:
label
0    0.996595
1    0.003405
Name: proportion, dtype: float64
First 5 rows of scaled training features:
[[-8.52838794e-02 -2.24522494e-01 -4.98199175e-01 -1.84375228e-01
  -1.37638296e-01  1.52183310e-01 -1.12711349e-01 -5.94996039e-02
  -6.06569271e-02 -8.79119156e-02 -4.29747693e-02 -1.46970637e-01
  -3.04259504e-01 -1.56025490e-02 -2.84386794e-01 -5.76854203e-02
  -4.99891483e-02 -2.92790906e-02 -5.48577744e-02 -2.81944696e-01
  -2.28648136e-01 -1.54594467e-01 -9.42441107e-03 -4.24431379e-02
  -4.17520014e-02 -3.88515044e-01 -7.79583539e-02 -7.66665958e-02
  -7.73489241e-02 -7.00576