# Classification Models for AI-Generated Code Detection

Train and evaluate multiple classifiers (Logistic Regression, Decision Tree, Random Forest, MLP, LinearSVC) on semantic features extracted from code snippets. Includes threshold tuning, ensemble voting, rule-based post-processing, and final submission generation.

## Contents
1. Imports & Data Loading
2. Feature Preparation & Train/Val Split
3. Logistic Regression
4. Decision Tree
5. Random Forest
6. MLP
7. LinearSVC
8. Combined Vectorizer Approach
9. Only Old Features Baseline

In [1]:
import os
import pandas as pd
import numpy as np

drive_path = "/data/semeval"

In [2]:
%pip install -U pandas
data_path_train = "data\\semeval\\train.parquet"
data_path_validation = "data\\semeval\\validation.parquet"
# data_path_test = (base_dir / "test.parquet").as_posix()

df_train = pd.read_parquet(data_path_train)
df_validation = pd.read_parquet(data_path_validation)


Note: you may need to restart the kernel to use updated packages.


In [3]:
drive_path = "/data/semeval/processed"

In [4]:
data_path_feats_200k = "data\semeval\processed\\test_feats_2.csv"
df_feats_200k = pd.read_csv(data_path_feats_200k)

data_path_feats_300k = "data\semeval\processed\\test_feats.csv"
df_feats_300k = pd.read_csv(data_path_feats_300k)


  data_path_feats_200k = "data\semeval\processed\\test_feats_2.csv"
  data_path_feats_300k = "data\semeval\processed\\test_feats.csv"


In [5]:
df_test = pd.read_parquet(data_path_test)

NameError: name 'data_path_test' is not defined

In [6]:
X_test = []
for index, row in df_feats_300k.iterrows():
  X_test.append(list(row.values))

In [7]:
for index, row in df_feats_200k.iterrows():
  X_test.append(list(row.values))

In [8]:
data_path_feats_train = "data\\semeval\\processed\\train_feats_300k.csv"
df_feats_train = pd.read_csv(data_path_feats_train)

data_path_feats_val = "data\\semeval\\processed\\val_feats.csv"
df_feats_val = pd.read_csv(data_path_feats_val)

In [9]:
X_train = []
for index, row in df_feats_train.iterrows():
  X_train.append(list(row.values))

In [10]:
Y_train = []
for index, row in df_train.head(300000).iterrows():
  Y_train.append(row['label'])

In [11]:
X_val = []
for index, row in df_feats_val.iterrows():
  X_val.append(list(row.values))

In [12]:
Y_val = []
for index, row in df_validation.iterrows():
  if row["language"] != "Python":
    Y_val.append(row['label'])
    continue

In [13]:
import numpy as np
from sklearn.preprocessing import StandardScaler

In [14]:
from __future__ import annotations

from dataclasses import dataclass
from typing import List, Dict, Any, Optional

import re
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

In [15]:
feature_names = [
        "verb_ratio_comments",
        "text_like_ratio",
        # "code_like_ratio",
        "comments_code_like_ratio_to_total",
        "comments_text_like_ratio_to_total",
        "comments_code_like_ratio_comments",
        "comments_text_like_ratio_comments",
        # "identifiers_verb_ratio:",
        # "cyclomatic_complexity_mean",
        # "cyclomatic_complexity_std",
        # "cyclomatic_complexity_max",
        "error_near_eof_ratio",
]

In [16]:
import re
COMMENT_RE = re.compile(r"(#|//|/\*|\*/)")

def get_comment_ratio(code: str) -> dict:
    lines = code.splitlines()
    loc = len(lines)

    # ---- comment structure ----
    comment_lines = 0

    for i, line in enumerate(lines):
        if COMMENT_RE.search(line) and "@" not in line:
            comment_lines += 1

            if line.strip().startswith(("#", "//")) is False :
                comment_lines -= 1

    return comment_lines if loc == 0 else comment_lines / loc

In [17]:
X_comments_train = []
for index, row in df_train.head(300000).iterrows():
  code = row['code']
  X_comments_train.append(get_comment_ratio(code))


In [18]:
X_train_line = []
i = 0
for index, row in df_train.head(300000).iterrows():
  code = row['code']
  loc = len(code.splitlines())
  bucket = "small"
  if loc >= 20 and loc <= 70:
    bucket = "medium"
  elif loc > 70:
    bucket = "large"
  # x_entropy = X_line_entropy_train[i]

  comm_h = X_comments_train[i]
  X_train_line.append([bucket, comm_h] + X_train[i][1:2])
  i = i + 1


In [20]:
X_val_line = []
i = 0
for index, row in df_validation.iterrows():
  if row["language"] != "Python":
    code = row['code']
    bucket = "small"
    if len(code.splitlines()) >= 20 and len(code.splitlines()) <= 70:
      bucket = "medium"
    elif len(code.splitlines()) > 70:
      bucket = "large"

    comm_h = get_comment_ratio(code)
    X_val_line.append([bucket, comm_h] + X_val[i][1:2])
    i = i + 1

In [21]:
bucket_scalers = {}
X_train_np = np.asarray(X_train_line, dtype=object)   # keep bucket strings
X_train_scaled = np.zeros((X_train_np.shape[0], X_train_np.shape[1] - 1), dtype=np.float32)

for bucket in ["small", "medium", "large"]:
    idx = [i for i, x in enumerate(X_train_line) if x[0] == bucket]
    if not idx:
        continue

    X_bucket = np.array([X_train_line[i][1:] for i in idx], dtype=np.float32)

    scaler = StandardScaler()
    scaler.fit(X_bucket)

    # explicit numerical safety
    scaler.scale_ = np.where(scaler.scale_ == 0, 1.0, scaler.scale_)

    bucket_scalers[bucket] = scaler

    X_bucket_scaled = scaler.transform(X_bucket)

    for j, i in enumerate(idx):
        X_train_scaled[i, :] = X_bucket_scaled[j]

In [22]:
X_val_np = np.asarray(X_val_line, dtype=object)
Y_val = np.asarray(Y_val)

print(X_val_np.shape[1])

X_val_scaled = np.zeros((X_val_np.shape[0], X_val_np.shape[1] - 1), dtype=np.float32)

for i, x in enumerate(X_val_line):
    bucket = x[0]
    feats = np.asarray(x[1:], dtype=np.float32).reshape(1, -1)

    if bucket in bucket_scalers:
        X_val_scaled[i] = bucket_scalers[bucket].transform(feats)[0]
    else:
        # fallback: no scaling (or use global_scaler if you have one)
        X_val_scaled[i] = feats[0]

3


In [23]:
feature_names = [
        "comment_h",
        "verb_ratio_comments",
    ]

# Logistic Regression

In [24]:
class SurfaceFeatureTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, feature_names):
        self.feature_names = feature_names

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        rows = []
        for code_embedding in X:
            rows.append(code_embedding)
        return np.asarray(rows, dtype=np.float32)


pipe_small = Pipeline([
    ("surface", SurfaceFeatureTransformer(feature_names)),
    ("impute", SimpleImputer(strategy="mean")),
    # ("scale", StandardScaler()),
    ("clf", LogisticRegression(
        solver="liblinear",
        penalty="l1",
        C=0.01,
        #class_weight="balanced",
        max_iter=5000,
        random_state=42
    )),
])

pipe_small.fit(X_train_scaled, Y_train)

p1 = pipe_small.predict(X_val_scaled)
f1_logreg = f1_score(Y_val, p1, average='macro')
print(f"F1_logreg: {f1_logreg}")
print(classification_report(Y_val, p1))

F1_logreg: 0.535905020104348
              precision    recall  f1-score   support

           0       0.51      0.63      0.57      4075
           1       0.57      0.45      0.50      4464

    accuracy                           0.54      8539
   macro avg       0.54      0.54      0.54      8539
weighted avg       0.54      0.54      0.53      8539





In [None]:
clf = pipe_small.named_steps["clf"]
coef = clf.coef_[0]   # binary classification

feature_names_coef = ["comm_h", "verb"]

import pandas as pd

feature_importance = pd.DataFrame({
    "feature": feature_names_coef,
    "coefficient": coef,
    "abs_coefficient": abs(coef)
}).sort_values("abs_coefficient", ascending=False)

feature_importance   # display table

2
2
  feature  coefficient  abs_coefficient
0  comm_h     1.171326         1.171326
1    verb     0.539016         0.539016


In [26]:
probs = clf.predict_proba(X_val_scaled)[:, 1]

thresholds = np.linspace(0.1, 0.9, 81)
scores = [
    f1_score(Y_val, probs >= t, average="macro")
    for t in thresholds
]

best_t = thresholds[np.argmax(scores)]
print(f"Best threshold: {best_t}")

Best threshold: 0.47


In [27]:
y_pred = (clf.predict_proba(X_val_scaled)[:, 1] >= best_t)
print(f1_score(Y_val, y_pred, average='macro'))
print(classification_report(Y_val, y_pred))

0.5471945243361732
              precision    recall  f1-score   support

           0       0.53      0.49      0.51      4075
           1       0.57      0.61      0.58      4464

    accuracy                           0.55      8539
   macro avg       0.55      0.55      0.55      8539
weighted avg       0.55      0.55      0.55      8539



# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

pipe_tree = Pipeline([
    ("surface", SurfaceFeatureTransformer(feature_names)),
    ("impute", SimpleImputer(strategy="mean")),
    ("clf", DecisionTreeClassifier(
        max_depth=2,
        random_state=42
    )),
])

pipe_tree.fit(X_train_scaled, Y_train)
p1 = pipe_tree.predict(X_val_scaled)
f1_tree = f1_score(Y_val, p1, average='macro')
print(f"F1_tree: {f1_tree}")
print(classification_report(Y_val, p1))

[-0.25323266 -0.21132326]
F1_logreg: 0.5411569976785606
              precision    recall  f1-score   support

           0       0.52      0.64      0.57      4075
           1       0.58      0.45      0.51      4464

    accuracy                           0.54      8539
   macro avg       0.55      0.55      0.54      8539
weighted avg       0.55      0.54      0.54      8539



# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

pipe_forest = Pipeline([
    ("surface", SurfaceFeatureTransformer(feature_names)),
    ("impute", SimpleImputer(strategy="mean")),
    ("clf", RandomForestClassifier(
        max_depth=2,
        random_state=42,
        n_estimators=11)),
])

pipe_forest.fit(X_train_scaled, Y_train)
p1 = pipe_forest.predict(X_val_scaled)
f1_forest = f1_score(Y_val, p1, average='macro')
print(f"F1_forest: {f1_forest}")
print(classification_report(Y_val, p1))

[-0.25323266 -0.21132326]
F1_forest: 0.5240032699948475
              precision    recall  f1-score   support

           0       0.50      0.53      0.51      4075
           1       0.55      0.52      0.53      4464

    accuracy                           0.52      8539
   macro avg       0.52      0.52      0.52      8539
weighted avg       0.53      0.52      0.52      8539



# MLP

In [None]:
from sklearn.neural_network import MLPClassifier

pipe_mlp = Pipeline([
    ("surface", SurfaceFeatureTransformer(feature_names)),
    ("impute", SimpleImputer(strategy="mean")),
    ("clf", MLPClassifier(
        hidden_layer_sizes=(10, 5),
        activation='tanh',
        solver='adam',
        max_iter=100,
        alpha = 0.0001
    )),])

pipe_mlp.fit(X_train_scaled, Y_train)
p1 = pipe_mlp.predict(X_val_scaled)
f1_mlp = f1_score(Y_val, p1, average='macro')
print(f"F1_mlp: {f1_mlp}")
print(classification_report(Y_val, p1))

[-0.25323266 -0.21132326]
F1_mlp: 0.4723531846946242
              precision    recall  f1-score   support

           0       0.45      0.36      0.40      4075
           1       0.50      0.60      0.55      4464

    accuracy                           0.48      8539
   macro avg       0.48      0.48      0.47      8539
weighted avg       0.48      0.48      0.48      8539



# LinearSVC

In [None]:
from sklearn.svm import LinearSVC

pipe_svc = Pipeline([
    ("surface", SurfaceFeatureTransformer(feature_names)),
    ("impute", SimpleImputer(strategy="mean")),
    ("clf", LinearSVC()),
])
pipe_svc.fit(X_train_scaled, Y_train)
p1 = pipe_svc.predict(X_val_scaled)
f1_svc = f1_score(Y_val, p1, average='macro')
print(f"F1_svc: {f1_svc}")
print(classification_report(Y_val, p1))

[-0.25323266 -0.21132326]
F1_svc: 0.533585841673839
              precision    recall  f1-score   support

           0       0.51      0.65      0.57      4075
           1       0.58      0.43      0.49      4464

    accuracy                           0.54      8539
   macro avg       0.54      0.54      0.53      8539
weighted avg       0.55      0.54      0.53      8539



# Combined vec

In [32]:
y_logreg = pipe_small.predict(X_val_scaled)
y_tree = pipe_tree.predict(X_val_scaled)
y_forest = pipe_forest.predict(X_val_scaled)
y_mlp = pipe_mlp.predict(X_val_scaled)
y_svc = pipe_svc.predict(X_val_scaled)

In [33]:
x_vect = []
for i in range(len(y_logreg)):
  x_vect.append([y_logreg[i], y_tree[i], y_svc[i]])


In [34]:
p_comb = LogisticRegression(solver = "liblinear", penalty="l1")
p_comb.fit(x_vect, Y_val)




0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'l1'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'liblinear'


In [36]:
df_test_add = pd.read_csv("data\\semeval\\additional\\add_data_clear.csv")

In [37]:
df_test_add.shape

(165833, 5)

In [38]:
Y_test_list_add = df_test_add.head(165833)["label"].astype(int).to_numpy()

In [39]:
data_path_feats_cross = "data\\semeval\\additional\\test_cross_feats.csv"
df_cross = pd.read_csv(data_path_feats_cross)

In [40]:
X_test_cross = []
for index, row in df_cross.head(165833).iterrows():
  X_test_cross.append(list(row.values))

In [41]:
X_test_add_comments = []
for index, row in df_test_add.head(165833).iterrows():
  code = row['code']
  X_test_add_comments.append(get_comment_ratio(code))

In [42]:
X_test_add = []
i = 0
for index, row in df_test_add.head(165833).iterrows():
  code = row['code']
  loc = len(code.splitlines())
  bucket = "small"
  if loc >= 20 and loc <= 70:
    bucket = "medium"
  elif loc > 70:
    bucket = "large"

  # x_line_std = find_line_length_std(code)
  # x_line_entropy = X_test_add_entropy[i]

  # X_test_add.append([bucket, x_line_std, x_line_entropy] + X_test_cross[i][1:3] + X_test_cross[i][4:5] + X_test[i][6:]) #
  comm_h = X_test_add_comments[i]
  X_test_add.append([bucket, comm_h] + X_test_cross[i][1:2])
  i = i + 1


In [43]:
X_test_add = np.asarray(X_test_add, dtype=object)
Y_test_add = np.asarray(Y_test_list_add)

X_test_add_scaled = np.zeros((X_test_add.shape[0], X_test_add.shape[1] - 1), dtype=np.float32)

for i, x in enumerate(X_test_add):
    bucket = x[0]
    feats = np.asarray(x[1:], dtype=np.float32).reshape(1, -1)

    if bucket in bucket_scalers:
        X_test_add_scaled[i] = bucket_scalers[bucket].transform(feats)[0]
    else:
        # fallback: no scaling (or use global_scaler if you have one)
        X_test_add_scaled[i] = feats[0]

In [44]:
p1_logreg = pipe_small.predict(X_test_add_scaled)
f1_logreg = f1_score(Y_test_list_add, p1_logreg, average='macro')
print(f"F1_logreg: {f1_logreg}")
print(classification_report(Y_test_list_add, p1_logreg))

F1_logreg: 0.6537140966550274
              precision    recall  f1-score   support

           0       0.81      0.74      0.77    115995
           1       0.49      0.58      0.53     49838

    accuracy                           0.69    165833
   macro avg       0.65      0.66      0.65    165833
weighted avg       0.71      0.69      0.70    165833



In [45]:
probs = clf.predict_proba(X_test_add_scaled)[:, 1]

thresholds = np.linspace(0.1, 0.9, 81)
scores = [
    f1_score(Y_test_add, probs >= t, average="macro")
    for t in thresholds
]

best_t = thresholds[np.argmax(scores)]
print(f"Best threshold: {best_t}")

Best threshold: 0.52


In [46]:
y_pred = (clf.predict_proba(X_test_add_scaled)[:, 1] >= best_t)
print(f1_score(Y_test_add, y_pred, average='macro'))
print(classification_report(Y_test_add, y_pred))

0.654828655928188
              precision    recall  f1-score   support

           0       0.80      0.75      0.78    115995
           1       0.50      0.57      0.53     49838

    accuracy                           0.70    165833
   macro avg       0.65      0.66      0.65    165833
weighted avg       0.71      0.70      0.70    165833



In [47]:
p1_tree = pipe_tree.predict(X_test_add_scaled)
f1_logreg = f1_score(Y_test_list_add, p1_tree, average='macro')
print(f"F1_logreg: {f1_logreg}")
print(classification_report(Y_test_list_add, p1_tree))

F1_logreg: 0.6390944272407749
              precision    recall  f1-score   support

           0       0.78      0.79      0.78    115995
           1       0.50      0.49      0.49     49838

    accuracy                           0.70    165833
   macro avg       0.64      0.64      0.64    165833
weighted avg       0.70      0.70      0.70    165833



In [48]:
p1_forest = pipe_forest.predict(X_test_add_scaled)
f1_logreg = f1_score(Y_test_list_add, p1_forest, average='macro')
print(f"F1_logreg: {f1_logreg}")
print(classification_report(Y_test_list_add, p1_forest))

F1_logreg: 0.647341628926476
              precision    recall  f1-score   support

           0       0.81      0.71      0.76    115995
           1       0.48      0.61      0.54     49838

    accuracy                           0.68    165833
   macro avg       0.64      0.66      0.65    165833
weighted avg       0.71      0.68      0.69    165833



In [49]:
p1_mlp = pipe_mlp.predict(X_test_add_scaled)
f1_logreg = f1_score(Y_test_list_add, p1_mlp, average='macro')
print(f"F1_logreg: {f1_logreg}")
print(classification_report(Y_test_list_add, p1_mlp))

F1_logreg: 0.6248932057227348
              precision    recall  f1-score   support

           0       0.82      0.63      0.71    115995
           1       0.44      0.68      0.54     49838

    accuracy                           0.65    165833
   macro avg       0.63      0.66      0.62    165833
weighted avg       0.71      0.65      0.66    165833



In [50]:
p1_svc = pipe_svc.predict(X_test_add_scaled)
f1_logreg = f1_score(Y_test_list_add, p1_svc, average='macro')
print(f"F1_logreg: {f1_logreg}")
print(classification_report(Y_test_list_add, p1_svc))

F1_logreg: 0.6548193374999066
              precision    recall  f1-score   support

           0       0.80      0.75      0.78    115995
           1       0.50      0.58      0.53     49838

    accuracy                           0.70    165833
   macro avg       0.65      0.66      0.65    165833
weighted avg       0.71      0.70      0.70    165833



In [51]:
x_test_vect = []
for i in range(len(p1_logreg)):
  x_test_vect.append([p1_logreg[i], p1_tree[i], p1_svc[i]])

y_pred_comb = p_comb.predict(x_test_vect)
print(f"f1 macro{f1_score(Y_test_list_add, y_pred_comb, average='macro')}")
print(classification_report(Y_test_list_add, y_pred_comb))

f1 macro0.6495244269623406
              precision    recall  f1-score   support

           0       0.81      0.73      0.77    115995
           1       0.49      0.59      0.53     49838

    accuracy                           0.69    165833
   macro avg       0.65      0.66      0.65    165833
weighted avg       0.71      0.69      0.70    165833



In [52]:
y_pred_combined = []
i = 0
for i in range(len(p1)):
  c = p1_logreg[i] + p1_tree[i] + p1_svc[i]
  if c >= 2:
    y_pred_combined.append(1)
  else:
    y_pred_combined.append(0)
print(len([y for y in y_pred_combined if y == 0]))

4831


In [53]:
print(len(y_pred_combined))

8539


In [54]:
# ensure predictions align with Y_test_list_add length
n = len(Y_test_list_add)
p1_logreg_aligned = np.asarray(p1_logreg)[:n]
p1_tree_aligned = np.asarray(p1_tree)[:n]
p1_svc_aligned = np.asarray(p1_svc)[:n]

assert len(p1_logreg_aligned) == len(p1_tree_aligned) == len(p1_svc_aligned) == n

y_pred_combined = [
    1 if (a + b + c) >= 2 else 0
    for a, b, c in zip(p1_logreg_aligned, p1_tree_aligned, p1_svc_aligned)
]

print(classification_report(Y_test_list_add, y_pred_combined))
print(f"f1 macro: {f1_score(Y_test_list_add, y_pred_combined, average='macro')}")

              precision    recall  f1-score   support

           0       0.80      0.74      0.77    115995
           1       0.49      0.58      0.53     49838

    accuracy                           0.70    165833
   macro avg       0.65      0.66      0.65    165833
weighted avg       0.71      0.70      0.70    165833

f1 macro: 0.6535392698799782


In [55]:
data_path_test_new2 ='data/semeval/test_new.parquet'
df_test_new = pd.read_parquet(data_path_test_new2)

In [56]:
X_test_submit = []
i = 0
for index, row in df_test_new.head(500000).iterrows():
  code = row['code']
  bucket = "small"
  if len(code.splitlines()) >= 20 and len(code.splitlines()) <= 70:
    bucket = "medium"
  elif len(code.splitlines()) > 70:
    bucket = "large"

  # x_line_entropy = find_line_len_entropy(code)


  # X_test_submit.append([buc ket, x_line_std, x_line_entropy] + x_test[1:3] + x_test[4:5] + x_test[6:]) #
  comm_h = get_comment_ratio(code)
  X_test_submit.append([bucket, comm_h] + X_test[i][1:2])
  i = i + 1

In [57]:
X_test_submit_np = np.asarray(X_test_submit, dtype=object)

X_test_submit_scaled = np.zeros((X_test_submit_np.shape[0], X_test_submit_np.shape[1] - 1), dtype=np.float32)

for i, x in enumerate(X_test_submit):
    bucket = x[0]
    feats = np.asarray(x[1:], dtype=np.float32).reshape(1, -1)

    if bucket in bucket_scalers:
        X_test_submit_scaled[i] = bucket_scalers[bucket].transform(feats)[0]
    else:
        # fallback: no scaling (or use global_scaler if you have one)
        X_test_submit_scaled[i] = feats[0]

In [58]:
y_text_pred2 = []
cnt0 = 0
cnt1 = 0
i = 0
sum0 = 0
sum1 = 0
max0 = 0
for x in X_train:
  if x[2] > 0:
    if Y_train[i] == 1:
      cnt1 += 1
      sum1 += x[2]
    elif Y_train[i] == 0:
      cnt0 += 1
      sum0 += x[2]
      if x[2] > max0:
        max0 = x[2]
  i = i + 1
print(sum0/cnt0)

print(sum1/cnt1)
print(max0)

0.07147478251379817
0.17150626452977064
1.0


In [59]:
y_pred_submit = pipe_small.predict(X_test_submit_scaled)
y_pred = (clf.predict_proba(X_test_submit_scaled)[:, 1] >= 0.65)
y_pred_tree = pipe_tree.predict(X_test_submit_scaled)
y_pred_forest = pipe_forest.predict(X_test_submit_scaled)
y_pred_mlp = pipe_mlp.predict(X_test_submit_scaled)
y_pred_svc = pipe_svc.predict(X_test_submit_scaled)

In [60]:
y_combined = []
i = 0
for i in range(len(y_pred_submit)):
  c = y_pred_submit[i] + y_pred_tree[i] + y_pred_svc[i]
  if c >= 2:
    y_combined.append(1)
  else:
    y_combined.append(0)

In [61]:
print(len([y for y in y_pred_svc if y == 0]))

351871


In [62]:
submission_ids = df_test_new.head(500000)['ID'].values

submission_df = pd.DataFrame({
    'ID': submission_ids,
    'label': y_pred_svc,
})

print(submission_df.head(20))

submission_df.to_csv('subm_svc_2.csv', index=False)

    ID  label
0    0      0
1    2      0
2    5      0
3    6      0
4    7      0
5    8      0
6    9      0
7   10      0
8   11      0
9   12      0
10  13      1
11  15      0
12  16      1
13  17      1
14  18      0
15  20      0
16  21      0
17  22      0
18  23      0
19  24      0


In [72]:
y_text_pred = []
cnt = 0
i = 0
for x in X_test:
  if x[2] > 0.3:
    if y_pred_submit[i] == 0:
      cnt += 1
    y_text_pred.append(1)
  else:
    y_text_pred.append(y_pred_submit[i])
  i = i + 1
print(cnt)

3423


In [73]:
Y_pred_tree = pipe_tree.predict(X_test_submit_scaled)
print(len([y for y in Y_pred_tree if y == 0]))

383494


In [65]:
y_text_pred_tree = []
cnt = 0
i = 0
for x in X_test:
  if x[2] > 0.3:
    if Y_pred_tree[i] == 0:
      cnt += 1
    y_text_pred_tree.append(1)
  else:
    y_text_pred_tree.append(Y_pred_tree[i])
  i = i + 1
print(cnt)

4383


In [66]:
print(len([y for y in y_pred if y == 0]))
print(len([y for y in y_pred_submit if y == 0]))
print(len([y for y in Y_pred_tree if y == 0]))


388347
348285
383494


In [74]:
i = 0
cnt = 0
for index, row in df_test_new.head(500000).iterrows():
  code = row["code"]
  line = code.splitlines()[0]
  loc = len(code.splitlines())
  if "go" == line or "python" == line or "java" == line or "c++" == line or "javascript" == line or "c#" == line or line == "c" or line == "cpp":
    if y_text_pred[i] == 0:
      cnt += 1
    y_text_pred[i] = 1

  # if "Example usage" in code:
  #   if y_pred[i] == 0:
  #     cnt += 1
  #   y_pred[i] = 1
  if "code here" in code:
    if y_text_pred[i] == 0:
      cnt += 1
    y_text_pred[i] = 1
  if "without comments" in code and loc > 1:
    if y_text_pred[i] == 0:
      cnt += 1
    y_text_pred[i] = 1
  # # # if "Example" in code:
  # # #   if y_test[i] == 0:
  # # #     cnt += 1
  # #   y_test[i] = 1
  # if "please" in code or "Please" in code and loc > 1:
  #   if y_text_pred[i] == 0:
  #     cnt += 1
  #   y_text_pred[i] = 1
  if "Explanation" in code or "explanations" in code and loc > 1:
    if y_text_pred[i] == 0:
      cnt += 1
    y_text_pred[i] = 1
  if "aplogize" in code:
    if y_text_pred[i] == 0:
      cnt += 1
    y_text_pred[i] = 1

  i = i + 1
print(cnt)

3534


In [75]:
print(len([y for y in y_text_pred if y == 0]))

341328


In [76]:
submission_ids = df_test_new.head(500000)['ID'].values

submission_df = pd.DataFrame({
    'ID': submission_ids,
    'label': y_text_pred,
})

print(submission_df.head(20))

submission_df.to_csv('subm_logreg_2f_rule_03_script_065_th.csv', index=False)

    ID  label
0    0      0
1    2      0
2    5      0
3    6      0
4    7      0
5    8      0
6    9      0
7   10      0
8   11      0
9   12      0
10  13      1
11  15      0
12  16      1
13  17      1
14  18      0
15  20      0
16  21      0
17  22      0
18  23      1
19  24      0


## Only old feats

In [77]:
clf = pipe_small.named_steps["clf"]
coef = clf.coef_[0]   # binary classification

print(len(coef))
print(len(feature_names))

import pandas as pd

feature_importance = pd.DataFrame({
    "feature": feature_names,
    "coefficient": coef,
    "abs_coefficient": abs(coef)
}).sort_values("abs_coefficient", ascending=False)

print(feature_importance)

2
2
  feature  coefficient  abs_coefficient
0  comm_h     1.171326         1.171326
1    verb     0.539016         0.539016


In [78]:
X_test = np.asarray(X_test, dtype=object)

X_test_scaled = np.zeros((X_test.shape[0], X_test.shape[1] - 1), dtype=np.float32)

for i, x in enumerate(X_test):
    bucket = x[0]
    feats = np.asarray(x[1:], dtype=np.float32).reshape(1, -1)

    if bucket in bucket_scalers:
        X_test_scaled[i] = bucket_scalers[bucket].transform(feats)[0]
    else:
        # fallback: no scaling (or use global_scaler if you have one)
        X_test_scaled[i] = feats[0]

ValueError: X has 7 features, but StandardScaler is expecting 2 features as input.

In [None]:
print(set(df_train["generator"].values))

{'01-ai/Yi-Coder-9B-Chat', 'google/codegemma-2b', 'human', 'bigcode/starcoder2-7b', 'bigcode/starcoder', 'meta-llama/Llama-3.2-3B', 'Qwen/Qwen2.5-Coder-32B-Instruct', 'codellama/CodeLlama-34b-Instruct-hf', 'Qwen/Qwen2.5-Coder-1.5B-Instruct', '01-ai/Yi-Coder-1.5B', 'meta-llama/Llama-3.1-8B', 'meta-llama/Llama-3.3-70B-Instruct', '01-ai/Yi-Coder-9B', '01-ai/Yi-Coder-1.5B-Chat', 'codellama/CodeLlama-70b-Instruct-hf', 'deepseek-ai/deepseek-coder-6.7b-instruct', 'microsoft/Phi-3-medium-4k-instruct', 'Qwen/Qwen2.5-Coder-1.5B', 'meta-llama/Llama-3.2-1B', 'bigcode/starcoder2-15b', 'Qwen/Qwen2.5-Coder-7B', 'microsoft/Phi-3.5-mini-instruct', 'deepseek-ai/deepseek-coder-1.3b-instruct', 'deepseek-ai/deepseek-coder-6.7b-base', 'ibm-granite/granite-8b-code-base-4k', 'Qwen/Qwen2.5-Coder-7B-Instruct', 'google/codegemma-7b', 'bigcode/starcoder2-3b', 'codellama/CodeLlama-7b-hf', 'microsoft/Phi-3-small-8k-instruct', 'microsoft/phi-2', 'deepseek-ai/deepseek-coder-1.3b-base', 'ibm-granite/granite-8b-code-in

In [79]:
j = 0
for index, row in df_train.iterrows():
  code = row["code"]
  loc = len(code.splitlines())
  if loc <= 50 and row["generator"] == "Qwen/Qwen2.5-Coder-32B-Instruct" and j < 10:
    print("-----NEW CODE ------")
    print(code)
    j = j + 1

-----NEW CODE ------
def solve(a, b, c):
    # Check for first shop cheaper
    first_shop_cheaper = -1
    for x in range(1, 10**9 + 1):
        if a * x < c * (x + b - 1) // b:
            first_shop_cheaper = x
            break

    # Check for second shop cheaper
    second_shop_cheaper = -1
    for x in range(1, 10**9 + 1):
        if c * (x + b - 1) // b < a * x:
            second_shop_cheaper = x
            break

    return first_shop_cheaper, second_shop_cheaper

def main():
    import sys
    input = sys.stdin.read()
    data = input.split()
    t = int(data[0])
    index = 1
    results = []
    for _ in range(t):
        a = int(data[index])
        b = int(data[index + 1])
        c = int(data[index + 2])
        index += 3
        results.append(solve(a, b, c))
    
    for result in results:
        print(result[0], result[1])

if __name__ == "__main__":
    main()
-----NEW CODE ------
import java.util.HashSet;
import java.util.Set;

public class Solution {
    public

In [80]:
cnt = 0
for index, row in df_test_new.iterrows():
  if "bigcode/starcoder2-7b" in row["code"]:
    cnt += 1
print(cnt)

0


In [81]:
j = 0
for index, row in df_train.iterrows():
  loc = len(row["code"].splitlines())
  if loc < 10 and j < 10:
    print("----- NEW CODE ----")
    print(code)
    j = j + 1

----- NEW CODE ----
python
def check_TPrime(nums):
    for num in nums:
        divisors = [i for i in range(1, num+1) if num % i == 0]
        if len(divisors) == 3:
            print("YES")
        else:
            print("NO")

n = int(input())
nums = list(map(int, input().split()))
check_TPrime(nums)
----- NEW CODE ----
python
def check_TPrime(nums):
    for num in nums:
        divisors = [i for i in range(1, num+1) if num % i == 0]
        if len(divisors) == 3:
            print("YES")
        else:
            print("NO")

n = int(input())
nums = list(map(int, input().split()))
check_TPrime(nums)
----- NEW CODE ----
python
def check_TPrime(nums):
    for num in nums:
        divisors = [i for i in range(1, num+1) if num % i == 0]
        if len(divisors) == 3:
            print("YES")
        else:
            print("NO")

n = int(input())
nums = list(map(int, input().split()))
check_TPrime(nums)
----- NEW CODE ----
python
def check_TPrime(nums):
    for num in nums:
        di

# Feature Ablation & Multi-Model Comparison

Grid-search across **Logistic Regression, XGBoost / Gradient Boosting, Random Forest, MLP, and a deeper neural network**, testing every feature present in the processed datasets (included vs excluded). Results are evaluated with **F1-macro on the validation set** (same metric used above).

In [1]:
# ═══════════════════════════════════════════════════════════════════════
# DATA PREPARATION FOR EXPERIMENTS
# (Minimal setup - loads data and builds feature matrices)
# ═══════════════════════════════════════════════════════════════════════

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import LabelBinarizer
from sklearn.impute import SimpleImputer

# --- 1. Load data files ---
print("Loading data files...")
df_train = pd.read_parquet("data\\semeval\\train.parquet")
df_feats_train = pd.read_csv("data\\semeval\\processed\\train_feats_300k.csv")
df_test_add = pd.read_csv("data\\semeval\\additional\\add_data_clear.csv")
df_cross = pd.read_csv("data\\semeval\\additional\\test_cross_feats.csv")

print(f"  df_train: {df_train.shape}")
print(f"  df_feats_train: {df_feats_train.shape}")
print(f"  df_test_add: {df_test_add.shape}")
print(f"  df_cross: {df_cross.shape}")

# --- 2. Extract labels ---
Y_train = df_train.head(300000)['label'].tolist()
Y_test_list_add = df_test_add.head(165833)["label"].astype(int).to_numpy()

# --- 3. Compute comment ratios ---
COMMENT_RE = re.compile(r"(#|//|/\*|\*/)")

def get_comment_ratio(code: str) -> float:
    lines = code.splitlines()
    loc = len(lines)
    comment_lines = 0
    for line in lines:
        if COMMENT_RE.search(line) and "@" not in line:
            comment_lines += 1
            if not line.strip().startswith(("#", "//")):
                comment_lines -= 1
    return comment_lines / loc if loc > 0 else 0

print("\nComputing comment ratios for train (300k samples)...")
X_comments_train = [get_comment_ratio(row['code']) for _, row in df_train.head(300000).iterrows()]

print("Computing comment ratios for test_add (165k samples)...")
X_test_add_comments = [get_comment_ratio(row['code']) for _, row in df_test_add.head(165833).iterrows()]

# --- 4. Build full feature matrices ---
_N_TEST = 165833
NUMERIC_CSV_COLS = list(df_feats_train.columns[1:])  # skip 'bucket' (str)
df_cross_h = df_cross.head(_N_TEST)

# Numeric features from CSV
tr_numeric = np.nan_to_num(df_feats_train[NUMERIC_CSV_COLS].values.astype(np.float64))
te_numeric = np.nan_to_num(df_cross_h[NUMERIC_CSV_COLS].values.astype(np.float64))

# Comment ratio column
tr_comment = np.asarray(X_comments_train, dtype=np.float64).reshape(-1, 1)
te_comment = np.asarray(X_test_add_comments, dtype=np.float64).reshape(-1, 1)

# Bucket → one-hot
lb = LabelBinarizer()
tr_bucket_oh = lb.fit_transform(df_feats_train["bucket"].values)
te_bucket_oh = lb.transform(df_cross_h["bucket"].values)
bucket_names = [f"bucket_{c}" for c in lb.classes_]

# Stack: 7 numeric + 1 comment_ratio + 3 bucket dummies = 11 features
X_full_tr = np.column_stack([tr_numeric, tr_comment, tr_bucket_oh])
X_full_te = np.column_stack([te_numeric, te_comment, te_bucket_oh])

ALL_FEAT = NUMERIC_CSV_COLS + ["comment_ratio"] + bucket_names
y_tr = np.asarray(Y_train, dtype=int)
y_te = np.asarray(Y_test_list_add, dtype=int)

print(f"\n✓ Data prepared successfully!")
print(f"  X_full_tr: {X_full_tr.shape}  |  X_full_te: {X_full_te.shape}")
print(f"  y_tr: {y_tr.shape}  |  y_te: {y_te.shape}")
print(f"  Features ({len(ALL_FEAT)}): {ALL_FEAT}")

Loading data files...
  df_train: (500000, 4)
  df_feats_train: (300000, 8)
  df_test_add: (165833, 5)
  df_cross: (165833, 8)

Computing comment ratios for train (300k samples)...
Computing comment ratios for test_add (165k samples)...

✓ Data prepared successfully!
  X_full_tr: (300000, 11)  |  X_full_te: (165833, 11)
  y_tr: (300000,)  |  y_te: (165833,)
  Features (11): ['verb_ratio_comments', 'text_like_ratio', 'comments_code_like_ratio_to_total', 'comments_text_like_ratio_to_total', 'comments_code_like_ratio_comments', 'comments_text_like_ratio_comments', 'error_near_eof_ratio', 'comment_ratio', 'bucket_large', 'bucket_medium', 'bucket_small']


In [None]:
import warnings, time
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.impute import SimpleImputer
import pandas as pd, numpy as np

try:
    from xgboost import XGBClassifier
    _HAS_XGB = True
except ImportError:
    _HAS_XGB = False
    print("[info] xgboost not installed — using GradientBoostingClassifier")

# ═══════════════════════════════════════════════════════════════════════
# 1. EXPLORE the processed feature CSVs
# ═══════════════════════════════════════════════════════════════════════

_N_TEST = 165833  # aligned slice used across all test_add artefacts

NUMERIC_CSV_COLS = list(df_feats_train.columns[1:])   # skip 'bucket' (str)
df_cross_h = df_cross.head(_N_TEST)

print("CSV columns:", list(df_feats_train.columns))
print(f"Train CSV: {df_feats_train.shape}  |  Test-Add CSV: {df_cross_h.shape}")
print(f"\nNumeric features from CSV ({len(NUMERIC_CSV_COLS)}):")
for c in NUMERIC_CSV_COLS:
    print(f"  {c:42s} "
          f"train [{df_feats_train[c].min():.4f} .. {df_feats_train[c].max():.4f}]  "
          f"test  [{df_cross_h[c].min():.4f} .. {df_cross_h[c].max():.4f}]")

print(f"\nBucket distribution (train):\n{df_feats_train['bucket'].value_counts().to_string()}")
print(f"\nBucket distribution (test_add):\n{df_cross_h['bucket'].value_counts().to_string()}")

print(f"\nExtra feature: comment_ratio (computed from raw code)")
print(f"  X_comments_train: len={len(X_comments_train)}, "
      f"range [{min(X_comments_train):.4f} .. {max(X_comments_train):.4f}]")

# ═══════════════════════════════════════════════════════════════════════
# 2. BUILD full feature matrices (7 CSV numeric + comment_ratio + bucket OH)
#    Evaluation set → test_add (additional dataset)
# ═══════════════════════════════════════════════════════════════════════

# Comment ratios for test_add are already computed in X_test_add_comments
assert len(X_test_add_comments) == _N_TEST, \
    f"X_test_add_comments length {len(X_test_add_comments)} != {_N_TEST}"

# Numeric features from CSV (skip bucket at column 0)
tr_numeric = np.nan_to_num(
    df_feats_train[NUMERIC_CSV_COLS].values.astype(np.float64))
te_numeric = np.nan_to_num(
    df_cross_h[NUMERIC_CSV_COLS].values.astype(np.float64))

# Comment ratio column
tr_comment = np.asarray(X_comments_train, dtype=np.float64).reshape(-1, 1)
te_comment = np.asarray(X_test_add_comments, dtype=np.float64).reshape(-1, 1)

# Bucket → one-hot (small / medium / large → 3 binary columns)
lb = LabelBinarizer()
tr_bucket_oh = lb.fit_transform(df_feats_train["bucket"].values)
te_bucket_oh = lb.transform(df_cross_h["bucket"].values)
bucket_names = [f"bucket_{c}" for c in lb.classes_]

print(f"\nBucket one-hot classes: {list(lb.classes_)} → {tr_bucket_oh.shape[1]} cols")

# Stack everything: 7 numeric + 1 comment_ratio + 3 bucket dummies = 11
X_full_tr = np.column_stack([tr_numeric, tr_comment, tr_bucket_oh])
X_full_te = np.column_stack([te_numeric, te_comment, te_bucket_oh])

ALL_FEAT = NUMERIC_CSV_COLS + ["comment_ratio"] + bucket_names
y_tr = np.asarray(Y_train, dtype=int)
y_te = np.asarray(Y_test_list_add, dtype=int)

assert X_full_tr.shape[0] == len(y_tr), \
    f"Train mismatch: {X_full_tr.shape[0]} vs {len(y_tr)}"
assert X_full_te.shape[0] == len(y_te), \
    f"Test mismatch: {X_full_te.shape[0]} vs {len(y_te)}"

print(f"\nFull feature matrix — Train: {X_full_tr.shape}   Test-Add: {X_full_te.shape}")
print(f"All features ({len(ALL_FEAT)}): {ALL_FEAT}")

# Impute NaN + global StandardScaler
_imp = SimpleImputer(strategy="mean").fit(X_full_tr)
X_full_tr = _imp.transform(X_full_tr)
X_full_te = _imp.transform(X_full_te)

_sc = StandardScaler().fit(X_full_tr)
X_tr_s = _sc.transform(X_full_tr)
X_te_s = _sc.transform(X_full_te)

# ═══════════════════════════════════════════════════════════════════════
# 3. FEATURE SUBSETS: ALL, drop-one, only-one (skip bucket dummies alone)
# ═══════════════════════════════════════════════════════════════════════

feat_subsets = {"ALL": list(range(len(ALL_FEAT)))}

for i, name in enumerate(ALL_FEAT):
    feat_subsets[f"drop_{name}"] = [j for j in range(len(ALL_FEAT)) if j != i]

for i, name in enumerate(ALL_FEAT):
    if not name.startswith("bucket_"):
        feat_subsets[f"only_{name}"] = [i]

# Also test: only the 2 features the previous cells used
feat_subsets["prev_2feats"] = [
    ALL_FEAT.index("verb_ratio_comments"),
    ALL_FEAT.index("comment_ratio"),
]

print(f"\n{len(feat_subsets)} feature subsets defined")

# ═══════════════════════════════════════════════════════════════════════
# 4. MODEL DEFINITIONS + hyperparameter grids
# ═══════════════════════════════════════════════════════════════════════

f1m = make_scorer(f1_score, average="macro")

model_defs = {}

model_defs["LogReg"] = (LogisticRegression, {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ["l1", "l2"],
    "solver": ["liblinear"],
    "max_iter": [5000],
    "random_state": [42],
})

model_defs["RandomForest"] = (RandomForestClassifier, {
    "n_estimators": [100, 200],
    "max_depth": [2, 5, 10, None],
    "min_samples_leaf": [1, 5],
    "random_state": [42],
})

if _HAS_XGB:
    model_defs["XGBoost"] = (XGBClassifier, {
        "n_estimators": [100, 200],
        "max_depth": [3, 6],
        "learning_rate": [0.05, 0.1, 0.3],
        "verbosity": [0],
        "random_state": [42],
    })
else:
    model_defs["GradBoost"] = (GradientBoostingClassifier, {
        "n_estimators": [100, 200],
        "max_depth": [3, 5],
        "learning_rate": [0.05, 0.1, 0.3],
        "random_state": [42],
    })

model_defs["MLP"] = (MLPClassifier, {
    "hidden_layer_sizes": [(64, 32), (128, 64)],
    "activation": ["relu", "tanh"],
    "alpha": [1e-4, 1e-3],
    "max_iter": [500],
    "random_state": [42],
})

model_defs["DeepNN"] = (MLPClassifier, {
    "hidden_layer_sizes": [(128, 64, 32), (256, 128, 64)],
    "activation": ["relu"],
    "alpha": [1e-4, 1e-3, 1e-2],
    "learning_rate_init": [1e-3],
    "early_stopping": [True],
    "max_iter": [800],
    "random_state": [42],
})

# ═══════════════════════════════════════════════════════════════════════
# 5. RUN ALL EXPERIMENTS  (evaluated on test_add)
# ═══════════════════════════════════════════════════════════════════════

results_rows = []
n_total = len(model_defs) * len(feat_subsets)
n_done = 0
t0 = time.time()

for mname, (Cls, pgrid) in model_defs.items():
    for sname, cols in feat_subsets.items():
        n_done += 1
        Xt, Xv = X_tr_s[:, cols], X_te_s[:, cols]

        gs = GridSearchCV(
            Cls(), pgrid,
            scoring=f1m, cv=3, n_jobs=-1, refit=True,
        )
        gs.fit(Xt, y_tr)

        yhat = gs.predict(Xv)
        tf1 = f1_score(y_te, yhat, average="macro")

        results_rows.append({
            "Model": mname,
            "Features": sname,
            "N_feats": len(cols),
            "CV_F1_macro": round(gs.best_score_, 5),
            "Test_F1_macro": round(tf1, 5),
            "Best_Params": str(gs.best_params_),
        })

        elapsed = time.time() - t0
        print(f"[{n_done:3d}/{n_total}] {mname:15s} | "
              f"{sname:45s} | CV={gs.best_score_:.4f}  "
              f"Test={tf1:.4f}  ({elapsed:.0f}s)")

# ═══════════════════════════════════════════════════════════════════════
# 6. COMPARISON TABLES
# ═══════════════════════════════════════════════════════════════════════

df_cmp = pd.DataFrame(results_rows).sort_values(
    "Test_F1_macro", ascending=False
).reset_index(drop=True)

print("\n" + "=" * 110)
print("  FEATURE × MODEL COMPARISON  (sorted by Test-Add F1-macro)")
print("=" * 110)
with pd.option_context(
    "display.max_colwidth", 55,
    "display.max_rows", None,
    "display.width", 220,
):
    display(df_cmp[["Model", "Features", "N_feats",
                    "CV_F1_macro", "Test_F1_macro"]])

print("\n── Best configuration per model ──")
best_per = df_cmp.loc[df_cmp.groupby("Model")["Test_F1_macro"].idxmax()]
with pd.option_context("display.max_colwidth", 100, "display.width", 220):
    display(best_per[["Model", "Features", "N_feats",
                      "CV_F1_macro", "Test_F1_macro", "Best_Params"]])

print("\n── Best configuration per feature subset ──")
best_per_feat = df_cmp.loc[df_cmp.groupby("Features")["Test_F1_macro"].idxmax()]
with pd.option_context("display.max_colwidth", 100, "display.width", 220):
    display(best_per_feat[["Model", "Features", "N_feats",
                           "CV_F1_macro", "Test_F1_macro"]].head(20))

[info] xgboost not installed — using GradientBoostingClassifier
CSV columns: ['bucket', 'verb_ratio_comments', 'text_like_ratio', 'comments_code_like_ratio_to_total', 'comments_text_like_ratio_to_total', 'comments_code_like_ratio_comments', 'comments_text_like_ratio_comments', 'error_near_eof_ratio']
Train CSV: (300000, 8)  |  Test-Add CSV: (165833, 8)

Numeric features from CSV (7):
  verb_ratio_comments                        train [0.0000 .. 1.0000]  test  [0.0000 .. 1.0000]
  text_like_ratio                            train [0.0000 .. 1.0000]  test  [0.0000 .. 1.0000]
  comments_code_like_ratio_to_total          train [0.0000 .. 1.0000]  test  [0.0000 .. 1.4545]
  comments_text_like_ratio_to_total          train [0.0000 .. 1.0000]  test  [0.0000 .. 0.8750]
  comments_code_like_ratio_comments          train [0.0000 .. 1.0000]  test  [0.0000 .. 1.0000]
  comments_text_like_ratio_comments          train [0.0000 .. 1.0000]  test  [0.0000 .. 1.0000]
  error_near_eof_ratio               

In [3]:
# ═══════════════════════════════════════════════════════════════════════
# MULTINOMIAL NAIVE BAYES EXPERIMENT
# ═══════════════════════════════════════════════════════════════════════
# Note: MultinomialNB requires non-negative features, so we use MinMaxScaler

import warnings, time
warnings.filterwarnings("ignore")
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
import pandas as pd, numpy as np

# Prepare data with MinMaxScaler (ensures non-negative values for MultinomialNB)
_imp_nb = SimpleImputer(strategy="mean").fit(X_full_tr)
X_tr_imp = _imp_nb.transform(X_full_tr)
X_te_imp = _imp_nb.transform(X_full_te)

_sc_nb = MinMaxScaler().fit(X_tr_imp)
X_tr_nb = _sc_nb.transform(X_tr_imp)
X_te_nb = _sc_nb.transform(X_te_imp)

print(f"Data prepared for MultinomialNB: Train {X_tr_nb.shape}, Test {X_te_nb.shape}")
print(f"Value range: [{X_tr_nb.min():.4f}, {X_tr_nb.max():.4f}]")

# Feature subsets (reuse from previous experiment)
feat_subsets_nb = {"ALL": list(range(len(ALL_FEAT)))}
for i, name in enumerate(ALL_FEAT):
    feat_subsets_nb[f"drop_{name}"] = [j for j in range(len(ALL_FEAT)) if j != i]
for i, name in enumerate(ALL_FEAT):
    if not name.startswith("bucket_"):
        feat_subsets_nb[f"only_{name}"] = [i]
feat_subsets_nb["prev_2feats"] = [
    ALL_FEAT.index("verb_ratio_comments"),
    ALL_FEAT.index("comment_ratio"),
]

# Model definition with hyperparameter grid
f1m = make_scorer(f1_score, average="macro")
mnb_grid = {
    "alpha": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
    "fit_prior": [True, False],
}

# Run experiments
results_nb = []
n_total = len(feat_subsets_nb)
n_done = 0
t0 = time.time()

for sname, cols in feat_subsets_nb.items():
    n_done += 1
    Xt, Xv = X_tr_nb[:, cols], X_te_nb[:, cols]
    
    gs = GridSearchCV(
        MultinomialNB(), mnb_grid,
        scoring=f1m, cv=3, n_jobs=-1, refit=True,
    )
    gs.fit(Xt, y_tr)
    
    yhat = gs.predict(Xv)
    tf1 = f1_score(y_te, yhat, average="macro")
    
    results_nb.append({
        "Model": "MultinomialNB",
        "Features": sname,
        "N_feats": len(cols),
        "CV_F1_macro": round(gs.best_score_, 5),
        "Test_F1_macro": round(tf1, 5),
        "Best_Params": str(gs.best_params_),
    })
    
    elapsed = time.time() - t0
    print(f"[{n_done:3d}/{n_total}] MultinomialNB | "
          f"{sname:45s} | CV={gs.best_score_:.4f}  "
          f"Test={tf1:.4f}  ({elapsed:.0f}s)")

# Results table
df_nb = pd.DataFrame(results_nb).sort_values(
    "Test_F1_macro", ascending=False
).reset_index(drop=True)

print("\n" + "=" * 90)
print("  MULTINOMIAL NAIVE BAYES RESULTS  (sorted by Test-Add F1-macro)")
print("=" * 90)
with pd.option_context(
    "display.max_colwidth", 55,
    "display.max_rows", None,
    "display.width", 200,
):
    display(df_nb[["Features", "N_feats", "CV_F1_macro", "Test_F1_macro", "Best_Params"]])

print(f"\nBest configuration: {df_nb.iloc[0]['Features']} with Test F1 = {df_nb.iloc[0]['Test_F1_macro']:.4f}")

Data prepared for MultinomialNB: Train (300000, 11), Test (165833, 11)
Value range: [0.0000, 1.0000]
[  1/21] MultinomialNB | ALL                                           | CV=0.7487  Test=0.5942  (8s)
[  2/21] MultinomialNB | drop_verb_ratio_comments                      | CV=0.7467  Test=0.5916  (9s)
[  3/21] MultinomialNB | drop_text_like_ratio                          | CV=0.7350  Test=0.5956  (9s)
[  4/21] MultinomialNB | drop_comments_code_like_ratio_to_total        | CV=0.7492  Test=0.5923  (10s)
[  5/21] MultinomialNB | drop_comments_text_like_ratio_to_total        | CV=0.7413  Test=0.5907  (11s)
[  6/21] MultinomialNB | drop_comments_code_like_ratio_comments        | CV=0.7455  Test=0.5775  (12s)
[  7/21] MultinomialNB | drop_comments_text_like_ratio_comments        | CV=0.7455  Test=0.5775  (13s)
[  8/21] MultinomialNB | drop_error_near_eof_ratio                     | CV=0.7241  Test=0.6089  (14s)
[  9/21] MultinomialNB | drop_comment_ratio                            | CV=0.

Unnamed: 0,Features,N_feats,CV_F1_macro,Test_F1_macro,Best_Params
0,drop_error_near_eof_ratio,10,0.72413,0.60886,"{'alpha': 0.001, 'fit_prior': False}"
1,drop_text_like_ratio,10,0.73496,0.59561,"{'alpha': 0.001, 'fit_prior': False}"
2,ALL,11,0.74873,0.59417,"{'alpha': 0.001, 'fit_prior': False}"
3,drop_comment_ratio,10,0.74293,0.59233,"{'alpha': 0.001, 'fit_prior': False}"
4,drop_comments_code_like_ratio_to_total,10,0.74923,0.59231,"{'alpha': 0.001, 'fit_prior': False}"
5,drop_verb_ratio_comments,10,0.74668,0.59159,"{'alpha': 0.001, 'fit_prior': False}"
6,drop_bucket_large,10,0.75403,0.5908,"{'alpha': 0.001, 'fit_prior': False}"
7,drop_comments_text_like_ratio_to_total,10,0.74129,0.59066,"{'alpha': 0.001, 'fit_prior': False}"
8,drop_comments_text_like_ratio_comments,10,0.74555,0.57751,"{'alpha': 0.001, 'fit_prior': False}"
9,drop_comments_code_like_ratio_comments,10,0.74555,0.57751,"{'alpha': 0.001, 'fit_prior': False}"



Best configuration: drop_error_near_eof_ratio with Test F1 = 0.6089


In [5]:
# ═══════════════════════════════════════════════════════════════════════
# EXPORT RESULTS TO EXCEL & ENSEMBLE EXPERIMENT
# ═══════════════════════════════════════════════════════════════════════

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

# ─────────────────────────────────────────────────────────────────────────────
# 1. COMPLETE RESULTS DATA (hardcoded from experiments)
# ─────────────────────────────────────────────────────────────────────────────

all_results = [
    # LogReg results
    ("LogReg", "only_comments_text_like_ratio_to_total", 1, 0.6131, 0.6900),
    ("LogReg", "only_verb_ratio_comments", 1, 0.5877, 0.6658),
    ("LogReg", "only_comment_ratio", 1, 0.6735, 0.6497),
    ("LogReg", "prev_2feats", 2, 0.6803, 0.6464),
    ("LogReg", "drop_error_near_eof_ratio", 10, 0.7341, 0.6209),
    ("LogReg", "ALL", 11, 0.7618, 0.6003),
    ("LogReg", "drop_comments_code_like_ratio_comments", 10, 0.7618, 0.6003),
    ("LogReg", "drop_comments_text_like_ratio_comments", 10, 0.7618, 0.6003),
    ("LogReg", "drop_bucket_medium", 10, 0.7617, 0.6003),
    ("LogReg", "drop_bucket_small", 10, 0.7618, 0.6003),
    ("LogReg", "drop_text_like_ratio", 10, 0.7487, 0.6002),
    ("LogReg", "drop_bucket_large", 10, 0.7618, 0.6002),
    ("LogReg", "only_comments_code_like_ratio_to_total", 1, 0.5725, 0.5969),
    ("LogReg", "drop_verb_ratio_comments", 10, 0.7621, 0.5956),
    ("LogReg", "drop_comment_ratio", 10, 0.7577, 0.5952),
    ("LogReg", "only_comments_code_like_ratio_comments", 1, 0.5712, 0.5938),
    ("LogReg", "only_comments_text_like_ratio_comments", 1, 0.5712, 0.5938),
    ("LogReg", "drop_comments_code_like_ratio_to_total", 10, 0.7623, 0.5989),
    ("LogReg", "drop_comments_text_like_ratio_to_total", 10, 0.7584, 0.5869),
    ("LogReg", "only_text_like_ratio", 1, 0.6983, 0.5701),
    ("LogReg", "only_error_near_eof_ratio", 1, 0.5907, 0.4074),
    # RandomForest results
    ("RandomForest", "only_comments_text_like_ratio_to_total", 1, 0.6139, 0.6932),
    ("RandomForest", "only_verb_ratio_comments", 1, 0.5878, 0.6657),
    ("RandomForest", "only_comment_ratio", 1, 0.6804, 0.6593),
    ("RandomForest", "prev_2feats", 2, 0.6921, 0.6441),
    ("RandomForest", "drop_error_near_eof_ratio", 10, 0.7509, 0.6201),
    ("RandomForest", "drop_text_like_ratio", 10, 0.7747, 0.5896),
    ("RandomForest", "drop_comment_ratio", 10, 0.7800, 0.5887),
    ("RandomForest", "ALL", 11, 0.7893, 0.5829),
    ("RandomForest", "drop_verb_ratio_comments", 10, 0.7890, 0.5826),
    ("RandomForest", "drop_comments_code_like_ratio_to_total", 10, 0.7880, 0.5824),
    ("RandomForest", "drop_bucket_medium", 10, 0.7893, 0.5821),
    ("RandomForest", "drop_bucket_small", 10, 0.7892, 0.5815),
    ("RandomForest", "drop_comments_code_like_ratio_comments", 10, 0.7893, 0.5796),
    ("RandomForest", "drop_comments_text_like_ratio_comments", 10, 0.7893, 0.5796),
    ("RandomForest", "drop_bucket_large", 10, 0.7889, 0.5781),
    ("RandomForest", "only_comments_code_like_ratio_comments", 1, 0.5716, 0.5960),
    ("RandomForest", "only_comments_text_like_ratio_comments", 1, 0.5716, 0.5960),
    ("RandomForest", "only_comments_code_like_ratio_to_total", 1, 0.5730, 0.5991),
    ("RandomForest", "drop_comments_text_like_ratio_to_total", 10, 0.7878, 0.5689),
    ("RandomForest", "only_text_like_ratio", 1, 0.6989, 0.5733),
    ("RandomForest", "only_error_near_eof_ratio", 1, 0.5940, 0.4066),
    # XGBoost results
    ("XGBoost", "only_comments_text_like_ratio_to_total", 1, 0.6138, 0.6930),
    ("XGBoost", "only_verb_ratio_comments", 1, 0.5878, 0.6657),
    ("XGBoost", "only_comment_ratio", 1, 0.6804, 0.6574),
    ("XGBoost", "prev_2feats", 2, 0.6922, 0.6455),
    ("XGBoost", "drop_error_near_eof_ratio", 10, 0.7511, 0.6213),
    ("XGBoost", "drop_text_like_ratio", 10, 0.7745, 0.5901),
    ("XGBoost", "drop_bucket_large", 10, 0.7886, 0.5875),
    ("XGBoost", "drop_bucket_small", 10, 0.7887, 0.5868),
    ("XGBoost", "drop_comment_ratio", 10, 0.7796, 0.5863),
    ("XGBoost", "drop_comments_code_like_ratio_to_total", 10, 0.7878, 0.5837),
    ("XGBoost", "drop_verb_ratio_comments", 10, 0.7887, 0.5829),
    ("XGBoost", "drop_bucket_medium", 10, 0.7888, 0.5820),
    ("XGBoost", "drop_comments_text_like_ratio_to_total", 10, 0.7875, 0.5813),
    ("XGBoost", "ALL", 11, 0.7886, 0.5809),
    ("XGBoost", "drop_comments_code_like_ratio_comments", 10, 0.7886, 0.5809),
    ("XGBoost", "drop_comments_text_like_ratio_comments", 10, 0.7886, 0.5809),
    ("XGBoost", "only_comments_code_like_ratio_to_total", 1, 0.5733, 0.6005),
    ("XGBoost", "only_comments_code_like_ratio_comments", 1, 0.5717, 0.5960),
    ("XGBoost", "only_comments_text_like_ratio_comments", 1, 0.5717, 0.5960),
    ("XGBoost", "only_text_like_ratio", 1, 0.6983, 0.5750),
    ("XGBoost", "only_error_near_eof_ratio", 1, 0.5939, 0.4066),
    # MLP results
    ("MLP", "only_comments_text_like_ratio_to_total", 1, 0.6133, 0.6906),
    ("MLP", "only_verb_ratio_comments", 1, 0.5878, 0.6657),
    ("MLP", "only_comment_ratio", 1, 0.6803, 0.6532),
    ("MLP", "prev_2feats", 2, 0.6916, 0.6447),
    ("MLP", "drop_error_near_eof_ratio", 10, 0.7503, 0.6198),
    ("MLP", "drop_text_like_ratio", 10, 0.7739, 0.5881),
    ("MLP", "drop_comment_ratio", 10, 0.7789, 0.5847),
    ("MLP", "drop_bucket_medium", 10, 0.7885, 0.5830),
    ("MLP", "ALL", 11, 0.7881, 0.5819),
    ("MLP", "drop_bucket_small", 10, 0.7882, 0.5776),
    ("MLP", "drop_verb_ratio_comments", 10, 0.7880, 0.5768),
    ("MLP", "drop_bucket_large", 10, 0.7880, 0.5705),
    ("MLP", "drop_comments_text_like_ratio_to_total", 10, 0.7867, 0.5697),
    ("MLP", "drop_comments_code_like_ratio_comments", 10, 0.7883, 0.5680),
    ("MLP", "drop_comments_text_like_ratio_comments", 10, 0.7883, 0.5680),
    ("MLP", "only_comments_code_like_ratio_to_total", 1, 0.5726, 0.5973),
    ("MLP", "only_comments_code_like_ratio_comments", 1, 0.5716, 0.5960),
    ("MLP", "only_comments_text_like_ratio_comments", 1, 0.5716, 0.5960),
    ("MLP", "only_text_like_ratio", 1, 0.6972, 0.5698),
    ("MLP", "drop_comments_code_like_ratio_to_total", 10, 0.7870, 0.5665),
    ("MLP", "only_error_near_eof_ratio", 1, 0.5940, 0.4066),
    # DeepNN results
    ("DeepNN", "only_comments_text_like_ratio_to_total", 1, 0.6129, 0.6878),
    ("DeepNN", "only_verb_ratio_comments", 1, 0.5875, 0.6641),
    ("DeepNN", "only_comment_ratio", 1, 0.6798, 0.6508),
    ("DeepNN", "prev_2feats", 2, 0.6911, 0.6421),
    ("DeepNN", "drop_error_near_eof_ratio", 10, 0.7496, 0.6237),
    ("DeepNN", "drop_text_like_ratio", 10, 0.7738, 0.5896),
    ("DeepNN", "drop_bucket_large", 10, 0.7879, 0.5862),
    ("DeepNN", "drop_comments_code_like_ratio_comments", 10, 0.7876, 0.5843),
    ("DeepNN", "drop_comments_text_like_ratio_comments", 10, 0.7876, 0.5843),
    ("DeepNN", "drop_bucket_medium", 10, 0.7881, 0.5835),
    ("DeepNN", "drop_comments_code_like_ratio_to_total", 10, 0.7863, 0.5831),
    ("DeepNN", "ALL", 11, 0.7872, 0.5811),
    ("DeepNN", "drop_comment_ratio", 10, 0.7787, 0.5805),
    ("DeepNN", "drop_verb_ratio_comments", 10, 0.7874, 0.5568),
    ("DeepNN", "drop_comments_text_like_ratio_to_total", 10, 0.7860, 0.5717),
    ("DeepNN", "only_comments_code_like_ratio_to_total", 1, 0.5721, 0.5981),
    ("DeepNN", "only_comments_code_like_ratio_comments", 1, 0.5711, 0.5948),
    ("DeepNN", "only_comments_text_like_ratio_comments", 1, 0.5711, 0.5948),
    ("DeepNN", "only_text_like_ratio", 1, 0.6965, 0.5721),
    ("DeepNN", "drop_bucket_small", 10, 0.7875, 0.5741),
    ("DeepNN", "only_error_near_eof_ratio", 1, 0.5938, 0.4058),
    # MultinomialNB results
    ("MultinomialNB", "drop_error_near_eof_ratio", 10, 0.7241, 0.6089),
    ("MultinomialNB", "drop_text_like_ratio", 10, 0.7350, 0.5956),
    ("MultinomialNB", "ALL", 11, 0.7487, 0.5942),
    ("MultinomialNB", "drop_comment_ratio", 10, 0.7429, 0.5923),
    ("MultinomialNB", "drop_comments_code_like_ratio_to_total", 10, 0.7492, 0.5923),
    ("MultinomialNB", "drop_verb_ratio_comments", 10, 0.7467, 0.5916),
    ("MultinomialNB", "drop_bucket_large", 10, 0.7540, 0.5908),
    ("MultinomialNB", "drop_comments_text_like_ratio_to_total", 10, 0.7413, 0.5907),
    ("MultinomialNB", "drop_comments_text_like_ratio_comments", 10, 0.7455, 0.5775),
    ("MultinomialNB", "drop_comments_code_like_ratio_comments", 10, 0.7455, 0.5775),
    ("MultinomialNB", "drop_bucket_small", 10, 0.7057, 0.5744),
    ("MultinomialNB", "drop_bucket_medium", 10, 0.7345, 0.5682),
    ("MultinomialNB", "prev_2feats", 2, 0.5696, 0.5208),
    ("MultinomialNB", "only_verb_ratio_comments", 1, 0.3435, 0.2311),
    ("MultinomialNB", "only_text_like_ratio", 1, 0.3435, 0.2311),
    ("MultinomialNB", "only_comments_text_like_ratio_to_total", 1, 0.3435, 0.2311),
    ("MultinomialNB", "only_comments_code_like_ratio_to_total", 1, 0.3435, 0.2311),
    ("MultinomialNB", "only_comments_code_like_ratio_comments", 1, 0.3435, 0.2311),
    ("MultinomialNB", "only_comments_text_like_ratio_comments", 1, 0.3435, 0.2311),
    ("MultinomialNB", "only_error_near_eof_ratio", 1, 0.3435, 0.2311),
    ("MultinomialNB", "only_comment_ratio", 1, 0.3435, 0.2311),
]

df_all = pd.DataFrame(all_results, columns=["Model", "Features", "N_feats", "CV_F1_macro", "Test_F1_macro"])

# ─────────────────────────────────────────────────────────────────────────────
# 2. EXPORT TO EXCEL
# ─────────────────────────────────────────────────────────────────────────────

model_order = ["LogReg", "RandomForest", "XGBoost", "MLP", "DeepNN", "MultinomialNB"]
df_all["Model_Order"] = df_all["Model"].apply(lambda x: model_order.index(x) if x in model_order else 99)
df_sorted = df_all.sort_values(["Model_Order", "Test_F1_macro"], ascending=[True, False]).drop(columns=["Model_Order"])

excel_path = "data/results_all_models.xlsx"
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side

with pd.ExcelWriter(excel_path, engine="openpyxl") as writer:
    df_sorted.reset_index(drop=True).to_excel(writer, sheet_name="All Results", index=False)
    workbook = writer.book
    worksheet = writer.sheets["All Results"]
    
    header_fill = PatternFill(start_color="4472C4", end_color="4472C4", fill_type="solid")
    header_font = Font(bold=True, color="FFFFFF")
    thin_border = Border(left=Side(style="thin"), right=Side(style="thin"),
                         top=Side(style="thin"), bottom=Side(style="thin"))
    
    for col in range(1, 6):
        cell = worksheet.cell(row=1, column=col)
        cell.fill = header_fill
        cell.font = header_font
        cell.alignment = Alignment(horizontal="center")
        cell.border = thin_border
    
    df_sorted_reset = df_sorted.reset_index(drop=True)
    top4_global = df_all.nlargest(4, "Test_F1_macro")
    bold_font = Font(bold=True)
    highlight_fill = PatternFill(start_color="FFF2CC", end_color="FFF2CC", fill_type="solid")
    
    for idx, row in df_sorted_reset.iterrows():
        is_top4 = any(
            (top4_global["Model"] == row["Model"]).values & 
            (top4_global["Features"] == row["Features"]).values &
            (abs(top4_global["Test_F1_macro"] - row["Test_F1_macro"]) < 0.0001).values
        )
        excel_row = idx + 2
        for col in range(1, 6):
            cell = worksheet.cell(row=excel_row, column=col)
            cell.border = thin_border
            if is_top4:
                cell.font = bold_font
                cell.fill = highlight_fill
    
    column_widths = [15, 45, 10, 15, 15]
    for i, width in enumerate(column_widths):
        worksheet.column_dimensions[chr(65 + i)].width = width

print(f"Results exported to: {excel_path}")
print(f"  - {len(df_sorted)} total experiments")
print(f"  - Top 4 results highlighted")

print("\n" + "=" * 70)
print("  TOP 4 RESULTS")
print("=" * 70)
print(df_all.nlargest(4, "Test_F1_macro")[["Model", "Features", "Test_F1_macro"]].to_string(index=False))

# ─────────────────────────────────────────────────────────────────────────────
# 3. STACKING ENSEMBLE
# ─────────────────────────────────────────────────────────────────────────────

print("\n" + "=" * 70)
print("  STACKING ENSEMBLE (30k subset)")
print("=" * 70)

best_feat_idx = [ALL_FEAT.index("comments_text_like_ratio_to_total")]
X_tr_best = X_full_tr[:, best_feat_idx]
X_te_best = X_full_te[:, best_feat_idx]

from sklearn.model_selection import train_test_split
TRAIN_SUBSET = 30000
X_tr_sub, _, y_tr_sub, _ = train_test_split(X_tr_best, y_tr, train_size=TRAIN_SUBSET, stratify=y_tr, random_state=42)
print(f"  Training subset: {TRAIN_SUBSET:,} samples")

imp_ens = SimpleImputer(strategy="mean").fit(X_tr_sub)
X_tr_ens = imp_ens.transform(X_tr_sub)
X_te_ens = imp_ens.transform(X_te_best)

sc_ens = StandardScaler().fit(X_tr_ens)
X_tr_ens_sc = sc_ens.transform(X_tr_ens)
X_te_ens_sc = sc_ens.transform(X_te_ens)

from sklearn.ensemble import ExtraTreesClassifier
base_estimators = [
    ("rf", RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)),
    ("gb", GradientBoostingClassifier(n_estimators=50, random_state=42)),
    ("mlp", MLPClassifier(hidden_layer_sizes=(32,), max_iter=200, random_state=42)),
    ("lr", LogisticRegression(max_iter=500, random_state=42)),
    ("et", ExtraTreesClassifier(n_estimators=50, random_state=42, n_jobs=-1)),
]

stacking_clf = StackingClassifier(
    estimators=base_estimators,
    final_estimator=LogisticRegression(max_iter=500, random_state=42),
    cv=2, n_jobs=-1, passthrough=False,
)

import time
t0 = time.time()
stacking_clf.fit(X_tr_ens_sc, y_tr_sub)
train_time = time.time() - t0

y_pred_stack = stacking_clf.predict(X_te_ens_sc)
f1_stack = f1_score(y_te, y_pred_stack, average="macro")

print(f"\n  Test F1-macro: {f1_stack:.4f}")
print(f"  Training time: {train_time:.1f}s")
print(f"\n  Best single model (RF): 0.6932")
print(f"  Stacking Ensemble:      {f1_stack:.4f}")
improvement = f1_stack - 0.6932
print(f"  Delta: {improvement:+.4f}")

# ─────────────────────────────────────────────────────────────────────────────
# 4. SAVE ENSEMBLE TO EXCEL
# ─────────────────────────────────────────────────────────────────────────────

from openpyxl import load_workbook

ensemble_data = [{
    "Ensemble": "Stacking (RF+GB+MLP+LR+ET -> LogReg)",
    "Feature": "comments_text_like_ratio_to_total",
    "Train_Subset": TRAIN_SUBSET,
    "Test_F1_macro": round(f1_stack, 4),
    "Training_Time_s": round(train_time, 1),
    "Best_Single_Model": "RandomForest",
    "Best_Single_F1": 0.6932,
    "Delta": round(improvement, 4),
}]
df_ensemble = pd.DataFrame(ensemble_data)

with pd.ExcelWriter(excel_path, engine="openpyxl", mode="a", if_sheet_exists="replace") as writer:
    df_ensemble.to_excel(writer, sheet_name="Stacking Ensemble", index=False)
    ws = writer.sheets["Stacking Ensemble"]
    for col in range(1, 9):
        cell = ws.cell(row=1, column=col)
        cell.fill = header_fill
        cell.font = header_font
        cell.alignment = Alignment(horizontal="center")
    ws.column_dimensions['A'].width = 40
    ws.column_dimensions['B'].width = 40
    for c in 'CDEFGH':
        ws.column_dimensions[c].width = 15

print(f"\nEnsemble saved to: {excel_path} (sheet: 'Stacking Ensemble')")

Results exported to: data/results_all_models.xlsx
  - 126 total experiments
  - Top 4 results highlighted

  TOP 4 RESULTS
       Model                               Features  Test_F1_macro
RandomForest only_comments_text_like_ratio_to_total         0.6932
     XGBoost only_comments_text_like_ratio_to_total         0.6930
         MLP only_comments_text_like_ratio_to_total         0.6906
      LogReg only_comments_text_like_ratio_to_total         0.6900

  STACKING ENSEMBLE (30k subset)
  Training subset: 30,000 samples

  Test F1-macro: 0.6911
  Training time: 14.0s

  Best single model (RF): 0.6932
  Stacking Ensemble:      0.6911
  Delta: -0.0021

Ensemble saved to: data/results_all_models.xlsx (sheet: 'Stacking Ensemble')


In [6]:
# ═══════════════════════════════════════════════════════════════════════
# TRAIN AND SAVE BEST MULTI-FEATURE MODEL FOR STREAMLIT APP
# ═══════════════════════════════════════════════════════════════════════

import joblib
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score

# Features for drop_error_near_eof_ratio (10 features - excludes error_near_eof_ratio)
# Based on ALL_FEAT = ['verb_ratio_comments', 'text_like_ratio', 'comments_code_like_ratio_to_total',
#                      'comments_text_like_ratio_to_total', 'comments_code_like_ratio_comments',
#                      'comments_text_like_ratio_comments', 'error_near_eof_ratio', 'comment_ratio',
#                      'bucket_large', 'bucket_medium', 'bucket_small']

# Feature indices (excluding error_near_eof_ratio at index 6)
SELECTED_FEATURES = [
    'verb_ratio_comments',           # 0
    'text_like_ratio',               # 1
    'comments_code_like_ratio_to_total',  # 2
    'comments_text_like_ratio_to_total',  # 3
    'comments_code_like_ratio_comments',  # 4
    'comments_text_like_ratio_comments',  # 5
    # 'error_near_eof_ratio',        # 6 - EXCLUDED
    'comment_ratio',                 # 7
    'bucket_large',                  # 8
    'bucket_medium',                 # 9
    'bucket_small',                  # 10
]

# Get indices in ALL_FEAT
selected_indices = [ALL_FEAT.index(f) for f in SELECTED_FEATURES]
print(f"Selected feature indices: {selected_indices}")
print(f"Selected features ({len(SELECTED_FEATURES)}): {SELECTED_FEATURES}")

# Extract feature subset
X_tr_selected = X_full_tr[:, selected_indices]
X_te_selected = X_full_te[:, selected_indices]

# Preprocessing: Imputer + Scaler
imputer = SimpleImputer(strategy='mean')
X_tr_imp = imputer.fit_transform(X_tr_selected)
X_te_imp = imputer.transform(X_te_selected)

scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr_imp)
X_te_scaled = scaler.transform(X_te_imp)

# Train LogReg with best params (from GridSearchCV results)
model = LogisticRegression(
    C=10,
    penalty='l2',
    solver='liblinear',
    max_iter=5000,
    random_state=42
)
model.fit(X_tr_scaled, y_tr)

# Evaluate
y_pred = model.predict(X_te_scaled)
f1 = f1_score(y_te, y_pred, average='macro')
print(f"\nModel performance - Test F1-macro: {f1:.4f}")

# Save model bundle
model_bundle = {
    'model': model,
    'scaler': scaler,
    'imputer': imputer,
    'features': SELECTED_FEATURES,
    'feature_indices': selected_indices,
    'test_f1_macro': f1,
}

save_path = 'models/ai_detector.joblib'
joblib.dump(model_bundle, save_path)
print(f"Model bundle saved to: {save_path}")

# Verify load
loaded = joblib.load(save_path)
print(f"\nLoaded model features: {loaded['features']}")
print(f"Loaded model test F1: {loaded['test_f1_macro']:.4f}")

Selected feature indices: [0, 1, 2, 3, 4, 5, 7, 8, 9, 10]
Selected features (10): ['verb_ratio_comments', 'text_like_ratio', 'comments_code_like_ratio_to_total', 'comments_text_like_ratio_to_total', 'comments_code_like_ratio_comments', 'comments_text_like_ratio_comments', 'comment_ratio', 'bucket_large', 'bucket_medium', 'bucket_small']

Model performance - Test F1-macro: 0.6217
Model bundle saved to: models/ai_detector.joblib

Loaded model features: ['verb_ratio_comments', 'text_like_ratio', 'comments_code_like_ratio_to_total', 'comments_text_like_ratio_to_total', 'comments_code_like_ratio_comments', 'comments_text_like_ratio_comments', 'comment_ratio', 'bucket_large', 'bucket_medium', 'bucket_small']
Loaded model test F1: 0.6217


## Complete Results Summary

### All Model x Feature Subset Results (Top Results by Test F1-macro)

| Model | Features | N_feats | CV_F1_macro | Test_F1_macro |
|-------|----------|---------|-------------|---------------|
| RandomForest | only_comments_text_like_ratio_to_total | 1 | 0.6139 | **0.6932** |
| XGBoost | only_comments_text_like_ratio_to_total | 1 | 0.6138 | **0.6930** |
| Stacking Ensemble | only_comments_text_like_ratio_to_total | 1 | - | 0.6911 |
| MLP | only_comments_text_like_ratio_to_total | 1 | 0.6133 | **0.6906** |
| LogReg | only_comments_text_like_ratio_to_total | 1 | 0.6131 | **0.6900** |
| DeepNN | only_comments_text_like_ratio_to_total | 1 | 0.6129 | 0.6878 |
| LogReg | only_verb_ratio_comments | 1 | 0.5877 | 0.6658 |
| RandomForest | only_verb_ratio_comments | 1 | 0.5878 | 0.6657 |
| XGBoost | only_verb_ratio_comments | 1 | 0.5878 | 0.6657 |
| MLP | only_verb_ratio_comments | 1 | 0.5878 | 0.6657 |
| DeepNN | only_verb_ratio_comments | 1 | 0.5875 | 0.6641 |
| RandomForest | only_comment_ratio | 1 | 0.6804 | 0.6593 |
| XGBoost | only_comment_ratio | 1 | 0.6804 | 0.6574 |
| MLP | only_comment_ratio | 1 | 0.6803 | 0.6532 |
| DeepNN | only_comment_ratio | 1 | 0.6798 | 0.6508 |
| LogReg | only_comment_ratio | 1 | 0.6735 | 0.6497 |
| LogReg | prev_2feats | 2 | 0.6803 | 0.6464 |
| XGBoost | prev_2feats | 2 | 0.6922 | 0.6455 |
| MLP | prev_2feats | 2 | 0.6916 | 0.6447 |
| RandomForest | prev_2feats | 2 | 0.6921 | 0.6441 |
| DeepNN | prev_2feats | 2 | 0.6911 | 0.6421 |
| DeepNN | drop_error_near_eof_ratio | 10 | 0.7496 | 0.6237 |
| XGBoost | drop_error_near_eof_ratio | 10 | 0.7511 | 0.6213 |
| LogReg | drop_error_near_eof_ratio | 10 | 0.7341 | 0.6209 |
| RandomForest | drop_error_near_eof_ratio | 10 | 0.7509 | 0.6201 |
| MLP | drop_error_near_eof_ratio | 10 | 0.7503 | 0.6198 |
| MultinomialNB | drop_error_near_eof_ratio | 10 | 0.7241 | 0.6089 |
| XGBoost | only_comments_code_like_ratio_to_total | 1 | 0.5733 | 0.6005 |
| LogReg | ALL | 11 | 0.7618 | 0.6003 |
| RandomForest | only_comments_code_like_ratio_to_total | 1 | 0.5730 | 0.5991 |
| DeepNN | only_comments_code_like_ratio_to_total | 1 | 0.5721 | 0.5981 |
| MLP | only_comments_code_like_ratio_to_total | 1 | 0.5726 | 0.5973 |
| LogReg | only_comments_code_like_ratio_to_total | 1 | 0.5725 | 0.5969 |
| MultinomialNB | drop_text_like_ratio | 10 | 0.7350 | 0.5956 |
| MultinomialNB | ALL | 11 | 0.7487 | 0.5942 |

*Full results (126 experiments) exported to `data/results_all_models.xlsx`*

## Logistic Regression Results

| Features | N_feats | CV_F1_macro | Test_F1_macro |
|----------|---------|-------------|---------------|
| only_comments_text_like_ratio_to_total | 1 | 0.6131 | **0.6900** |
| only_verb_ratio_comments | 1 | 0.5877 | 0.6658 |
| only_comment_ratio | 1 | 0.6735 | 0.6497 |
| prev_2feats | 2 | 0.6803 | 0.6464 |
| drop_error_near_eof_ratio | 10 | 0.7341 | 0.6209 |
| ALL | 11 | 0.7618 | 0.6003 |
| drop_comments_code_like_ratio_comments | 10 | 0.7618 | 0.6003 |
| drop_comments_text_like_ratio_comments | 10 | 0.7618 | 0.6003 |
| drop_bucket_medium | 10 | 0.7617 | 0.6003 |
| drop_bucket_small | 10 | 0.7618 | 0.6003 |
| drop_text_like_ratio | 10 | 0.7487 | 0.6002 |
| drop_bucket_large | 10 | 0.7618 | 0.6002 |
| drop_comments_code_like_ratio_to_total | 10 | 0.7623 | 0.5989 |
| only_comments_code_like_ratio_to_total | 1 | 0.5725 | 0.5969 |
| drop_verb_ratio_comments | 10 | 0.7621 | 0.5956 |
| drop_comment_ratio | 10 | 0.7577 | 0.5952 |
| only_comments_code_like_ratio_comments | 1 | 0.5712 | 0.5938 |
| only_comments_text_like_ratio_comments | 1 | 0.5712 | 0.5938 |
| drop_comments_text_like_ratio_to_total | 10 | 0.7584 | 0.5869 |
| only_text_like_ratio | 1 | 0.6785 | 0.5346 |
| only_error_near_eof_ratio | 1 | 0.5907 | 0.4074 |

**Best Configuration:** `only_comments_text_like_ratio_to_total` with Test F1 = 0.6900

## Random Forest Results

| Features | N_feats | CV_F1_macro | Test_F1_macro |
|----------|---------|-------------|---------------|
| only_comments_text_like_ratio_to_total | 1 | 0.6139 | **0.6932** |
| only_verb_ratio_comments | 1 | 0.5878 | 0.6657 |
| only_comment_ratio | 1 | 0.6804 | 0.6593 |
| prev_2feats | 2 | 0.6922 | 0.6452 |
| drop_error_near_eof_ratio | 10 | 0.7513 | 0.6198 |
| only_comments_code_like_ratio_to_total | 1 | 0.5731 | 0.6025 |
| only_comments_code_like_ratio_comments | 1 | 0.5716 | 0.5960 |
| only_comments_text_like_ratio_comments | 1 | 0.5716 | 0.5960 |
| drop_text_like_ratio | 10 | 0.7747 | 0.5896 |
| drop_comment_ratio | 10 | 0.7800 | 0.5887 |
| drop_bucket_large | 10 | 0.7892 | 0.5877 |
| ALL | 11 | 0.7893 | 0.5829 |
| drop_verb_ratio_comments | 10 | 0.7890 | 0.5826 |
| drop_comments_code_like_ratio_to_total | 10 | 0.7880 | 0.5824 |
| drop_bucket_medium | 10 | 0.7893 | 0.5821 |
| drop_bucket_small | 10 | 0.7892 | 0.5815 |
| drop_comments_code_like_ratio_comments | 10 | 0.7893 | 0.5796 |
| drop_comments_text_like_ratio_comments | 10 | 0.7893 | 0.5796 |
| only_text_like_ratio | 1 | 0.6983 | 0.5767 |
| drop_comments_text_like_ratio_to_total | 10 | 0.7878 | 0.5689 |
| only_error_near_eof_ratio | 1 | 0.5940 | 0.4066 |

**Best Configuration:** `only_comments_text_like_ratio_to_total` with Test F1 = 0.6932

## XGBoost Results

| Features | N_feats | CV_F1_macro | Test_F1_macro |
|----------|---------|-------------|---------------|
| only_comments_text_like_ratio_to_total | 1 | 0.6138 | **0.6930** |
| only_verb_ratio_comments | 1 | 0.5878 | 0.6657 |
| only_comment_ratio | 1 | 0.6804 | 0.6574 |
| prev_2feats | 2 | 0.6922 | 0.6455 |
| drop_error_near_eof_ratio | 10 | 0.7511 | 0.6213 |
| only_comments_code_like_ratio_to_total | 1 | 0.5733 | 0.6005 |
| only_comments_code_like_ratio_comments | 1 | 0.5717 | 0.5960 |
| only_comments_text_like_ratio_comments | 1 | 0.5717 | 0.5960 |
| drop_text_like_ratio | 10 | 0.7745 | 0.5901 |
| drop_bucket_large | 10 | 0.7886 | 0.5875 |
| drop_bucket_small | 10 | 0.7887 | 0.5868 |
| drop_comment_ratio | 10 | 0.7796 | 0.5863 |
| drop_comments_code_like_ratio_to_total | 10 | 0.7878 | 0.5837 |
| drop_verb_ratio_comments | 10 | 0.7887 | 0.5829 |
| drop_bucket_medium | 10 | 0.7888 | 0.5820 |
| drop_comments_text_like_ratio_to_total | 10 | 0.7875 | 0.5813 |
| ALL | 11 | 0.7886 | 0.5809 |
| drop_comments_code_like_ratio_comments | 10 | 0.7886 | 0.5809 |
| drop_comments_text_like_ratio_comments | 10 | 0.7886 | 0.5809 |
| only_text_like_ratio | 1 | 0.6983 | 0.5750 |
| only_error_near_eof_ratio | 1 | 0.5939 | 0.4066 |

**Best Configuration:** `only_comments_text_like_ratio_to_total` with Test F1 = 0.6930

## MLP (Multi-Layer Perceptron) Results

| Features | N_feats | CV_F1_macro | Test_F1_macro |
|----------|---------|-------------|---------------|
| only_comments_text_like_ratio_to_total | 1 | 0.6133 | **0.6906** |
| only_verb_ratio_comments | 1 | 0.5878 | 0.6657 |
| only_comment_ratio | 1 | 0.6803 | 0.6532 |
| prev_2feats | 2 | 0.6916 | 0.6447 |
| drop_error_near_eof_ratio | 10 | 0.7499 | 0.6136 |
| only_comments_code_like_ratio_to_total | 1 | 0.5726 | 0.5973 |
| only_comments_code_like_ratio_comments | 1 | 0.5716 | 0.5960 |
| only_comments_text_like_ratio_comments | 1 | 0.5716 | 0.5960 |
| drop_text_like_ratio | 10 | 0.7739 | 0.5881 |
| drop_comment_ratio | 10 | 0.7789 | 0.5847 |
| drop_bucket_medium | 10 | 0.7885 | 0.5830 |
| ALL | 11 | 0.7881 | 0.5819 |
| drop_bucket_small | 10 | 0.7882 | 0.5776 |
| drop_verb_ratio_comments | 10 | 0.7880 | 0.5768 |
| drop_comments_code_like_ratio_to_total | 10 | 0.7868 | 0.5752 |
| only_text_like_ratio | 1 | 0.6971 | 0.5739 |
| drop_bucket_large | 10 | 0.7880 | 0.5705 |
| drop_comments_text_like_ratio_to_total | 10 | 0.7867 | 0.5697 |
| drop_comments_code_like_ratio_comments | 10 | 0.7883 | 0.5680 |
| drop_comments_text_like_ratio_comments | 10 | 0.7883 | 0.5680 |
| only_error_near_eof_ratio | 1 | 0.5940 | 0.4066 |

**Best Configuration:** `only_comments_text_like_ratio_to_total` with Test F1 = 0.6906

## Deep Neural Network Results

| Features | N_feats | CV_F1_macro | Test_F1_macro |
|----------|---------|-------------|---------------|
| only_comments_text_like_ratio_to_total | 1 | 0.6129 | **0.6878** |
| only_verb_ratio_comments | 1 | 0.5875 | 0.6641 |
| only_comment_ratio | 1 | 0.6798 | 0.6508 |
| prev_2feats | 2 | 0.6911 | 0.6421 |
| drop_error_near_eof_ratio | 10 | 0.7496 | 0.6237 |
| only_comments_code_like_ratio_to_total | 1 | 0.5721 | 0.5981 |
| only_comments_code_like_ratio_comments | 1 | 0.5711 | 0.5948 |
| only_comments_text_like_ratio_comments | 1 | 0.5711 | 0.5948 |
| ALL | 11 | 0.7872 | 0.5907 |
| drop_text_like_ratio | 10 | 0.7738 | 0.5896 |
| drop_bucket_small | 10 | 0.7887 | 0.5868 |
| drop_bucket_large | 10 | 0.7879 | 0.5862 |
| drop_comments_code_like_ratio_comments | 10 | 0.7876 | 0.5843 |

| drop_comments_text_like_ratio_comments | 10 | 0.7876 | 0.5843 |**Best Configuration:** `only_comments_text_like_ratio_to_total` with Test F1 = 0.6878

| drop_bucket_medium | 10 | 0.7881 | 0.5835 |

| drop_comments_code_like_ratio_to_total | 10 | 0.7863 | 0.5831 || only_error_near_eof_ratio | 1 | 0.5938 | 0.4058 |

| drop_comment_ratio | 10 | 0.7787 | 0.5805 || drop_verb_ratio_comments | 10 | 0.7874 | 0.5568 |

| only_text_like_ratio | 1 | 0.6965 | 0.5721 || drop_comments_text_like_ratio_to_total | 10 | 0.7860 | 0.5717 |

## Multinomial Naive Bayes Results

| Features | N_feats | CV_F1_macro | Test_F1_macro |
|----------|---------|-------------|---------------|
| drop_error_near_eof_ratio | 10 | 0.7241 | **0.6089** |
| drop_text_like_ratio | 10 | 0.7350 | 0.5956 |
| ALL | 11 | 0.7487 | 0.5942 |
| drop_comment_ratio | 10 | 0.7429 | 0.5923 |
| drop_comments_code_like_ratio_to_total | 10 | 0.7492 | 0.5923 |
| drop_verb_ratio_comments | 10 | 0.7467 | 0.5916 |
| drop_bucket_large | 10 | 0.7540 | 0.5908 |
| drop_comments_text_like_ratio_to_total | 10 | 0.7413 | 0.5907 |
| drop_comments_text_like_ratio_comments | 10 | 0.7455 | 0.5775 |
| drop_comments_code_like_ratio_comments | 10 | 0.7455 | 0.5775 |
| drop_bucket_small | 10 | 0.7057 | 0.5744 |
| drop_bucket_medium | 10 | 0.7345 | 0.5682 |
| prev_2feats | 2 | 0.5696 | 0.5208 |
| only_* (all single features) | 1 | 0.3435 | 0.2311 |

**Best Configuration:** `drop_error_near_eof_ratio` with Test F1 = 0.6089

**Note:** MultinomialNB requires non-negative features (MinMaxScaler applied). Single-feature configurations perform extremely poorly (F1 ~ 0.23) due to the scaler normalizing all values to a narrow range, losing discriminative power.

## Best Configuration Per Model Summary

| Model | Best Feature Subset | N_feats | CV_F1_macro | Test_F1_macro |
|-------|---------------------|---------|-------------|---------------|
| RandomForest | only_comments_text_like_ratio_to_total | 1 | 0.6139 | **0.6932** |
| XGBoost | only_comments_text_like_ratio_to_total | 1 | 0.6138 | 0.6930 |
| Stacking Ensemble | only_comments_text_like_ratio_to_total | 1 | - | 0.6911 |
| MLP | only_comments_text_like_ratio_to_total | 1 | 0.6133 | 0.6906 |
| LogReg | only_comments_text_like_ratio_to_total | 1 | 0.6131 | 0.6900 |
| DeepNN | only_comments_text_like_ratio_to_total | 1 | 0.6129 | 0.6878 |
| MultinomialNB | drop_error_near_eof_ratio | 10 | 0.7241 | 0.6089 |

**Notes:**
- Stacking Ensemble uses RF, GradientBoosting, MLP, LogReg, and ExtraTrees as base models with LogReg meta-classifier
- MultinomialNB performs poorly with single features due to MinMaxScaler requirements; best with 10 features

## Interpretation & Key Findings

### 1. Severe Overfitting with Multi-Feature Models
- **All models with 10-11 features show strong overfitting**: CV F1 scores of 0.75-0.79 but Test F1 only 0.55-0.62
- The gap between CV and Test performance (~0.15-0.20) indicates the models memorize training patterns that don't generalize

### 2. Single Feature Dominance
The best performing configuration across **all models** is using **only `comments_text_like_ratio_to_total`**:
- Achieves Test F1 = **0.69-0.70** with just 1 feature
- This significantly outperforms using all 11 features (Test F1 ~ 0.58-0.60)
- **Simpler models generalize better** for this task

### 3. Top Individual Features (Ranked by Test F1)
| Rank | Feature | Test F1 |
|------|---------|---------|
| 1 | `comments_text_like_ratio_to_total` | 0.69-0.70 |
| 2 | `verb_ratio_comments` | 0.66-0.67 |
| 3 | `comment_ratio` | 0.65-0.66 |
| 4 | `comments_code_like_ratio_to_total` | 0.60 |
| 5 | `comments_code_like_ratio_comments` | 0.59-0.60 |

### 4. Harmful Feature: `error_near_eof_ratio`
- Using **only** `error_near_eof_ratio` gives worst Test F1 (0.40-0.41)
- **Dropping** this feature consistently improves test performance:
  - LogReg: 0.6003 -> 0.6209 (+0.02)
  - RandomForest: 0.5829 -> 0.6201 (+0.04)
  - XGBoost: 0.5809 -> 0.6213 (+0.04)
  - MultinomialNB: 0.5942 -> 0.6089 (+0.01)
- This feature appears to introduce spurious correlations that hurt generalization

### 5. Model Complexity vs. Performance
| Complexity | Model | Best Test F1 |
|------------|-------|--------------|
| Low | LogReg | 0.6900 |
| Medium | RandomForest | **0.6932** |
| Medium | XGBoost | 0.6930 |
| Medium | Stacking Ensemble | 0.6911 |
| High | MLP | 0.6906 |
| Very High | DeepNN | 0.6878 |
| Naive Bayes | MultinomialNB | 0.6089 |

- **Random Forest achieves the overall best** Test F1 (0.6932)
- Simpler models (LogReg, RF, XGBoost) outperform complex neural networks
- **Stacking Ensemble provides no improvement** over individual models (0.6911 vs 0.6932)
- MultinomialNB struggles with single features but achieves 0.61 with 10 features

### 6. Stacking Ensemble Analysis
- Ensemble of RF, GradientBoosting, MLP, LogReg, ExtraTrees with LogReg meta-classifier
- **Test F1: 0.6911** (trained on 30k subset)
- Delta vs best single model: -0.0021 (no gain)
- The ensemble fails to improve because all base models converge to similar predictions on this single-feature task

### 7. Recommendations
1. **Use `comments_text_like_ratio_to_total` as the primary feature** - it alone provides ~69% F1
2. **Remove `error_near_eof_ratio`** from feature sets - it hurts generalization
3. **Prefer simpler models** (Logistic Regression or Random Forest) over deep networks
4. If combining features, use only 2-3: `comments_text_like_ratio_to_total`, `verb_ratio_comments`, and optionally `comment_ratio`
5. The `prev_2feats` combination (verb_ratio_comments + comment_ratio) gives stable ~64.5% F1 across models
6. **Skip ensemble methods** - they add complexity without improving performance

### 8. Distribution Shift Analysis
The large CV-to-Test gap suggests significant **distribution shift** between:
- Training data (300K samples)
- Test-Add evaluation set (165K samples)

Features that capture comment structure (`comments_text_like_ratio_to_total`) appear more robust to this shift than syntactic features or error patterns.