# Model training

In [1]:
import datetime
import json
import pickle

import numpy as np
import optuna
import pandas as pd
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score
from sklearn.model_selection import KFold
from sklearn.utils.class_weight import compute_class_weight
from xgboost import XGBClassifier
from challenge.new_or_used import build_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
train_df = pickle.load(open("../data/train_dataset.pkl", "rb"))
test_df = pickle.load(open("../data/test_dataset.pkl", "rb"))

In [3]:
_, y_train, _, y_test = build_dataset()

y_train = pd.Series([y == "used" for y in y_train])
y_train = y_train.astype(int)

y_test = pd.Series([y == "used" for y in y_test])
y_test = y_test.astype(int)

In [4]:
class_weights = compute_class_weight(
    "balanced", classes=np.unique(y_train), y=y_train
)
print(class_weights)

[0.93067505 1.08048406]


# Hyperparameter Optimization with Optuna  

We use **Optuna** to search for the best hyperparameters for the XGBoost classifier in order to maximize performance.  

To avoid data leakage from the test set, we split the training data into **train** and **validation** subsets, and perform a **5-fold cross-validation** during the search.  

The process is as follows:  
1. Optimize hyperparameters using cross-validation on the training/validation split.  
2. Select the best-performing parameter set based on validation results.  
3. Retrain the model on the **entire training dataset** with the chosen parameters.  
4. Evaluate the final model on the **held-out test dataset**.  


In [8]:
def objective(trial):
    param = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",
        "n_estimators": trial.suggest_int("n_estimators", 300, 800),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.02),
        "max_depth": trial.suggest_int("max_depth", 10, 20),
        "subsample": trial.suggest_float("subsample", 0.6, 1),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.8, 1),
        "min_child_weight": trial.suggest_int("min_child_weight", 3, 10),
        "gamma": trial.suggest_float("gamma", 0.6, 1),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 1),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.3, 1),
        "random_state": 42,
        "scale_pos_weight": class_weights[1],
        "grow_policy": trial.suggest_categorical(
            "grow_policy", ["depthwise", "lossguide"]
        ),
        "tree_method": trial.suggest_categorical("tree_method", ["approx", "hist"]),
        "enable_categorical": True,
    }
    clf = XGBClassifier(**param)
    kf = KFold(n_splits=5)

    accuracies = []

    for train_index, val_index in kf.split(train_df):
        clf.fit(train_df.iloc[train_index], y_train.iloc[train_index])
        y_pred = clf.predict(train_df.iloc[val_index])
        accuracy = accuracy_score(y_train.iloc[val_index], y_pred)
        accuracies.append(accuracy)

    return np.mean(accuracies)

study = optuna.create_study(direction="maximize")

[I 2025-09-04 13:05:12,323] A new study created in memory with name: no-name-56e12159-3676-4d17-ad10-c8bde39f66e5


In [9]:
id = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
print(id)

20250904130515


In [11]:
study.optimize(objective, n_trials=50)
print("The best params are: ", study.best_params)
print("The best accuracy in validation is: ", study.best_value)

[I 2025-09-04 13:18:40,615] Trial 9 finished with value: 0.8992999999999999 and parameters: {'n_estimators': 554, 'learning_rate': 0.014261587896416373, 'max_depth': 19, 'subsample': 0.7483008802685539, 'colsample_bytree': 0.9250885866659034, 'min_child_weight': 4, 'gamma': 0.6373414225095054, 'reg_alpha': 0.17505745761749258, 'reg_lambda': 0.6022500002359948, 'grow_policy': 'depthwise', 'tree_method': 'approx'}. Best is trial 9 with value: 0.8992999999999999.
[I 2025-09-04 13:19:41,634] Trial 10 finished with value: 0.8910333333333333 and parameters: {'n_estimators': 607, 'learning_rate': 0.0012865209278401831, 'max_depth': 18, 'subsample': 0.9973797942464315, 'colsample_bytree': 0.9192678064148374, 'min_child_weight': 4, 'gamma': 0.7299534128168896, 'reg_alpha': 0.05910796834308141, 'reg_lambda': 0.43047903520360686, 'grow_policy': 'depthwise', 'tree_method': 'approx'}. Best is trial 9 with value: 0.8992999999999999.
[I 2025-09-04 13:20:54,579] Trial 11 finished with value: 0.8943999

The best params are:  {'n_estimators': 702, 'learning_rate': 0.01916315762126148, 'max_depth': 20, 'subsample': 0.757297605299585, 'colsample_bytree': 0.8712893306756, 'min_child_weight': 3, 'gamma': 0.8753123686891097, 'reg_alpha': 0.8816156436665484, 'reg_lambda': 0.6563661635923242, 'grow_policy': 'depthwise', 'tree_method': 'approx'}
The best accuracy in validation is:  0.9011222222222222


# Actual training and testing

In [14]:
best_params = study.best_params
clf = XGBClassifier(
    enable_categorical=True,
    scale_pos_weight=class_weights[1],
    **best_params,
)
clf.fit(train_df, y_train)
y_pred_train = clf.predict(train_df)
y_pred_test = clf.predict(test_df)

train_recall = recall_score(y_train, y_pred_train)
test_recall = recall_score(y_test, y_pred_test)
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)
train_precision = precision_score(y_train, y_pred_train)
test_precision = precision_score(y_test, y_pred_test)
train_f1 = f1_score(y_train, y_pred_train)
test_f1 = f1_score(y_test, y_pred_test)
train_confusion_matrix = confusion_matrix(y_train, y_pred_train)
test_confusion_matrix = confusion_matrix(y_test, y_pred_test)

results = {
    "id": id,
    "best_params": study.best_params,
    "train_recall": train_recall,
    "best_average_accuracy_for_validation": study.best_value,
    "train_precision": train_precision,
    "train_accuracy": train_accuracy,
    "train_f1": train_f1,
    "train_confusion_matrix": train_confusion_matrix,
    "test_recall": test_recall,
    "test_precision": test_precision,
    "test_accuracy": test_accuracy,
    "test_f1": test_f1,
    "test_confusion_matrix": test_confusion_matrix,
}

print(results)

{'id': '20250904130515', 'best_params': {'n_estimators': 702, 'learning_rate': 0.01916315762126148, 'max_depth': 20, 'subsample': 0.757297605299585, 'colsample_bytree': 0.8712893306756, 'min_child_weight': 3, 'gamma': 0.8753123686891097, 'reg_alpha': 0.8816156436665484, 'reg_lambda': 0.6563661635923242, 'grow_policy': 'depthwise', 'tree_method': 'approx'}, 'train_recall': 0.961990971955436, 'best_average_accuracy_for_validation': 0.9011222222222222, 'train_precision': 0.91166632534644, 'train_accuracy': 0.9392777777777778, 'train_f1': 0.9361528126642912, 'train_confusion_matrix': array([[44470,  3882],
       [ 1583, 40065]]), 'test_recall': 0.9175010883761427, 'test_precision': 0.867998352553542, 'test_accuracy': 0.898, 'test_f1': 0.8920634920634921, 'test_confusion_matrix': array([[4765,  641],
       [ 379, 4215]])}


#### 💡 Insights  
The model trained with the best hyperparameters (selected on the validation set) achieved:  
- **Validation accuracy**: 93.9%  
- **Test accuracy**: 89.8%  

This indicates a drop in performance when moving from validation to test, suggesting some **overfitting** to the validation folds. Still, the test accuracy remains strong and above the required baseline.

In [19]:
# Save the model and save best_params
import json
test_accuracy = int(test_accuracy*100)

results["test_confusion_matrix"] = results["test_confusion_matrix"].tolist()
results["train_confusion_matrix"] = results["train_confusion_matrix"].tolist()

with open(
    f"../models/params/best_params_{id}_{test_accuracy}.json",
    "w",
) as f:
    print(
        "Saving the best params at: ",
        f"../models/params/best_params_{id}_{test_accuracy}.json",
    )
    json.dump(results, f)
with open(
    f"../models/weights/model_{id}_{test_accuracy}.pkl",
    "wb",
) as f:
    print(
        "Saving the model at: ",
        f"../models/weights/model_{id}_{test_accuracy}.pkl",
    )
    pickle.dump(clf, f)

Saving the best params at:  ../models/params/best_params_20250904130515_89000000.json
Saving the model at:  ../models/weights/model_20250904130515_89000000.pkl


#### 💡 Insights  
Up to this point, we reported **accuracy**, which measures the proportion of correct predictions over the total.  
While useful, accuracy does not capture the model’s weakness with **false negatives** and **false positives** equally.  

An additional and critical metric here is **recall**, since it reflects the ability to correctly identify used items.  
False negatives (classifying a used item as new) are particularly damaging for the business: if a buyer receives a product labeled as *new* but it is actually *used*, they may lose trust and stop using Mercado Libre.  

On the test set, the model achieved:  
- **Recall**: 91.8% → strong performance, reducing the risk of false negatives.  
- **Precision**: 86.8% → ensures that most items predicted as used are truly used.  
- **F1-score**: 89.2% → balances precision and recall into a single metric, showing overall robustness.  

This trade-off demonstrates that the model not only achieves high recall to protect user trust, but also maintains good precision, with the F1-score confirming a balanced performance.  

## Check on the feature importance

In [20]:
feature_importance_df = pd.DataFrame(
    {"feature": train_df.columns, "importance": clf.feature_importances_}
)

# Sort by importance (descending)
feature_importance_df = feature_importance_df.sort_values("importance", ascending=False)

display(feature_importance_df)

Unnamed: 0,feature,importance
23,initial_quantity,0.351135
16,listing_type_id,0.10092
3,seller_id,0.071313
25,available_quantity,0.068248
31,title_nuevo_or_usado,0.042209
18,buying_mode,0.037248
24,sold_quantity,0.034268
19,has_parent_item_id,0.032053
26,title_sex_shop,0.031659
32,total_time_seconds,0.027685


# Feature Importance Analysis  

None of the features have an importance of 0, which validates that the **EDA process was effective** in selecting useful variables.  

As anticipated during the EDA, the `title` clusters stand out:  
- **Cluster 2** and **Cluster 3** are among the most important cluster-based features,  
confirming their predictive value for distinguishing between new and used items.  

# Next Steps  

Future work could focus on improving the use of text encoders and exploring additional modeling strategies.  

### 🔤 Text-related improvements  
1. **Alternative dimensionality reduction**  
   - Explore algorithms such as **t-SNE** or **UMAP** to reduce embedding dimensions before clustering.  

2. **Cluster optimization**  
   - Test additional metrics beyond inertia and silhouette score to determine the optimal number of clusters.  

3. **Experiment with different encoders**  
   - **OpenAI embeddings**: Higher-dimensional representations with strong performance in Spanish, which could capture nuances of **Rioplatense Spanish** better than RoBERTa.  
   - **BETO encoder**: Smaller and faster; although the embeddings are weaker, BETO could be fine-tuned, potentially outperforming RoBERTa in this specific task and be better for production. 

4. **Fine-tuning strategies**  
   - Apply **contrastive learning** to RoBERTa for generating more discriminative embeddings.  

5. **Additional text features**  
   - Encode the `warranty` field, which is free text and contains rich information that may contribute strongly to the model.  

---

### 📊 Non-text improvements  
1. **Feature engineering**  
   - Explore interactions between numerical features (e.g., price × shipping type).  

2. **Modeling strategies**  
   - Compare XGBoost with **LightGBM** and **CatBoost** for potential efficiency or performance gains.  
   - Build **ensemble models** (e.g., stacking with logistic regression or blending with neural networks).  

3. **Explainability and robustness**  
   - Use **SHAP values** or permutation importance to better understand feature contributions.  
   - Perform sensitivity analysis to detect potential biases or instabilities in predictions.  

4. **Data augmentation**  
   - Expand the dataset by generating synthetic samples for underrepresented categories or conditions.  
   - Augment text fields with paraphrasing or back-translation for more robust encoders.  