# Clothes Size Predictor

## Quick Note (PLEASE READ)

This is the new notebook where I did the feature engineering and the training model, here's why:

I had the time to check where my model failed, and I noticed that my model is failing precisely because of the problems I identified (feature overlap and class imbalance). The only way to get out of that error is to attack those problems directly. 

### So let's begin

In [1]:
# 1. Preparación e Importación
import os
import sys
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Añadimos el path del proyecto para importar módulos personalizados
PROJECT_PATH = os.path.abspath(os.path.join(os.pardir))
sys.path.append(PROJECT_PATH)
from src.pipeline.feature_engineering import *

# Cargamos el dataset
DATA_PATH = '../data/processed/clothes_processed.csv' 
df = pd.read_csv(DATA_PATH)

# FILTRADO CRÍTICO: Eliminamos las filas con muy pocos datos para evitar sesgos
df = df[df['size'] != 'XXL'].copy() 
print(f"Filas restantes después de eliminar XXL (support=14): {len(df)}")


# 2. Ejecutar el Pipeline de Transformación
TARGET_COL = 'size'
processor = FeatureEngineer(df, target_col=TARGET_COL) 
df_processed = processor.run_all_preprocessing() 

# 3. Separación de Variables X e y
X = df_processed[['weight', 'height', 'age']] 
y = df_processed[f'{TARGET_COL}_encoded']      

# 4. División de Datos
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y  
)

Filas restantes después de eliminar XXL (support=14): 26284
🚀 Running essential Pre-processing pipeline (Scaling & Encoding)...
🔢 Encoded target column 'size'.
💾 LabelEncoder saved.
📏 Scaled numeric features: ['weight', 'height', 'age']
💾 StandardScaler saved.
✅ Pre-processing pipeline completed.


In [2]:
# Define the Random Forest model
rf_model = RandomForestClassifier(random_state=42, n_jobs=-1)

# Hyperparameter grid for Random Forest
param_grid = {
    # NNumber of trees in the forest
    'n_estimators': [100, 200], 

    # Maximum depth of the trees
    'max_depth': [5, 10, None],  # None allows nodes to expand until all leaves are pure

    # Class weighting (to continue combating imbalance)
    'class_weight': [None, 'balanced'] 
}

# GridSearchCV
grid_search_rf = GridSearchCV(
    estimator=rf_model,
    param_grid=param_grid,
    scoring='f1_weighted',
    cv=5,
    n_jobs=-1
)

print("\n Initializing grid search for Random Forest...")
grid_search_rf.fit(X_train, y_train)

# Save the best model
best_rf_model = grid_search_rf.best_estimator_
joblib.dump(best_rf_model, '../models/best_rf_model_optimized.pkl')


 Initializing grid search for Random Forest...


['../models/best_rf_model_optimized.pkl']

In [3]:
print(f"Mejores Parámetros: {grid_search_rf.best_params_}")
print(f"Mejor Score (F1-Weighted): {grid_search_rf.best_score_:.4f}")

Mejores Parámetros: {'class_weight': 'balanced', 'max_depth': 5, 'n_estimators': 200}
Mejor Score (F1-Weighted): 0.3792


In [4]:
encoder = joblib.load('../models/label_encoder.pkl') 
y_pred = best_rf_model.predict(X_test)

print("\n--- REPORT ---")
print(classification_report(y_test, y_pred, target_names=encoder.classes_))


--- REPORT ---
              precision    recall  f1-score   support

           L       0.28      0.36      0.31       827
           M       0.32      0.12      0.17       958
           S       0.34      0.22      0.27       805
          XL       0.31      0.32      0.31       924
         XXS       0.35      0.78      0.48       506
        XXXL       0.71      0.70      0.70      1237

    accuracy                           0.41      5257
   macro avg       0.38      0.42      0.38      5257
weighted avg       0.41      0.41      0.39      5257

