#¬†Rossman dataset 

Fuente catboost: https://catboost.ai/docs/en/

Correr esto:

!pip install xgboost==1.7.6 scikit-learn==1.2.2

https://www.kaggle.com/competitions/rossmann-store-sales/data

- Id - an Id that represents a (Store, Date) duple within the test set
- Store - a unique Id for each store
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - differentiates between 4 different store models: a, b, c, d
- Assortment - describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance - distance in meters to the nearest competitor store
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
- Promo - indicates whether a store is running a promo on that day
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

In [1]:
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

# ----------------------------
# CARGA Y PREPROCESAMIENTO
# ----------------------------
# Carga de datos
train = pd.read_csv('Data/rossman.csv')
stores = pd.read_csv('Data/store.csv')

# Merge de ambos datasets
df = pd.merge(train, stores, on='Store')

# Convertir la columna de fecha
df['Date'] = pd.to_datetime(df['Date'])

# Ordenar por fecha para evitar leakage
df = df.sort_values('Date')

# Filtrar solo tiendas abiertas
df = df[df['Open'] == 1]

# ----------------------------
# FEATURE ENGINEERING
# ----------------------------

# Variables temporales √∫tiles
df['month'] = df['Date'].dt.month
df['day_of_week'] = df['Date'].dt.dayofweek
df['day_of_month'] = df['Date'].dt.day
df['week_of_year'] = df['Date'].dt.isocalendar().week.astype(int)
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Eliminar columnas que no usaremos
df = df.drop(columns=['Date', 'Store', 'Customers'])



  train = pd.read_csv('Data/rossman.csv')


In [2]:
# ----------------------------
# SEPARACI√ìN TEMPORAL: 90% pasado, 20% futuro
# ----------------------------
split_index = int(len(df) * 0.9)
train_df = df.iloc[:split_index]
test_df = df.iloc[split_index:]

# Separar X y y
target = 'Sales'
X_train = train_df.drop(columns=target)
y_train = train_df[target]
X_test = test_df.drop(columns=target)
y_test = test_df[target]

# Alinear columnas en test con train
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)



In [3]:
# Paso 2: Detectar columnas categ√≥ricas (n√∫mero bajo de categor√≠as)
cat_cols = [col for col in X_train.columns if X_train[col].nunique() < 50]

# Paso 3: Forzar columnas categ√≥ricas a string y llenar nulos
for col in cat_cols:
    X_train[col] = X_train[col].astype(str).fillna('missing')
    X_test[col] = X_test[col].astype(str).fillna('missing')


### üê± Tuneando CatBoost

- **`iterations`**:  
  N√∫mero total de √°rboles a entrenar. Si usas `early_stopping_rounds`, puedes poner un n√∫mero alto sin preocuparte por overfitting.

- **`depth`**:  
  Profundidad m√°xima de cada √°rbol. Profundidades mayores capturan m√°s complejidad, pero pueden sobreajustar.

- **`learning_rate`**:  
  Qu√© tan r√°pido aprende el modelo. Valores m√°s bajos requieren m√°s iteraciones, pero suelen generalizar mejor.

- **`subsample`**:  
  Fracci√≥n de observaciones usadas para entrenar cada √°rbol. Se controla mediante `bootstrap_type` + `subsample`.

- **`rsm`** (Random Subspace Method):  
  Fracci√≥n de columnas (features) usadas en cada split. Equivalente a `colsample_bytree`.

- **`early_stopping_rounds`**:  
  Detiene el entrenamiento si la m√©trica en el set de validaci√≥n no mejora en N iteraciones. Se activa al pasar `eval_set`.

- **`eval_metric`**:  
  M√©trica usada durante entrenamiento para monitorear desempe√±o (ej: `'RMSE'`, `'MAE'`, `'Logloss'`, `'AUC'`). Define si se activa early stopping.

- **`min_data_in_leaf`**:  
  N√∫mero m√≠nimo de muestras requeridas para hacer un split. Sirve para evitar sobreajuste (similar a `min_child_samples` en LightGBM).

- **`l2_leaf_reg`**:  
  Regularizaci√≥n L2 (Ridge) aplicada a los pesos de las hojas. Ayuda a controlar complejidad del modelo.

- **`random_strength`**:  
  Regularizaci√≥n de los splits. Cuanto m√°s alto, m√°s aleatoriedad al decidir los splits ‚Üí √∫til contra overfitting.


In [4]:
from catboost import CatBoostRegressor
from sklearn.metrics import r2_score

# Paso 5: Entrenar el modelo
model = CatBoostRegressor(
    iterations=10000,           # ‚âà n_estimators
    depth=5,                    # ‚âà max_depth
    learning_rate=0.1,
    subsample=0.5,              # igual que en XGBoost
    rsm=0.8,                    # ‚âà colsample_bytree
    eval_metric='R2',           # m√©trica para evaluar
    l2_leaf_reg=0.1,            # ‚âà reg_lambda
    random_strength=5,          # ‚âà gamma (penalizaci√≥n para splits)
    verbose=100,
    early_stopping_rounds=20    # early stopping con validaci√≥n
)

# Entrenamiento con eval_set incluyendo train y test
model.fit(
    X_train, y_train,
    cat_features=cat_cols,
    eval_set=[(X_test, y_test)],
)

# Paso 6: Evaluar
y_pred = model.predict(X_test)
print(f" R¬≤: {r2_score(y_test, y_pred):.4f}")


0:	learn: 0.0435957	test: 0.0306309	best: 0.0306309 (0)	total: 190ms	remaining: 31m 40s
100:	learn: 0.6134502	test: 0.5643274	best: 0.5643274 (100)	total: 9.89s	remaining: 16m 9s
200:	learn: 0.6983887	test: 0.6460173	best: 0.6460173 (200)	total: 20.5s	remaining: 16m 39s
300:	learn: 0.7424881	test: 0.6857417	best: 0.6857417 (300)	total: 31.8s	remaining: 17m 4s
400:	learn: 0.7703974	test: 0.7099062	best: 0.7099062 (400)	total: 43.4s	remaining: 17m 20s
500:	learn: 0.7915378	test: 0.7283160	best: 0.7283160 (500)	total: 55.6s	remaining: 17m 34s
600:	learn: 0.8057578	test: 0.7448941	best: 0.7448941 (600)	total: 1m 7s	remaining: 17m 42s
700:	learn: 0.8172454	test: 0.7544104	best: 0.7546981 (696)	total: 1m 20s	remaining: 17m 43s
800:	learn: 0.8261818	test: 0.7630289	best: 0.7630304 (799)	total: 1m 32s	remaining: 17m 43s
900:	learn: 0.8341629	test: 0.7746682	best: 0.7746728 (899)	total: 1m 45s	remaining: 17m 42s
1000:	learn: 0.8417151	test: 0.7813506	best: 0.7813506 (1000)	total: 1m 57s	remaini

In [5]:
import pandas as pd

df_importances = pd.DataFrame({
    'feature': model.feature_names_,
    'importance': model.feature_importances_
}).sort_values(by='importance', ascending=False)


In [6]:
df_importances

Unnamed: 0,feature,importance
7,CompetitionDistance,27.583348
2,Promo,13.433849
9,CompetitionOpenSinceYear,9.550844
5,StoreType,8.85441
8,CompetitionOpenSinceMonth,8.733118
6,Assortment,6.662655
11,Promo2SinceWeek,5.742445
12,Promo2SinceYear,4.789122
0,DayOfWeek,2.995582
13,PromoInterval,2.462945


## Comparativa: XGBoost vs LightGBM vs CatBoost

| Caracter√≠stica              | **XGBoost**                                                | **LightGBM**                                               | **CatBoost**                                                  |
|-----------------------------|------------------------------------------------------------|-------------------------------------------------------------|----------------------------------------------------------------|
| **Velocidad**               | R√°pido, pero m√°s lento que LightGBM y CatBoost             | üî• Muy r√°pido gracias a histogramas y leaf-wise growth      | R√°pido, aunque un poco m√°s lento que LightGBM                  |
| **Precisi√≥n**               | Alta                                                       | Alta, a veces mejor con buen tuning                         | Muy alta, especialmente con categ√≥ricas                        |
| **Variables categ√≥ricas**   | ‚ùå No las maneja (requiere encoding manual)                | ‚ùå No las maneja (requiere encoding manual)                 | ‚úÖ Soporte nativo + regularizaci√≥n secuencial                  |
| **Uso de memoria**          | Moderado                                                   | ‚úÖ Muy eficiente (binning)                                   | Similar a XGBoost                                              |
| **Manejo de missing values**| ‚úÖ Autom√°tico                                               | ‚úÖ Autom√°tico                                                | ‚úÖ Autom√°tico                                                   |
| **Soporte GPU**             | ‚úÖ S√≠ (bastante estable)                                   | ‚úÖ S√≠ (muy r√°pido)                                           | ‚úÖ S√≠ (algo m√°s limitado)                                      |
| **Instalaci√≥n**             | F√°cil (`pip install xgboost`)                             | F√°cil (`pip install lightgbm`)                              | Un poco m√°s pesada (`pip install catboost`)                   |
| **Documentaci√≥n**           | Excelente                                                  | Buena                                                       | Muy buena                                                     |
| **Interacci√≥n con sklearn** | Muy buena                                                  | Muy buena                                                   | Muy buena                                                     |
| **Tolerancia al orden**     | ‚úÖ Neutral                                                  | ‚úÖ Neutral                                                   | ‚ö†Ô∏è Sensible (por codificaci√≥n secuencial)                      |

---

## ‚úÖ ¬øCu√°ndo usar cada uno?

## ‚úÖ ¬øCu√°ndo usar XGBoost, LightGBM o CatBoost?

| Situaci√≥n                                                  | Recomendaci√≥n                                      |
|------------------------------------------------------------|----------------------------------------------------|
| Dataset tabular peque√±o o mediano                          | ‚úÖ XGBoost o CatBoost                               |
| Dataset grande, muchas variables num√©ricas                 | ‚úÖ LightGBM                                         |
| Muchas variables categ√≥ricas sin preprocesamiento          | ‚úÖ CatBoost (manejo nativo y robusto)              |
| Quieres algo robusto y estable con buen soporte            | ‚úÖ XGBoost (muy probado en producci√≥n y Kaggle)     |
| Entrenamiento r√°pido con buen desempe√±o                    | ‚úÖ LightGBM                                         |
| Quieres interpretabilidad con SHAP                         | ‚úÖ Cualquiera, pero CatBoost da mejores resultados con categ√≥ricas |
| Necesitas buen rendimiento sin mucho tuning                | ‚úÖ CatBoost (buenos defaults)                       |
| Ya tienes pipeline con OneHot/Target Encoding              | ‚úÖ XGBoost o LightGBM                               |
| Tuning autom√°tico (Optuna, GridSearchCV, etc.)             | ‚úÖ LightGBM (r√°pido y convergente)                  |
| Producci√≥n en sistemas legacy o APIs bien documentadas     | ‚úÖ XGBoost (mayor madurez, m√°s integraci√≥n)         |
| Clasificaci√≥n multi-label o problemas no est√°ndar          | ‚úÖ XGBoost (soporte m√°s flexible)                   |


## üß† Tips

- **LightGBM** puede overfittear f√°cilmente ‚Üí cuida `num_leaves` y `min_data_in_leaf`.
- **CatBoost** funciona muy bien con defaults y sin preprocessing.
- **XGBoost** es muy robusto y balanceado, ideal si ya tienes un pipeline con encoding hecho.
