Introducción

# Predicción de Ventas en un E-commerce

**Objetivo:** Predecir cuántas unidades se venderán por tienda y producto en el próximo mes.

Este proyecto resuelve un problema real usando datos de la competición [Kaggle: Predict Future Sales](https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales).

Se ha seguido un flujo completo de Machine Learning desde el análisis exploratorio hasta la generación de predicciones.


Carga de librerías y datos

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBRegressor
import joblib
from sklearn.metrics import mean_squared_error
from math import sqrt


Exploración y limpieza - resumen

In [2]:
#Cargar datos limpios
df_sales = pd.read_csv('../data/sales_clean.csv')

#Mostrar estadísticas básicas
df_sales.describe()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,2928485.0,2928485.0,2928485.0,2928485.0,2928485.0
mean,14.56976,33.00294,10200.28,889.4673,1.247598
std,9.422951,16.22543,6324.392,1727.5,2.292948
min,0.0,0.0,0.0,0.07,1.0
25%,7.0,22.0,4477.0,249.0,1.0
50%,14.0,31.0,9355.0,399.0,1.0
75%,23.0,47.0,15691.0,999.0,1.0
max,33.0,59.0,22169.0,307980.0,1000.0


- Se eliminaron precios negativos y valores extremos.
- Se eliminaron duplicados.

Agregación mensual

In [3]:
#Agrupar por mes, tienda y producto
df_monthly = df_sales.groupby(['date_block_num', 'shop_id', 'item_id'])['item_cnt_day'].sum().reset_index()
df_monthly.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)
df_monthly['item_cnt_month'] = df_monthly['item_cnt_month'].clip(0, 20)

Creación de matriz base

In [4]:
#Matriz con todas las combinaciones mes-tienda-producto
df_matrix = pd.read_csv('../data/matriz_features_lags.csv')

Feature engineering

In [5]:
#Mostrar features clave creadas
df_matrix[['item_cnt_month', 'item_cnt_month_lag_1', 'item_cnt_month_lag_3']].describe()

Unnamed: 0,item_cnt_month,item_cnt_month_lag_1,item_cnt_month_lag_3
count,10884550.0,10884550.0,10884550.0
mean,0.2996268,0.2833825,0.2594169
std,1.225039,1.201368,1.161263
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,0.0,0.0,0.0
max,20.0,20.0,20.0


División en train / validación

In [6]:
df_train = df_matrix[df_matrix['date_block_num'] < 33]
df_val = df_matrix[df_matrix['date_block_num'] == 33]

features = [
    'item_id', 'shop_id', 'item_category_id',
    'month', 'year',
    'item_avg_monthly_sales', 'shop_avg_monthly_sales',
    'item_cnt_month_lag_1', 'item_cnt_month_lag_3'
]
target = 'item_cnt_month'

X_train = df_train[features]
y_train = df_train[target]
X_val = df_val[features]
y_val = df_val[target]

Entrenamiento del modelo

In [7]:
model = XGBRegressor(
    n_estimators=100, max_depth=8,
    learning_rate=0.05, subsample=0.5,
    random_state=42, n_jobs=-1
)

model.fit(X_train, y_train)

Evaluación

In [8]:
y_pred = model.predict(X_val)
rmse = sqrt(mean_squared_error(y_val, y_pred))
print(f'RMSE en validación: {rmse:.4f}')

RMSE en validación: 0.8958


Predicción final y generación de archivo

In [9]:
submission = pd.read_csv('../data/submission.csv')
submission.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.924389
1,1,3.073612
2,2,0.918653
3,3,0.430294
4,4,5.292832


In [10]:
joblib.dump(model, '../models/xgb_model_final.joblib')

['../models/xgb_model_final.joblib']

### Conclusiones

- Se ha construido un modelo robusto con XGBoost.
- RMSE en validación: **0.8958**
- Se han aplicado técnicas de agregación temporal, lags y optimización.

