
# Ejercicio: Gradient Boosting con datos reales de Kaggle (House Prices)

**Objetivo:** completar un flujo de ML con **Gradient Boosting** para predecir `SalePrice` usando el dataset **House Prices** de Kaggle.

---

## ¿Cómo instalar y usar Jupyter?

### 1) Crear entorno virtual (opcional pero recomendado)
```bash
# macOS / Linux
python3 -m venv myenv
source myenv/bin/activate

# Windows (PowerShell)
py -m venv myenv
myenv\Scripts\Activate.ps1
```

### 2) Instalar dependencias mínimas
```bash
pip install --upgrade pip
pip install jupyter pandas numpy matplotlib scikit-learn
```

### 3) Abrir Jupyter
```bash
# Opción A: Jupyter Notebook
jupyter notebook

# Opción B: JupyterLab
pip install jupyterlab
jupyter lab
```
Jupyter abrirá una ventana en tu navegador. Crea/abre este notebook y ejecuta las celdas con **Shift+Enter**.

---

## ¿Cómo obtener el dataset de Kaggle?

### Opción 1: Descarga manual
1. Ve a la competición: *House Prices - Advanced Regression Techniques*.
2. Descarga `train.csv` y guárdalo en una carpeta de tu proyecto, por ejemplo: `data/train.csv`.

### Opción 2: Kaggle API (requiere configurar credenciales)
```bash
pip install kaggle
# Coloca tu kaggle.json en ~/.kaggle/kaggle.json (Linux/macOS) o C:\Users\<USER>\.kaggle\kaggle.json (Windows)
kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip -d data
```
> Para este ejercicio bastará con **`train.csv`**.


## 1) Importar librerías

In [1]:

# IMPORTA AQUÍ LAS LIBRERÍAS NECESARIAS
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, r2_score
# En versiones recientes usa root_mean_squared_error (si no, haremos un fallback manual)
try:
    from sklearn.metrics import root_mean_squared_error
    _HAS_RMSE = True
except Exception:
    from sklearn.metrics import mean_squared_error
    _HAS_RMSE = False

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Librerías importadas.")

ModuleNotFoundError: No module named 'numpy'

## 2) Cargar datos de Kaggle (`train.csv`)

In [None]:

# AJUSTA ESTA RUTA A DONDE GUARDASTE train.csv
DATA_PATH = "data/train.csv"  # <-- cambia si es necesario

assert os.path.exists(DATA_PATH), f"No se encontró {DATA_PATH}. Ajusta la ruta."

df = pd.read_csv(DATA_PATH)
print(df.shape)
df.head()

## 3) Exploración inicial

In [None]:

# Dimensiones, tipos y nulos
display(df.info())
display(df.isna().sum().sort_values(ascending=False).head(20))
df.describe(include='all').T.head(20)


## 4) Seleccionar variables y preprocesar (versión simple)

Para comenzar, usa un **subconjunto de variables numéricas** (puedes modificarlas):
- `OverallQual`, `GrLivArea`, `GarageCars`, `YearBuilt`

Variable objetivo: **`SalePrice`**.


In [None]:

features_numeric = ["OverallQual", "GrLivArea", "GarageCars", "YearBuilt"]
target = "SalePrice"

# Filtrar filas completas en esas columnas
data = df[features_numeric + [target]].dropna().copy()

X = data[features_numeric]
y = data[target]

X.head()

## 5) Dividir en entrenamiento y prueba

In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE
)
X_train.shape, X_test.shape

## 6) Crear y entrenar Gradient Boosting Regressor (básico)

In [None]:

gbr = GradientBoostingRegressor(
    random_state=RANDOM_STATE,
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    subsample=0.9
)
gbr.fit(X_train, y_train)
print("Modelo entrenado.")

## 7) Evaluar el modelo

In [None]:

y_pred = gbr.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
if _HAS_RMSE:
    rmse = root_mean_squared_error(y_test, y_pred)
else:
    from sklearn.metrics import mean_squared_error
    import numpy as np
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"MAE : {mae:0.4f}")
print(f"RMSE: {rmse:0.4f}")
print(f"R²  : {r2:0.4f}")

## 8) Visualizar predicciones

In [None]:

# a) Dispersión y_real vs y_pred
plt.figure(figsize=(6,6))
plt.scatter(y_test, y_pred, s=18, alpha=0.8)
min_v = min(y_test.min(), y_pred.min())
max_v = max(y_test.max(), y_pred.max())
plt.plot([min_v, max_v], [min_v, max_v], linestyle="--")
plt.title("y_real vs y_pred (TEST)")
plt.xlabel("y_real"); plt.ylabel("y_pred")
plt.tight_layout(); plt.show()

# b) Importancia de variables
importances = pd.Series(gbr.feature_importances_, index=features_numeric).sort_values(ascending=True)
plt.figure(figsize=(6,4))
importances.plot(kind="barh")
plt.title("Importancia de variables (GBR)")
plt.tight_layout(); plt.show()

## 9) Mejorar el modelo (GridSearchCV)

In [None]:

param_grid = {
    "n_estimators": [200, 400, 600],
    "learning_rate": [0.03, 0.06, 0.1],
    "max_depth": [2, 3, 4],
    "subsample": [0.8, 1.0]
}

gbr_base = GradientBoostingRegressor(random_state=RANDOM_STATE)

grid = GridSearchCV(
    estimator=gbr_base,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=5,
    n_jobs=-1
)
grid.fit(X_train, y_train)
print("Mejores hiperparámetros:", grid.best_params_)

best_model = grid.best_estimator_
y_pred_best = best_model.predict(X_test)

if _HAS_RMSE:
    rmse_best = root_mean_squared_error(y_test, y_pred_best)
else:
    rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
r2_best = r2_score(y_test, y_pred_best)

print(f"RMSE base -> mejorado: {rmse:0.4f} -> {rmse_best:0.4f}")
print(f"R²   base -> mejorado: {r2:0.4f} -> {r2_best:0.4f}")


## 10) (Opcional) Pipeline con variables categóricas

- Elige algunas columnas categóricas (p. ej., `Neighborhood`, `KitchenQual`, etc.).
- Usa `OneHotEncoder` + `ColumnTransformer` y entrena un **Pipeline** con `GradientBoostingRegressor`.
- Compara métricas con la versión numérica simple.


In [None]:

# EJEMPLO DE PLANTILLA (completa tú con columnas reales si quieres probar)

# cols_num = ["OverallQual", "GrLivArea", "GarageCars", "YearBuilt"]
# cols_cat = ["Neighborhood", "KitchenQual"]
# used_cols = cols_num + cols_cat + [target]

# data2 = df[used_cols].dropna().copy()
# X2 = data2[cols_num + cols_cat]
# y2 = data2[target]

# pre = ColumnTransformer(
#     transformers=[
#         ("num", "passthrough", cols_num),
#         ("cat", OneHotEncoder(handle_unknown="ignore"), cols_cat)
#     ]
# )

# pipe = Pipeline([
#     ("prep", pre),
#     ("model", GradientBoostingRegressor(random_state=RANDOM_STATE))
# ])

# X2_train, X2_test, y2_train, y2_test = train_test_split(
#     X2, y2, test_size=0.25, random_state=RANDOM_STATE
# )
# pipe.fit(X2_train, y2_train)
# y2_pred = pipe.predict(X2_test)

# if _HAS_RMSE:
#     rmse2 = root_mean_squared_error(y2_test, y2_pred)
# else:
#     rmse2 = np.sqrt(mean_squared_error(y2_test, y2_pred))
# r2_2 = r2_score(y2_test, y2_pred)

# print(f"RMSE (pipeline categóricas): {rmse2:0.4f}")
# print(f"R²   (pipeline categóricas): {r2_2:0.4f}")