
# Objetivo 2: Generación de modelos

Pipeline completo para la predicción del crecimiento de datos de biodiversidad en GBIF.

Este script implementa el flujo de trabajo de principio a fin para modelar datos de panel
de series temporales, incluyendo:
1.  Carga datos de PA_dataAnalysis y preparación
2.  Ingeniería de características temporales (lags y ventanas móviles).
3.  Un marco de validación cruzada robusto para series de tiempo (ventana expansiva).
4.  Preprocesamiento (imputación y escalado) dentro del bucle de validación para evitar fuga de datos.
5.  Entrenamiento y evaluación comparativa de cuatro modelos:
    - Prophet.
    - Random Forest.
    - XGBoost.
    - Red Neuronal LSTM (para modelado secuencial).
    - INCLUIR SARIMAX DE TODOS MODOS, INTENTAR HACER UNA TRANSFORMACIÓN LOGARÍTMICA, SE PUEDE PONER SIN LA VALIDACIÓN DE SUPUESTOS. TENER CUIDADO CON EL CROSS VALIDATION.
6.  Selección del mejor modelo basado en métricas de rendimiento (MAE, RMSE, R²).
7.  Reentrenamiento del modelo final y generación de pronósticos para Colombia hasta 2030
    bajo dos escenarios de políticas.

In [112]:
# =============================================================================
# 1. IMPORTACIÓN DE LIBRERÍAS Y CONFIGURACIÓN INICIAL
# =============================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from tqdm import tqdm

# Preprocesamiento y modelado de Scikit-Learn
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Modelos especializados

import xgboost as xgb
from statsmodels.tsa.statespace.sarimax import SARIMAX
from prophet import Prophet

# Configuraciones generales
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 7)

In [58]:
# Modelado de Deep Learning con TensorFlow/Keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
#Carga de datos

url = "https://raw.githubusercontent.com/rortizgeo/Maestria_CD_Proyecto-Aplicado/main/Data_final.csv"
Data_final = pd.read_csv(url)

# Convertir 'year' a formato de fecha para Prophet
Data_final['ds'] = pd.to_datetime(Data_final['year'], format='%Y')

# Ordenar los datos por país y año, es crucial para series de tiempo
Data_final = Data_final.sort_values(by=['country', 'ds']).reset_index(drop=True) # Falta ajustar para que la fecha sea al 31 de Diciembre de cada año

# Eliminación de columnas por tener muchos vacíos y no ser posible completarlas con imputación. (pensar en otras estrategias)
columns_to_drop = ['Overall score', 'areas_protegidas']
Data_final = Data_final.drop(columns=columns_to_drop)

Para aplicar modelos como Random Forest y XGBoost, es necesario agregar características de temporalidad en los datos, para lo cuál es necesario calcular retardos, que se deben aplicar teniendo en cuenta un análisis del ACF Y PACF, así como la incorporación de los tiempos del retardo como hiperparámetros. 

Los modelos basados en árboles como Random Forest y XGBoost no son conscientes de la secuencia temporal de los datos y no pueden "extrapolar" tendencias más allá de los valores que han visto en el entrenamiento. Por lo tanto, es necesario convertir la información temporal en características que el modelo pueda entender. La creación de retardos (lags) y estadísticas de ventana móvil es la técnica estándar para lograrlo. Se podría identificar el número de retardos como un hiperparámetro, guiado por análisis de ACF y PACF (Ver EDA)

In [None]:
# =============================================================================
# 2. INGENIERÍA DE CARACTERÍSTICAS TEMPORALES (OPTIMIZADA)
# =============================================================================



print("\nPaso 2: Realizando ingeniería de características temporales...")

TARGET = 'occurrenceCount_publisher'

def create_temporal_features_optimized(data, features_to_lag, 
                                     lags=[1, 2, 3, 4, 5], 
                                     roll_windows=[3, 5, 7],
                                     fill_na=0):
    """
    Genera características temporales y completa automáticamente con 0
    los valores NaN generados, según la lógica del negocio.
    
    Parámetros
    ----------
    data : pd.DataFrame
        Dataset con al menos 'country' y variables numéricas.
    features_to_lag : list
        Lista de columnas numéricas a transformar.
    lags : list
        Lista de retardos.
    roll_windows : list
        Lista de ventanas móviles.
    fill_na : int/float
        Valor para completar NaN (0 por defecto según lógica de negocio).
        
    Retorna
    -------
    DataFrame con nuevas características y NaN completados.
    """
    
    df_copy = Data_final.copy()
    
    for feature in features_to_lag:
        # Características de lag
        for lag in lags:
            lag_col = f'{feature}_lag{lag}'
            df_copy[lag_col] = df_copy.groupby('country')[feature].shift(lag)
            df_copy[lag_col] = df_copy[lag_col].fillna(fill_na)
        
        # Características de ventana móvil
        for w in roll_windows:
            # Rolling mean
            mean_col = f'{feature}_rollmean{w}'
            df_copy[mean_col] = (
                df_copy.groupby('country')[feature]
                .shift(1)
                .rolling(window=w, min_periods=1)
                .mean()
            )
            df_copy[mean_col] = df_copy[mean_col].fillna(fill_na)
            
            # Rolling std
            std_col = f'{feature}_rollstd{w}'
            df_copy[std_col] = (
                df_copy.groupby('country')[feature]
                .shift(1)
                .rolling(window=w, min_periods=1)
                .std()
            )
            df_copy[std_col] = df_copy[std_col].fillna(fill_na)
    
    return df_copy

# ===========================
# Uso del código optimizado
# ===========================

# Aplicar rezagos en bloques. Se deben hacer modelos con diferentes rezagos. Tener en cuenta que se pierden observaciones. Tomar todas las X y rezagar todas y hasta 3 periodos. Si incluir y-1.

features_to_lag = [
    "occurrenceCount_publisher", "pib_per_capita",
    "gasto_educacion_gobierno", "gasto_educacion_pib", "PC1", "PC2"
]

# Crear dataset con nuevas features (completando con 0)
df_featured = create_temporal_features_optimized(
    Data_final,
    features_to_lag=features_to_lag,
    lags=[1, 2, 3, 4, 5],  # Reducido para evitar overfitting
    roll_windows=[3, 5, 7],  # Reducido para evitar overfitting
    fill_na=0  # ¡IMPORTANTE! Completar con 0, en este caso mostraría que para ese año no hubo publicación de datos.
)

print("Ingeniería de características completada.")
print(f"Shape del dataset: {df_featured.shape}")
print(f"Valores NaN restantes: {df_featured.isnull().sum().sum()}")

# Mostrar estadísticas de las nuevas características
print("\nEstadísticas de las nuevas características:")
new_features = [col for col in df_featured.columns if any(x in col for x in ['_lag', '_roll'])]
print(df_featured[new_features].describe())



Paso 2: Realizando ingeniería de características temporales...
Ingeniería de características completada.
Shape del dataset: (656, 81)
Valores NaN restantes: 0

Estadísticas de las nuevas características:
       occurrenceCount_publisher_lag1  occurrenceCount_publisher_lag2  \
count                    6.560000e+02                    6.560000e+02   
mean                     1.530224e+07                    1.257584e+07   
std                      5.295431e+07                    4.383457e+07   
min                      0.000000e+00                    0.000000e+00   
25%                      1.549475e+04                    1.335000e+02   
50%                      1.176419e+06                    5.515200e+05   
75%                      9.340858e+06                    7.115654e+06   
max                      7.401771e+08                    5.946568e+08   

       occurrenceCount_publisher_lag3  occurrenceCount_publisher_lag4  \
count                    6.560000e+02                    6.56000

In [None]:
# Verificar distribución de las nuevas features # AJUSTAR Y QUITAR FILAS EN VEZ DE LLENAR CON 0
print("Distribución de valores en características temporales:")
for col in new_features:
    zero_percentage = (df_featured[col] == 0).mean() * 100
    print(f"{col}: {zero_percentage:.1f}% ceros")

Distribución de valores en características temporales:
occurrenceCount_publisher_lag1: 18.8% ceros
occurrenceCount_publisher_lag2: 25.0% ceros
occurrenceCount_publisher_lag3: 31.2% ceros
occurrenceCount_publisher_lag4: 37.5% ceros
occurrenceCount_publisher_lag5: 43.8% ceros
occurrenceCount_publisher_rollmean3: 9.9% ceros
occurrenceCount_publisher_rollstd3: 15.4% ceros
occurrenceCount_publisher_rollmean5: 5.9% ceros
occurrenceCount_publisher_rollstd5: 7.6% ceros
occurrenceCount_publisher_rollmean7: 3.7% ceros
occurrenceCount_publisher_rollstd7: 4.3% ceros
pib_per_capita_lag1: 6.2% ceros
pib_per_capita_lag2: 12.5% ceros
pib_per_capita_lag3: 18.8% ceros
pib_per_capita_lag4: 25.0% ceros
pib_per_capita_lag5: 31.2% ceros
pib_per_capita_rollmean3: 0.2% ceros
pib_per_capita_rollstd3: 0.3% ceros
pib_per_capita_rollmean5: 0.2% ceros
pib_per_capita_rollstd5: 0.3% ceros
pib_per_capita_rollmean7: 0.2% ceros
pib_per_capita_rollstd7: 0.3% ceros
gasto_educacion_gobierno_lag1: 6.2% ceros
gasto_educacio

In [None]:
# =============================================================================
# 3. PREPARACIÓN PARA EL MODELADO Y VALIDACIÓN
# =============================================================================
print("\nPaso 3: Preparando el marco de validación y los datos para el modelado...")

# Definir variable objetivo
TARGET = 'occurrenceCount_publisher'

# Definir variables predictoras (Decidir con Daniel cuáles usar basándose en EDA y disponibilidad. Preguntar si es posible usar todas sin complejizar el modelo) # ANALISIS DE CORRELACIONES CRUZADAS
features = [
    'PC1', 'PC2', 'pib_per_capita', 'gasto_educacion_gobierno',
    'gasto_educacion_pib', 'superficie_total_km2',
    "country", "region", "incomeLevel",
    f"{TARGET}_lag1", f"{TARGET}_lag2", f"{TARGET}_lag3", f"{TARGET}_lag4", f"{TARGET}_lag5",
    f"{TARGET}_rollmean3", f"{TARGET}_rollstd3", f"{TARGET}_rollmean5", f"{TARGET}_rollstd5",
    f"{TARGET}_rollmean7", f"{TARGET}_rollstd7"
    ]

# Filtrar las variables que realmente existen en el DataFrame
features = [f for f in features if f in df_featured.columns]

# Separar tipos de columnas
cat_cols  = [c for c in features if c in ['country', 'region', 'incomeLevel']]
date_cols = [c for c in features if c == 'ds']
num_cols  = [c for c in features if c not in cat_cols + date_cols]

print("\nFeatures numéricas:", num_cols)
print("Features categóricas:", cat_cols)
print("Features de fecha:", date_cols)


# Definir X (predictoras) y (objetivo)
X = df_featured[features].copy()
y = df_featured[TARGET].copy()

# Configurar validación cruzada para series de tiempo
n_splits = 5  # número de pliegues
tscv = TimeSeriesSplit(n_splits=n_splits)

# Extraer años únicos (ordenados) para controlar los folds
unique_years = df_featured['year'].unique()
unique_years.sort()
print(f"\nAños disponibles: {unique_years}")
print(f"Se usará TimeSeriesSplit con {n_splits} folds.")

# Diccionario para almacenar resultados (LIstas por modelo)
results = {
    'Prophet': [],
    'RandomForest':[],
    'XGBoost': [],
    'LSTM': []
}

print("Features seleccionadas:", features)
print("X shape:", X.shape)
print("y shape:", y.shape)



Paso 3: Preparando el marco de validación y los datos para el modelado...

Features numéricas: ['PC1', 'PC2', 'pib_per_capita', 'gasto_educacion_gobierno', 'gasto_educacion_pib', 'superficie_total_km2', 'occurrenceCount_publisher_lag1', 'occurrenceCount_publisher_lag2', 'occurrenceCount_publisher_lag3', 'occurrenceCount_publisher_lag4', 'occurrenceCount_publisher_lag5', 'occurrenceCount_publisher_rollmean3', 'occurrenceCount_publisher_rollstd3', 'occurrenceCount_publisher_rollmean5', 'occurrenceCount_publisher_rollstd5', 'occurrenceCount_publisher_rollmean7', 'occurrenceCount_publisher_rollstd7']
Features categóricas: ['country', 'region', 'incomeLevel']
Features de fecha: []

Años disponibles: [2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
 2021 2022]
Se usará TimeSeriesSplit con 5 folds.
Features seleccionadas: ['PC1', 'PC2', 'pib_per_capita', 'gasto_educacion_gobierno', 'gasto_educacion_pib', 'superficie_total_km2', 'country', 'region', 'incomeLevel', 'occur

# Pruebas sin optimización de parámetros

In [104]:
# =============================================================================
# 4. CREACIÓN DE SECUENCIAS Y FUNCIONES DE MODELOS
# =============================================================================

def create_lstm_sequences_global(df, features, target, look_back=3):
    """
    Crea secuencias LSTM para todos los países antes del split train/test.
    Retorna X_seq, y_seq, years, countries (alineados).
    """
    X_seq, y_seq, years, countries = [], [], [], []

    for country in df['country'].unique():
        df_country = df[df['country'] == country].sort_values('year')

        # Extraer features y target como float32
        X_country = df_country[features].values.astype(np.float32)
        y_country = df_country[target].values.astype(np.float32)
        years_country = df_country['year'].values

        if len(X_country) > look_back:
            for i in range(len(X_country) - look_back):
                X_seq.append(X_country[i:(i + look_back)])
                y_seq.append(y_country[i + look_back])
                years.append(years_country[i + look_back])
                countries.append(country)

    return (
        np.array(X_seq, dtype=np.float32),
        np.array(y_seq, dtype=np.float32),
        np.array(years),
        np.array(countries),
    )


# ---- Modelos ----
def train_random_forest(X_train, y_train, X_test, y_test):
    """
    Entrena un Random Forest Regressor y devuelve valores reales y predicciones.
    """
    rf = RandomForestRegressor(
        n_estimators=100,
        random_state=42,
        n_jobs=-1,
        min_samples_leaf=3
    )
    rf.fit(X_train, y_train)
    return y_test, rf.predict(X_test)


def train_xgboost(X_train, y_train, X_test, y_test):
    """
    Entrena un XGBoost Regressor y devuelve valores reales y predicciones.
    """
    model = xgb.XGBRegressor(
        n_estimators=100,
        random_state=42,
        n_jobs=-1,
        learning_rate=0.1,
        max_depth=5
    )
    model.fit(X_train, y_train)
    return y_test, model.predict(X_test)


def train_lstm(X_train, y_train, X_test, y_test, look_back):
    """
    Entrena una LSTM básica y devuelve valores reales y predicciones.
    """
    # Asegurar tipos correctos
    X_train = np.asarray(X_train, dtype=np.float32)
    y_train = np.asarray(y_train, dtype=np.float32)
    X_test = np.asarray(X_test, dtype=np.float32)
    y_test = np.asarray(y_test, dtype=np.float32)

    # Evitar entrenar si no hay suficientes datos
    if X_train.shape[0] == 0 or X_test.shape[0] == 0:
        return None, None

    model = tf.keras.Sequential([
        tf.keras.layers.LSTM(
            50,
            activation='relu',
            input_shape=(look_back, X_train.shape[2])
        ),
        tf.keras.layers.Dense(1)
    ])
    model.compile(optimizer='adam', loss='mae')

    # Entrenar con early stopping opcional
    model.fit(X_train, y_train, epochs=10, batch_size=16, verbose=0)

    return y_test, model.predict(X_test, verbose=0).flatten()


def train_prophet(df, train_years, test_years, target, regressors=None):
    # Base Prophet dataframe
    prophet_df = df[['ds', target, 'year']].rename(columns={target: 'y'})
    
    if regressors:
        for r in regressors:
            if r in df.columns:
                prophet_df[r] = df[r]

    # Dividir train/test
    prophet_train = prophet_df[prophet_df['year'].isin(train_years)]
    prophet_test = prophet_df[prophet_df['year'].isin(test_years)]

    # Crear modelo
    m = Prophet(yearly_seasonality=True, daily_seasonality=False)
    if regressors:
        for r in regressors:
            if r in prophet_df.columns:
                m.add_regressor(r)

    # Entrenar
    m.fit(prophet_train)

    # Generar predicciones
    forecast = m.predict(prophet_test[['ds'] + (regressors or [])])

    return prophet_test['y'].values, forecast['yhat'].values



In [105]:
# =============================================================================
# 5. EJECUCIÓN DE MODELOS CON CROSS-VALIDATION (MEJORADO)
# =============================================================================
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Función de preprocesamiento tabular
def preprocess_tabular(X_train, X_test):
    """
    Aplica imputación, escalado y codificación one-hot
    de forma consistente en train y test.
    """
    # Variables categóricas y numéricas
    cat_features = ["country", "region", "incomeLevel"]
    num_features = [col for col in X_train.columns if col not in cat_features]

    # Transformadores
    preprocessor = ColumnTransformer(
        transformers=[
            ("num", Pipeline([
                ("imputer", IterativeImputer(max_iter=10, random_state=42)),
                ("scaler", StandardScaler())
            ]), num_features),
            ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
        ],
        remainder="drop"
    )

    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)

    return X_train_processed, X_test_processed, preprocessor


# Features numéricas para LSTM
features_lstm = df_featured.select_dtypes(include=np.number).columns.tolist()
features_lstm = [col for col in features_lstm if col != TARGET]

look_back = 3
X_seq, y_seq, years_seq, countries_seq = create_lstm_sequences_global(
    df_featured, features_lstm, TARGET, look_back
)

unique_years = df_featured['year'].unique()
unique_years.sort()

for fold, (train_idx, test_idx) in enumerate(tscv.split(unique_years)):
    print(f"\n===== FOLD {fold+1}/{n_splits} =====")

    train_years = [unique_years[i] for i in train_idx]
    test_years = [unique_years[i] for i in test_idx]

    # === Random Forest / XGBoost ===
    mask_train = df_featured['year'].isin(train_years)
    mask_test = df_featured['year'].isin(test_years)

    X_train_tab = X.loc[mask_train].copy()
    y_train_tab = y.loc[mask_train]
    X_test_tab = X.loc[mask_test].copy()
    y_test_tab = y.loc[mask_test]

    # Preprocesar
    X_train_tab, X_test_tab, preprocessor = preprocess_tabular(X_train_tab, X_test_tab)

    # Random Forest
    y_true_rf, y_pred_rf = train_random_forest(X_train_tab, y_train_tab, X_test_tab, y_test_tab)
    results['RandomForest'].append((y_true_rf, y_pred_rf))

    # XGBoost
    y_true_xgb, y_pred_xgb = train_xgboost(X_train_tab, y_train_tab, X_test_tab, y_test_tab)
    results['XGBoost'].append((y_true_xgb, y_pred_xgb))


    # === LSTM ===
    mask_train_lstm = np.isin(years_seq, train_years)
    mask_test_lstm = np.isin(years_seq, test_years)

    X_train_lstm, y_train_lstm = X_seq[mask_train_lstm], y_seq[mask_train_lstm]
    X_test_lstm, y_test_lstm = X_seq[mask_test_lstm], y_seq[mask_test_lstm]

    y_true_lstm, y_pred_lstm = train_lstm(X_train_lstm, y_train_lstm, X_test_lstm, y_test_lstm, look_back)
    if y_true_lstm is not None:
        results['LSTM'].append((y_true_lstm, y_pred_lstm))
    else:
        results['LSTM'].append(([], []))


    # === Prophet con regresores ===
    try:
        # Variables adicionales como regresores externos
        regressors = ["pib_per_capita", "gasto_educacion_gobierno"]

        y_true_prophet, y_pred_prophet = train_prophet(
            df_featured, train_years, test_years, TARGET, regressors=regressors
        )
        results['Prophet'].append((y_true_prophet, y_pred_prophet))
    except Exception as e:
        results['Prophet'].append(([], []))
        print(f"⚠️ Prophet falló en fold {fold+1}: {e}")



===== FOLD 1/5 =====


2025-09-12 12:35:26.503132: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 12:35:26.684366: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 12:35:30.863810: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
12:35:31 - cmdstanpy - INFO - Chain [1] start processing
12:35:31 - cmdstanpy - INFO - Chain [1] done processing



===== FOLD 2/5 =====


2025-09-12 12:35:32.225618: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 12:35:32.400323: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 12:35:38.090298: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
12:35:38 - cmdstanpy - INFO - Chain [1] start processing
12:35:38 - cmdstanpy - INFO - Chain [1] done processing



===== FOLD 3/5 =====


2025-09-12 12:35:39.467136: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 12:35:39.625889: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 12:35:47.108440: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
12:35:47 - cmdstanpy - INFO - Chain [1] start processing
12:35:47 - cmdstanpy - INFO - Chain [1] done processing



===== FOLD 4/5 =====


2025-09-12 12:35:48.435564: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 12:35:48.599590: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 12:35:58.289794: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
12:35:58 - cmdstanpy - INFO - Chain [1] start processing
12:35:58 - cmdstanpy - INFO - Chain [1] done processing



===== FOLD 5/5 =====


2025-09-12 12:35:59.633281: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 12:35:59.792810: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 12:36:11.138915: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
12:36:11 - cmdstanpy - INFO - Chain [1] start processing
12:36:11 - cmdstanpy - INFO - Chain [1] done processing


In [None]:
# =============================================================================
# 6. EVALUACIÓN DE MÉTRICAS (OPTIMIZADA)
# =============================================================================
def compute_metrics(y_true, y_pred):
    y_true = np.array(y_true, dtype=float)
    y_pred = np.array(y_pred, dtype=float)

    if len(y_true) == 0 or len(y_pred) == 0:
        return None

    # Filtrar valores NaN o inf
    mask = ~np.isnan(y_true) & ~np.isnan(y_pred) & ~np.isinf(y_true) & ~np.isinf(y_pred)
    y_true, y_pred = y_true[mask], y_pred[mask]

    if len(y_true) == 0:
        return None

    # MAE, RMSE, R2
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)

    # MAPE más estable: ignorar ceros
    non_zero_mask = y_true != 0
    if non_zero_mask.any():
        mape = np.mean(np.abs((y_true[non_zero_mask] - y_pred[non_zero_mask]) / y_true[non_zero_mask])) * 100
    else:
        mape = np.nan  # si todo es cero, no se calcula

    # SMAPE alternativo
    smape = np.mean(2.0 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred) + 1e-6)) * 100

    return {"MAE": mae, "RMSE": rmse, "R2": r2, "MAPE": mape, "SMAPE": smape}


# Resumen de métricas
summary = {}
print("\n=== Resultados finales ===")
for model, folds in results.items():
    fold_metrics = []
    for (y_true, y_pred) in folds:
        m = compute_metrics(y_true, y_pred)
        if m:
            fold_metrics.append(m)
    if fold_metrics:
        avg_metrics = {k: np.nanmean([fm[k] for fm in fold_metrics]) for k in fold_metrics[0]}
        summary[model] = avg_metrics
        print(f"\n{model}:")
        for k, v in avg_metrics.items():
            print(f"  {k}: {v:.4f}")
    else:
        print(f"\n{model}: sin resultados")



=== Resultados finales ===

Prophet:
  MAE: 31094490.5729
  RMSE: 70200229.4628
  R2: -0.0283
  MAPE: 63084.9990
  SMAPE: 138.0917

RandomForest:
  MAE: 8637875.7723
  RMSE: 36307239.4780
  R2: 0.7189
  MAPE: 340.0481
  SMAPE: 47.8957

XGBoost:
  MAE: 6559937.1293
  RMSE: 25350348.5323
  R2: 0.7965
  MAPE: 234.9338
  SMAPE: 50.2811

LSTM:
  MAE: 6981599.9816
  RMSE: 15296019.9750
  R2: 0.9025
  MAPE: 1859.9406
  SMAPE: 71.0238


# PRUEBA con optimización de parámetros

In [127]:
# =============================================================================
# 4. OPTIMIZACIÓN DE MODELOS
# =============================================================================

def optimize_random_forest(X_train, y_train):
    param_dist = {
        "n_estimators": [100, 200, 300],
        "max_depth": [None, 10, 20],
        "min_samples_leaf": [1, 2, 5],
    }
    rf = RandomForestRegressor(random_state=42, n_jobs=-1)
    search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                n_iter=5, scoring="neg_mean_absolute_error",
                                cv=3, random_state=42)
    search.fit(X_train, y_train)
    return search.best_estimator_

def optimize_xgboost(X_train, y_train):
    param_grid = {
        "n_estimators": [100, 200, 500],
        "max_depth": [3, 5, 7],
        "learning_rate": [0.01, 0.1, 0.2],
        "subsample": [0.7, 0.9, 1.0],
    }

    best_model, best_mae = None, float("inf")

    for n in param_grid["n_estimators"]:
        for d in param_grid["max_depth"]:
            for lr in param_grid["learning_rate"]:
                for subs in param_grid["subsample"]:
                    model = xgb.XGBRegressor(
                        n_estimators=n,
                        max_depth=d,
                        learning_rate=lr,
                        subsample=subs,
                        random_state=42,
                        n_jobs=-1
                    )
                    model.fit(X_train, y_train)
                    preds = model.predict(X_train)
                    mae = mean_absolute_error(y_train, preds)
                    if mae < best_mae:
                        best_mae = mae
                        best_model = model
    return best_model

def optimize_lstm(X_train, y_train, look_back):
    param_grid = [
        {"units": 50, "dropout": 0.2, "epochs": 10, "batch_size": 16},
        {"units": 100, "dropout": 0.3, "epochs": 20, "batch_size": 32},
    ]

    best_model, best_mae = None, float("inf")

    for params in param_grid:
        model = tf.keras.Sequential([
            tf.keras.layers.LSTM(params["units"], activation="relu",
                                 input_shape=(look_back, X_train.shape[2])),
            tf.keras.layers.Dropout(params["dropout"]),
            tf.keras.layers.Dense(1)
        ])
        model.compile(optimizer="adam", loss="mae")
        model.fit(X_train, y_train,
                  epochs=params["epochs"],
                  batch_size=params["batch_size"],
                  verbose=0)

        preds = model.predict(X_train, verbose=0).flatten()
        mae = mean_absolute_error(y_train, preds)

        if mae < best_mae:
            best_mae = mae
            best_model = model

    return best_model

def optimize_prophet(train_df, target):
    param_grid = [
        {"changepoint_prior_scale": 0.01, "seasonality_prior_scale": 5},
        {"changepoint_prior_scale": 0.1, "seasonality_prior_scale": 10},
        {"changepoint_prior_scale": 0.5, "seasonality_prior_scale": 15},
    ]

    best_model, best_mae = None, float("inf")

    for params in param_grid:
        m = Prophet(
            yearly_seasonality=True,
            daily_seasonality=False,
            changepoint_prior_scale=params["changepoint_prior_scale"],
            seasonality_prior_scale=params["seasonality_prior_scale"]
        )
        prophet_train = train_df.rename(columns={"year": "ds", target: "y"})
        m.fit(prophet_train)
        forecast = m.predict(prophet_train[["ds"]])
        mae = mean_absolute_error(prophet_train["y"], forecast["yhat"])
        if mae < best_mae:
            best_mae = mae
            best_model = m

    return best_model

In [128]:
# =============================================================================
# BLOQUE MÉTRICAS
# =============================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def smape(y_true, y_pred):
    """Symmetric Mean Absolute Percentage Error"""
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return 100/len(y_true) * np.sum(2 * np.abs(y_pred - y_true) / 
                                    (np.abs(y_true) + np.abs(y_pred) + 1e-6))

def compute_metrics(y_true, y_pred):
    if len(y_true) == 0 or len(y_pred) == 0:
        return None
    return {
        "MAE": mean_absolute_error(y_true, y_pred),
        "RMSE": np.sqrt(mean_squared_error(y_true, y_pred)),
        "R2": r2_score(y_true, y_pred),
        "SMAPE": smape(y_true, y_pred)
    }

In [134]:
# =============================================================================
# 5. EJECUCIÓN DE MODELOS (USA OPTIMIZACIÓN)
# =============================================================================

def run_random_forest(X_train, y_train, X_test, y_test):
    model = optimize_random_forest(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_test, y_pred, model.get_params()

def run_xgboost(X_train, y_train, X_test, y_test):
    model = optimize_xgboost(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_test, y_pred, model.get_params()

def run_lstm(X_train, y_train, X_test, y_test, look_back):
    param_grid = [
        {"units": 50, "dropout": 0.2, "epochs": 10, "batch_size": 16},
        {"units": 100, "dropout": 0.3, "epochs": 20, "batch_size": 32},
    ]

    best_model, best_mae, best_params = None, float("inf"), None

    for params in param_grid:
        model = tf.keras.Sequential([
            tf.keras.layers.LSTM(params["units"], activation="relu",
                                 input_shape=(look_back, X_train.shape[2])),
            tf.keras.layers.Dropout(params["dropout"]),
            tf.keras.layers.Dense(1)
        ])
        model.compile(optimizer="adam", loss="mae")
        history = model.fit(
            X_train, y_train,
            epochs=params["epochs"],
            batch_size=params["batch_size"],
            verbose=0
        )

        preds = model.predict(X_train, verbose=0).flatten()
        mae = mean_absolute_error(y_train, preds)

        if mae < best_mae:
            best_mae = mae
            best_model = model
            best_params = params  # 🔹 Guardamos directamente los parámetros probados

    if best_model is None:
        return [], [], {}

    # Predicciones finales con el mejor modelo
    y_pred = best_model.predict(X_test, verbose=0).flatten()

    return y_test, y_pred, best_params  # 🔹 Retornamos params directamente


def run_prophet(train_df, test_df, target, regressors=None):
    """
    Entrena y evalúa Prophet en un fold.
    - train_df y test_df: subconjuntos de datos (DataFrames).
    - target: variable objetivo.
    - regressors: lista opcional de regresores externos.
    """
    # Preparar datasets
    prophet_train = train_df[['year', target]].rename(columns={'year': 'ds', target: 'y'})
    prophet_test = test_df[['year', target]].rename(columns={'year': 'ds', target: 'y'})

    # Convertir a datetime para Prophet
    prophet_train['ds'] = pd.to_datetime(prophet_train['ds'], format='%Y')
    prophet_test['ds'] = pd.to_datetime(prophet_test['ds'], format='%Y')

    # Inicializar Prophet
    m = Prophet(
        yearly_seasonality=True,
        daily_seasonality=False,
        changepoint_prior_scale=0.1,
        seasonality_prior_scale=10
    )

    # Agregar regresores si existen
    if regressors:
        for reg in regressors:
            if reg in train_df.columns:
                m.add_regressor(reg)

    # Ajustar el modelo
    train_with_regs = prophet_train.copy()
    if regressors:
        for reg in regressors:
            train_with_regs[reg] = train_df[reg].values

    m.fit(train_with_regs)

    # Preparar datos futuros
    future = prophet_test[['ds']].copy()
    if regressors:
        for reg in regressors:
            future[reg] = test_df[reg].values

    forecast = m.predict(future)

    return prophet_test['y'].values, forecast['yhat'].values, {
        "changepoint_prior_scale": m.changepoint_prior_scale,
        "seasonality_prior_scale": m.seasonality_prior_scale,
        "regressors": regressors if regressors else []
    }


In [None]:
# =============================================================================
# 6. LOOP DE CROSS-VALIDATION
# =============================================================================

results_detailed = { "Prophet": [], "RandomForest": [], "XGBoost": [], "LSTM": [] }

for fold, (train_idx, test_idx) in enumerate(tscv.split(unique_years)):
    print(f"\n===== FOLD {fold+1}/{n_splits} =====")
    train_years = [unique_years[i] for i in train_idx]
    test_years = [unique_years[i] for i in test_idx]

    # === Datos tabulares ===
    mask_train = df_featured['year'].isin(train_years)
    mask_test = df_featured['year'].isin(test_years)
    X_train_tab = X.loc[mask_train].select_dtypes(include=np.number)
    y_train_tab = y.loc[mask_train]
    X_test_tab = X.loc[mask_test].select_dtypes(include=np.number)
    y_test_tab = y.loc[mask_test]

    # Preprocesamiento
    imputer = IterativeImputer(max_iter=10, random_state=42)
    scaler = StandardScaler()
    X_train_tab = scaler.fit_transform(imputer.fit_transform(X_train_tab))
    X_test_tab = scaler.transform(imputer.transform(X_test_tab))

    # --- Random Forest ---
    y_true_rf, y_pred_rf, params_rf = run_random_forest(X_train_tab, y_train_tab, X_test_tab, y_test_tab)
    metrics_rf = compute_metrics(y_true_rf, y_pred_rf)
    results_detailed["RandomForest"].append({**metrics_rf, "params": params_rf, "fold": fold+1})

    # --- XGBoost ---
    y_true_xgb, y_pred_xgb, params_xgb = run_xgboost(X_train_tab, y_train_tab, X_test_tab, y_test_tab)
    metrics_xgb = compute_metrics(y_true_xgb, y_pred_xgb)
    results_detailed["XGBoost"].append({**metrics_xgb, "params": params_xgb, "fold": fold+1})

    # --- LSTM ---
    mask_train_lstm = np.isin(years_seq, train_years)
    mask_test_lstm = np.isin(years_seq, test_years)
    X_train_lstm, y_train_lstm = X_seq[mask_train_lstm], y_seq[mask_train_lstm]
    X_test_lstm, y_test_lstm = X_seq[mask_test_lstm], y_seq[mask_test_lstm]
    y_true_lstm, y_pred_lstm, params_lstm = run_lstm(X_train_lstm, y_train_lstm, X_test_lstm, y_test_lstm, look_back)
    if len(y_true_lstm) > 0:
        metrics_lstm = compute_metrics(y_true_lstm, y_pred_lstm)
        results_detailed["LSTM"].append({**metrics_lstm, "params": params_lstm, "fold": fold+1})

    # Definir regresores opcionales (ejemplo: variables socioeconómicas)
prophet_regressors = ["pib_per_capita", "gasto_educacion_gobierno", "PCA1", "PCA2", "superficie_total_km2",  "gasto_educacion_pib", "gasto_educacion_gobierno"]

# Prophet
try:
    prophet_train = df_featured[df_featured["year"].isin(train_years)]
    prophet_test = df_featured[df_featured["year"].isin(test_years)]
    y_true_prophet, y_pred_prophet, params_prophet = run_prophet(
        prophet_train, prophet_test, TARGET, regressors=prophet_regressors
    )
    metrics_prophet = compute_metrics(y_true_prophet, y_pred_prophet)
    results_detailed["Prophet"].append({**metrics_prophet, "params": params_prophet, "fold": fold+1})
except Exception as e:
    print(f"⚠️ Prophet falló en fold {fold+1}: {e}")    


===== FOLD 1/5 =====


2025-09-12 15:13:15.595891: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:13:15.758547: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:13:20.554036: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2025-09-12 15:13:21.514152: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:13:21.681781: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:13:26.090486: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.



===== FOLD 2/5 =====


2025-09-12 15:13:47.367068: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:13:47.558714: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:13:54.112326: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2025-09-12 15:13:55.074448: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:13:55.261764: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:14:01.902767: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.



===== FOLD 3/5 =====


2025-09-12 15:14:23.683814: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:14:23.865095: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:14:31.936144: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2025-09-12 15:14:32.915556: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:14:33.099831: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:14:41.110533: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.



===== FOLD 4/5 =====


2025-09-12 15:15:04.224427: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:15:04.398472: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:15:14.843554: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2025-09-12 15:15:15.834703: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:15:16.011354: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:15:26.207150: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.



===== FOLD 5/5 =====


2025-09-12 15:15:47.308662: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:15:47.461187: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:15:59.653423: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2025-09-12 15:16:00.702911: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-09-12 15:16:00.857215: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp_10.
2025-09-12 15:16:13.150570: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
15:16:13 - cmdstanpy - INFO - Chain [1] start processing
15:16:13 - cmdstanpy - INFO - Chain [1] done processing


In [137]:
# =============================================================================
# 7. EVALUACIÓN DE MÉTRICAS
# =============================================================================
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# =============================================================================
# BLOQUE DE RESUMEN FINAL
# =============================================================================
all_results = []
for model, folds in results_detailed.items():
    for res in folds:
        res["model"] = model
        all_results.append(res)

df_results = pd.DataFrame(all_results)

print("\n=== Resultados Detallados ===")
print(df_results[["model", "fold", "MAE", "RMSE", "R2", "SMAPE", "params"]])

print("\n=== Promedio por modelo ===")
print(df_results.groupby("model")[["MAE", "RMSE", "R2", "SMAPE"]].mean())



=== Resultados Detallados ===
           model  fold           MAE          RMSE        R2       SMAPE  \
0        Prophet     5  5.593730e+07  1.279492e+08 -0.022634  129.957022   
1   RandomForest     1  4.371994e+06  1.812670e+07  0.622490   78.203109   
2   RandomForest     2  6.735821e+06  2.653531e+07  0.691307   66.670576   
3   RandomForest     3  6.285663e+06  1.826668e+07  0.894061   45.563590   
4   RandomForest     4  1.327231e+07  5.075060e+07  0.652979   27.537785   
5   RandomForest     5  1.907566e+07  7.744264e+07  0.625368   26.131990   
6        XGBoost     1  4.234149e+06  1.693988e+07  0.670306   87.087511   
7        XGBoost     2  5.008329e+06  1.688554e+07  0.875001   75.490959   
8        XGBoost     3  4.311654e+06  9.564915e+06  0.970953   63.742064   
9        XGBoost     4  9.817560e+06  3.704550e+07  0.815097   48.023371   
10       XGBoost     5  9.058047e+06  3.694684e+07  0.914729   33.983425   
11          LSTM     1  3.442083e+06  9.463904e+06  0.897