# Modelado — Próximo pedido DIGITAL (100% PySpark)

Este cuaderno implementa los **siguientes pasos** del EDA:
- Construcción de **features de comportamiento** (RFM simplificado y agregados transaccionales).
- Preparación de un **dataset a nivel cliente-mes** con la **etiqueta `próximo pedido = DIGITAL`**.
- Entrenamiento de un modelo supervisado sencillo (**Logistic Regression** de Spark ML).
- **Evaluación** con métricas estándar (AUC ROC, AUC PR, Accuracy, Precision, Recall, F1).

## 1) Sesión Spark y parámetros

In [None]:
from pyspark.sql import SparkSession, functions as F, types as T, Window
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Imputer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

spark = (
    SparkSession.builder
    .appName("modelado-proximo-pedido-digital")
    .master("local[*]")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.warehouse.dir", "./spark-warehouse")
    .getOrCreate()
)
spark.sparkContext.setLogLevel("WARN")

DATA_DIR = "dataset/dataset"     # cambia si tu ruta es distinta
TEST_START_YM = "2024-01"        # train < 2024-01 ; test >= 2024-01

print("Spark version:", spark.version)

## 2) Carga y preparación mínima

In [None]:
df = spark.read.parquet(DATA_DIR)

expected = [
    "cliente_id","pais_cd","region_comercial_txt","agencia_id","ruta_id",
    "tipo_cliente_cd","madurez_digital_cd","estrellas_txt","frecuencia_visitas_cd",
    "fecha_pedido_dt","canal_pedido_cd","facturacion_usd_val",
    "materiales_distintos_val","cajas_fisicas"
]
present = [c for c in expected if c in df.columns]
print("Columnas presentes:", present)

df = (
    df
    .withColumn("month_first", F.trunc("fecha_pedido_dt", "month"))
    .withColumn("ym", F.date_format("month_first", "yyyy-MM"))
    .withColumn("is_digital", F.when(F.col("canal_pedido_cd") == "DIGITAL", 1).otherwise(0))
)
df.select("cliente_id","fecha_pedido_dt","ym","canal_pedido_cd","is_digital").show(5, truncate=False)

## 3) Etiqueta: **¿el próximo pedido del cliente es DIGITAL?**

Tomamos el **último pedido de cada cliente-mes** y usamos si su **próximo pedido** es DIGITAL como etiqueta (`label`).

In [None]:
from pyspark.sql import Window

w_client_order = Window.partitionBy("cliente_id").orderBy(F.col("fecha_pedido_dt").asc())
w_client_month_desc = Window.partitionBy("cliente_id","month_first").orderBy(F.col("fecha_pedido_dt").desc())

orders = (
    df
    .withColumn("prev_dt", F.lag("fecha_pedido_dt").over(w_client_order))
    .withColumn("next_canal", F.lead("canal_pedido_cd").over(w_client_order))
    .withColumn("next_is_digital", F.when(F.col("next_canal")=="DIGITAL", 1).otherwise(0))
    .withColumn("recency_days", F.datediff(F.col("fecha_pedido_dt"), F.col("prev_dt")))
    .withColumn("rn_month_desc", F.row_number().over(w_client_month_desc))
)

last_in_month = (
    orders
    .filter(F.col("rn_month_desc")==1)
    .select("cliente_id","month_first","ym",
            F.col("recency_days").alias("recency_days_last"),
            F.col("next_is_digital").alias("label"))
)
last_in_month.show(5, truncate=False)

## 4) Features a nivel **cliente-mes**

In [None]:
monthly_agg = (
    df.groupBy("cliente_id","month_first","ym")
      .agg(
          F.count("*").alias("n_orders"),
          F.avg("is_digital").alias("digital_ratio"),
          F.sum(F.col("facturacion_usd_val").cast("double")).alias("sum_fact"),
          F.avg(F.col("facturacion_usd_val").cast("double")).alias("avg_fact"),
          F.sum(F.col("cajas_fisicas").cast("double")).alias("sum_cajas"),
          F.avg(F.col("cajas_fisicas").cast("double")).alias("avg_cajas"),
          F.avg(F.col("materiales_distintos_val").cast("double")).alias("avg_mat_dist"),
          F.first("tipo_cliente_cd", ignorenulls=True).alias("tipo_cliente_cd"),
          F.first("madurez_digital_cd", ignorenulls=True).alias("madurez_digital_cd"),
          F.first("frecuencia_visitas_cd", ignorenulls=True).alias("frecuencia_visitas_cd"),
          F.first("pais_cd", ignorenulls=True).alias("pais_cd"),
          F.first("region_comercial_txt", ignorenulls=True).alias("region_comercial_txt")
      )
)

from pyspark.sql import Window
w_client_month = Window.partitionBy("cliente_id").orderBy(F.col("month_first").asc())

ds = (monthly_agg
      .join(last_in_month, on=["cliente_id","month_first","ym"], how="left")
      .withColumn("lag1_digital_ratio", F.lag("digital_ratio", 1).over(w_client_month))
      .filter(F.col("label").isNotNull()))

ds.select("cliente_id","ym","n_orders","digital_ratio","lag1_digital_ratio","sum_fact","avg_cajas","recency_days_last","label").show(5, truncate=False)

## 5) Split temporal: train / test

In [None]:
train = ds.filter(F.col("ym") < F.lit(TEST_START_YM))
test  = ds.filter(F.col("ym") >= F.lit(TEST_START_YM))

print("Train rows:", train.count(), " | Test rows:", test.count())

## 6) Pipeline de modelado (Logistic Regression)

In [None]:
num_cols = ["n_orders","digital_ratio","lag1_digital_ratio","sum_fact","avg_fact","sum_cajas","avg_cajas","avg_mat_dist","recency_days_last"]
cat_cols = ["tipo_cliente_cd","madurez_digital_cd","frecuencia_visitas_cd","pais_cd","region_comercial_txt"]

imputer = Imputer(inputCols=num_cols, outputCols=[c + "_imp" for c in num_cols])
indexers = [StringIndexer(inputCol=c, outputCol=c + "_idx", handleInvalid="keep") for c in cat_cols]
encoders = [OneHotEncoder(inputCols=[c + "_idx"], outputCols=[c + "_oh"]) for c in cat_cols]
assembler = VectorAssembler(inputCols=[c + "_imp" for c in num_cols] + [c + "_oh" for c in cat_cols],
                            outputCol="features")

lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=50, regParam=0.01)
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[imputer] + indexers + encoders + [assembler, lr])

model = pipeline.fit(train)
print(model.stages[-1])

## 7) Evaluación del modelo

In [None]:
pred_test = model.transform(test).cache()
pred_test.select("cliente_id","ym","label","probability","prediction").show(5, truncate=False)

e_auc = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
e_pr  = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderPR")

auc = e_auc.evaluate(pred_test)
aupr = e_pr.evaluate(pred_test)

# Métricas derivadas con umbral por defecto 0.5
cm = pred_test.groupBy("label","prediction").count().toPandas()

tp = int(cm[(cm["label"]==1) & (cm["prediction"]==1)]["count"].sum())
tn = int(cm[(cm["label"]==0) & (cm["prediction"]==0)]["count"].sum())
fp = int(cm[(cm["label"]==0) & (cm["prediction"]==1)]["count"].sum())
fn = int(cm[(cm["label"]==1) & (cm["prediction"]==0)]["count"].sum())

accuracy  = (tp + tn) / max(tp + tn + fp + fn, 1)
precision = tp / max(tp + fp, 1)
recall    = tp / max(tp + fn, 1)
f1        = (2 * precision * recall) / max(precision + recall, 1e-9)

print(f"AUC ROC : {auc:.4f}")
print(f"AUC PR  : {aupr:.4f}")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall  : {recall:.4f}")
print(f"F1      : {f1:.4f}")

print("\nMatriz de confusión (label x prediction):")
print(cm.sort_values(['label','prediction']))

## 8) Próximos pasos

- Ajustar **umbral de decisión** según la curva PR (si se prioriza Precision o Recall).
- Más **ingeniería de variables** (ventanas móviles 3–6 meses, crecimiento, estacionalidad).
- **Backtesting** temporal con múltiples cortes.
- Probar modelos **árbol/ensemble** (RandomForest, GBT).

## 9) Cierre

In [None]:
spark.stop()
print("Spark session stopped.")