
## Notebook Summary

This notebook covers the following steps for preparing labeled training data for product recommendation:

1. **Set Global Cutoff Timestamp:**  
   Defined a cutoff time (`CUTOFF_TS`) to separate training features from future purchases.

2. **Load Feature Data:**  
   Loaded leakage-free candidate features from the feature store table.

3. **Extract Future Purchases:**  
   Selected purchases strictly after the cutoff timestamp to use as positive labels.

4. **Attach Labels:**  
   Joined features with future purchases, assigning label `1` for future purchases and `0` otherwise.

5. **Sanity Check:**  
   Displayed label distribution to verify correct labeling.

6. **Save Labeled Data:**  
   Saved the final labeled dataset as a Delta table for downstream model training.

In [0]:
from pyspark.sql import functions as F

gold = "kusha_solutions.product_recomendation"

# ============================================================
# üîí GLOBAL CUTOFF (MUST MATCH FEATURE ENGINEERING)
# ============================================================
CUTOFF_TS = "2025-12-01 00:00:00"

print("üìå Label cutoff time:", CUTOFF_TS)

# ============================================================
# 1Ô∏è‚É£ READ FINAL FEATURE DATA (LEAKAGE-FREE)
# ============================================================
features = spark.table(
    "kusha_solutions.product_recomendation.fs_canddiate_features"
)

print("‚úÖ Features loaded")

# ============================================================
# 2Ô∏è‚É£ FUTURE PURCHASES (LABEL SOURCE)
#     ‚Üí STRICTLY AFTER CUTOFF
# ============================================================
future_purchases = (
    spark.table(f"{gold}.gold_sales_enriched")
         .filter(F.col("EventTime") >= F.lit(CUTOFF_TS))
         .filter(F.lower(F.col("InteractionType")) == "purchase")
         .select("CustomerID", "ProductID")
         .distinct()
         .withColumn("label", F.lit(1))
)

print("‚úÖ Future purchases extracted")

# ============================================================
# 3Ô∏è‚É£ LABEL ATTACHMENT
# ============================================================
labeled_data = (
    features
      .join(
          future_purchases,
          ["CustomerID", "ProductID"],
          "left"
      )
      .withColumn("label", F.coalesce(F.col("label"), F.lit(0)))
)

# ============================================================
# 4Ô∏è‚É£ SANITY CHECK
# ============================================================
print("üìä Label distribution:")
labeled_data.groupBy("label").count().show()

# ============================================================
# 5Ô∏è‚É£ SAVE LABELED DATA
# ============================================================
labeled_data.write \
    .mode("overwrite") \
    .format("delta") \
    .saveAsTable(f"{gold}.training_labeled_candidates_v2")

print("‚úÖ Labeled data saved successfully")
