
# Product Recommendation Feature Engineering

This notebook outlines the process for generating user-product features for a product recommendation system using Databricks and Feature Store.

---

## 1. Candidate Generation

- Load candidate products from `kusha_solutions.product_recomendation.gold_candidate_products`.

## 2. Candidate Source Flags

- Add binary flags for each candidate source:
  - `user_history`
  - `same_category`
  - `brand_affinity`
  - `fbt`
  - `trending`
  - `age_group`
  - `location`

## 3. Historical Interactions

- Join with historical sales data from `gold_sales_enriched`.
- Filter interactions strictly before cutoff timestamp (`2025-12-01 00:00:00`).
- Aggregate user interactions:
  - `user_views`
  - `user_carts`
  - `user_purchases`
  - `last_interaction_ts`

## 4. Feature Transformation

- Add recency flag: `recent_7d_interaction` (interaction within 7 days of cutoff).

## 5. Product Features

- Join with product features from `gold_product_features`:
  - `ProductRating`
  - `ReviewsCount` (log-transformed)
  - `DiscountPercent` (binary flag for discount)

## 6. User Profile Features

- Join with user features from `gold_customers_with_age_group`:
  - `AvgReviewRating`
  - `AgeGroup` (encoded)

## 7. Final Feature Set

- Combine all features.
- Add `num_sources` (sum of source flags).
- Drop leakage-prone and unnecessary columns.
- Fill missing values with zero.

## 8. Feature Store Integration

- Write final features to Feature Store table:
  - `kusha_solutions.product_recomendation.fs_canddiate_features`
  - Primary keys: `CustomerID`, `ProductID`
  - Description: Leakage-free user-product features for ranking model

---

## Online Feature Updates

- Repeat feature engineering steps for online scoring (no cutoff).
- Use `mode="merge"` to update Feature Store.

---

## Display and Validation

- Display final feature set for validation.

---

In [0]:
from pyspark.sql import functions as F

gold = "kusha_solutions.product_recomendation"

# üîí GLOBAL CUTOFF (MUST MATCH LABEL GENERATION)
CUTOFF_TS = "2025-12-01 00:00:00"

# --------------------------------------------------
# BASE CANDIDATES
# --------------------------------------------------
candidates = spark.table(f"{gold}.gold_candidate_products")

# --------------------------------------------------
# CANDIDATE SOURCE FLAGS
# --------------------------------------------------
features_added = (
    candidates
    .withColumn("src_user_history", F.when(F.col("candidate_source") == "user_history", 1).otherwise(0))
    .withColumn("src_same_category", F.when(F.col("candidate_source") == "same_category", 1).otherwise(0))
    .withColumn("src_brand_affinity", F.when(F.col("candidate_source") == "brand_affinity", 1).otherwise(0))
    .withColumn("src_fbt", F.when(F.col("candidate_source") == "fbt", 1).otherwise(0))
    .withColumn("src_trending", F.when(F.col("candidate_source") == "trending", 1).otherwise(0))
    .withColumn("src_age_group", F.when(F.col("candidate_source") == "age_group", 1).otherwise(0))
    .withColumn("src_location", F.when(F.col("candidate_source") == "location", 1).otherwise(0))
)

# --------------------------------------------------
# HISTORICAL INTERACTIONS (STRICTLY BEFORE CUTOFF)
# --------------------------------------------------
sales_hist = (
    spark.table(f"{gold}.gold_sales_enriched")
         .select("CustomerID", "ProductID", "InteractionType", "EventTime")
         .filter(F.col("EventTime") < F.lit(CUTOFF_TS))
)

features_added = (
    features_added
    .join(
        sales_hist,
        ["CustomerID", "ProductID"],
        "left"
    )
    .groupBy(
        "CustomerID", "ProductID", "candidate_source",
        "src_user_history", "src_same_category", "src_brand_affinity",
        "src_fbt", "src_trending", "src_age_group", "src_location"
    )
    .agg(
        F.sum(F.when(F.lower("InteractionType") == "view", 1).otherwise(0)).alias("user_views"),
        F.sum(F.when(F.lower("InteractionType") == "add_to_cart", 1).otherwise(0)).alias("user_carts"),
        F.sum(F.when(F.lower("InteractionType") == "purchase", 1).otherwise(0)).alias("user_purchases"),
        F.max("EventTime").alias("last_interaction_ts")
    )
)


In [0]:
# üö® DO NOT DROP last_interaction_ts YET
features_dropped = (
    features_added
    .drop(
        "candidate_source",
        "src_user_history"   # leakage-prone
    )
)


In [0]:
from pyspark.sql import functions as F

# --------------------------------------------------
# SAFE RECENCY FLAG (RELATIVE TO CUTOFF)
# --------------------------------------------------
features_transformed = features_dropped.withColumn(
    "recent_7d_interaction",
    F.when(
        F.datediff(F.lit(CUTOFF_TS), F.col("last_interaction_ts")) <= 7,
        1
    ).otherwise(0)
)

# --------------------------------------------------
# PRODUCT FEATURES
# --------------------------------------------------
product_features = (
    spark.table(f"{gold}.gold_product_features")
         .select("ProductID", "ProductRating", "ReviewsCount", "DiscountPercent")
         .withColumn("log_reviews", F.log1p("ReviewsCount"))
         .withColumn("is_discounted", F.when(F.col("DiscountPercent") > 0, 1).otherwise(0))
)

# --------------------------------------------------
# USER PROFILE FEATURES
# --------------------------------------------------
user_features = (
    spark.table(f"{gold}.gold_customers_with_age_group")
         .select("CustomerID", "AvgReviewRating", "AgeGroup")
         .withColumn(
             "age_group_encoded",
             F.when(F.col("AgeGroup") == "18-24", 1)
              .when(F.col("AgeGroup") == "25-34", 2)
              .when(F.col("AgeGroup") == "35-44", 3)
              .when(F.col("AgeGroup") == "45-54", 4)
              .when(F.col("AgeGroup") == "55-64", 5)
              .otherwise(0)
         )
)

# --------------------------------------------------
# FINAL FEATURE SET
# --------------------------------------------------
features_final = (
    features_transformed
    .join(product_features, "ProductID", "left")
    .join(user_features, "CustomerID", "left")
    .withColumn(
        "num_sources",
        F.expr("""
            src_same_category +
            src_brand_affinity +
            src_fbt +
            src_trending +
            src_age_group +
            src_location
        """)
    )
    .drop("last_interaction_ts")   # ‚úÖ DROP HERE
    .fillna(0)
)


In [0]:
display(features_final)

In [0]:
features_final = features_final.drop("AgeGroup")


In [0]:
from databricks.feature_store import FeatureStoreClient

fs = FeatureStoreClient()

fs.create_table(
    name="kusha_solutions.product_recomendation.fs_canddiate_features",
    primary_keys=["CustomerID", "ProductID"],
    df=features_final,
    description="Leakage-free user-product features for ranking model"
)


## **Feature Engineering for new data**

In [0]:
from pyspark.sql import functions as F
from databricks.feature_store import FeatureStoreClient

gold = "kusha_solutions.product_recomendation"

# ============================================================
# 1Ô∏è‚É£ BASE CANDIDATES (ONLINE)
# ============================================================

candidates = spark.table(f"{gold}.gold_candidate_products")

# ============================================================
# 2Ô∏è‚É£ SOURCE FLAGS
# ============================================================

features_added = (
    candidates
    .withColumn("src_same_category",  F.when(F.col("candidate_source") == "same_category", 1).otherwise(0))
    .withColumn("src_brand_affinity", F.when(F.col("candidate_source") == "brand_affinity", 1).otherwise(0))
    .withColumn("src_fbt",            F.when(F.col("candidate_source") == "fbt", 1).otherwise(0))
    .withColumn("src_trending",       F.when(F.col("candidate_source") == "trending", 1).otherwise(0))
    .withColumn("src_age_group",      F.when(F.col("candidate_source") == "age_group", 1).otherwise(0))
    .withColumn("src_location",       F.when(F.col("candidate_source") == "location", 1).otherwise(0))
)

# ============================================================
# 3Ô∏è‚É£ HISTORICAL + LIVE INTERACTIONS (NO CUTOFF)
# ============================================================

sales_all = (
    spark.table(f"{gold}.gold_sales_enriched")
         .select("CustomerID", "ProductID", "InteractionType", "EventTime")
)

features_agg = (
    features_added
    .join(sales_all, ["CustomerID", "ProductID"], "left")
    .groupBy(
        "CustomerID", "ProductID",
        "src_same_category", "src_brand_affinity",
        "src_fbt", "src_trending",
        "src_age_group", "src_location"
    )
    .agg(
        F.sum(F.when(F.lower("InteractionType") == "view", 1).otherwise(0)).alias("user_views"),
        F.sum(F.when(F.lower("InteractionType") == "add_to_cart", 1).otherwise(0)).alias("user_carts"),
        F.sum(F.when(F.lower("InteractionType") == "purchase", 1).otherwise(0)).alias("user_purchases"),
        F.max("EventTime").alias("last_interaction_ts")
    )
)

# ============================================================
# 4Ô∏è‚É£ RECENCY FEATURE (RELATIVE TO NOW)
# ============================================================

features_agg = features_agg.withColumn(
    "recent_7d_interaction",
    F.when(
        F.datediff(F.current_timestamp(), F.col("last_interaction_ts")) <= 7,
        1
    ).otherwise(0)
)

# ============================================================
# 5Ô∏è‚É£ PRODUCT FEATURES
# ============================================================

product_features = (
    spark.table(f"{gold}.gold_product_features")
         .select("ProductID", "ProductRating", "ReviewsCount", "DiscountPercent")
         .withColumn("log_reviews", F.log1p("ReviewsCount"))
         .withColumn("is_discounted", F.when(F.col("DiscountPercent") > 0, 1).otherwise(0))
)

# ============================================================
# 6Ô∏è‚É£ USER FEATURES
# ============================================================

user_features = (
    spark.table(f"{gold}.gold_customers_with_age_group")
         .select("CustomerID", "AvgReviewRating", "AgeGroup")
         .withColumn(
             "age_group_encoded",
             F.when(F.col("AgeGroup") == "18-24", 1)
              .when(F.col("AgeGroup") == "25-34", 2)
              .when(F.col("AgeGroup") == "35-44", 3)
              .when(F.col("AgeGroup") == "45-54", 4)
              .when(F.col("AgeGroup") == "55-64", 5)
              .otherwise(0)
         )
)

# ============================================================
# 7Ô∏è‚É£ FINAL FEATURE SET
# ============================================================

features_final = (
    features_agg
    .join(product_features, "ProductID", "left")
    .join(user_features, "CustomerID", "left")
    .withColumn(
        "num_sources",
        F.expr("""
            src_same_category +
            src_brand_affinity +
            src_fbt +
            src_trending +
            src_age_group +
            src_location
        """)
    )
    .drop("last_interaction_ts", "AgeGroup")
    .fillna(0)
)

# ============================================================
# 8Ô∏è‚É£ WRITE TO FEATURE STORE (MERGE ‚Äì VERY IMPORTANT)
# ============================================================

fs = FeatureStoreClient()

fs.write_table(
    name="kusha_solutions.product_recomendation.fs_canddiate_features",
    df=features_final,
    mode="merge"
)

print("‚úÖ ONLINE Feature Store updated successfully")


In [0]:
df=spark.table("kusha_solutions.product_recomendation.fs_canddiate_features")
display(df)