# **Exploratory Data Analysis (EDA)**

**â€“ Gold Layer**

This notebook performs Exploratory Data Analysis on the Gold layer tables
generated from the DLT pipeline for the Product Recommendation System.

## Objectives:
- Understand user behavior and interaction patterns
- Validate multi-product orders for FBT recommendations
- Analyze category, age-group, and price behavior
- Prepare insights for feature engineering and model training

## Summary of Analyses Performed:
- Loaded and displayed enriched sales data from the Gold layer
- Sampled sales data for efficient visualization
- Visualized interaction type distribution
- Analyzed user activity and interaction scores
- Explored product popularity and top-selling products
- Examined recency-weighted user-product interactions
- Investigated product price band distribution
- Summarized purchases by age group
- Visualized hybrid recommendation scores
- Assessed recommendation counts per user
- Explored user recency vs. interaction volume

Each analysis step provides insights for improving recommendation quality and feature engineering.

In [0]:
gold_sales = spark.table(
    "kusha_solutions.product_recomendation.gold_sales_enriched"
)

gold_sales.display()   # Databricks built-in table view


In [0]:
sales_pd = gold_sales.sample(fraction=0.1, seed=42).toPandas()


In [0]:
import matplotlib.pyplot as plt

sales_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_sales_enriched"
).sample(0.2).toPandas()

sales_pd['InteractionType'].value_counts().plot(
    kind='bar',
    title='Interaction Type Distribution'
)

plt.show()


In [0]:
upi_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_user_product_interactions"
).toPandas()

upi_pd.groupby("CustomerID")["interaction_score"].sum().hist(bins=50)
plt.title("User Activity Distribution")
plt.xlabel("Total Interaction Score")
plt.ylabel("Users")
plt.show()


In [0]:
import matplotlib.pyplot as plt

sales_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_sales_enriched"
).toPandas()

product_popularity = (
    sales_pd.groupby("ProductID")["Quantity"]
    .sum()
    .sort_values(ascending=False)
    .head(50)
)
plt.figure(figsize=(12,5))
product_popularity.plot(kind="bar")
plt.title("Top 50 Product Popularity (Bar Chart)")
plt.xlabel("Product Rank")
plt.ylabel("Total Quantity Sold")
plt.xticks([])
plt.show()


In [0]:
import matplotlib.pyplot as plt

weighted_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_user_product_interactions_weighted"
).toPandas()

plt.figure(figsize=(6,4))
plt.hist(weighted_pd["recency_weighted_score"], bins=30)
plt.title("Recency Weighted Interaction Score Distribution")
plt.xlabel("Recency Weighted Score")
plt.ylabel("Number of User-Product Pairs")
plt.show()


In [0]:
price_band_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_product_price_bands"
).toPandas()

price_band_pd['price_band'].value_counts().plot(
    kind='pie', autopct='%1.1f%%'
)
plt.title("Product Price Band Distribution")
plt.ylabel("")
plt.show()


In [0]:
age_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_agegroup_popular_products"
).toPandas()

age_summary = age_pd.groupby("AgeGroup")["purchase_count"].sum()

age_summary.plot(kind='bar')
plt.title("Purchases by Age Group")
plt.xlabel("Age Group")
plt.ylabel("Total Purchases")
plt.show()


In [0]:
hybrid_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_hybrid_user_product_scores"
).sample(0.1).toPandas()

hybrid_pd['hybrid_score'].hist(bins=50)
plt.title("Hybrid Score Distribution")
plt.show()


In [0]:
import pandas as pd
import matplotlib.pyplot as plt

weighted_pd = spark.table(
    "kusha_solutions.product_recomendation.gold_user_product_interactions_weighted"
).toPandas()

# Aggregate per user
user_summary = weighted_pd.groupby("CustomerID").agg(
    total_interactions=("interaction_events", "sum"),
    last_interaction=("last_interaction_ts", "max")
).reset_index()

user_summary["days_since_last_interaction"] = (
    pd.Timestamp.now() - user_summary["last_interaction"]
).dt.days

plt.figure(figsize=(6,4))
plt.scatter(
    user_summary["days_since_last_interaction"],
    user_summary["total_interactions"],
    alpha=0.5
)
plt.xlabel("Days Since Last Interaction")
plt.ylabel("Total Interaction Events")
plt.title("User Recency vs Interaction Volume")
plt.show()