### The Goal
- Today, I'm bridging the gap between Data Engineering and Machine Learning. 
- Before we can train a model, we need to understand the shape of our data. 
- My goal is to engineer features that capture 'human behavior'- like weekend shopping habits, so our model can learn from them

In [0]:
##### 1. Setup
# Standard imports
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

# Use our catalog and schema
spark.sql("USE CATALOG ecommerce")
spark.sql("USE SCHEMA silver")

# Load the Silver Table
df = spark.table("cleaned_traffic")
print(f"Row count: {df.count()}")
display(df.limit(5))


Row count: 109516455


event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session,ingestion_ts,source_file,event_date,price_tier
2019-10-04T08:51:16.000Z,view,4802948,2053013554658804075,electronics.audio.headphone,razer,91.64,543272567,b36ef7ec-ee91-40ac-936a-0a3b1c67f3f0,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium
2019-10-04T08:51:20.000Z,view,5100551,2053013553341792533,electronics.clocks,xiaomi,166.03,518678594,e50be089-a207-4128-8b11-5ee89b462125,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium
2019-10-04T08:51:28.000Z,view,9300034,2053013554524586339,,sony,97.56,555367266,cfab298c-57ca-4953-970c-fb347908a4aa,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium
2019-10-04T08:53:28.000Z,view,2402570,2053013563743667055,appliances.kitchen.hood,bosch,102.94,521051958,80f59174-6a43-4d3f-a807-d65eabe89445,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium
2019-10-04T08:53:44.000Z,view,1306747,2053013558920217191,computers.notebook,acer,450.18,512713549,eef19b82-e063-4e0d-88f0-2afe3000f4c8,2026-01-17T08:01:59.737Z,dbfs:/Volumes/workspace/ecommerce/ecommerce_data/processed_data/combined_all/part-00004-tid-4230937331832510606-a76a7b7e-af71-40a4-85e1-ec106e94cdf8-1034-1.c000.snappy.parquet,,premium


### Feature Engineering (The Transformation)
I am transforming the raw data into "signals" a model can actually learn from:
* **`hour` & `day_of_week`**: Extracting cyclic patterns (e.g., shopping peaks at 8 PM).
* **`is_weekend`**: A binary flag (1/0) to test if behavior changes on Saturdays/Sundays.
* **`price_log`**: Real-world prices often follow a "Power Law" (long tail). Log-transforming (`log(x+1)`) normalizes them, which helps linear models perform better.
* **`is_purchased`**: Converting the target variable (`event_type='purchase'`) into a binary 1 or 0 allows us to mathematically correlate "buying" with other variables.

In [0]:
# Apply transformations
gold_features = df.withColumn("hour", F.hour("event_time")) \
    .withColumn("day_of_week", F.dayofweek("event_time")) \
    .withColumn("is_weekend", F.when(F.dayofweek("event_time").isin([1, 7]), 1).otherwise(0)) \
    .withColumn("price_log", F.log(F.col("price") + 1)) \
    .withColumn("is_purchased", F.when(F.col("event_type") == 'purchase', 1).otherwise(0))

# Save this table NOW so we can use it for stats immediately (and for ML tomorrow)
gold_features.write.mode("overwrite").saveAsTable("gold_features")
print("Features engineered and saved to 'gold_features'.")

display(gold_features.select("event_time", "is_weekend", "hour", "price", "price_log", "is_purchased").limit(5))

Features engineered and saved to 'gold_features'.


event_time,is_weekend,hour,price,price_log,is_purchased
2019-10-04T08:51:16.000Z,0,8,91.64,4.528721013824685,0
2019-10-04T08:51:20.000Z,0,8,166.03,5.118173437001857,0
2019-10-04T08:51:28.000Z,0,8,97.56,4.59066549978521,0
2019-10-04T08:53:28.000Z,0,8,102.94,4.643813809580296,0
2019-10-04T08:53:44.000Z,0,8,450.18,6.111866372960278,0


### 2. Statistical Summaries
- Now that we have numerical features, let's look at the distribution.
I am specifically comparing `price` vs `price_log`. 
- We expect `price_log` to have a much smaller standard deviation, indicating it is less "noisy" and safer for Machine Learning.

In [0]:
print("--- Price Distribution: Raw vs Log ---")
gold_features.select("price", "price_log").describe().show()

--- Price Distribution: Raw vs Log ---
+-------+-----------------+------------------+
|summary|            price|         price_log|
+-------+-----------------+------------------+
|  count|        109516455|         109516455|
|   mean| 292.341219466026| 5.051637303688198|
| stddev|356.8838583403892|1.1955573680533131|
|    min|             0.77|0.5709795465857378|
|    max|          2574.07| 7.853631997194365|
+-------+-----------------+------------------+



%md
##### Data Distribution Analysis
* **Raw Price:** The Standard Deviation (`356.88`) is larger than the Mean (`292.34`), indicating massive volatility.
* **Log Price:** By applying the log transformation, we reduced the Standard Deviation to `1.19`.
* **Conclusion:** The `price_log` feature follows a much more "Normal" (Bell Curve) distribution, making it a stable feature for training our model.

### 3. Hypothesis Testing
**Hypothesis:** *People browse more on weekends, but are they less likely to buy?*

To test this, I will calculate the **Conversion Rate** 
`(Total Purchases / Total Events)` for Weekends (1) vs. Weekdays (0).

In [0]:
print("--- Conversion Analysis: Weekend vs. Weekday ---")

# Group by the feature we engineered
stats_df = gold_features.groupBy("is_weekend") \
    .agg(
        F.count("*").alias("total_events"),
        F.sum("is_purchased").alias("total_purchases"),
        F.round(F.avg("price"), 2).alias("avg_price")
    ) \
    .withColumn("conversion_rate", F.round(F.col("total_purchases") / F.col("total_events") * 100, 2))

stats_df.show()

--- Conversion Analysis: Weekend vs. Weekday ---
+----------+------------+---------------+---------+---------------+
|is_weekend|total_events|total_purchases|avg_price|conversion_rate|
+----------+------------+---------------+---------+---------------+
|         1|    36057235|         613075|   293.81|            1.7|
|         0|    73459220|        1046615|   291.62|           1.42|
+----------+------------+---------------+---------+---------------+



%md
##### Hypothesis Result:
**Hypothesis:** *I initially assumed weekends would have lower conversion rates (window shopping) compared to weekdays.*

**The Data says:**
* **Weekend Conversion:** `1.70%`
* **Weekday Conversion:** `1.42%`

**Interpretation:**
My hypothesis was **incorrect**. Users are actually **~20% more likely** to purchase on weekends than on weekdays.
* **Actionable Insight:** The `is_weekend` flag is a **high-value signal**. Our Machine Learning model should likely weight this feature positively when predicting purchase probability.

### 4. Correlation Analysis
Finally, I want to validate a classic economic theory: **Does a higher price actually reduce the likelihood of a purchase?**

Since I converted `purchase` to a number (0 or 1) in Step 1, I can now mathematically calculate the correlation coefficient.
* **Result closer to -1:** Strong negative correlation (Price Up = Sales Down).
* **Result closer to 0:** No relationship.

In [0]:
# Calculate Pearson Correlation
price_corr = gold_features.stat.corr("price_log", "is_purchased")
hour_corr = gold_features.stat.corr("hour", "is_purchased")

print(f"Correlation (Log Price vs Purchase): {price_corr:.4f}")
print(f"Correlation (Hour vs Purchase):      {hour_corr:.4f}")

Correlation (Log Price vs Purchase): 0.0098
Correlation (Hour vs Purchase):      -0.0185


##### Correlation Analysis
I tested the economic theory that "Higher Price = Lower Purchase Probability."

* **Result:** The correlation coefficient is `0.0098` (effectively zero).
* **Interpretation:** For this specific e-commerce store, **Price is not a barrier to purchase.**
* **Behavioral Insight:** A user is statistically just as likely to buy an expensive item as a cheap one. This suggests a motivated customer base or a product mix where price sensitivity is low.