### Load E-Commerce Transactions Dataset

In this step,loading the `default.ecommerce_transactions` table from Databricks into a Spark DataFrame.  
This dataset will be used throughout Day 11 for statistical analysis and feature engineering for ML prep.

To quick understanding of the structure displaying the first few rows of the table.

In [0]:
df = spark.table("default.ecommerce_transactions")
display(df.limit(10))

Transaction_ID,User_Name,Age,Country,Product_Category,Purchase_Amount,Payment_Method,Transaction_Date
1,Ava Hall,63,Mexico,Clothing,780.69,Debit Card,2023-04-14
2,Sophia Hall,59,India,Beauty,738.56,PayPal,2023-07-30
3,Elijah Thompson,26,France,Books,178.34,Credit Card,2023-09-17
4,Elijah White,43,Mexico,Sports,401.09,UPI,2023-06-21
5,Ava Harris,48,Germany,Beauty,594.83,Net Banking,2024-10-29
6,Elijah Harris,51,India,Toys,966.5,Cash on Delivery,2025-01-18
7,Oliver Clark,27,Germany,Home & Kitchen,341.73,Credit Card,2024-03-13
8,Olivia Allen,46,Canada,Home & Kitchen,11.33,Debit Card,2024-01-04
9,Liam Harris,54,France,Beauty,279.43,Cash on Delivery,2023-12-06
10,Liam Allen,60,Canada,Beauty,223.9,Cash on Delivery,2023-08-07


### Descriptive Statistics 

This step focuses on understanding the overall distribution and quality of the data.

calculating:
- Basic descriptive statistics (count, mean, std, min, max) for numeric columns like `Purchase_Amount` and `Age`
- Total row count and transaction counts
- Missing value checks for key columns
- Approximate quartiles (Q1 and Q3) for `Purchase_Amount` to understand data spread and detect potential outliers



In [0]:
from pyspark.sql import functions as F
# 1) Numeric descriptive stats
df.select("Purchase_Amount", "Age").describe().show()

# 2) Extra summary metrics
df.select(
    F.count("*").alias("rows"),
    F.count("Transaction_ID").alias("txn_id_count"),
    F.countDistinct("Transaction_ID").alias("distinct_txn_id"),
    F.sum(F.when(F.col("Purchase_Amount").isNull(), 1).otherwise(0)).alias("null_purchase_amount"),
    F.sum(F.when(F.col("Transaction_Date").isNull(), 1).otherwise(0)).alias("null_transaction_date"),
    F.expr("percentile_approx(Purchase_Amount, array(0.25,0.75))").alias("q1_q3_purchase")
).show(truncate=False)

+-------+------------------+------------------+
|summary|   Purchase_Amount|               Age|
+-------+------------------+------------------+
|  count|             50000|             50000|
|   mean|503.15979300000197|          43.96868|
| stddev|286.56355761465926|15.260577864626729|
|    min|              5.04|                18|
|    max|            999.98|                70|
+-------+------------------+------------------+

+-----+------------+---------------+--------------------+---------------------+----------------+
|rows |txn_id_count|distinct_txn_id|null_purchase_amount|null_transaction_date|q1_q3_purchase  |
+-----+------------+---------------+--------------------+---------------------+----------------+
|50000|50000       |50000          |0                   |0                    |[255.35, 751.16]|
+-----+------------+---------------+--------------------+---------------------+----------------+



### Weekday vs Weekend Purchase Analysis

In this step, analyzing customer purchasing behavior based on the day of the week.

- Deriving a `day_of_week` column from `Transaction_Date`
- Create a binary `is_weekend` flag (Saturday & Sunday)
- Compare weekday vs weekend transactions using:
  - Number of transactions
  - Average purchase amount
  - Total revenue
  - Purchase amount variability (standard deviation)

This analysis provides a simple check to see whether weekends show different purchasing patterns compared to weekdays.


In [0]:
from pyspark.sql import functions as F

df = spark.table("default.ecommerce_transactions")

df_week = (df
    .withColumn("day_of_week", F.dayofweek("Transaction_Date"))     # 1=Sun, 7=Sat
    .withColumn("is_weekend", F.col("day_of_week").isin([1, 7]))
)

# Compare weekend vs weekday summary
(df_week.groupBy("is_weekend")
 .agg(
     F.count("*").alias("transactions"),
     F.round(F.avg("Purchase_Amount"), 2).alias("avg_purchase"),
     F.round(F.sum("Purchase_Amount"), 2).alias("total_revenue"),
     F.round(F.stddev_samp("Purchase_Amount"), 2).alias("std_purchase")
 )
).show()

+----------+------------+------------+-------------+------------+
|is_weekend|transactions|avg_purchase|total_revenue|std_purchase|
+----------+------------+------------+-------------+------------+
|     false|       35616|       501.6| 1.78651117E7|      286.88|
|      true|       14384|      507.01|   7292877.95|      285.76|
+----------+------------+------------+-------------+------------+



### Identify correlations
Purchase_Amount vs Age

In [0]:
# Correlations 
df2.select("Purchase_Amount", "Age", "day_of_week").na.drop().stat.corr("Purchase_Amount", "Age")

-0.003585451717015732

### Feature Engineering for Machine Learning Preparation
In this step, creating simple meaningful features to make the dataset more suitable for machine learning models.

Engineered features include:
- `day_of_week`: captures weekly purchasing patterns
- `purchase_log`: log transformation of `Purchase_Amount` to reduce skewness
- `days_since_last_purchase`: measures customer purchase recency using a window function



In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = spark.table("default.ecommerce_transactions")
# Window per user ordered by date
w = Window.partitionBy("User_Name").orderBy("Transaction_Date")
df_feat = (df
    .withColumn("day_of_week", F.dayofweek("Transaction_Date"))
    .withColumn("purchase_log", F.log(F.col("Purchase_Amount") + F.lit(1)))
    .withColumn("prev_txn_date", F.lag("Transaction_Date").over(w))
    .withColumn("days_since_last_purchase",
                F.datediff(F.col("Transaction_Date"), F.col("prev_txn_date")))
)
display(df_feat.select(
    "Transaction_ID","User_Name","Purchase_Amount",
    "day_of_week","purchase_log","days_since_last_purchase"
).limit(10))

Transaction_ID,User_Name,Purchase_Amount,day_of_week,purchase_log,days_since_last_purchase
31528,Ava Allen,86.13,6,4.467401256243195,
41902,Ava Allen,496.31,6,6.209213574104885,0.0
33019,Ava Allen,821.93,2,6.712871142381708,3.0
9819,Ava Allen,532.07,3,6.2786527476250935,1.0
27055,Ava Allen,839.22,3,6.733663762308199,0.0
49765,Ava Allen,325.6,3,5.788736180536365,0.0
14314,Ava Allen,762.08,5,6.637362875067317,2.0
47599,Ava Allen,748.44,6,6.619326260969298,1.0
42370,Ava Allen,800.69,4,6.686721999477285,5.0
18967,Ava Allen,718.12,7,6.578028242265144,3.0
