#Descriptive Statistics
Descriptive statistics summarize your data so you can answer:

Is the data clean?

Is it skewed?

Are there outliers?

Can this data be safely used for ML?
Why This Step Is Mandatory (Production View)

Before ML or hypothesis testing:

‚ùå You don‚Äôt trust raw data

‚ùå You don‚Äôt assume distributions

‚úÖ You measure and validate

This prevents:

Wrong assumptions

Bad models

Incorrect business conclusions

#Column We‚Äôll Analyze First
We‚Äôll start with price because:

It‚Äôs numeric

It affects conversion

It usually contains skew & outliers

In [0]:
#This converts the Delta table into a Spark DataFrame.
events = spark.table("silver.events")


In [0]:
events.printSchema()

In [0]:
events.select("price").describe().show()


This tells you:

Is price reasonable?

Are there extreme values?

Is the column usable for ML?

mean can be misleading

Percentiles show real distribution

This is production-grade and interview-ready

In [0]:
from pyspark.sql import functions as F

events.select(
    F.expr("percentile_approx(price, 0.5)").alias("p50_median"),
    F.expr("percentile_approx(price, 0.9)").alias("p90"),
    F.expr("percentile_approx(price, 0.99)").alias("p99")
).show()


#Analysis Outcome
‚ÄúIn our ecommerce data, half of the products are priced below ‚Çπ166, 
while 90% are below ‚Çπ759. A very small fraction (1%) are premium-priced products above ‚Çπ1661, 
which can influence averages and require special handling in ML.‚Äù

#Hypothesis Testing
Weekday vs Weekend (Simple & Practical)


A hypothesis is just a question we want to check using data.

Our question:

Do users buy more on weekends than weekdays?

In [0]:
from pyspark.sql import functions as F

events_wd = events.withColumn(
    "is_weekend",
    F.dayofweek("event_date").isin([1, 7])
)


In [0]:
events_wd.select("event_date", "is_weekend").show(5)


You should see:

true ‚Üí weekend

false ‚Üí weekday

We say:

event_type = 'purchase' ‚Üí user bought something

Now count purchases by weekday/weekend.

In [0]:
events_wd.filter(F.col("event_type") == "purchase") \
    .groupBy("is_weekend") \
    .count() \
    .show()

In [0]:
# Hypothesis: weekday vs weekend conversion
weekday = events.withColumn("is_weekend",
    F.dayofweek("event_date").isin([1,7]))
weekday.groupBy("is_weekend", "event_type").count().show()

#Fair Comparison using Conversion Rate
Earlier we counted:

Total purchases on weekdays

Total purchases on weekends

‚ùå This is not fair, because:

Weekdays = more days

Weekends = fewer days

So instead we compare:

Conversion Rate = Purchases √∑ Total Events

In [0]:
conversion = (
    events_wd
    .groupBy("is_weekend")
    .agg(
        F.sum(F.when(F.col("event_type") == "purchase", 1).otherwise(0)).alias("purchases"),
        F.count("*").alias("total_events")
    )
    .withColumn(
        "conversion_rate",
        F.col("purchases") / F.col("total_events")
    )
)

conversion.show()


Your Hypothesis Results ‚Äî

Weekend conversion   = 0.0169  (‚âà 1.69%)
Weekday conversion   = 0.0118  (‚âà 1.18%)

What This Means (Plain English)
On weekends:

Out of 100 user events,

~1.7 events result in a purchase

On weekdays:

Out of 100 user events,

~1.2 events result in a purchase

üëâ People are more likely to buy on weekends.

üß† Important Insight (Why This Matters)

Even though:

Weekdays have more total events

Weekends have fewer users

Still:

Weekend users convert better

This is exactly why we used conversion rate, not total purchases.

#Correlation Analysis

We want to see relationships between numerical variables in our data.
For example:

‚ÄúDoes a higher price affect conversion?‚Äù
‚ÄúDo other variables move together?‚Äù

This is very important before ML because:

ML models perform better when you know which features are related.

Strong correlations can indicate redundancy.

In [0]:
# Example: correlation between price and conversion
# First, we create a 'conversion' column: 1 for purchase, 0 otherwise
events_corr = events.withColumn(
    "conversion",
    F.when(F.col("event_type") == "purchase", 1).otherwise(0)
)

# Now calculate correlation
corr_price_conversion = events_corr.stat.corr("price", "conversion")
print(f"Correlation between price and conversion: {corr_price_conversion:.4f}")


Correlation ranges from -1 to 1

1 = strong positive relationship

0 = no relationship

-1 = strong negative relationship

Your value = 0.0022 ‚âà 0

This means: price has almost no effect on whether someone buys or not.

Price is not a predictive feature for conversion in this dataset.
ML models should consider multiple features, not just price.

# For Practice Purpose, adding new features only

In [0]:
from pyspark.sql import Window

events_feat = events_corr.withColumn("hour", F.hour("event_time")) \
                         .withColumn("day_of_week", F.dayofweek("event_date")) \
                         .withColumn("price_log", F.log(F.col("price")+1))


In [0]:
numeric_cols = ["price", "price_log", "conversion", "hour", "day_of_week"]

for i in range(len(numeric_cols)):
    for j in range(i+1, len(numeric_cols)):
        c = events_feat.stat.corr(numeric_cols[i], numeric_cols[j])
        print(f"Correlation between {numeric_cols[i]} and {numeric_cols[j]}: {c:.4f}")


#Feature Engineering for ML
We will create useful features for modeling purchase behavior.

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

events_feat = events_corr


events_feat = events_feat.withColumn("hour", F.hour("event_time"))

events_feat = events_feat.withColumn("day_of_week", F.dayofweek("event_date"))

events_feat = events_feat.withColumn("price_log", F.log(F.col("price")+1))

window_user = Window.partitionBy("user_id").orderBy("event_time")
events_feat = events_feat.withColumn(
    "time_since_first_view",
    F.unix_timestamp("event_time") - F.unix_timestamp(F.first("event_time").over(window_user))
)

events_feat = events_feat.withColumn(
    "is_weekend_int",
    F.dayofweek("event_date").isin([1,7]).cast("int")
)

events_feat.select(
    "user_id",
    "product_id",
    "event_type",
    "price",
    "price_log",
    "hour",
    "day_of_week",
    "is_weekend_int",
    "time_since_first_view",
    "conversion"
).show(5)
