##### Learn:- PySpark vs Pandas comparison
##### Task :- Load full e-commerce dataset

###Step 1: Load dataset in PySpark (No execution yet)

In [0]:
events = spark.read.csv(
    "/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv",
    header=True,
    inferSchema=True
)


###Question to yourself:Did Spark read all rows now?
#### NO — Spark only created a logical plan

###Step 2: Inspect the Schema
###What happened:

###Spark scanned only metadata

###No full data scan yet

In [0]:
events.printSchema()


###Inspect the Plan

In [0]:
events.explain(True)

##Step 3: Count rows (Action → Execution)
###What happened:

###Spark executed the full DAG

###Executors read the file

###Data distributed across partitions

###This is where Spark actually works 

In [0]:
events.count()


###Inspect the Plan

In [0]:
events.explain(True)


###Load some Data to Pandas Dataframe for  Comparison

In [0]:
sample_pd = events.limit(10000).toPandas()
sample_pd.head()


###What happened:

###Data loaded immediately

###Stored fully in driver memory

###No lazy behavior

### Key realization:

### Pandas loads data eagerly → PySpark defers work

In [0]:
type(sample_pd)

In [0]:
type(events)

###Production-Ready Load (Best Practice)

In [0]:
def load_events(path):
    return spark.read.csv(path, header=True, inferSchema=True)

events = load_events("/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv")


###Learn:- Joins (inner, left, right, outer)
###Task :- Perform complex joins

Mental Model (Read This First)

A Spark join = data reshuffle across executors

Whenever you join:

Spark may shuffle data

New stages are created

Join type controls which rows survive

###Prepare Joinable Tables (From Same Dataset)

####We’ll derive tables, just like real production pipelines.


In [0]:
events_df = spark.read.csv(
    "/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv",
    header=True,
    inferSchema=True
)



### Create Product  Dimnesion Table 

In [0]:
products_df = events_df.select(
    "product_id",
    "category_code",
    "brand"
).dropDuplicates()




###Always Alias Before Joins (BEST PRACTICE)

In [0]:
e = events_df.alias("e")
p = products_df.alias("p")


###INNER JOIN (Most Common)
Meaning

Keeps only matching product_id

Drops unmatched rows

In [0]:
inner_join_df = e.join(
    p,
    on="product_id",
    how="inner"
)



### Inspect Plan

In [0]:
inner_join_df.explain(True)

##Validate

In [0]:
inner_join_df.select(
    "product_id",
    "e.event_type",
    "p.brand"
).show(5)


###LEFT Outer JOIN
Meaning

Keeps all events

Missing product data → NULL

In [0]:
left_join_df = e.join(
    p,
    on="product_id",
    how="left"
)


In [0]:
left_join_df.select(
    "product_id",
    "e.event_type",
    "p.brand"
).show(5)


##RIGHT OUTER JOIN

In [0]:
right_join_df = e.join(
    p,
    on="product_id",
    how="right"
)


In [0]:
right_join_df.select(
    "product_id",
    "e.event_type",
    "p.brand"
).show(5)


###FULL OUTER JOIN

In [0]:
full_outer_join_df = e.join(
    p,
    on="product_id",
    how="outer"
)


In [0]:
full_outer_join_df.select(
    "product_id",
    "e.event_type",
    "p.brand"
).show(5)


###Learn :Window functions (running totals, rankings)
###Task :Calculate running totals with window functions

###Window Function

###Keeps every row

###Adds derived analytics columns

###Operates over a logical window of rows

####Think: “Calculate something over a group, but don’t collapse rows.”

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window


In [0]:
events_df = spark.read.csv(
    "/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv",
    header=True,
    inferSchema=True
)
events_clean = events_df.filter(events_df.brand.isNotNull())


###Keep only required columns (best practice)

In [0]:
events_clean = events_clean.select(
    "event_time",
    "event_type",
    "product_id",
    "brand",
    "price"
)


###Prepare Data for Window Calculations

In [0]:
events_clean = events_clean.withColumn(
    "event_time",
    F.to_timestamp("event_time")
)


###Running total of revenue per brand over time

###Step 1: Filter purchase events

In [0]:
purchases_df = events_clean.filter(
    (F.col("event_type") == "purchase") &
    (F.col("price").isNotNull())
)


##Step 2: Define Window Specification

In [0]:
revenue_window = Window \
    .partitionBy("brand") \
    .orderBy("event_time") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)


##Step 3: Calculate Running Total

In [0]:
running_revenue_df = purchases_df.withColumn(
    "running_revenue",
    F.sum("price").over(revenue_window)
)



In [0]:
running_revenue_df.select(
    "brand",
    "event_time",
    "price",
    "running_revenue"
).show(10, truncate=False)


##Ranking (Top Products per Brand)
###Define Ranking Window

In [0]:
rank_window = Window \
    .partitionBy("brand") \
    .orderBy(F.col("price").desc())


In [0]:
ranked_df = purchases_df.withColumn(
    "row_number", F.row_number().over(rank_window)
).withColumn(
    "rank", F.rank().over(rank_window)
).withColumn(
    "dense_rank", F.dense_rank().over(rank_window)
)


In [0]:
ranked_df.select(
    "brand",
    "product_id",
    "price",
    "row_number",
    "rank",
    "dense_rank"
).show(20)


###Filter Top-N Using Window Output
###Example: Top 3 expensive products per brand

In [0]:
top_3_products_df = ranked_df.filter(
    F.col("row_number") <= 3
)

In [0]:
top_3_products_df.select(
    "brand",
    "product_id",
    "price"
).show(10)


#User-Defined Functions (UDFs) in PySpark

What is a UDF (in simple words)?

A UDF lets you write custom Python logic and apply it to Spark DataFrame columns when:

Built-in Spark SQL functions are not enough

You need custom business logic

You want row-by-row transformation

Think of UDF as:

“Let Spark call my Python function on every row”

Important rule (VERY IMPORTANT)

Avoid UDFs unless really needed

Why?

UDFs are slower

They break Spark optimizations

Spark treats them as a black box

 Always prefer:Built-in functions  >  SQL expressions  >  UDF


#Basic Python UDF – Syntax
Step 1: Write a Python function

In [0]:
def price_category(price):
    if price is None:
        return "UNKNOWN"
    elif price >= 100:
        return "PREMIUM"
    elif price >= 50:
        return "STANDARD"
    else:
        return "BUDGET"



Register as UDF

In [0]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

price_category_udf = udf(price_category, StringType())



Apply on YOUR DataFrame (purchases_df)

In [0]:
purchases_with_category_df = purchases_df.withColumn(
    "price_category",          
    price_category_udf(F.col("price"))
)


Validate

In [0]:
purchases_with_category_df.select(
    "brand", "product_id", "price", "price_category"
).show(10)



###Another Derived Feature (Using SAME Dataset)
Revenue Flag

In [0]:
def revenue_flag(price):
    if price is None:
        return "UNKNOWN"
    return "HIGH_VALUE" if price >= 75 else "LOW_VALUE"

revenue_flag_udf = udf(revenue_flag, StringType())

purchases_df = purchases_df.withColumn(
    "revenue_flag",
    revenue_flag_udf(F.col("price"))
)


In [0]:
purchases_df.select(
    "brand", "price", "revenue_flag"
).show(10)


Derived Feature from event_time (REAL column)
Weekend / Weekday

In [0]:
def day_type(event_time):
    if event_time is None:
        return "UNKNOWN"
    return "WEEKEND" if event_time.weekday() >= 5 else "WEEKDAY"

day_type_udf = udf(day_type, StringType())

purchases_df = purchases_df.withColumn(
    "day_type",
    day_type_udf(F.col("event_time"))
)


In [0]:
purchases_df.select(
    "event_time", "day_type"
).show(10)


IMPORTANT: Same Result WITHOUT UDF (Best Practice)

Spark-native (FASTER):

In [0]:
purchases_df = purchases_df.withColumn(
    "price_category",
    F.when(F.col("price") >= 100, "PREMIUM")
     .when(F.col("price") >= 50, "STANDARD")
     .otherwise("BUDGET")
)


In [0]:
display(purchases_df.show(10))