### Uploading sample e-commerce CSV file
#### This file is taken from the kaggle data sets "ecommerce_transactions.csv" and uploaded using Data Ingestion

Load the dataset

In [0]:
spark

<pyspark.sql.connect.session.SparkSession at 0xfff614828e60>

In [0]:
%python
events = spark.table("default.ecommerce_transactions")

Display the dataset

In [0]:
%python
events.limit(10).display()

Transaction_ID,User_Name,Age,Country,Product_Category,Purchase_Amount,Payment_Method,Transaction_Date
1,Ava Hall,63,Mexico,Clothing,780.69,Debit Card,2023-04-14
2,Sophia Hall,59,India,Beauty,738.56,PayPal,2023-07-30
3,Elijah Thompson,26,France,Books,178.34,Credit Card,2023-09-17
4,Elijah White,43,Mexico,Sports,401.09,UPI,2023-06-21
5,Ava Harris,48,Germany,Beauty,594.83,Net Banking,2024-10-29
6,Elijah Harris,51,India,Toys,966.5,Cash on Delivery,2025-01-18
7,Oliver Clark,27,Germany,Home & Kitchen,341.73,Credit Card,2024-03-13
8,Olivia Allen,46,Canada,Home & Kitchen,11.33,Debit Card,2024-01-04
9,Liam Harris,54,France,Beauty,279.43,Cash on Delivery,2023-12-06
10,Liam Allen,60,Canada,Beauty,223.9,Cash on Delivery,2023-08-07


#### Creating Derived Features using UDFs (Age Group Classification)
creating a **User-Defined Function (UDF)** to categorize users into age groups.


In [0]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

# Create AGE GROUP using a UDF (user defined function)
def age_bucket(age):
    if age is None:
        return "Unknown"
    if age < 25:
        return "young"
    elif age < 50:
        return "middle_aged"
    elif age < 65:
        return "senior"
    else:
        return "old"

age_bucket_udf = udf(age_bucket, StringType())

events_with_age = events.withColumn("age_group", age_bucket_udf(F.col("Age")))

events_with_age.limit(5).display()

Transaction_ID,User_Name,Age,Country,Product_Category,Purchase_Amount,Payment_Method,Transaction_Date,age_group
1,Ava Hall,63,Mexico,Clothing,780.69,Debit Card,2023-04-14,senior
2,Sophia Hall,59,India,Beauty,738.56,PayPal,2023-07-30,senior
3,Elijah Thompson,26,France,Books,178.34,Credit Card,2023-09-17,middle_aged
4,Elijah White,43,Mexico,Sports,401.09,UPI,2023-06-21,middle_aged
5,Ava Harris,48,Germany,Beauty,594.83,Net Banking,2024-10-29,middle_aged


#### Window Functions: Running Total per User

Using **window function** to calculate the cumulative purchase amount for each user.
Since `User_Name` alone is not unique, defining a composite user key using
`User_Name`, `Country`, and `Age` to avoid incorrect aggregations.


In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
user_key_cols = ["User_Name", "Country", "Age"]
# Running total per (User_Name, Country, Age)
running_total = Window.partitionBy(*user_key_cols).orderBy("Transaction_Date", "Transaction_ID")
events_with_running = events.withColumn(
    "running_total_per_user",
    F.sum("Purchase_Amount").over(running_total)
)
display(
    events_with_running.select(
        "Transaction_ID", "User_Name", "Age", "Country",
        "Purchase_Amount", "Transaction_Date", "running_total_per_user"
    ).orderBy("User_Name", "Country", "Age", "Transaction_Date", "Transaction_ID").limit(10)
)

Transaction_ID,User_Name,Age,Country,Purchase_Amount,Transaction_Date,running_total_per_user
27055,Ava Allen,18,Australia,839.22,2023-03-14,839.22
6654,Ava Allen,20,Australia,141.69,2023-08-10,141.69
20677,Ava Allen,22,Australia,328.2,2023-06-10,328.2
24669,Ava Allen,24,Australia,697.08,2025-02-24,697.08
24861,Ava Allen,26,Australia,212.78,2024-04-17,212.78
14968,Ava Allen,27,Australia,707.59,2024-02-14,707.59
13221,Ava Allen,27,Australia,677.41,2024-04-05,1385.0
9819,Ava Allen,28,Australia,532.07,2023-03-14,532.07
37645,Ava Allen,28,Australia,279.39,2024-01-14,811.46
31161,Ava Allen,32,Australia,861.08,2024-02-26,861.08


#### Aggregation: Top 5 Product Categories by Total Revenue

Aggregate transaction-level data to identify the
**top 5 product categories** based on total purchase amount.

This is a common business analytics use case.


In [0]:
# Top 5 product categories by total purchase amount

from pyspark.sql import functions as F

# Top 5 product categories by total revenue
top_5_products = (
    events
    .groupBy("Product_Category")
    .agg(F.sum("Purchase_Amount").alias("total_revenue"))
    .orderBy(F.desc("total_revenue"))
    .limit(5)
)

display(top_5_products)

Product_Category,total_revenue
Sports,3195335.8999999925
Toys,3185652.36000001
Books,3181897.2999999947
Clothing,3171225.960000004
Electronics,3133965.03999998


#### Ranking Product Categories using Window Functions

Ranking product categories based on their total revenue using
a **window ranking function**. 

This provides a global ranking of products
from highest to lowest revenue.


In [0]:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Total revenue per product category
product_revenue = (
    events
    .groupBy("Product_Category")
    .agg(F.sum("Purchase_Amount").alias("total_revenue"))
)
# Ranking product categories by revenue
rank_window = Window.orderBy(F.col("total_revenue").desc())

ranked_products = product_revenue.withColumn(
    "product_rank_by_revenue",
    F.dense_rank().over(rank_window)
)
# display result
display(
    ranked_products
    .orderBy("product_rank_by_revenue")
)

Product_Category,total_revenue,product_rank_by_revenue
Sports,3195335.8999999925,1
Toys,3185652.36000001,2
Books,3181897.2999999947,3
Clothing,3171225.960000004,4
Electronics,3133965.03999998,5
Grocery,3123579.5200000014,6
Home & Kitchen,3108945.779999996,7
Beauty,3057387.789999989,8


#### Joining Aggregated Product Rankings back to Transaction Data

Performing a **LEFT JOIN** to enrich the original transaction
data with product-level revenue and ranking information.

This demonstrates a common feature-enrichment pattern in data engineering.


In [0]:
# Join product ranking back to events
events_with_product_rank = events.join(
    ranked_products,
    on="Product_Category",
    how="left"
)

# Display result
display(
    events_with_product_rank.select(
        "Transaction_ID",
        "Product_Category",
        "Purchase_Amount",
        "total_revenue",
        "product_rank_by_revenue"
    ).orderBy("product_rank_by_revenue").limit(5)
)

Transaction_ID,Product_Category,Purchase_Amount,total_revenue,product_rank_by_revenue
39,Sports,452.37,3195335.8999999925,1
21,Sports,228.8,3195335.8999999925,1
23,Sports,713.29,3195335.8999999925,1
4,Sports,401.09,3195335.8999999925,1
41,Sports,883.32,3195335.8999999925,1
