
# PySpark Scenarios with Examples

This notebook demonstrates common PySpark data analysis scenarios with practical examples. Each scenario showcases a specific use case, including sales aggregation, identifying top-selling products, analyzing customer purchase patterns, and calculating customer lifetime value.

**Prepared by TR Raveendra**

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import *
 

### Scenario 1: Calculating Sales Totals
This scenario calculates the total price and total quantity sold for each product using 

groupBy() and agg() functions

In [0]:
# 1. Create sample sales data
sales_data = [
    {"product_id": 1, "product_name": "Laptop", "price": 1200, "quantity": 2},
    {"product_id": 2, "product_name": "Smartphone", "price": 800, "quantity": 1},
    {"product_id": 1, "product_name": "Laptop", "price": 1200, "quantity": 1},
]

# 2. Create DataFrame
df = spark.createDataFrame(sales_data)

# 3. Calculate sales totals using groupBy() and agg()
sales_totals = df.groupBy("product_name").agg(
    sum("price").alias("total_price"),
    sum("quantity").alias("total_quantity")
)

# 4. Show the result
sales_totals.show(truncate=False)

+------------+-----------+--------------+
|product_name|total_price|total_quantity|
+------------+-----------+--------------+
|Laptop      |2400       |3             |
|Smartphone  |800        |1             |
+------------+-----------+--------------+



### Scenario 2: Identifying Top-Selling Products
This code identifies the top-selling products by quantity by using 

groupBy(), agg(), sort(), and limit()

In [0]:
# 1. Create sample sales data
sales_data = [
    {"product_id": 1, "product_name": "Laptop", "price": 1200, "quantity": 2},
    {"product_id": 2, "product_name": "Smartphone", "price": 800, "quantity": 1},
    {"product_id": 1, "product_name": "Laptop", "price": 1200, "quantity": 1},
]

# 2. Create DataFrame
df = spark.createDataFrame(sales_data)

# 3. Aggregate, sort, and limit to find top-selling products
top_selling_products = df.groupBy("product_name").agg(
    sum("quantity").alias("total_quantity")
).sort(
    "total_quantity", ascending=False
).limit(3)

# 4. Show the result
top_selling_products.show()

+------------+--------------+
|product_name|total_quantity|
+------------+--------------+
|      Laptop|             3|
|  Smartphone|             1|
+------------+--------------+



### Scenario 3: Analyzing Customer Purchase Patterns
### This script joins customer and sales data to analyze purchase patterns using 

join() and groupBy()

In [0]:
# 1. Create sample data
customer_data = [
    {"customer_id": 1, "name": "Raveendra", "email": "Raveendra@gmail.com"},
    {"customer_id": 2, "name": "Reshwanth", "email": "Reshwanth@gmail.com"},
]

sales_data = [
    {"order_id": 1, "customer_id": 1, "product_id": 1, "quantity": 1},
    {"order_id": 2, "customer_id": 2, "product_id": 2, "quantity": 2},
]

# 2. Create DataFrames
customer_df = spark.createDataFrame(customer_data)
sales_df = spark.createDataFrame(sales_data)

# 3. Join and group to analyze purchase patterns
joined_df = customer_df.join(sales_df, on="customer_id")
purchase_patterns = joined_df.groupBy("name").agg(count("order_id"))

# 4. Show the result
purchase_patterns.show()

+---------+---------------+
|     name|count(order_id)|
+---------+---------------+
|Raveendra|              1|
|Reshwanth|              1|
+---------+---------------+



### Scenario 4: Calculating Customer Lifetime Value (CLV)
This scenario calculates the total amount spent and total number of orders for each customer to determine their lifetime value

In [0]:
# 1. Create sample data
order_data = [
    {"order_id": 1, "customer_id": 1, "order_amount": 100},
    {"order_id": 2, "customer_id": 1, "order_amount": 50},
    {"order_id": 3, "customer_id": 2, "order_amount": 200},
]

# 2. Create DataFrame
df = spark.createDataFrame(order_data)

# 3. Group by customer to calculate CLV
clv_df = df.groupBy("customer_id").agg(
    sum("order_amount").alias("total_spent"),
    count("order_id").alias("total_orders")
)

# 4. Show the result
clv_df.show()

+-----------+-----------+------------+
|customer_id|total_spent|total_orders|
+-----------+-----------+------------+
|          1|        150|           2|
|          2|        200|           1|
+-----------+-----------+------------+



### Scenario 5: Identifying Repeat Customers
This code identifies customers who have placed more than one order

In [0]:
# 1. Create sample data
order_data = [
    {"order_id": 1, "customer_id": 1},
    {"order_id": 2, "customer_id": 2},
    {"order_id": 3, "customer_id": 1},
    {"order_id": 4, "customer_id": 3},
]

# 2. Create DataFrame
df = spark.createDataFrame(order_data)

# 3. Group by customer_id and count orders
repeat_customer_df = df.groupBy("customer_id").count()

# 4. Show the result
repeat_customer_df.show()

+-----------+-----+
|customer_id|count|
+-----------+-----+
|          1|    2|
|          2|    1|
|          3|    1|
+-----------+-----+



### Scenario 6: Analyzing Order Placement Frequency
This script analyzes order frequency by year, demonstrating the use of the 

year() function

In [0]:
# 1. Create sample data
order_data = [
    {"order_id": 1, "customer_id": 1, "order_date": "2023-10-01"},
    {"order_id": 2, "customer_id": 1, "order_date": "2023-10-05"},
    {"order_id": 3, "customer_id": 2, "order_date": "2023-10-10"},
]

# 2. Create DataFrame
df = spark.createDataFrame(order_data)

# 3. Add a year column and then group to analyze frequency
df = df.withColumn("order_year", year(df["order_date"]))
order_frequency_df = df.groupBy("customer_id", "order_year").count()

# 4. Show the result
order_frequency_df.show()

+-----------+----------+-----+
|customer_id|order_year|count|
+-----------+----------+-----+
|          1|      2023|    2|
|          2|      2023|    1|
+-----------+----------+-----+



### Scenario 7: Calculating Salesman Commission
This code calculates the commission for each salesman based on their total sales and commission rate

In [0]:
# 1. Create sample data
data = [
    {"salesman_id": 1, "sales": 1000, "commission_rate": 0.1},
    {"salesman_id": 2, "sales": 2000, "commission_rate": 0.15},
    {"salesman_id": 3, "sales": 3000, "commission_rate": 0.2},
]

# 2. Create DataFrame
df = spark.createDataFrame(data=data)

# 3. Calculate commission using withColumn()
df_with_commission = df.withColumn("commission", col("sales") * col("commission_rate"))

# 4. Show the result
df_with_commission.show()

+---------------+-----+-----------+----------+
|commission_rate|sales|salesman_id|commission|
+---------------+-----+-----------+----------+
|            0.1| 1000|          1|     100.0|
|           0.15| 2000|          2|     300.0|
|            0.2| 3000|          3|     600.0|
+---------------+-----+-----------+----------+



### Scenario 8: Getting Unique Salesman IDs
This scenario shows how to get a list of unique salesman IDs using the 

distinct() transformation

In [0]:
# 1. Create sample data
data = [
    {"salesman_id": 1, "sales": 1000},
    {"salesman_id": 1, "sales": 2000},
    {"salesman_id": 2, "sales": 3000},
    {"salesman_id": 2, "sales": 4000},
    {"salesman_id": 3, "sales": 5000},
]

# 2. Create DataFrame
df = spark.createDataFrame(data=data)

# 3. Get unique salesman IDs using distinct()
unique_salesman_ids = df.select("salesman_id").distinct()

# 4. Show the result
unique_salesman_ids.show()

+-----------+
|salesman_id|
+-----------+
|          1|
|          2|
|          3|
+-----------+



#### Scenario 9: Filtering Salesmen by City
This code filters the DataFrame to find all salesmen from a specific city (

Bangalore) and then selects their name and city

In [0]:
# 1. Create sample data with a predefined schema
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

data = [
    (1, "Reshwanth", "Bangalore"),
    (2, "Vikranth", "New York"),
    (3, "Raveendra", "Bangalore"),
]

schema = StructType([
    StructField("salesman_id", IntegerType(), True),
    StructField("salesman_name", StringType(), True),
    StructField("city", StringType(), True),
])

# 2. Create DataFrame
df = spark.createDataFrame(data=data, schema=schema)

# 3. Filter and select
bangalore_salesmen = df.filter(col("city") == "Bangalore")
salesmen_info = bangalore_salesmen.select("salesman_name", "city")

# 4. Show the result
salesmen_info.show()

+-------------+---------+
|salesman_name|     city|
+-------------+---------+
|    Reshwanth|Bangalore|
|    Raveendra|Bangalore|
+-------------+---------+



### Scenario 10: Filtering Order Data
This scenario filters a DataFrame to find orders from a specific delivery person and selects specific columns.

In [0]:
# 1. Create sample data with a predefined schema
data = [
    (1, "2023-01-01", 10, 100, 100),
    (2, "2023-01-02", 5, 50, 200),
    (3, "2023-01-03", 8, 80, 200),
    (4, "2023-01-04", 12, 120, 300),
]

schema = StructType([
    StructField("order_number", IntegerType()),
    StructField("date", StringType()),
    StructField("qty", IntegerType()),
    StructField("amount", IntegerType()),
    StructField("deliverypersonid", IntegerType()),
])

# 2. Create DataFrame and cast date column
df = spark.createDataFrame(data=data, schema=schema).withColumn("date", col("date").cast("date"))

# 3. Filter and select
filtered_data = df.filter(col("deliverypersonid") == 200)
result = filtered_data.select("order_number", "date", "qty", "amount")

# 4. Show the result
result.show()

+------------+----------+---+------+
|order_number|      date|qty|amount|
+------------+----------+---+------+
|           2|2023-01-02|  5|    50|
|           3|2023-01-03|  8|    80|
+------------+----------+---+------+



### Scenario 11: Getting First and Last Day of the Month
This code demonstrates how to calculate the first and last day of the month for a given date column using trunc() and last_day().

In [0]:
# 1. Create sample data with a date column
data = [
    ("2023-01-15",),
    ("2023-02-20",),
    ("2023-03-05",),
    ("2023-02-10",),
]

columns = ["date"]

# 2. Create DataFrame and convert string to DateType
df = spark.createDataFrame(data, columns).withColumn("date", col("date").cast("date"))

# 3. Calculate first and last dates of the month
df_result = df.withColumn(
    "first_day_month",
    trunc(df["date"], "month")
).withColumn(
    "last_day_month",
    last_day(df["date"])
)

# 4. Show the result
df_result.show()

+----------+---------------+--------------+
|      date|first_day_month|last_day_month|
+----------+---------------+--------------+
|2023-01-15|     2023-01-01|    2023-01-31|
|2023-02-20|     2023-02-01|    2023-02-28|
|2023-03-05|     2023-03-01|    2023-03-31|
|2023-02-10|     2023-02-01|    2023-02-28|
+----------+---------------+--------------+



### Scenario 12: Finding First Visited Customers
This scenario uses a window function to find the first visit for each customer, partitioned by customer ID.



In [0]:
# 1. Create sample data
data = [
    ("Cust1", "StoreA", "2023-01-01"),
    ("Cust2", "StoreB", "2023-01-02"),
    ("Cust1", "StoreA", "2023-01-03"),
    ("Cust3", "StoreB", "2023-01-04"),
    ("Cust2", "StoreA", "2023-01-05"),
]

schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("store_id", StringType(), True),
    StructField("visit_date", StringType(), True),
])

# 2. Create DataFrame and cast date column
df = spark.createDataFrame(data, schema).withColumn("visit_date", to_date(col("visit_date")))

# 3. Define a window specification
windowSpec = Window.partitionBy("customer_id").orderBy("visit_date")

# 4. Find the first visit date for each customer
df_result = df.withColumn(
    "first_time_visited",
    min(col("visit_date")).over(windowSpec)
).filter(
    col("visit_date") == col("first_time_visited")
)

# 5. Select the required columns and show
result = df_result.select("customer_id", "store_id", "visit_date")
result.show()

+-----------+--------+----------+
|customer_id|store_id|visit_date|
+-----------+--------+----------+
|      Cust1|  StoreA|2023-01-01|
|      Cust2|  StoreB|2023-01-02|
|      Cust3|  StoreB|2023-01-04|
+-----------+--------+----------+



### Scenario 13: Union of Two DataFrames
This scenario demonstrates how to combine two DataFrames with the same schema into a single DataFrame using the union() operation

In [0]:
# 1. Create sample data for two DataFrames
data1 = [("Laptop", 1200), ("Smartphone", 800)]
data2 = [("Tablet", 500), ("Smartwatch", 300)]
columns = ["product_name", "price"]

# 2. Create the DataFrames
df1 = spark.createDataFrame(data=data1, schema=columns)
df2 = spark.createDataFrame(data=data2, schema=columns)

# 3. Union the two DataFrames
union_df = df1.union(df2)

# 4. Show the result
union_df.show()

+------------+-----+
|product_name|price|
+------------+-----+
|      Laptop| 1200|
|  Smartphone|  800|
|      Tablet|  500|
|  Smartwatch|  300|
+------------+-----+



###Scenario 14: Exploding an Array Column
This script shows how to transform a DataFrame with an array column by creating a new row for each element in the array using the explode() function

In [0]:
# 1. Create a DataFrame with an array column
from pyspark.sql.types import ArrayType, StringType, StructType, StructField

data = [
    ("Laptop", ["MSI", "DELL"]),
    ("Smartphone", ["Samsung", "Apple", "Google"]),
]

schema = StructType([
    StructField("product", StringType(), True),
    StructField("brands", ArrayType(StringType()), True),
])

df = spark.createDataFrame(data, schema)

# 2. Explode the 'brands' column
exploded_df = df.withColumn("brand", explode(df.brands))

# 3. Show the result
exploded_df.show()

+----------+--------------------+-------+
|   product|              brands|  brand|
+----------+--------------------+-------+
|    Laptop|         [MSI, DELL]|    MSI|
|    Laptop|         [MSI, DELL]|   DELL|
|Smartphone|[Samsung, Apple, ...|Samsung|
|Smartphone|[Samsung, Apple, ...|  Apple|
|Smartphone|[Samsung, Apple, ...| Google|
+----------+--------------------+-------+



### Scenario 15: Dropping Duplicate Rows
This scenario demonstrates how to remove duplicate rows from a DataFrame based on all columns.

In [0]:
# 1. Create a DataFrame with duplicate rows
data = [
    ("Laptop", 1200),
    ("Smartphone", 800),
    ("Laptop", 1200),
]
columns = ["product_name", "price"]
df = spark.createDataFrame(data=data, schema=columns)

# 2. Drop the duplicate rows
distinct_df = df.dropDuplicates()

# 3. Show the result
distinct_df.show()

+------------+-----+
|product_name|price|
+------------+-----+
|      Laptop| 1200|
|  Smartphone|  800|
+------------+-----+



### Scenario 16: Renaming a Column
This code shows how to rename a column in a DataFrame from its original name to a new name

In [0]:
# 1. Create a DataFrame
data = [("Laptop", 1200)]
columns = ["product_name", "price"]
df = spark.createDataFrame(data=data, schema=columns)

# 2. Rename the 'product_name' column to 'item_name'
renamed_df = df.withColumnRenamed("product_name", "item_name")

# 3. Show the result
renamed_df.show()

+---------+-----+
|item_name|price|
+---------+-----+
|   Laptop| 1200|
+---------+-----+



###Scenario 17: Adding a New Column
This script adds a new column to a DataFrame, in this case, a 'status' column with a fixed value

In [0]:
# 1. Create a DataFrame
data = [("Laptop", 1200), ("Smartphone", 800)]
columns = ["product_name", "price"]
df = spark.createDataFrame(data=data, schema=columns)

# 2. Add a new 'status' column
df_with_status = df.withColumn("status", lit("New"))

# 3. Show the result
df_with_status.show()

+------------+-----+------+
|product_name|price|status|
+------------+-----+------+
|      Laptop| 1200|   New|
|  Smartphone|  800|   New|
+------------+-----+------+



### **Detailed Documentation: PySpark Window Functions**

**`unboundedPreceding`** and **`currentRow`** are keywords used in PySpark's window functions to define a **frame**, which is a set of rows within a window (or partition) on which a function operates. This frame moves as the function is applied to each row.

---

### **Detailed Overview**

In simple terms, a window function calculates a value for each row based on a set of rows related to the current row. This set of related rows is the **window frame**. The frame is defined by a `start` boundary and an `end` boundary.

* **`unboundedPreceding`**: This is a frame boundary that represents the **first row** of the partition. No matter where the current row is, this boundary will always be at the very beginning of the partition. It is "unbounded" because it doesn't refer to a fixed number of rows; it just means "start from the beginning."
* **`currentRow`**: This is a frame boundary that represents the **current row** being processed. As the window function moves from row to row, the `currentRow` boundary changes to always be at the row currently under consideration.

When you use `rowsBetween(Window.unboundedPreceding, Window.currentRow)`, you are telling Spark to define a frame for each row that starts at the first row of the partition and ends at the current row.

---

### **Analogy and Example**

Imagine you're tracking your daily steps. You want to know your total steps for the year up to the current day. Each day is a row in your data.

* The **partition** is the entire year's worth of data.
* The **window frame** for any given day is from January 1st (the `unboundedPreceding` row) up to that day (the `currentRow`).

This combination is perfect for calculating metrics that are cumulative over time, such as **running totals**, **cumulative averages**, and **running counts**, because the calculation for each row includes all preceding rows in the same partition.

Let's illustrate with a simple table and a running sum calculation:

| store_id | product_id | sales_amount |
| :--- | :--- | :--- |
| Store A | ProductX | 100 |
| Store A | ProductX | 150 |
| Store A | ProductX | 50 |

If you were to calculate a running sum on this data ordered by an implicit date, the window frame would change for each row:

1.  **For the first row (`sales_amount` = 100):**
    * The frame starts at `unboundedPreceding` (the first row).
    * The frame ends at `currentRow` (the first row).
    * The sum is **100**.

2.  **For the second row (`sales_amount` = 150):**
    * The frame starts at `unboundedPreceding` (the first row).
    * The frame ends at `currentRow` (the second row).
    * The sum is 100 + 150 = **250**.

3.  **For the third row (`sales_amount` = 50):**
    * The frame starts at `unboundedPreceding` (the first row).
    * The frame ends at `currentRow` (the third row).
    * The sum is 100 + 150 + 50 = **300**.

In [0]:
 
from pyspark.sql.functions import sum, avg, count, col, row_number, to_date
from pyspark.sql.window import Window

 
# Create sample sales data
data = [
    ("StoreA", "ProductX", "2023-01-01", 100),
    ("StoreA", "ProductX", "2023-01-02", 150),
    ("StoreA", "ProductX", "2023-01-03", 50),
    ("StoreA", "ProductY", "2023-01-01", 200),
    ("StoreA", "ProductY", "2023-01-02", 250),
    ("StoreB", "ProductX", "2023-01-01", 300),
    ("StoreB", "ProductX", "2023-01-02", 100),
]

columns = ["store_id", "product_id", "sale_date", "sales_amount"]

df = spark.createDataFrame(data, columns).withColumn("sale_date", to_date(col("sale_date")))

df.show()

+--------+----------+----------+------------+
|store_id|product_id| sale_date|sales_amount|
+--------+----------+----------+------------+
|  StoreA|  ProductX|2023-01-01|         100|
|  StoreA|  ProductX|2023-01-02|         150|
|  StoreA|  ProductX|2023-01-03|          50|
|  StoreA|  ProductY|2023-01-01|         200|
|  StoreA|  ProductY|2023-01-02|         250|
|  StoreB|  ProductX|2023-01-01|         300|
|  StoreB|  ProductX|2023-01-02|         100|
+--------+----------+----------+------------+



### Scenario 1: Running Total of Sales
This code calculates the `cumulative sales amount` for each product within each store over time.

`Partitioning`: The data is partitioned by store_id and product_id.

`Ordering`: The rows within each partition are ordered by sale_date.

`Frame`: The frame is defined as unboundedPreceding to currentRow, which includes all preceding rows up to the current row.

In [0]:
# Define the window specification
# 'rowsBetween(Window.unboundedPreceding, Window.currentRow)' sets the frame to include all rows from the start of the partition up to the current row.
# This is commonly used for cumulative calculations (e.g., running totals).
# It is NOT mandatory; if omitted, the default frame for ordered windows is 'rangeBetween(unboundedPreceding, currentRow)' for aggregation functions.
window_spec = Window.partitionBy("store_id", "product_id").orderBy("sale_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Calculate the running total of sales
df_with_running_total = df.withColumn(
    "running_total_sales",
    sum("sales_amount").over(window_spec)
)

# Show the result
df_with_running_total.show()

+--------+----------+----------+------------+-------------------+
|store_id|product_id| sale_date|sales_amount|running_total_sales|
+--------+----------+----------+------------+-------------------+
|  StoreA|  ProductX|2023-01-01|         100|                100|
|  StoreA|  ProductX|2023-01-02|         150|                250|
|  StoreA|  ProductX|2023-01-03|          50|                300|
|  StoreA|  ProductY|2023-01-01|         200|                200|
|  StoreA|  ProductY|2023-01-02|         250|                450|
|  StoreB|  ProductX|2023-01-01|         300|                300|
|  StoreB|  ProductX|2023-01-02|         100|                400|
+--------+----------+----------+------------+-------------------+



In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum
from pyspark.sql.window import Window
 
# ---------------------------
# 1. Create sample dataset
# ---------------------------
data = [
    ("StoreA", "ProductX", 1000, 100),
    ("StoreA", "ProductX", 1500, 150),
    ("StoreA", "ProductX", 2000, 50),
    ("StoreA", "ProductY", 1200, 200),
    ("StoreA", "ProductY", 2200, 250),
    ("StoreB", "ProductX", 800, 300),
    ("StoreB", "ProductX", 2100, 100),
]
columns = ["store_id", "product_id", "event_id", "sales_amount"]

df = spark.createDataFrame(data, columns)

# ---------------------------
# 2. Define a window using rangeBetween
#    - Partition by store_id + product_id
#    - Order by event_id
#    - For each row, look 500 before and 500 after in event_id
# ---------------------------
window_spec_range = (
    Window.partitionBy("store_id", "product_id")
    .orderBy(col("event_id"))
    .rangeBetween(-500, 500)
)

df_range = df.withColumn(
    "sum_sales_eventid_window",
    sum("sales_amount").over(window_spec_range)
)

# ---------------------------
# 3. Define a window over the FULL partition
#    - Uses unboundedPreceding to unboundedFollowing
#    - Includes ALL rows for the same store_id + product_id
# ---------------------------
window_spec_full = (
    Window.partitionBy("store_id", "product_id")
    .orderBy(col("event_id"))
    .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)

df_full = df_range.withColumn(
    "sum_sales_eventid_full_window",
    sum("sales_amount").over(window_spec_full)
)

# ---------------------------
# 4. Show final result
# ---------------------------
df_full.show(truncate=False)


+--------+----------+--------+------------+------------------------+-----------------------------+
|store_id|product_id|event_id|sales_amount|sum_sales_eventid_window|sum_sales_eventid_full_window|
+--------+----------+--------+------------+------------------------+-----------------------------+
|StoreA  |ProductX  |1000    |100         |250                     |300                          |
|StoreA  |ProductX  |1500    |150         |300                     |300                          |
|StoreA  |ProductX  |2000    |50          |200                     |300                          |
|StoreA  |ProductY  |1200    |200         |200                     |450                          |
|StoreA  |ProductY  |2200    |250         |250                     |450                          |
|StoreB  |ProductX  |800     |300         |300                     |400                          |
|StoreB  |ProductX  |2100    |100         |100                     |400                          |
+--------+

### Scenario 2: Cumulative Average of Sales
This example calculates the `running average of sales` for each product within a store. This is useful for tracking performance trends.

`Partitioning`: By store_id and product_id.

`Ordering`: By sale_date.

`Frame`: The frame is unboundedPreceding to currentRow.

In [0]:
# Define the window specification
window_spec = Window.partitionBy("store_id", "product_id").orderBy("sale_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Calculate the running average of sales
df_with_running_avg = df.withColumn(
    "running_avg_sales",
    avg("sales_amount").over(window_spec)
)

# Show the result
df_with_running_avg.show()

+--------+----------+----------+------------+-----------------+
|store_id|product_id| sale_date|sales_amount|running_avg_sales|
+--------+----------+----------+------------+-----------------+
|  StoreA|  ProductX|2023-01-01|         100|            100.0|
|  StoreA|  ProductX|2023-01-02|         150|            125.0|
|  StoreA|  ProductX|2023-01-03|          50|            100.0|
|  StoreA|  ProductY|2023-01-01|         200|            200.0|
|  StoreA|  ProductY|2023-01-02|         250|            225.0|
|  StoreB|  ProductX|2023-01-01|         300|            300.0|
|  StoreB|  ProductX|2023-01-02|         100|            200.0|
+--------+----------+----------+------------+-----------------+



### Scenario 3: Counting Cumulative Transactions
This code `counts the number of transactions` up to the current row for each product within each store.

`Partitioning`: By store_id and product_id.

`Ordering`: By sale_date.

`Frame`: The frame is unboundedPreceding to currentRow.

In [0]:
# Define the window specification
window_spec = Window.partitionBy("store_id", "product_id").orderBy("sale_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Count the cumulative number of transactions
df_with_cumulative_count = df.withColumn(
    "cumulative_transactions",
    count("*").over(window_spec)
)

# Show the result
df_with_cumulative_count.show()

+--------+----------+----------+------------+-----------------------+
|store_id|product_id| sale_date|sales_amount|cumulative_transactions|
+--------+----------+----------+------------+-----------------------+
|  StoreA|  ProductX|2023-01-01|         100|                      1|
|  StoreA|  ProductX|2023-01-02|         150|                      2|
|  StoreA|  ProductX|2023-01-03|          50|                      3|
|  StoreA|  ProductY|2023-01-01|         200|                      1|
|  StoreA|  ProductY|2023-01-02|         250|                      2|
|  StoreB|  ProductX|2023-01-01|         300|                      1|
|  StoreB|  ProductX|2023-01-02|         100|                      2|
+--------+----------+----------+------------+-----------------------+

