PySpark, the rollup() function is used to perform multi-level aggregation by computing group-wise subtotals and a grand total.
It’s commonly used when you want to aggregate data at multiple levels within a single query — similar to ROLLUP in SQL.

I'll explain step by step with examples.

### 1. Syntax
DataFrame.rollup(*cols)


cols → Columns on which you want to perform the rollup.

It generates all possible groupings based on the hierarchy of columns provided.

After rollup, you usually combine it with agg() or groupBy().agg() to calculate aggregations.

### 2. Sample Dataset

Let's create a sample DataFrame:

In [1]:
data = [
    ("Electronics", "Mobile", 1000),
    ("Electronics", "Laptop", 1500),
    ("Electronics", "Tablet", 800),
    ("Clothing", "Shirt", 500),
    ("Clothing", "Jeans", 700),
    ("Clothing", "Shoes", 1200)
]

columns = ["Category", "SubCategory", "Sales"]

df = spark.createDataFrame(data, columns)
df.show()



StatementMeta(, c9514d10-5a54-498b-8cb8-4e34b3ef2fde, 3, Finished, Available, Finished)

+-----------+-----------+-----+
|   Category|SubCategory|Sales|
+-----------+-----------+-----+
|Electronics|     Mobile| 1000|
|Electronics|     Laptop| 1500|
|Electronics|     Tablet|  800|
|   Clothing|      Shirt|  500|
|   Clothing|      Jeans|  700|
|   Clothing|      Shoes| 1200|
+-----------+-----------+-----+



### 3. Using rollup() for Subtotals

Let's roll up by Category and SubCategory:

In [2]:
from pyspark.sql import functions as F

df_rollup = df.rollup("Category", "SubCategory") \
              .agg(F.sum("Sales").alias("TotalSales")) \
              .orderBy("Category", "SubCategory")

df_rollup.show()


StatementMeta(, c9514d10-5a54-498b-8cb8-4e34b3ef2fde, 4, Finished, Available, Finished)

+-----------+-----------+----------+
|   Category|SubCategory|TotalSales|
+-----------+-----------+----------+
|       NULL|       NULL|      5700|
|   Clothing|       NULL|      2400|
|   Clothing|      Jeans|       700|
|   Clothing|      Shirt|       500|
|   Clothing|      Shoes|      1200|
|Electronics|       NULL|      3300|
|Electronics|     Laptop|      1500|
|Electronics|     Mobile|      1000|
|Electronics|     Tablet|       800|
+-----------+-----------+----------+



### 4. How It Works

Level 1: Aggregates by Category + SubCategory

Level 2: Aggregates by Category only → subtotal per category

Level 3: Aggregates all data → grand total

### 5. Using grouping() to Identify Subtotals

You can add a flag to differentiate detail rows, subtotals, and grand totals:

In [None]:
df_rollup_flagged = df.rollup("Category", "SubCategory") \
    .agg(
        F.sum("Sales").alias("TotalSales"),
        F.grouping("Category").alias("Category_grouping"),
        F.grouping("SubCategory").alias("SubCategory_grouping")
    ) \
    .orderBy("Category", "SubCategory")

df_rollup_flagged.show()


### 6. Difference Between rollup() and cube()
| Feature     | **rollup()**                | **cube()**                              |
| ----------- | --------------------------- | --------------------------------------- |
| Aggregation | Hierarchical (top → bottom) | All combinations                        |
| Grand Total | Yes                         | Yes                                     |
| Subtotals   | Yes (hierarchical only)     | Yes (all combinations)                  |
| Use Case    | Reports with **subtotals**  | Reports with **all cross-level totals** |


### 7. Real-Time Use Case

Example:
If you want sales totals per product, per category, and overall in a single query, rollup() is ideal.

### 8. Summary

rollup() → Hierarchical aggregation

Produces detail rows + subtotals + grand total

Use grouping() to identify subtotal rows

Best for multi-level reporting in PySpark.