What is cube() in PySpark?
cube() is like an extended version of groupBy(), where instead of grouping only by the given columns, it produces aggregations for all possible combinations of those columns, including partial aggregations and the grand total.

Think of it like an OLAP cube in data warehousing — you get aggregated views for every combination of your grouping columns.

In [0]:
from pyspark.sql.functions import sum

# Sample data
data = [
    ("East", "ProductA", 100),
    ("East", "ProductB", 150),
    ("West", "ProductA", 200),
    ("West", "ProductB", 300),
    ("West", "ProductC", 250)
]

columns = ["Region", "Product", "Sales"]

df = spark.createDataFrame(data, columns)

df.display()

In [0]:
# Cube on Region and Product
cube_df = df.cube("Region", "Product").agg(sum("Sales").alias("TotalSales"))

cube_df.display()

### What happens here:

It aggregates Sales for all combinations:

    (Region, Product) — normal grouping

    (Region, null) — subtotal per region

    (null, Product) — subtotal per product

    (null, null) — grand total

### When to Use cube()

When you need all possible aggregation combinations (subtotals + grand totals).

Good for multi-dimensional analysis.

Example: Sales data by region and product category.

### Difference Between groupBy(), rollup(), and cube()
| Function    | Output                                                            |
| ----------- | ----------------------------------------------------------------- |
| `groupBy()` | Only the exact grouping columns.                                  |
| `rollup()`  | Groupings in a **hierarchical order** (like totals along a path). |
| `cube()`    | **All possible** combinations of grouping columns.                |

