# PySpark SQL groupBy() Function: Grouping Data Made Easy

## Introduction to the `groupBy()` Function

The `groupBy()` function in PySpark is used to group the data based on one or more columns, similar to the `GROUP BY` clause in SQL. It is typically used when performing aggregate calculations (like sum, average, or count) on groups of data. This is crucial when you want to calculate summary statistics, identify trends, or gain insights into different segments of your data.


## Basic Syntax:

```
DataFrame.groupBy(*cols).agg(*expressions)
```

### Parameters

- **`cols`**: The column(s) to group by.
- **`expressions`**: Aggregation expressions like `sum()`, `avg()`, `count()`, etc., that are applied to the grouped data.


## How It Works:

- The `groupBy()` function creates groups based on the unique values in the specified columns.
- After grouping, you typically use an aggregate function such as `sum()`, `count()`, `avg()`, etc., to compute metrics for each group.


## Practical Examples

### 1. Grouping by a Single Column

**Scenario**: You have a DataFrame with sales data, and you want to group the data by the `ITEM` column and calculate the total sales per item.

**Code Example**:

In [0]:
df = spark.createDataFrame([
    ("ItemA", 10),
    ("ItemB", 20),
    ("ItemA", 30),
    ("ItemB", 40),
    ("ItemC", 50)
], ["ITEM", "SALES"])

# Group by ITEM and calculate total SALES
df.groupBy("ITEM").sum("SALES").show()


+-----+----------+
| ITEM|sum(SALES)|
+-----+----------+
|ItemA|        40|
|ItemB|        60|
|ItemC|        50|
+-----+----------+



### 2. Grouping by Multiple Columns

**Scenario**: You have sales data by `ITEM` and `REGION`, and you want to group by both columns to calculate the total sales per item in each region.

**Code Example**:

In [0]:
df_multi = spark.createDataFrame([
    ("ItemA", "Region1", 10),
    ("ItemB", "Region2", 20),
    ("ItemA", "Region1", 30),
    ("ItemB", "Region2", 40),
    ("ItemC", "Region3", 50)
], ["ITEM", "REGION", "SALES"])

# Group by ITEM and REGION and calculate total SALES
df_multi.groupBy("ITEM", "REGION").sum("SALES").show()


+-----+-------+----------+
| ITEM| REGION|sum(SALES)|
+-----+-------+----------+
|ItemA|Region1|        40|
|ItemB|Region2|        60|
|ItemC|Region3|        50|
+-----+-------+----------+



### 3. Using `agg()` with Multiple Aggregation Functions

**Scenario**: You want to group by `ITEM` and calculate both the total sales and the average sales per item.

**Code Example**:

In [0]:
from pyspark.sql.functions import sum, avg

# Group by ITEM and apply multiple aggregation functions
df.groupBy("ITEM").agg(
    sum("SALES").alias("TOTAL_SALES"),
    avg("SALES").alias("AVERAGE_SALES")
).show()


+-----+-----------+-------------+
| ITEM|TOTAL_SALES|AVERAGE_SALES|
+-----+-----------+-------------+
|ItemA|         40|         20.0|
|ItemB|         60|         30.0|
|ItemC|         50|         50.0|
+-----+-----------+-------------+



### 4. Grouping and Counting Rows

**Scenario**: You want to group by `ITEM` and count how many times each item appears in the dataset.

**Code Example**:

In [0]:
# Group by ITEM and count the occurrences of each item
df.groupBy("ITEM").count().show()


+-----+-----+
| ITEM|count|
+-----+-----+
|ItemA|    2|
|ItemB|    2|
|ItemC|    1|
+-----+-----+



### 5. Filtering After Grouping (Using `having()` Equivalent)

**Scenario**: You want to filter the groups based on the total sales, for example, only showing items with total sales greater than 50.

**Code Example**:

In [0]:
# Group by ITEM, sum SALES, and filter groups where total SALES > 50
df.groupBy("ITEM").sum("SALES").filter("sum(SALES) > 50").show()


+-----+----------+
| ITEM|sum(SALES)|
+-----+----------+
|ItemB|        60|
+-----+----------+



### 6. Grouping and Sorting

**Scenario**: You want to group by `ITEM`, calculate the total sales, and then sort the results by total sales in descending order.

**Code Example**:

In [0]:
# Group by ITEM, sum SALES, and sort the result in descending order of total SALES
df.groupBy("ITEM").sum("SALES").orderBy("sum(SALES)", ascending=False).show()


+-----+----------+
| ITEM|sum(SALES)|
+-----+----------+
|ItemB|        60|
|ItemC|        50|
|ItemA|        40|
+-----+----------+

