# Internals of groupBy in PySpark

## Table of Contents
1. [Introduction to groupBy](#1-introduction-to-groupby)
2. [How groupBy Works Internally](#2-how-groupby-works-internally)
3. [Shuffling and Partitioning in groupBy](#3-shuffling-and-partitioning-in-groupby)
4. [Impact of Adaptive Query Execution (AQE)](#4-impact-of-adaptive-query-execution-aqe)
5. [Optimizing groupBy Operations](#5-optimizing-groupby-operations)
6. [Properties and Configurations](#6-properties-and-configurations)
7. [Summary](#7-summary)

## 1. Introduction to groupBy

The `groupBy` operation in PySpark is a **wide transformation** that groups rows in a DataFrame based on one or more columns. It is commonly used for aggregations, such as counting, summing, or averaging values within each group.

### Why is groupBy Important?
- **Data Aggregation**: Allows you to summarize data by grouping it based on specific columns.
- **Data Analysis**: Useful for analyzing trends and patterns within subsets of data.
- **Performance Considerations**: Understanding how `groupBy` works internally helps optimize performance, especially for large datasets.

### Key Concepts
- **Wide Transformation**: `groupBy` is a wide transformation because it requires shuffling data across the cluster.
- **Shuffling**: The process of redistributing data across the cluster based on the group key.
- **Partitions**: Data is divided into partitions, and shuffling ensures that rows with the same key are processed together.

## 2. How groupBy Works Internally

When you perform a `groupBy` operation, PySpark follows these steps:

### Diagram: groupBy Process

```
┌───────────────┐              ┌───────────────┐
│  DataFrame    │              │  Shuffled Data │
│  (Partitioned)│              │  (Grouped)     │
└───────┬───────┘              └───────┬───────┘
        │                              │
        ▼                              ▼
┌───────────────────────────────────────────────┐
│               Shuffle Data by Group Key       │
└───────┬───────────────────────────────┬───────┘
        │                               │
        ▼                               ▼
┌───────────────┐              ┌───────────────┐
│  Node 1       │              │  Node 2       │
│  Group 1      │              │  Group 2      │
└───────────────┘              └───────────────┘
```

### Steps in groupBy
1. **Local Aggregation**: Each partition performs a local aggregation (e.g., counting) on the group key.
2. **Shuffling**: Data is shuffled across the network so that rows with the same group key are processed together.
3. **Global Aggregation**: The shuffled data is aggregated to produce the final result.

### Example: groupBy on Customer Data

Let's assume we have a `customers` DataFrame with 1GB of data, and we want to group by `city` and count the number of customers in each city.

```python
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("groupByExample").getOrCreate()

# Load customer data (1GB)
customers_df = spark.read.csv("customers.csv", header=True, inferSchema=True)

# Perform groupBy and count
result_df = customers_df.groupBy("city").count()

# Show results
result_df.show()
```

**Output:**

| city      | count |
|-----------|-------|
| Bangalore | 1000  |
| Pune      | 800   |
| Mumbai    | 1200  |

### Key Points:
- **Local Aggregation**: Each partition performs a local count before shuffling.
- **Shuffling Overhead**: Shuffling is expensive and can impact performance.
- **Global Aggregation**: The final result is computed after shuffling.

## 3. Shuffling and Partitioning in groupBy

### Shuffling in groupBy
Shuffling is the process of redistributing data across the cluster so that rows with the same group key are processed together. By default, PySpark creates **200 shuffle partitions** when a wide transformation like `groupBy` is triggered.

### Partitioning in groupBy
Partitioning divides the data into smaller chunks (partitions) that can be processed in parallel. However, if the number of unique group keys is small (e.g., 9 unique cities), most of the shuffle partitions will remain empty, leading to inefficiency.

### Example: Shuffle Partitions in groupBy

```python
# Default shuffle partitions
print("Default shuffle partitions:", spark.conf.get("spark.sql.shuffle.partitions"))

# Perform groupBy and count
result_df = customers_df.groupBy("city").count()

# Write results to disk (for testing)
result_df.write.format("noop").mode("overwrite").save()
```

**Output:**

```
Default shuffle partitions: 200
```

### Key Points:
- **Default Shuffle Partitions**: PySpark creates 200 shuffle partitions by default.
- **Empty Partitions**: If the number of unique group keys is small, many partitions will remain empty, leading to inefficiency.
- **Task Scheduler Overhead**: The task scheduler is burdened by empty partitions, as it still needs to create tasks for them.

## 4. Impact of Adaptive Query Execution (AQE)

Adaptive Query Execution (AQE) is a feature introduced in Spark 3.0 that optimizes query execution at runtime. It dynamically adjusts the number of shuffle partitions, handles partition skew, and switches join strategies based on runtime statistics.

### AQE and groupBy
When AQE is enabled, PySpark dynamically coalesces shuffle partitions to reduce the number of empty partitions. This improves performance by reducing the overhead of the task scheduler.

### Example: Enabling AQE

```python
# Enable AQE
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Perform groupBy and count
result_df = customers_df.groupBy("city").count()

# Write results to disk (for testing)
result_df.write.format("noop").mode("overwrite").save()
```

### Key Points:
- **Dynamic Coalescing**: AQE dynamically reduces the number of shuffle partitions based on runtime statistics.
- **Performance Improvement**: Reduces the overhead of the task scheduler by eliminating empty partitions.
- **Enabled by Default**: AQE is enabled by default in Spark 3.2 and later.

## 5. Optimizing groupBy Operations

### Reducing Shuffling
Shuffling is expensive, so reducing the amount of data shuffled can significantly improve performance. Here are some strategies:
- **Filter Early**: Apply filters before the `groupBy` operation to reduce the amount of data shuffled.
- **Reduce Columns**: Select only the necessary columns before the `groupBy` operation.

### Example: Filtering Before groupBy

```python
# Filter data before groupBy
filtered_df = customers_df.filter(customers_df["city"] == "Bangalore")

# Perform groupBy and count
result_df = filtered_df.groupBy("city").count()

# Show results
result_df.show()
```

### Key Points:
- **Filter Early**: Reduces the amount of data shuffled across the network.
- **Select Columns**: Select only the necessary columns to reduce the size of the data being shuffled.

## 6. Properties and Configurations

### PySpark Properties
- **spark.sql.shuffle.partitions**: Controls the number of shuffle partitions. The default value is 200.
  ```python
  spark.conf.set("spark.sql.shuffle.partitions", 50)
  ```
- **spark.sql.adaptive.enabled**: Enables or disables Adaptive Query Execution (AQE). Set to `true` by default in Spark 3.2 and later.
  ```python
  spark.conf.set("spark.sql.adaptive.enabled", "true")
  ```

### Special Considerations
- **Memory Constraints**: Ensure that each node has enough memory to handle the shuffled data.
- **Skewed Data**: If the data is skewed, consider using techniques like salting to distribute the data more evenly.

## 7. Summary

In this notebook, we explored the **internals of groupBy in PySpark**, including how it works, the impact of shuffling and partitioning, and how to optimize groupBy operations using Adaptive Query Execution (AQE).

### Key Takeaways:
- **groupBy is a Wide Transformation**: It involves shuffling data across the cluster.
- **Shuffling Overhead**: Shuffling is expensive, so reducing the amount of data shuffled can improve performance.
- **Adaptive Query Execution (AQE)**: Dynamically optimizes query execution by reducing shuffle partitions and handling partition skew.
- **Optimization Strategies**: Filter early, reduce columns, and use AQE to optimize groupBy operations.

By understanding the internals of groupBy and leveraging optimization techniques, you can significantly improve the performance of your PySpark applications.