# PySpark SQL max() Function: How to Find Maximum Values

## Introduction to the `max()` Function

The `max()` function in PySpark is used to calculate the maximum value in a column. It works similarly to the SQL `MAX()` function and is essential for identifying the largest or highest values in a dataset.


## Basic Syntax:

```
DataFrame.groupBy(*cols).agg(max("columnName"))
```

### Parameters:

- **`cols`**: The column(s) to group by (optional).
- **`max("columnName")`**: Applies the `max()` function to the specified column to get the maximum value.


## Why Use `max()`?

- It helps find the highest values in data, such as the maximum salary, price, or the latest date in a dataset.
- You can use `max()` across the entire DataFrame or within groups when combined with `groupBy()`.


## Practical Examples

### 1. Finding the Maximum Value in a Single Column

**Scenario**: You have a DataFrame with sales data, and you want to find the maximum sales value across the dataset.

**Code Example**:

In [0]:
from pyspark.sql.functions import max

df = spark.createDataFrame([
    ("ItemA", 100),
    ("ItemB", 200),
    ("ItemA", 300),
    ("ItemC", 400),
    ("ItemB", 500)
], ["ITEM", "SALES"])

# Find the maximum SALES value
df.agg(max("SALES").alias("MAX_SALES")).show()


+---------+
|MAX_SALES|
+---------+
|      500|
+---------+



### 2. Grouping and Finding the Maximum Value

**Scenario**: You want to group the data by `ITEM` and calculate the maximum sales value for each item.

**Code Example**:

In [0]:
# Group by ITEM and calculate the maximum SALES
df.groupBy("ITEM").agg(max("SALES").alias("MAX_SALES")).show()


+-----+---------+
| ITEM|MAX_SALES|
+-----+---------+
|ItemA|      300|
|ItemB|      500|
|ItemC|      400|
+-----+---------+



### 3. Finding the Maximum Value of Multiple Columns

**Scenario**: You have both sales and profit data, and you want to find the maximum value for both columns.

**Code Example**:

In [0]:
df_multi = spark.createDataFrame([
    ("ItemA", 100, 10),
    ("ItemB", 200, 20),
    ("ItemA", 300, 30),
    ("ItemC", 400, 40),
    ("ItemB", 500, 50)
], ["ITEM", "SALES", "PROFIT"])

# Find the maximum values of both SALES and PROFIT
df_multi.agg(
    max("SALES").alias("MAX_SALES"),
    max("PROFIT").alias("MAX_PROFIT")
).show()


+---------+----------+
|MAX_SALES|MAX_PROFIT|
+---------+----------+
|      500|        50|
+---------+----------+



### 4. Finding the Maximum with a Conditional Expression

**Scenario**: You want to find the maximum sales value for sales greater than 200.

**Code Example**:

In [0]:
from pyspark.sql.functions import when

# Find the maximum SALES where SALES > 200
df.agg(max(when(df.SALES > 200, df.SALES)).alias("MAX_SALES_OVER_200")).show()


+------------------+
|MAX_SALES_OVER_200|
+------------------+
|               500|
+------------------+



### 5. Grouping and Finding Multiple Maximum Values

**Scenario**: You want to group by `ITEM` and find the maximum sales and profit for each item.

**Code Example**:

In [0]:
# Group by ITEM and find the maximum SALES and PROFIT
df_multi.groupBy("ITEM").agg(
    max("SALES").alias("MAX_SALES"),
    max("PROFIT").alias("MAX_PROFIT")
).show()


+-----+---------+----------+
| ITEM|MAX_SALES|MAX_PROFIT|
+-----+---------+----------+
|ItemA|      300|        30|
|ItemB|      500|        50|
|ItemC|      400|        40|
+-----+---------+----------+



### 6. Finding the Global Maximum Value

**Scenario**: You want to find the maximum sales value across the entire dataset, without any grouping.

**Code Example**:

In [0]:
# Global maximum of SALES without grouping
df.agg(max("SALES").alias("GLOBAL_MAX_SALES")).show()


+----------------+
|GLOBAL_MAX_SALES|
+----------------+
|             500|
+----------------+

