# PySpark SQL min() Function: How to Find Minimum Values

## Introduction to the `min()` Function

The `min()` function in PySpark is used to calculate the minimum value in a column. It works similarly to the SQL `MIN()` function and is one of the fundamental aggregation functions used in data analysis to identify the smallest value in a dataset.


## Basic Syntax:

```
DataFrame.groupBy(*cols).agg(min("columnName"))
```

### Parameters:

- **`cols`**: The column(s) to group by (if needed).
- **`min("columnName")`**: Applies the `min()` function to the specified column to get the minimum value.



## Why Use `min()`?

- It’s useful for finding the lowest values, such as the minimum price, minimum salary, or the earliest date in a dataset.
- It can be applied globally (across the entire DataFrame) or within specific groups (using `groupBy()`).


## Practical Examples

### 1. Finding the Minimum of a Single Column

**Scenario**: You have a DataFrame with sales data, and you want to find the minimum sales value across the entire DataFrame.

**Code Example**:

In [0]:
from pyspark.sql.functions import min

df = spark.createDataFrame([
    ("ItemA", 100),
    ("ItemB", 200),
    ("ItemA", 300),
    ("ItemC", 400),
    ("ItemB", 500)
], ["ITEM", "SALES"])

# Find the minimum SALES value
df.agg(min("SALES").alias("MIN_SALES")).show()


+---------+
|MIN_SALES|
+---------+
|      100|
+---------+



### 2. Grouping and Finding the Minimum Value

**Scenario**: You want to group the data by `ITEM` and calculate the minimum sales value for each item.

**Code Example**:

In [0]:
# Group by ITEM and calculate the minimum SALES
df.groupBy("ITEM").agg(min("SALES").alias("MIN_SALES")).show()


+-----+---------+
| ITEM|MIN_SALES|
+-----+---------+
|ItemA|      100|
|ItemB|      200|
|ItemC|      400|
+-----+---------+



### 3. Finding the Minimum of Multiple Columns

**Scenario**: You have both sales and profit data, and you want to find the minimum value for both columns.

**Code Example**:

In [0]:
df_multi = spark.createDataFrame([
    ("ItemA", 100, 10),
    ("ItemB", 200, 20),
    ("ItemA", 300, 30),
    ("ItemC", 400, 40),
    ("ItemB", 500, 50)
], ["ITEM", "SALES", "PROFIT"])

# Find the minimum value of both SALES and PROFIT
df_multi.agg(
    min("SALES").alias("MIN_SALES"),
    min("PROFIT").alias("MIN_PROFIT")
).show()


+---------+----------+
|MIN_SALES|MIN_PROFIT|
+---------+----------+
|      100|        10|
+---------+----------+



### 4. Finding the Minimum with a Conditional Expression

**Scenario**: You want to find the minimum sales value for sales greater than 200.

**Code Example**:

In [0]:
from pyspark.sql.functions import when

# Find the minimum SALES where SALES > 200
df.agg(min(when(df.SALES > 200, df.SALES)).alias("MIN_SALES_OVER_200")).show()


+------------------+
|MIN_SALES_OVER_200|
+------------------+
|               300|
+------------------+



### 5. Grouping and Finding Multiple Minimum Values

**Scenario**: You want to group by `ITEM` and find the minimum sales and profit for each item.

**Code Example**:

In [0]:
# Group by ITEM and find the minimum SALES and PROFIT
df_multi.groupBy("ITEM").agg(
    min("SALES").alias("MIN_SALES"),
    min("PROFIT").alias("MIN_PROFIT")
).show()


+-----+---------+----------+
| ITEM|MIN_SALES|MIN_PROFIT|
+-----+---------+----------+
|ItemA|      100|        10|
|ItemB|      200|        20|
|ItemC|      400|        40|
+-----+---------+----------+



### 6. Finding the Global Minimum Value

**Scenario**: You want to find the minimum sales value across the entire dataset, without any grouping.

**Code Example**:

In [0]:
# Global minimum of SALES without grouping
df.agg(min("SALES").alias("GLOBAL_MIN_SALES")).show()


+----------------+
|GLOBAL_MIN_SALES|
+----------------+
|             100|
+----------------+

