# PySpark SQL avg() Function: Calculating Averages in DataFrames

## Introduction to the `avg()` Function

The `avg()` function in PySpark is used to calculate the average (mean) value of numeric columns. It works similarly to the SQL `AVG()` function, providing a quick way to compute the mean for any numeric data column.


## Basic Syntax:

```
DataFrame.groupBy(*cols).agg(avg("columnName"))
```

### Parameters:

- **`cols`**: The column(s) to group by (if necessary).
- **`avg("columnName")`**: Applies the average function to the specified column.


## Why Use `avg()`?

- It’s a common aggregation operation used in financial analysis, data profiling, and summary reporting to determine the mean value of datasets.
- It works well for understanding central tendencies in numeric data and comparing groups of data points.


## Practical Examples

### 1. Calculating the Average of a Single Column

**Scenario**: You have a DataFrame with sales data, and you want to calculate the average sales across all items.

**Code Example**:

In [0]:
from pyspark.sql.functions import avg

df = spark.createDataFrame([
    ("ItemA", 100),
    ("ItemB", 200),
    ("ItemA", 300),
    ("ItemC", 400),
    ("ItemB", 500)
], ["ITEM", "SALES"])

# Calculate the average of the SALES column
df.agg(avg("SALES").alias("AVERAGE_SALES")).show()


+-------------+
|AVERAGE_SALES|
+-------------+
|        300.0|
+-------------+



### 2. Grouping and Calculating Averages

**Scenario**: You want to group by `ITEM` and calculate the average sales for each item.

**Code Example**:

In [0]:
# Group by ITEM and calculate the average SALES for each item
df.groupBy("ITEM").agg(avg("SALES").alias("AVERAGE_SALES")).show()


+-----+-------------+
| ITEM|AVERAGE_SALES|
+-----+-------------+
|ItemA|        200.0|
|ItemB|        350.0|
|ItemC|        400.0|
+-----+-------------+



### 3. Calculating Averages for Multiple Columns

**Scenario**: You have sales and profit data, and you want to calculate the average sales and the average profit across the entire DataFrame.

**Code Example**:

In [0]:
df_multi = spark.createDataFrame([
    ("ItemA", 100, 10),
    ("ItemB", 200, 20),
    ("ItemA", 300, 30),
    ("ItemC", 400, 40),
    ("ItemB", 500, 50)
], ["ITEM", "SALES", "PROFIT"])

# Calculate the average for both SALES and PROFIT columns
df_multi.agg(
    avg("SALES").alias("AVERAGE_SALES"),
    avg("PROFIT").alias("AVERAGE_PROFIT")
).show()


+-------------+--------------+
|AVERAGE_SALES|AVERAGE_PROFIT|
+-------------+--------------+
|        300.0|          30.0|
+-------------+--------------+



### 4. Grouping and Calculating Multiple Averages

**Scenario**: You want to group by `ITEM` and calculate both the average sales and average profit for each item.

**Code Example**:

In [0]:
# Group by ITEM and calculate both average SALES and average PROFIT
df_multi.groupBy("ITEM").agg(
    avg("SALES").alias("AVERAGE_SALES"),
    avg("PROFIT").alias("AVERAGE_PROFIT")
).show()


+-----+-------------+--------------+
| ITEM|AVERAGE_SALES|AVERAGE_PROFIT|
+-----+-------------+--------------+
|ItemA|        200.0|          20.0|
|ItemB|        350.0|          35.0|
|ItemC|        400.0|          40.0|
+-----+-------------+--------------+



### 5. Using `avg()` with a Conditional Expression

**Scenario**: You want to calculate the average sales only for rows where sales exceed 200.

**Code Example**:

In [0]:
from pyspark.sql.functions import when

# Calculate the average sales where SALES > 200
df.agg(avg(when(df.SALES > 200, df.SALES)).alias("AVERAGE_SALES_OVER_200")).show()


+----------------------+
|AVERAGE_SALES_OVER_200|
+----------------------+
|                 400.0|
+----------------------+



### 6. Calculating a Global Average

**Scenario**: You want to calculate the average sales across the entire dataset, without any grouping.

**Code Example**:

In [0]:
# Global average of SALES
df.agg(avg("SALES").alias("GLOBAL_AVERAGE_SALES")).show()


+--------------------+
|GLOBAL_AVERAGE_SALES|
+--------------------+
|               300.0|
+--------------------+

