# Aggregate Function in Pyspark – Part 1



## Sample Data Frame

In [14]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import sum, avg, count, max, min, countDistinct

# Create Spark Session
spark = SparkSession.builder.appName("AggregateFunctions").getOrCreate()

StatementMeta(, 263409aa-dd9c-49fc-8916-59dd951626e5, 16, Finished, Available, Finished)

In [15]:
# Create sample data
data = [
    Row(id=1, value=10),
    Row(id=2, value=20),
    Row(id=3, value=30),
    Row(id=4, value=None),
    Row(id=5, value=40),
    Row(id=6, value=20)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Show the DataFrame
df.show()

StatementMeta(, 263409aa-dd9c-49fc-8916-59dd951626e5, 17, Finished, Available, Finished)

+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  2|   20|
|  3|   30|
|  4| NULL|
|  5|   40|
|  6|   20|
+---+-----+



1. **Summation (`sum`)**: Sums up the values in a specified column.

In [16]:
# Applying aggregate functions
total_sum = df.select(sum("value").alias("Total_sum")).show()

StatementMeta(, 263409aa-dd9c-49fc-8916-59dd951626e5, 18, Finished, Available, Finished)

+---------+
|Total_sum|
+---------+
|      120|
+---------+



2. **Average (`avg`)**: Computes the average of the values in a specified column.

In [17]:
average = df.select(avg("value").alias("Average")).show()

StatementMeta(, 263409aa-dd9c-49fc-8916-59dd951626e5, 19, Finished, Available, Finished)

+-------+
|Average|
+-------+
|   24.0|
+-------+



3. **Count (`count`)**: Counts the number of non-null values in a specified column.

In [18]:
count = df.select(count("value").alias("Total_Count")).show()

StatementMeta(, 263409aa-dd9c-49fc-8916-59dd951626e5, 20, Finished, Available, Finished)

+-----------+
|Total_Count|
+-----------+
|          5|
+-----------+



4. **Maximum (`max`) and Minimum (`min`)**: Finds the maximum and minimum values in a specified column.

In [19]:
max_min_value = df.select(max("value").alias("Max_value"),min("value").alias("Min_value")).show()

StatementMeta(, 263409aa-dd9c-49fc-8916-59dd951626e5, 21, Finished, Available, Finished)

+---------+---------+
|Max_value|Min_value|
+---------+---------+
|       40|       10|
+---------+---------+



5. **Distinct Values Count (`countDistinct`)**: Counts the number of distinct values in a specified column.

In [20]:
distinct_values = df.select(countDistinct("value").alias("DistinctCount")).show()

StatementMeta(, 263409aa-dd9c-49fc-8916-59dd951626e5, 22, Finished, Available, Finished)

+-------------+
|DistinctCount|
+-------------+
|            4|
+-------------+



### Notes:
- **Handling Nulls**: The `count` function will count only non-null values, while `sum`, `avg`, `max`, and `min` will ignore null values in their calculations.
- **Performance Considerations**: Aggregate functions can be resource-intensive, especially on large datasets. Using the appropriate partitioning can improve performance.

### Use Cases:
- **Summation**: Useful for calculating total sales, total revenue, etc.
- **Average**: Helpful for finding average metrics like average sales per day.
- **Count**: Useful for counting occurrences, such as the number of transactions.
- **Max/Min**: Helps to determine the highest and lowest values, such as maximum sales on a specific day.
- **Distinct Count**: Useful for finding unique items, like unique customers or products.