## Aggregate functions in PySpark

Aggregate functions in PySpark are functions that operate on a group of rows and return a single value.

These functions are available in the `pyspark.sql.functions` module and can be used with methods like `groupBy()`, `agg()`, or directly on the DataFrame.

### Common Aggregate Functions:

| **Function**               | **Description**                                                     |
|----------------------------|---------------------------------------------------------------------|
| **`count()`**              | Counts the number of rows.                                         |
| **`sum()`**                | Computes the sum of a numeric column.                              |
| **`avg()`**                | Computes the average (mean) of a numeric column.                   |
| **`min()`**                | Returns the minimum value in a column.                             |
| **`max()`**                | Returns the maximum value in a column.                             |
| **`collect_list()`**       | Aggregates values into a list for each group.                      |
| **`collect_set()`**        | Aggregates unique values into a set for each group.                |
| **`first()`**              | Returns the first value in a group.                                |
| **`last()`**               | Returns the last value in a group.                                 |
| **`stddev()`**             | Computes the standard deviation of a numeric column.               |
| **`variance()`**           | Computes the variance of a numeric column.                         |
| **`approx_count_distinct()`** | Approximates the count of distinct values in a column.          |
| **`sumDistinct()`**        | Computes the sum of unique values in a column.                     |
| **`skewness()`**           | Computes the skewness of a numeric column.                         |
| **`kurtosis()`**           | Computes the kurtosis of a numeric column.                         |


In [0]:
from pyspark.sql.functions import sum, avg, min, max, count

# sample data
data = [
(1,'Rohish',26,20000,'india','IT'),
(2,'Melody',None,40000,'germany','engineering'),
(3,'Pawan',12,60000,'india','sales'),
(4,'Roshini',44,None,'uk','engineering'),
(5,'Raushan',35,70000,'india','sales'),
(6,None,29,200000,'uk','IT'),
(7,'Adam',37,65000,'us','IT'),
(8,'Chris',16,40000,'us','sales'),
(None,None,None,None,None,None),
(7,'Adam',37,65000,'us','IT')]

columns = ["id", "name", "age", "salary", "country", "dept"]

emp_df = spark.createDataFrame(data, columns)

emp_df.show()

+----+-------+----+------+-------+-----------+
|  id|   name| age|salary|country|       dept|
+----+-------+----+------+-------+-----------+
|   1| Rohish|  26| 20000|  india|         IT|
|   2| Melody|null| 40000|germany|engineering|
|   3|  Pawan|  12| 60000|  india|      sales|
|   4|Roshini|  44|  null|     uk|engineering|
|   5|Raushan|  35| 70000|  india|      sales|
|   6|   null|  29|200000|     uk|         IT|
|   7|   Adam|  37| 65000|     us|         IT|
|   8|  Chris|  16| 40000|     us|      sales|
|null|   null|null|  null|   null|       null|
|   7|   Adam|  37| 65000|     us|         IT|
+----+-------+----+------+-------+-----------+



#### count():
- `count()` in PySpark works as both transformation and action
- When you do count on df, it will return df count its works as action but when you do count on any columns it return a new df, works as transformation.

In [0]:
# count() as action
emp_df.count()

Out[5]: 10

In [0]:
# count() as transformation: its returning a new dataframe
emp_df.select(count("id"))

Out[6]: DataFrame[count(id): bigint]

In [0]:
# saving the result to a new dataframe and displaying that df
id_count_df = emp_df.select(count("id")) # if you are selecting only one column it wont count the null
id_count_df.show()

+---------+
|count(id)|
+---------+
|        9|
+---------+



#### sum, min, max, and avg

**Basic Aggregation without Grouping**

In [0]:
emp_df.select(
  min("salary").alias("min_salary"),
  max("salary").alias("max_salary"),
  avg("salary").alias("avg_salary"),
  sum("salary").alias("total_salary")
).show()

+----------+----------+----------+------------+
|min_salary|max_salary|avg_salary|total_salary|
+----------+----------+----------+------------+
|     20000|    200000|   70000.0|      560000|
+----------+----------+----------+------------+



In [0]:
# casting the result
emp_df.select(
  min("salary").alias("min_salary").cast("int"),
  max("salary").alias("max_salary").cast("int"),
  avg("salary").alias("avg_salary").cast("int"),
  sum("salary").alias("total_salary").cast("int")
).show()

+----------+----------+----------+------------+
|min_salary|max_salary|avg_salary|total_salary|
+----------+----------+----------+------------+
|     20000|    200000|     70000|      560000|
+----------+----------+----------+------------+



### Group By

#### Aggregation with Grouping
- You can use groupBy() to group the data by one or more columns and apply aggregate functions


In [0]:
emp_df.groupBy("dept").count().show()

+-----------+-----+
|       dept|count|
+-----------+-----+
|         IT|    4|
|engineering|    2|
|      sales|    3|
|       null|    1|
+-----------+-----+



In [0]:
# dept wise salary
emp_df.groupBy("dept").agg(sum("salary").alias("dept_wise_salary")).show()

+-----------+----------------+
|       dept|dept_wise_salary|
+-----------+----------------+
|         IT|          350000|
|engineering|           40000|
|      sales|          170000|
|       null|            null|
+-----------+----------------+



In [0]:
# country wise dept salary
emp_df.groupBy("country","dept").agg(sum("salary").alias("country_wise_dept_salary")).sort("country").show()

+-------+-----------+------------------------+
|country|       dept|country_wise_dept_salary|
+-------+-----------+------------------------+
|   null|       null|                    null|
|germany|engineering|                   40000|
|  india|         IT|                   20000|
|  india|      sales|                  130000|
|     uk|engineering|                    null|
|     uk|         IT|                  200000|
|     us|         IT|                  130000|
|     us|      sales|                   40000|
+-------+-----------+------------------------+



In [0]:
# Group by Department and aggregate
emp_df.groupBy("dept").agg(
    sum("salary").alias("Total_Salary"),
    count("id").alias("Employee_Count")
).show()

+-----------+------------+--------------+
|       dept|Total_Salary|Employee_Count|
+-----------+------------+--------------+
|         IT|      350000|             4|
|engineering|       40000|             2|
|      sales|      170000|             3|
|       null|        null|             0|
+-----------+------------+--------------+



**Using Multiple Aggregates in One Step:**

You can use `agg()` for multiple aggregate functions in a single statement.

In [0]:
emp_df.groupBy("dept").agg(
  sum("Salary").alias("total_salary"),
  avg("Salary").cast("int").alias("avg_salary"),
  min("Salary").alias("min_salary"),
  max("Salary").alias("max_salary")
).show()

+-----------+------------+----------+----------+----------+
|       dept|total_salary|avg_salary|min_salary|max_salary|
+-----------+------------+----------+----------+----------+
|         IT|      350000|     87500|     20000|    200000|
|engineering|       40000|     40000|     40000|     40000|
|      sales|      170000|     56666|     40000|     70000|
|       null|        null|      null|      null|      null|
+-----------+------------+----------+----------+----------+

