## Aggregate Data in a DataFrame
Using groupBy(~) method aggregates rows based on the specified columns and also compute statistics such as the mean for each of these groups.

### Parameters
- cols | list or string or Column | optional
    - Columns to group by. By default, all rows will be grouped together..

### Return Value
  - [GroupedData object (pyspark.sql.group.GroupedData)](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.GroupedData.html)

[API Reference](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.groupBy.html)

In [0]:
df = spark.createDataFrame(
  [
    ('Shiva', 'fresher','IT',30, 5000), 
    ('Teja','experience','IT', 25, 4000), 
    ('Bhavishya','fresher' ,'HR', 23, 7000)
  ],
  ['name','fresh_exp','department','age', 'bonus']
)
df.show(truncate=False)

+---------+----------+----------+---+-----+
|name     |fresh_exp |department|age|bonus|
+---------+----------+----------+---+-----+
|Shiva    |fresher   |IT        |30 |5000 |
|Teja     |experience|IT        |25 |4000 |
|Bhavishya|fresher   |HR        |23 |7000 |
+---------+----------+----------+---+-----+



### groupBy without any arguments

In [0]:
df.groupby().max().show()

+--------+----------+
|max(age)|max(bonus)|
+--------+----------+
|      30|      7000|
+--------+----------+



groupBy() without any arguments will group all rows together, and will compute statistics for each numeric column

### groupBy a single columns and compute all the columns statistics of each group

In [0]:
df.groupBy("department").max().show()

+----------+--------+----------+
|department|max(age)|max(bonus)|
+----------+--------+----------+
|        IT|      30|      5000|
|        HR|      23|      7000|
+----------+--------+----------+



### groupBy a single column and computing statistic of specific columns of each group

In [0]:
from pyspark.sql import functions as f
df.groupBy(f.col("department")).max("age").show()

+----------+--------+
|department|max(age)|
+----------+--------+
|        IT|      30|
|        HR|      23|
+----------+--------+



In [0]:
df.groupBy(f.col("department")).agg(f.max("age")).show() # using aggregate functions

+----------+--------+
|department|max(age)|
+----------+--------+
|        IT|      30|
|        HR|      23|
+----------+--------+



In [0]:
df.groupBy(f.col("department")).agg(f.max("age").alias('max_age')).show() # using alias

+----------+-------+
|department|max_age|
+----------+-------+
|        IT|     30|
|        HR|     23|
+----------+-------+



### groupBy a single column and computing multiple statistic of each group

In [0]:
df.groupby(f.col("department")).agg(f.count("age").alias('count'),f.max("age").alias("max_age"), f.min("age").alias("min_age"), f.avg("bonus").alias('avg_bonus')).show()

+----------+-----+-------+-------+---------+
|department|count|max_age|min_age|avg_bonus|
+----------+-----+-------+-------+---------+
|        IT|    2|     30|     25|   4500.0|
|        HR|    1|     23|     23|   7000.0|
+----------+-----+-------+-------+---------+



### groupBy a multiple column and computing multiple statistic of each group

In [0]:
df.groupby(["department", "fresh_exp"]).max("age").show()

+----------+----------+--------+
|department| fresh_exp|max(age)|
+----------+----------+--------+
|        IT|   fresher|      30|
|        IT|experience|      25|
|        HR|   fresher|      23|
+----------+----------+--------+

