
###groupBy() transformation

 Similar to SQL GROUP BY clause, PySpark **groupBy**() transformation is used to group rows that have the same values in specified columns into summary rows. It allows you to perform aggregate functions on groups of rows, rather than on individual rows, enabling you to summarize data and generate aggregate statistics.

#####Aggregate functions: 
- **count**() Returns the number of rows for each group.
- **max**()	Returns the maximum of values for each group.
- **min**()	Returns the minimum of values for each group.
- **sum**()	Returns the total for values for each group.
- **avg**()	Returns the average for values for each group.
- **agg**()	Using groupBy() agg() function, we can calculate more than one aggregate at a time.

In [0]:
# Data
simpleData = [
    ("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
]

# Create DataFrame
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema=schema)

df.printSchema()
df.show(truncate=False)


In [0]:
# Using groupBy().sum()
df.groupBy("department").sum("salary").show(truncate=False)


In [0]:
# Using groupBy().count()
df.groupBy("department").count().show()

In [0]:
# Using groupBy().min()
df.groupBy("department").min("salary").show()

In [0]:
# Using groupBy().max()
df.groupBy("department").max("salary").show()

In [0]:
# Using groupBy().avg()
df.groupBy("department").avg( "salary").show()

In [0]:
# GroupBy on multiple columns
df.groupBy("department","state") \
    .sum("salary","bonus") \
    .show()

In [0]:
# Running more aggregations

df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus") \
     ) \
    .show(truncate=False)

In [0]:
# Using filter on aggregate data

from pyspark.sql.functions import col
df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
      avg("salary").alias("avg_salary"), \
      sum("bonus").alias("sum_bonus"), \
      max("bonus").alias("max_bonus")) \
    .where(col("sum_bonus") >= 50000) \
    .show(truncate=False)

In [0]:
# Use last() with groupBy() - to retrieve the last value of a column within each group.
from pyspark.sql.functions import last
df.groupBy("department").agg(last("salary")).show()

#You can also use last() directly on a DataFrame column (without groupBy()), and it behaves as a global aggregate function.
#df.select(last("salary")).show()

####PySpark SQL GROUP BY Query

In [0]:
# Register DataFrame as a temporary view
df.createOrReplaceTempView("employees")

# Using SQL Query
sql_string = """SELECT department,
       SUM(salary) AS sum_salary,
       AVG(salary) AS avg_salary,
       SUM(bonus) AS sum_bonus,
       MAX(bonus) AS max_bonus
FROM employees
GROUP BY department
HAVING SUM(bonus) >= 50000"""

# Execute SQL query against the temporary view
df2 = spark.sql(sql_string)
df2.show()