# 2.1 GROUP BY and Aggregations

This section covers grouping data and using aggregate functions to summarize information across multiple rows.

## Learning Objectives
By the end of this section, you will be able to:
- Group data using GROUP BY clause
- Apply aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- Combine grouping with filtering
- Understand the difference between WHERE and HAVING
- Create summary reports from detailed data

## Prerequisites
- Completed Chapter 1 (SQL Fundamentals)
- Understanding of basic SELECT queries
- Familiarity with our sample database schema

## Database Connection

Let's connect to our database and verify our data is ready for aggregation examples.

In [None]:
import sqlite3
import pandas as pd
from IPython.display import display

# Connect to our database
conn = sqlite3.connect('my_database.db')
cursor = conn.cursor()

# Quick data verification
print("✅ Database connection established!")
print("\n📊 Current data summary:")

tables = ['departments', 'employees', 'projects']
for table in tables:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table:12}: {count:3} records")

## Understanding Aggregate Functions

Aggregate functions perform calculations on multiple rows and return a single result.

### Common Aggregate Functions:
- **COUNT()** - Number of rows
- **SUM()** - Total sum of values
- **AVG()** - Average value
- **MIN()** - Minimum value  
- **MAX()** - Maximum value
- **GROUP_CONCAT()** - Concatenate text values (SQLite specific)

In [None]:
# Example 1: Basic aggregate functions on employees table
print("📊 Example 1: Basic aggregates on employee salaries")

df = pd.read_sql_query("""
SELECT 
    COUNT(*) AS "Total Employees",
    SUM(salary) AS "Total Payroll",
    AVG(salary) AS "Average Salary",
    MIN(salary) AS "Minimum Salary",
    MAX(salary) AS "Maximum Salary",
    ROUND(AVG(salary), 2) AS "Avg Salary (Rounded)"
FROM employees
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Aggregates on projects
print("📊 Example 2: Project budget statistics")
df = pd.read_sql_query("""
SELECT 
    COUNT(*) AS "Total Projects",
    SUM(budget) AS "Total Budget",
    AVG(budget) AS "Average Budget",
    MIN(budget) AS "Smallest Project",
    MAX(budget) AS "Largest Project"
FROM projects
""", conn)
display(df)

## Introduction to GROUP BY

The GROUP BY clause groups rows that have the same values into summary rows, creating one result row per group.

### GROUP BY Syntax:
```sql
SELECT column, aggregate_function(column)
FROM table
GROUP BY column;
```

### Key Rules:
- Columns in SELECT must be in GROUP BY or be aggregate functions
- GROUP BY creates one row per unique value combination
- Use with aggregate functions to summarize each group

In [None]:
# Example 1: Count employees by department
print("👥 Example 1: Employee count by department")

df = pd.read_sql_query("""
SELECT 
    dept_id,
    COUNT(*) AS employee_count
FROM employees 
GROUP BY dept_id
ORDER BY employee_count DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Average salary by department
print("💰 Example 2: Average salary by department")
df = pd.read_sql_query("""
SELECT 
    dept_id,
    COUNT(*) AS employee_count,
    ROUND(AVG(salary), 2) AS avg_salary,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM employees 
GROUP BY dept_id
ORDER BY avg_salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 3: Project statistics by status
print("🚀 Example 3: Projects grouped by status")
df = pd.read_sql_query("""
SELECT 
    status,
    COUNT(*) AS project_count,
    SUM(budget) AS total_budget,
    AVG(budget) AS avg_budget
FROM projects 
GROUP BY status
ORDER BY total_budget DESC
""", conn)
display(df)

## GROUP BY with JOIN

Combine GROUP BY with JOIN operations to create more meaningful reports using data from multiple tables.

In [None]:
# Example 1: Employee statistics with department names
print("🏢 Example 1: Employee statistics by department (with names)")

df = pd.read_sql_query("""
SELECT 
    d.dept_name AS department,
    d.location,
    COUNT(e.emp_id) AS employee_count,
    ROUND(AVG(e.salary), 2) AS avg_salary,
    SUM(e.salary) AS total_payroll
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name, d.location
ORDER BY total_payroll DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Project and budget analysis by department
print("📈 Example 2: Department project and budget analysis")
df = pd.read_sql_query("""
SELECT 
    d.dept_name AS department,
    COUNT(DISTINCT e.emp_id) AS employee_count,
    COUNT(DISTINCT p.project_id) AS project_count,
    COALESCE(SUM(p.budget), 0) AS total_project_budget,
    COALESCE(AVG(p.budget), 0) AS avg_project_budget
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id  
LEFT JOIN projects p ON d.dept_id = p.dept_id
GROUP BY d.dept_id, d.dept_name
ORDER BY total_project_budget DESC
""", conn)
display(df)

## HAVING Clause - Filtering Groups

The HAVING clause filters groups created by GROUP BY, similar to WHERE but for aggregate results.

### HAVING vs WHERE:
- **WHERE**: Filters individual rows before grouping
- **HAVING**: Filters groups after GROUP BY is applied
- **HAVING**: Can use aggregate functions in conditions

In [None]:
# Example 1: Departments with more than 1 employee
print("👥 Example 1: Departments with multiple employees")

df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    AVG(e.salary) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING COUNT(e.emp_id) > 1
ORDER BY employee_count DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Departments with average salary above 75000
print("💰 Example 2: High-paying departments (avg salary > 75000)")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    ROUND(AVG(e.salary), 2) AS avg_salary,
    SUM(e.salary) AS total_payroll
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING AVG(e.salary) > 75000
ORDER BY avg_salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 3: Project status with total budget > 100000
print("🚀 Example 3: High-budget project statuses")
df = pd.read_sql_query("""
SELECT 
    status,
    COUNT(*) AS project_count,
    SUM(budget) AS total_budget,
    AVG(budget) AS avg_budget
FROM projects
GROUP BY status
HAVING SUM(budget) > 100000
ORDER BY total_budget DESC
""", conn)
display(df)

## Advanced Grouping Techniques

Let's explore more sophisticated grouping scenarios and calculations.

In [None]:
# Example 1: Multiple grouping columns
print("📊 Example 1: Projects grouped by department and status")

df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    p.status,
    COUNT(p.project_id) AS project_count,
    SUM(p.budget) AS total_budget,
    AVG(p.priority) AS avg_priority
FROM departments d
INNER JOIN projects p ON d.dept_id = p.dept_id
GROUP BY d.dept_name, p.status
ORDER BY d.dept_name, total_budget DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Conditional aggregation using CASE
print("📈 Example 2: Salary ranges by department")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS total_employees,
    SUM(CASE WHEN e.salary < 70000 THEN 1 ELSE 0 END) AS low_salary_count,
    SUM(CASE WHEN e.salary BETWEEN 70000 AND 85000 THEN 1 ELSE 0 END) AS mid_salary_count,
    SUM(CASE WHEN e.salary > 85000 THEN 1 ELSE 0 END) AS high_salary_count,
    ROUND(AVG(e.salary), 2) AS avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING COUNT(e.emp_id) > 0
ORDER BY avg_salary DESC
""", conn)
display(df)

## Practice Exercises

Practice grouping and aggregation with these exercises:

1. **Basic Grouping**: Count customers by country
2. **Salary Analysis**: Find departments with total payroll > 150000
3. **Project Priority**: Average priority by project status
4. **Complex Analysis**: Department efficiency (budget per employee)
5. **Conditional Grouping**: High vs low priority projects by department

Complete the exercises below:

In [None]:
# Exercise 1: Count customers by country
print("Exercise 1: Customer count by country")
df = pd.read_sql_query("""
SELECT 
    country,
    COUNT(*) AS customer_count
FROM customers
GROUP BY country
ORDER BY customer_count DESC
""", conn)
display(df)

print("\n" + "="*30 + "\n")

# Exercise 2: Departments with high payroll
print("Exercise 2: Departments with total payroll > 150,000")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    SUM(e.salary) AS total_payroll,
    ROUND(AVG(e.salary), 2) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING SUM(e.salary) > 150000
ORDER BY total_payroll DESC
""", conn)
display(df)

print("\n" + "="*30 + "\n")

# Exercise 3: Average priority by project status
print("Exercise 3: Average priority by project status")
df = pd.read_sql_query("""
SELECT 
    status,
    COUNT(*) AS project_count,
    ROUND(AVG(priority), 2) AS avg_priority,
    MIN(priority) AS min_priority,
    MAX(priority) AS max_priority
FROM projects
GROUP BY status
ORDER BY avg_priority DESC
""", conn)
display(df)

print("\n" + "="*30 + "\n")

# Exercise 4: Department efficiency (budget per employee)
print("Exercise 4: Department efficiency analysis")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    COALESCE(SUM(p.budget), 0) AS total_project_budget,
    CASE 
        WHEN COUNT(e.emp_id) > 0 THEN ROUND(COALESCE(SUM(p.budget), 0) / COUNT(e.emp_id), 2)
        ELSE 0 
    END AS budget_per_employee
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
LEFT JOIN projects p ON d.dept_id = p.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING COUNT(e.emp_id) > 0
ORDER BY budget_per_employee DESC
""", conn)
display(df)

## Section Summary

In this section, you mastered data aggregation and grouping:

✅ **Aggregate Functions**: COUNT, SUM, AVG, MIN, MAX for data summarization  
✅ **GROUP BY**: Grouping rows to create summary statistics  
✅ **HAVING Clause**: Filtering grouped results with aggregate conditions  
✅ **Multi-table Grouping**: Combining JOINs with GROUP BY  
✅ **Conditional Aggregation**: Using CASE statements in aggregates  
✅ **Complex Analysis**: Creating business intelligence reports  

### Key SQL Commands:
- `GROUP BY column` - Group rows by column values
- `COUNT(*), SUM(), AVG(), MIN(), MAX()` - Aggregate functions
- `HAVING condition` - Filter groups (not individual rows)
- `COALESCE(value, default)` - Handle NULL values
- `CASE WHEN ... THEN ... END` - Conditional logic

### Query Structure with Grouping:
```sql
SELECT column, aggregate_function(column)
FROM table1
[JOIN table2 ON condition]
[WHERE row_conditions]
GROUP BY column
[HAVING group_conditions]
[ORDER BY column];
```

### Best Practices:
- Every non-aggregate column in SELECT must be in GROUP BY
- Use HAVING for group filtering, WHERE for row filtering
- Consider NULL values with COALESCE or IFNULL
- Order results for better readability

### Next Steps
In section 2.2, you'll learn about JOIN operations to combine data from multiple tables effectively.