# 2.1 GROUP BY and Aggregations

This section covers grouping data and using aggregate functions to summarize information across multiple rows.

## Learning Objectives
By the end of this section, you will be able to:
- Group data using GROUP BY clause
- Apply aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- Combine grouping with filtering
- Understand the difference between WHERE and HAVING
- Create summary reports from detailed data

## Prerequisites
- Completed Chapter 1 (SQL Fundamentals)
- Understanding of basic SELECT queries
- Familiarity with our sample database schema

## Database Connection

Let's connect to our database and verify our data is ready for aggregation examples.

In [1]:
import sqlite3
import pandas as pd
from IPython.display import display

# Connect to our database
conn = sqlite3.connect('my_database.db')
cursor = conn.cursor()

# Quick data verification
print("✅ Database connection established!")
print("\n📊 Current data summary:")

tables = ['departments', 'employees', 'projects']
for table in tables:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table:12}: {count:3} records")

✅ Database connection established!

📊 Current data summary:
  departments :   5 records
  employees   :   5 records
  projects    :   5 records


## Understanding Aggregate Functions

Aggregate functions perform calculations on multiple rows and return a single result.

### Common Aggregate Functions:
- **COUNT()** - Number of rows
- **SUM()** - Total sum of values
- **AVG()** - Average value
- **MIN()** - Minimum value  
- **MAX()** - Maximum value
- **GROUP_CONCAT()** - Concatenate text values (SQLite specific)

In [2]:
# Example 1: Basic aggregate functions on employees table
print("📊 Example 1: Basic aggregates on employee salaries")

df = pd.read_sql_query("""
SELECT 
    COUNT(*) AS "Total Employees",
    SUM(salary) AS "Total Payroll",
    AVG(salary) AS "Average Salary",
    MIN(salary) AS "Minimum Salary",
    MAX(salary) AS "Maximum Salary",
    ROUND(AVG(salary), 2) AS "Avg Salary (Rounded)"
FROM employees
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Aggregates on projects
print("📊 Example 2: Project budget statistics")
df = pd.read_sql_query("""
SELECT 
    COUNT(*) AS "Total Projects",
    SUM(budget) AS "Total Budget",
    AVG(budget) AS "Average Budget",
    MIN(budget) AS "Smallest Project",
    MAX(budget) AS "Largest Project"
FROM projects
""", conn)
display(df)

📊 Example 1: Basic aggregates on employee salaries


Unnamed: 0,Total Employees,Total Payroll,Average Salary,Minimum Salary,Maximum Salary,Avg Salary (Rounded)
0,5,407000,81400.0,65000,95000,81400.0




📊 Example 2: Project budget statistics


Unnamed: 0,Total Projects,Total Budget,Average Budget,Smallest Project,Largest Project
0,5,775000,155000.0,50000,300000


## Introduction to GROUP BY

The GROUP BY clause groups rows that have the same values into summary rows, creating one result row per group.

### GROUP BY Syntax:
```sql
SELECT column, aggregate_function(column)
FROM table
GROUP BY column;
```

### Key Rules:
- Columns in SELECT must be in GROUP BY or be aggregate functions
- GROUP BY creates one row per unique value combination
- Use with aggregate functions to summarize each group

In [3]:
# Example 1: Count employees by department
print("👥 Example 1: Employee count by department")

df = pd.read_sql_query("""
SELECT 
    dept_id,
    COUNT(*) AS employee_count
FROM employees 
GROUP BY dept_id
ORDER BY employee_count DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Average salary by department
print("💰 Example 2: Average salary by department")
df = pd.read_sql_query("""
SELECT 
    dept_id,
    COUNT(*) AS employee_count,
    ROUND(AVG(salary), 2) AS avg_salary,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM employees 
GROUP BY dept_id
ORDER BY avg_salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 3: Project statistics by status
print("🚀 Example 3: Projects grouped by status")
df = pd.read_sql_query("""
SELECT 
    status,
    COUNT(*) AS project_count,
    SUM(budget) AS total_budget,
    AVG(budget) AS avg_budget
FROM projects 
GROUP BY status
ORDER BY total_budget DESC
""", conn)
display(df)

👥 Example 1: Employee count by department


Unnamed: 0,dept_id,employee_count
0,2,2
1,4,1
2,3,1
3,1,1




💰 Example 2: Average salary by department


Unnamed: 0,dept_id,employee_count,avg_salary,min_salary,max_salary
0,4,1,95000.0,95000,95000
1,1,1,90000.0,90000,90000
2,2,2,78500.0,75000,82000
3,3,1,65000.0,65000,65000




🚀 Example 3: Projects grouped by status


Unnamed: 0,status,project_count,total_budget,avg_budget
0,In Progress,2,350000,175000.0
1,Planning,1,300000,300000.0
2,Active,1,75000,75000.0
3,Completed,1,50000,50000.0


## GROUP BY with JOIN

Combine GROUP BY with JOIN operations to create more meaningful reports using data from multiple tables.

In [4]:
# Example 1: Employee statistics with department names
print("🏢 Example 1: Employee statistics by department (with names)")

df = pd.read_sql_query("""
SELECT 
    d.dept_name AS department,
    d.location,
    COUNT(e.emp_id) AS employee_count,
    ROUND(AVG(e.salary), 2) AS avg_salary,
    SUM(e.salary) AS total_payroll
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name, d.location
ORDER BY total_payroll DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Project and budget analysis by department
print("📈 Example 2: Department project and budget analysis")
df = pd.read_sql_query("""
SELECT 
    d.dept_name AS department,
    COUNT(DISTINCT e.emp_id) AS employee_count,
    COUNT(DISTINCT p.project_id) AS project_count,
    COALESCE(SUM(p.budget), 0) AS total_project_budget,
    COALESCE(AVG(p.budget), 0) AS avg_project_budget
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id  
LEFT JOIN projects p ON d.dept_id = p.dept_id
GROUP BY d.dept_id, d.dept_name
ORDER BY total_project_budget DESC
""", conn)
display(df)

🏢 Example 1: Employee statistics by department (with names)


Unnamed: 0,department,location,employee_count,avg_salary,total_payroll
0,Engineering,San Francisco,2,78500.0,157000.0
1,Sales,Los Angeles,1,95000.0,95000.0
2,Human Resources,New York,1,90000.0,90000.0
3,Marketing,Chicago,1,65000.0,65000.0
4,Finance,Boston,0,,




📈 Example 2: Department project and budget analysis


Unnamed: 0,department,employee_count,project_count,total_project_budget,avg_project_budget
0,Engineering,2,2,900000,225000.0
1,Finance,0,1,200000,200000.0
2,Marketing,1,1,75000,75000.0
3,Sales,1,1,50000,50000.0
4,Human Resources,1,0,0,0.0


## HAVING Clause - Filtering Groups

The HAVING clause filters groups created by GROUP BY, similar to WHERE but for aggregate results.

### HAVING vs WHERE:
- **WHERE**: Filters individual rows before grouping
- **HAVING**: Filters groups after GROUP BY is applied
- **HAVING**: Can use aggregate functions in conditions

In [5]:
# Example 1: Departments with more than 1 employee
print("👥 Example 1: Departments with multiple employees")

df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    AVG(e.salary) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING COUNT(e.emp_id) > 1
ORDER BY employee_count DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Departments with average salary above 75000
print("💰 Example 2: High-paying departments (avg salary > 75000)")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    ROUND(AVG(e.salary), 2) AS avg_salary,
    SUM(e.salary) AS total_payroll
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING AVG(e.salary) > 75000
ORDER BY avg_salary DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 3: Project status with total budget > 100000
print("🚀 Example 3: High-budget project statuses")
df = pd.read_sql_query("""
SELECT 
    status,
    COUNT(*) AS project_count,
    SUM(budget) AS total_budget,
    AVG(budget) AS avg_budget
FROM projects
GROUP BY status
HAVING SUM(budget) > 100000
ORDER BY total_budget DESC
""", conn)
display(df)

👥 Example 1: Departments with multiple employees


Unnamed: 0,dept_name,employee_count,avg_salary
0,Engineering,2,78500.0




💰 Example 2: High-paying departments (avg salary > 75000)


Unnamed: 0,dept_name,employee_count,avg_salary,total_payroll
0,Sales,1,95000.0,95000
1,Human Resources,1,90000.0,90000
2,Engineering,2,78500.0,157000




🚀 Example 3: High-budget project statuses


Unnamed: 0,status,project_count,total_budget,avg_budget
0,In Progress,2,350000,175000.0
1,Planning,1,300000,300000.0


## Advanced Grouping Techniques

Let's explore more sophisticated grouping scenarios and calculations.

In [6]:
# Example 1: Multiple grouping columns
print("📊 Example 1: Projects grouped by department and status")

df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    p.status,
    COUNT(p.project_id) AS project_count,
    SUM(p.budget) AS total_budget,
    AVG(p.priority) AS avg_priority
FROM departments d
INNER JOIN projects p ON d.dept_id = p.dept_id
GROUP BY d.dept_name, p.status
ORDER BY d.dept_name, total_budget DESC
""", conn)
display(df)

print("\n" + "="*50 + "\n")

# Example 2: Conditional aggregation using CASE
print("📈 Example 2: Salary ranges by department")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS total_employees,
    SUM(CASE WHEN e.salary < 70000 THEN 1 ELSE 0 END) AS low_salary_count,
    SUM(CASE WHEN e.salary BETWEEN 70000 AND 85000 THEN 1 ELSE 0 END) AS mid_salary_count,
    SUM(CASE WHEN e.salary > 85000 THEN 1 ELSE 0 END) AS high_salary_count,
    ROUND(AVG(e.salary), 2) AS avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING COUNT(e.emp_id) > 0
ORDER BY avg_salary DESC
""", conn)
display(df)

📊 Example 1: Projects grouped by department and status


Unnamed: 0,dept_name,status,project_count,total_budget,avg_priority
0,Engineering,Planning,1,300000,5.0
1,Engineering,In Progress,1,150000,4.0
2,Finance,In Progress,1,200000,5.0
3,Marketing,Active,1,75000,3.0
4,Sales,Completed,1,50000,2.0




📈 Example 2: Salary ranges by department


Unnamed: 0,dept_name,total_employees,low_salary_count,mid_salary_count,high_salary_count,avg_salary
0,Sales,1,0,0,1,95000.0
1,Human Resources,1,0,0,1,90000.0
2,Engineering,2,0,2,0,78500.0
3,Marketing,1,1,0,0,65000.0


## Practice Exercises

Practice grouping and aggregation with these exercises:

1. **Basic Grouping**: Count customers by country
2. **Salary Analysis**: Find departments with total payroll > 150000
3. **Project Priority**: Average priority by project status
4. **Complex Analysis**: Department efficiency (budget per employee)
5. **Conditional Grouping**: High vs low priority projects by department

Complete the exercises below:

In [7]:
# Exercise 1: Count customers by country
print("Exercise 1: Customer count by country")
df = pd.read_sql_query("""
SELECT 
    country,
    COUNT(*) AS customer_count
FROM customers
GROUP BY country
ORDER BY customer_count DESC
""", conn)
display(df)

print("\n" + "="*30 + "\n")

# Exercise 2: Departments with high payroll
print("Exercise 2: Departments with total payroll > 150,000")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    SUM(e.salary) AS total_payroll,
    ROUND(AVG(e.salary), 2) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING SUM(e.salary) > 150000
ORDER BY total_payroll DESC
""", conn)
display(df)

print("\n" + "="*30 + "\n")

# Exercise 3: Average priority by project status
print("Exercise 3: Average priority by project status")
df = pd.read_sql_query("""
SELECT 
    status,
    COUNT(*) AS project_count,
    ROUND(AVG(priority), 2) AS avg_priority,
    MIN(priority) AS min_priority,
    MAX(priority) AS max_priority
FROM projects
GROUP BY status
ORDER BY avg_priority DESC
""", conn)
display(df)

print("\n" + "="*30 + "\n")

# Exercise 4: Department efficiency (budget per employee)
print("Exercise 4: Department efficiency analysis")
df = pd.read_sql_query("""
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    COALESCE(SUM(p.budget), 0) AS total_project_budget,
    CASE 
        WHEN COUNT(e.emp_id) > 0 THEN ROUND(COALESCE(SUM(p.budget), 0) / COUNT(e.emp_id), 2)
        ELSE 0 
    END AS budget_per_employee
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
LEFT JOIN projects p ON d.dept_id = p.dept_id
GROUP BY d.dept_id, d.dept_name
HAVING COUNT(e.emp_id) > 0
ORDER BY budget_per_employee DESC
""", conn)
display(df)

Exercise 1: Customer count by country


Unnamed: 0,country,customer_count
0,USA,3




Exercise 2: Departments with total payroll > 150,000


Unnamed: 0,dept_name,employee_count,total_payroll,avg_salary
0,Engineering,2,157000,78500.0




Exercise 3: Average priority by project status


Unnamed: 0,status,project_count,avg_priority,min_priority,max_priority
0,Planning,1,5.0,5,5
1,In Progress,2,4.5,4,5
2,Active,1,3.0,3,3
3,Completed,1,2.0,2,2




Exercise 4: Department efficiency analysis


Unnamed: 0,dept_name,employee_count,total_project_budget,budget_per_employee
0,Engineering,4,900000,225000.0
1,Marketing,1,75000,75000.0
2,Sales,1,50000,50000.0
3,Human Resources,1,0,0.0


## Section Summary

In this section, you mastered data aggregation and grouping:

✅ **Aggregate Functions**: COUNT, SUM, AVG, MIN, MAX for data summarization  
✅ **GROUP BY**: Grouping rows to create summary statistics  
✅ **HAVING Clause**: Filtering grouped results with aggregate conditions  
✅ **Multi-table Grouping**: Combining JOINs with GROUP BY  
✅ **Conditional Aggregation**: Using CASE statements in aggregates  
✅ **Complex Analysis**: Creating business intelligence reports  

### Key SQL Commands:
- `GROUP BY column` - Group rows by column values
- `COUNT(*), SUM(), AVG(), MIN(), MAX()` - Aggregate functions
- `HAVING condition` - Filter groups (not individual rows)
- `COALESCE(value, default)` - Handle NULL values
- `CASE WHEN ... THEN ... END` - Conditional logic

### Query Structure with Grouping:
```sql
SELECT column, aggregate_function(column)
FROM table1
[JOIN table2 ON condition]
[WHERE row_conditions]
GROUP BY column
[HAVING group_conditions]
[ORDER BY column];
```

### Best Practices:
- Every non-aggregate column in SELECT must be in GROUP BY
- Use HAVING for group filtering, WHERE for row filtering
- Consider NULL values with COALESCE or IFNULL
- Order results for better readability

### Next Steps
In section 2.2, you'll learn about JOIN operations to combine data from multiple tables effectively.