<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/2_Statistical_Aggregations.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Statistical Aggregations

## Overview

### 🥅 Analysis Goals

- **Analyze average, median, minimum, and maximum net revenue**:
Examine central tendency and revenue extremes to understand category performance and distribution patterns.
- **Compare these metrics for 2022 and 2023**:
Highlight changes in category revenue to identify growth, decline, or stability over time.

### 📘 Concepts Covered

- `AVG`
- `MIN`
- `MAX`
- Median with `PERCENTILE_CONT`

[Source Documentation on Aggregate Functions.](https://www.postgresql.org/docs/9.5/functions-aggregate.html)

---

In [12]:
import sys
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## Pivot with Statistical Aggregation Functions

### 📝 Notes

You can also pivot with other statistical aggregate functions though it's not used as frequently as `SUM` or `COUNT`. Example: We'll pivot the values by the average, minimum, and maximum in our **SUM with Case When** query below. Essentially we'll replace `SUM` with `AVG`, `MIN`, and `MAX`.

#### Aggregation Review
- **Average:** The sum of all values divided by the total number of values.  
- **Minimum:** The smallest value in a dataset.  
- **Maximum:** The largest value in a dataset.  

#### Syntax

```sql
# SELECT
#     column_name,
#     AVG(CASE WHEN column1 = 'value1' THEN column2 END) AS avg_value1,
#     AVG(CASE WHEN column1 = 'value2' THEN column2 END) AS avg_value2
# FROM
#     table_name
# GROUP BY 
#     column_name;
```

#### More Aggregations

[Source Documentation on Aggregate Functions.](https://www.postgresql.org/docs/9.5/functions-aggregate.html)

There are other aggregate functions you can pivot by but we won't be going into depth in this course. Below are the others you can use (some may not work depending on the SQL language you're using): 

- `VARIANCE`  
- `VAR_POP`  
- `VAR_SAMP`  
- `STDDEV`  
- `STDDEV_POP`  
- `STDDEV_SAMP`  
- `ARRAY_AGG`  
- `STRING_AGG`  
- `BOOL_AND`  
- `BOOL_OR`  

### 📈 Analysis

- Find the average, minimum, and maximum net revenue by category for 2023 and 2022. This helps us examine central tendency and revenue extremes to understand category performance and distribution patterns.

#### Average Net Revenue by Category

**`AVG`**

1. Find the average net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `CASE WHEN` to calculate the net revenue only for 2022 and 2023:
     - For 2022, include sales where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include sales where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - Use `AVG` to calculate the average net revenue for each year.
   - Group the data by `categoryname` to get average revenue by category.
   - Order the results alphabetically by `categoryname`.

In [13]:
%%sql 

SELECT
    p.categoryname AS category,
    AVG(CASE WHEN s.orderdate BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS avg_net_revenue_2022,
    AVG(CASE WHEN s.orderdate BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS avg_net_revenue_2023
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_avg_sale,y2023_avg_sale
0,Audio,392.3,425.38
1,Cameras and camcorders,1210.02,1210.96
2,Cell phones,722.2,623.28
3,Computers,1565.62,1292.39
4,Games and Toys,81.29,80.83
5,Home Appliances,1755.36,1886.55
6,"Music, Movies and Audio Books",386.61,334.58
7,TV and Video,1535.61,1687.9


#### Minimum Net Revenue by Category

**`MIN`**

1. Find the minimum net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `CASE WHEN` to calculate the net revenue only for 2022 and 2023:
     - For 2022, include sales where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include sales where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - 🔔 Use `MIN` to calculate the minimum net revenue for each year.
   - Group the data by `categoryname` to get minimum revenue by category.
   - Order the results alphabetically by `categoryname`.

In [14]:
%%sql 

SELECT
    p.categoryname AS category,
    MIN(CASE WHEN s.orderdate BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS min_net_revenue_2022,
    MIN(CASE WHEN s.orderdate BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS min_net_revenue_2023
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_min_sale,y2023_min_sale
0,Audio,9.31,10.85
1,Cameras and camcorders,6.74,5.98
2,Cell phones,2.53,2.28
3,Computers,0.83,0.75
4,Games and Toys,2.83,3.49
5,Home Appliances,4.04,4.54
6,"Music, Movies and Audio Books",7.29,6.91
7,TV and Video,41.3,42.3


#### Maximum Net Revenue by Category

**`MAX`**

1. Find the maximum net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `CASE WHEN` to calculate the net revenue only for 2022 and 2023:
     - For 2022, include sales where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include sales where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - 🔔 Use `MAX` to calculate the maximum net revenue for each year.
   - Group the data by `categoryname` to get maximum revenue by category.
   - Order the results alphabetically by `categoryname`.

In [15]:
%%sql 

SELECT
    p.categoryname AS category,
    MAX(CASE WHEN s.orderdate BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS max_net_revenue_2022,
    MAX(CASE WHEN s.orderdate BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS max_net_revenue_2022
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_max_sale,y2023_max_sale
0,Audio,3473.36,2730.87
1,Cameras and camcorders,15008.39,13572.0
2,Cell phones,7692.37,8912.22
3,Computers,38082.66,27611.6
4,Games and Toys,5202.01,3357.3
5,Home Appliances,31654.55,32915.59
6,"Music, Movies and Audio Books",5415.19,3804.91
7,TV and Video,30259.41,27503.12


---
## Pivot with Median

### 📝 Notes

#### Review
The median is the middle number if you sort the values in a set from low to high. 

**For example:** 
> <img src="../Resources/images/1.2_Finding_the_median.png" alt="Median Example" width="50%">

### Median in Different Databases
- PostgreSQL → Use `PERCENTILE_CONT(0.5)`
- SQL Server → Use `PERCENTILE_CONT(0.5)`
- MySQL → No native `MEDIAN()`, requires subqueries or window functions
- SQLite → No built-in `MEDIAN()`, requires custom logic
- MariaDB → No built-in `MEDIAN()`, requires custom approach

#### Calculate Median in PostgreSQL

`PERCENTILE_CONT`

- **`PERCENTILE_CONT`** calculates a percentile (e.g., 25th, 50th, 75th) by estimating values between sorted data points.  
- Syntax:
  ```sql
  SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
  FROM table_name
  WHERE column_name IS NOT NULL;
  ```
- Note: Some SQL languages may have a dedicated `MEDIAN()` function, but PostgreSQL doesn't. 

### 💻 Analysis

- Find the median net revenue for 2023 and 2022 by category. Which helps highlight changes in category revenue to identify growth, decline, or stability over time.

#### Median Net Revenue by Category

**`PERCENTILE_CONT`**, **`WITHIN GROUP`**

1. Find the median for net revenue in 2022 - 2023.
   - Use the `PERCENTILE_CONT(0.5)` function to calculate the median value (50th percentile) of `net revenue` in the specified date range.
   - Define `net revenue` as the product of `quantity`, `netprice`, and `exchangerate`.
   - Filter rows in the `WHERE` clause where `orderdate` is between `2022-01-01` and `2023-12-31`.

> #### Why You Need `WITHIN GROUP (ORDER BY …)`
> 
> `PERCENTILE_CONT(0.5)` is not a regular aggregate function—it's an ordered-set aggregate. This means:
> 
> - It requires values to be ordered within a specific grouping.
> - Instead of reducing all values into a single result (like AVG()), it computes a percentile based on ordering.
> 
> Unlike regular aggregates, `PERCENTILE_CONT()` needs explicit ordering of the column’s values. That’s why you must include:

In [16]:
%%sql 

SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (quantity * netprice * exchangerate)) AS median
FROM
    sales
WHERE
    orderdate BETWEEN '2022-01-01' AND '2023-12-31';

Unnamed: 0,median
0,398.0


2. Find the median net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `PERCENTILE_CONT(0.5)` to calculate the median for each year within categories:
     - For 2022, include `net revenue` where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include `net revenue` where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - Use `CASE WHEN` to separate calculations for 2022 and 2023.
   - Group the data by `categoryname` to calculate medians for each category.
   - Order the results alphabetically by `categoryname`.

In [17]:
%%sql

SELECT
    p.categoryname AS category,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (CASE 
        WHEN s.orderdate BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) 
    END)) AS y2022_median_sales,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (CASE 
        WHEN s.orderdate BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) 
    END)) AS y2023_median_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname; 


Unnamed: 0,category,y2022_median_sales,y2023_median_sales
0,Audio,257.21,266.59
1,Cameras and camcorders,651.46,672.6
2,Cell phones,418.6,375.88
3,Computers,809.7,657.18
4,Games and Toys,33.78,32.62
5,Home Appliances,791.0,825.25
6,"Music, Movies and Audio Books",186.58,159.63
7,TV and Video,730.46,790.79


<img src="../Resources/images/1.2_category_median.png" alt="Median" width="50%">