<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/7_Query_Optimization/2_Optimization_Techniques.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Optimization Techniques

## Overview

### 🥅 Analysis Goals
 
- **Optimize Query**: Optimize the queries to reduce the time it takes to run the query.
- **Optimize Cohort Analysis**: Rewrite the query in the `cohort_analysis` view to be more optimized.

### 📘 Concepts Covered

- Basic query optimization

In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---

## Optimization Techniques

### 🔰 Beginner (Fundamental Optimizations)
- 🚫 Avoid `SELECT *`: Retrieve only necessary columns.
- 📉 Use `LIMIT` for Large Datasets: Improve performance on large queries.
- 🛠 Use `WHERE` Instead of `HAVING`: Filter before aggregation for efficiency.

### ⚡ Intermediate (Query Structure & Execution Plan Optimizations)
- 📖 Use Query Execution Plans: Identify slow queries and optimize execution paths.
- 📌 Minimize `GROUP BY` Usage: Avoid unnecessary aggregations.
- 🔗 Reduce Joins When Possible: Optimize relationships to prevent expensive joins.
- 📊 Optimize `ORDER BY`: Use indexed columns for sorting.

### 🚀 Advanced (Database-Level Optimizations)
- 🧠 Use Proper Data Types: Ensure numeric vs. string-based filtering is efficient.
- ⚡ Use Proper Indexing: Speed up queries with strategic indexes.
- 🗃 Use Partitioning for Large Tables: Improve performance on large datasets.

---

## ⚡ Intermediate (Query Structure & Execution Plan Optimizations)

#### 📌 Minimize `GROUP BY` Usage
- The goal is to reduce unnecessary `GROUP BY` operations, which can slow down queries significantly.
- Here we are trying to get `net_revenue` per customer order

In [3]:
%%sql

EXPLAIN ANALYZE
SELECT 
    customerkey,
    orderdate,
    orderkey,
    linenumber,
    SUM(quantity * netprice * exchangerate)
FROM sales
GROUP BY 
    customerkey, 
    orderdate,
    orderkey,
    linenumber


Unnamed: 0,QUERY PLAN
0,HashAggregate (cost=7516.82..8157.89 rows=641...
1,"Group Key: orderkey, linenumber"
2,Batches: 5 Memory Usage: 8241kB Disk Usage...
3,-> Seq Scan on sales (cost=0.00..4518.73 r...
4,Planning Time: 0.844 ms
5,Execution Time: 108.461 ms


- Well if we find out we only care about daily customer orders we could remove `linenumber` grouping.

In [4]:
%%sql

EXPLAIN ANALYZE
SELECT 
    customerkey,
    orderdate,
    orderkey,
    SUM(quantity * netprice * exchangerate) AS net_revenue
FROM sales
GROUP BY 
    customerkey, 
    orderdate,
    orderkey

Unnamed: 0,QUERY PLAN
0,HashAggregate (cost=8016.51..8657.57 rows=641...
1,"Group Key: customerkey, orderdate, orderkey"
2,Batches: 5 Memory Usage: 8241kB Disk Usage...
3,-> Seq Scan on sales (cost=0.00..4518.73 r...
4,Planning Time: 0.120 ms
5,Execution Time: 37.593 ms


> Removing `linenumber` in `GROUP BY` improves query by 25%!

#### 🔗 Reduce Joins When Possible
- Optimize relationships to prevent expensive joins.


In [5]:
%%sql

EXPLAIN ANALYZE
SELECT 
    c.customerkey,
    c.givenname,
    c.surname,
    p.productname,
    s.orderdate,
    s.orderkey,
    d.year
FROM sales s
INNER JOIN customer c ON s.customerkey = c.customerkey
INNER JOIN product p ON p.productkey = s.productkey
INNER JOIN date d ON d.date = s.orderdate

Unnamed: 0,QUERY PLAN
0,Hash Join (cost=5698.10..11792.34 rows=199873...
1,Hash Cond: (s.orderdate = d.date)
2,-> Hash Join (cost=5557.91..11126.87 rows=...
3,Hash Cond: (s.productkey = p.productkey)
4,-> Hash Join (cost=5442.27..10485.69...
5,Hash Cond: (s.customerkey = c.cu...
6,-> Seq Scan on sales s (cost=0...
7,-> Hash (cost=4129.90..4129.90...
8,Buckets: 131072 Batches: ...
9,-> Seq Scan on customer c...


In [6]:
%%sql

EXPLAIN ANALYZE
SELECT 
    c.customerkey,
    c.givenname,
    c.surname,
    p.productname,
    s.orderdate,
    s.orderkey,
    EXTRACT(YEAR FROM s.orderdate) AS year
FROM sales s
INNER JOIN customer c ON s.customerkey = c.customerkey
INNER JOIN product p ON p.productkey = s.productkey

Unnamed: 0,QUERY PLAN
0,Hash Join (cost=5557.91..11626.55 rows=199873...
1,Hash Cond: (s.productkey = p.productkey)
2,-> Hash Join (cost=5442.27..10485.69 rows=...
3,Hash Cond: (s.customerkey = c.customer...
4,-> Seq Scan on sales s (cost=0.00..4...
5,-> Hash (cost=4129.90..4129.90 rows=...
6,Buckets: 131072 Batches: 1 Mem...
7,-> Seq Scan on customer c (cos...
8,-> Hash (cost=84.17..84.17 rows=2517 width...
9,Buckets: 4096 Batches: 1 Memory Usag...


####   📊 Optimize `ORDER BY`
- Limit number of columns in `ORDER BY` clause
- Avoid sorting on computed columns or function calls
- Place most selective columns first in ORDER BY clause (columns that filter out more rows should come first, as this allows the database to eliminate more rows early in the sorting process)
- Use indexed columns for sorting to leverage existing database indexes


In [7]:
%%sql

EXPLAIN ANALYZE
SELECT 
    customerkey,
    orderdate,
    orderkey,
    SUM(quantity * netprice * exchangerate) AS net_revenue
FROM sales
GROUP BY 
    customerkey, 
    orderdate,
    orderkey
ORDER BY
    net_revenue DESC,
    customerkey,
    orderdate,
    orderkey

Unnamed: 0,QUERY PLAN
0,Sort (cost=13775.85..13936.11 rows=64106 widt...
1,Sort Key: (sum((((quantity)::double precisio...
2,Sort Method: external merge Disk: 2776kB
3,-> HashAggregate (cost=8016.51..8657.57 ro...
4,"Group Key: customerkey, orderdate, ord..."
5,Batches: 5 Memory Usage: 8241kB Disk...
6,-> Seq Scan on sales (cost=0.00..451...
7,Planning Time: 0.048 ms
8,Execution Time: 55.196 ms


In [8]:
%%sql

EXPLAIN ANALYZE
SELECT 
    customerkey,
    orderdate,
    orderkey,
    SUM(quantity * netprice * exchangerate)
FROM sales
GROUP BY 
    customerkey, 
    orderdate,
    orderkey


Unnamed: 0,QUERY PLAN
0,HashAggregate (cost=8016.51..8657.57 rows=641...
1,"Group Key: customerkey, orderdate, orderkey"
2,Batches: 5 Memory Usage: 8241kB Disk Usage...
3,-> Seq Scan on sales (cost=0.00..4518.73 r...
4,Planning Time: 0.048 ms
5,Execution Time: 64.769 ms


> The percentage reduction was about 5%.

---
## Real World Query Optimization

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Query Efficiency: Optimized data retrieval methods
  - Resource Management: Efficient use of database resources
  - Performance Scaling: Handling growing data volumes
- **💡 Why It Matters**: 
    - Improves business operations and costs
    - Reduces cloud computing costs through efficient queries
    - Enables faster reporting for business decisions

### 📈 Analysis

- Rewrite the query to utilize the techniques we learned, to reduce processing time and make it more efficient.  

#### Optimize Cohort Analysis View

**Query Optimization**

1. Use `EXPLAIN ANALYZE` on the query used in our `cohort_analysis` view to find ways to optimize it better.

In [9]:
%%sql

EXPLAIN ANALYZE
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM
		sales s
	LEFT JOIN customer c ON
		c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
)
SELECT
	customerkey,
	orderdate,
	total_net_revenue,
	num_orders,
	countryfull,
	age,
	CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
	MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
	EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM
	customer_revenue cr;


Unnamed: 0,QUERY PLAN
0,WindowAgg (cost=35601.24..48592.98 rows=19987...
1,-> GroupAggregate (cost=35601.24..43596.16...
2,"Group Key: s.customerkey, s.orderdate,..."
3,-> Sort (cost=35601.24..36100.92 row...
4,"Sort Key: s.customerkey, s.order..."
5,Sort Method: external merge Dis...
6,-> Hash Left Join (cost=5442.2...
7,Hash Cond: (s.customerkey ...
8,-> Seq Scan on sales s (...
9,-> Hash (cost=4129.90..4...


Below is the query output.

<img src="../Resources/images/7.2_explain_1.png" alt="Query Results 1" style="width: 70%; height: auto;">

2. Use `INNER JOIN` in the `customer_revenue` CTE.
    - If every `sales.customerkey` exists in customer, change `LEFT JOIN` to `INNER JOIN`.
    - This eliminates unnecessary NULL checks and improves join efficiency.


In [10]:
%%sql

WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM
		sales s
	INNER JOIN customer c 
		ON c.customerkey = s.customerkey -- Update
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
)
SELECT
	customerkey,
	orderdate,
	total_net_revenue,
	num_orders,
	countryfull,
	age,
	CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
	MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
	EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM
	customer_revenue cr;


Unnamed: 0,customerkey,orderdate,total_net_revenue,num_orders,countryfull,age,cleaned_name,first_purchase_date,cohort_year
0,15,2021-03-08,2217.41,1,Australia,55,Julian McGuigan,2021-03-08,2021
1,180,2018-07-28,525.31,1,Australia,65,Gabriel Bosanquet,2018-07-28,2018
2,180,2023-08-28,1984.90,2,Australia,65,Gabriel Bosanquet,2018-07-28,2018
3,185,2019-06-01,1395.52,1,Australia,40,Gabrielle Castella,2019-06-01,2019
4,243,2016-05-19,287.67,1,Australia,66,Maya Atherton,2016-05-19,2016
...,...,...,...,...,...,...,...,...,...
83094,2099697,2022-09-13,38.20,3,United States,54,Phillipp Maier,2022-09-13,2022
83095,2099711,2016-08-13,2067.75,1,United States,80,Katerina Pavlícková,2016-08-13,2016
83096,2099711,2017-08-14,3940.92,1,United States,80,Katerina Pavlícková,2016-08-13,2016
83097,2099743,2022-03-17,469.62,2,United States,21,Luciana Almonte,2022-03-17,2022


In [11]:
%%sql

EXPLAIN ANALYZE
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM
		sales s
	INNER JOIN customer c 
		ON c.customerkey = s.customerkey -- Update
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
)
SELECT
	customerkey,
	orderdate,
	total_net_revenue,
	num_orders,
	countryfull,
	age,
	CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
	MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
	EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM
	customer_revenue cr;

Unnamed: 0,QUERY PLAN
0,WindowAgg (cost=35601.24..48592.98 rows=19987...
1,-> GroupAggregate (cost=35601.24..43596.16...
2,"Group Key: s.customerkey, s.orderdate,..."
3,-> Sort (cost=35601.24..36100.92 row...
4,"Sort Key: s.customerkey, s.order..."
5,Sort Method: external merge Dis...
6,-> Hash Join (cost=5442.27..10...
7,Hash Cond: (s.customerkey ...
8,-> Seq Scan on sales s (...
9,-> Hash (cost=4129.90..4...


> This didn't actually reduce the time, but it's still good to do.

3. Add MAX to the following columns in `customer_revenue` and remove those columns in the `GROUP BY` statement: `countryfull`, `age`, `givenname`, and `surname`.

    - If `countryfull`, `age`, `givenname`, and `surname` do not change for each `customerkey`, remove them from `GROUP BY` and use `MAX()`.
    - This reduces sorting and aggregation overhead.

> **⚠️ Note**: Using MAX() on fields when the values are guaranteed to be the same within each group, is a good practice as it reduces the need for unnecessary GROUP BY operations and improves query performance.

In [None]:
%%sql

WITH customer_revenue AS (
    SELECT
        s.customerkey,
        s.orderdate,
        SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
        COUNT(s.orderkey) AS num_orders,
        MAX(c.countryfull) AS countryfull,
        MAX(c.age) AS age,
        MAX(c.givenname) AS givenname,
        MAX(c.surname) AS surname
    FROM sales s
    INNER JOIN customer c ON c.customerkey = s.customerkey
    GROUP BY
        s.customerkey,
        s.orderdate
)
SELECT
    customerkey,
    orderdate,
    total_net_revenue,
    num_orders,
    countryfull,
    age,
    CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
    MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM customer_revenue cr;


4. Run the `EXPLAIN ANALYZE` plan again on this updated query to see what's been improved.

In [13]:
%%sql

EXPLAIN ANALYZE
WITH customer_revenue AS (
    SELECT
        s.customerkey,
        s.orderdate,
        SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
        COUNT(s.orderkey) AS num_orders,
        MAX(c.countryfull) AS countryfull,
        MAX(c.age) AS age,
        MAX(c.givenname) AS givenname,
        MAX(c.surname) AS surname
    FROM sales s
    INNER JOIN customer c ON c.customerkey = s.customerkey
    GROUP BY
        s.customerkey,
        s.orderdate
)
SELECT
    customerkey,
    orderdate,
    total_net_revenue,
    num_orders,
    countryfull,
    age,
    CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
    MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM customer_revenue cr;


Unnamed: 0,QUERY PLAN
0,WindowAgg (cost=23390.59..33489.51 rows=37024...
1,-> Finalize GroupAggregate (cost=23390.59....
2,"Group Key: s.customerkey, s.orderdate"
3,-> Gather Merge (cost=23390.59..3145...
4,Workers Planned: 1
5,Workers Launched: 1
6,-> Partial GroupAggregate (cos...
7,"Group Key: s.customerkey, ..."
8,-> Sort (cost=22390.58.....
9,Sort Key: s.customer...


<img src="../Resources/images/7.2_basic_optimization.png" alt="Basic Optimization 1" style="width: 70%; height: auto;">

> The percentage reduction was about 20%.

5. Update the view with the new optimized query.


In [15]:
%%sql

DROP VIEW cohort_analysis;

In [16]:
%%sql

CREATE OR REPLACE VIEW cohort_analysis AS
WITH customer_revenue AS (
    SELECT
        s.customerkey,
        s.orderdate,
        SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
        COUNT(s.orderkey) AS num_orders,
        MAX(c.countryfull) AS countryfull,
        MAX(c.age) AS age,
        MAX(c.givenname) AS givenname,
        MAX(c.surname) AS surname
    FROM sales s
    INNER JOIN customer c ON c.customerkey = s.customerkey
    GROUP BY
        s.customerkey,
        s.orderdate
)
SELECT
    customerkey,
    orderdate,
    total_net_revenue,
    num_orders,
    countryfull,
    age,
    CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
    MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM customer_revenue cr;
