### Exploring Northwind database using SQL

The Northwind database was originally created by Microsoft. It simulates a wholesale business called "Northwind Traders" that imports and exports foods worldwide.

In this exercise, I explore the Northwind database using Postgre SQL. 

In [1]:
#Import libraries
import pandas as pd
from sqlalchemy import create_engine

In [2]:
# Create database connection
engine = create_engine('postgresql+psycopg2://tharinduabeysinghe:####@localhost/northwind')

# Run quey and load data to a dataframe
def execute_sql_query(sql):
    # Load data into a pandas DataFrame
    df = pd.read_sql_query(sql, con=engine)
    return df

### Window Functions

A window function performs calculation across a set of rows that are related to the current row, without collapsing the result set (without using GROUP BY).

A window function consists of two parts. The first part is the function and the second part is the window. The window defines how you need to view your data with the function. 

The code below uses different functions with the same window. 

In [3]:
sql ='''SELECT product_id,
            quantity,
            unit_price,
            Row_number()
                OVER(
                ORDER BY quantity DESC) AS up,
            Rank()
                OVER(
                ORDER BY quantity DESC) AS r_up,
            Dense_rank()
                OVER(
                ORDER BY quantity DESC) AS dr_up
        FROM   order_details
        WHERE  product_id = 11'''


# Execute query
execute_sql_query(sql)

Unnamed: 0,product_id,quantity,unit_price,up,r_up,dr_up
0,11,50,21.0,1,1,1
1,11,50,21.0,2,1,1
2,11,50,16.8,3,1,1
3,11,40,21.0,4,4,2
4,11,40,21.0,5,4,2
5,11,35,21.0,6,6,3
6,11,30,16.8,7,7,4
7,11,30,16.8,8,7,4
8,11,30,21.0,9,7,4
9,11,25,21.0,10,10,5


By changing the window portion for the query, we can change which portion the function is applied in the source view. You can use different columns for partition by or order by clauses.

The Query below applies the rank function over each different quantity of the orders.

In [4]:
sql = '''SELECT product_id,
                quantity,
                Row_number()
                OVER(
                    PARTITION BY product_id
                    ORDER BY quantity DESC) AS up
        FROM   order_details
        WHERE  product_id IN ( 11, 12, 13 )
        AND quantity > 20'''

# Execute query
execute_sql_query(sql)

Unnamed: 0,product_id,quantity,up
0,11,50,1
1,11,50,2
2,11,50,3
3,11,40,4
4,11,40,5
5,11,35,6
6,11,30,7
7,11,30,8
8,11,30,9
9,11,25,10


In [5]:
sql='''SELECT product_id,
              quantity,
              ROUND(CAST(AVG(quantity) 
                OVER(PARTITION BY product_id) AS numeric),2) AS Avg_quantity
        FROM   order_details
        WHERE  product_id IN ( 11, 12, 13 )
        AND quantity > 30 '''
        
# Execute query
execute_sql_query(sql)

Unnamed: 0,product_id,quantity,avg_quantity
0,11,40,44.17
1,11,50,44.17
2,11,40,44.17
3,11,50,44.17
4,11,50,44.17
5,11,35,44.17
6,12,36,57.33
7,12,100,57.33
8,12,36,57.33
9,13,80,63.0


Let's answer a list of questions generated by ChatGPT on Northwind dataset. 

Question 1: For each customer, list their orders along with the order date, total order amount, and the running total of their spending over time.”

Below is the answer query.

In [6]:
sql='''
-- Create a new view with customer_id, order_id, order date, quantity 
-- and how much they spent for each order
WITH cte
     AS (SELECT o.customer_id,
                o.order_id,
                order_date,
                Sum(quantity) AS total_order_amount,
                Sum(unit_price * quantity * ( 1 - discount )) AS order_total
         FROM orders o
            JOIN PUBLIC.order_details od
                ON o.order_id = od.order_id
         GROUP BY o.customer_id,
                  o.order_id,
                  order_date
         ORDER BY o.customer_id,
                  order_date)
-- Then use window function to calculate running total
SELECT customer_id,
       order_id,
       order_date,
       total_order_amount,
       Round(Cast(Sum(order_total)
                    OVER (
                      partition BY customer_id
                      ORDER BY order_date) AS NUMERIC), 2) AS running_total
FROM cte
GROUP BY customer_id,
         order_id,
         order_date,
         total_order_amount,
         order_total
ORDER BY customer_id,
          order_date '''
 
# Execute query
execute_sql_query(sql)

Unnamed: 0,customer_id,order_id,order_date,total_order_amount,running_total
0,ALFKI,10643,1997-08-25,38,814.50
1,ALFKI,10692,1997-10-03,20,1692.50
2,ALFKI,10702,1997-10-13,21,2022.50
3,ALFKI,10835,1998-01-15,17,2868.30
4,ALFKI,10952,1998-03-16,18,3339.50
...,...,...,...,...,...
825,WOLZA,10792,1997-12-23,28,1666.85
826,WOLZA,10870,1998-02-04,5,1826.85
827,WOLZA,10906,1998-02-25,15,2254.35
828,WOLZA,10998,1998-04-03,69,2940.35


Question 2: List the top 2 most recent orders for each customer

In [7]:
sql='''
WITH orders_sorted AS (
SELECT 
	customer_id,
	order_id,
	order_date,
	RANK() OVER(PARTITION BY customer_id ORDER BY order_date DESC) 
		AS orders_partitioned
FROM public.orders
ORDER BY customer_id, order_date DESC
	)

SELECT *
FROM orders_sorted
WHERE orders_partitioned <=2;
'''
 
# Execute query
execute_sql_query(sql)

Unnamed: 0,customer_id,order_id,order_date,orders_partitioned
0,ALFKI,11011,1998-04-09,1
1,ALFKI,10952,1998-03-16,2
2,ANATR,10926,1998-03-04,1
3,ANATR,10759,1997-11-28,2
4,ANTON,10856,1998-01-28,1
...,...,...,...,...
173,WHITC,11032,1998-04-17,2
174,WILMK,11005,1998-04-07,1
175,WILMK,10910,1998-02-26,2
176,WOLZA,11044,1998-04-23,1


Question 3: Find First Product Sold in Each Category

First the product name, category name, and order date added to a table. The table is partitioned by category and products are ordered by the order date in ascending order. The result will have rank as 1 for the first product sold for each category. Then the rows where product rank is 1 are filtered as the final answer.

In [9]:
sql = '''
WITH order_category AS (
    SELECT 
        product_name,
        category_name,
        order_date,
        ROW_NUMBER() OVER(
            PARTITION BY category_name 
            ORDER BY order_date
        ) AS product_rank
    FROM products p
    JOIN order_details od
        ON p.product_id = od.product_id
    JOIN orders o
        ON o.order_id = od.order_id
    JOIN categories c
        ON c.category_id = p.category_id
    GROUP BY category_name, product_name, order_date
    ORDER BY category_name, order_date
)
SELECT *
FROM order_category
WHERE product_rank = 1
ORDER BY category_name
'''
#Execute query
execute_sql_query(sql)

Unnamed: 0,product_name,category_name,order_date,product_rank
0,Chartreuse verte,Beverages,1996-07-10,1
1,Louisiana Fiery Hot Pepper Sauce,Condiments,1996-07-08,1
2,Sir Rodney's Marmalade,Confections,1996-07-09,1
3,Mozzarella di Giovanni,Dairy Products,1996-07-04,1
4,Singaporean Hokkien Fried Mee,Grains/Cereals,1996-07-04,1
5,Pâté chinois,Meat/Poultry,1996-07-11,1
6,Manjimup Dried Apples,Produce,1996-07-05,1
7,Jack's New England Clam Chowder,Seafood,1996-07-08,1


#### Reference:

- [SQL Window Functions](https://www.youtube.com/watch?v=rIcB4zMYMas&t=77s)