### Exploring Northwind database using Subqueries and CTEs

The Northwind database was originally created by Microsoft. It simulates a wholesale business called "Northwind Traders" that imports and exports foods worldwide.

In this exercise, I explore the Northwind database using Postgre SQL. 

In [21]:
#Import libraries
import pandas as pd
from sqlalchemy import create_engine

In [22]:
# Create database connection
engine = create_engine('postgresql+psycopg2://tharinduabeysinghe:####@localhost/northwind')

# Run quey and load data to a dataframe
def execute_sql_query(sql):
    # Load data into a pandas DataFrame
    df = pd.read_sql_query(sql, con=engine)
    return df

### Subqueries

A subquery is a query inside another query. Subqueries are mostly used to add a new column to the main query result, to create a filter or to create a consolidated source from which to select the data. A subquery is always written in parentheses. It can appear in different places usually within the SELECT, FROM, and WHERE clauses, based on the objective of its use. You can have as many as nested subqueries as possible. 

##### Scalar vs Multivalued Subqueries

Subqeuries are characterized into scalar and multivalued subqueries based on the number of values returned by the subquery.

A scalar subquery returns a single value or no value (when the output value of the subquery is NULL). Multivalued subqueries return a collection of multiple values. If a scalar subquery returns multiple values, a runtime error will occur.

The other way to characterize subqueries is based on the subquery's dependency on the outer queries. This approach classifies subqueries into self-contained subqueries and correlated subqueries.


##### Self-Contained Subquery

A self-contained subquery is a subquery that can be run independently of the outer query. 

In [23]:
sql= '''SELECT order_id,
	        unit_price,
	        quantity
        FROM order_details
        WHERE discount >
		        (SELECT AVG(discount)
			        FROM order_details)'''

# Execute query
execute_sql_query(sql)

Unnamed: 0,order_id,unit_price,quantity
0,10250,42.40,35
1,10250,16.80,15
2,10254,3.60,15
3,10254,19.20,21
4,10258,15.20,50
...,...,...,...
641,11076,23.25,20
642,11076,9.20,10
643,11077,19.00,24
644,11077,40.00,2


##### Correlated Subqueries

The nested query below pulls the names and quantities of the 10 most ordered products.

In [24]:
sql= '''SELECT product_name,
       		(SELECT Sum(quantity)
        	FROM order_details o
        	WHERE o.product_id = p.product_id) AS product_quantity
		FROM products p
		ORDER BY product_quantity DESC
		LIMIT 10'''

# Execute query
execute_sql_query(sql)

Unnamed: 0,product_name,product_quantity
0,Camembert Pierrot,1577
1,Raclette Courdavault,1496
2,Gorgonzola Telino,1397
3,Gnocchi di nonna Alice,1263
4,Pavlova,1158
5,Rhönbräu Klosterbier,1155
6,Guaraná Fantástica,1125
7,Boston Crab Meat,1103
8,Tarte au sucre,1083
9,Chang,1057


The query below pulls the cities with the 10 most orders shipped and the percentage of orders shipped to each city out of all orders. The nested query calculates the total orders and outer query calculates the percentage orders per city.

In [25]:
sql = '''SELECT 
            ship_city,
            ROUND(cast(count(o.order_id) as numeric) / (SELECT count(*) as total_orders FROM order_details), 2) as perc
        FROM orders o
        INNER JOIN order_details d 
        ON o.order_id = d.order_id
        GROUP BY 1
        ORDER BY 2 DESC
        LIMIT 10'''
        
# Execute query
execute_sql_query(sql)

Unnamed: 0,ship_city,perc
0,Boise,0.05
1,Graz,0.05
2,Rio de Janeiro,0.04
3,Sao Paulo,0.04
4,London,0.04
5,Cunewalde,0.04
6,México D.F.,0.03
7,Albuquerque,0.03
8,Cork,0.03
9,Brandenburg,0.02


#### Correlated Subqueries

In correlated subqueries, the subqueries have references to columns from the outer query.

Below is a correlated subquery that pulls the orders at the last date for each employee. 

In [26]:
sql = '''SELECT order_id, employee_id, order_date  
         FROM Orders AS o1
         WHERE order_date =
            (SELECT Max(order_date)
            FROM Orders AS o2
            WHERE o2.employee_id = o1.employee_id);'''

# Execute query
execute_sql_query(sql)

Unnamed: 0,order_id,employee_id,order_date
0,11043,5,1998-04-22
1,11045,6,1998-04-23
2,11058,9,1998-04-29
3,11063,3,1998-04-30
4,11070,2,1998-05-05
5,11073,2,1998-05-05
6,11074,7,1998-05-06
7,11075,8,1998-05-06
8,11076,4,1998-05-06
9,11077,1,1998-05-06


### Common Table Expressions (CTEs)
A CTE is a named temporary result set. CTEs are defined using WITH keyword and used as a subquery. A CTE can be referenced within a single query (SELECT, INSERT, UPDATE, or DELETE) statement. A CTE is created only in the memory, not as a table in the database. Once the query is cancelled, the CTE is not available anymore.

The query below pulls yearly order data from the database and then aggregates total sales per year in the outer query.

When the subquery used as a filter to the main query, the subquery is in the WHERE clause. The outer query use operators such as IN, >, and < to filter depending on the output of the subquery.

In [27]:
sql = '''WITH yearlysales
              AS (SELECT Date_part('year', o.order_date) AS orderyear,
                    od.unit_price,
                    od.quantity
                FROM orders o
                LEFT JOIN order_details od
                       ON od.order_id = o.order_id)
         SELECT orderyear,
             Sum(unit_price * quantity) AS TotalSales
         FROM   yearlysales
         GROUP  BY orderyear
         ORDER  BY orderyear; '''

# Execute query
execute_sql_query(sql)

Unnamed: 0,orderyear,totalsales
0,1996.0,226298.50135
1,1997.0,658388.749487
2,1998.0,469771.339604


Below is a comparison of querying the same scenario using a subquery and a CTE.

In [28]:
sql= '''SELECT Count(*)
        FROM   (SELECT product_id,
                    Max(quantity) AS max_quantity
                FROM   order_details
                GROUP  BY product_id) AS mq
        WHERE  max_quantity > 50 '''
        
# Execute query
execute_sql_query(sql)

Unnamed: 0,count
0,65


In [29]:
sql=        '''WITH mq AS
                (SELECT PRODUCT_ID,
                        MAX(QUANTITY) AS max_quantity
                    FROM ORDER_DETAILS
                    GROUP BY PRODUCT_ID)
                    
            SELECT COUNT(*)
            FROM mq
            WHERE max_quantity > 50'''
            
# Execute query
execute_sql_query(sql)

Unnamed: 0,count
0,65


### Subqueries vs CTE and TempTables

Subqueries are suitable for simple situations whereas CTEs are good for complex scenarios. CTEs look cleaner than subqueries.

The query below is a CTE expression with multiple references.

In [30]:
sql=        '''WITH mq AS
                (SELECT PRODUCT_ID,
                        MAX(QUANTITY) AS max_quantity
                    FROM ORDER_DETAILS
                    GROUP BY PRODUCT_ID)
                    
            SELECT COUNT(*)
            FROM mq
            WHERE max_quantity < (SELECT AVG(max_quantity) FROM mq)'''
            
# Execute query
execute_sql_query(sql)

Unnamed: 0,count
0,43


Here We create two CTEs with two conditions and query by joining two tables. This is where CTEs are powerful than Subqueries. If you were to write Subqueries here, the two nested queries should be written instead of the table names in the last select statement, which makes the query look messy. 

In [31]:
# Find max quantities that got discounts above 10%
sql = '''WITH mq
            AS (SELECT product_id,
                        Max(quantity) AS max_quantity
                FROM order_details
                GROUP BY product_id),
            non_discount
            AS (SELECT *
                FROM PUBLIC.order_details
                WHERE discount > 0.20)
        SELECT *
        FROM non_discount
            LEFT JOIN mq
                    ON non_discount.product_id = mq.product_id; '''
                    
# Execute query
execute_sql_query(sql)

Unnamed: 0,order_id,product_id,unit_price,quantity,discount,product_id.1,max_quantity
0,10258,2,15.20,50,0.20,2,100
1,10258,5,17.00,65,0.20,5,70
2,10258,32,25.60,6,0.20,32,50
3,10260,41,7.70,16,0.25,41,120
4,10260,62,39.40,15,0.25,62,80
...,...,...,...,...,...,...,...
310,11065,54,7.45,20,0.25,54,80
311,11076,6,25.00,20,0.25,6,70
312,11076,14,23.25,20,0.25,14,70
313,11076,19,9.20,10,0.25,19,80


If you need to use certain CTEs again an again, it is a good idea to use Temp Tables instead of CTEs. The Temptable is stored and can be called again and again instead of re-executing the code. If you do not have permission to create Temp Tables, you can try to save the table output as a view in your database.

#### Reference:
 - [jack dee's space](https://gettingshaped.wordpress.com/2010/06/24/subqueries-using-northwind-database/?utm_source=chatgpt.com)
 - [CTEs vs Subqueries vs Temp Tables](https://www.youtube.com/watch?v=LJC8277LONg)