#  Common Table Expressions (CTEs) vs. Subqueries

##  Learning Objectives
By the end of this notebook, you should be able to:
- Understand the differences between **CTEs** and **subqueries**.  
- Learn how to **optimise SQL queries** using CTEs.  
- Write **complex SQL queries** efficiently using both CTEs and subqueries.  



##  Overview
In this notebook, we explore **Common Table Expressions (CTEs)** and **subqueries** in SQL — two powerful tools that simplify complex queries, enhance readability, and can improve performance.

We’ll examine these concepts through a real-world scenario:

> **Identifying Sub-Saharan African countries with underdeveloped economies that might struggle to gain access to water.**

We’ll start by writing SQL queries using **subqueries**, and then optimise them using **CTEs**.



##  Connecting to Our MySQL Database

Since we’re using a **MySQL** database, we’ll connect using `mysql` and `pymysql`.



In [1]:
%load_ext sql

##  Exercise Overview

We’ll be working with the **`united_nations.Access_to_Basic_Services`** table, which contains data on:
- Country names  
- Access to basic services (like water and sanitation)  
- Estimated GDP  
- Regional classification  

Our goal is to identify **Sub-Saharan African countries** with:
- An estimated GDP **below the regional average**, and  
- **Less than 60%** access to managed drinking water services in **2020**.


###  Task 1: Calculate the Average GDP for Each Region
Use the `AVG(Est_gdp_in_billions) OVER(PARTITION BY Region)` window function to calculate the average GDP for each region in 2020.


In [3]:
%%sql

SELECT 
    Country_name,
    Region,
    Est_gdp_in_billions,
    AVG(Est_gdp_in_billions) OVER(PARTITION BY Region) AS Avg_gdp_for_region
FROM united_nations.Access_to_Basic_Services
WHERE Time_period = 2020
LIMIT 20;


 * mysql+pymysql://root:***@localhost:3306/united_nations
20 rows affected.


Country_name,Region,Est_gdp_in_billions,Avg_gdp_for_region
Kazakhstan,Central and Southern Asia,171.08,338.738182
Kyrgyzstan,Central and Southern Asia,,338.738182
Tajikistan,Central and Southern Asia,8.13,338.738182
Turkmenistan,Central and Southern Asia,,338.738182
Uzbekistan,Central and Southern Asia,59.89,338.738182
Afghanistan,Central and Southern Asia,20.14,338.738182
Bangladesh,Central and Southern Asia,373.9,338.738182
Bhutan,Central and Southern Asia,2.33,338.738182
India,Central and Southern Asia,2667.69,338.738182
Iran (Islamic Republic of),Central and Southern Asia,,338.738182


###  Task 2: Filter the Data
Now, filter for **Sub-Saharan African countries** with:
- **Underdeveloped economies** (GDP below regional average)
- **Low access** to managed drinking water services (<60%)
- **Data for the year 2020**

You’ll encounter an error:
> `Unknown column 'Avg_gdp_for_region' in where clause`

This happens because you cannot reference an alias in the same query.  
We’ll fix this using **subqueries** or **CTEs** next.


In [4]:
%%sql

SELECT *
FROM united_nations.Access_to_Basic_Services
WHERE Region = 'Sub-Saharan Africa'
  AND Time_period = 2020
  AND Est_gdp_in_billions < Avg_gdp_for_region
  AND Access_to_managed_drinking_water_services < 60;


 * mysql+pymysql://root:***@localhost:3306/united_nations
(pymysql.err.OperationalError) (1054, "Unknown column 'Avg_gdp_for_region' in 'where clause'")
[SQL: SELECT *
FROM united_nations.Access_to_Basic_Services
WHERE Region = 'Sub-Saharan Africa'
  AND Time_period = 2020
  AND Est_gdp_in_billions < Avg_gdp_for_region
  AND Access_to_managed_drinking_water_services < 60;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)


###  Task 3: Implement the Solution Using a Subquery
We’ll use a **subquery** to calculate the average regional GDP and then reference it in the main query.  
This resolves the aliasing issue we encountered earlier.


In [11]:
%%sql
SELECT 
    main.Country_name,
    main.Region,
    main.Est_gdp_in_billions,
    main.Pct_managed_drinking_water_services
FROM united_nations.Access_to_Basic_Services AS main
WHERE 
    main.Region = 'Sub-Saharan Africa'
    AND main.Time_period = 2020
    AND main.Est_gdp_in_billions < (
        SELECT AVG(sub.Est_gdp_in_billions)
        FROM united_nations.Access_to_Basic_Services AS sub
        WHERE sub.Region = main.Region
        AND sub.Time_period = 2020
    )
    AND main.Pct_managed_drinking_water_services < 60;


 * mysql+pymysql://root:***@localhost:3306/united_nations
6 rows affected.


Country_name,Region,Est_gdp_in_billions,Pct_managed_drinking_water_services
Madagascar,Sub-Saharan Africa,13.05,56.33
Somalia,Sub-Saharan Africa,6.88,57.33
Central African Republic,Sub-Saharan Africa,2.33,38.33
Chad,Sub-Saharan Africa,10.72,52.67
Burkina Faso,Sub-Saharan Africa,17.93,53.33
Niger,Sub-Saharan Africa,13.74,57.33


###  Task 4: Implement the Solution Using a Common Table Expression (CTE)

Now, we’ll use a **CTE** to achieve the same result in a cleaner and more readable way.  
CTEs make it easier to reuse logic and organise complex SQL queries.

Here, we first create a temporary table (`RegionalGDP`) that contains each country’s GDP along with the **regional average GDP**.


In [14]:
%%sql
WITH RegionalGDP AS (
    SELECT 
        Region,
        AVG(Est_gdp_in_billions) AS Avg_Regional_GDP
    FROM united_nations.Access_to_Basic_Services
    WHERE Time_period = 2020
    GROUP BY Region
)
SELECT 
    main.Country_name,
    main.Region,
    main.Est_gdp_in_billions,
    main.Pct_managed_drinking_water_services,
    regional.Avg_Regional_GDP
FROM united_nations.Access_to_Basic_Services AS main
JOIN RegionalGDP AS regional
    ON main.Region = regional.Region
WHERE 
    main.Region = 'Sub-Saharan Africa'
    AND main.Time_period = 2020
    AND main.Est_gdp_in_billions < regional.Avg_Regional_GDP
    AND main.Pct_managed_drinking_water_services < 60;


 * mysql+pymysql://root:***@localhost:3306/united_nations
6 rows affected.


Country_name,Region,Est_gdp_in_billions,Pct_managed_drinking_water_services,Avg_Regional_GDP
Madagascar,Sub-Saharan Africa,13.05,56.33,39.041316
Somalia,Sub-Saharan Africa,6.88,57.33,39.041316
Central African Republic,Sub-Saharan Africa,2.33,38.33,39.041316
Chad,Sub-Saharan Africa,10.72,52.67,39.041316
Burkina Faso,Sub-Saharan Africa,17.93,53.33,39.041316
Niger,Sub-Saharan Africa,13.74,57.33,39.041316


##  Summary: Subqueries vs. Common Table Expressions (CTEs)

In our exercises, we have seen how **subqueries** and **Common Table Expressions (CTEs)** can be used to break down complex queries into manageable parts. This not only enhances readability but also improves performance by enabling the database to process the query more efficiently.

For instance, when identifying **Sub-Saharan African countries** with underdeveloped economies struggling to access drinking water services, we first used **subqueries** to combine conditions and evaluate the GDP criteria before assessing water access. We then applied the concept of **Common Table Expressions (CTEs)** as another optimised alternative, which allowed us to calculate the regional average GDP separately. This created a reusable block of code that could be referenced in the final query.

By comparing the two implementations, we saw how **CTEs** could simplify complex queries and eliminate the need for saving intermediary tables or using multiple subqueries, which can take up more processing time and database space.

The use of **CTEs** and **subqueries** empowers us to optimise SQL queries effectively, making our data analysis more **efficient**, **readable**, and **organised**.

>  **Remember:** The choice between subqueries and CTEs often depends on the specific requirements of your task and the complexity of your SQL queries. It’s always a good idea to test different implementations and choose the one that best meets your needs.
