# Subquery in the SELECT clause

In this notebook, we will look at subqueries, which are powerful tools to enable more in-depth analysis in SQL. They are essentially intermediate results sets that we access with another query, so **a query inside another query**. We can use subqueries in various places in a query, and those subquery results also have various forms. Here, we look at the **use of a subquery in the `SELECT` clause.**

> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

In this train, we will learn:
- How to use **subqueries** instead of static values to make calculations dynamic.
- How to turn a normal subquery into a **correlated subquery** to perform calculations based on specific criteria.

## Overview

Imagine we want to calculate the percentage of land area a specific country in a sub-region occupies, as a percentage of the total land area in that sub-region. That is a challenging question because we need to divide each country's land area with the sum of all countries in that sub-region. 
To do it, we can use a subquery. For example, let's just look at one sub-region for now, `Middle Africa`, and use the `Geographic_location` table to **find out the percentage of land each country has as a percentage of the total for 'Middle Africa'.** 

## Connecting to our MySQL database

We will use our `Geographic_location` table in our `united_nations` database that we created in MySQL Workbench. We can apply the same queries we used in MySQL Workbench in this notebook if we connect to our MySQL server by running the cells below.


In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [2]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:37882792@localhost:3306/united_nations

'Connected: root@united_nations'

## Exercise

### 1. Calculate the total land area of the 'Middle Africa' sub-region

Write a query that will find the `Land_area` sum for the `Middle Africa` sub-region. Call this column `total_land_area`.

In [3]:
%%sql

SELECT
    SUM(Land_area) as total_land_area
FROM 
    Geographic_location 
WHERE 
    Sub_region = 'Middle Africa';

 * mysql+pymysql://root:***@localhost:3306/united_nations
1 rows affected.


total_land_area
3888270.0


This query will return a single value which is the sum of all of the `Land_area` for the `Middle Africa` sub-region.

### 2. Calculate land area percentages for the Middle African countries using a static value

Recall that, to find the percentage of land area each country in the `Middle Africa` sub-region occupies, we need to **divide each country’s land area with the `total_land_area`**, which we have calculated above. Copy and paste this land area value into a new query and calculate the percentages. Call this calculated column `pct_regional_land`.


To calculate the percentages, we select the country names and the land area for each country, which we then divide by the total area for Middle Africa multiplied by 100 (to get the percentage), and then round the result off. 

**Note:** In the above query, we calculated a `total_land_area` of **"3888270.00"** which we'll use in the query below to calculate percentages for each country.

In [4]:
%%sql

SELECT 
    Country_name, 
    ROUND(Land_area/3888270.00*100) AS  pct_regional_land
FROM 
    Geographic_location 
WHERE 
    Sub_region = 'Middle Africa';

 * mysql+pymysql://root:***@localhost:3306/united_nations
9 rows affected.


Country_name,pct_regional_land
Angola,32.0
Cameroon,12.0
Central African Republic,16.0
Chad,32.0
Congo,
Democratic Republic of the Congo,
Equatorial Guinea,1.0
Gabon,7.0
Sao Tome and Principe,0.0


### 3. Calculate land area percentages for the Middle African countries using a subquery

Instead of using the static value above, let's improve our query by using a subquery to achieve the same result.

Hint: The subquery in this case will be the query we created to find the total land area in Exercise 1.

In [5]:
%%sql
SELECT 
    Country_name, 
    ROUND(land_area / (
                        SELECT 
                            SUM(Land_area)
                        FROM 
                            Geographic_location 
                        WHERE
                            Sub_region = 'Middle Africa') * 100, 2) as Pct_regional_land
FROM 
    Geographic_location 
WHERE 
    Sub_region = 'Middle Africa';

 * mysql+pymysql://root:***@localhost:3306/united_nations
9 rows affected.


Country_name,Pct_regional_land
Angola,32.06
Cameroon,12.16
Central African Republic,16.02
Chad,32.38
Congo,
Democratic Republic of the Congo,
Equatorial Guinea,0.72
Gabon,6.63
Sao Tome and Principe,0.02


When we execute this, the inner query runs first and calculates the sum value, then “passes” that to the outer query to get the percentages. A subquery that passes a single value to the outer query is known as a **scalar subquery.** 

The benefit of using the subquery instead of the actual value is that if we want to do the same for another sub-region, we can just change the search string, “Middle Africa”, for example, to "Polynesia". It will process the same calculation using the Polynesian data. These types of calculations are known as **dynamic**. By changing some of the filters, we automatically calculate the corresponding sum.

### Correlated subqueries

With a normal nested subquery (uncorrelated), **it runs first and executes once**, returning value(s) to be used by the outer query. This is what we have above where our subquery calculates the land area sum for the `Middle Africa` sub-region, and this is the value that is returned to the outer query to calculate land area percentages.

If we want to calculate the land area percentages for all the regions, we would need to manually edit the search string in the `WHERE` clause, which is not very practical.

Instead, we can use a **correlated subquery**. This is a type of subquery that is **executed once for every row processed by the outer query**. It is often used when you need to perform a calculation based on values in the current row of the outer query. In our case, we want the subquery to calculate the land area sum based on the sub-region value of the current row being processed by the outer query. 

Below is a general syntax template where a correlated subquery has been used: 

```
SELECT
    outer_column1, 
    outer_column2
FROM
    outer_table AS outer_alias
WHERE 
    expression operator (
                        SELECT 
                            aggregate_function(inner_column)
                        FROM 
                            inner_table AS inner_alias
                        WHERE 
                            inner_column = outer_alias.outer_column2
                        )
;
```

### 4. Calculate country land area percentages for all the regions using a correlated subquery

Transform the subquery in Exercise 2 into a correlated subquery that will calculate land area percentages for all the regions.

Hint: Use the general syntax above to help you figure out how to achieve this.


We replace **“Middle Africa”** with a reference to the field that will change for each row, that is, the `Sub_region`. In order for this to work, we have to give the table in our main query an alias, **`g`**, so that when we refer to it in the subquery, SQL knows we are talking about the outer query’s table.

In [3]:
%%sql
SELECT 
    Country_name, 
    ROUND(Land_area / (
                        SELECT 
                            SUM(Land_area)
                        FROM 
                            Geographic_location 
                        WHERE
                            Sub_region = g.Sub_region) * 100, 2) as Pct_regional_land
FROM 
    Geographic_location AS g
limit 20;

 * mysql+pymysql://root:***@localhost:3306/united_nations
20 rows affected.


Country_name,Pct_regional_land
Afghanistan,13.67
Algeria,36.03
American Samoa,2.77
Angola,32.06
Anguilla,
Antigua and Barbuda,0.21
Argentina,17.77
Armenia,0.82
Aruba,0.09
Australia,96.69


As SQL looks at the first row in **`g`**, the sub-region of the first row is **`g.sub-Region`**. 

So for the first row, the sub-region will be `Central and Southern Asia`. The inner query will then execute, filtering out the Central and Southern Asia data, and calculate the sum of the land area in Central and Southern Asia. The main query then uses that value to calculate the percentage. SQL then moves to the second row. This time, the sub-region is `Northern Africa and Western Asia`, which the subquery uses again, and passes back to the main query, and so on. 

The downside of using correlated subqueries is that they can be quite inefficient. This is because the subquery may be evaluated once for each row processed by the outer query. For example, once we have calculated the total area for Middle Africa, the same calculation will be repeated for another row.