##  Filtering and Analyzing a Summary Statistic Report

###  Learning Objectives
By the end of this training, you should be able to:
- Understand how to **analyze a summary statistic report**.
- Use the SQL **`WHERE`** and **`HAVING`** clauses to filter summarized data effectively.


In [1]:
%load_ext sql

In [3]:
%%sql

SELECT
    *
FROM
    Access_to_Basic_Services
LIMIT 5;

 * mysql+pymysql://root:***@localhost/united_nations
5 rows affected.


Region,Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_population_in_millions,Est_gdp_in_billions,Land_area,Pct_unemployment
Central and Southern Asia,Central Asia,Kazakhstan,2015,94.67,98.0,17.542806,184.39,2699700.0,4.93
Central and Southern Asia,Central Asia,Kazakhstan,2016,94.67,98.0,17.794055,137.28,2699700.0,4.96
Central and Southern Asia,Central Asia,Kazakhstan,2017,95.0,98.0,18.037776,166.81,2699700.0,4.9
Central and Southern Asia,Central Asia,Kazakhstan,2018,95.0,98.0,18.276452,179.34,2699700.0,4.85
Central and Southern Asia,Central Asia,Kazakhstan,2019,95.0,98.0,18.513673,181.67,2699700.0,4.8


### Exercise

We started by finding the **minimum**, **maximum**, and **average** percentage of people who have access to **managed drinking water services**,  
along with the **number of countries** and the **total GDP** per `Region` and `Sub_region`.

In this exercise, we will **order the summarized data** by the estimated GDP to identify which regions or sub-regions have the highest and lowest total GDP values.

**Goal:**  
- Reinforce understanding of SQL aggregate functions (`MIN()`, `MAX()`, `AVG()`, `COUNT()`, `SUM()`).
- Learn how to use the `ORDER BY` clause to sort grouped results meaningfully.


In [4]:
%%sql

SELECT 
    Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS Min_Access,
    MAX(Pct_managed_drinking_water_services) AS Max_Access,
    AVG(Pct_managed_drinking_water_services) AS Avg_Access,
    COUNT(Country_name) AS Country_Count,
    SUM(Est_gdp_in_billions) AS Total_GDP
FROM access_to_basic_services
GROUP BY Region, Sub_region
ORDER BY Total_GDP DESC;


 * mysql+pymysql://root:***@localhost/united_nations
18 rows affected.


Region,Sub_region,Min_Access,Max_Access,Avg_Access,Country_Count,Total_GDP
Eastern and South-Eastern Asia,Eastern Asia,75.67,100.0,92.699667,30,107123.37
Latin America and the Caribbean,South America,86.0,100.0,94.880952,84,19959.58
Central and Southern Asia,Southern Asia,67.0,99.67,91.894074,54,19824.66
Eastern and South-Eastern Asia,South-Eastern Asia,73.33,100.0,90.626061,66,15563.18
Northern Africa and Western Asia,Western Asia,59.0,100.0,95.031204,108,13605.83
Europe and Northern America,Northern America,91.0,100.0,97.911333,30,9905.96
Oceania,Australia and New Zealand,100.0,100.0,100.0,12,9241.73
Latin America and the Caribbean,Central America,79.0,100.0,93.798125,48,8524.66
Sub-Saharan Africa,Western Africa,53.33,99.0,72.365686,102,3621.31
Northern Africa and Western Asia,Northern Africa,61.33,100.0,88.906111,36,2736.8


### Exercise 1 — Filtering for the Year 2020

**Goal:**  
Use the **summary statistic report** to focus only on data from the **year 2020**.

**What we will use:**  
The SQL `WHERE` clause to restrict the dataset before aggregation.

**Notes:**
- The column `Time_period` represents the year in the dataset.
- Filtering at this stage ensures that our summary statistics (minimum, maximum, average, total GDP, and country count) are calculated only for records from the year 2020.
- We will keep the same structure as our previous summary query but add a `WHERE` condition to include only `Time_period = 2020`.


In [5]:
%%sql

SELECT 
    Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS Min_Access,
    MAX(Pct_managed_drinking_water_services) AS Max_Access,
    AVG(Pct_managed_drinking_water_services) AS Avg_Access,
    COUNT(Country_name) AS Country_Count,
    SUM(Est_gdp_in_billions) AS Total_GDP
FROM access_to_basic_services
WHERE Time_period = 2020
GROUP BY Region, Sub_region
ORDER BY Total_GDP DESC;


 * mysql+pymysql://root:***@localhost/united_nations
18 rows affected.


Region,Sub_region,Min_Access,Max_Access,Avg_Access,Country_Count,Total_GDP
Eastern and South-Eastern Asia,Eastern Asia,81.0,100.0,93.4,5,19741.09
Central and Southern Asia,Southern Asia,80.33,99.67,93.074444,9,3487.02
Latin America and the Caribbean,South America,90.33,100.0,95.667143,14,2755.5
Eastern and South-Eastern Asia,South-Eastern Asia,75.33,100.0,92.120909,11,2721.73
Northern Africa and Western Asia,Western Asia,63.0,100.0,95.593333,18,2194.63
Europe and Northern America,Northern America,91.0,100.0,98.0,5,1655.39
Oceania,Australia and New Zealand,100.0,100.0,100.0,2,1538.63
Latin America and the Caribbean,Central America,79.33,100.0,94.49875,8,1347.35
Sub-Saharan Africa,Western Africa,53.33,99.0,73.607059,17,631.91
Northern Africa and Western Asia,Northern Africa,62.33,100.0,90.053333,6,386.29


### Exercise 2 — Filtering Countries with Less than 60% Managed Drinking Water Services

**Goal:**  
Focus on countries where the **percentage of managed drinking water services** is **below 60%**.

**What we will use:**  
The SQL `WHERE` clause to limit our data to countries that have **less than 60% access** to managed drinking water services.

**Notes:**
- This filter helps identify regions and sub-regions that may require urgent attention or investment.
- We’ll build on the previous query (which focused on the year 2020) by adding another condition using the `AND` operator.


In [6]:
%%sql

SELECT 
    Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS Min_Access,
    MAX(Pct_managed_drinking_water_services) AS Max_Access,
    AVG(Pct_managed_drinking_water_services) AS Avg_Access,
    COUNT(Country_name) AS Country_Count,
    SUM(Est_gdp_in_billions) AS Total_GDP
FROM access_to_basic_services
WHERE Time_period = 2020
  AND Pct_managed_drinking_water_services < 60
GROUP BY Region, Sub_region
ORDER BY Total_GDP DESC;


 * mysql+pymysql://root:***@localhost/united_nations
4 rows affected.


Region,Sub_region,Min_Access,Max_Access,Avg_Access,Country_Count,Total_GDP
Sub-Saharan Africa,Eastern Africa,48.33,58.0,54.9975,4,127.59
Sub-Saharan Africa,Middle Africa,38.33,52.67,47.75,4,66.67
Sub-Saharan Africa,Western Africa,53.33,57.33,55.33,2,31.67
Oceania,Melanesia,56.67,56.67,56.67,1,23.85


### Exercise 3 — Filtering Sub-Regions with Fewer Than Four Countries

**Goal:**  
Filter the results to only include **regions and sub-regions** that have **fewer than four countries**.

**What we will use:**  
The SQL `HAVING` clause to filter the aggregated results after grouping.

**Notes:**
- The `WHERE` clause filters rows **before aggregation**, while the `HAVING` clause filters **after aggregation**.
- We’ll use the `COUNT()` alias from our previous query (`Country_Count`) to specify this condition.


In [7]:
%%sql

SELECT 
    Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS Min_Access,
    MAX(Pct_managed_drinking_water_services) AS Max_Access,
    AVG(Pct_managed_drinking_water_services) AS Avg_Access,
    COUNT(Country_name) AS Country_Count,
    SUM(Est_gdp_in_billions) AS Total_GDP
FROM access_to_basic_services
WHERE Time_period = 2020
  AND Pct_managed_drinking_water_services < 60
GROUP BY Region, Sub_region
HAVING COUNT(Country_name) < 4
ORDER BY Total_GDP DESC;


 * mysql+pymysql://root:***@localhost/united_nations
2 rows affected.


Region,Sub_region,Min_Access,Max_Access,Avg_Access,Country_Count,Total_GDP
Sub-Saharan Africa,Western Africa,53.33,57.33,55.33,2,31.67
Oceania,Melanesia,56.67,56.67,56.67,1,23.85
