###  **Learning Objectives**
By the end of this training, you should be able to:
- Use the `GROUP BY` clause to analyze datasets at different levels of granularity.  
- Connect to a MySQL database and query data using SQL within Jupyter Notebook.  



###  **Connecting to the MySQL Database**

We’ll connect to the `united_nations` MySQL database that contains the `access_to_basic_services` table.  
Once connected, we can run SQL queries directly from Jupyter Notebook using the `%sql` magic command.




In [1]:
%load_ext sql

In [3]:
%%sql

SELECT
    *
FROM
    Access_to_Basic_Services
LIMIT 5;

 * mysql+pymysql://root:***@localhost/united_nations
5 rows affected.


Region,Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_population_in_millions,Est_gdp_in_billions,Land_area,Pct_unemployment
Central and Southern Asia,Central Asia,Kazakhstan,2015,94.67,98.0,17.542806,184.39,2699700.0,4.93
Central and Southern Asia,Central Asia,Kazakhstan,2016,94.67,98.0,17.794055,137.28,2699700.0,4.96
Central and Southern Asia,Central Asia,Kazakhstan,2017,95.0,98.0,18.037776,166.81,2699700.0,4.9
Central and Southern Asia,Central Asia,Kazakhstan,2018,95.0,98.0,18.276452,179.34,2699700.0,4.85
Central and Southern Asia,Central Asia,Kazakhstan,2019,95.0,98.0,18.513673,181.67,2699700.0,4.8


###  Exercise 1 — Access to Managed Drinking Water Services (Region / Sub_region)

**Goal:**  
Calculate the **minimum**, **maximum**, and **average** percentage of people who have access to managed drinking water services, grouped by `Region` and `Sub_region`.

**What we will use:**  
SQL aggregate functions `MIN()`, `MAX()`, and `AVG()`, combined with a `GROUP BY` clause.

**Notes:**
- We’ll use the column `Pct_managed_drinking_water_services` to represent the percentage of people with access to managed drinking water services.
- Results will include clear aliases for readability (e.g., `Min_Access`, `Max_Access`, `Avg_Access`).
- `AVG()` returns a numeric average; you may use rounding or casting in SQL if you want to limit decimal places.
- Rows are grouped first by `Region`, then by `Sub_region` for analysis at two levels of geographic granularity.

**Expected output columns:**
- `Region` — top-level geographic region  
- `Sub_region` — more specific geographic area within the region  
- `Min_Access` — minimum percentage of people with managed drinking water access in that group  
- `Max_Access` — maximum percentage of people with managed drinking water access in that group  
- `Avg_Access` — average percentage of people with managed drinking water access in that group  

**Example:**  
This query helps identify which regions and sub-regions have the highest, lowest, and typical levels of access to managed drinking water services.


In [6]:
%%sql
SELECT 
    Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS Min_Access,
    MAX(Pct_managed_drinking_water_services) AS Max_Access,
    AVG(Pct_managed_drinking_water_services) AS Avg_Access
FROM access_to_basic_services
GROUP BY Region, Sub_region;


 * mysql+pymysql://root:***@localhost/united_nations
18 rows affected.


Region,Sub_region,Min_Access,Max_Access,Avg_Access
Central and Southern Asia,Central Asia,80.33,100.0,93.144667
Central and Southern Asia,Southern Asia,67.0,99.67,91.894074
Eastern and South-Eastern Asia,Eastern Asia,75.67,100.0,92.699667
Eastern and South-Eastern Asia,South-Eastern Asia,73.33,100.0,90.626061
Europe and Northern America,Northern America,91.0,100.0,97.911333
Latin America and the Caribbean,Caribbean,64.0,100.0,96.005
Latin America and the Caribbean,Central America,79.0,100.0,93.798125
Latin America and the Caribbean,South America,86.0,100.0,94.880952
Northern Africa and Western Asia,Northern Africa,61.33,100.0,88.906111
Northern Africa and Western Asia,Western Asia,59.0,100.0,95.031204


### Exercise 2 — Number of Countries per Region and Sub_region

**Goal:**  
Determine the **number of countries** within each `Region` and `Sub_region`.

**What we will use:**  
The SQL aggregate function `COUNT()` in combination with the `GROUP BY` clause.

**Notes:**
- We’ll use the `Country_name` column to count how many countries belong to each region and sub-region.
- The result will include an alias `Country_Count` for clarity.
- Grouping by both `Region` and `Sub_region` allows us to analyze the data at two geographic levels.

**Expected output columns:**
- `Region` — top-level geographic region  
- `Sub_region` — more specific geographic area within the region  
- `Country_Count` — number of countries in that region and sub-region  

**Example:**  
This query helps us understand how countries are distributed across different regions and sub-regions.


In [7]:
%%sql

SELECT 
    Region,
    Sub_region,
    COUNT(Country_name) AS Country_Count
FROM access_to_basic_services
GROUP BY Region, Sub_region;


 * mysql+pymysql://root:***@localhost/united_nations
18 rows affected.


Region,Sub_region,Country_Count
Central and Southern Asia,Central Asia,30
Central and Southern Asia,Southern Asia,54
Eastern and South-Eastern Asia,Eastern Asia,30
Eastern and South-Eastern Asia,South-Eastern Asia,66
Europe and Northern America,Northern America,30
Latin America and the Caribbean,Caribbean,128
Latin America and the Caribbean,Central America,48
Latin America and the Caribbean,South America,84
Northern Africa and Western Asia,Northern Africa,36
Northern Africa and Western Asia,Western Asia,108


### Exercise 3 — Total GDP per Region and Sub_region

**Goal:**  
Determine the **total GDP** for each `Region` and `Sub_region` by summing all GDP values.

**What we will use:**  
The SQL aggregate function `SUM()` together with the `GROUP BY` clause.

**Notes:**
- We’ll use the column `Est_gdp_in_billions`, which contains the estimated GDP values (in billions).
- The result will include an alias `Total_GDP` for better readability.
- Grouping by both `Region` and `Sub_region` provides insight into the economic scale of different geographic areas.

**Expected output columns:**
- `Region` — top-level geographic region  
- `Sub_region` — more specific area within the region  
- `Total_GDP` — total estimated GDP (in billions) for that region and sub-region  

**Example:**  
This query helps us compare economic output across different regions and sub-regions to identify which areas have the largest or smallest total GDP.


In [8]:
%%sql

SELECT 
    Region,
    Sub_region,
    SUM(Est_gdp_in_billions) AS Total_GDP
FROM access_to_basic_services
GROUP BY Region, Sub_region;


 * mysql+pymysql://root:***@localhost/united_nations
18 rows affected.


Region,Sub_region,Total_GDP
Central and Southern Asia,Central Asia,1670.32
Central and Southern Asia,Southern Asia,19824.66
Eastern and South-Eastern Asia,Eastern Asia,107123.37
Eastern and South-Eastern Asia,South-Eastern Asia,15563.18
Europe and Northern America,Northern America,9905.96
Latin America and the Caribbean,Caribbean,2070.17
Latin America and the Caribbean,Central America,8524.66
Latin America and the Caribbean,South America,19959.58
Northern Africa and Western Asia,Northern Africa,2736.8
Northern Africa and Western Asia,Western Asia,13605.83


### Summary

We can also combine all of our previous queries into a **single query** to get one consolidated result that includes:
- The **minimum**, **maximum**, and **average** access to managed drinking water services,  
- The **number of countries**, and  
- The **total GDP**  

for each `Region` and `Sub_region`.

By combining these aggregations, we can analyze multiple dimensions of our dataset at once — providing a clearer overview of development indicators across different regions.


In [9]:
%%sql

SELECT 
    Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS Min_Access,
    MAX(Pct_managed_drinking_water_services) AS Max_Access,
    AVG(Pct_managed_drinking_water_services) AS Avg_Access,
    COUNT(Country_name) AS Country_Count,
    SUM(Est_gdp_in_billions) AS Total_GDP
FROM access_to_basic_services
GROUP BY Region, Sub_region;


 * mysql+pymysql://root:***@localhost/united_nations
18 rows affected.


Region,Sub_region,Min_Access,Max_Access,Avg_Access,Country_Count,Total_GDP
Central and Southern Asia,Central Asia,80.33,100.0,93.144667,30,1670.32
Central and Southern Asia,Southern Asia,67.0,99.67,91.894074,54,19824.66
Eastern and South-Eastern Asia,Eastern Asia,75.67,100.0,92.699667,30,107123.37
Eastern and South-Eastern Asia,South-Eastern Asia,73.33,100.0,90.626061,66,15563.18
Europe and Northern America,Northern America,91.0,100.0,97.911333,30,9905.96
Latin America and the Caribbean,Caribbean,64.0,100.0,96.005,128,2070.17
Latin America and the Caribbean,Central America,79.0,100.0,93.798125,48,8524.66
Latin America and the Caribbean,South America,86.0,100.0,94.880952,84,19959.58
Northern Africa and Western Asia,Northern Africa,61.33,100.0,88.906111,36,2736.8
Northern Africa and Western Asia,Western Asia,59.0,100.0,95.031204,108,13605.83
