##  Learning Objectives

In this training session, we will focus on expanding our SQL filtering and conditional logic skills.

By the end of this section, we will be able to:

- **Use `IS NULL` and `IS NOT NULL`** to identify records with missing or non-missing values.  
- **Use `IN` and `NOT IN`** to efficiently include or exclude specific categories or countries in a query.  
- **Explore potential correlations** between a country's **GDP** and the **availability of managed drinking water and sanitation services**, with a focus on **Sub-Saharan Africa**.

---

These skills are essential for understanding **data completeness**, **pattern recognition**, and performing **data-driven analysis** in real-world datasets.


In [1]:
%load_ext sql

To make a query, we add the %%sql command to the start of a cell, create one open line and then the query like below and run the cell.

In [3]:
%%sql

SELECT
    *
FROM
    Access_to_Basic_Services
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/united_nations
5 rows affected.


Region,Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_population_in_millions,Est_gdp_in_billions,Land_area,Pct_unemployment
Central and Southern Asia,Central Asia,Kazakhstan,2015,94.67,98.0,17.542806,184.39,2699700.0,4.93
Central and Southern Asia,Central Asia,Kazakhstan,2016,94.67,98.0,17.794055,137.28,2699700.0,4.96
Central and Southern Asia,Central Asia,Kazakhstan,2017,95.0,98.0,18.037776,166.81,2699700.0,4.9
Central and Southern Asia,Central Asia,Kazakhstan,2018,95.0,98.0,18.276452,179.34,2699700.0,4.85
Central and Southern Asia,Central Asia,Kazakhstan,2019,95.0,98.0,18.513673,181.67,2699700.0,4.8


## Exercise: Exploring Correlation Between GDP and Access to Basic Services

In this exercise, we will continue working with the **`united_nations.access_to_basic_services`** table.  
This dataset provides valuable insights into different countries’ access to **managed drinking water**, **sanitation services**, and their **estimated GDP**.

###  Objective
We aim to determine whether there is a **correlation between a country's GDP** and its **access to basic services** — focusing on countries within the **Sub-Saharan Africa** region.

---

###  Task 1: Filter Data for Sub-Saharan Africa (2020)
We’ll begin by selecting all relevant data for:
- **Region:** Sub-Saharan Africa  
- **Year (Time_period):** 2020  

This will give us a focused dataset to analyze the relationship between economic output (GDP) and access to essential services.


In [4]:
%%sql
-- Task 1: Select data for Sub-Saharan Africa in the year 2020
-- Filtering the dataset to focus only on Sub-Saharan African countries for analysis

SELECT
    Region,
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    Pct_managed_sanitation_services,
    Est_gdp_in_billions,
    Est_population_in_millions
FROM access_to_basic_services
WHERE 
    Region = 'Sub-Saharan Africa'
    AND Time_period = 2020
ORDER BY Country_name ASC;


 * mysql+pymysql://root:***@localhost:3306/united_nations
47 rows affected.


Region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_gdp_in_billions,Est_population_in_millions
Sub-Saharan Africa,Angola,2020,52.33,47.0,53.62,33.428486
Sub-Saharan Africa,Benin,2020,65.33,17.33,15.65,12.643123
Sub-Saharan Africa,Botswana,2020,89.67,74.33,14.93,2.546402
Sub-Saharan Africa,Burkina Faso,2020,53.33,25.0,17.93,21.522626
Sub-Saharan Africa,Burundi,2020,70.33,44.33,2.65,12.220227
Sub-Saharan Africa,Cabo Verde,2020,87.33,78.0,1.7,0.58264
Sub-Saharan Africa,Cameroon,2020,64.0,43.0,40.77,26.491087
Sub-Saharan Africa,Central African Republic,2020,38.33,15.0,2.33,5.34302
Sub-Saharan Africa,Chad,2020,52.67,18.67,10.72,16.644701
Sub-Saharan Africa,Congo,2020,69.0,17.67,,


###  Task 2: Identifying NULL Values in GDP

Before analyzing correlations, it’s important to ensure our data is complete and reliable.

Some entries in the dataset may have **NULL (missing)** values in the **`Est_gdp_in_billions`** column.  
Countries with missing GDP data cannot contribute meaningfully to our analysis, as we need both GDP and service access values for comparison.

In this task, we will:
- Check if there are any **NULL values** in the GDP column.  
- Identify which countries have missing GDP data within the **Sub-Saharan Africa (2020)** subset.


In [5]:
%%sql
-- Task 2: Checking for NULL values in GDP data
-- Identifying Sub-Saharan African countries with missing GDP values in 2020

SELECT
    Region,
    Country_name,
    Time_period,
    Est_gdp_in_billions
FROM access_to_basic_services
WHERE 
    Region = 'Sub-Saharan Africa'
    AND Time_period = 2020
    AND Est_gdp_in_billions IS NULL;


 * mysql+pymysql://root:***@localhost:3306/united_nations
9 rows affected.


Region,Country_name,Time_period,Est_gdp_in_billions
Sub-Saharan Africa,Mayotte,2020,
Sub-Saharan Africa,Réunion,2020,
Sub-Saharan Africa,South Sudan,2020,
Sub-Saharan Africa,United Republic of Tanzania,2020,
Sub-Saharan Africa,Congo,2020,
Sub-Saharan Africa,Democratic Republic of the Congo,2020,
Sub-Saharan Africa,Côte d'Ivoire,2020,
Sub-Saharan Africa,Gambia,2020,
Sub-Saharan Africa,Saint Helena,2020,


###  Task 3: Excluding NULL GDP Values

Now that we’ve checked for missing GDP values, we’ll refine our dataset to include **only countries with valid GDP data**.  

By using the **`IS NOT NULL`** condition, we ensure that all selected rows have complete GDP information.  
This step helps us prepare a clean dataset for accurate correlation analysis between GDP and access to basic services.


In [6]:
%%sql
-- Task 3: Excluding countries with NULL GDP values
-- Creating a clean dataset for Sub-Saharan Africa (2020) with complete GDP records

SELECT
    Region,
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    Pct_managed_sanitation_services,
    Est_gdp_in_billions,
    Est_population_in_millions
FROM access_to_basic_services
WHERE 
    Region = 'Sub-Saharan Africa'
    AND Time_period = 2020
    AND Est_gdp_in_billions IS NOT NULL
ORDER BY Country_name ASC;


 * mysql+pymysql://root:***@localhost:3306/united_nations
38 rows affected.


Region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_gdp_in_billions,Est_population_in_millions
Sub-Saharan Africa,Angola,2020,52.33,47.0,53.62,33.428486
Sub-Saharan Africa,Benin,2020,65.33,17.33,15.65,12.643123
Sub-Saharan Africa,Botswana,2020,89.67,74.33,14.93,2.546402
Sub-Saharan Africa,Burkina Faso,2020,53.33,25.0,17.93,21.522626
Sub-Saharan Africa,Burundi,2020,70.33,44.33,2.65,12.220227
Sub-Saharan Africa,Cabo Verde,2020,87.33,78.0,1.7,0.58264
Sub-Saharan Africa,Cameroon,2020,64.0,43.0,40.77,26.491087
Sub-Saharan Africa,Central African Republic,2020,38.33,15.0,2.33,5.34302
Sub-Saharan Africa,Chad,2020,52.67,18.67,10.72,16.644701
Sub-Saharan Africa,Djibouti,2020,69.0,56.0,3.18,1.090156


###  Task 4: Analyzing GDP and Access to Basic Services in the Top 5 Economies

To better understand potential correlations between **GDP** and **access to basic services**,  
we will focus on the **top 5 economies in Sub-Saharan Africa**:

-  Nigeria  
-  South Africa  
-  Ethiopia  
-  Kenya  
-  Ghana  

###  Objective
Retrieve data for these five countries in the **year 2020**, ensuring only records with **non-null GDP values** are included.

This filtered dataset will help us compare how economic strength may influence access to managed drinking water and sanitation services.


In [7]:
%%sql
-- Task 4: Retrieve data for the top 5 economies in Sub-Saharan Africa (2020)
-- Filtering by specific countries and ensuring GDP values are not null

SELECT
    Region,
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    Pct_managed_sanitation_services,
    Est_gdp_in_billions,
    Est_population_in_millions
FROM access_to_basic_services
WHERE 
    Region = 'Sub-Saharan Africa'
    AND Time_period = 2020
    AND Country_name IN ('Nigeria', 'South Africa', 'Ethiopia', 'Kenya', 'Ghana')
    AND Est_gdp_in_billions IS NOT NULL
ORDER BY Est_gdp_in_billions DESC;


 * mysql+pymysql://root:***@localhost:3306/united_nations
5 rows affected.


Region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_gdp_in_billions,Est_population_in_millions
Sub-Saharan Africa,Nigeria,2020,77.33,42.67,432.2,208.327405
Sub-Saharan Africa,South Africa,2020,92.0,78.67,337.62,58.801927
Sub-Saharan Africa,Ethiopia,2020,58.0,11.67,107.66,117.190911
Sub-Saharan Africa,Kenya,2020,67.0,33.67,100.67,51.98578
Sub-Saharan Africa,Ghana,2020,84.67,23.0,70.04,32.180401


###  Task 5: Exploring the Rest of Sub-Saharan Africa

In the previous task, we focused on the **top 5 economies** in Sub-Saharan Africa  
(`Nigeria`, `South Africa`, `Ethiopia`, `Kenya`, and `Ghana`).  

Now, we’ll expand our view to include **the rest of Sub-Saharan Africa** —  
excluding those five major economies.  

###  Objective
To analyze GDP and access to basic services for all other Sub-Saharan African countries in **2020**,  
excluding the top 5 economies identified earlier.  

This comparison allows us to explore how smaller or developing economies perform in providing essential services like drinking water and sanitation.


In [8]:
%%sql
-- Task 5: Retrieve data for Sub-Saharan African countries excluding the top 5 economies
-- Focus: Countries other than Nigeria, South Africa, Ethiopia, Kenya, and Ghana (Year 2020)

SELECT
    Region,
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    Pct_managed_sanitation_services,
    Est_gdp_in_billions,
    Est_population_in_millions
FROM access_to_basic_services
WHERE 
    Region = 'Sub-Saharan Africa'
    AND Time_period = 2020
    AND Country_name NOT IN ('Nigeria', 'South Africa', 'Ethiopia', 'Kenya', 'Ghana')
    AND Est_gdp_in_billions IS NOT NULL
ORDER BY Est_gdp_in_billions DESC;


 * mysql+pymysql://root:***@localhost:3306/united_nations
33 rows affected.


Region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_gdp_in_billions,Est_population_in_millions
Sub-Saharan Africa,Angola,2020,52.33,47.0,53.62,33.428486
Sub-Saharan Africa,Cameroon,2020,64.0,43.0,40.77,26.491087
Sub-Saharan Africa,Uganda,2020,61.0,21.67,37.6,44.404611
Sub-Saharan Africa,Senegal,2020,85.0,57.0,24.49,16.43612
Sub-Saharan Africa,Zimbabwe,2020,68.0,36.33,21.51,15.669666
Sub-Saharan Africa,Zambia,2020,66.67,32.67,18.11,18.927715
Sub-Saharan Africa,Burkina Faso,2020,53.33,25.0,17.93,21.522626
Sub-Saharan Africa,Mali,2020,83.67,46.0,17.47,21.22404
Sub-Saharan Africa,Benin,2020,65.33,17.33,15.65,12.643123
Sub-Saharan Africa,Gabon,2020,73.33,47.0,15.31,2.292573


###  Summary

In this exercise, we explored how to handle missing and filtered data using SQL conditional statements.  

We:
- Used the **`IS NULL`** statement to identify countries with missing GDP values.  
- Applied the **`IS NOT NULL`** statement to exclude those countries from our analysis.  
- Used the **`IN`** statement to focus on the **top 5 GDP economies** in Sub-Saharan Africa (`Nigeria`, `South Africa`, `Ethiopia`, `Kenya`, and `Ghana`).  
- Used the **`NOT IN`** statement to examine the **remaining countries** in the region.  

After analyzing the data, we found that **there isn’t a clear or noticeable correlation** between a country’s GDP and its **access to drinking water and sanitation services** in Sub-Saharan Africa.  
This suggests that **economic size alone may not determine service accessibility**, and other factors such as governance, infrastructure investment, and policy priorities likely play a role.
