###  **Learning Objectives**

By the end of this training, you should be able to:

- Understand how to perform an initial data analysis using **SQL numeric functions**.  
- Connect to a **MySQL database** from a Jupyter Notebook using `mysql` and `pymysql`.  
- Analyze the **Access_to_Basic_Services** table to understand the range and distribution of values in your dataset.  



###  **Connecting to the MySQL Database**

Before running any queries, establish a connection to your MySQL database using the `%%sql` magic command and `pymysql` as the driver.

```python
# Load SQL extension in Jupyter
%load_ext sql

# Connect to the MySQL database
%sql mysql+pymysql://root:YourPassword@localhost:3306/united_nations


In [1]:
%load_ext sql

In [3]:
%%sql

SELECT
    *
FROM
    Access_to_Basic_Services
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/united_nations
5 rows affected.


Region,Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_population_in_millions,Est_gdp_in_billions,Land_area,Pct_unemployment
Central and Southern Asia,Central Asia,Kazakhstan,2015,94.67,98.0,17.542806,184.39,2699700.0,4.93
Central and Southern Asia,Central Asia,Kazakhstan,2016,94.67,98.0,17.794055,137.28,2699700.0,4.96
Central and Southern Asia,Central Asia,Kazakhstan,2017,95.0,98.0,18.037776,166.81,2699700.0,4.9
Central and Southern Asia,Central Asia,Kazakhstan,2018,95.0,98.0,18.276452,179.34,2699700.0,4.85
Central and Southern Asia,Central Asia,Kazakhstan,2019,95.0,98.0,18.513673,181.67,2699700.0,4.8


###  **Exercise**

In this section, we’ll perform an initial data analysis on the `access_to_basic_services` table to answer the following questions:

1. What is the total number of entries in the dataset?  
2. What are the earliest and latest years for which we have data?  
3. How many countries are included in this dataset?  
4. What is the average percentage of people who have access to managed drinking water services across all years and countries?



###  **1. Total Number of Entries**

To determine the total number of records in the dataset, we’ll use the `COUNT()` function.  
This function returns the number of rows in a specified column (or all rows when using `*`).  
We’ll also use the alias `total_entries` to label the result clearly.


In [4]:
%%sql
SELECT COUNT(*) AS total_entries
FROM access_to_basic_services;


 * mysql+pymysql://root:***@localhost:3306/united_nations
1 rows affected.


total_entries
1048


###  **2. Earliest and Latest Years of Data**

To find the time range of our dataset, we can use the `MIN()` and `MAX()` functions on the `Time_period` column.  
- `MIN()` returns the **earliest year** available in the dataset.  
- `MAX()` returns the **latest year** available.  

We’ll use aliases to make the output more descriptive.


In [5]:
%%sql
SELECT 
    MIN(Time_period) AS earliest_year,
    MAX(Time_period) AS latest_year
FROM access_to_basic_services;


 * mysql+pymysql://root:***@localhost:3306/united_nations
1 rows affected.


earliest_year,latest_year
2015,2020


###  **3. Number of Countries Included in the Dataset**

To determine how many unique countries are represented in the dataset, we’ll count the distinct values in the `Country_name` column.

- The `COUNT()` function is used to count entries.  
- The `DISTINCT` keyword ensures that each country is only counted once, excluding duplicates.  
- We’ll use an alias `total_countries` to make the result more readable.


In [6]:
%%sql
SELECT COUNT(DISTINCT Country_name) AS total_countries
FROM access_to_basic_services;


 * mysql+pymysql://root:***@localhost:3306/united_nations
1 rows affected.


total_countries
182


###  **4. Average Access to Managed Drinking Water Services**

To find the overall average percentage of people who have access to **managed drinking water services** across all years and countries,  
we’ll use the `AVG()` function on the `Pct_managed_drinking_water_services` column.

- The `AVG()` function calculates the mean (average) value of a numeric column.  
- We’ll use an alias `avg_managed_drinking_water` to label the result clearly.


In [7]:
%%sql
SELECT AVG(Pct_managed_drinking_water_services) AS avg_managed_drinking_water
FROM access_to_basic_services;


 * mysql+pymysql://root:***@localhost:3306/united_nations
1 rows affected.


avg_managed_drinking_water
87.189103


###  **Summary**

We can combine all of our previous queries into a **single SQL query** that returns all the results in one output.  
By using aggregate functions together — `COUNT()`, `MIN()`, `MAX()`, and `AVG()` — we can summarize our dataset efficiently.

This approach provides a compact overview of the total entries, time range, number of countries,  
and the average access to managed drinking water services in one table.


In [8]:
%%sql
SELECT 
    COUNT(*) AS total_entries,
    MIN(Time_period) AS earliest_year,
    MAX(Time_period) AS latest_year,
    COUNT(DISTINCT Country_name) AS total_countries,
    AVG(Pct_managed_drinking_water_services) AS avg_managed_drinking_water
FROM access_to_basic_services;


 * mysql+pymysql://root:***@localhost:3306/united_nations
1 rows affected.


total_entries,earliest_year,latest_year,total_countries,avg_managed_drinking_water
1048,2015,2020,182,87.189103
