# Exploration of CIA World Factbook

In this project, we'll work with data from the [CIA World Factbook](https://www.cia.gov/library/publications/the-world-factbook/), a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:

- `population` — the global population.
- `population_growth` — the annual population growth rate, as a percentage.
- `area` — the total land and water area.

In this guided project, we'll use **SQL** in Jupyter Notebook to analyze data from this database (available in the project folder).

Let's first connect the Jupyter Notebook to the database file

In [4]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db

'Connected: None@factbook.db'

In [21]:
#Print some information on the tables in the database

In [23]:
%%sql
SELECT *
  FROM sqlite_master
WHERE type='table';

Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


In [24]:
#Displaying the first five rows of the facts table

In [26]:
%%sql
SELECT *
  FROM facts
LIMIT 5;

Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


We can deduct from the first rows and column names some basic information about the *facts* table :

- `name` — the name of the country.
- `area` — the country's total area (both land and water).
- `area_land` — the country's land area in square kilometers.
- `area_water` — the country's waterarea in square kilometers.
- `population` — the country's population.
- `population_growth` — the country's population growth as a percentage.
- `birth_rate` — the country's birth rate, or the number of births per year per 1,000 people.
- `death_rate` — the country's death rate, or the number of death per year per 1,000 people.

With those precisions, it would be interesting to calculate some basic statistics in order to get a sort of data summary.

In [31]:
%%sql
SELECT MIN(population) AS 'Population(min.)',
       MAX(population) AS 'Population(max.)',
       MIN(population_growth) AS 'Pop. growth(min.)',
        MAX(population_growth) AS 'Pop. growth(max.)'
FROM facts;

Done.


Population(min.),Population(max.),Pop. growth(min.),Pop. growth(max.)
0,7256490011,0.0,4.02


We can see that `minimum population` is 0 which is strange, as well as a `maximum population` over 7 billion.
Let's displlay the corresponding countries.

In [28]:
%%sql
SELECT name
  FROM facts
WHERE population = 0 
 OR population = 7256490011

Done.


name
Antarctica
World


It seems like the table contrains a row for the whole world with explains the maximum population being over 7 billion.  
The population for *Antarctica* is indeed of 0 based on CIA Factbook [page for Antarctica](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html).  

The row regarding the *whole world* should be excluded of our summary statistics.

In [32]:
%%sql
SELECT MIN(population) AS 'Population(min.)',
       MAX(population) AS 'Population(max.)',
       MIN(population_growth) AS 'Pop. growth(min.)',
        MAX(population_growth) AS 'Pop. growth(max.)'
FROM facts
WHERE name != "World";

Done.


Population(min.),Population(max.),Pop. growth(min.),Pop. growth(max.)
0,1367485388,0.0,4.02


In [36]:
%%sql
SELECT ROUND(AVG(population), 0) AS "Population (Mean)",
       ROUND(AVG(area), 2) AS "Area (Mean)"
FROM facts
WHERE name != "World";

Done.


Population (Mean),Area (Mean)
32242667.0,555093.55


The mean population in the world is of 32 million inhabitants which is only **2%** of the most populated country : **China**.  
Let's use our mean values to display **densely populated countries** i.e. 
- **Above-average** values for `population`
- **Below-average** values for `area` (land + water)

*Note: area_water refers to the sum of the surfaces of all inland water bodies, such as lakes, reservoirs, or rivers, as delimited by international boundaries and/or coastlines*

In [70]:
%%sql
SELECT name, population, area, population/area AS 'density (hab/km2)'  
  FROM facts
WHERE population > (SELECT AVG(population)
                      FROM facts
                    WHERE name != "World")
  AND area < (SELECT AVG(area)
                FROM facts
              WHERE name != "World"
             )
ORDER BY population/area DESC;

Done.


name,population,area,density (hab/km2)
Bangladesh,168957745,148460,1138
"Korea, South",49115196,99720,492
Philippines,100998376,300000,336
Japan,126919659,377915,335
Vietnam,94348835,331210,284
United Kingdom,64088222,243610,263
Germany,80854408,357022,226
Italy,61855120,301340,205
Uganda,37101745,241038,153
Thailand,67976405,513120,132


Here is a list of densely populated countries (ordered by population density). Notice that the top countries in terms of density are **asian countries**. 

## Conclusion

This project allowed us to do a basic exploration of the database by applying some **SQL queries** technics like : 
- *Conditional filtering* with `WHERE`
- *Sorting* with `ORDER BY`
- Use of *aggregation functions* such as MIN, AVG
- *Subqueries*

