# Guided Project: Analyzing CIA Factbook Data Using SQL

In this project, we'll work with data from the **CIA World Factbook**, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:

* population - The population as of 2015.
* population_growth - The annual population growth rate, as a percentage.
* area - The total land and water area.

For more details on the the Factbook data check the [here](https://www.cia.gov/the-world-factbook/)

In this guided project, we'll use SQL in Jupyter Notebook to explore and analyze data from this database. 

### Installing SQL for Jupyter using conda

If you have not installed the SQL for Jupyter

In [1]:
# !pip install ipython-sql
# import sys
# !{sys.executable} -m pip install ipython-sql

In [2]:
%%capture

#loading the database
%load_ext sql 

 #specifying the path to our database
%sql sqlite:///factbook.db

'Connected: None@factbook.db'

## Overview of the table

In [3]:
# Listing the tables in the factbook database
%%sql 
SELECT * 
  FROM sqlite_master 
 WHERE type='table'

SyntaxError: invalid syntax (<ipython-input-3-37fb58a1e063>, line 2)

We can see that we have only 2 tables in our database
* sqlite_sequence
* facts
<p>We can also the columns in each table and their respective data type</p>

We will be working with the **facts** table. <p>Below is the data dictionary for our table</p>:
* name - The name of the country in alphabetical order.
* area - The total land and sea area of the country in square kilometer.
* population - The country's population.
* population_growth- The country's population growth as a percentage.
* birth_rate - The country's birth rate, or the number of births a year per 1,000 people.
* death_rate - The country's death rate, or the number of death a year per 1,000 people.
* area- The country's total area (both land and water).
* area_land - The country's land area in square kilometers.
* area_water - The country's waterarea in square kilometers.

#### Displaying the first 5 rows of *facts* table

In [4]:
%%sql 
SELECT * 
  FROM facts 
 LIMIT 5;

Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


## Summary Statistics

Here we are looking for the following:
* Minimum population
* Maximum population
* Minimum population growth
* Maximum population growth

In [None]:
%%sql 
SELECT printf("%,d", MIN(population)) AS 'Min population', 
       printf("%,d", MAX(population)) AS 'Max. population', 
       MIN(population_growth) AS 'Min pop. growth', 
       MAX(population_growth) AS 'Max pop. growth' 
 FROM facts;

We can see a very interesting output where a country has a population of zero. Also the maximum population of 7,256,490,011 (i.e about 7.2 billion!). All these call for further investigations.

So let's check which country(ies) has population of zero

In [None]:
%%sql
SELECT * 
  FROM facts
 WHERE population = 0

## Exploring Outiers

Alternatively, writing let's use subqueries to know more about the outliers; minimum and maximum population

#### Extracting countries with the minimum population

In [None]:
%%sql
SELECT name 
  FROM facts 
 WHERE population = (SELECT MIN(population) 
                       FROM facts)

#### Extracting countries with the maximum population

In [None]:
%%sql
SELECT name 
  FROM facts 
 WHERE population = (SELECT MAX(population) 
                       FROM facts)

In [None]:
%%sql
SELECT * 
  FROM facts 
 WHERE name = 'Antarctica'

The minimum population is 0, this needs further investigation by inspecting the source of the data, [CIA factbook](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html). So we might need to replace appropriately 

Also observably, the world being the maximum population shows that this is total population of countries in the world. Thus, this row needs to be excluded.

## Exploring Average Population and Area

Recalculating the summary statistics by excluding the row for the whole world.

1. Minimum population
2. Maximum population
3. Minimum population growth
4. Maximum population growth

In [None]:
%%sql 
SELECT  printf("%,d", MIN(population)) AS 'Min population', 
        printf("%,d", MAX(population)) AS 'Max. population', 
        MIN(population_growth) AS min_pop_growth, 
        MAX(population_growth) AS max_pop_growth 
  FROM facts
 WHERE name <> 'World'
   AND
       name <> 'Antarctica';

In [None]:
%%sql
SELECT name 
  FROM facts 
 WHERE population = (SELECT MAX(population) 
                       FROM facts 
                      WHERE name <> 'World'
                        AND
                            name <> 'Antarctica')

#### Calculating the average population and area so as to be able to find calculate the densely populated countries

In [None]:
%%sql
SELECT AVG(population) AS avg_population, 
       AVG(area) AS avg_area
  FROM facts
 WHERE name <> 'World'
   AND
       name <> 'Antarctica'

## Finding Densely Populated Countries
<p>We will be finding the densely populated countries using the code written in the above cell as a subquery here. That is we are looking for countries whose population is above the average population and the area below the average area</p>

###### In order to make the population and area easily readable, we will use thousand separator for the values in those columns

In [None]:
%%sql
SELECT * 
  FROM facts 
 LIMIT 3

We will also be calulating the [population density](https://www.internetgeography.net/topics/what-is-population-density/) and sort our outcome by the population density

In [None]:
%%sql
SELECT *,  
        printf("%,d", population) AS population, 
        printf("%,d", area) AS area,
        population/area as pop_density
  FROM facts
 WHERE population > (SELECT AVG(population) 
                       FROM facts 
                      WHERE name <> 'World' 
                        AND
                            name <> 'Antarctica')
  AND
      area < (SELECT AVG(area) 
                      FROM facts 
                     WHERE name <> 'World'
                       AND
                           name <> 'Antarctica')
 ORDER BY pop_density DESC;

It can be seen that there are 14 countries that densely populated whihch comprises mostly the Southern America and Asia countries. Also an Africa country (Morocco) is also included in this list. Bangladesh is the most densely populated country in the world followed South Korea.

It can also be observed these densely populated areas are in Asia

In [None]:
%%sql
SELECT * 
  FROM facts 
 WHERE name = 'Macau'

### Country with the highest growth rate

In [None]:
%%sql
SELECT * 
  FROM facts
 WHERE population_growth = (SELECT MAX(population_growth) 
                              FROM facts 
                             WHERE name <> 'World'
                               AND
                                   name <> 'Antarctica')

### Country with highest population

In [None]:
%%sql
SELECT * 
  FROM facts
 WHERE population = (SELECT MAX(population) 
                       FROM facts 
                      WHERE name <> 'World'
                        AND
                            name <> 'Antarctica')

China is the country with most people with a population 1.38 billion

South Sudan is the country with the highest growth with  population growth rate of 4.02

### Countries with the highest ratios of water to land: Which countries have more water than land?

In [None]:
%%sql
SELECT *, 
        area_water/area_land as water_land_ratio 
  FROM facts
 WHERE water_land_ratio > (SELECT AVG(area_water/area_land)
                             FROM facts 
                            WHERE name <> 'World'
                              AND
                                  name <> 'Antarctica')

The countries with the highest water to land ratio are British Indian Ocean Territory

### Which countries will add the most people to their population next year?

Here we are looking for countries with highest population growth rate

In [None]:
%%sql
SELECT *
  FROM facts
 WHERE population_growth > (SELECT AVG(population_growth) 
                              FROM facts
                             WHERE name <> 'World'
                               AND
                                   name <> 'Antarctica')
    ORDER BY population_growth DESC

African Countries; **South Sudan, Malawi, Burundi, Niger and Uganda** top the list of countries that will add most people next year

### Countries with higher death rate than birth rate

In [None]:
%%sql
SELECT * 
  FROM facts
 WHERE death_rate > birth_rate
 ORDER BY death_rate DESC;

In [None]:
%%sql
SELECT *, 
        population/area AS pop_density 
  FROM facts
 WHERE pop_density > (SELECT AVG(population/area) FROM facts WHERE name <> 'World')
 ORDER BY pop_density DESC;