<h1 align='center'>Countries Statistics</h1>

![image.jpg](https://les.mitsubishielectric.co.uk/assets/Uploads/328a039bfe/Changing-view-from-space.jpg)

# 0. About

**In this notebook we will prepare and visualize various statistical data on countries. We will focus on the data for 2015, since there are most of them**

Datasets used in this notebook:
1. [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world) - 225 countries<br>
    Includes basic information such as area, population, GDP, etc.
2. [Environmental variables for world countries](https://www.kaggle.com/zanderventer/environmental-variables-for-world-countries) - 242 countries<br>
    Contains information about the natural conditions and climate of the countries.
3. [World happiness report](https://www.kaggle.com/unsdsn/world-happiness) - 158 countries<br>
    Assessment of the average level of happiness and the parameters used for the assessment.
4. [CO2 and GHG emission data](https://www.kaggle.com/srikantsahu/co2-and-ghg-emission-data) - 222 countries<br>
    Emissions of countries from 1750 to 2017.
5. [Life expectancy (WHO)](https://www.kaggle.com/kumarajarshi/life-expectancy-who) - 183 countries<br>
    Life expectancy and factors affecting it.
6. [Alcohol consumption](https://www.kaggle.com/sansuthi/alcohol-consumption) - 213 countries<br>
    Alcohol consumption, suicide rate, income, employment and urban rate.
7. [Income by country](https://www.kaggle.com/frankmollard/income-by-country) - 193 countries<br>
    Income, GDP and other economy factors

# 1. Importing libraries

We will use the following libraries:
1. **pandas** and **numpy** to work with data
2. **plotly** and **plotly express** for interactive visualization
3. **pycountry**, **pycountry_convert** and **functools** for converting country names and codes
4. **xlrd** and **openpyxl** to work with Excel files
5. **json** to load countries geojson

In [None]:
import pandas as pd
import numpy as np
import plotly.figure_factory as ff
import plotly.express as px
!pip install pycountry_convert
import pycountry, pycountry_convert, functools
!pip install xlrd
!pip install openpyxl
import json

Also, we will create a function to find the country code by name and vice versa.

In [None]:
@functools.lru_cache(None)
def get_country_code(country):
    try:
        result = pycountry.countries.search_fuzzy(country)
    except:
        return 'NaN'
    else:
        return result[0].alpha_3

def get_country_name(alpha_3_code):
    result = pycountry.countries.get(alpha_3=alpha_3_code)
    if result is None:
        return None
    else:
        return result.name

def get_continent_code(country_name):
    code = get_country_code(country_name)
    if code is None:
        return None
    country = pycountry.countries.get(alpha_3=get_country_code(country_name))
    if country is None:
        return None
    try:
        return pycountry_convert.country_alpha2_to_continent_code(country.alpha_2)
    except:
        return None

# 2. Data preparing

We need to collect data from three datasets into one pandas dataframe. When working with countries, it is important for us to adhere to certain standards. We will use the [ISO 3166-1 alpha-3](https://www.iso.org/iso-3166-country-codes.html) standard to encode the country names. 

<p style='font-size: 26px'>Countries of the world dataset</p>

Load the file and see what is in it.

In [None]:
countries_dataset = pd.read_csv('../input/countries-of-the-world/countries of the world.csv', decimal=',')
countries_dataset.head()

In [None]:
print('Shape:', countries_dataset.shape, '\n')
print('Missing values:')
print(countries_dataset.isnull().sum(), '\n')
print('Data types:')
print(countries_dataset.dtypes, '\n')

**What we need to do:**
1. Add "Country code (ISO)" column
2. Format float values to 3 decimal places
3. Reformat columns names
4. Reformat units
5. Remove "Climate" column because we don't know what the encodings mean and it have too many missing values
6. Remove "Region",  "Infant mortality (per 1000 births)", "GDP ($ per capita)" and "Deathrate" columns

In [None]:
columns_names = ['Country', 'Region', 'Population', 'Area, sq. km.',
       'Pop. Density, per sq. km.', 'Coastline (coast/area ratio)',
       'Net migration, %', 'Infant mortality, per 1000 births',
       'GDP, $ per capita', 'Literacy, %', 'Phones, per 1000', 'Arable, %',
       'Crops, %', 'Other, %', 'Climate', 'Birthrate, per 1000 inhabitants', 'Deathrate, per 1000 inhabitants',
       'Agriculture, %', 'Industry, %', 'Service, %']
countries_dataset.columns = columns_names
countries_dataset['Area, sq. km.'] *= 2.58999
countries_dataset['Pop. Density, per sq. km.'] *= 2.58999
countries_dataset['Agriculture, %'] *= 100
countries_dataset['Industry, %'] *= 100
countries_dataset['Service, %'] *= 100
countries_dataset = countries_dataset.round(decimals=3)
countries_dataset.drop(['Region', 'Climate', 'Infant mortality, per 1000 births', 'GDP, $ per capita', 'Deathrate, per 1000 inhabitants'], axis=1, inplace=True)

In [None]:
countries_dataset['Country'] = countries_dataset['Country'].astype(str).apply(lambda x: x[:-1] if x != 'NaN' else x)
countries_codes_map = {country: get_country_code(country) for country in countries_dataset['Country']}

Let's check if all countries got their code

In [None]:
for country in countries_codes_map.keys():
    if countries_codes_map[country] == 'NaN':
        print(country)

As we can see, some countries were left without a code. This may be due to the fact that their names in the dataset are different from the names given by the standard. For simplicity, we'll fill them in manually.<br>
*Note*: Gaza Strip and West Bank at the moment does not have a unified code, so it will be easier to simply delete them from the table. More info: https://en.wikipedia.org/wiki/ISO_3166-2:PS<br>

In [None]:
missing_countries_codes = {
    'Antigua & Barbuda': 'ATG',
    'Bahamas, The': 'BHS',
    'Bosnia & Herzegovina': 'BIH',
    'British Virgin Is.': 'VGB',
    'Burma': 'MMR',
    'Cape Verde': 'CPV',
    'Central African Rep.': 'CAF',
    'Congo, Dem. Rep.': 'COD',
    'Congo, Repub. of the': 'COG',
    'East Timor': 'TLS',
    'Gambia, The': 'GMB',
    'Korea, North': 'PRK',
    'Korea, South': 'KOR',
    'Laos': 'LAO',
    'Macau': 'MAC',
    'Micronesia, Fed. St.': 'FSM',
    'Netherlands Antilles': 'ANT',
    'N. Mariana Islands': 'MNP',
    'Saint Kitts & Nevis': 'KNA',
    'St Pierre & Miquelon': 'SPM',
    'Sao Tome & Principe': 'STP',
    'Swaziland': 'SWZ',
    'Trinidad & Tobago': 'TTO',
    'Turks & Caicos Is': 'TCA',
}
for country in missing_countries_codes.keys():
    countries_codes_map[country] = missing_countries_codes[country]

In [None]:
countries_codes_map['Niger'] = 'NER'
countries_codes_map['Virgin Islands'] = 'VIR'
countries_codes_map['Guadeloupe'] = 'GLP'
countries_codes_map['Mayotte'] = 'MYT'

In [None]:
countries_dataset['Country code (ISO)'] = countries_dataset['Country'].map(countries_codes_map).astype(str)
countries_dataset = countries_dataset.loc[countries_dataset['Country code (ISO)'] != 'NaN']
countries_dataset.shape

In [None]:
countries_dataset.head()

<p style='font-size: 26px'>Environmental variables for world countries dataset</p>

In [None]:
env_dataset = pd.read_csv('../input/environmental-variables-for-world-countries/World_countries_env_vars.csv')
env_dataset.head()

In [None]:
print('Shape:', env_dataset.shape, '\n')
print('Missing values:')
print(env_dataset.isnull().sum(), '\n')
print('Data types:')
print(env_dataset.dtypes, '\n')

**What we need to do:**
1. Add "Country code (ISO)" column
2. Format float values to 3 decimal places
3. Reformat columns names
4. Remove "aspect", "slope", "isothermality" and "cropland_cover" columns

In [None]:
env_dataset['Country'] = env_dataset['Country'].astype(str)
env_dataset.set_axis(env_dataset.columns.map(lambda x: x.capitalize().replace('_', ' ')), axis=1, inplace=True)

In [None]:
columns_names = ['Country', 'Accessibility to cities, min', 'Elevation, m', 'Aspect', 'Slope',
               'Cropland cover, %', 'Tree canopy cover, %', 'Isothermality',
               'Rain coldestquart, mm', 'Rain driestmonth, mm', 'Rain driestquart, mm',
               'Rain mean annual, mm', 'Rain seasonailty', 'Rain warmestquart, mm',
               'Rain wettestmonth, mm', 'Rain wettestquart, mm', 'Temp annual range, %',
               'Temp coldestquart, degC', 'Temp diurnal range, degC', 'Temp driestquart, degC',
               'Temp max warmestmonth, degC', 'Temp mean annual, degC', 'Temp min coldestmonth, degC',
               'Temp seasonality, degC', 'Temp warmestquart, degC', 'Temp wettestquart, degC', 'Wind, m/s',
               'Cloudiness, days per year']
env_dataset.columns = columns_names
env_dataset.drop(['Aspect', 'Slope', 'Isothermality'], axis=1, inplace=True)
env_dataset = env_dataset.round(decimals=3)
env_dataset = env_dataset.loc[~(env_dataset['Country'] == 'Indian Ocean Territories')]
env_dataset = env_dataset.loc[~(env_dataset['Country'] == 'Kosovo')]
env_dataset = env_dataset.loc[~(env_dataset['Country'] == 'Northern Cyprus')]
countries_codes_map = {country: get_country_code(country) for country in env_dataset['Country']}

Let's check if all countries got their code

In [None]:
for country in countries_codes_map.keys():
    if countries_codes_map[country] == 'NaN':
        print(country)

As we can see, some countries were left without a code. This may be due to the fact that their names in the dataset are different from the names given by the standard. For simplicity, we'll fill them in manually.<br>
*Note*: Siachen Glacier does not have its own ISO 3166-1 aplha-3 code

In [None]:
missing_countries_codes = {
    'Democratic Republic of the Congo': 'COD',
    'Indian Ocean Territories': 'IOT',
    'Laos': 'LAO',
    'North Korea': 'PRK',
    'South Georgia and South Sandwich Islands': 'SGS',
    'Ivory Coast': 'CIV',
    'South Korea': 'KOR',
    'Cape Verde': 'CPV',
    'Guinea Bissau': 'GNB',
    'Northern Cyprus': 'CYP',
    'Swaziland': 'SWZ',
    'East Timor': 'TLS',
    'United States Virgin Islands': 'VIR',
    'French Southern and Antarctic Lands': 'ATF',
    'Hong Kong S.A.R.': 'HKG',
    'Pitcairn Islands': 'PCN',
    'Macau S.A.R': 'MAC'
}
for country in missing_countries_codes.keys():
    countries_codes_map[country] = missing_countries_codes[country]

In [None]:
countries_codes_map['Curacao'] = 'CUW'
countries_codes_map['Sint Maarten'] = 'SXM'
countries_codes_map['Niger'] = 'NER'

In [None]:
env_dataset['Country code (ISO)'] = env_dataset['Country'].map(countries_codes_map).astype(str)
env_dataset = env_dataset.loc[env_dataset['Country code (ISO)'] != 'NaN']
env_dataset.shape

In [None]:
env_dataset.head()

<p style='font-size: 26px'>World happiness report dataset</p>

In [None]:
happiness_dataset = pd.read_csv('../input/world-happiness/2015.csv')
happiness_dataset.head()

In [None]:
print('Shape:', happiness_dataset.shape, '\n')
print('Missing values:')
print(happiness_dataset.isnull().sum(), '\n')
print('Data types:')
print(happiness_dataset.dtypes, '\n')

**What we need to do:**
1. Add "Country code" column
2. Format float values to 3 decimal places
3. Reformat columns names
5. Remove "Region", "Standard Error" and "Happiness Rank"

In [None]:
happiness_dataset = happiness_dataset.round(decimals=3)
happiness_dataset.drop(['Region', 'Standard Error', 'Happiness Rank'], axis=1, inplace=True)

In [None]:
columns_names = ['Country', 'Happiness score',
       'Economy (extent contribution)', 'Family (extent contribution)', 'Life expectancy (extent contribution)',
       'Freedom (extent contribution)', 'Corruption (extent contribution)', 'Generosity (extent contribution)',
       'Dystopia Residual (extent contribution)']
happiness_dataset.columns = columns_names
happiness_dataset['Country'] = happiness_dataset['Country'].astype(str)
happiness_dataset = happiness_dataset.loc[~(happiness_dataset['Country'] == 'Kosovo')]
countries_codes_map = {country: get_country_code(country) for country in happiness_dataset['Country']}

Let's check if all countries got their code

In [None]:
for country in countries_codes_map.keys():
    if countries_codes_map[country] == 'NaN':
        print(country)

As we can see, some countries were left without a code. This may be due to the fact that their names in the dataset are different from the names given by the standard. For simplicity, we'll fill them in manually.

In [None]:
missing_countries_codes = {
    'South Korea': 'KOR',
    'Somaliland region': 'SOM',
    'Laos': 'LAO',
    'Swaziland': 'SWZ',
    'Palestinian Territories': 'PSE',
    'Hong Kong S.A.R. of China': 'HKG',
    'Congo (Kinshasa)': 'COD',
    'Congo (Brazzaville)': 'COG',
    'Ivory Coast': 'CIV',
}
for country in missing_countries_codes.keys():
    countries_codes_map[country] = missing_countries_codes[country]

In [None]:
countries_codes_map['Niger'] = 'NER'

In [None]:
happiness_dataset['Country code (ISO)'] = happiness_dataset['Country'].map(countries_codes_map).astype(str)
happiness_dataset = happiness_dataset.loc[happiness_dataset['Country code (ISO)'] != 'NaN']
happiness_dataset.shape

In [None]:
happiness_dataset.head()

<p style='font-size: 26px'>CO2 and GHG emission dataset</p>

In [None]:
emission_dataset = pd.read_csv('../input/co2-and-ghg-emission-data/emission data.csv')
emission_dataset.head()

In [None]:
print('Shape:', emission_dataset.shape, '\n')
print('Missing values:')
print(emission_dataset[['2014', '2015', '2016', '2017']].isnull().sum(), '\n')
print('Data types:')
print(emission_dataset[['2014', '2015', '2016', '2017']].dtypes, '\n')

**What we need to do:**
1. Add "Country code (ISO)" column
2. Reformat columns names
3. Leave only data for 2015
4. Remove data for regions (not countries)

In [None]:
emission_dataset = emission_dataset[['Country', '2015']]
emission_dataset.rename({'2015': 'Emission, tons'}, axis=1, inplace=True)
emission_dataset['Country'] = emission_dataset['Country'].astype(str)

In [None]:
for item in emission_dataset['Country']:
    print(item, end=', ')

Removing:
1. Americas (other)
2. Antarctic Fisheries
3. Asia and Pacific (other)
4. EU-28
5. Europe (other)
6. Middle East
7. Palestine (don't have unified ISO code)
8. South Africa
9. World

Renaming: "Micronesia (country)" to "Micronesia"

In [None]:
emission_dataset = emission_dataset.loc[~emission_dataset['Country'].isin(['Americas (other)', 'Asia and Pacific (other)', 'EU-28', 'Europe (other)', 
                                                                           'Antarctic Fisheries', 'Middle East', 'Palestine', 'South Africa', 'World'])]
emission_dataset.loc[emission_dataset['Country'] == 'Micronesia (country)', 'Country'] = 'Micronesia'
emission_dataset = emission_dataset.loc[~((emission_dataset['Country'] == 'Kyrgysztan') & (emission_dataset['Emission, tons'] == 0.0))]

In [None]:
countries_codes_map = {country: get_country_code(country) for country in emission_dataset['Country']}

Let's check if all countries got their code

In [None]:
for country in countries_codes_map.keys():
    if countries_codes_map[country] == 'NaN':
        print(country)

As we can see, some countries were left without a code. This may be due to the fact that their names in the dataset are different from the names given by the standard. For simplicity, we'll fill them in manually.<br>
*Note*: Czechoslovakia was divided into Czechia (CZ, CZE, 203), and Slovakia (SK, SVK, 703), so, we just remove it. More info: https://www.iso.org/obp/ui/#iso:code:3166:CSHH

In [None]:
missing_countries_codes = {
    'Bonaire Sint Eustatius and Saba': 'BES',
    'Cape Verde': 'CPV',
    'Democratic Republic of Congo': 'COD',
    'Faeroe Islands': 'FRO',
    'Kyrgysztan': 'KGZ',
    'Laos': 'LAO',
    'North Korea': 'PRK',
    'South Korea': 'KOR',
    'Swaziland': 'SWZ',
    'Wallis and Futuna Islands': 'WLF'
}
for country in missing_countries_codes.keys():
    countries_codes_map[country] = missing_countries_codes[country]

In [None]:
countries_codes_map['Niger'] = 'NER'
countries_codes_map['Republic of Korea'] = 'KOR'
countries_codes_map['Guadeloupe'] = 'GLP'
countries_codes_map['Curacao'] = 'CUW'

In [None]:
emission_dataset['Country code (ISO)'] = emission_dataset['Country'].map(countries_codes_map).astype(str)
emission_dataset = emission_dataset.loc[emission_dataset['Country code (ISO)'] != 'NaN']
emission_dataset.shape

In [None]:
emission_dataset.head()

<p style='font-size: 26px'>Life Expectancy (WHO)</p>

In [None]:
life_dataset = pd.read_csv('../input/life-expectancy-who/Life Expectancy Data.csv')
life_dataset.head()

In [None]:
print('Shape:', life_dataset.shape, '\n')
print('Missing values:')
print(life_dataset.isnull().sum(), '\n')
print('Data types:')
print(life_dataset.dtypes, '\n')

**What we need to do:**
1. Add "Country code (ISO)" column
2. Reformat columns names
3. Leave only one entry per country (2015)
4. Remove "Status", "Alcohol", "GDP", "Total expnediture and "Population" columns

In [None]:
life_dataset = life_dataset.loc[life_dataset['Year'] == 2015]
life_dataset.drop(['Year', 'Status', 'Alcohol', 'GDP', 'Total expenditure', 'Population'], axis=1, inplace=True)

In [None]:
columns_names = ['Country', 'Life expectancy, age', 'Adult mortality, per 1000', 'Infant deaths, per 1000',
       'Expenditure on health, % of GDP', 'Hepatitis B immunization, %', 'Measles, per 1000', 'BMI',
       'Under-five deaths, per 1000', 'Polio immunization, %', 'Diphtheria immunization, %',
       'HIV/AIDS infant deaths, per 1000', 'Thinness 1-19 years, %',
       'Thinness 5-9 years, %', 'Human Development Index (0 to 1)', 'School years']
life_dataset.columns = columns_names

In [None]:
countries_codes_map = {country: get_country_code(country) for country in life_dataset['Country']}

Let's check if all countries got their code

In [None]:
for country in countries_codes_map.keys():
    if countries_codes_map[country] == 'NaN':
        print(country)

As we can see, some countries were left without a code. This may be due to the fact that their names in the dataset are different from the names given by the standard. For simplicity, we'll fill them in manually.<br>

In [None]:
missing_countries_codes = {
    'Bolivia (Plurinational State of)': 'BOL',
    'Democratic Republic of the Congo': 'COD',
    'Iran (Islamic Republic of)': 'IRN',
    'Micronesia (Federated States of)': 'FSM',
    'Swaziland': 'SWZ',
    'The former Yugoslav republic of Macedonia': 'MKD',
    'Venezuela (Bolivarian Republic of)': 'VEN'
}
for country in missing_countries_codes.keys():
    countries_codes_map[country] = missing_countries_codes[country]

In [None]:
countries_codes_map['Niger'] = 'NER'
countries_codes_map['Republic of Korea'] = 'KOR'

In [None]:
life_dataset['Country code (ISO)'] = life_dataset['Country'].map(countries_codes_map).astype(str)
life_dataset = life_dataset.loc[life_dataset['Country code (ISO)'] != 'NaN']
life_dataset.shape

Filling in some missing values:
1. USA HDI 2105 - 0.921 (Source: https://countryeconomy.com/hdi/usa?year=2015)

In [None]:
life_dataset.loc[life_dataset['Country code (ISO)'] == 'USA', 'Human Development Index (0 to 1)'] = 0.921

In [None]:
life_dataset.head()

<p style='font-size: 26px'>Alcohol consumption dataset</p>

In [None]:
alcohol_dataset = pd.read_csv('../input/alcohol-consumption/gapminder_alcohol.csv')
alcohol_dataset.head()

In [None]:
print('Shape:', alcohol_dataset.shape, '\n')
print('Missing values:')
print(alcohol_dataset.isnull().sum(), '\n')
print('Data types:')
print(alcohol_dataset.dtypes, '\n')

**What we need to do:**
1. Add "Country code (ISO)" column
2. Reformat columns names
3. Remove "income per person" column

In [None]:
alcohol_dataset.drop(['incomeperperson'], axis=1, inplace=True)
columns_names = ['Country', 'Alcohol consumption', 'Suicides, per 100', 'Employ rate, %', 'Urban rate, %']
alcohol_dataset.columns = columns_names
alcohol_dataset = alcohol_dataset.round(decimals=3)

In [None]:
countries_codes_map = {country: get_country_code(country) for country in alcohol_dataset['Country']}

Let's check if all countries got their code

In [None]:
for country in countries_codes_map.keys():
    if countries_codes_map[country] == 'NaN':
        print(country)

As we can see, some countries were left without a code. This may be due to the fact that their names in the dataset are different from the names given by the standard. For simplicity, we'll fill them in manually.

In [None]:
missing_countries_codes = {
    'Cape Verde': 'CPV',
    'Central African Rep.': 'CAF',
    'Congo, Dem. Rep.': 'COD',
    'Congo, Rep.': 'COG',
    'Czech Rep.': 'CZE',
    'Dominican Rep.': 'DOM',
    'Faeroe Islands': 'FRO',
    'Hong Kong, China': 'HKG',
    'Korea, Dem. Rep.': 'PRK',
    'Korea, Rep.': 'KOR',
    'Laos': 'LAO',
    'Macao, China': 'MAC',
    'Macedonia, FYR': 'MKD',
    'Micronesia, Fed. Sts.': 'FSM',
    'Netherlands Antilles': 'ANT',
    'Serbia and Montenegro': 'SCG',
    'Swaziland': 'SWZ',
    'West Bank and Gaza': 'PSE',
    'Yemen, Rep.': 'YEM'
}
for country in missing_countries_codes.keys():
    countries_codes_map[country] = missing_countries_codes[country]

In [None]:
countries_codes_map['Niger'] = 'NER'
countries_codes_map['Guadeloupe'] = 'GLP'

In [None]:
alcohol_dataset['Country code (ISO)'] = alcohol_dataset['Country'].map(countries_codes_map).astype(str)
alcohol_dataset = alcohol_dataset.loc[alcohol_dataset['Country code (ISO)'] != 'NaN']
alcohol_dataset.shape

In [None]:
alcohol_dataset.head()

<p style='font-size: 26px'>Income by country dataset</p>

In [None]:
income_dataset = pd.read_excel('../input/income-by-country/Income by Country.xlsx', sheet_name=['GDP per capita', 'Income Index'])
income_dataset['GDP per capita']

In [None]:
income_dataset['Income Index']

**What we need to do:**
1. Add "Country code (ISO)" column
2. Refactor columns names
3. Сombine data into one dataframe
4. Convert columns to *float64*
4. Leave only one entry per country (2015)

In [None]:
income_dataset = income_dataset['GDP per capita'][['Country', 2015]].merge(income_dataset['Income Index'][['Country', 2015]], on='Country', how='outer')

In [None]:
income_dataset.rename({'2015_x': 'GDP, $ per capita', '2015_y': 'Income index (natural log)'}, axis=1, inplace=True)

income_dataset.loc[income_dataset['GDP, $ per capita'] == '..', 'GDP, $ per capita'] = np.nan
income_dataset.loc[income_dataset['Income index (natural log)'] == '..', 'Income index (natural log)'] = np.nan
income_dataset['GDP, $ per capita'] = income_dataset['GDP, $ per capita'].astype(np.float64)
income_dataset['Income index (natural log)'] = income_dataset['Income index (natural log)'].astype(np.float64)

income_dataset = income_dataset.round(decimals=3)
countries_codes_map = {country: get_country_code(country) for country in income_dataset['Country']}

Let's check if all countries got their code

In [None]:
for country in countries_codes_map.keys():
    if countries_codes_map[country] == 'NaN':
        print(country)

As we can see, some countries were left without a code. This may be due to the fact that their names in the dataset are different from the names given by the standard. For simplicity, we'll fill them in manually. Also, there are samples here that are not directly related to countries. They need to be removed:
1. Human Development
2. Very high human development
3. High human development
4. Medium human development
5. Low human development
6. Developing Countries
7. Regions
8. Arab States
9. East Asia and the Pacific
10. Europe and Central Asia
11. Latin America and the Caribbean
12. South Asia
13. Sub-Saharan Africa
14. Least Developed Countries
15. Small Island Developing States
16. Organization for Economic Co-operation and Development
17. World

In [None]:
missing_countries_codes = {
    'Bolivia (Plurinational State of)': 'BOL',
    'Congo (Democratic Republic of the)': 'COD',
    'CÃ´te d\'Ivoire': 'CIV',
    'Eswatini (Kingdom of)': 'SWZ',
    'Hong Kong; China (SAR)': 'HKG',
    'Iran (Islamic Republic of)': 'IRN',
    'Korea (Republic of)': 'KOR',
    'Micronesia (Federated States of)': 'FSM',
    'Moldova (Republic of)': 'MDA',
    'Palestine; State of': 'PSE',
    'Tanzania (United Republic of)': 'TZA',
    'Venezuela (Bolivarian Republic of)': 'VEN'
}
for country in missing_countries_codes.keys():
    countries_codes_map[country] = missing_countries_codes[country]

In [None]:
countries_codes_map['Niger'] = 'NER'

In [None]:
income_dataset['Country code (ISO)'] = income_dataset['Country'].map(countries_codes_map).astype(str)
income_dataset = income_dataset.loc[income_dataset['Country code (ISO)'] != 'NaN']
income_dataset.shape

In [None]:
income_dataset.head()

<p style='font-size: 26px'>Combining</p>

Let's use the pandas merge function to combine data into one dataframe.

In [None]:
datasets = [countries_dataset, env_dataset, happiness_dataset, emission_dataset, life_dataset, alcohol_dataset, income_dataset]
df = datasets[0].drop('Country', axis=1)
for dataset in datasets[1:]:
    df = df.merge(dataset.drop('Country', axis=1), on='Country code (ISO)', how='outer')

In [None]:
df.head()

We removed "Country" columns, now we will use the get_country_name function to create one.

In [None]:
df['Country'] = df['Country code (ISO)'].apply(get_country_name)

Let's check if some countries names are *None*.

In [None]:
df.loc[df['Country'].isnull(), 'Country code (ISO)']

The names of the two countries were not found. Here they are:
1. ANT - Netherlands Antilles
2. SCG - Serbia and Montenegro

Let's fill them in manually.

In [None]:
df.loc[df['Country code (ISO)'] == 'ANT', 'Country'] = 'Netherlands Antilles'
df.loc[df['Country code (ISO)'] == 'SCG', 'Country'] = 'Serbia and Montenegro'

Also, let's add a column with the name of the continent where the country is located.

In [None]:
df['Continent'] = df['Country'].apply(get_continent_code)

Let's check if all continents got their code

In [None]:
df.loc[df['Continent'].isnull(), 'Country']

As we can see, some countries were left without a code. We fill them manually.

In [None]:
missing_values = {
    'Timor-Leste': 'AS',
    'Netherlands Antilles': 'SA',
    'Western Sahara': 'AF',
    'Antarctica': 'AN',
    'French Southern Territories': 'AN',
    'Sint Maarten (Dutch part)': 'NA',
    'Pitcairn': 'OC',
    'Holy See (Vatican City State)': 'EU',
    'Serbia and Montenegro': 'EU'
}
for country in missing_values.keys():
    df.loc[df['Country'] == country, 'Continent'] = missing_values[country]

We can also convert the codes to a more readable form.

In [None]:
continents_codes = {
    'AS': 'Asia', 
    'EU': 'Europe', 
    'AF': 'Africa', 
    'OC': 'Oceania', 
    'NA': 'North america', 
    'SA': 'South america',
    'AN': 'Antarctica'
}
df['Continent'] = df['Continent'].replace(continents_codes)

Okay, let's see what columns we have.

In [None]:
df.columns

We need to change their order to make it easier to navigate. Let's divide the columns into several categories:
1. **Main**<br>
"Country", "Country code (ISO)", "Continent", "Population", "Area, sq. km."
2. **Economics**<br>
"GDP, $ per capita",  "Income index (natural log)", "Pop. Density, per sq. km.", "Net migration, %", "Phones, per 1000", "Urban rate, %", "Agriculture, %", "Industry, %", "Service, %", "Expenditure on health, % of GDP"
3. **Life**<br>
"Literacy, %", "Accessibility to cities, min", "Happiness score", "Economy (extent contribution)", "Family (extent contribution)", "Life expectancy (extent contribution)", "Freedom (extent contribution)", "Corruption (extent contribution)", "Generosity (extent contribution)", "Dystopia Residual (extent contribution)", "Thinness  1-19 years, %", "Thinness 5-9 years, %", "Human Development Index (0 to 1)", "School years", "Alcohol consumption", "Suicides, per 100", "Employ rate, %"
4. **Health**<br>
"Birthrate, per 1000 inhabitants", "Life expectancy, age", "Adult mortality, per 1000", "Infant deaths, per 1000", "Hepatitis B immunization, %", "Measles, per 1000", "BMI", "Under-five deaths, per 1000", "Polio immunization, %","Diphtheria immunization, %", "HIV/AIDS infant deaths, per 1000"
5. **Nature and environment**<br>
"Coastline (coast/area ratio)", "Arable, %", "Crops, %", "Other, %", "Elevation, m", "Cropland cover, %", "Tree canopy cover, %", "Rain coldestquart, mm", "Rain driestmonth, mm", "Rain driestquart, mm", "Rain mean annual, mm", "Rain seasonailty", "Rain warmestquart, mm", "Rain wettestmonth, mm", "Rain wettestquart, mm", "Temp annual range, %", "Temp coldestquart, degC", "Temp diurnal range, degC", "Temp driestquart, degC", "Temp max warmestmonth, degC", "Temp mean annual, degC", "Temp min coldestmonth, degC", "Temp seasonality, degC", "Temp warmestquart, degC", "Temp wettestquart, degC", "Wind, m/s", "Cloudiness, days per year", "Emission, tons"

In [None]:
columns_order = ['Country', 'Country code (ISO)', 'Continent', 'Population', 'Area, sq. km.', 'GDP, $ per capita', 'Income index (natural log)', 'Pop. Density, per sq. km.', 
                 'Net migration, %', 'Phones, per 1000', 'Urban rate, %', 'Agriculture, %', 'Industry, %', 'Service, %', 'Expenditure on health, % of GDP', 
                 'Literacy, %', 'Accessibility to cities, min', 'Happiness score', 
                 'Economy (extent contribution)', 'Family (extent contribution)', 'Life expectancy (extent contribution)', 'Freedom (extent contribution)', 
                 'Corruption (extent contribution)', 'Generosity (extent contribution)', 'Dystopia Residual (extent contribution)', 'Thinness 1-19 years, %', 
                 'Thinness 5-9 years, %', 'Human Development Index (0 to 1)', 'School years', 'Alcohol consumption', 'Suicides, per 100', 'Employ rate, %', 
                 'Birthrate, per 1000 inhabitants', 'Life expectancy, age', 'Adult mortality, per 1000', 'Infant deaths, per 1000', 'Hepatitis B immunization, %', 
                 'Measles, per 1000', 'BMI', 'Under-five deaths, per 1000', 'Polio immunization, %','Diphtheria immunization, %', 'HIV/AIDS infant deaths, per 1000', 
                 'Coastline (coast/area ratio)', 'Arable, %', 'Crops, %', 'Other, %', 'Elevation, m', 'Cropland cover, %', 'Tree canopy cover, %', 
                 'Rain coldestquart, mm', 'Rain driestmonth, mm', 'Rain driestquart, mm', 'Rain mean annual, mm', 'Rain seasonailty', 'Rain warmestquart, mm', 
                 'Rain wettestmonth, mm', 'Rain wettestquart, mm', 'Temp annual range, %', 'Temp coldestquart, degC', 'Temp diurnal range, degC', 
                 'Temp driestquart, degC', 'Temp max warmestmonth, degC', 'Temp mean annual, degC', 'Temp min coldestmonth, degC', 'Temp seasonality, degC', 
                 'Temp warmestquart, degC', 'Temp wettestquart, degC', 'Wind, m/s', 'Cloudiness, days per year', 'Emission, tons']
df = df[columns_order]

# 3. Some feature engineering

Such a number of features makes further analysis difficult, so we should combine or remove some columns:
1. Life<br>
    1. Remove "Economy (extent contribution)", "Family (extent contribution)", "Life expectancy (extent contribution)", "Freedom (extent contribution)", "Corruption (extent contribution)", "Generosity (extent contribution)", "Dystopia Residual (extent contribution)"
    2. Remove "Thinness 5-9 years, %"
2. Health<br>
    1. Remove "Under-five deaths, per 1000"
    2. Add "Population growth vector" (Birthrate - Deathrate)
3. Nature and environment
    1. Remove "Cropland cover, %"
    2. Leave only "Rain mean annual, mm"
    3. Leave only "Temp mean annual, degC"
    4. Add "Temp range, degC" (Temp max - Temp min)
    5. Remove "Wind, m/s", "Cloudiness, days per year" (too many missing values)

In [None]:
df.drop(['Economy (extent contribution)', 'Family (extent contribution)', 'Life expectancy (extent contribution)', 'Freedom (extent contribution)', 
        'Corruption (extent contribution)', 'Generosity (extent contribution)', 'Dystopia Residual (extent contribution)'], axis=1, inplace=True)
df.drop('Thinness 5-9 years, %', axis=1, inplace=True)

In [None]:
df.drop('Under-five deaths, per 1000', axis=1, inplace=True)
df['Population growth vector, per 1000'] = df['Birthrate, per 1000 inhabitants'] - df['Adult mortality, per 1000'] - df['Infant deaths, per 1000']

In [None]:
df.drop(['Cropland cover, %', 'Wind, m/s', 'Cloudiness, days per year'], axis=1, inplace=True)
df.drop(['Rain coldestquart, mm', 'Rain driestmonth, mm', 'Rain driestquart, mm', 'Rain seasonailty', 'Rain warmestquart, mm', 
         'Rain wettestmonth, mm', 'Rain wettestquart, mm'], axis=1, inplace=True)
df['Temp range, degC'] = df['Temp max warmestmonth, degC'] - df['Temp min coldestmonth, degC']
df.drop(['Temp annual range, %', 'Temp coldestquart, degC', 'Temp diurnal range, degC', 'Temp driestquart, degC', 'Temp max warmestmonth, degC', 
         'Temp min coldestmonth, degC', 'Temp seasonality, degC', 'Temp warmestquart, degC', 'Temp wettestquart, degC'], axis=1, inplace=True)

In [None]:
df.head()

By the way, let's see how many countries we have marked on Antarctica.

In [None]:
df.loc[df['Continent'] == 'Antarctica']

I think that, given the number of records and the missing values in them, they will only interfere with charting, so let's just remove them.

In [None]:
df = df.loc[~(df['Continent'] == 'Antarctica')]

I will save the resulting dataframe to 'output/result_df.csv'. If you would like to use it in your public works, please add this notebook to the sources.

In [None]:
df.to_csv('./result_df.csv')

# 4. Visualization

So, we have prepared the data. Let's see what we have.<br>
**Features:**
1. *Country* - Name of country
2. *Country code (ISO)* - Country [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) code
3. *Continent* - Continent name
4. *Population* - Population of country
5. *Area, sq. km.* - Area in square kilometers
6. *GDP, \$ per capita* - [Gross domestic product](https://en.wikipedia.org/wiki/Gross_domestic_product) in \$
7. *Income index (natural log)* - [Gross national income](https://en.wikipedia.org/wiki/Gross_national_income) (natural logarithm)
8. *Pop. Density, per sq. km.* - Poulation density per square kilometer
9. *Net migration, %* - Migration, percentage of residents
10. *Phones, per 1000* - Modile phones per 1000 residents
11. *Urban rate, %* - Percentage of residents living in cities
12. *Agriculture, %* - Percentage of residents engaged in agriculture
13. *Industry, %* - Percentage of residents engaged in industry
14. *Service, %* - Percentage of residents engaged in service
15. *Expenditure on health, % of GDP* - Expenditure on health in percent of GDP
16. *Literacy, %* - Percentage of literate residents
17. *Accessibility to cities, min* - Average driving time to the city in minutes
18. *Happiness rank* (deleted)
19. *Happiness score* - Average level of happiness according to the population survey
20. *Thinness 1-19 years, %* - Prevalence of malnutrition among children aged 1-19
21. *Human Development Index (0 to 1)* - [Human Development Index](https://en.wikipedia.org/wiki/Human_Development_Index) from 0 to 1
22. *School years* - Mean education years
23. *Alcohol consumption* - Alcohol consumption per person
24. *Suicides, per 100* - Suicides per 100 residents
25. *Employ rate, %* - Employment percentage
26. *Birthrate, per 1000 inhabitants* - Birthrate
27. *Life expectancy, age* - Average life expectancy
28. *Adult mortality, per 1000* - Adult (16-65 years) mortality
29. *Infant deaths, per 1000* - Infant deaths
30. *Hepatitis B immunization, %* - Hepatit B immuniztion
31. *Measles, per 1000* - Reported measles cases per 1000 people
32. *BMI* - [Body Mass Index](https://en.wikipedia.org/wiki/Body_mass_index)
33. *Polio immunization, %* - Polio immunization percentage
34. *Diphtheria immunization, %* - Diphtheria immunization percentage
35. *HIV/AIDS infant deaths, per 1000* - HIV/AIDS infant deaths per 1000 births
36. *Coastline (coast/area ratio)* - Coastline length
37. *Arable, %* - Percentage of arable land
38. *Crops, %* - Percentage of cropland
39. *Other, %* - Percentage of other land
40. *Elevation, m* - Height above sea level
41. *Tree canopy cover, %* - Percentage of land with trees
42. *Isothermality* (deleted)
43. *Rain mean annual, mm* - Average annual precipitation in millimeters
44. *Temp mean annual, degC* - Average annual temperature
45. *Emission, tons* - Emission per year in tons
46. *Population growth vector, per 1000* - Birthrate-Deathrate
47. *Temp range, degC* - Temp range

<p style='font-size: 26px'>Distirbutions and histograms</p>

**GDP distribution** shows how GDP is distributed on different continents

In [None]:
px.histogram(df, x='GDP, $ per capita', color='Continent', barmode='overlay', marginal='box', title='GDP distribution', nbins=50, height=600)

On **Life expectancy distribution** we can see significant differences between the average life expectancy on different continents. Subsequently, we will try to understand what causes this.

In [None]:
px.histogram(df, x='Life expectancy, age', color='Continent', barmode='overlay', marginal='box', title='Life expectancy', nbins=20, height=600)

**Mean education years distribution** shows how many years people study on different continents

In [None]:
px.histogram(df, x='School years', color='Continent', barmode='overlay', marginal='box', title='Mean education years distribution', nbins=30, height=600)

<p style='font-size: 26px'>Correlation</p>

**Correlation matrix** shows [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between features. With it, we can ыуу simple linear correlations. However, it does not always shows the real dependecies. In this graph, the brighter the color, the higher correlation.

In [None]:
px.imshow(df.corr(), color_continuous_scale=['#07f', '#fff', '#07f'], title='Correlation matrix', height=1000)

Features pairs with the highest correlation

In [None]:
corr_matrix = df.corr().abs()
top_corr = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)).stack().sort_values(ascending=False)
top_corr.to_frame().rename(columns={0: 'Correlation'}).head(20)

Most interesting dependencies:
1. Human Development Index, School years, Life expectancy, Happiness score and Birthrate (pairwise and together)
2. Human Development Index, GDP, Income index and Phones

Let's see how some features are mutually distributed.

In [None]:
px.scatter_3d(df.loc[~df['Birthrate, per 1000 inhabitants'].isnull()], x='School years', y='Human Development Index (0 to 1)', z='Life expectancy, age', 
              color='Happiness score', hover_name='Country', size='Birthrate, per 1000 inhabitants', opacity=0, title='School years/HDI/Life expectancy', 
             width=750, height=750)

In [None]:
px.scatter_3d(df.loc[~df['Birthrate, per 1000 inhabitants'].isnull()], x='GDP, $ per capita', y='Human Development Index (0 to 1)', z='Phones, per 1000', 
              color='Happiness score', hover_name='Country', size='Birthrate, per 1000 inhabitants', opacity=0, title='GDP/HDI/Phones', 
             width=750, height=750)

**Pairwise features dependencies**

In [None]:
px.scatter_matrix(df, dimensions=['GDP, $ per capita', 'Phones, per 1000', 'Human Development Index (0 to 1)', 'Birthrate, per 1000 inhabitants', 
                                  'Life expectancy, age', 'Happiness score'], color='School years', hover_name='Country', width=1000, height=1000, 
                  labels={'GDP, $ per capita': 'GDP $', 'Human Development Index (0 to 1)': 'HDI', 'Life expectancy, age': 'Life expectancy', 
                          'Birthrate, per 1000 inhabitants': 'Birthrate per 1000'}, hover_data=['Continent']).update_traces(diagonal_visible=False)

<p style='font-size: 26px'>Maps</p>

Interactive map with main information about countries

In [None]:
px.choropleth(df, locations='Country code (ISO)', color='Human Development Index (0 to 1)', hover_name='Country', 
              hover_data=['Continent', 'Area, sq. km.', 'Population', 'Pop. Density, per sq. km.', 'GDP, $ per capita', 'Happiness score', 
                          'Human Development Index (0 to 1)', 'Life expectancy, age', 'Literacy, %', 'Temp mean annual, degC'], title='Main information')

Map, showing different health metrics for each country

In [None]:
px.choropleth(df, locations='Country code (ISO)', color='Life expectancy, age', hover_name='Country', 
              hover_data=['Continent', 'Life expectancy, age', 'BMI', 'Birthrate, per 1000 inhabitants', 'Adult mortality, per 1000', 'Infant deaths, per 1000', 
                          'Population growth vector, per 1000', 'Expenditure on health, % of GDP', 'Measles, per 1000', 'Hepatitis B immunization, %',
                          'Polio immunization, %', 'Diphtheria immunization, %', 'HIV/AIDS infant deaths, per 1000'], title='Health')

Standart of living map

In [None]:
px.choropleth(df, locations='Country code (ISO)', color='Happiness score', hover_name='Country', 
              hover_data=['Happiness score', 'GDP, $ per capita', 'Phones, per 1000', 'Urban rate, %', 'Literacy, %', 
                          'Accessibility to cities, min', 'Human Development Index (0 to 1)', 'School years', 'Alcohol consumption', 
                          'Suicides, per 100', 'Employ rate, %', 'Agriculture, %', 'Industry, %', 'Service, %'], title="Standart of living")

<p style='font-size: 26px'>Thank you for reading!</p> I will be glad if you write your opinion, comments and advice in the comments. Also write what you think about the data dependencies we saw in the graphs and how they might be related. And, of course, if you find any mistake, please write about it so that I can fix it. Thanks!