## Assignment 5: Data Science Revision
### Stefan Dimitrov Velev, 0MI3400521, Big Data Technologies
### Faculty of Mathematics and Informatics, Sofia University

### Task 1: In this question you will work with data sets from *Our World In Data* and Python to produce thoughtful analyses and interesting visualisations

#### 1. Import required Python packages

In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### 2. Read the CSV files

**2.1 Self-reported life satisfaction**

*Data sources: World Happiness Report (2012-2024); Wellbeing Research Centre (2024); Population based on various sources (2023)*

https://ourworldindata.org/grapher/happiness-cantril-ladder?time=latest

In [84]:
df_life_satisfaction = pd.read_csv('./data/happiness-cantril-ladder.csv')

In [85]:
df_life_satisfaction.head()

Unnamed: 0,Entity,Code,Year,Cantril ladder score
0,Afghanistan,AFG,2011,4.25835
1,Afghanistan,AFG,2014,3.575
2,Afghanistan,AFG,2015,3.36
3,Afghanistan,AFG,2016,3.794
4,Afghanistan,AFG,2017,3.6315


In [86]:
print("The number of rows in the life satisfaction data frame is:", len(df_life_satisfaction))

The number of rows in the life satisfaction data frame is: 1787


**2.2 Share in extreme poverty vs. life expectancy**

*Data sources: UN, World Population Prospects (2024); World Bank Poverty and Inequality Platform (2024); HYDE (2023); Gapminder - Population v7 (2022); Gapminder - Systema Globalis (2022)*

https://ourworldindata.org/grapher/extreme-poverty-headcount-ratio-vs-life-expectancy-at-birth

In [87]:
df_extreme_poverty_life_expectancy = pd.read_csv('./data/extreme-poverty-headcount-ratio-vs-life-expectancy-at-birth.csv')

In [88]:
df_extreme_poverty_life_expectancy.head()

Unnamed: 0,Entity,Code,Year,Life expectancy - Sex: all - Age: 0 - Variant: estimates,$2.15 a day - Share of population in poverty,990305-annotations,Population (historical),World regions according to OWID
0,Afghanistan,AFG,1950,28.156,,,7776182.0,
1,Afghanistan,AFG,1951,28.584,,,7879343.0,
2,Afghanistan,AFG,1952,29.014,,,7987783.0,
3,Afghanistan,AFG,1953,29.452,,,8096703.0,
4,Afghanistan,AFG,1954,29.698,,,8207953.0,


In [89]:
print("The number of rows in the extreme poverty vs. life expectancy data frame is:", len(df_extreme_poverty_life_expectancy))

The number of rows in the extreme poverty vs. life expectancy data frame is: 60100


**2.3 Political corruption index**

*Data source: V-Dem (2024)*

https://ourworldindata.org/grapher/political-corruption-index

In [90]:
df_political_corruption = pd.read_csv('./data/political-corruption-index.csv')

In [91]:
df_political_corruption.head()

Unnamed: 0,Entity,Code,Year,"Political corruption index (best estimate, aggregate: average)"
0,Afghanistan,AFG,1789,0.438
1,Afghanistan,AFG,1790,0.438
2,Afghanistan,AFG,1791,0.438
3,Afghanistan,AFG,1792,0.438
4,Afghanistan,AFG,1793,0.438


In [92]:
print("The number of rows in the political corruption data frame is:", len(df_political_corruption))

The number of rows in the political corruption data frame is: 33090


#### 3. Data  Cleaning

**3.1 Self-reported life satisfaction**

In [93]:
# Remove the unnecessary columns in the data frame
df_life_satisfaction = df_life_satisfaction[['Entity', 'Year', 'Cantril ladder score']]

In [94]:
# Rename the applicable columns
df_life_satisfaction = df_life_satisfaction.rename(columns={'Entity': 'Country'})

In [95]:
# Leaving only rows for year 2021
df_life_satisfaction = df_life_satisfaction[df_life_satisfaction['Year'] == 2021]

In [96]:
# Remove rows with missing values
df_life_satisfaction = df_life_satisfaction.dropna()

In [97]:
df_life_satisfaction

Unnamed: 0,Country,Year,Cantril ladder score
8,Afghanistan,2021,2.403800
19,Africa,2021,4.517288
30,Albania,2021,5.198800
41,Algeria,2021,5.122300
57,Argentina,2021,5.967000
...,...,...,...
1741,Vietnam,2021,5.485000
1752,World,2021,5.184147
1763,Yemen,2021,4.196900
1773,Zambia,2021,3.759800


In [98]:
# Remove not-country-specific entries
df_life_satisfaction = df_life_satisfaction[df_life_satisfaction['Country'] != 'High-income countries']
df_life_satisfaction = df_life_satisfaction[df_life_satisfaction['Country'] != 'Low-income countries']
df_life_satisfaction = df_life_satisfaction[df_life_satisfaction['Country'] != 'Lower-middle-income countries']
df_life_satisfaction = df_life_satisfaction[df_life_satisfaction['Country'] != 'Upper-middle-income countries']
df_life_satisfaction = df_life_satisfaction[df_life_satisfaction['Country'] != 'World']

In [99]:
df_life_satisfaction

Unnamed: 0,Country,Year,Cantril ladder score
8,Afghanistan,2021,2.403800
19,Africa,2021,4.517288
30,Albania,2021,5.198800
41,Algeria,2021,5.122300
57,Argentina,2021,5.967000
...,...,...,...
1730,Venezuela,2021,4.925500
1741,Vietnam,2021,5.485000
1763,Yemen,2021,4.196900
1773,Zambia,2021,3.759800


**3.2 Share in extreme poverty vs. life expectancy**

In [100]:
# Remove the unnecessary columns in the data frame
df_extreme_poverty_life_expectancy = df_extreme_poverty_life_expectancy[['Entity', 'Year', 'Life expectancy - Sex: all - Age: 0 - Variant: estimates', '$2.15 a day - Share of population in poverty']]

In [101]:
# Rename the applicable columns
df_extreme_poverty_life_expectancy = df_extreme_poverty_life_expectancy.rename(columns={'Entity': 'Country', 'Life expectancy - Sex: all - Age: 0 - Variant: estimates': 'Life expectancy', '$2.15 a day - Share of population in poverty': 'Share in extreme poverty'})

In [102]:
# Leaving only rows for year 2021
df_extreme_poverty_life_expectancy = df_extreme_poverty_life_expectancy[df_extreme_poverty_life_expectancy['Year'] == 2021]

In [103]:
# Remove rows with missing values
df_extreme_poverty_life_expectancy = df_extreme_poverty_life_expectancy.dropna()

In [104]:
df_extreme_poverty_life_expectancy

Unnamed: 0,Country,Year,Life expectancy,Share in extreme poverty
2588,Armenia,2021,72.552,0.523521
3683,Austria,2021,81.820,0.485822
5502,Belgium,2021,81.659,0.029965
6024,Benin,2021,59.610,12.723279
6636,Bolivia,2021,61.427,1.964501
...,...,...,...,...
54536,Turkey,2021,75.722,0.442865
56400,United Kingdom,2021,80.708,0.245101
56661,United States,2021,76.384,0.248719
57298,Uruguay,2021,75.434,0.112155


In [105]:
# Remove not-country-specific entries
df_extreme_poverty_life_expectancy = df_extreme_poverty_life_expectancy[df_extreme_poverty_life_expectancy['Country'] != 'World']

In [106]:
df_extreme_poverty_life_expectancy

Unnamed: 0,Country,Year,Life expectancy,Share in extreme poverty
2588,Armenia,2021,72.552,0.523521
3683,Austria,2021,81.820,0.485822
5502,Belgium,2021,81.659,0.029965
6024,Benin,2021,59.610,12.723279
6636,Bolivia,2021,61.427,1.964501
...,...,...,...,...
54275,Tunisia,2021,72.893,0.273734
54536,Turkey,2021,75.722,0.442865
56400,United Kingdom,2021,80.708,0.245101
56661,United States,2021,76.384,0.248719


**3.3 Political corruption index**

In [107]:
# Remove the unnecessary columns in the data frame
df_political_corruption = df_political_corruption[['Entity', 'Year', 'Political corruption index (best estimate, aggregate: average)']]

In [108]:
# Rename the applicable columns
df_political_corruption = df_political_corruption.rename(columns={'Entity': 'Country', 'Political corruption index (best estimate, aggregate: average)': 'Political corruption index'})

In [109]:
# Leaving only rows for year 2021
df_political_corruption = df_political_corruption[df_political_corruption['Year'] == 2021]

In [110]:
# Remove rows with missing values
df_political_corruption = df_political_corruption.dropna()

In [111]:
df_political_corruption

Unnamed: 0,Country,Year,Political corruption index
232,Afghanistan,2021,0.397000
467,Africa,2021,0.624089
579,Albania,2021,0.609000
703,Algeria,2021,0.693000
827,Angola,2021,0.510000
...,...,...,...
32273,World,2021,0.483179
32518,Yemen,2021,0.936000
32795,Zambia,2021,0.440000
32963,Zanzibar,2021,0.715000


In [112]:
# Remove not-country-specific entries
df_political_corruption = df_political_corruption[df_political_corruption['Country'] != 'World']

In [113]:
df_political_corruption

Unnamed: 0,Country,Year,Political corruption index
232,Afghanistan,2021,0.397000
467,Africa,2021,0.624089
579,Albania,2021,0.609000
703,Algeria,2021,0.693000
827,Angola,2021,0.510000
...,...,...,...
31996,Vietnam,2021,0.484000
32518,Yemen,2021,0.936000
32795,Zambia,2021,0.440000
32963,Zanzibar,2021,0.715000


#### 4. Data  Segregation


In [114]:
continents = ['Africa', 'Asia', 'Australia', 'Europe', 'North America', 'South America']

**4.1 Self-reported life satisfaction**

In [115]:
# Extract data only for continents
df_life_satisfaction_continents = df_life_satisfaction[df_life_satisfaction['Country'].isin(continents)]

In [116]:
df_life_satisfaction_continents

Unnamed: 0,Country,Year,Cantril ladder score
19,Africa,2021,4.517288
79,Asia,2021,4.892916
90,Australia,2021,7.1621
511,Europe,2021,6.338868
1190,North America,2021,6.692469
1465,South America,2021,5.985671


In [117]:
# Remove continent entries in the original data frame
df_life_satisfaction = df_life_satisfaction[(~df_life_satisfaction['Country'].isin(continents)) | (df_life_satisfaction['Country'] == 'Australia')]

In [118]:
df_life_satisfaction

Unnamed: 0,Country,Year,Cantril ladder score
8,Afghanistan,2021,2.4038
30,Albania,2021,5.1988
41,Algeria,2021,5.1223
57,Argentina,2021,5.9670
68,Armenia,2021,5.3986
...,...,...,...
1730,Venezuela,2021,4.9255
1741,Vietnam,2021,5.4850
1763,Yemen,2021,4.1969
1773,Zambia,2021,3.7598


In [124]:
df_life_satisfaction.count()

Country                 147
Year                    147
Cantril ladder score    147
dtype: int64

**4.2 Political corruption index**

In [119]:
# Extract data only for continents
df_political_corruption_continents = df_political_corruption[df_political_corruption['Country'].isin(continents)]

In [120]:
df_political_corruption_continents

Unnamed: 0,Country,Year,Political corruption index
467,Africa,2021,0.624089
1403,Asia,2021,0.555878
1638,Australia,2021,0.031
9709,Europe,2021,0.235659
20953,North America,2021,0.458667
26875,South America,2021,0.4725


In [121]:
# Remove continent entries in the original data frame
df_political_corruption = df_political_corruption[(~df_political_corruption['Country'].isin(continents)) | (df_political_corruption['Country'] == 'Australia')]

In [122]:
df_political_corruption

Unnamed: 0,Country,Year,Political corruption index
232,Afghanistan,2021,0.397
579,Albania,2021,0.609
703,Algeria,2021,0.693
827,Angola,2021,0.510
1062,Argentina,2021,0.471
...,...,...,...
31996,Vietnam,2021,0.484
32518,Yemen,2021,0.936
32795,Zambia,2021,0.440
32963,Zanzibar,2021,0.715


In [123]:
df_political_corruption.count()

Country                       180
Year                          180
Political corruption index    180
dtype: int64