In [None]:
%matplotlib inline

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from scipy.stats import kurtosis
import plotly.express as px

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# What makes us happy?
## A take on World Happiness Report 2015 -2021 in the period before the COVID-19 pandemic and during the COVID 19 Pandemic.

### Abstract

In this paper, we will explore World Happiness Report for the period of 2015 - 2021 and try to answer the question of what makes us happy using their data and models.

The project is divided into two parts - Pandemic and Pre-pandemic to observe if there is any change in happiness scores for top, middle and bottom countries.

The data is divided into top, bottom, and middle countries to create more distinctive groups for the purpose of exploring the effect of GDP_per_capita, Life support, life expectancy, freedom, perception of corruption, and generosity variables on countries' happiness rank

## Introduction

## What is World Happiness Report?

The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, the second in 2013, the third in 2015, and the fourth in the 2016 Update. The World Happiness 2017, which ranks 155 countries by their happiness levels, was released at the United Nations at an event celebrating International Day of Happiness on March 20th. The report continues to gain global recognition as governments, organizations, and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy, and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2015-2021 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.

The sample size per each country is at least 1000 people. In some countries, Gallup collects oversamples in major cities or areas of special interest. Additionally, in some large countries, such as China and Russia, sample sizes of at least 2,000 are collected. Although rare, in some instances, the sample size is between 500 and 1,000.

More details on the variables:

* GDP per capita - GDP per capita in purchasing power parity (PPP) at constant 2011 international dollar prices
* Corruption Perception: The measure is the national average of the survey responses to two questions in the Gallup World Poll: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses. In case the perception of government corruption is missing, we use the perception of business corruption as the overall perception. The corruption perception at the national level is just the average response of the overall perception at the individual level. 
* Healthy Life Expectancy (HLE). The time series of healthy life expectancy at birth are calculated by the authors based on data from the World Health Organization (WHO), the World Development Indicators (WDI), and statistics published in journal articles.
* Social support (or having someone to count on in times of trouble) is the national average of the binary responses (either 0 or 1) to the question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
* Freedom to make life choices is the national average of responses to the question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
* Generosity is the residual of regressing the national average of response to the question “Have you donated money to a charity in the past month?” on GDP per capita.
* dystopia_residual - Dystopia is an imaginary country that has the world’s least happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive width. The residuals, or unexplained components, differ for each country, reflecting the extent to which the six variables either over-or under-explain average 2015-2021 life evaluations. Dystopia_residuals is the sum of residuals + dystopia.
* Happiness score or subjective well-being (SWB) - is the sum of all 6 variables  + dystopia_residual
* Happiness rank - based on happiness score (target value)


*More information on the World Happiness progect can be found [here](https://worldhappiness.report/)*

## Reading, Tidying and Cleaning the Data

In [None]:
data_2015 = pd.read_csv('/kaggle/input/2015csv/2015.csv')
data_2016 = pd.read_csv('/kaggle/input/2016csv/2016.csv')
data_2017 = pd.read_csv('/kaggle/input/2017-csv/2017.csv')
data_2018 = pd.read_csv('/kaggle/input/2018csv/2018.csv')
data_2019 = pd.read_csv('/kaggle/input/2019csv/2019.csv')
data_2020 = pd.read_csv('/kaggle/input/2020csv/2020.csv')
data_2021 = pd.read_csv('/kaggle/input/2021csv/2021.csv')

In [None]:
data_2015.shape, data_2016.shape, data_2017.shape, data_2018.shape, data_2019.shape, data_2020.shape, data_2021.shape 

# We can see that the data sets don't have the same number of observations and features.

In [None]:
data_2015.dtypes # variables are stored in the correct format

In [None]:
data_2016.dtypes # variables are stored in the correct format

In [None]:
data_2017.dtypes # variables are stored in the correct format

In [None]:
data_2018.dtypes # variables are stored in the correct format

In [None]:
data_2019.dtypes # variables are stored in the correct format

In [None]:
data_2020.dtypes # variables are stored in the correct format

In [None]:
data_2021.dtypes # variables are stored in the correct format

In [None]:
def missing_data(data_frame, name):
    print(f'Missing data {name} : {data_frame.isnull().sum()[1]}')

In [None]:
missing_data(data_2015, 'dataset_2015')
missing_data(data_2016, 'dataset_2016')
missing_data(data_2017, 'dataset_2017')
missing_data(data_2018, 'dataset_2018')
missing_data(data_2019, 'dataset_2019')
missing_data(data_2020, 'dataset_2020')
missing_data(data_2021, 'dataset_2021')

We have no missing data and nan values in the datasets

But we have 0 which is expected as the model works on self-assessing the state of each variable per country by the participants( from 0, including, to 10) and this is how is constructed the Dystopia country - taking the lowest scores for each variable. We will not clean them as they are part of the model and are taken into consideration when calculating the happiness score, thus happiness rank.

In [None]:
data_2015.all()

In [None]:
data_2016.all()

In [None]:
data_2017.all()

In [None]:
data_2018.all()

In [None]:
data_2019.all()

In [None]:
data_2020.all()

In [None]:
data_2021.all()

In [None]:
data_2015.shape, data_2016.shape, data_2017.shape, data_2018.shape, data_2019.shape, data_2020.shape, data_2021.shape

We can see that the data sets don't have the same number of observations and features - we have to see which features we have to keep and which to for our task. 

We will keep the original variables used in the report to measure happiness:

* GDP per Capita
* Family
* Life Expectancy
* Freedom
* Generosity
* Trust Government Corruption

Also the identification attributes:

* Country and Region

And the target feature:

* Happiness Rank
* Happiness Score
* Dystopia + residual

Dystopia + Residual compares each countries scores to the theoretical unhappiest country in the world and explain why some countries are happier than others, in the datasets available for 2018 and 2019 this column is missing, but we know it's the difference in total happiness score minus the sum of the 6 main variables, so it will be easy to calculate and add it.

The rest of the features left are different target features that vary from dataset to dataset and we will drop them, as we will draw our own conclusion based on the original features. The names of the Columns we want to keep are not the same for all datasets. We have to unify them to explore them.
Since we will need the identifier region to group easier countries we will have to add that column in datasets where it is missing.

For the purpose of comparing pre-pandemic and pandemic states of the countries, we have to add and a column for the year to compile them all together.

In [None]:
new_colums_names = ['country', 'region', 'GDP_per_capita', 'family', 'life_expectancy', 'freedom', 'generosity','trust_gov_corruption',
                   'happiness_rank', 'happiness_score', 'dystopia_residual']

In [None]:
data_2015.columns = ['country', 'region', 'happiness_rank', 'happiness_score', 'stand_err', 'GDP_per_capita', 'family',
                    'life_expectancy', 'freedom', 'trust_gov_corruption', 'generosity', 'dystopia_residual']

In [None]:
data_2016.columns = ['country', 'region', 'happiness_rank', 'happiness_score', 'lci', 'uci', 
                     'GDP_per_capita', 'family',
                    'life_expectancy', 'freedom', 'trust_gov_corruption', 'generosity', 'dystopia_residual']

In [None]:
data_2017.columns = ['country', 'happiness_rank', 'happiness_score', 'whisker_high', 'whisker_low', 
                     'GDP_per_capita', 'family',
                    'life_expectancy', 'freedom', 'trust_gov_corruption', 'generosity', 'dystopia_residual']

In [None]:
data_2018.columns = ['happiness_rank', 'country','happiness_score', 'GDP_per_capita', 'family',
                     'life_expectancy','freedom', 'generosity', 'trust_gov_corruption']
                     

In [None]:
data_2019.columns = ['happiness_rank', 'country','happiness_score', 'GDP_per_capita', 'family',
                     'life_expectancy','freedom', 'generosity', 'trust_gov_corruption']

In [None]:
data_2020.columns = ['country','region', 'happiness_score', 'std_error_hs', 'whisker_high', 'whisker_low', 
                    'Log_GDP_per_capita','social_support',
                     'health_life_expectancy','freedom_to_make_choices', 'generosity_', 'trust_gov_corruption_',
                    'ladder_score_dystopia', 'GDP_per_capita','family', 'life_expectancy',
                    'freedom', 'generosity', 'trust_gov_corruption', 'dystopia_residual']

In [None]:
data_2021.columns = ['country','region', 'happiness_score', 'std_error_hs', 'whisker_high', 'whisker_low', 
                    'Log_GDP_per_capita','social_support',
                     'health_life_expectancy','freedom_to_make_choices', 'generosity_', 'trust_gov_corruption_',
                    'ladder_score_dystopia', 'GDP_per_capita','family', 'life_expectancy',
                    'freedom', 'generosity', 'trust_gov_corruption', 'dystopia_residual']

In [None]:
def tidy_dataframe(data_set):
    for attribute in data_set.columns:
        if attribute not in new_colums_names:
            del data_set[attribute]
    return data_set


In [None]:
data_2015_tidy = tidy_dataframe(data_2015.copy())
data_2016_tidy = tidy_dataframe(data_2016.copy())
data_2017_tidy = tidy_dataframe(data_2017.copy())
data_2018_tidy = tidy_dataframe(data_2018.copy())
data_2019_tidy = tidy_dataframe(data_2019.copy())
data_2020_tidy = tidy_dataframe(data_2020.copy())
data_2021_tidy = tidy_dataframe(data_2021.copy())

In [None]:
data_2020_tidy['happiness_rank'] = np.arange(1,154) # dataset 2020 and 2021 are  missing rank column, 
data_2021_tidy['happiness_rank'] = np.arange(1,150) #the countries are ordered already based on rank 

In [None]:
# we will add a column for year in order to prepare the detasets to be merged in one dataframe over the hole period
data_2015_tidy['year'] = pd.to_datetime(2015, format='%Y').year
data_2016_tidy['year'] = pd.to_datetime(2016, format='%Y').year
data_2017_tidy['year'] = pd.to_datetime(2017, format='%Y').year
data_2018_tidy['year'] = pd.to_datetime(2018, format='%Y').year
data_2019_tidy['year'] = pd.to_datetime(2019, format='%Y').year
data_2020_tidy['year'] = pd.to_datetime(2020, format='%Y').year
data_2021_tidy['year'] = pd.to_datetime(2021, format='%Y').year

In [None]:
# we will add a column dystopia + residuals to datasets 2018 and 2019

data_2019_tidy['dystopia_residual'] = data_2019['happiness_score'] - (data_2019['GDP_per_capita'] + data_2019['family'] + data_2019['life_expectancy'] +data_2019['freedom'] + data_2019['generosity'] + data_2019['trust_gov_corruption'])
data_2018_tidy['dystopia_residual'] = data_2018['happiness_score'] - (data_2018['GDP_per_capita'] + data_2018['family'] + data_2018['life_expectancy'] +data_2018['freedom'] + data_2018['generosity'] + data_2018['trust_gov_corruption'])

In [None]:
data_2015_tidy

In order to add the 'region' feature in the datasets where it is missing we have to unify the names of the country names where there is a difference:( it is easier to do it directly in the csv file with regex)
Notes:

* In all datasets we will unify the name of Taiwan to Taiwan ( replacing Taiwan Province of China)
* In all datasets we will unify the name of Hong Kong to Hong Kong ( replacing Hong Kong S.A.R., China)
* In all datasets we will unify the name of Somalia to Somalia ( replacing Somaliland region) - 2016 data set has both Somalia and Somaliland Region - check the original file to see if a mistake
* In all datasets we will unify the name of Trinidad & Tobago to Trinidad and Tobago
* In all datasets we will unify the name of North Cyprus to North Cyprus ( replacing Nortern Cyprus)
* In all datasets we will unify the name of North Macedonia to North Macedonia ( replacing Macedonia)


In [None]:
## test cell for unifing country names
missing_countries = set(data_2020.country).symmetric_difference(set(data_2021.country))
missing_countries

In [None]:
## test cell
set(data_2020_tidy.region).difference(set(data_2015_tidy.region))

In [None]:
data_2021_tidy[(data_2021_tidy.region == "North America and ANZ")]

In [None]:

def add_region_column(data_set_with_region, data_set_without):
    for c,r in zip(data_set_with_region.country, data_set_with_region.region):
        try:
            if c == data_set_without[data_set_without.country ==c ].country.item():
                row = data_set_without[data_set_without.country == c ].country.index.item()
                data_set_without.at[row,'region'] = r
        except ValueError:
            pass
    return data_set_without



In [None]:
add_region_column(data_2015_tidy, data_2017_tidy)
add_region_column(data_2015_tidy, data_2018_tidy)
add_region_column(data_2015_tidy, data_2019_tidy)

In [None]:
data_2019_tidy.region.isnull().sum(), data_2018_tidy.region.isnull().sum(), data_2017_tidy.region.isnull().sum()

We have 3 missing regions because countries vary in different data sets, for now we will leave them as they are. Later, if we need we will assign regions to those countries.
For the visualization of bottom 30 countries I need South Sudan region. The rest 2 might update if I need them.

In [None]:
data_2017_tidy.loc[146,'region'] = 'Sub-Saharan Africa'

In [None]:
data_2018_tidy.loc[153,'region'] = 'Sub-Saharan Africa'

In [None]:
data_2019_tidy.loc[155,'region'] = 'Sub-Saharan Africa'

In [None]:
#check
data_2019_tidy[(data_2019_tidy.region.isnull())]

In [None]:
#check
data_2016_tidy[(data_2016_tidy.country == 'South Sudan')].region

In [None]:
#check
data_2017_tidy.loc[146,'region'] = 'Sub-Saharan Africa'

In [None]:
data_2015_tidy.shape, data_2016_tidy.shape, data_2017_tidy.shape, data_2018_tidy.shape, data_2019_tidy.shape, data_2020_tidy.shape, data_2021_tidy.shape 

All datasets have the same number of features, with the same names and now we can explore the data.

## Exploring the Data

Now that we have all data sets tidy and clean let's see:
* where are the happiest countries ( which region) for the period 2015 - 2021 and where are the bottom 30
* which countries were happiest ( top 30 countries) for the period 2015 - 2021 and where are the bottom 30
* from which factors benefited the most the happiest countries for the period 2015 - 2021

In [None]:
## check
data_2015_tidy.tail(30).happiness_rank.unique()

In [None]:
bottom_regions_all = pd.concat([data_2015_tidy.tail(30), data_2016_tidy.tail(30), data_2017_tidy.tail(30), data_2018_tidy.tail(30), data_2019_tidy.tail(30),data_2020_tidy.tail(30),data_2021_tidy.tail(30)], 
                     ignore_index=True)

In [None]:
top_regions_all = pd.concat([data_2015_tidy.head(30), data_2016_tidy.head(30), data_2017_tidy.head(30), data_2018_tidy.head(30), data_2019_tidy.head(30),data_2020_tidy.head(30),data_2021_tidy.head(30)], 
                     ignore_index=True)

In [None]:

def regions_rank(dataset,year, position):
    total = dataset.groupby('region').year.size().sort_values(ascending = False)
    region = total.index
    region_values = []
    for r in total:
        region_values.append(r)
    plt.figure(figsize = (8, 4))
    plt.title(f"{position} Regions Cumulative Count {year}")
    plt.barh(region,region_values)
    plt.show()

In [None]:
regions_rank(top_regions_all, '2015 - 2021', 'Top')

In [None]:
regions_rank(bottom_regions_all, '2015 - 2021', 'Bottom')

* We can see that for the period 2015 - 2021 cumulatively most of the top 30 happiest countries were from Western Europe- to be exact a little bit more than 50%, followed by Latin America - 20 % and the Middle East and Northern Africa - 10%. Here North America is in fourth place because it consists only of Canada and the United States. In the next graphic where we have regions broke down into countries, we can see better the position, both of Canada and the United States.

* For the same period cumulatively most of the bottom 30 happiest countries are from Sub-Saharan Africa - 72 %, central and the Middle East and Northern Africa - 10%. Sub-Saharan Africa is one of the biggest regions in the world.

Only from the graphics of the Cumulative count of regions in top and bottom ranks we can not make many conclusions as the area, population and number of countries are very different for some regions, but still it gives us an answer to where are located most of the happiest and unhappiest countries in the world.

We can see how those regions' countries' count changed through the years 2015 -2021 and see if we can find something interesting( only for top countries).

In [None]:
final_df = pd.read_csv('/kaggle/input/top-30-regionscsv/top_30_regions.csv')

In [None]:
def plot_top_regions(ds):
    fig = px.bar(final_df, x='region', y='count',animation_frame="year", color='region',animation_group='region', range_y=[0,20])
    fig.update_layout(autosize=True)

    fig.show()


In [None]:
plot_top_regions(final_df)

From 2018 to 2021:

The count for Wester Countries in top 30 happiest countries started growing, with a peak in 2020 - 2021.

Latin America and Caribbeans count started declining with lowest point at 2021.

Central and Eastern European countries started to rank in top 30 happiest countries.


Let's look at country level and see how top and bottom 30 happiest countries look like for the period 2015 - 2021.

In [None]:
def plot_countries_rank(ds,start_range, end_range):

    fig = px.bar(ds, x="country", y="happiness_rank",animation_frame="year", color="country",animation_group="country", range_y=[start_range,end_range])
    fig.show()

In [None]:
plot_countries_rank(top_regions_all,0,31)

* All Nordic Countries ( Denmark, Finland, Norway, Sweden, Iceland) are consistantly in top 10, moreover since 2016 a Nordic country has always been the top happiest country in the world - with Finland winning for the past 4 years ( including 2021) in a row.

* Biger moves are observed for the last 10 countries - some countries drop from top 30 and others are added. 

In [None]:
plot_countries_rank(bottom_regions_all,0,160)

 * We have more variation in the bottom 30 countries through 2015 - 2021 than we had for top 30 countries.
 
 * Most of the countries ( as we have observed above - 72%) are from Sub- Saharian region

There might be many external and internal factors for the countries to shift from one rank's place to another. In the next part of the project, we will look into the effect of the six main variables (GDP per capita, social support, life expectancy, perception of corruption, freedom, generosity) effect on countries' happiness rank before and during COVID-19 Pandemic, as we expect such major Health crisis to have an impact on happiness rank.

We will look into pre-pandemic( 2015 - 2019) and pandemic (2020 - 2021) data to see how the different variables changed.

Pre-pandemic includes also 2019 because the data for the respective year is collected up to march the same year and the outbreak of the Pandemic is deemed to be the end of November 2019.
Pandemic includes data_2020 and data_2021.

First, we will analyze pre-pandemic data and then we will compare it with pandemic data and observe what has changed,  if anything has changed.

### Pre-pandemic and pandemic data analysis

In [None]:
pre_pandemic= pd.concat([data_2015_tidy, data_2016_tidy, data_2017_tidy, data_2018_tidy, data_2019_tidy], 
                     ignore_index=True)


In [None]:
pandemic= pd.concat([data_2020_tidy, data_2021_tidy], 
                     ignore_index=True)

**1. Pre-pandemic**

Let's look into the correlation between the 6 variables and happiness score.

In [None]:
plt.figure(figsize = (10, 6))
sns.heatmap(pre_pandemic.drop(['year', 'happiness_rank'],axis=1).corr(), cmap = "PiYG_r", annot = True)
plt.show()

We can see that happiness score pre pandemic correlates quite strong with GDP per capita, family and life expectancy. Not so much with freedom, even less with perception of corruption and almost not at all with generosity.
Generosity and perception of corruption have low correlations with any of the variables, strongest with freedom.

Now that we have the pre-pandemic data set we can further explore how these varaiables correlate for top 30, middle 30  and bottom 30 countries. We divide the datasets in top, bottom and middle, to look for tendencies and typical values if such are present for top, middle and bottom countries.


In [None]:
top_pre_pandemic = (pre_pandemic.where(pre_pandemic.happiness_rank <31)).dropna()

In [None]:
middle_pre_pandemic = (pre_pandemic.where((pre_pandemic.happiness_rank>60) & (pre_pandemic.happiness_rank<91))).dropna()

In [None]:
bottom_pre_pandemic = (pre_pandemic.where(pre_pandemic.happiness_rank>125)).dropna()

In [None]:
def sort_countries(df):                               ## top, middle and bottom countries are those with mean top, 
                                                    ## middle and bottom 30 values for the respective period 
                                                    ## in order to have the same number of countries in each category
    sorted_df = df.groupby(['country']).mean().sort_values(by=['happiness_rank'])
    sorted_df['country'] = sorted_df.index
    return sorted_df[0:30]

In [None]:
def top_mid_bottom(df1,df2,df3,name):
    top_df = df1
    middle_df = df2
    bottom_df = df3
    top_df['position'] = f'top {name}'
    bottom_df['position'] = f'bottom {name}'
    middle_df['position'] = f'middle {name}'
    top_bottom_df = pd.concat([bottom_df.dropna(), top_df.dropna(),middle_df.dropna()], 
                     ignore_index=True)
    
    sns.pairplot(top_bottom_df.drop(['happiness_rank','year'],axis=1),hue ='position',height=2)
    plt.show()

In [None]:
top_mid_bottom(sort_countries(top_pre_pandemic),sort_countries(middle_pre_pandemic),sort_countries(bottom_pre_pandemic),'Pre-pandemic data')

* As we have already observed GDP per capita has a strong linear correlation with family and life expectancy. 
Also, for GDP per capita, family, and life expectancy we can see that for the top 30 countries the distribution is closer to the mean with a lower standard deviation compared to the middle and bottom 30 countries.
* For perception of corruption top 30 countries have bimodal distribution, with one of the group's values overlapping with bottom and middle 30 countries. Bottom and Middle countries have alike distribution.
* For generosity, we can see that bottom and middle countries have some countries with higher scores than top 30 countries.
* For freedom top 30 countries distribution is with smaller standard deviation than the bottom and middle countries and values more to the right of the graph. Middle countries have bimodal distribution.

Let's deep dive into the data and analyze in more details GDP per capita, Social support, Perception of corruption, Freedom, Generosity and Life expectancy and also Dystopia + residuals.


**1.1 GDP per capita and happiness**

In [None]:
def plot_varaibles(df1,df2,df3,df4, name):
    plt.figure(figsize = (10, 4))
    plt.hist(df1, bins = 10,alpha = 0.5,label = 'top 30')
    plt.hist(df2, bins = 10, alpha = 0.5, label = 'middle 30')
    plt.hist(df3, bins = 10,alpha = 0.5, label = 'bottom 30')
    plt.axvline(x =df4, c = 'green', lw =3, label = 'pre_pandemic_mean')
    plt.title(f"Distribution of {name} pre-pandemic")
    plt.xlabel(f'{name}')
    plt.ylabel('Count')
    plt.legend(loc = "upper right")
    plt.show()  

In [None]:
plot_varaibles(sort_countries(top_pre_pandemic).GDP_per_capita,sort_countries(middle_pre_pandemic).GDP_per_capita,sort_countries(bottom_pre_pandemic).GDP_per_capita,pre_pandemic.GDP_per_capita.mean(),'GDP per capita')

In [None]:
def GDP_per_capita(df, rank, var):
    ft_mean = f'{rank} mean {df.GDP_per_capita.mean():.2f}'
    ft_stdv = f'{rank} stdv {df.GDP_per_capita.std():.2f}'
    ft_skew = f'{rank} skew {df.GDP_per_capita.skew():.2f}'
    ft_max = f'{rank} max {df.GDP_per_capita.max():.2f}'
    ft_kurt = f'{rank} kurtosis {kurtosis(df.GDP_per_capita, fisher = False):.2f}'
    print(f'{var}')
    return ft_mean, ft_stdv,ft_skew , ft_max , ft_kurt

In [None]:
print(GDP_per_capita(sort_countries(top_pre_pandemic),'top','GDP_per_capita'))
print(GDP_per_capita(sort_countries(middle_pre_pandemic),'middle', 'GDP_per_capita'))
print(GDP_per_capita(sort_countries(bottom_pre_pandemic),'bottom', 'GDP_per_capita'))

Top countries have smaller standard deviation than the other 2 groups, values distributed to the right of the graph and less values in the tails than in the center. Most of the countries are grouped around the higher values with smaller diffrence between them compared to the other 2 groups.

Middle countries also have negative skew and values more to the center of the distribution.

Bottom countries have values more to the left of the graph and in the tails - we can see that are more groups formed compared to top and middle.

In [None]:
pre_pandemic.where(pre_pandemic.GDP_per_capita >1.64).dropna() # Countries with higher GDP than the top country

In [None]:
def outlier_country(df1,df2,num):
    outlier_country = df1.where(df2 >num).dropna()
    return outlier_country

In [None]:
def plot_outliers(var, df1,df2,df3,rank):
    plt.figure(figsize = (6, 4))
    plt.title(f"{rank} Countries for {var} outliers")
    sns.barplot(x=var, y="country", data=df1, palette='Blues_r')
    plt.axvline(x =df2, c = 'red', lw =3, label = f'{rank} mean')
    plt.axvline(x = df3, c = 'green', lw =3, label = 'pre_pandemic_mean')
    plt.legend(loc = "lower left")
    plt.show()
    

In [None]:
outlier = outlier_country(sort_countries(bottom_pre_pandemic),sort_countries(bottom_pre_pandemic).GDP_per_capita,0.9)

In [None]:
plot_outliers("GDP_per_capita",outlier,sort_countries(bottom_pre_pandemic).GDP_per_capita.mean(),pre_pandemic.GDP_per_capita.mean(),'Bottom')

In the bottom 30 countries we have some that have values above the mean values for pre-pandemic countries total.

We will show the outliers for each of the main variables below.

**1.2. Life expectancy and happiness**

In [None]:
plot_varaibles(sort_countries(top_pre_pandemic).life_expectancy,sort_countries(middle_pre_pandemic).life_expectancy,sort_countries(bottom_pre_pandemic).life_expectancy,pre_pandemic.life_expectancy.mean(),'Life expectancy')

In [None]:
def life_expectancy(df, rank, var):
    ft_mean = f'{rank} mean {df.life_expectancy.mean():.2f}'
    ft_stdv = f'{rank} stdv {df.life_expectancy.std():.2f}'
    ft_skew = f'{rank} skew {df.life_expectancy.skew():.2f}'
    ft_max = f'{rank} max {df.life_expectancy.max():.2f}'
    ft_kurt = f'{rank} kurtosis {kurtosis(df.life_expectancy, fisher = False):.2f}'
    print(f'{var}')
    return ft_mean, ft_stdv, ft_skew, ft_max, ft_kurt 

In [None]:
print(life_expectancy(sort_countries(top_pre_pandemic),'top', 'life_expectancy'))
print(life_expectancy(sort_countries(middle_pre_pandemic),'middle','life_expectancy'))
print(life_expectancy(sort_countries(bottom_pre_pandemic),'bottom','life_expectancy'))

Here we have the same observation as above top and middle countries have more values to the right with fewer values in the tails and more to the center, with bottom countries is the opposite.

Top countries are grouped more around the mean with a smaller standard deviation, the opposite is valid for bottom countries.


In [None]:
pre_pandemic.where((pre_pandemic.life_expectancy >0.97) & (pre_pandemic.happiness_rank > 31)).dropna() # Countries where life expectancy is higher than the top country

In [None]:
outlier = outlier_country(sort_countries(bottom_pre_pandemic),sort_countries(bottom_pre_pandemic).life_expectancy,0.61)
plot_outliers("life_expectancy",outlier,sort_countries(bottom_pre_pandemic).life_expectancy.mean(),pre_pandemic.life_expectancy.mean(),'Bottom')

In [None]:
outlier = outlier_country(sort_countries(middle_pre_pandemic),sort_countries(middle_pre_pandemic).life_expectancy,0.86)
plot_outliers("life_expectancy",outlier,sort_countries(middle_pre_pandemic).life_expectancy.mean(),pre_pandemic.life_expectancy.mean(),'Middle')

For life expectancy we have both for middle and bottom countries values that are above the average.

**1.3 Social Support**

In [None]:
plot_varaibles(sort_countries(top_pre_pandemic).family,sort_countries(middle_pre_pandemic).family,sort_countries(bottom_pre_pandemic).family,pre_pandemic.family.mean(),'Social support')

In [None]:
def family(df, rank, var):
    ft_mean = f'{rank} mean {df.family.mean():.2f}'
    ft_stdv = f'{rank} stdv {df.family.std():.2f}'
    ft_skew = f'{rank} skew {df.family.skew():.2f}'
    ft_max = f'{rank} max {df.family.max():.2f}'
    ft_kurt = f'{rank} kurtosis {kurtosis(df.family, fisher = False):.2f}'
    print(f'{var}')
    return ft_mean, ft_stdv, ft_skew, ft_max, ft_kurt

In [None]:
print(family(sort_countries(top_pre_pandemic),'top', 'social support'))
print(family(sort_countries(middle_pre_pandemic),'middle', 'social support'))
print(family(sort_countries(bottom_pre_pandemic),'bottom', 'social support'))

Here top countries have again lower standard deviation compared to the other 2 groups, and values more to the right.

Bottom and middle countries have more of the values to the right of the graph and to the center of the distribution with more groups formed around peak values.

In [None]:
outlier = outlier_country(sort_countries(bottom_pre_pandemic),sort_countries(bottom_pre_pandemic).family,1.07)
plot_outliers("family",outlier,sort_countries(bottom_pre_pandemic).family.mean(),pre_pandemic.family.mean(),'bottom')

For the variables most strongly correlated with happiness score we have observed that top countries have most of the countries grouped around the mean values with lower standard deviation (smaller difference between top countries based on these variables), whereas for bottom countries is the opposite - more groups are formed around peak values (bigger difference between bottom countries based on these variables).

**1.4 Freedom and happiness**

In [None]:
plot_varaibles(sort_countries(top_pre_pandemic).freedom,sort_countries(middle_pre_pandemic).freedom,sort_countries(bottom_pre_pandemic).freedom,pre_pandemic.freedom.mean(),'Freedom')

In [None]:
def freedom(df, rank,var):
    ft_mean = f'{rank} mean {df.freedom.mean():.2f}'
    ft_stdv = f'{rank} stdv {df.freedom.std():.2f}'
    ft_skew = f'{rank} skew {df.freedom.skew():.2f}'
    ft_max = f'{rank} max {df.freedom.max():.2f}'
    ft_kurt = f'{rank} kurtosis {kurtosis(df.freedom, fisher = False):.2f}'
    print(f'{var}')
    return ft_mean, ft_stdv, ft_skew, ft_max, ft_kurt

In [None]:
print(freedom(sort_countries(top_pre_pandemic),'top','freedom'))
print(freedom(sort_countries(middle_pre_pandemic),'middle','freedom'))
print(freedom(sort_countries(bottom_pre_pandemic),'bottom','freedom'))

Top, bottom and middle countries have most of the values to the right of the graph and in the tails.
Top countries have lower standard deviation.

In [None]:
outlier = outlier_country(sort_countries(bottom_pre_pandemic),sort_countries(bottom_pre_pandemic).freedom,0.41)
plot_outliers("freedom",outlier,sort_countries(bottom_pre_pandemic).freedom.mean(),pre_pandemic.freedom.mean(),'bottom')

**1.5 Perception of corruption and happiness**

In [None]:
plot_varaibles(sort_countries(top_pre_pandemic).trust_gov_corruption,sort_countries(middle_pre_pandemic).trust_gov_corruption,sort_countries(bottom_pre_pandemic).trust_gov_corruption,pre_pandemic.trust_gov_corruption.mean(),'Perception of corruption')

In [None]:
def trust_gov_corruption(df, rank,var):
    ft_mean = f'{rank} mean {df.trust_gov_corruption.mean():.2f}'
    ft_stdv = f'{rank} stdv {df.trust_gov_corruption.std():.2f}'
    ft_skew = f'{rank} skew {df.trust_gov_corruption.skew():.2f}'
    ft_max = f'{rank} max {df.trust_gov_corruption.max():.2f}'
    ft_kurt = f'{rank} kurtosis {kurtosis(df.trust_gov_corruption, fisher = False):.2f}'
    print(f'{var}')
    return ft_mean, ft_stdv, ft_skew, ft_max, ft_kurt

In [None]:
print(trust_gov_corruption(sort_countries(top_pre_pandemic),'top','corruption'))
print(trust_gov_corruption(sort_countries(middle_pre_pandemic),'middle','corruption'))
print(trust_gov_corruption(sort_countries(bottom_pre_pandemic),'bottom','corruption'))

Top countries have higher standard deviation than bottom and middle with values more to the tails and the right of the graph.
Bottom and middle have alike distributions, more of them to the center and to the left of the graph.

In [None]:
pre_pandemic.where((pre_pandemic.trust_gov_corruption >0.44) & (pre_pandemic.happiness_rank > 31)).dropna() # Countries where perception of corruption has a higher score than the top countries max

**1.6 Generosity and happiness**

In [None]:
plot_varaibles(sort_countries(top_pre_pandemic).generosity,sort_countries(middle_pre_pandemic).generosity,sort_countries(bottom_pre_pandemic).generosity,pre_pandemic.generosity.mean(),'Generosity')

In [None]:
def generosity(df, rank,var):
    ft_mean = f'{rank} mean {df.generosity.mean():.2f}'
    ft_stdv = f'{rank} stdv {df.generosity.std():.2f}'
    ft_skew = f'{rank} skew {df.generosity.skew():.2f}'
    ft_max = f'{rank} max {df.generosity.max():.2f}'
    ft_kurt = f'{rank} kurtosis {kurtosis(df.generosity, fisher = False):.2f}'
    print(f'{var}')
    return ft_mean, ft_stdv, ft_skew, ft_max, ft_kurt

In [None]:
print(generosity(sort_countries(top_pre_pandemic),'top','generosity'))
print(generosity(sort_countries(middle_pre_pandemic),'middle','generosity'))
print(generosity(sort_countries(bottom_pre_pandemic),'bottom','generosity'))

Bootm and middle countries have alike distribution.

We can see that there are countries from bottom and middle countries that have higher generosity score than top countruies' highest value.

In [None]:
pre_pandemic.where((pre_pandemic.generosity >0.41)& (pre_pandemic.happiness_rank > 31)).dropna() # Countries that have higher generosity than the top 1 country

**1.6 Dystopia + residuals and happiness**

In [None]:
plot_varaibles(sort_countries(top_pre_pandemic).dystopia_residual,sort_countries(middle_pre_pandemic).dystopia_residual,sort_countries(bottom_pre_pandemic).dystopia_residual,pre_pandemic.dystopia_residual.mean(),'Generosity')

In [None]:
def dyst(df, rank, var):
    ft_mean = f'{rank} mean {df.dystopia_residual.mean():.2f}'
    ft_stdv = f'{rank} stdv {df.dystopia_residual.std():.2f}'
    ft_skew = f'{rank} skew {df.dystopia_residual.skew():.2f}'
    ft_max = f'{rank} max {df.dystopia_residual.max():.2f}'
    ft_max = f'{rank} max {df.dystopia_residual.max():.2f}'
    ft_kurt = f'{rank} kurtosis {kurtosis(df.dystopia_residual, fisher = False):.2f}'
    print(f'{var}')
    return ft_mean, ft_stdv, ft_skew, ft_max, ft_kurt

In [None]:
print(dyst(sort_countries(top_pre_pandemic),'top', 'Dystopia + residuals'))
print(dyst(sort_countries(middle_pre_pandemic),'middle', 'Dystopia + residuals'))
print(dyst(sort_countries(bottom_pre_pandemic),'bottom', 'Dystopia + residuals'))

Here the differnce between top and bottom countries in abs values is the biggest, giving the bottom countries bigger disadvantage.

In [None]:
middle_high_score = pre_pandemic.where((pre_pandemic.GDP_per_capita >= sort_countries(top_pre_pandemic).GDP_per_capita.mean())& (pre_pandemic.family >=sort_countries(top_pre_pandemic).family.mean()) & (pre_pandemic.life_expectancy >sort_countries(top_pre_pandemic).life_expectancy.mean()) & (pre_pandemic.happiness_rank > 30) ).dropna()

In [None]:
bottom_high_score = pre_pandemic.where((pre_pandemic.GDP_per_capita >= sort_countries(middle_pre_pandemic).GDP_per_capita.mean())& (pre_pandemic.family >=sort_countries(middle_pre_pandemic).family.mean()) & (pre_pandemic.life_expectancy >=sort_countries(middle_pre_pandemic).life_expectancy.mean()) &  (pre_pandemic.happiness_rank > 90)).dropna()

In [None]:
plt.figure(figsize = (6, 4))
plt.title(f"Middle outliers")
sns.barplot(x="happiness_rank", y="country", data=middle_high_score, palette='Blues_r')
plt.show()

In [None]:
plt.figure(figsize = (6, 4))
plt.title(f"Bottom outliers")
sns.barplot(x="happiness_rank", y="country", data=bottom_high_score , palette='Blues_r')
plt.show()

In [None]:
def plot_dyst_residuals(df1,rank):
    plt.figure(figsize = (12, 4))
    sns.barplot(x=df1.index, y=df1['dystopia_residual'])
    plt.title(f'Average dystopia + residuals effect {rank} 30 countries')
    plt.axhline(y =pre_pandemic.dystopia_residual.mean(), c = 'red', lw =3, label = 'pre-pandemic mean')
    plt.axhline(y =df1.dystopia_residual.mean(), c = 'green', lw =3, label = f'{rank} mean')
    #plt.axhline(y =pre_pandemic.GDP_per_capita.mean(), c = 'yellow', lw =3, label = 'pre-pandemic GDP mean')
    #plt.axhline(y =pre_pandemic.family.mean(), c = 'orange', lw =3, label = 'pre-pandemic Social support mean')
    #plt.axhline(y =pre_pandemic.life_expectancy.mean(), c = 'blue', lw =3, label = 'pre-pandemic Life expectancy mean')
    plt.xticks(rotation=90)
    plt.legend(loc = "upper right")
    plt.show()

In [None]:
plot_dyst_residuals(sort_countries(top_pre_pandemic),'top')

In [None]:
plot_dyst_residuals(sort_countries(bottom_pre_pandemic),'bottom')

**1.7 Pre pandemic summary**
We have observed that based on the data from World happiness report, the happiest countries have a high score on GDP per capita, social support and life expectancy with countries grouped more to the mean of the values and a smaller difference between them, with bottom countries or unhappiest countries is the opposite.

For the other variables top countries have the highest mean scores as well and dystopia +  residuals giving them a bigger advantage.

Let's see if something changed for pandemic period, as we do expect so.

**2.Pandemic**

In [None]:
plt.figure(figsize = (10, 4))
sns.heatmap(pandemic.drop(['year', 'happiness_rank'],axis=1).corr(), cmap = "PiYG_r", annot = True)
plt.show()


In pandemic dataset GDP per capita, life expectancy are stil the 2 most strongly correlated variables with happiness score. 

Here freedom correlation score improved and changed to 0.6 from 0.55, closing the gap with family. 

Family correlation score dropped to 0.62 from 0.65. 

Generosity has dropped dramatically from 0.18 to 0.026.

Perception of corruption changed from 0.3 to 0.42

In [None]:
top_pandemic = pandemic.where(pandemic.happiness_rank <31)
middle_pandemic = pandemic.where((pre_pandemic.happiness_rank>60) & (pre_pandemic.happiness_rank<91))
bottom_pandemic = (pandemic.where(pandemic.happiness_rank>125)).dropna()

In [None]:
top_mid_bottom(sort_countries(top_pandemic),sort_countries(middle_pandemic),sort_countries(bottom_pandemic),'Pandemic data')


* Family distribution - the gap between top and middle, bottom widen during the pandemic.
* Life expectancy - top 30 countries have stronger bimodal distribution than pre-panemic period
* Freedom - middle 30 and bottom 30 distribution changed with the middle having higher peak than bottom and more values to the right of the graph
* Generosity - bottom and middle have even more values to the right end of the graph than top 30 countries compared to pre-pandemic
* Perception of corruption kept relatively the same distribution
* Dystopia + residuals  - the gap between top,middle and bottom countries become even wider

Let's see how the variables split for top, bottom and middle have changed.

First we will look into country level:

In [None]:
def contribution_features(df):
    contribution_features = ['GDP_per_capita',
                         'family',
                         'life_expectancy',
                         'freedom',
                         'generosity', 'trust_gov_corruption',
                         'dystopia_residual']

    pandemic_contributions = df[contribution_features]
    return pandemic_contributions
    

In [None]:
def plot_contr_feat(df1,df2,rank,period):
    fig, ax = plt.subplots(figsize=(10, 4))
    sns.set()
    df1.set_index(df2['country']).plot(kind='bar',stacked=True,ax=ax)
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.title(f'{rank} 30 {period} variables split')
    plt.show()
    

In [None]:
plot_contr_feat(contribution_features(sort_countries(top_pandemic)),sort_countries(top_pandemic),'Top', 'pandemic')

In [None]:
plot_contr_feat(contribution_features(sort_countries(top_pre_pandemic)),sort_countries(top_pre_pandemic),'Top', 'pre_pandemic')

In [None]:
plot_contr_feat(contribution_features(sort_countries(middle_pre_pandemic)),sort_countries(middle_pre_pandemic),'Middle', 'pre_pandemic')

In [None]:
plot_contr_feat(contribution_features(sort_countries(middle_pandemic)),sort_countries(middle_pandemic),'Middle', 'pandemic')

In [None]:
plot_contr_feat(contribution_features(sort_countries(bottom_pandemic)),sort_countries(bottom_pandemic),'Bottom', 'pandemic')

In [None]:
plot_contr_feat(contribution_features(sort_countries(bottom_pre_pandemic)),sort_countries(bottom_pre_pandemic),'Bottom', 'pre-pandemic')

Now let's see how varaiables scores changed for top, middle and bottom countries in pandemic period.

In [None]:
def prep_final(data, rank):
    df1 = data
    df1['rank'] = rank
    return df1

In [None]:
df1 = sort_countries(top_pre_pandemic)
df1['rank'] = 'top_pre_andemic'

In [None]:
final_top_prep = prep_final(sort_countries(top_pre_pandemic), 'top_pre_pandemic')
final_middle_prep = prep_final(sort_countries(middle_pre_pandemic), 'middle_pre_pandemic')
final_bottom_prep = prep_final(sort_countries(bottom_pre_pandemic), 'bottom_pre_pandemic')
final_top_pand= prep_final(sort_countries(top_pandemic), 'top_pandemic')
final_middle_pand = prep_final(sort_countries(middle_pandemic), 'middle_pandemic')
final_bottom_pand = prep_final(sort_countries(bottom_pandemic), 'bottom_pandemic')

In [None]:
final_bottom_prep.GDP_per_capita.mean(), final_bottom_pand.GDP_per_capita.mean()


In [None]:
final = pd.concat([final_top_prep, final_middle_prep, final_bottom_prep, 
                  final_top_pand, final_middle_pand,final_bottom_pand], 
                     ignore_index=True)

In [None]:
def plot_final(var):
    
    sns.displot(data = final, x = var, hue='rank',kind="kde", height = 10)
    plt.show()

In [None]:
plot_final("GDP_per_capita")

In [None]:
plot_final("life_expectancy")

In [None]:
plot_final("family")

In [None]:
plot_final("freedom")

In [None]:
plot_final("trust_gov_corruption")

In [None]:
plot_final("generosity")

In [None]:
plot_final("dystopia_residual")

*Note to self: Why we see negative values in the distributions?*

kde in pandas uses Gaussian kernels. Basically, it puts a gaussian over each data point and sums up the densities (with proper normalisation). So, you'll always have tails extending over your data range. Basically, KDE says that although there is no data in this range, there could have been data in another random sample and thus is assigning some small mass to represent that possibility.


* GDP per capita - again most notable (negative) change we observe with bottom countries.

We can see that the Pandemic have the highest negative impact on bottom countries for the variables social support, life expectancy and GDP per capita - all very crucial for coping with Health crisis such as COVID 19.

## Conclusion

According to the World Happiness Report, the happiest countries are in Western Europe, they are on average: the wealthiest, with the longest life expectancy and are experiencing the highest social support. The bottom happiest countries are in Sub-Saharian_Africa, they are on average: the least wealthy, with the lowest life expectancy and social support. In pandemic or not, those are the most important factors, making the difference especially for top and bottom countries.


## References

World Happiness progect can be found [here](https://worldhappiness.report/)