In [None]:
!pip install pycountry_convert

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pycountry_convert as pc
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy import stats
import plotly.express as px
import matplotlib.patches as mpatches

# 0 Basic Data wrangling & Visual inspection
Before we jump right into it, we quickly wanted to rename the columns to more manageable names and add a column for the continent a country is on. Unfortunately, the package we used (pycountry_convert) wasn't able to fully comprehend all (ever-changing) country names, so we had to hard code some of them and manually assign the corresponding continent. 

In [None]:
#Load datasets
df_2021 = pd.read_csv("/kaggle/input/world-happiness-report-2021/world-happiness-report-2021.csv")
df_bef = pd.read_csv("/kaggle/input/world-happiness-report-2021/world-happiness-report.csv")
#Rename columns
df_bef = df_bef.rename(columns={"Country name" : "country_name",  "Life Ladder" : "score", "Log GDP per capita" : "log_gdp", "Social support" : "social_support", "Healthy life expectancy at birth" : "life_expectancy", "Freedom to make life choices" : "freedom",  "Generosity" : "generosity", "Perceptions of corruption" : "corruption"})
#add continent names
df_add = df_bef[(df_bef["country_name"] == "Kosovo")]

for i in ["North Cyprus", "Palestinian Territories", "Somaliland region", "Taiwan Province of China", "Congo (Brazzaville)", "Congo (Kinshasa)", "Hong Kong S.A.R. of China"]:
    df_add = pd.concat([df_add, df_bef[df_bef["country_name"] == i]])
    
df_bs = df_bef[(df_bef["country_name"] != "Kosovo")]

for i in ["North Cyprus", "Palestinian Territories", "Somaliland region", "Taiwan Province of China", "Congo (Brazzaville)", "Congo (Kinshasa)", "Hong Kong S.A.R. of China"]:
    df_bs = df_bs[(df_bs["country_name"] != i)]
    
df_add["continent"] = df_add["country_name"]
df_add["continent"] = df_add["continent"].map({"Kosovo":"EU", "North Cyprus":"AS", "Palestinian Territories":"AS", "Somaliland region":"AF", "Taiwan Province of China":"AS", "Congo (Brazzaville)":"AF", "Congo (Kinshasa)":"AF", "Hong Kong S.A.R. of China":"AS"})

continents = [pc.country_name_to_country_alpha2(t, cn_name_format="default") for t in df_bs["country_name"]]
continent_name = [pc.country_alpha2_to_continent_code(t) for t in continents]
df_bs["continent"] = continent_name
df_bs = pd.concat([df_bs, df_add])
df = df_bs






## 0.1 The World Happiness Report 

### Data source:
The World Happiness Report sources its data from the Gallup World Poll.

> Gallup – the organization behind this enormous poll – interviews approximately 1,000 residents per country each year. More than 150 countries all around the globe participate. These people are randomly selected, as long as they are registered civilians of the country (non-institutionalized) and aged 15 and older. Each respondent in this happiness survey is asked the same questions in his or her own language to produce statistically comparable results.
| https://www.gallup.com/178667/gallup-world-poll-work.aspx

### Dissecting the data: 
* Country name (country_name): The country's name
* Life ladder (score): The final "happiness score" a country receives based on the sum of the following six factors.
* Log GDP per Capita (log_gdp): The logarithm of the real GDP per Capita supplied by the International Monetary Fund.
* Social support (social_support): The perceived amount of social support, obtained via the question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?"
* Healthy life expectancy at birth (life_expectancy): Number of years of life expectancy for people born in the observed year (without underlying conditions).
* Freedom to make life choices (freedom): The perceived amount of freedom of life choices, obtained via the question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
* Generosity (generosity): Generosity is the residual of regressing the national average of GWP responses to the question “Have you donated money to a charity in the past month?” on GDP per capita.
* Perception of corruption (corruption): The perceived amount of corruption in the interviewee's country, obtained via the questions: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?”.
* Positive / Negative Affect: The positive / negative affect is a representation of the subjective happiness of the observed sample size of a country averaged over the last three years. Around 1000 subjects are questioned face-to-face or via telephone each year for most countries (some countries are underrepresented for various reasons, e.g. struck by war). As stated in the World Happiness Report (WHR) 2021, the measures positive and negative affect were not published as not sufficient data was available.
* Continent (continent): The continent the country is located on (geopolitically) 

After taking a look at the reports pre-2021 and 2021, we decided to omit 2021 for most analyses to make the data better comparable. We learned that in 2021 the formulas for several factors were changed and less countries participated in the study. This meant that we would have had to change the data back to the "old" formulas and still faced the problem of under


## 0.2 Visual inspection of the dataset
Initially, we will let pandas describe the dataset for us.

We can gather that we are looking at around 1950 data entries from 2005 - 2020. Not all countries have values for all years and variables. Further, indexes like the *freedom, corruption, positive and negative affects* are between 0 and 1, while *generosity* seems to contain negative values.  
Also, *life_expectancy* is a continous variable measured in years.  
For further analyses within variables (i.e. *life_expectancy* plotted against *year*) we will not normalize the data since we are looking at one factor over time, but for analyses between data (i.e. *log_gdp* and *score*) we will normalize the data. This is simply due to better comparability, e.g. when the ranges differ. 



In [None]:
num_countries = len(df.country_name.unique())
min_Year = df.year.min()
max_Year = df.year.max()
num_na = df.isna().sum().sum()
num_rows_with_na = df[df.isnull().any(axis=1)].shape[0]
print(f"In the dataset are {num_countries} countries represented. \nThe earliest entry is from {min_Year} and the latest from {max_Year}. \nThere are {num_na} NANs in the dataset and {num_rows_with_na} rows \nwith at least one NAN. \n\nIn the following you see a description of the columns: \n")

df.describe()

## Research Questions
After visual inspection of the data, we decided to pursue answers to the following research questions:
1. How did the global happiness develop over the observed years?
2. Which of the six named indicators have significant (p < 0.05) impact on the overall happiness (rank) of a country?
3. Did the top and bottom 10 countries change over the observed years?
4. Grouping by continent, what trends are observable?

# Global view
## 1.1 Number of countries per year
In the following plot we divided the number of countries that participated in the report by continent. First of all we can see that in 2005 and 2020 the number of countries was much lower, namely 27 countries in 2005 and 95 countries in 2020, compared to the other years (average between 2006 and 2019: 130.5).

In 2005, the Gallup Company started the poll which could explain the small number of participants. For example, due to not knowing if the poll can be executed, if the questions really get the wanted information and so on. As for 2020 (and 2021) we suspect that the coronavirus caused the low participation rate. Simply because such polls can – at least in some countries – only be processed face to face. 

Furthermore, some continents are underrepresented in 2005: in 2005 only 3.7% of the countries were from Africa compared to an average of ~24.49% in the other years, the same holds for Asia (2005: ~25.93%, average between 2006 and 2020: ~30.37%) to some extend. Of course the differences are more extreme for Africa. Also, Europe is very overrepresented. In 2005 about 51.85% of the countries were from Europe compared to an average of ~27.06% in the other years. For the other continents the
proportion did not change too much. It might have been the case that Europe simply was easier to reach than the other countries / continents. 

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20,5))

dfg = df.groupby(["year", "continent"]).country_name.count().reset_index()
dfg.pivot(index="year", columns="continent").plot(kind='bar', stacked=True, ax=ax1)
ax1.set_title("Number of countries per year and continent")
ax1.set_xlabel("Year")
ax1.set_ylabel("Count")
ax1.legend([])

dfg = df.groupby(["year", "continent"]).count().country_name.reset_index()
participants_per_year = df.groupby("year").count().country_name
dfg = dfg.merge(participants_per_year, on="year")
dfg["perc"] = dfg.country_name_x/dfg.country_name_y*100
dfg[["year","continent", "perc"]].pivot(index="year", columns="continent").plot(kind='bar', stacked=True, ax=ax2)
ax2.set_title("Number of countries per year and continent (normalized per year and in percent)")
ax2.set_xlabel("Year")
ax2.set_ylabel("Proportion of continents in %")

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., labels=["Africa", "Asia", "Europe", "North America", "Oceania", "South America"])
plt.show()

for c in ["AF", "AS", "EU", "NA", "OC", "SA"]:
  print(f"Percentage of countries from {c}")
  print(f"      in 2005:               ~{dfg[(dfg.continent==c) & (dfg.year == 2005)].perc.values[0]:.2f}%")
  print(f"      between 2006 and 2020: ~{dfg[~dfg.year.isin([2005]) & (dfg.continent==c)].perc.mean():.2f}%")
  print()

## 1.2 Development of indicators world wide

The next plot helped to get a first idea of the indicators and how they developed over time. They are plotted together for all countries. First of all we can clearly see that apart from corruption and negative affect all indicators drop greatly from 2005 to 2006. As we pointed out in the previous plot the number of participants increased drastically from 2005 to 2006 and the relative amount of countries from developing continents such as Africa and Asia increased. Therefore, this might be the reason for the decrease. However most indicators increase slightly between 2006 and 2007.  
The volatility seems to be quite low between 2007 and 2019, there are no huge changes. Apart from freedom, which increases during this period, the indicators are monotonic.   
From 2019 to 2020 we can see an increase in some indicators. Again, we would like to point out that the percentage of countries from Europe increased from 2019 (\~27.08%) to 2020 (~38.95%) and following the percentage of countries from less developed continents decreased. Again, the proportion of each continent might have caused this change.

Our idea is that the poll questions are very diverse and not only favoring a single indicator but rather a combination of many. Furthermore, we can simply assume that the countries are overall pretty steady (according to the questions and indicators). 


In [None]:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
sns.set(style="darkgrid")

dfg = df.drop(columns=["country_name", "continent"])
dfg = (dfg - dfg.min())/(dfg.max() - dfg.min())
dfg.year = df.year
dfg = dfg.melt("year", var_name="cols", value_name="vals")

sns.lineplot(data=dfg, x = "year", y = "vals", hue="cols")
ax.set_title("Development of indicators worldwide from 2005-2020")
ax.set_xlabel("Year")
ax.set_ylabel("Normalized value")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

## 1.3 Geographic representation of score development
In the following we see a world map where each country (that participated) is colored according to its score in the report. It is an interactive plot which means that one can change the year that is displayed. As you can see the scores are overall lower in Asia and Africa compared to the other continents. Furthermore, the countries from Africa that participated change very often.  
The score in South America and Asia are increasing in the first years of the report but from about 2014 they decrease again.
Eastern Europe has a much lower score than the rest of Europe in all observed years.  
The scores in North America seem to be always high.  

In [None]:
dfg = df
category_orders=[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
fig = px.choropleth(dfg, locations="country_name", locationmode='country names', color="score", title="World Map", animation_frame="year", category_orders={"year":category_orders}, range_color=(2, 9), color_continuous_scale=["red", "green"])
fig.write_html(os.getcwd() + "/woma_lifeladder.html")
fig.show()

# 2 Significance of indicators


Next we take a look at the correlation between the different variables using sns.heatmap.
Remember, a correlation simply denotes whether values increase / decrease together. This is a statistical association and not proof for casuality between the variables. None the less, correlation is a great starting point to get an overview over our dataset and the relations within.   

In [None]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", vmin=-1, vmax=1);

Before anaylsis, let's take a step back and calculate the significance of the correlation using [pearsons correlation from scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)

*Note: We only display significant correlations $p < \alpha ; \space \alpha = 0.05$*  
Strong colors (a deep red or blue) represent stronger correlation. Naturally, the correlation of a variable with itself is 1 (the diagonal).  
We observe a strong positive correlation between *score*  and *log_gdp* (0.79, $p < 0.001$), *social_support* (0.71, $p < 0.001$), *life_expectancy* (0.74, $p < 0.001$) and to a lesser degree *freedom* (0.59, $p < 0.001$) and *generosity* (0.19, $p < 0.001$). This means that if the economy is booming, the score seems to be much higher. 
Suprisingly, despite the report stating 
> "for corruption a higher rank means a lower perceived frequency of corruption (https://worldhappiness.report/ed/2019/changing-world-happiness/)

there is a negative correlation between *score* and *corruption* of -0.43 ($p < 0.001$).  
Next, the strongest correlation can be found between *log_gdp* and *life_expectancy*, suggesting that in countries with higher real GDP, the life expectancy at birth is higher than in low GDP countries.   

In [None]:
def corr_sig(df=None):
    p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))
    for col in df.columns:
        for col2 in df.drop(col,axis=1).columns:
            df2 = df[[col, col2]].dropna()
            _ , p = stats.pearsonr(df2[col],df2[col2])
            p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p
    return p_matrix

In [None]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)

p_values = corr_sig(df.drop(columns=["continent", "country_name"])).round(5)
mask = np.invert(np.tril(p_values<0.05))    
sns.heatmap(df.corr(), annot=True, mask=mask, cmap="coolwarm", vmin=-1, vmax=1);

# 3 Top & Bottom 10 countries
In the next plot we can see the development (in percent) of the ten best and ten worst ranked countries in 2005 during the years 2005 to 2020. Interestingly, the scores of the best ranked countries decrease slightly (<-10%). Venezuela seems to be an outlier because of its big decrease of -36.21%. The scores of the worst ranked countries develop positively, negatively or only slightly either way. The changes range from 21.41% to -34.96%. Hence the development of the worst ranked countries is very diverse whereas the development of the best ranked countries is overall slightly negative.

In [None]:
# For the 10 best countries
df_best = df[df.year == 2005].sort_values(by="score").tail(10).reset_index()[["country_name"]]

old = []
new = []
change = []
val_2019 = []
for c in df_best.country_name.values:
    old.append(df[(df.country_name == c) & (df.year == 2005)].score.values)
    new.append(df[(df.country_name == c) & (df.year == 2020)].score.values)
    if len(new[-1]) == 0: 
        new[-1] = df[(df.country_name == c) & (df.year == 2019)].score.values
        val_2019.append(c)
    change.append(new[-1]/old[-1])
df_best["old"] = [k for p in old for k in p]
df_best["new"] = [k for p in new for k in p]
df_best["change"] = [(k-1)*100 for p in change for k in p]

df_best.sort_values(by="change", ascending=False, inplace=True)
df_best.reset_index(drop=True, inplace=True)

palette_best = {}
for i, country in enumerate(df_best.country_name):
    palette_best[country] = f"C{i}"

# For the 10 worst countries
df_worst = df[df.year == 2005].sort_values(by="score").head(10).reset_index()[["country_name"]]

old = []
new = []
change = []
val_2019 = []
for c in df_worst.country_name.values:
    old.append(df[(df.country_name == c) & (df.year == 2005)].score.values)
    new.append(df[(df.country_name == c) & (df.year == 2020)].score.values)
    if len(new[-1]) == 0: 
        new[-1] = df[(df.country_name == c) & (df.year == 2019)].score.values
        val_2019.append(c)
    change.append(new[-1]/old[-1])
df_worst["old"] = [k for p in old for k in p]
df_worst["new"] = [k for p in new for k in p]
df_worst["change"] = [(k-1)*100 for p in change for k in p]

df_worst.sort_values(by="change", ascending=False, inplace=True)
df_worst.reset_index(drop=True, inplace=True)

palette_worst = {}
for i, country in enumerate(df_worst.country_name):
    palette_worst[country] = f"C{i}"

In [None]:
fig = plt.figure(figsize=(24,8))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

sns.barplot(x=df_best.change, y=df_best.country_name, palette=palette_best, ax=ax1)

ax1.set_xlim((-40, 40))
ax1.set_xlabel("Change in percent")
ax1.set_ylabel("Country")
ax1.set_title("Development of score between 2005 and 2020 of the 10 best ranked countries in 2005", fontdict={"fontsize":15})
if(len(val_2019) > 0): ax1.text(-40, 10.8, f"{val_2019} were compared with 2019 due to missing data.")
for index, row in df_best.iterrows():
    ax1.text(row.change, index, round(row.change,2), color='black', ha="left")

sns.barplot(x=df_worst.change, y=df_worst.country_name, palette=palette_worst, ax=ax2)

ax2.set_xlim((-40, 40))
ax2.set_xlabel("Change in percent")
ax2.set_ylabel("Country")
ax2.set_title("Development of score between 2005 and 2020 of the 10 worst ranked countries in 2005", fontdict={"fontsize":15})
if(len(val_2019) > 0): ax2.text(-40, 10.8, f"{val_2019} were compared with 2019 due to missing data.")
for index, row in df_worst.iterrows():
    ax2.text(row.change, index, round(row.change,2), color='black', ha="left")

Now we plot the development of the score for each of the ten best ranked countries in 2005. Again, Venezuela seems to be an outlier due to its drastic changes and negative development. To make better assumptions about the countries, we will therefore omit Venezuela because it does not share the same characteristics.
Overall we see a steady development. Importantly, the standard deviation is quite low with an average of 0.26 (including Venezuela) and six countries have a standard deviation <0.2. A low standard deviation might be a characteristic of the good / best ranked countries. Furthermore, France, Saudi Arabia and Spain have a lower score in 2009.

In [None]:
dfg = df[df.country_name.isin(df_best.country_name)][["country_name", "score", "year"]]
std_best = {}
for country in df_best.country_name:
    std_best[country] = np.std(dfg[dfg.country_name==country].score.values)

g = sns.FacetGrid(dfg, col="country_name", hue="country_name", palette=palette_best, height=5, col_wrap = 5, aspect=0.9)
g.map(sns.lineplot, "year", "score")
g.set(xlim=(2005, 2020), ylim=(3, 9), xticks=range(2005, 2021, 2), xlabel="Year", ylabel="Score")
g.fig.suptitle("Development of the score of the 10 best ranked countries in 2005 with std")
g.fig.subplots_adjust(top=0.9)
for ax, country in zip(g.axes.flat, dfg.country_name.unique()):
    std = np.std(dfg[dfg.country_name==country].score.values)
    ax.set_title(f"{country} (std = {std_best[country]:.2f})")

print(f"Mean standard deviation of the ten BEST ranked countries in 2005 is {np.mean(list(std_best.values()))}.")

Next, we see the same plot for the worst ranked countries in 2005. There are many ups and downs over the years. The average standard deviation is much higher with 0.41 compared to the best ranked countries (std = 0.26). No country has a standard deviation <0.2, however six countries have a std >0.4. Hence volatility is high.

In [None]:
dfg = df[df.country_name.isin(df_worst.country_name)][["country_name", "score", "year"]]
std_worst = {}
for country in df_worst.country_name:
    std_worst[country] = np.std(dfg[dfg.country_name==country].score.values)

g = sns.FacetGrid(dfg, col="country_name", hue="country_name", palette=palette_worst, height=5, col_wrap = 5, aspect=0.9)
g.map(sns.lineplot, "year", "score")
g.set(xlim=(2005, 2020), ylim=(3, 9), xticks=range(2005, 2021, 2), xlabel="Year", ylabel="Score")
g.fig.suptitle("Development of the score of the 10 worst ranked countries in 2005 with std")
g.fig.subplots_adjust(top=0.9)
for ax, country in zip(g.axes.flat, dfg.country_name.unique()):
    ax.set_title(f"{country} (std = {std_worst[country]:.2f})")

print(f"Mean standard deviation of the ten WORST ranked countries in 2005 is {np.mean(list(std_worst.values()))}.")

# 4 Continent View

Next, we would like to investigate whether there are differences between the six observed continents (sorry Antartica!) and how each continent developed concerning both the general happiness and all of the previously mentioned indicator variables. 

A few key points one needs to keep in mind when analysing and interpreting this data:
* As observed in 1.1, there were very few countries when the report started in 2005. It took some years to extend to up to 150 countries, however, 2020 is another outlier in sample size due to the coronavirus. 
* Oceania only consists of Australia and New Zealand (and a total of only 28 observations). We expect there to be more extreme volatility in general. 
* Some countries failed to report some information in some years, this may be due to crisis (e.g. Syrian War, Corona Crisis) or governmental influence (e.g. Saudi Arabia)


## 4.1 General development

Plotting the mean score for each continent for all years, we find that on average, people living in Oceania are happiest (keep in mind, only two countries, Australia and New Zealand, are observed). 
Generally, after more countries were included into the analyis in the year 2006, the initially high average in all continents plummeted. We will therefore analyse from 2006 on. Further, we will examine each continent shortly:
* *Asia (AS)* averages within 5 and 5.5 score for the observed time, displaying a vertical trend in the years 2006 to 2018. There is a clear upwards trend in the last two observed years, finally breaking the 5.5 score barrier. 
* *Europe (EU)* shows a general upwards trend since 2009 dip. We suggest the hypothesis that this dip might have been caused by the 2008 financial crisis which would also explain the prior sharp increase in score due to increased gdp (see our correlation above which suggests it is one of the most important indicator for happiness) in the pre-crash flourishing economy.
* *Africa (AF)* averages the lowest between a score of 4 and 4.5, showing a positive trend starting 2016 and ongoing in 2020. 
* *South America (SA)* shows higher volatility (between 5.5 and 6.5 score) than other continents, with a declining trend in 2020. 
* *North America (NA)* exhitbits very similiar trends to it's direct neighbor SA, although it is currently slightly higher in score (0.5).
* *Oceania (OC)* shows the highest average happiness over all observed years.

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data = df, x = "year", y = "score", hue = "continent", style = "continent", markers = True,ax = ax, ci = False, linewidth = 2)
ax.set_title("Development of Happiness score by continent")
ax.set_xlabel("Year")
ax.set_ylabel("Score")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0, title="Continents")
plt.show()

In [None]:
# helper function to create subdatasets
def create_indicator_df(df,cont = "EU"):
    # select relevant data and drop irrelevant cols
    sub_df = df[df["continent"] == cont].drop(columns =["country_name","continent"])
    # normalie between 0 and 1 
    sub_df = (sub_df - sub_df.min())/(sub_df.max() - sub_df.min())
    sub_df.year = df.year
    sub_df = sub_df.melt("year", var_name="cols", value_name="vals")
    return sub_df

df_moy = df.groupby("year").mean().reset_index()
print(df_moy)


## 4.2 Indicator Development Asia

In *Asia (AS)* we can observe the following when inspecting each indicator. 
* *corruption* is high (meaning very little perceived corruption), however, slowly trending downwards.
* *log_gdp*, *life_expectancy*, *positive_affects*, *social_support* and *rank* generally trend upwards. This relates to the recent economic boom and perceived increased development of many Asian countries. 
* *generosity* is at an all-time low.

Overall, the participating Asian countries' population seem to be increasingly satisfied with their respective country. Especially the economic boom, increased *freedom* and social support are notable.  

In [None]:
dfg = create_indicator_df(df,"AS")
g = sns.FacetGrid(dfg, col="cols", hue = "cols", height=3, col_wrap=3,aspect =3)
g.map(sns.lineplot, "year", "vals", markers = True, ci =False, linewidth = 2)
g.set_axis_labels("Year","Normalized Score")
g.set_titles(col_template = "{col_name} indicator")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Development of indicators 2005-2020: Asia")

## 4.3 Indicator Development Europe

In *Europe (EU)* we can observe the following when inspecting each indicator. 
* After a very turbulent period of 2007 to 2009 (financial crisis), there is a clear and strong upward trend for *log_gdp*, *life_expectancy*,*positive_affects*,*social_support* and *rank*, indicating that in general, Europe is getting happier.
* However, *corruption* indicates a strong downwards trend. Analysis shows that Germany, Estonia, Ireland, Poland and Austria have the largest standard deviation(0.14, 0.11,0.10,0.09) (i.e. biggest changes over the years). More precisely, one can observe that of these five, only Poland went above the EU average corruption - all other countries reported significantly lower corruption scores lately.
* *positive_affects* clearly show the 2008 crisis  but also indicate increased positivity in 2020 despite the corona crisis

Overall, the participating European countries' population seem to be increasingly satisfied with their respective country, albeit the increase in perceived *corruption* worries many.

In [None]:
df[df["continent"] == "EU"].groupby("country_name")["corruption"].std().sort_values().nlargest(5)
s = df[df["continent"] == "EU"].groupby(["year"])["corruption"].mean().reset_index()
df_corr_eu = df[df["country_name"].isin(["Germany", "Estonia", "Ireland", "Poland", "Austria"])]
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data = s, x = "year", y = "corruption", color = "black")
ax.annotate(xy = (2005, s.iloc[0]["corruption"]-0.06), text = "EU Mean")
ax = sns.lineplot(data = df_corr_eu, x = "year", y = "corruption", hue = "country_name", style = "country_name", markers = True,ax = ax, ci = False, linewidth = 2)
ax.set_title("Comparison of corruption in the five countries with the largest STD")
ax.set_xlabel("Year")
ax.set_ylabel("Corruption")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0, title="Country")
plt.show()


In [None]:
dfg = create_indicator_df(df,"EU")
g = sns.FacetGrid(dfg, col="cols", hue = "cols", height=3, col_wrap=3,aspect =3)
g.map(sns.lineplot, "year", "vals", markers = True, ci =False, linewidth = 2)
g.set_axis_labels("Year","Normalized Score")
g.set_titles(col_template = "{col_name} indicator")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Development of indicators 2005-2020: Europe")

## 4.4 Indicator Development North America

In *North America (NA)* we can observe the following when inspecting each indicator. 
* Every indicator other than *generosity* is comparatively quite high, but *generosity* is very low (only Oceania ranks lower).
* *freedom*, *life_expectancy* and *log_gdp* but also *negative_affect* increased steadily for the last observed period. *negative_affect* is compared to other contients very high.
Overall, the participating North American countries' population seem to be increasingly satisfied with their respective country, especially economically, but the increase in perceived *corruption* and *negative_affect* worries many.

In [None]:
dfg = create_indicator_df(df,"NA")
g = sns.FacetGrid(dfg, col="cols", hue = "cols", height=3, col_wrap=3,aspect =3)
g.map(sns.lineplot, "year", "vals", markers = True, ci =False, linewidth = 2)
g.set_axis_labels("Year","Normalized Score")
g.set_titles(col_template = "{col_name} indicator")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Development of indicators 2005-2020: North America")

## 4.5 Indicator Development Africa

In *Africa (AF)* we can observe the following when inspecting each indicator. 
* *life_expectancy* saw the biggest increase of all continents, however, this sharp increase is only visible weakly in the overall score, supporting our correlation analysis that *life_expectancy* is a medium-strength indicator. 
* *freedom*, *generosity* and *log_gdp* but also *negative_affect* fluctuated heavily over the observed period, but in general show upward tendencies. *corruption* sees a very strong horizontal trend with little movement in any direction.
* *social_support* decreased after 2010 steadily.
Overall, the participating African countries' population seem to be increasingly satisfied with their respective country, but the increase in *negative_affect* means many worry on a daily base. Additionally, the decrease in social support dampens the positive trend many indicators exhibit.

In [None]:
dfg = create_indicator_df(df,"AF")
g = sns.FacetGrid(dfg, col="cols", hue = "cols", height=3, col_wrap=3,aspect =3)
g.map(sns.lineplot, "year", "vals", markers = True, ci =False, linewidth = 2)
g.set_axis_labels("Year","Normalized Score")
g.set_titles(col_template = "{col_name} indicator")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Development of indicators 2005-2020: Africa")

## 4.6 Indicator Development Oceania

As the data is so sparse, remember, there are only between one and three countries per year, we were to ommit the analyis here, but we wanted to point out some key facts anyway:
* Unsuprisingly, *life_expectancy* and *log_gdp* shows an upwards trend.
* All other indicators show downwards trends, which affects the overall score that shows a slight downward trend.


In [None]:
dfg = create_indicator_df(df,"OC")
g = sns.FacetGrid(dfg, col="cols", hue = "cols", height=3, col_wrap=3,aspect =3)
g.map(sns.lineplot, "year", "vals", markers = True, ci =False, linewidth = 2)
g.set_axis_labels("Year","Normalized Score")
g.set_titles(col_template = "{col_name} indicator")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Development of indicators 2005-2020: Oceania")

## 4.7 Indicator Development South America

In *South America (SA)* we can observe the following when inspecting each indicator. 
* *life_expectancy* and *log_gdp* increased steadily over the years.
* *social_support* and *positive_affect* both saw downward trends, evidently *negative_affects* saw an upward trend.
* suprisingly, *freedom* and *corruption* exhibit upwards trends until 2017 / 2018, since then a horizontal or slightly declining trend is observable. 

Overall, the participating South American countries' population seem to be increasingly **less** satisfied with their respective country, despite increases in life_expectancy and log_gdp. Additionally, the decrease in social support and increased negative affects dampen the positive trend some other indicators exhibit.


In [None]:
dfg = create_indicator_df(df,"SA")
g = sns.FacetGrid(dfg, col="cols", hue = "cols", height=3, col_wrap=3,aspect =3)
g.map(sns.lineplot, "year", "vals", markers = True, ci =False, linewidth = 2)
g.set_axis_labels("Year","Normalized Score")
g.set_titles(col_template = "{col_name} indicator")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Development of indicators 2005-2020: South America")

# Conclusions

To get a feeling for the data set and have a good overview we started with basic steps via looking at the participating countries in general and the indicators' development over the years. This helped us to start our analysis and give us a hint about our research questions. Next, we plotted the correlation matrices so that we could analyse which indicators have the most impact on the score. We found out that especially GDP, Life Expectancy and Social Support were the most important influences in the happiness score. 
The most interesting contries usually are the most extreme data. For us, this were the top and bottom 10 countries. We found out that the score of the top 10 countries slighty decrease over time except for Venezuela. Unfortunately, we couldn't really find the reasons for the huge drop for Venezuela in our data but with a little googling, we learned that in 2010, poverty and inflation began to rise and in 2013 shortages in e.g. flour, milk and other necessary resources caused malnutritions. In 2016, Venezuelas president degreed an "economic emergency". After that, the borders were opened for import and the economy started to get better again. All these increases and decreases can be seen in the second last diagram in section 3. 
Lastly, we compared the continents and looked at their development in detail. Overall, Oceania seems to have the highest average happiness score over the years. The specific details can be found in section 4. 

We were very excited while working with the data set. Having seen the different impacts an indicator can have and analysing the development of said indicators with regards to different countries, continents or just time was interesting and fascinating. Nevertheless, the data set carries much more information than we discovered and it will always be intriguing to occasionally check the new World Happiness Report. 