# World Happiness Data 2015-2020

In [None]:
from IPython.display import Image
Image('../input/header/smile.jpg')

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import geopandas as gpd
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = "plotly"

## Load Data

In [None]:
# Create a dictionary with each data set as items.
years = [2015, 2016, 2017, 2018, 2019, 2020]
dataset = {year : pd.read_csv(f'../input/world-happiness-report/{year}.csv')
              for year in years}

In [None]:
# Column headers are inconsistent between files, so we give them the same names.
for i, item in dataset.items():
    item.rename(columns = {'Ladder score' : 'Score',
                           'GDP per capita' : 'Logged GDP per capita',
                           'Happiness.Rank' : 'Overall rank',
                           'Happiness Rank' : 'Overall rank',
                           'Happiness.Score' : 'Score',
                           'Happiness Score' : 'Score',
                           'Economy..GDP.per.Capita.' : 'Logged GDP per capita',
                          'Health..Life.Expectancy.' : 'Healthy life expectancy',
                          'Trust..Government.Corruption.' : 'Perceptions of corruption',
                          'Family' : 'Social support',
                          'Freedom' : 'Freedom to make life choices',
                           'Dystopia.Residual' : 'Dystopia Residual',
                           'Economy (GDP per Capita)' : 'Logged GDP per capita',
                           'Health (Life Expectancy)' : 'Healthy life expectancy',
                           'Trust (Government Corruption)' : 'Perceptions of corruption',
                           'Country name' : 'Country',
                           'Country or region' : 'Country'
                          }, 
                inplace=True)

In [None]:
# Divide the variables into "explained by" and regular variables.
variables =  ['Explained by: Log GDP per capita', 
             'Explained by: Social support',
             'Explained by: Healthy life expectancy',
             'Explained by: Freedom to make life choices',
             'Explained by: Generosity',
             'Explained by: Perceptions of corruption',
             ]
variables2 =  ['Logged GDP per capita',
             'Social support',
             'Healthy life expectancy',
             'Freedom to make life choices',
             'Generosity',
             'Perceptions of corruption',
             ]

In [None]:
# Make the country names the index for each data frame.
for year in years:
    dataset[year].set_index('Country', inplace=True)

## Overview

We will start with an overview of the data from 2020, looking at the data from earlier years later. 

We will consider the 'Explained by' variables seperately from the raw measurements. Below we look at the raw measurement data, in descending order of happiness score. The cells are coloured so that the larger values in the column are darker.

Some trends are immediately obvious:

1) Factors closely linked with economic development, (i.e. GDP per capita, social support, and healthy life expectancy) are very strongly linked with happiness score.

2) There are some noticable exceptions to the above. In particular, many countries seem to be happier than their economy would suggest. Costa Rica in particular has a very high happiness score given its GDP per capita. 

In [None]:
dataset[2020][['Score'] + variables2].style.background_gradient(cmap='Blues')

Below we show the correlation matrix for the variables. This data confirms that GPD per capita, social support and healthy life expectancy are the main factors in contributing to happiness score. Freedom and perception of corruption also contribute significantly. Genorosity shows only a weak correlation with score. 

In [None]:
dataset[2020][['Score'] + variables2].corr().style.background_gradient(cmap='RdBu', vmin=-1, vmax=1)

Here we plot happiness score against each variable together with local regression trend line. Note that many of the trendlines are not linear. GDP per capita shows a slight increase in gradient for the richest nations, but it is a small change. Happiness seems to grow exponentially with social support however. It is also interesting that the perception of corruption does not seem to have any significant effect on happiness until it falls below about 0.6. 

In [None]:
for var in variables2:
    dataset[2020].plot.scatter(x=var,
                               y='Score',
                              hover_name=dataset[2020].index,
                              trendline="lowess").show()

In [None]:
shapefile = '../input/map-files/ne_110m_admin_0_countries.shp'
#Read shapefile using Geopandas
gdf = gpd.read_file(shapefile)[['ADMIN', 'ADM0_A3', 'geometry']]
#Rename columns.
gdf.columns = ['country', 'country_code', 'geometry']
#Drop row corresponding to 'Antarctica'
gdf = gdf.drop(gdf.index[159])

In [None]:
replacements = {'United States of America':'United States',
               'Czechia':'Czech Republic',
                'Taiwan' : 'Taiwan Province of China',
                'Republic of Serbia' : 'Serbia',
                'Palestine' : 'Palestinian Territories',
                'Republic of the Congo' : 'Congo (Kinshasa)',
                'eSwatini' : 'Swaziland',
                'United Republic of Tanzania' : 'Tanzania',
                }

In [None]:
# Make names match
gdf = gdf.replace(replacements)

In [None]:
#Merge dataframes gdf and dataset[2020].
merged = gdf.merge(dataset[2020], left_on = 'country', right_on = 'Country', how='left')

And finally for this overview section, we plot happiness score on a map so that we can look for geographic trends. The figure clearly shows that Western Europe, North and South America, Australia and New Zealand are the happiest places in the world. Africa and South Asia are the least happy places in the world.

In [None]:
fig = px.choropleth(merged, locations="country_code",
                    color='Score',
                    hover_name="country", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.RdBu,
                   )
fig.update_layout(
    autosize=False,
    width=950,
    height=600,)
fig.show()

## Trends

In this section we look across the data sets from different years to look for trends across time.

In [None]:
happiness_ts = pd.DataFrame()
for year in years:
    happiness_ts = pd.concat([happiness_ts, dataset[year]['Score']], axis=1)

happiness_ts.columns = years

In [None]:
df = dataset[2019].copy()
df['Delta'] = happiness_ts[2019] - happiness_ts[2015]
df['2015 Score'] = happiness_ts[2015]

In [None]:
#Merge dataframes gdf and df_2016.
merged = gdf.merge(df, left_on = 'country', right_on = 'Country', how='left')

Here we plot 'Delta', the change in happiness score between 2015 and 2019. Venezuala is the country that has had the more notible change in happiness in this time, with a fall of 2.103. The biggest gains in happiness are in West Africa, which was among the least happy regions in the world in 2015. Eastern Europe has also seen significant gains in happiness. Southern Africa was already one of the least happy regions in 2015, and has gotten worse in the years to 2019. North and South America saw happiness fall in these years, but are still among the happiest places in the world. Europe, already one of the happiest places in the world, saw further gains in these years. 

In [None]:
fig = px.choropleth(merged, locations="country_code",
                    color='Delta',
                    hover_name="country", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.RdBu,
                   )
fig.update_layout(
    autosize=False,
    width=950,
    height=600,)
fig.show()

Box plots of happiness score in each year from 2015-2020 show that while the average happiness score does not show any particular trend, the difference between the happiest and least happy nations has been growing. 

In [None]:
px.box(happiness_ts,
       points='all',
       hover_data=[happiness_ts.index],
       labels=dict(variable="Year", value="Score"),
      )

In [None]:
ts = {var : pd.DataFrame(columns=years[:-1]) for var in variables2}
for var in variables2:
    for year in years[:-1]:
        ts[var][year] = dataset[year][var]

The box plots below show the trends from 2015-2019 for the other variables. There doesn't seem to be any strong trends. Healthy life expectancy has improved in this time, especially among the worst performing countries. 

In [None]:
for var in variables2:
    fig = px.box(ts[var],
                 points='all',
                 hover_data=[ts[var].index],
                 labels=dict(variable="Year", value=var)
                )
    fig.show()

## Clustering
In this section, we look to see if there is more than one way to be a happy country. We do this by looking at the happiest quartile of nations for which we have data and using a k-means clustering algorithm on the "Explained by" variables. The Elbow plot below indicates that we can find three distinct groups of nations.

In [None]:
year = 2020
clusterset = dataset[year][:len(dataset[year])//4]

In [None]:
kdata = clusterset[variables]
X = kdata.values

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss, '.')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.grid()
plt.show()

In [None]:
nclusters = 3
kmeans = KMeans(n_clusters = nclusters, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)

In [None]:
clusterset['cluster']= y_kmeans

In [None]:
#Merge dataframes gdf and df_2016.
merged = gdf.merge(clusterset, left_on = 'country', right_on = 'Country', how='left')

# Cluster names need to be strings so we can colour by cluster.
merged['cluster'] = merged['cluster'].astype(str)

Plotting the three clusters on a map, we see that the clusters are quite diverse. Cluster 2 consists of Northern Europe, Australia, New Zealand, Canada, and the UAE. Cluster 1 includes Southern Europe, the USA, Saudia Arabia, Israel, and a few countries from central and South America. Cluster 0 is mostly central and South American countries, as well as Uzbekistan. 

In [None]:
fig = px.choropleth(merged, locations="country_code",
                    color="cluster", 
                    hover_name="country", # column to add to hover information
                    color_discrete_map={'0.0':'green',
                                        '1.0':'red',
                                        '2.0':'blue',
                                        'nan': 'gray'},
                   )

fig.show()

In [None]:
# Find median values for each variable for each cluster.
cluster_median = {i : clusterset[clusterset['cluster']==i].median() for i in range(nclusters)}

Box plots of happiness score show that cluster 2 is the happiest, with clusters 0 and 1 having similar median happiness, but cluster 1 having a much larger range. It is interesting that Costa Rica and Singapore are outliers of their respective clusters, Costa Rica being much happier than its cluster, and Singapore much less so. 

In [None]:
fig = px.box(clusterset,
                 y='Score',
                 facet_col='cluster',
                 points='all',
                 hover_data=[clusterset.index],
                ) 
fig.show()

We can use a radar chart showing the median values of the "Explained by" variables to understand the differences between the clusters. This shows a few trends: 

1) Cluster 2 outperforms the other clusters on every measure, which explains why it has the highest happiness score. 

2) If we use GDP per capita, social support, and healthy life expectancy as proxies for economic development, then we can see that cluster 1 and 2 are much more developed than cluster 0. 

3) The difference between cluster 2 and cluster 1 is that cluster 2 is mostly in perception of corruption, generosity, and freedom rather than the econonic variables.

4) Although cluster 0 is much less economically developed than cluster 1, they have more freedom. 

To summarise, cluster 1 and 2 are the most economically developed, cluster 0 and 2 are freer, and cluster 2 is less corrupt and more generous. 

In [None]:
fig = go.Figure()
for i in range(nclusters-1, -1, -1):
    fig.add_trace(go.Scatterpolar(r=cluster_median[i][variables], theta=variables, fill='toself', name='Cluster ' + str(i)))
fig.show()

The radar chart only shows median values. Below we show box plots for each variable and cluster. The conclusions from the radar chart still hold.

In [None]:
fig = px.box(clusterset,
             y=['Explained by: Log GDP per capita',
                 'Explained by: Social support',
                 'Explained by: Healthy life expectancy',],
             color='cluster',
             points='all',
             hover_data=[clusterset.index],
            ) 
fig.show()

In [None]:
fig = px.box(clusterset,
             y=['Explained by: Freedom to make life choices',
                 'Explained by: Generosity',
                 'Explained by: Perceptions of corruption'],
             color='cluster',
             points='all',
             hover_data=[clusterset.index],
            ) 
fig.show()

Here we see the radar chart for the two happiness outliers, Singapore and Costa Rica. Costa Rica was much happier than other nations in cluster 0 while Singapore was much less happy than other countries in cluster 2. It is easy to see why Singapore was in cluster 2, as it is very economically developed and has a low level of percieved corruption. Similarly, Costa Rica matches the profile of cluster 0, though with a high GDP per capita and healthy life expectancy for that cluster. Given that the happiness score for Costa Rica is 7.12 and for Singapore is 6.38, it seems clear that Costa Rica has something extra that is making it a happy country that is not captured by this data.

In [None]:
nations = ['Singapore', 'Costa Rica']

fig = go.Figure()
for nation in nations:
    fig.add_trace(go.Scatterpolar(r=clusterset.loc[nation][variables],
                                  theta=variables,
                                  fill='toself',
                                  name=nation + ' : ' + str(clusterset.loc[nation]['Score'])
                                 )
                 )
fig.show()

## Regional Breakdown

Here we have a radar chart showing how the median values for each "Explained by" variable differs with region.

In [None]:
fig = go.Figure()
for region in dataset[2020]['Regional indicator'].unique():
    data = dataset[2020].where(dataset[2020]['Regional indicator'] == region).dropna().median()
    fig.add_trace(go.Scatterpolar(r=data[variables],
                                      theta=variables,
                                      fill='toself',
                                      name=region + ' : ' + str(data['Score'])
                                     )
                     )
fig.show()

## Conclusions

The most important factor in the happiness of a nation is economic development, as can be seen from the correlation matrix we produced in the overview section. Generosity seems to have little to do with the happiness of a nation, but the increased social support increases happiness exponentially. The perception of corruption within a nation only shows a small correlation with happiness, and only once the perception of corruption falls below about 0.6. 

Although happiness worldwide has remained fairly constant on average over the years 2015-2020, the difference between the happiest and least happy countries has grown. The biggest increases in happiness between 2015-2020 were seen in West Africa. The biggest drop in happiness over the same time period was in Venezuela. 

We can divide the happiest quartile of countries into three clusters. The first cluster (consisting mostly of Northern Europe) is the happiest, richest, and perceives the lowest levels of corruption. The second (consisting mostly of Southern Europe and the USA) is almost as rich as the first, but perceives more corruption. The third cluster (consisting mostly of central and South American countries) is not as rich as the first two, but freer than the second. This third cluster is almost as happy as the second, despite being much poorer economially. 

We can see that there are factors that contribute to happiness that are not found in this data set. This is clear from looking at two significant outliers: Singapore and Costa Rica. Singapore beats Costa Rica on every measure apart from freedom, where they are almost equal, but Costa Rica is much happier. 