# Exploring the world happiness data

[DSLC stages]: EDA


In this document, we will perform exploratory data analysis (EDA) on the World Happiness data. The general structure of this document is as follows: each section will pose a question related to the data, and then answer this question through multiple exploratory visualizations. We will use PCS (Predictability, Stability, and Computability) to evaluate interesting findings and translate some of these into explanatory conclusions.

Next, we will load and clean/preprocess the World Happiness data.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
from functions.clean_happiness import clean_happiness

pd.set_option('display.max_columns', None)

In [2]:
# load the data
happiness_orig = pd.read_excel("../data/WHR2018Chapter2OnlineData.xls", sheet_name=0)


In [3]:
happiness_clean = clean_happiness(happiness_orig,impute_method="average")
happiness_clean

Unnamed: 0,year,happiness,log_gdp_per_capita,social_support,life_expectancy,freedom_choices,generosity,corruption,positive_affect,negative_affect,government_confidence,democratic_quality,delivery_quality,sd_ladder,sd_d_mean_ladder,gini_wb_estimate,gini_wb_estimate_average,gini_hh_income
0,2005,3.723590,7.168690,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.929690,-1.655084,1.774662,0.476600,0.000,0.000,0.441906
1,2006,3.723590,7.168690,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.929690,-1.655084,1.774662,0.476600,0.000,0.000,0.441906
2,2007,3.723590,7.168690,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.929690,-1.655084,1.774662,0.476600,0.000,0.000,0.441906
3,2008,3.723590,7.168690,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.929690,-1.655084,1.774662,0.476600,0.000,0.000,0.441906
4,2009,4.401778,7.333790,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,0.000,0.000,0.441906
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2127,2013,4.690188,7.565154,0.799274,48.949745,0.575884,-0.076716,0.830937,0.711885,0.182288,0.527755,-1.026085,-1.526321,1.964805,0.418918,0.432,0.432,0.555439
2128,2014,4.184451,7.562753,0.765839,50.051235,0.642034,-0.045885,0.820217,0.725214,0.239111,0.566209,-0.985267,-1.484067,2.079248,0.496899,0.432,0.432,0.601080
2129,2015,3.703191,7.556052,0.735800,50.925652,0.667193,-0.094585,0.810457,0.715079,0.178861,0.590012,-0.893078,-1.357514,2.198865,0.593776,0.432,0.432,0.655137
2130,2016,3.735400,7.538829,0.768425,51.800068,0.732971,-0.065283,0.723612,0.737636,0.208555,0.699344,-0.863044,-1.371214,2.776363,0.743257,0.432,0.432,0.596690


## High-level summary of the data

The first question we pose is quite general: *What do the variables in the data look like?* Before delving into specific trends, it is helpful to have a high-level overview of the variables of interest.

In [4]:
happiness_clean.columns

Index(['year', 'happiness', 'log_gdp_per_capita', 'social_support',
       'life_expectancy', 'freedom_choices', 'generosity', 'corruption',
       'positive_affect', 'negative_affect', 'government_confidence',
       'democratic_quality', 'delivery_quality', 'sd_ladder',
       'sd_d_mean_ladder', 'gini_wb_estimate', 'gini_wb_estimate_average',
       'gini_hh_income'],
      dtype='object')

### Exploring the response (happiness)

*Has world happiness data been consistently increasing?*

The chart below illustrates the trend of world happiness data over time (post-imputation). These imputed world happiness values are calculated using the "average imputation" method.


In [5]:
happiness_by_year = happiness_clean.groupby("year")["happiness"].sum()
px.line(happiness_by_year)

In [6]:
# compute the number of happiness in 2005
total_2005 = happiness_clean.query('year == 2005')["happiness"].sum()
# compute the number of happiness in 2009
total_2009 = happiness_clean.query('year == 2009')["happiness"].sum()
# compute the number of happiness in 2013
total_2013 = happiness_clean.query('year == 2013')["happiness"].sum()
# compute the number of happiness in 2017
total_2017 = happiness_clean.query('year == 2017')["happiness"].sum()

In [7]:
total_2005

874.2462823390961

In [8]:
total_2009

888.6175446510315

In [9]:
total_2013

878.4874721765518

In [10]:
total_2017

889.0016205310822

Although the line graph suggests significant fluctuations in the complete global happiness data, considering the scale of the vertical axis, the actual variations are relatively smooth.

#### PCS Evaluation of the Stability of Cleaning and Preprocessing Decisions

Let us examine whether the key conclusions derived from the trend of happiness data over time in the previous chart remain stable with respect to the choice of imputation method.

The following figure presents the trend lines under different imputation methods (average imputation, forward-fill imputation, and no imputation).

In [11]:
# add previous imputed happiness count value
happiness_clean["happiness_previous"] = clean_happiness(happiness_orig,impute_method="previous")["happiness"] 
# compute the happiness counts by year for each imputation approach
unimputed_happiness_by_year = happiness_orig.groupby("year")["Life Ladder"].sum()  
imputed_average_happiness_by_year = happiness_clean.groupby("year")["happiness"].sum()  
imputed_previous_happiness_by_year = happiness_clean.groupby("year")["happiness_previous"].sum()  

imputed_happiness_by_year_df = pd.DataFrame({
    "None": unimputed_happiness_by_year,
    "Average": imputed_average_happiness_by_year,
    "Previous": imputed_previous_happiness_by_year,
    "year": happiness_clean["year"].unique()
    }
  ).melt(id_vars="year", var_name="imputation_method")
  
px.line(imputed_happiness_by_year_df, x="year", y="value", color="imputation_method")

Due to significant data missing in the earlier years (e.g., from 2005 to 2010), using forward-fill imputation may lead to distorted results because of the insufficient reference points available before and after these years. In contrast, average imputation does not rely on temporal continuity, allowing for consistent handling of missing values across all years.

Based on our domain knowledge regarding these missing values (and assuming that most of the missing values are more likely to align with the results of average imputation rather than last observation carried forward values or zero), we believe that the results of average imputation are more likely to reflect the reality.

### The relationship between happiness and other features.

We first calculate a correlation heatmap to provide an overview of the relationships between the variables.

In [12]:
variables = ['happiness', 'log_gdp_per_capita', 'social_support', 
             'life_expectancy', 'freedom_choices', 'generosity', 
             'corruption', 'positive_affect', 'negative_affect', 
             'government_confidence', 'democratic_quality', 'delivery_quality',
             'sd_ladder', 'sd_d_mean_ladder', 'gini_wb_estimate', 
             'gini_wb_estimate_average', 'gini_hh_income']

# Select the relevant columns
df_selected = happiness_clean[variables]

# Calculate the correlation matrix
correlation_matrix = df_selected.corr()

px.imshow(correlation_matrix, 
          color_continuous_scale="Viridis", 
          color_continuous_midpoint=0)

What we can see is that there is a strong correlation between ` log_gdp_per_capita ` and ` happiness `(response) variable, as well as several other variables(` social_support`,`life expectancy`,`democratic_quality`,`delivery_quality`,`sd_d_mean_ladder`). 

Moreover, the `democratic_quality` and `delivery_quality` are highly correlated. Additionally, `gini_wb_estimate` and `gini_wb_estimate_average` also show a strong positive correlation from the perspective of their data sources.

To avoid multicollinearity in subsequent predictions, we will exclude the `gini_wb_estimate_average` and `delivery_quality` features.

After gaining a preliminary understanding, let's now use scatter plots to visually demonstrate the relationship between each feature and happiness, using 2017 as an example.

In [17]:
# Filter the data for the year 2017
happiness_2017 = happiness_clean[happiness_clean['year'] == 2017]

# Select the feature columns excluding 'happiness' and 'year'
columns_to_plot = ['log_gdp_per_capita', 'social_support', 'life_expectancy', 'freedom_choices', 
                   'generosity', 'corruption', 'positive_affect', 'negative_affect', 
                   'government_confidence', 'democratic_quality', 'sd_ladder', 'sd_d_mean_ladder', 
                   'gini_wb_estimate','gini_hh_income']

# Include the 'happiness' column in the DataFrame
happiness_2017['happiness'] = happiness_2017['happiness']

# Use melt to convert the data into long format
happiness_2017_melted = happiness_2017.melt(id_vars=["year", "happiness"], value_vars=columns_to_plot,
                                            var_name="variable", value_name="value")

# Create a scatter plot
fig = px.scatter(happiness_2017_melted, 
                 x="value", 
                 y="happiness", 
                 facet_col="variable", 
                 facet_col_wrap=3, 
                 opacity=0.2, 
                 height=800, 
                 facet_row_spacing=0.08)

# Update the subplot titles
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1]))

# Update the x-axis settings
fig.update_xaxes(matches=None, showticklabels=True)

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Let's reproduce these plots, but with the log-transformed happiness response variable.


In [None]:
# Perform a logarithmic transformation on the 'value' column and remove non-positive values
happiness_2017_melted['value_log'] = happiness_2017_melted['value'].apply(lambda x: np.log(x) if x > 0 else np.nan)

# Create a scatter plot
fig = px.scatter(happiness_2017_melted, 
                 x="value_log", 
                 y="happiness", 
                 facet_col="variable", 
                 facet_col_wrap=3, 
                 opacity=0.2, 
                 height=800, 
                 facet_row_spacing=0.08)

# Update the subplot titles
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1])) 

# Update the x-axis settings
fig.update_xaxes(matches=None, showticklabels=True)

fig.show()

Unfortunately, some features still do not show a strong correlation with the log-transformed happiness.

After the aforementioned visualization analysis of each feature, among the four features most strongly correlated with happiness in the correlation heatmap, we selected the following features from three key areas—economics, social connections, and health—to plot histograms:

- `Log GDP per capita`: Economic development is often closely linked to residents' quality of life and happiness. Therefore, this feature may have a significant impact on the happiness index.

- `Social support`: A strong social support system (e.g., family, friends, community) is generally associated with higher levels of happiness.

- `Healthy life expectancy at birth`: A higher healthy life expectancy indicates a society with robust health and medical security, which usually has a positive effect on happiness.

In [20]:
fig = make_subplots(rows=1, cols=3)
fig.add_trace(
    go.Histogram(x=happiness_clean["log_gdp_per_capita"], name="Log GDP per capita"),
    row=1, col=1)
fig.add_trace(
    go.Histogram(x=happiness_clean["social_support"], name="Social support"),
    row=1, col=2)
fig.add_trace(
    go.Histogram(x=happiness_clean["life_expectancy"], name="Healthy life expectancy at birth"),
    row=1, col=3)
# add axes
fig.update_layout(
    title_text="Histograms of important variables",
    xaxis1_title_text="Log GDP per capita",
    xaxis2_title_text="Social support",
    xaxis3_title_text="Healthy life expectancy at birth",
    showlegend=False,
)
fig.show()