In [1]:
import pandas as pd
import plotly.graph_objs as go
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np

20-06-2023

Information Visualization: group project Draft 

Group: B4

| Student name | student number | 
| --- | --- | 
| Evan Lont | 14729210 | 
| Joep Haanen | 14657368 |
| Lotte te Kulve | 14648911 | 
| Robin Kuipers | 14273810 |

# Correlation between Happiness and Economic Factors 
## Introduction

Our data analysis will focus on exploring the correlation between happiness and economic factors. Using the "World Happiness Report" dataset and relevant economic indicators such as GDP per capita, inflation rates, and consumer price index (CPI), we will investigate the relationship between subjective well-being and economic stability. By analyzing the data, we aim to determine whether countries with higher economic indicators tend to exhibit higher happiness scores. This study will contribute to our understanding of how economic factors influence individual and societal levels of happiness.

#### Perspectives
1. Consumer Price Index (CPI) and Happiness: We want to examine the impact of
CPI on happiness. We will argue that a higher CPI, reflecting increased prices of
goods and services, could potentially reduce individuals' satisfaction with their
standard of living, impacting overall happiness levels.

2. Inflation Rates and Happiness: We want to investigate the relationship between
inflation rates and happiness. We will argue that high inflation rates may lead to
increased uncertainty, economic instability, and decreased purchasing power,
potentially negatively affecting happiness levels in a country.

3. GDP per Capita and Happiness: We will explore the correlation between GDP per
capita and happiness scores. We will argue that countries with higher GDP per
capita may have better economic opportunities, access to resources, and quality of
life, which could positively impact happiness levels.

## Datasets and preprocessing
We decided to use the World Happiness Dataset from WHO. For our second dataset, we wanted to use a dataset about inflation trends from (at least) ten years ago up until 2022. We found an inflation dataset from OECD data that matched our requirements. We then analysed the two datasets and came to the conclusion that we needed a more comprehensive dataset for the happiness measurement, because it only contained data for the year 2019. To make accurate analyses, the dataset has to match the time periods of the other. Also, the inflation dataset can possibly give us interesting visualisations because of the inflation trends before, during and (somewhat) after the COVID-19 pandemic. 

### Dataset 1: World happiness report
**Source:** https://worldhappiness.report/ed/2020/#appendices-and-data

**Number of records:** `20`

****************************************Number of variables:**************************************** `10`

**Description:** This dataset is from the Organization for Economic Cooperation and Development (OECD) and focuses on inflation, specifically the Consumer Price Index (CPI). The Consumer Price Index is a widely used measure of inflation, representing the average price change over time for a basket of goods and services commonly consumed by households.

The dataset likely includes information about inflation rates for various countries and regions over a specific time period. It can provide insights into how prices of goods and services have changed over time, serving as an important indicator of economic stability and the purchasing power of consumers.

By analyzing this dataset, we can gain a deeper understanding of inflation trends across different economies and regions. We could also compare the inflation rates of different countries to assess their economic performance, identify periods of high or low inflation, and study the potential impacts on various sectors such as investment, wages, and consumer spending.

| Variable | Datatype | Measurement scale |
| --- | --- | --- |
| country name | Categorical | Nominal |
| Regional indicator | Categorical | Nominal |
| Happiness score | Continuous | Interval |
| upperwhisker | Continuous | Interval |
| lowerwhisker | Continuous | Interval |
| Logged GDP per capita | Continuous | Ratio |
| Healthy life expectancy | Continuous | Interval |
| Generosity | Continuous | Interval |
| Perceptions of corruption | Continuous | Interval |
| Explained by: Log GDP per capita | Continuous | Ratio |
| Explained by: Healthy life expectancy | Continuous | Ratio |
| Explained by: Freedom to make life choices | Continuous | Ratio |
| Explained by: Generosity | Continuous | Ratio |
| Explained by: Social support | Continuous | Ratio |
| Explained by: Perceptions of corruption | Continuous | Ratio |
| Dystopia + residual | Continuous | Interval |

### Dataset 2: Inflation (CPI)

**Source:** https://data.oecd.org/price/inflation-cpi.htm

**Number of records:** `490`

****************************************Number of variables:**************************************** `8`

**Description:** The "Inflation (CPI)" dataset from the OECD contains information on consumer price index (CPI) and inflation rates across various countries. It provides a comprehensive view of the changes in price levels for goods and services over time, allowing for the analysis and comparison of inflation rates among different economies. The dataset includes indicators such as headline inflation, core inflation, and various sub-components of CPI. It serves as a valuable resource for understanding and monitoring inflation trends at a global level.

| Variable | Datatype | Measurement scale |
| --- | --- | --- |
| Location | Categorical | Nominal |
| Regional indicator | Categorical | Nominal |
| Subject | categorical | Nominal |
| Measure | categorical | Interval |
| Frequency | Continuous | Interval |
| Time | Continuous | Interval |
| Value | Continuous | Interval |
| Flag code | Categorical | Nominal |


## Data selection

for each variable we asked ourselves the following questions:

- What are the variables in the data?
- Do we need all the data points and variables?
- Are there data that are out of scope?
- Are there privacy or ethical issues in the data?
- Is it practical to process the variable that we want?

To prevent our dataset to be too large, we decided to analyse the data for the years 2020 and 2022, because both datasets' values varied a lot in between these years. Another reason for the selection of only two different years is that we want to find out how much the data can differ in such a small timeframe.

Based on the discussions we had, we:

- We removed the following columns from the world happiness dataset:
    - Upperwhisker
    - Lowerwhisker
    - Standard error of ladder score
    - Dystopia + residual

- We rearranged columns so that we can easily see which country and year we’re looking at.
- We removed countries we didn’t need for our analysis and kept the following selection:

'Switzerland', 'Netherlands', 'New Zealand', 'Canada','Saudi Arabia', 
'Chile', 'Japan', 'Portugal', 'China', 'Vietnam','Nepal', 'South Africa', 'Ukraine', 'Morocco', 'Cameroon', 'Iran','Egypt', 'India'

- We changed country names to abbreviations
    Both datasets contained information per country, but the inflation dataset has abbreviations as values and the happiness dataset has the full country names. If we want to compare data for certain countries, we have to align these values to either abbreviations or full country names. We chose for abbreviations. By running the following code in a Jupyter Notebook:
    

In [2]:
inflation = pd.read_csv('inflation.csv')
happiness_2020 = pd.read_csv('happiness_2020-def.csv')
happiness_2022 = pd.read_csv('happiness_2022-def.csv')
inflation.drop('Flag Codes', axis=1, inplace=True)
inflation.drop('FREQUENCY', axis=1, inplace=True)

In [3]:
# Specify the desired column order
column_order = ['Country name', 'Happiness score', 'Dystopia + residual', 'Explained by: Log GDP per capita', 'Explained by: Social support', 'Explained by: Healthy life expectancy', 'Explained by: Freedom to make life choices', 'Explained by: Generosity', 'Explained by: Perceptions of corruption']

# Reorder the columns
happiness_2020 = happiness_2020[column_order]
happiness_2022 = happiness_2022[column_order]

KeyError: "['Country name', 'Dystopia + residual', 'Explained by: Log GDP per capita'] not in index"

In [None]:
# list all unique country names
unique_countries = pd.unique(happiness_2020['Country name'])

# list all unique abbreviations
unique_abbr = pd.unique(inflation['LOCATION'])

# map all unique country names in a dictionary with abbreviations as values
country_mapping = {
    "Switzerland": "CHE",
    "Netherlands": "NLD",
    "New Zealand": "NZL",
    "Canada": "CAN",
    "Saudi Arabia": "SAU",
    "Chile": "CHL",
    "Japan": "JPN",
    "Portugal": "PRT",
    "China": "CHN",
    "South Africa": "ZAF",
    "India": "IND"
}

# map the dictionary to the values of 'country name' in the happiness dataset
happiness_2020['Country name'] = happiness_2020['Country name'].map(country_mapping)
happiness_2020.head()

# export to csv
#happiness_2020.to_csv('happiness_2020.csv', index=False)

In [None]:
pd.DataFrame.head(inflation, n=5)

## Reflection

To be made → we’re waiting for feedback

## Work distribution

| Who? | Role | Tasks |
| --- | --- | --- |
| Evan |  | Visualizations, setup Github  |
| Joep |  | Visualizations |
| Lotte |  | Data preprocessing, visualizations |
| Robin |  | Data preprocessing, documentation |

# Visualizations

In [None]:
inflation2020 = inflation[inflation['TIME'] == 2020]
inflation2022 = inflation[inflation['TIME'] == 2022]

In [None]:
pd.DataFrame.head(happiness_2020, n=10)

In [None]:
pd.DataFrame.head(happiness_2022, n=10)

Let's analyse the difference in inflation between 2020 and 2022:

- In 2020, the inflation was considerably lower than in 2022.
- Has the overall happiness score for generosity in 2022 decreased compared to 2020?
- Can this be explained by inflation?"

## Happiness and Generosity in 2020 vs 2022

- Countries on the X-axis
- Bar for 2020 and bar for 2022 in one graph
- Happiness score and generosity on the Y-axis (separate graph for each variable)

#### Question: Is inflation higher in 2022 than in 2020 in every selected country?

#### Question: Is the happiness score lower in 2022 than in 2020 every selected country?


In [None]:
# Define the colors (ChatGPT)
colors = ['rgb(102,194,165)', 'rgb(252,141,98)', 'rgb(141,160,203)']

# creeer de layout
layout = go.Layout(
    xaxis=go.layout.XAxis(
        type='category' # het type van de X as is categorisch
    ),
    yaxis = go.layout.YAxis(
        tickformat = ',.0%', # toon als percentage
    ),
    height=400
)

year2020 = go.Bar(
    x=happiness_2020['Country name'],
    y=happiness_2020['Explained by: Generosity'], # by year 2020
    name='2020',
    marker=dict(color=colors[0]) #ChatGPT 
)
year2022 = go.Bar(
    x=happiness_2022['Country'],
    y=happiness_2022['Explained by: Generosity'],
    name='2022',
    marker=dict(color=colors[1]) #ChatGPT
)

data = [year2020, year2022]
fig = go.Figure(data=data, layout=layout)

# labels
fig.update_layout(
    title="World happiness explained by generosity 2020 vs 2022",
    xaxis_title="Country",
    yaxis_title="Percentage explained by generosity")
    
fig.show()

### Conclusion

The graph shows that the 'happiness explained by generosity is lower  for almost every country in 2022 than in 2020. Inflation is much higher in 2022 than in 2020. From this graph can be concluded that when inflation is higher, people experience less happiness due to generosity.


In [None]:
# Define the colors (ChatGPT)
colors = ['rgb(102,194,165)', 'rgb(252,141,98)', 'rgb(141,160,203)']

# creeer de layout
layout = go.Layout(
    xaxis=go.layout.XAxis(
        type='category' # het type van de X as is categorisch
    ),
    yaxis = go.layout.YAxis(
        tickformat = ',.0%', # toon als percentage
    ),
    height=400
)

year2020 = go.Bar(
    x=happiness_2020['Country name'],
    y=happiness_2020['Explained by: Log GDP per capita'], # by year 2020
    name='2020',
    marker=dict(color=colors[0]) #ChatGPT 
)
year2022 = go.Bar(
    x=happiness_2022['Country'],
    y=happiness_2022['Explained by: GDP per capita'],
    name='2022',
    marker=dict(color=colors[1]) #ChatGPT
)

data = [year2020, year2022]
fig = go.Figure(data=data, layout=layout)

# labels
fig.update_layout(
    title="World happiness explained by GDP per capita 2020 vs 2022",
    xaxis_title="Country",
    yaxis_title="Percentage explained by GDP per capita")
    
fig.show()

In [None]:
# Functie om de aslabels handmatig te formatteren
def custom_tickformat(value):
    return f"{value:.1f}".replace('.', ',')

# Define the colors (ChatGPT)
colors = ['rgb(102,194,165)', 'rgb(252,141,98)', 'rgb(141,160,203)']

# creeer de layout
layout = go.Layout(
    xaxis=go.layout.XAxis(
        type='category' # het type van de X as is categorisch
    ),
    yaxis=go.layout.YAxis(
        tickvals=fig.data[0].y,  # Gebruik de y-waarden als aslabels
        ticktext=[custom_tickformat(value) for value in fig.data[0].y],  # Formatteer de aslabels
    ),
    height=400
)

# defineer de data
year2020 = go.Bar(
    x=inflation2020['LOCATION'],
    y=inflation2020['Value'], # by year 2020
    name='2020',
    marker=dict(color=colors[0]) #ChatGPT 
)
year2022 = go.Bar(
    x=inflation2022['LOCATION'],
    y=inflation2022['Value'],
    name='2022',
    marker=dict(color=colors[1]) #ChatGPT
)

# creeer het figuur
data = [year2020, year2022]
fig = go.Figure(data=data, layout=layout)

# labels
fig.update_layout(
    title="Inflatie per land",
    xaxis_title="Country",
    yaxis_title="Inflatie percentage ten opzichte van 2015")
    
fig.show()

### Happiness vs inflation in different countries

In [None]:
df1 = pd.read_csv('happiness_2020-def.csv')

df2 = pd.read_csv('happiness_2022-def.csv')

df3 = pd.read_csv('Data_inflatie.csv')
df3 = df3[df3["TIME"]=="2020"]
df3 = df3[df3["SUBJECT"]=="TOT"]
df3 = df3[df3["MEASURE"]=="IDX2015"]
countries = ["CHE", "NLD", "NZL", "CAN", "SAU", "CHL", "PRT", "CHN", "ZAF", "IND"]
df3 = df3[df3['LOCATION'].isin(countries)]

df4 = pd.read_csv('Data_inflatie.csv')
df4 = df4[df4["TIME"]=="2022"]
df4 = df4[df4["SUBJECT"]=="TOT"]
df4 = df4[df4["MEASURE"]=="IDX2015"]
countries = ["CHE", "NLD", "NZL", "CAN", "SAU", "CHL", "PRT", "CHN", "ZAF", "IND"]
df4 = df4[df4['LOCATION'].isin(countries)]

In [None]:
# Maak van de x-as 2 getallen in plaats van 3 getallen
df3["Value"] = df3["Value"] - 100
df4["Value"] = df4["Value"] - 100

# Plot de punten in de grafiek
plt.scatter(df3["Value"], df1["Happiness score"], label="2020")
plt.scatter(df4["Value"], df2["Happiness score"], label="2022")

# Geef labels aan de assen
plt.xlabel('Increase in inflation compared to 2015 (%)')
plt.ylabel('Happiness Score')

# Voeg labels toe aan de punten
for x, y, label in zip(df3["Value"], df1["Happiness score"], df3["LOCATION"]):
    plt.text(x, y, label, ha='center', va='bottom')
for x, y, label in zip(df4["Value"], df2["Happiness score"], df4["LOCATION"]):
    plt.text(x, y, label, ha='center', va='bottom')

# Trek lijnen tussen de punten voor overzicht
plt.plot([df3["Value"], df4["Value"]], [df1["Happiness score"], df2["Happiness score"]], 'k-')
plt.title("Happiness and inflation in countries")

#
plt.legend()

# Toon de grafiek
plt.show()

The graph shown above is a scatterplot that shows the correlation between happiness in a country and the increased inflation in said country. The y-axis shows the "Happiness Score" that each country received. This indicates the
happiness of inhabitants from this country on a scale of 1 to 10. The dots are labeled by country code. The blue dots represent said country in the year 2020, and the orange dots represent said country in 2022. The x-axis shows the increase in inflation compared to the year 2015. Where an increase in 10% means that prices in this country are 110% of what they used to be in the year 2015. The lines between the dots function to easily show an increase or decrease in happiness. With this graph we aimed to either confirm or debunk one of our perspectives.

The remarkable result is that there is no clear pattern in the scatterplot. We do see that every country has seen an increase in inflation. However some countries have increased their happiness score, while others have seen a decrease in happiness. This means that we can not confirm nor debunk either one of our perspectives. This does show that there is no clear correlation

between the inflation of a country and the happiness of the inhabitants of said country.

### GDP per Capita and Happiness:
We will explore the correlation between GDP per capita and happiness scores. 

We will argue that countries with higher GDP per capita may have better economic opportunities, access to resources, and quality of life, which could positively impact happiness levels.

In [None]:
df = pd.read_csv('happiness_2020-def.csv')

# sort by country
df = df.sort_values('Country name')

# create stacked bar chart
fig = px.bar(df, x=['Explained by: Log GDP per capita', 'Happiness score'], y='Country name', orientation='h', 
barmode='stack', color_discrete_map={'Explained by: Log GDP per capita': '#1F77B4', 'Happiness score': '#AEC7E8'})

#edit chart
fig.update_layout(title='GDP per capita and Happiness in 2020')
fig.update_layout(legend=dict(title='Variables'))

fig.show()

In [None]:
df2 = pd.read_csv('happiness_2022-def.csv')

# sort by country
df2 = df2.sort_values('Country')

# create stacked bar chart
fig = px.bar(df2, x=['Explained by: GDP per capita', 'Happiness score'], y='Country', orientation='h', 
             barmode='stack', color_discrete_map={'Explained by: GDP per capita': '#1F77B4', 'Happiness score': '#AEC7E8'})


#edit chart
fig.update_layout(title='GDP per capita and Happiness in 2022')
fig.update_layout(legend=dict(title='Variables'))


fig.show()  

Looking at the charts of both 2020 and 2022, for most countries it is true that those with a high happiness score also appear to have a high GDP, and vice versa. Thus, these visualization support the idea that a high GDP per capita will have a positive effect on the overall happiness of the population. For example, India has both the lowest happiness score and lowest GDP per capita, and Switzerland has both the highest happiness score and the highest GDP per capita. This is true for both graphs.

However, one can also make some arguments for an opposing statement. A higher GDP does not imply a higher happiness score for each country. For example: In 2020 Saudi-Arabia and Canada have a fairly even GDP per Capita, at 1.33 and 1.30. However, Saudi-Arabia has a happiness score of 6.4 while Canada has a happiness score of 7.2.

Thus, overall a higher GDP per capita will lead to a better happiness score, likely due to the citizens of these countries having enough wealth and financial stability in order to live a fulfilling live. However, other variables such as individual freedom are of importance too, which is why a rich country like Saudi-Arabia may still end up with a lower happiness score.


In [None]:
df_2020 = pd.read_csv('happiness_2020-def.csv')
df_2022 = pd.read_csv('happiness_2022-def.csv')

df_2020 = df_2020.sort_values('Country name')
df_2022 = df_2022.sort_values('Country')



# Create the scatter plot for 2020
scatter_2020 = go.Scatter(
    x=df_2020['Explained by: Log GDP per capita'],
    y=df_2020['Happiness score'],
    mode='markers',
    text=df_2020['Country name'],
    name='2020',
    marker=dict(color='#56B4E9', size=10, symbol='circle'),
    textposition='top center',  # Set the text position
    textfont=dict(color='black') 
)

# Create the scatter plot for 2022
scatter_2022 = go.Scatter(
    x=df_2022['Explained by: GDP per capita'],
    y=df_2022['Happiness score'],
    mode='markers',
    text=df_2022['Country'],
    name='2022',
    marker=dict(color='#E69F00', size=10, symbol='circle'),
    textposition='top center',  # Set the text position
    textfont=dict(color='black')  # Set the text color
)


# Create chart with both plots
fig = go.Figure(data=[scatter_2020, scatter_2022])
fig.update_layout(
    title='GDP per capita and Happiness',
    xaxis=dict(title='GDP per capita'),
    yaxis=dict(title='Happiness score'),
    hovermode='closest'
)

In [None]:
df_inflation = pd.read_csv('inflation_def.csv')
df_gdp = pd.read_csv('happiness_2020-def.csv')
selected_countries = ['CHE', 'NLD', 'NZL', 'CAN', 'SAU', 'CHL', 'PRT', 'CHN', 'ZAF', 'IND']
df_filtered_gdp = df_gdp[df_gdp['Country name'].isin(selected_countries)]
df_filtered_inflation = df_inflation[df_inflation['LOCATION'].isin(selected_countries)]

fig = go.Figure()

# Add GDP data
fig.add_trace(go.Scatter(
    x=df_filtered_gdp['Explained by: Log GDP per capita'],
    y=df_filtered_inflation['Value'],
    mode='markers',
    name='GDP',
    marker=dict(
        color='blue',
        size=10
    ),
    text=df_filtered_gdp['Country name']
))


# Add inflation data
fig.add_trace(go.Scatter(
    x=df_filtered_gdp['Explained by: Log GDP per capita'],
    y=df_filtered_inflation['Value'],
    mode='markers',
    name='Inflation',
    marker=dict(
        color='red',
        size=10
    ),
    text=df_filtered_gdp['Country name']
))


# Update layout
fig.update_layout(
    title='Correlation between GDP and Inflation',
    xaxis=dict(title='GDP'),
    yaxis=dict(title='Inflation'),
    showlegend=True,
)


fig.show()

For all countries except India, there is an not necessarily a clear correlation between inflation and the country its GDP: 

In [None]:
df_inflation

In [None]:
df_inflation = pd.read_csv('inflation_def.csv')
df_gdp = pd.read_csv('happiness_2020-def.csv')
selected_countries = ['CHE', 'NLD', 'NZL', 'CAN', 'SAU', 'CHL', 'PRT', 'CHN', 'ZAF', 'IND']
df_filtered_gdp = df_gdp[df_gdp['Country name'].isin(selected_countries)]
df_filtered_inflation = df_inflation[df_inflation['LOCATION'].isin(selected_countries)]

fig = go.Figure()

# Add GDP data
fig.add_trace(go.Scatter(
    x=df_filtered_gdp["Happiness score"],
    y=df_filtered_inflation['Value'],
    mode='markers',
    name='Country name',
    marker=dict(
        color='blue',
        size=10
    ),
    text=df_filtered_gdp['Country name']
))


# Update layout
fig.update_layout(
    title='Correlation between Happiness and Inflation in 2020',
    xaxis=dict(title='Happiness'),
    yaxis=dict(title='Inflation'),
    showlegend=True,
)

fig.show()

## Visualization drafts

### Unemployment rate, happiness score: 2020 vs 2022 (draft)
We found a third dataset from OECD stats on Labour Market Statistics which we still have to preprocess to use for our analysis. The dataset contains data for each of our selected countries about (un)employment rates by age, age group, gender and educational level. 

We want to filter the dataset for the years 2020 and 2022. We then want to merge this dataset with the happiness dataset and create a data visualization to test the correlation between the unemployment rate and happiness rate for different age groups.

The second visualization will be based on the same filtered data and will visualize the possible correlation between the happiness rate of unemployed people grouped by different educational levels.

In [None]:
import plotly.express as px
import dash
import dash_core_components as dcc
import dash_html_components as html

# Load the dataset
df = pd.read_csv('happiness_2020-def.csv')

# Initialize the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    dcc.Dropdown(
        id='country-dropdown',
        options=[{'label': country, 'value': country} for country in df['Country name'].unique()],
        value=df['Country name'].unique()[0], 
    ),
    html.H2(id='chart-title'),
    dcc.Graph(id='pie-chart')
])

# Define the callback function to update the pie chart
@app.callback(
    [dash.dependencies.Output('pie-chart', 'figure'),
     dash.dependencies.Output('chart-title', 'children')],
    [dash.dependencies.Input('country-dropdown', 'value')]
)
def update_pie_chart(country):
    # Filter the dataframe based on the selected country
    filtered_df = df[df['Country name'] == country]
    
    # Prepare data for the pie chart
    labels = filtered_df.columns[3:]
    values = filtered_df.iloc[0, 3:]
    
    
    # Create the pie chart figure
    fig = px.pie(values=values, names=labels, hole=0.5)
    
    # Add the text '2020' in the center of the pie chart
    fig.update_layout(
        annotations=[dict(text='2020', x=0.5, y=0.5, font_size=20, showarrow=False)],
    )
    
    # Set the chart title
    title = f"Distribution of each happiness factor for {country}"
    
    return fig, title

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)