# World Happiness Analysis

### World Happiness Report

The World Happiness Report is published annually by the United Nations and ranks countries by how happy their citizens assess themselves to be. The rankings are based on a few important life factors that have been shown to directly impact overall perceived happiness, namely:
- GDP per capita
- Social support
- Healthy life expectancy
- Freedom
- Generosity
- Perceptions of corruption

Each factor is a number between 0 (being the worst) and 10 or 1  (being the best) depending on the factor. Based on these factors, the overall Happiness Score of the country is calculated. Then based on these scores, the countries are ranked in order of their Happiness.

In this, I will be analysing the World Happiness Report dataset for the year 2019, and then comparing it to the data from 2015 to see what has changed.

Datasets used:
- [Kaggle - World Happiness Report](https://www.kaggle.com/unsdsn/world-happiness/)

## Data Preparation and Cleaning
### Importing important libraries and Loading the datasets
Before we start, we need to import all the important libraries that are needed for the analysis.
We load the datasets into dataframes using Pandas. `happy_one_df` will contain the data for 2015 and `happy_two_df` will contain the data for 2019.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
happy_one_df = pd.read_csv('../input/world-happiness/2015.csv')
happy_two_df = pd.read_csv('../input/world-happiness/2019.csv')

### First look
Now, let us take a look at the datasets by displaying five rows from each. As we can see, each Country and their Happiness Score and Rank are given along with the values for each feauture.

In [None]:
happy_one_df.head(5)

In [None]:
happy_two_df.head(5)

In [None]:
print(happy_one_df.shape, happy_two_df.shape)

In [None]:
print('There are {} countries in the 2015 dataset and {} countries in the 2019 dataset'.format(happy_one_df.shape[0], happy_two_df.shape[0]))

In [None]:
# Before we go ahead, let's check if any value in the dataframes are null
total_missing_one = happy_one_df.isna().sum()
total_missing_two = happy_two_df.isna().sum()
print(total_missing_one, total_missing_two)
# We can see that there are no null values

The datasets have different column names for the same features. This will be changed and also the columns that are not needed from the 2015 dataset will be removed.

In [None]:
print(happy_one_df.columns)
print(happy_two_df.columns)
happy_one_df.drop(columns = ['Standard Error', 'Dystopia Residual'], inplace = True)
happy_one_df.rename(columns = {'Economy (GDP per Capita)':'GDP per capita', 'Health (Life Expectancy)':'Healthy life expectancy', 'Family':'Social support', 'Trust (Government Corruption)':'Perceptions of corruption'}, inplace = True)
happy_two_df.rename(columns = {'Country or region':'Country', 'Score':'Happiness Score', 'Overall rank':'Happiness Rank', 'Freedom to make life choices':'Freedom'}, inplace = True)

To see the countries that are only there in one dataset and not the other, we take intersection of the datasets and then find the complement. When doing this we see that some of the countries are actually the same but have just been named differently. For consistency, we will rename these countries.

In [None]:
common = happy_one_df.merge(happy_two_df, on=['Country'])
result1 = happy_one_df[~happy_one_df.Country.isin(common.Country)]
result2 = happy_two_df[~happy_two_df.Country.isin(common.Country)]
print(result1['Country'])
print(result2['Country'])

In [None]:
happy_two_df['Country'].replace({'Trinidad & Tobago': 'Trinidad and Tobago'}, inplace = True)
happy_one_df['Country'].replace({'North Cyprus': 'Northern Cyprus', 'Somaliland region': 'Somalia', 'Macedonia':'North Macedonia', 'Sudan':'South Sudan'}, inplace = True)

For region-wise Happiness Score comparison, we need a Region column in the 2019 dataset.

First let's look at the number of unique Regions. As we can see, there are 10 different Regions that the countries are classified into.

For adding the Region column, we will take the countries and regions there in the 2015 dataset and find the intersection with the 2019 dataset. However, there are some countries that are not there in 2015. These will need their Region to be explicitly given (Namibia and Gambia).

In [None]:
print(happy_one_df.Region.unique(), happy_one_df.Region.nunique())

In [None]:
selected_columns = happy_one_df[["Country","Region"]]
new_df = selected_columns.copy()
new_df

In [None]:
common = happy_two_df.merge(new_df, on=['Country'])
happy_three_df = common.copy()
select_countries1 = happy_two_df.loc[happy_two_df['Country'] == 'Namibia']
select_countries1 = select_countries1.append(happy_two_df.loc[happy_two_df['Country'] == 'Gambia'])
select_countries1['Region'] = ['Sub-Saharan Africa', 'Sub-Saharan Africa']
select_countries1
happy_three_df = pd.concat([happy_three_df, select_countries1])
happy_three_df.sort_values('Happiness Score', ascending = False, inplace = True)
happy_three_df.index = np.arange(0, 156)
happy_three_df

Now that the datasets have been prepared, lets take a look at the feautures of each column such as mean, minimum and maximum value using `.describe()`.

As we can see below:
- The mean Happiness Score in 2015 is around 5.376 and in 2019 it is around 5.407. This shows that the mean global Happiness Score has increased since 2015.
- This could be due to the increase in the mean of some of the factors such as GDP per capita and Healthy life expectancy. However, a large increase would've been hindered by factors whose averages have decreased such as Freedom and Generosity.
- It is quite alarming to see that the maximum value for Generosity has dropped by a significant amount from 2015 to 2019. However, there is a significant increase in the maximum values of Social support and Healthy life expectancy.
- All the feautures have a minimum value of zero. This is either because the feature could not be calculated for that year or that specific feauture has been rated so poorly in the respective country.

In [None]:
happy_one_df.describe()

In [None]:
happy_three_df.describe()

## Exploratory Analysis, Visualization and Answering Questions

Now let's take a deeper look into the Happiness Report data and attempt to answer some questions
- First we will look at the data from 2019 and understand the correlation between the Happiness Score and the features used to calculate it.
- Then we will look at Region-wise scores and analyse the difference.
- After that, we will look at the happiest and least happiest countries.
- Then we will compare the 2019 data with the 2015 data  and see how much it has changed.

In [None]:
# Setting some initial parameters for the charts
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## How does each life factor affect the Happiness Score?

In order to analyse this, we find the correlation between the Happiness Score and each factor since higher the correlation, the more it affects the Score. We shall also see if there is any correlation between two factors.

In [None]:
# Find the correlation using .corr() and then remove unwanted columns and rows.
corr_df = happy_three_df.corr()
corr_df.drop(columns= ['Happiness Rank', 'Happiness Score'], inplace = True)
corr_df.drop(['Happiness Rank'], inplace = True)
corr_df

In [None]:
sns.heatmap(corr_df, annot=True)

From this, we can infer the following:

- The Happiness Score depends the most on GDP per capita, Social support and Healthy life expectancy.
- The Score is least dependent on Generosity and Perceptions of corruption.
- Generosity has a larger correlation with Perceptions of corruption and Freedom than the overall Score.
- The three main influencing factors of the Score, namely GDP per capita, Social support and Healthy life expectancy, are highly correlated with each other.
- Freedom has a good correlation with the Happiness Score since it is understandable to think that a higher level of perceived freedom would make a person more happy.
- Freedom is also related with Perceptions of corruption. A place that a person feels has less corruption would definitely make the person feel like they have more freedom to act as they wish.

Hence we can see that the Happiness Score is influenced to different degrees by each factor. We can also see that each factor also has a correlation with other factors.

Let us also take a look at how the value for each factor affects the Happiness Score for each country. This is done using a scatterplot with each factor as the x axis and the Happiness Score as the y axis. The countries are colored according to their Region.

In [None]:
happy_factor_columns = ['GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom', 'Generosity', 'Perceptions of corruption']
matplotlib.rcParams['font.size'] = 20
fig, axes = plt.subplots(3, 2, figsize=(50, 50))

k = 0
for i in range(3):
    for j in range(2):
        column_name = happy_factor_columns[k]
        axes[i,j].set_title('Happiness Score vs. ' + column_name)
        sns.scatterplot(happy_three_df[column_name], happy_three_df['Happiness Score'],
                        hue=happy_three_df['Region'], s=500, ax=axes[i,j]);
        k = k + 1
        axes[i][j].legend(fontsize='20', markerscale=4)

As we can see:
- In most of the graphs, as the value for the life factor increases, so does the Happiness Score.
- In all graphs, the highest Score regions tend to be Western Europe, Australia and New Zealand and North America. From these graphs we can see that these 'happiest' regions tend to have a higher value for each life factor.
- Except Generosity and Perceptions of corruption (the factors having least correlation with the Score), the other factors have almost a linear and directly proportional relationship with the Happiness Score.

## Which Region of the world is happiest and which is the least happiest?

In order to look at this, first let us look at the number of different regions and also how many countries there are in each region.

In [None]:
print('There are {} different regions and they are {}'.format(happy_three_df.Region.nunique(), happy_three_df.Region.unique()))

In [None]:
# By grouping the Countries by regions, we can see the number of countries in all ten regions
country_counts_df = happy_three_df.groupby('Region')[['Country']].count()
country_counts_df

Now, we can look at the average Happiness Score in each region

In [None]:
region_happy_df = happy_three_df.groupby('Region')[['Happiness Score']].mean().sort_values('Happiness Score', ascending=False)
region_happy_df

In [None]:
# Visualization using a barplot will make it easier to see the amount of difference between the average scores
region_happy_df.reset_index(inplace=True)
sns.set_color_codes("pastel")
matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.figsize'] = (20, 6)
sns.barplot(x="Region", y="Happiness Score", data=region_happy_df, color="b")

In [None]:
print('From this we can see that based on average Happiness Scores, {} is the happiest Region and {} is the least happiest Region in the world'
      .format(region_happy_df.iloc[0]['Region'], region_happy_df.iloc[-1]['Region']))

## Is the most frequent Region in the top ten countries the same as the happiest Region?
- Is the most frequent Region in the lowest ten countries the same as the least happiest Region?
- If they are different, what causes this?

In [None]:
# Sort the countries according to the Happiness Score (the initial dataframe is already sorted but we do this just in case and also we need to work on a copy)
country_happy_df = happy_three_df.sort_values('Happiness Score', ascending=False)
country_happy_df.head(5)

Let us look at the most frequent Region in the top ten countries and the most frequent Region in the lowest ten countries.

In [None]:
print('The most frequent region in the top ten countries is {}'.format(country_happy_df.head(10)['Region'].mode()[0]))
print('The most frequent region in the lowest ten countries is {}'.format(country_happy_df.tail(10)['Region'].mode()[0]))

We can see that the most frequent region in the 10 least happiest countries is the same as the least happiest region based on average Happiness Scores. However, this is not the same for the top 10 happiest countries. It is Western Europe instead of Australia and New Zealand. Why does this happen?

Let us plot the Happiness Scores on a barplot along with the standard deviations for each Region.

In [None]:
matplotlib.rcParams['figure.figsize'] = (20, 5)
matplotlib.rcParams['font.size'] = 10
sns.barplot('Region', 'Happiness Score', data=country_happy_df, ci='sd')

Looking at this, we can see that:
- Western Europe has a high standard deviation for the Happiness Score since the 21 countries in it have Happiness Scores that vary greatly from the average for the Region.
- So, even though the most frequent region in the top happiest countries was Western Europe, due to the countries in that Region that have lower Scores, the average Happiness Score for that region is lower than Australia and New Zealand.
- The standard deviation line is small for Australia and New Zealand region since this only includes two countries with almost similar Scores.

## How is the happiest country different from the least happiest country?
Let us find and analyse the difference between the happiest country and least happiest country.

To do this, we take the first and last rows of the sorted dataset. Then we compare the Happiness Score and the value for each factor.

In [None]:
print('The happiest country is {} and the least happiest country is {}.'
      .format(country_happy_df.iloc[0]['Country'], country_happy_df.iloc[-1]['Country']))

In [None]:
# We take the first and last row and put it in a new dataframe for easier analysis
compare_df = country_happy_df.tail(1).copy()
compare_df = pd.concat([compare_df, country_happy_df.head(1).copy()])
compare_df.reset_index(drop=True, inplace=True)
compare_df

In [None]:
# We do the following for ease of comparison between the features for both Countries
new_compare_df = compare_df.melt(id_vars = ['Country'])
new_compare_df.drop([0, 1, 16, 17], inplace = True)
new_compare_df.reset_index(drop=True, inplace=True)
# The Happiness Scores for the countries are normalized to values between 0 and 10 so that the other features are more
# comparable in the following barplot
new_compare_df.at[0, 'value'] = new_compare_df.iloc[0]['value'] / 10
new_compare_df.at[1, 'value'] = new_compare_df.iloc[1]['value'] / 10
new_compare_df

The following barplot shows the values for the normalized Happiness Score and the valuse of each happiness factor. The value for each country has been placed side by side for comparison. The table following the graph shows the numerical values depicting the difference between each value.

In [None]:
sns.barplot(x='variable', y='value', hue='Country', data=new_compare_df);

In [None]:
new_group_df = new_compare_df.groupby('variable')[['value']].diff().dropna()
new_group_df

From above, it can be seen that:
- The Happiness Score for Finland is much greater than South Sudan by a value of almost 5.
- Except for Generosity, the other happiness factors for Finland have higher values than that of South Sudan. Such higher values are what consequently made the Happiness Score of Finland so much higher.
- Although Genorosity in Finland is lower than that in the least happiest country, this did not affect the Happiness Score that much because as we saw earlier, there was low correlation between Generosity and the overall Happiness Score for a country.

Hence, we can see that the difference between the happiest country and least happiest country is that the happiest country has higher values for most of the happiness factors.

## How has happiness changed since 2015?
We will now compare the happiness in 2019 to the data from 2015 to see if there is any change.

First, we will make overlapping histograms of the Happiness Scores for both years for better comparison.

In [None]:
# Comparing 2015 and 2019
matplotlib.rcParams['figure.figsize'] = (10, 5)
plt.title("Distribution of Happiness Scores in 2015 and 2019")
plt.xlabel('Happiness Score')
plt.ylabel('Number of Countries')
plt.hist(happy_one_df['Happiness Score'], alpha=0.4, bins=[2, 3, 4, 5, 6, 7, 8]);
plt.hist(happy_three_df['Happiness Score'], alpha=0.4, bins=[2, 3, 4, 5, 6, 7, 8]);

From this distribution of Happiness, where blue depicts 2015 and orange depicts 2019, we can see that there has been a positive shift in the happiness scores. There are more scores in the higher score range (score > 5) and less scores in the lower score range (score <= 5).

To better understand this, let us look at the percentage of happiness scores in the lower and higher range in both years.

In [None]:
high_2019_df = (happy_three_df[happy_three_df['Happiness Score'] > 5])
high_2019_percent = high_2019_df.shape[0] / happy_three_df.shape[0] * 100

low_2019_df = (happy_three_df[happy_three_df['Happiness Score'] <= 5])
low_2019_percent = low_2019_df.shape[0] / happy_three_df.shape[0] * 100

high_2015_df = (happy_one_df[happy_one_df['Happiness Score'] > 5])
high_2015_percent = high_2015_df.shape[0] / happy_one_df.shape[0] * 100

low_2015_df = (happy_one_df[happy_one_df['Happiness Score'] <= 5])
low_2015_percent = low_2015_df.shape[0] / happy_one_df.shape[0] * 100

print('Percent of countries in the higher score range in 2019: {} \nPercent of countries in the lower score range in 2019: {} \nPercent of countries in the higher score range in 2015: {} \nPercent of countries in the lower score range in 2015: {}'
     .format(high_2019_percent, low_2019_percent, high_2015_percent, low_2015_percent))

Hence, we can see that the percentage of countries in the higher Happiness Score range has increased in 2019 as compared to 2015 by more than 3%. This supports the positive shift in Happiness Scores that we saw earlier from the histogram.

## Which country's happiness increased the most and which one's decreased the most?
Since 2015, the Happiness Scores for the countries have changed due to change in the happiness factors. They may have increased or decreased. Here, we will look at the country whose happiness has increased the most and the country whose happiness has decreased the most.

In [None]:
# Merge the two datasets on the column Country and then find the change in the Happiness Scores
selected_columns = happy_one_df[["Country","Happiness Score"]]
new_df = selected_columns.copy()
merge_df = happy_three_df.merge(new_df, on=['Country'])
merge_df['Score Change'] = merge_df['Happiness Score_x'] - merge_df['Happiness Score_y']
merge_df.head(5)

In [None]:
merge_df['Score Change'].describe()

Looking at the above description of the Score Change column, we can see that the average change in the Happiness Score is a small positive amount.

We can also see that there is a country that has had the maximum increase in happiness (given by max) and a country with maximum decrease in happiness (given by min). Let us find these countries and see why this has happened.

In [None]:
max_increase = merge_df['Score Change'].idxmax()
max_i_country = merge_df.loc[max_increase]['Country']

max_decrease = merge_df['Score Change'].idxmin()
max_d_country = merge_df.loc[max_decrease]['Country']

print('The country whose happiness increased the most is {}, and the country whose happiness decreased the most is {}.'
     .format(max_i_country, max_d_country))

Hence, Benin is the country with the maximum increase in happiness and Venezuela is the country with the maximum decrease in happiness. Let us look ath their respective rows in the datasets and see why this has happened.

In [None]:
increase_df_2019 = happy_three_df.loc[happy_three_df['Country'] == max_i_country]
increase_df_2015 = happy_one_df.loc[happy_one_df['Country'] == max_i_country]
increase_df_2019 = pd.concat([increase_df_2019, increase_df_2015])
increase_df_2019

The first row is from the 2019 dataset and the second row is from the 2015 dataset.

As we can see for Berlin, except Freedom and Generosity, All the other factors have increased by significant amounts. This lead to a change of more than 1.5 in the Happiness Score and also increased the Happiness Rank for the country.

In [None]:
decrease_df_2019 = happy_three_df.loc[happy_three_df['Country'] == max_d_country]
decrease_df_2015 = happy_one_df.loc[happy_one_df['Country'] == max_d_country]
decrease_df_2019 = pd.concat([decrease_df_2019, decrease_df_2015])
decrease_df_2019

As we can see for Venezuela, except Social support, Healthy life expectancy and Generosity, all the other factors have decreased. These factors that have increased have only done so by a small amount. Hence, this has lead to an overall decrease in the Happiness Score by more than 2.1 and also greatly decreased the Happiness Rank of the country.

A very significant decerase that can be seen here is the huge decrease in freedom.