In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing the Data

First, we will import and save all the world happiness report for the year 2019 and create a dataframe.

In [None]:
df19 = pd.read_csv("/kaggle/input/world-happiness/2019.csv")

Now we will display a few rows of the data frame and also check for any missing values or inconcistencies.

In [None]:
df19

In [None]:
df19.info()

It can be observed that there are no missing or null values.

# Visualizing the Data

## Geospatial Visualization of the Happiness Score

In order to plot a world map, we will first need to import geopandas and also use a file called "naturalearth_lowres"  with which one can create a basemap of the world.

In [None]:
import matplotlib.pyplot as plt 
import geopandas as gpd
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

In [None]:
world

Now, in order to join the two datasets one needs the names of the countries to be consistent in both data sets. Below, you can see the countries in the dataframe "world" which are supposedly not in the dataframe "df19" and vice versa. But infact, for a few countries only the names are a bit different for example -
1. United States of America -> United States	
2. Czechia -> Czech Republic
3. Dominican Rep. -> Dominican Republic
4. Dem. Rep. Congo -> Congo (Kinshasa)
5. eSwatini -> Swaziland	
6. Côte d'Ivoire -> Ivory Coast

fun fact - On celebration of the country's 50th year of independence, King Mswati III declared that he was changing the name of Swaziland to eSwatini. The king made his declaration to a crowded stadium in Manzini, noting the name change is intended to shed vestiges of the country's colonial past. :)

In [None]:
world[~world.name.isin(df19["Country or region"])]

In [None]:
df19[~df19["Country or region"].isin(world.name)]


In [None]:
world.at[176, "name"] = "South Sudan"
world.at[4, "name"] = "United States"
world.at[17,"name"] = "Dominican Republic"
world.at[160,"name"] = "Northern Cyprus"
world.at[11,"name"] = "Congo (Kinshasa)"
world.at[153,"name"] = "Czech Republic"
world.at[171,"name"] = "North Macedonia"
world.at[79,"name"] = "Palestinian Territories"
world.at[175,"name"] = "Trinidad & Tobago"
world.at[170,"name"]="Bosnia and Herzegovina"
world.at[60,"name"]="Ivory Coast"
world.at[66,"name"]="Central African Republic"
world.at[73,"name"]="Swaziland"
world.at[67,"name"]="Congo (Brazzaville)"

Now, below are the 6 countries in our happiness report data frame for which I could nt find information in our world dataframe.

In [None]:
df19[~df19["Country or region"].isin(world.name)]

We shall now merge the two data frames. 

In [None]:
for_plotting1 = world.merge(df19, left_on = 'name', right_on = "Country or region", how="left")
for_plotting2 = for_plotting1
for_plotting1

Finally, we can visualize the happiness score of each country. All the countries for which the happiness scores are missing are grey. For the other countries, the legend shows the range in which the happiness score of each of the country lies.

In [None]:
ax = for_plotting1.dropna().plot(column='Score', cmap = 'viridis', figsize=(20,15),scheme='quantiles', k=8, legend = True);
for_plotting2[for_plotting2.Score.isna()].plot(color='lightgrey', ax=ax)
ax.set_title('Happiness score of countries (2019)', fontdict= {'fontsize':15})
ax.set_axis_off()
ax.get_legend().set_bbox_to_anchor((.12,.12))


*  Here, at first glance we can see that most of North America, New Zealand, Finland, Denmark, Norway, Iceland and Australia are Yellow. These are the happiest coutries.
* Most of Asia, however is either green or blue, hence, these are countries which are fairly unhappy.
* Compared to other continents, Africa has the most number of countries which are relatively unhappy.

## Analysis of Contributing Factors for the Happiness Scores of the 20 Happiest and Unhappiest Countries

The columns following the happiness score estimate the extent to which each of six factors – "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption" – contribute to the happiness score for each country.

This means that - the happiness score is the sum of these 6 contributing factors.

However, there is a difference between the sum of the six contributing factors and the happiness scores of the countries. So, it is clear that there are certain contributing factors which are unknown. It is possible that either there are several unknown contributing factors that contribute relatively less to the happiness score or that there are only a few factors that might contribute more to the score.

To put things into perpective, a lower or a higher value of a contributing factor might mean that the importance of that factor to the citizens' happiness is less or more respectively. For instance, let there be a country X and a country Y. Suppose, the value of GDP per capita is lower in country X than in country Y. This does not mean that the GDP of country X is lower than that of country Y. It simply means that the GDP plays more of a role in contributing to the happiness of the citizens of country Y.

In [None]:
unknown_factors = df19["Score"]-(df19["GDP per capita"]+df19["Social support"]+df19["Healthy life expectancy"]+df19["Freedom to make life choices"]+df19["Generosity"]+df19["Perceptions of corruption"])
for_plotting3 = df19.copy()
for_plotting3["unknown contributing factors"] = unknown_factors
for_plotting3.set_index("Country or region",drop=True,inplace=True)
for_plotting3[:20][["GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption","unknown contributing factors"]].plot(kind="bar", figsize=(12,8), cmap = "viridis", stacked=True)
plt.title("Contribution of each factor to happiness score for top 20 happiest countries")
plt.xlabel("Countries")
plt.ylabel("Contribution")
ax.get_legend().set_bbox_to_anchor((.12,.12))

From the above stacked bar graph for the 20 countries with the ***highest*** happiness scores, we can observe the following -

1. GDP is the highest contributing factor to the happiness score of Luxembourg and the lowest contributing factor in case of Costa Rica. 
2. The values for Social Support, Healthy Life Expectancy and Freedom To Make Life Choices as contributing factors are nearly equal for all the countries.
3. Generosity is the lowest contributing factor for Czech Repuplic compared to the other countries.
4. Perceptions of corruption is the lowest contributing factor to the happiness of the citizens of Iceland, Costa Rica, Israel, United States and Czech Republic compared to the other countries.
5. Overall, the 3 factors that contribute the most to the happiness score of all countries are GDP, Healthy Life Expectancy and Social Support. 

In [None]:
unknown_factors = df19["Score"]-(df19["GDP per capita"]+df19["Social support"]+df19["Healthy life expectancy"]+df19["Freedom to make life choices"]+df19["Generosity"]+df19["Perceptions of corruption"])
for_plotting4 = df19.copy()
for_plotting4["unknown contributing factors"] = unknown_factors
for_plotting4.set_index("Country or region",drop=True,inplace=True)
for_plotting4[-20:][["GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption","unknown contributing factors"]].plot(kind="bar", figsize=(12,8), cmap = "viridis", stacked=True)
plt.title("Contribution of each factor to happiness score for the 20 unhappiest countries")
plt.xlabel("Countries")
plt.ylabel("Contribution")
ax.get_legend().set_bbox_to_anchor((.12,.12))

From the above stacked bar graph for the 20 countries with the ***lowest*** happiness scores, we can observe the following -

1. GDP is the lowest contributing factor in case of Liberia, Burundi and Central African Republic and highest contributing factor in case of Botswana as compared to the significance of GDP to the happiness scores of other countries.
2. The value of Social Support as a contributing factor in case of Central African Republic is significantly lower than the other countries and is almost negligible.
3. Compared to other countries Lesotho and Central African Republic have the lowest values for contribution of Healthy Life Expectancy to the happiness score. Whereas, all other countries have nearly equal value for Healthy Life Expectancy as a contributing factor.
4. The contribution of Freedom to Make Life Choices in case of Haiti, Syria and Afghanisthan is next to none i.e. negligible. In Comoros, Madagascar, Burundi and Yemen, the contribution of Freedom to Make Life Choices is not negligible but is still relatively less than the other countries.
5. For Botsawana, generosity as a contributing factor to happiness score is lower than that of all other countries.
6. Noticeably, perceptions of corruption as a contributing factor is the highest in case of Rwanda compared to all others.
7. It is also apparent that the value for unkown contributing factors is extremely low for Botswana and higher for Central African Republic compared to others.

## Heatmap of a Correlation Matrix of Contributing Factors, Overall Rank and Happiness Score

Correlation explains how one or more variables are related to each other. Two features (variables) can be positively correlated with each other. It means that when the value of one variable increases then the value of the other variable(s) also increases. Similarly, if they are negatively correlated, two features (variables) can also be negatively correlated with each other. It means that when the value of one variable decreases then the value of the other variable(s) increases.

A heatmap is a graphical representation of data in which data values are represented as colors. That is, it uses color in order to communicate a value to the reader. 

So, the heatmap below will represent the correlation between various features such as overall rank, happiness score, GDP per capita and all our other contributing factors in the form of colours. But, the heatmap has also been annotated for better understanding.

In [None]:
import seaborn as sns
corr_matrix = df19.corr()
corr_matrix
mask = np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)]= True
f, ax = plt.subplots(figsize=(11, 15)) 
heatmap = sns.heatmap(corr_matrix, mask = mask,square = True,linewidths = .5, cmap = "viridis", cbar_kws = {'shrink': .4, 'ticks' : [-1, -.5, 0, 0.5, 1]}, vmin = -1, vmax = 1, annot = True, annot_kws = {"size": 12})
ax.set_yticklabels(corr_matrix.columns, rotation = 0)
ax.set_xticklabels(corr_matrix.columns)
sns.set_style({'xtick.bottom': True}, {'ytick.left': True})

Okay, so now we will be able to observe some obvious but also some extremely intriguing correlations. Following are the observations that can be made from the correlation heatmap above -

1. Just to familiarize you with how this works, the most obvious strong negative correlation you can observe at first glance is between overall rank and the happiness score which is -0.99. This means that greater the value of the happiness score, lower the value of the overall rank and vice versa. In case you find this a bit confounding, remember that if the value of the overall rank is low, it simply means that the country is higher up in the ranking, thus it has a higher happiness score.

2. There is a strong positive correlation between GDP per capita X Score, Social Support X Score and Healthy Life Expectancy X Score. This means that higher the contribution of these factors to the happiness score, greater is the happiness score. Also, the Happiness Score is heavily influenced by these factors.

3. You may also have observed that the value of correlation between Score and Generosity is 0.076, which would be considered insignificant. Hence, we can conclude that Generosity as a contributing factor does not have much of an influence on the Happiness Score.

4. Another interesting observation that can be made is the strong positive correlation between GDP Per Capita X Social Support, GDP Per Capita X Healthy Life Expectancy, and Social Support X Healthy Life Expectancy. In my opinion, this should mean that if the value of one of these features as contributing factors is high then it is extremely likely that the value of the other two features as contributing factors must also be high. 

