This is a kernel where i do an analysis using the data from the happiness report of 2019. I will try to find what the main criterias that makes a country happy.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv(os.path.join(dirname, filename))
# data = pd.read_csv('world-happiness-report-2019.csv')

Looking at the first column of the data to have an idea of what the data looks like


In [None]:
data.head()

Getting the information about the data set

In [None]:
data.info()

The number of non null values are not the same for all the columns
This suggests that there are NaN in the dataset
Let's sum the number of NaN by column

In [None]:
data.isna().sum()

Using the forward filling method to replace the NaN.
Since the countries are ranked in the dataset then it is most likely
that the values missing are similar to the country ranked right above the one 
with the missing values.

In [None]:
filled_data = data.fillna(method = 'ffill')

filled_data.isna().sum()

Now that we don't have any more NaN, we can scatter plot
the six criterias used to rank the countries.

In [None]:
#Scatter plotting the six criterias used to assess happinness
criteria = filled_data.columns[5:11]

plt.figure(figsize=(16,10))
for i in range(criteria.shape[0]):
    plt.subplot(2,3,i+1)
    plt.scatter(x=filled_data['Ladder'],y=filled_data[criteria[i]])
    plt.xlabel('Ladder')
    plt.ylabel(criteria[i])

In the x axis we can just put the ranking of the country instead of
its name to make it easier. Then we can identify the country by its ranking (Ladder).

Another way to analyze the important factors is to get the ten top ranked, ten in the middle and ten at the bottom.
Then we can use the same scatter plot to see what factors are more important from those six.

In [None]:
ten_best = filled_data[0:11]
ten_middle = filled_data[72:83]
ten_worst = filled_data[filled_data.shape[0]-11:-1]
plt.figure(figsize=(16,10))
for i in range(criteria.shape[0]):
    plt.subplot(2,3,i+1)
    plt.scatter(x=ten_best['Ladder'],y=ten_best[criteria[i]],c='green')
    plt.scatter(x=ten_middle['Ladder'],y=ten_middle[criteria[i]],c='blue')
    plt.scatter(x=ten_worst['Ladder'],y=ten_worst[criteria[i]],c='red')
    plt.xlabel('Ladder')
    plt.ylabel(criteria[i])
plt.suptitle("Green : Top ten, Blue : Middle ten, Red : Bottom ten");

This way of showing the data by getting ten countries from the top, middle and bottow allow to show what criterias need to put focus on to move to the top of the ranking.
If you compare the ten in the middle and the ten at the bottom, you can see that there is not a big difference in the Freedom, Corruption and Generosity. But, we can see a clear classification for the other three criterias. That basically translates to :
 - Are people healthy (Healthy life expentency) ?
 - Are they paid correctly (Log of GDP per capita) ?
 - Do they have acces to social services (Social Support)?

We can see how those three criterias are linkend by looking at the correlation

In [None]:
reduced_data = filled_data[['Social support', 
                'Log of GDP\nper capita', 
                'Healthy life\nexpectancy']]
reduced_data.columns = ['Social', 'GDP', 'Health'] # Reducing the name of the columns

corr = reduced_data.corr()
sns.heatmap(corr,vmin=0.5,vmax=1)

print(corr)

From the values of the correlation matrix then we can see how those three criterias are strongly correlated. 

So a person from a political party will look at those results and prepar a program based on those main criterias for the election. 

Feel free to let me know what you think !