In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
data_old = pd.read_csv('/kaggle/input/world-happiness-report-2021/world-happiness-report.csv')
data_old.rename(columns = {'Life Ladder' : 'Ladder score', 
                           'Healthy life expectancy at birth': 'Healthy life expectancy',
                          'Log GDP per capita': 'Logged GDP per capita'}, inplace = True)
data_old.drop(columns = ['Positive affect', 'Negative affect'], inplace = True)
data_old.head()

In [None]:
data_new = pd.read_csv('/kaggle/input/world-happiness-report-2021/world-happiness-report-2021.csv')
data_new['year'] = 2021
data_new = data_new[['Country name', 'year','Ladder score',
       'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']]
data_new.head()

In [None]:
#concatenate old and new records

data = pd.concat([data_new, data_old], axis = 0)

In [None]:
data = data[data['year'].isin(range(2011, 2022))]

# Cleaning

In [None]:
#presence of null values in dataset
data.isnull().sum()

In [None]:
data.dropna(inplace = True)

In [None]:
#drop duplicates, if any
data.duplicated().sum()

In [None]:
data.drop_duplicates(inplace = True)

In [None]:
#distribution and detection of outliers

sns.boxplot(data['Ladder score'])

In [None]:
#distribution and detection of outliers

sns.boxplot(data['Logged GDP per capita'])

In [None]:
#distribution and detection of outliers

sns.boxplot(data['Social support'])

In [None]:
#distribution and detection of outliers

sns.boxplot(data['Healthy life expectancy'])

In [None]:
#distribution and detection of outliers

sns.boxplot(data['Freedom to make life choices'])

In [None]:
#distribution and detection of outliers

sns.boxplot(data['Generosity'])

In [None]:
#distribution and detection of outliers

sns.boxplot(data['Perceptions of corruption'])

In [None]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
#remove outliers by IQR technique
data = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
data.describe()

In [None]:
sns.pairplot(data, hue = 'year')

# Are people happier?

Logged GDP per capital is more or less constant over the years with a slight increase in 2020.

In [None]:
plt.figure(figsize = (20, 5))
sns.barplot(data = data, x = data['year'], y = data['Logged GDP per capita']).set_title('Logged GDP per capita from 2011 to 2021')

Social capital is more or less constant over the years with a slight increase in 2020.

In [None]:
plt.figure(figsize = (20, 5))
sns.barplot(data = data, x = 'year', y = 'Social support').set_title('Social support from 2011 to 2021')

Healthy life expectancy has increased over the years with slight dip in 2021. The slight dip is 2021 may be due in part to the spread of COVID-19. 

In [None]:
plt.figure(figsize = (10, 5))
sns.barplot(data = data, x = 'year', y = 'Healthy life expectancy').set_title('Healthy life expectancy from 2011 to 2021')

# Where are the happiest people?

In [None]:
sns.heatmap(data.groupby(by = 'Country name')['Ladder score'].mean().to_frame().nlargest(10, 'Ladder score'), annot = True).set_title('Top 10 Happiest Countries (2011-2021)')

In [None]:
sns.heatmap(data[data['year'] == 2021].groupby(by = ['Country name'])['Ladder score'].mean().to_frame().nlargest(10, 'Ladder score'), annot = True).set_title('Top 10 Happiest Countries in 2021')

# What makes up happiness?

(a) Ladder score measures the degree of happiness the respondents perceived. Respondents were told to, “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?” This measure is also referred to as Cantril life ladder, or just life ladder in our analysis.
 In particular, the ladder score shows moderately strong positive relationship with the following parameters:
* Logged GDP per capital: The statistics of GDP per capita (variable name gdp) in purchasing power parity (PPP) at constant 2017 international dollar prices are from the October 14,2020 update of the World Development Indicators (WDI). 
* Social support : Social support (or having someone to count on in times of trouble) is the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
* Health life expectancy: Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository (Last updated: 2020-09-28). The data at the source are available for the years 2000, 2005, 2010, 2015 and 2016. To match this report’s sample period (2005-2020), interpolation and extrapolation are used.

(b) There is a fairly strong positive relationship between the ladder score and freedom to make life choices. Freedom to make life choices is the national average of responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”

(c) No relationship between happiness and generosity. Generosity is the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.Contrary to the conventional belief that giving begets joy, there seem to be low or no correlation between ladder score and generosity. This means that self-reported happiness rating may not be related to the act of giving. 

(d) Negative relationship between happiness and corruption perception. Corruption perception is The measure is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses. In case the perception of government corruption is missing, we use the perception of business corruption as the overall perception. The corruption perception at the national level is just the average response of the overall perception at the individual level. Based on the heatmap, it appears that higher the perception of corruption, the lower the happiness rating. However, the negative relationship is not strong, which may mean that the corruption perception may not be a strong determinant of happiness.


In [None]:
plt.figure(figsize=(30,10))
corr = sns.heatmap(data.corr(), annot = True, cmap = 'Spectral')