# Objective
### How is the Happiness score influenced by other parameters in 2021?
In other words, in this notebook, we are going to see how strong a correlation is.
<ol>
    <li>Between the happiness score and other features.</li>
    <li>Between each features</li>
</ol>
If there is a correlation, how strong the correlation is: how one feature impact the other one?
<br>
As reminder: corr = 1, when one increase, the other increase. corr = 0, when one increase, the other stay constant. corr = -1, when one increase, the other decrease.
<br>
<br>
The dataset used is the world happiness datasets of 2021.
<br>
I used Python for the analysis, but the same can be performed using Excel since the dataset is not big.

### Set up the environment

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

### Inspect the data set
<i>Ladder score</i> is our <i>Happiness score</i>. <br>
<i>Ladder score in dystopia</i> is a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors.
<br> 
<br>
The dataset has 20 columns and 149 rows. Each row represent one unique country.
   

In [None]:
df = pd.read_csv('../input/world-happiness-report-2021/world-happiness-report-2021.csv')
df.head(2)

In [None]:
df.info()

In [None]:
df.shape #get the shape of the dataset (row, column)

In [None]:
df.size

### Verify the dataset
<ul>
    <li>Is there empty cell?</li>
    <li>Is there cells is 0 as the value?</li>
    <li>Is there duplicate?</li>
    <li>And more</li>
</ul>
No need to check for null values since .describe() says there's none (149 rows for 149 non-null values).

In [None]:
df['Country name'].nunique()  
#df.duplicated(subset = 'Country name') #it works as well

In [None]:
(df == 0).sum(axis = 0) #count number of 0s in each column

##### Let's see if the <i>Explained by</i> columns are useful
In other words: see the correlation with the <i> Explained by</i> and its twin (the same feature without the <i>explained by</i>
<br>
The correlation is 1 everywhere, except for the <i>Perceptions of corruption</i> which has a correlation of -1. Anyway, correlations are strong.
So we can drop the <i>Explained by</i> columns

In [None]:
#create a new data frame just to check correlation between the explained by and its twin
df_check = df[['Logged GDP per capita', 'Social support', 'Healthy life expectancy', 
              'Freedom to make life choices', 'Generosity', 'Perceptions of corruption',
             'Explained by: Log GDP per capita', 'Explained by: Social support',
             'Explained by: Healthy life expectancy', 'Explained by: Freedom to make life choices',
             'Explained by: Generosity', 'Explained by: Perceptions of corruption']]
df_check.head(2)

In [None]:
corr_ex_by = df_check.corr().iloc[0:6,6:] #without the .iloc() we would have correlation for each attributes 2 times
sns.heatmap(corr_ex_by, annot = True, cmap = 'flare').set_title('Heat map of correlation of "explained by"')

### Create dataset that will be used for studying the correlation
Drop all column but the six features:
<ol>
    <li>Logged GDP per capita</li>
    <li>Social support</li>
    <li>Healthy life expectancy</li>
    <li>Freedom to make life choices</li>
    <li>Generosity</li>
    <li>Perceptions of corruption</li>
</ol>
Additionally, we keep the <i>Country name</i>, <i>Regional indicator</i>, <i>Ladder score</i>.<br>
<i>Ladder score in dystopia</i> is an interesting data but not for the question we want to answer in our study, but let's drop it for our new dataset.

In [None]:
df_study = df[['Country name', 'Regional indicator', 'Ladder score', 'Logged GDP per capita',
              'Social support', 'Healthy life expectancy', 'Freedom to make life choices',
              'Generosity', 'Perceptions of corruption']]
df_study.head(2)

In [None]:
df_study.info()

In [None]:
df_study.shape

### Analyzing the dataset
For better visibility and understanding, I will put the visualization for each analysis every time it's needed.

In [None]:
df_study.describe() #statistical info about the dataset

Let's see which country has the MAX and MIN for each factors

### Analyzing the ladder score

The country with the highest ladder score is Finland (7.842), while the one with the lowest is Afghanistan (2.523).
Western Europe is the region the most represented in top logged GDP per capita and Social support, which appear to be the top 2 factor having the biggest impact on the ladder score. Which lead to a domination of western countries in the most happiest countries. 
<br> 
<br>
However, North America and ANZ (NA&ANZ) region is the happiest region right before the Western Europe (WE) region. This trend can be explained by the fact there are only 4 countries represented in the NA&ANZ region, while 21 countries are represented in the WE region.
NA&ANZ counts Canada, USA, New Zealand and Australia which are among the highest GDP per capita. While WE region have countries with low GDP per capita, social support and life expectancy.

Get the country with the highest value and the country with the lowest value for each feature.

In [None]:
def summary(df_study):
    country_max = {}
    region_max = {}
    value_max = {}
    country_min = {}
    region_min = {}
    value_min = {}
    
    for col in df_study:
        #catch the value at the location wanted in the dataset df_study
        country_max[col] = df.loc[df_study[col] == df_study[col].max(), 'Country name'].values[0]
        region_max[col] = df.loc[df_study[col] == df_study[col].max(), 'Regional indicator'].values[0]
        value_max[col] = df.loc[df_study[col] == df_study[col].max(), col].values[0]
        country_min[col] = df.loc[df_study[col] == df_study[col].min(), 'Country name'].values[0]
        region_min[col] = df.loc[df_study[col] == df_study[col].min(), 'Regional indicator'].values[0]
        value_min[col] = df.loc[df_study[col] == df_study[col].min(), col].values[0]
    
    table = pd.DataFrame([country_max, region_max, value_max, country_min, region_min, value_min], 
                         index = ['Country name MAX', 'Region MAX', 'Value MAX',
                                  'Country name MIN', 'Region MIN', 'Value MIN'])
    return table

In [None]:
#drop the columns country name and regional indicator because they are useless for the output
summary = summary(df_study).drop(['Country name', 'Regional indicator'], axis = 1)
summary.T #transpose the table

Visualize the ladder score per country with a bar graph.

In [None]:
plt.figure(figsize = (25,10))
x = df_study['Ladder score']
y = df_study['Country name']
plt.bar(y, x)
plt.title('Ladder score per country')
plt.xlabel('Countries')
plt.ylabel('Ladder score')
plt.xticks(rotation = 90) #to rotate the position of the x axis labels
plt.show()

The countries are classified per region. Let's see the distribution of region. 
<br>
Some regions have more countries than other. Which could create a bias in the correlation analysis.
<br>
In order to avoid the bias. We will analyze the correlation worldwide and per region separately and compare.

The correlation between the ladder score and the 6 factors will be compared between region and world wide

In [None]:
#create a new row "worldwide" in the dataframe "df_study"
world_row = {'Country name': 'World',
        'Regional indicator': 'World',
        'Ladder score': df_study['Ladder score'].mean(),
        'Logged GDP per capita': df_study['Logged GDP per capita'].mean(),
        'Social support': df_study['Social support'].mean(),
        'Healthy life expectancy': df_study['Healthy life expectancy'].mean(),
        'Freedom to make life choices': df_study['Freedom to make life choices'].mean(),
        'Generosity': df_study['Generosity'].mean(),
        'Perceptions of corruption': df_study['Perceptions of corruption'].mean()
        }
#add the new row in the dataframe df_study
df_study = df_study.append(world_row, ignore_index = True)
df_study.tail(2) #to check if the new row is well added

Group by region (including the new row <i>World</i>, check how many countries per region and gives average of ladder score
<br>
Thanks to this pivot table, we can see that the region <i>North America and ANZ</i> is the most happiest region. <b>BUT</b> there are only 4 countries in this region.
<br>
This could lead to an unfair conclusion regarding the correlation of factors.
<br>
<br>
Following up, an analysis of the correlation for each factor will be led for:
<ul>
    <li>World wide</li>
    <li>Region</li>
</ul>

In [None]:
#group by region and give values for countries and ladder scores
mean_ladder_score = df_study.groupby('Regional indicator').agg({'Country name': ['count'],
                                                                'Ladder score': ['mean']})
mean_ladder_score.columns = ['Country name count', 'Ladder score mean']
#rank ladder score in ascending order
mean_ladder_score = mean_ladder_score.sort_values('Ladder score mean', ascending = False)
mean_ladder_score

Plot a pie chart of the distribution of each country per region

In [None]:
#plot a pie chart of the distribution of each country per region
fig,ax = plt.subplots()
x = mean_ladder_score['Country name count']
labels = mean_ladder_score.index
explode = (0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1) #create spaces between segment of the pie chart
ax.pie(x, labels = labels, radius = 2, explode = explode)

#create a white circle at the center of the pie to create a donut chart
my_circle = plt.Circle( (0,0), 0.7, color = 'white')
p = plt.gcf()
p.gca().add_artist(my_circle)

plt.show()

Plot a bar graph of the average ladder score per region and in the world

In [None]:
plt.figure(figsize = (15, 8))
x = mean_ladder_score['Ladder score mean'].round(decimals = 3)
y = mean_ladder_score.index
plt.barh(y, x)
plt.title('Ladder score per region')
plt.ylabel('Region')
plt.xlabel('Ladder score')
plt.xticks(rotation = 0)

#show the max ladder score for each bar
for index, value in enumerate(x):
    plt.text(value, index, str(value))

plt.show()

### Correlations world wide & per region

Logged GDP per capita, social support and Healthy life expectancy are the factors with a correlation values closest to 1. Which means these factors have the highest impact on the Ladder score of each country.
<br>
While, Generosity has a correlation close to one, which means its impact is neutral.

Let's first see the correlations with all the countries together

In [None]:
#first we need to drop the row "world" that we just added to avoid adding it in the correlation function
df_study = df_study.drop(index = 149)

In [None]:
#get correlation between "ladder score" and the 6 factors
corr_world = df_study.corr().iloc[:,:]
corr_world

Thanks to the heat map, we can see that:
<ol>
    <li>Logged GDP per capita</li>
    <li>Social support</li>
    <li>Healthy life expectancy</li>
</ol>
Have the highest correlation (close to 1) which means they have a greater impact on the happiness score in a country.
<br>
<br>
The <i>Logged GDP per capita<i> seems to have the highest impact on the <i>Ladder score</i> and each other factor.

In [None]:
corr_world_2 = df_study.corr() #get the correlation between each attribute
#get a heat map of all correlations
#in other word -> the correlation between each attribute
sns.heatmap(corr_world_2, annot = True, cmap = 'flare').set_title('Heat map of correlation of each attribute')

In [None]:
#see how correlation "ladder score" vs "logged GDP per capita" look like
#use the seaborn library for this visualization
sns.scatterplot(data = df_study, x = 'Logged GDP per capita', y = 'Ladder score', 
                hue = "Regional indicator", sizes = (30, 400)).legend(loc = 'right', bbox_to_anchor = (1.75, 0.5))
plt.title('Scatter plot between Ladder Score and the Logged GPD per capita')

We can see that the correlation might be different per region.
<br>
If the R² was ploted for each region, its slope would probably be different.
<br>
To be sure the assumption is founded, let's see the correlation per region.

In [None]:
corr_region = df_study.groupby('Regional indicator').corr().unstack().iloc[:,1:7]
corr_region

The scatter plot show that the correlation between logged GDP per capita and Ladder score, is different per region. In some region the impact is greater than other regions.
<br>
Surprinsingly, WE region has a correlation close to 0 between Healthy Life expectancy and the Ladder score. Which means that the impact is closer to neutral impact on the ladder score.
<br> The perception of corruption has a relatively low impact on the average happiness world wide. BUT, for NA&ANZ and WE regions, the correlation is extremely close to -1. Which means, the less corrupted the country is, the highest the happiness score is. 

### Analyzing the correlation more deeply

While studying the heat map of correlations, the highest correlation is 0.86 between <i>Logged GDP per Capita</i> and <i>Healthy life expectancy</i>.
<br>
Let's plot this first.

In [None]:
#Plot data and a linear regression model fit
fig = (10, 5)
fig, ax = plt.subplots(figsize = fig)
sns.regplot(x = 'Logged GDP per capita', y = 'Healthy life expectancy', data = df_study, ax = ax)
plt.title('Regression plot between Life expectancy and the Logged GPD per capita')

In [None]:
sns.jointplot(data = df_study, x = 'Logged GDP per capita', y = 'Healthy life expectancy')

Now, we're gonna use a distribution plot for both <i>healthy life expectancy</i> and <i>logged GDP per capita.</i>
<br>
<br>
Plot a vertical line for the country with the highest and lowest ladder score for each distribution plot.
<br>
According to the <i>Summary</i> dataframe. It is Afghanistan (lowest) and Finland (highest).

In [None]:
finland = df_study[(df_study['Country name'] == 'Finland')].iloc[0]
afghanistan = df_study[(df_study['Country name'] == 'Afghanistan')].iloc[0]

In [None]:
sns.displot(df_study['Logged GDP per capita'], kde = True)
plt.axvline(finland['Logged GDP per capita'], color = 'green', label = 'Finland')
plt.axvline(afghanistan['Logged GDP per capita'], color = 'red', label = 'Afghanistan')
plt.legend()
plt.title('Distribution of the Logged GDP per capita')
#if want to see the repartition per region, use the line below
#sns.displot(df_study, x = 'Logged GDP per capita', hue = 'Regional indicator', multiple = 'stack')

In [None]:
sns.displot(df_study['Healthy life expectancy'], kde = True)
plt.axvline(finland['Healthy life expectancy'], color = 'green', label = 'Finland')
plt.axvline(afghanistan['Healthy life expectancy'], color = 'red', label = 'Afghanistan')
plt.legend()
plt.title('Distribution of the life expectancy')
#if want to see the repartition per region, use the line below
#sns.displot(df_study, x = 'Logged GDP per capita', hue = 'Regional indicator', multiple = 'stack')

While the lowest correlation score is -0.42 between <i>Perceptions of corruption</i> and the <i> Ladder score</i>.
Let's plot the distriubution and the regression between these 2.

In [None]:
#Plot data and a linear regression model fit
fig = (10, 5)
fig, ax = plt.subplots(figsize = fig)
sns.regplot(x = 'Ladder score', y = 'Perceptions of corruption', data = df_study, ax = ax)
plt.title('Regression plot between Ladder Score and the Perception of corruption')

In [None]:
sns.displot(df_study['Perceptions of corruption'], kde = True)
plt.axvline(finland['Perceptions of corruption'], color = 'green', label = 'Finland')
plt.axvline(afghanistan['Perceptions of corruption'], color = 'red', label = 'Afghanistan')
plt.legend()
plt.title('Distribution of the perception of corruption')
#if want to see the repartition per region, use the line below
#sns.displot(df_study, x = 'Perceptions of corruption', hue = 'Regional indicator', multiple = 'stack')

The correlation of <i>Perceptions of corruption</i> with the <i>Ladder score</i> is -0.42. Which means when one goes up, the other goes down.
<br>
Here, it means that the more the <i>Perceptions of corruption</i> is low, the more happy a country is.
<br>
<br>
Even if the correlation is -0.42, it still has a great impact for the most happiest and least happiest countries.

### Conclusion

Healthy life expectancy, Logged GDP per capita and social support have the highest impact globally. Those features, remain import for a country's happiness score.
<br>
<br>
While checking the correlation per region, we surprinsingly find that for the region Westernern Europe, the Healthy life expectancy has not a strong impact compare to what the impact is for this feature globally.
<br>
<br>
The perception of corruption has a strong impact on the most happiest and least happiest countries, with a negative correlation.