<a href="https://colab.research.google.com/github/terrafirmatrekker/WHR2022EDA/blob/main/WorldHappinessReport22EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analyzing The World Happiness Report of 2022**

2022 is the tenth anniversary of the World Happiness
Report (WHR). The WHRs are based
on a wide variety of data, the most important
source has always been the Gallup World Poll, due to its
unique range and comparability of global
annual surveys.

Evaluations of quality of life from the Gallup World Poll
provides the basis for the annual happiness
rankings that have always sparked widespread
interest about the secrets of life in the happiest
countries. 

## Methodology

The Gallup World poll used asks respondents to evaluate their
current life as a whole using the mental image of a ladder, with **the best possible life for them as a 10 and worst possible as a 0.** 

Each respondent provides a numerical responseon this scale, referred to as the Cantril ladder. Typically, around 1,000 responses are gathered annually for each country. Weights are used to construct population-representative national averages for each year in each country. We base our national happiness rankings on a three-year average, thereby increasing the
sample size to provide more precise estimates. 

## Data

This report will examine the most recent [dataset](https://happiness-report.s3.amazonaws.com/2022/Appendix_2_Data_for_Figure_2.1.xls) which has been converted into a CSV and stored as a Gist [here](https://gist.github.com/terrafirmatrekker/474f3acc5b44322bc54fcc49870dcfd1).:

## Loading and Cleaning Data

In [4]:
# Importing Some of the Initial Libraries Needed
import pandas as pd
from urllib.request import urlretrieve

In [14]:
url = 'https://gist.githubusercontent.com/terrafirmatrekker/474f3acc5b44322bc54fcc49870dcfd1/raw/14dafbd7faf09e0b1d4a53646b830cf79b3c510b/WHR2022.csv'
urlretrieve(url, 'WHR2022.csv')
data = pd.read_csv('WHR2022.csv')
# Get a count of unique values and columns
print(data.shape)
# Get top 6 countries
data.head(6)

(147, 24)


Unnamed: 0,RANK,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,...,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,1,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,...,,,,,,,,,,
1,2,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,...,,,,,,,,,,
2,3,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,...,,,,,,,,,,
3,4,Switzerland,7.512,7.586,7.437,2.153,2.026,1.226,0.822,0.677,...,,,,,,,,,,
4,5,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,...,,,,,,,,,,
5,6,Luxembourg*,7.404,7.501,7.307,2.042,2.209,1.155,0.79,0.7,...,,,,,,,,,,


As we can see there are 24 columns, but it seems like some of the columns are empty. Let's use *pandas.DataFrame.info* method to get more info about the data types.


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 24 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   RANK                                        147 non-null    int64  
 1   Country                                     147 non-null    object 
 2   Happiness score                             146 non-null    float64
 3   Whisker-high                                146 non-null    float64
 4   Whisker-low                                 146 non-null    float64
 5   Dystopia (1.83) + residual                  146 non-null    float64
 6   Explained by: GDP per capita                146 non-null    float64
 7   Explained by: Social support                146 non-null    float64
 8   Explained by: Healthy life expectancy       146 non-null    float64
 9   Explained by: Freedom to make life choices  146 non-null    float64
 10  Explained by: 

So it seems as if we have 12 empty columns without values that we can remove prior to our analysis. Let's us *df.dropna(axis=1, how='all')* to remove these extraneous columns with NaNs and store it in a new dataframe.

In [27]:
clean_data = data.dropna(axis=1, how='all')
clean_data.head()

Unnamed: 0,RANK,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,1,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534
1,2,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532
2,3,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191
3,4,Switzerland,7.512,7.586,7.437,2.153,2.026,1.226,0.822,0.677,0.147,0.461
4,5,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,0.271,0.419


It seems that our data has two other columns that we can drop since "Whisker-high" & "Whisker-low" refers to a box plot that was generated from the orginal dataset and is not necessary for analysis so let's drop these columns with *DataFrame.drop*.

In [30]:
clean_data = clean_data.drop(labels=['Whisker-high', 'Whisker-low'], axis = 1)
clean_data.head()

Unnamed: 0,RANK,Country,Happiness score,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,1,Finland,7.821,2.518,1.892,1.258,0.775,0.736,0.109,0.534
1,2,Denmark,7.636,2.226,1.953,1.243,0.777,0.719,0.188,0.532
2,3,Iceland,7.557,2.32,1.936,1.32,0.803,0.718,0.27,0.191
3,4,Switzerland,7.512,2.153,2.026,1.226,0.822,0.677,0.147,0.461
4,5,Netherlands,7.415,2.137,1.945,1.206,0.787,0.651,0.271,0.419
