# World Happiness Report 2021

## Background

- The first eight reports were produced by the founding trio of co-editors assembled in Thimphu in July 2011 pursuant to the Bhutanese Resolution passed by the General Assembly in June 2011 that invited national governments to “give more importance to happiness and well-being in determining how to achieve and measure social and economic development.” 

- annual publication by the Sustainable Development Solutions Network (SDSN) and The Center for Sustainable Development at Columbia University

- data collected in partnership with the Gallup World Poll team
- data sources used to develop the report include:
- Gallup World Poll
- World Risk Poll (Lloyd's Register Foundation)
- Covid Data Hub (Imperial College London // YouGov)

## Methodology

- accessed the World Happiness Report (https://worldhappiness.report/) to review methodology and types of data included
- downloaded dataset from Kaggle
- The Happiness Index includes questions about perceptions about corruption, social support, healthy life expectancy, generousity, and freedown...so my follow-up question
- for each of the countries - how has has happiness changed from year to year?
- how do perceptions of corruptions by the public compare with expert opinions on global corruption (via Transparency International survey)?
- how do these elements compare between countries with managing higher numbers of COVID deaths and COVID cases? 

- if data is available, this analysis could be supplemented with information about levels of restriction in each country

## Datasets:
- World Happiness Report: https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021
- Transparency International Corruption Index: https://www.transparency.org/en/cpi/2020/index/nzl
- Novel Coronavirus (COVID-19) Cases and Deaths (JHU CSSE): https://github.com/CSSEGISandData/COVID-19 

## Methods:

I extracted cumulative case and death data as of 31 December 2020 from Novel Coronavirus (COVID-19) Cases and Deaths (JHU CSSE) and the 2021 Corruption Perception Index (CPI) Score from Transparency International Corruption Index and added them to the a new file containg World Happiness Report 2021 indicators

I mapped CPI scores to WHR Perception of Corruption Scores (100-cpi_score/100)

This is my rough mapping - not an evidence-based method to compare these scores

I mapped the country names across the different datasets. In the instances where country names differed, I mapped the following:

| Base                    | Alternative Spellings               |
|-------------------------|-------------------------------------|
| Czech Republic          | Czechia                             |
| United States           | US                                  |
| United States           | United States of America            |
| Taiwan                  | Taiwan*                             |
| Taiwan                  | Taiwan Province of China            |
| South Korea             | Korea, South                        |
| North Korea             | Korea, North                        |
| North Cyprus            | Turkish Republic of Northern Cyprus |
| North Cyprus            | Northern Cyprus                     |
| Ivory Coast             | Cote d'Ivoire                       |
| Hong Kong               | Hong Kong S.A.R. of China           |
| Palestinian Territories | West Bank and Gaza                  |
| Myanmar                 | Burma                               |
| Eswatini                | Swaziland                           |
| Congo (Brazzaville)     | Republic of Congo                   |
| Congo (Kinshasa)        | Democratic Republic of Congo        |

I removed rows irrelevant to this analysis including:
- Ladder score                           
- Standard error of ladder score            
- upperwhisker                                
- lowerwhisker
- Explained by: Log GDP per capita
- Explained by: Social support
- Explained by: Healthy life expectancy
- Explained by: Freedom to make life choices
- Explained by: Generosity
- Explained by: Perceptions of corruption
- Dystopia + residual  

####Data Notes:
- North Cyprus is only recognized by Turkey, and thus excluded from the analysis. 
- Turkmenistan was not included in the JHU CSSE because it claims no COVID-19 cases or deaths, and thus do not report. Thus, in the combined dataset, I mark it as 0 deaths and 0 cases to compare '0' reporting with perception of corruption scores in this analysis.
- North Korea was not included in the World Happiness Report
- Palestinian Territories were not included in the Transparency International Corruption Index 

## Analysis

## Story

Import pandas and matplotlib libraries

In [1]:
import pandas as pd

In [2]:
import matplotlib as plt

In [3]:
import numpy as np

Import datasets

- World Happiness Report: https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021
- Transparency International Corruption Index: https://www.transparency.org/en/cpi/2020/index/nzl
- Novel Coronavirus (COVID-19) Cases and Deaths (JHU CSSE): https://github.com/CSSEGISandData/COVID-19

In [4]:
df_wh21 = pd.read_csv('world-happiness-report-2021.csv')
df_wh21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 20 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country name                                149 non-null    object 
 1   Regional indicator                          149 non-null    object 
 2   Ladder score                                149 non-null    float64
 3   Standard error of ladder score              149 non-null    float64
 4   upperwhisker                                149 non-null    float64
 5   lowerwhisker                                149 non-null    float64
 6   Logged GDP per capita                       149 non-null    float64
 7   Social support                              149 non-null    float64
 8   Healthy life expectancy                     149 non-null    float64
 9   Freedom to make life choices                149 non-null    float64
 10  Generosity    

In [5]:
df_wh21.head(10)
df_wh21[df_wh21['Country name'].str.contains("congo", case = False)]

Unnamed: 0,Country name,Regional indicator,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
82,Congo (Brazzaville),Sub-Saharan Africa,5.342,0.097,5.533,5.151,8.117,0.636,58.221,0.695,-0.068,0.745,2.43,0.518,0.392,0.307,0.381,0.144,0.124,3.476


In [6]:
df_cov19_cases = pd.read_csv("time_series_covid19_confirmed_global.csv")
df_cov19_cases.head()


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/14/21,6/15/21,6/16/21,6/17/21,6/18/21,6/19/21,6/20/21,6/21/21,6/22/21,6/23/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,91458,93272,93288,96531,98734,98734,98734,103902,105749,107957
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,132461,132469,132476,132481,132484,132488,132490,132490,132496,132497
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,133742,134115,134458,134840,135219,135586,135821,136294,136679,137049
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,13826,13828,13836,13839,13842,13842,13842,13864,13864,13873
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,36790,36921,37094,37289,37467,37604,37678,37748,37874,38002


Let's extract the cumulative cases per country on 31 Dec 2020

In [7]:
df_cov_cases = df_cov19_cases[['Province/State','Country/Region','12/31/20']]
df_cov_cases = df_cov_cases.rename(columns={
    'Country/Region': 'country',
    'Province/State': 'province',
    '12/31/20': 'cases_2020'
})
df_cov_cases.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   province    87 non-null     object
 1   country     278 non-null    object
 2   cases_2020  278 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 6.6+ KB


278 countries? Strange - let's check if any countries are mentioned multiple times

In [8]:
df_cov_cases['country'].value_counts().loc[lambda x: x>1]

China             34
Canada            16
United Kingdom    12
France            12
Australia          8
Netherlands        5
Denmark            3
New Zealand        2
Name: country, dtype: int64

In [9]:
df_cov_cases[df_cov_cases['country'] == 'New Zealand']

Unnamed: 0,province,country,cases_2020
198,Cook Islands,New Zealand,0
199,,New Zealand,2162


In [10]:
#Cook Islands are self-governing...let's dropit
index_names = df_cov_cases[(df_cov_cases['province'] == 'Cook Islands')].index

df_cov_cases.drop(index_names, inplace= True)

df_cov_cases[df_cov_cases['country'] == 'New Zealand']


Unnamed: 0,province,country,cases_2020
199,,New Zealand,2162


In [11]:
df_cov_cases[df_cov_cases['country'] == 'Denmark']

index_names = df_cov_cases[
    (df_cov_cases['province'] == 'Faroe Islands')| 
    (df_cov_cases['province'] == 'Greenland')
    ].index

df_cov_cases.drop(index_names, inplace= True)
df_cov_cases[df_cov_cases['country'] == 'Denmark']

Unnamed: 0,province,country,cases_2020
104,,Denmark,163479


In [12]:
df_cov_cases[df_cov_cases['country'] == 'Netherlands']

index_names = df_cov_cases[
    (df_cov_cases['province'] == 'Aruba')| 
    (df_cov_cases['province'] == 'Bonaire, Sint Eustatius and Saba')|
    (df_cov_cases['province'] == 'Curacao')|
    (df_cov_cases['province'] == 'Sint Maarten')
    ].index

df_cov_cases.drop(index_names, inplace= True)
df_cov_cases[df_cov_cases['country'] == 'Netherlands']

Unnamed: 0,province,country,cases_2020
197,,Netherlands,796981


In [13]:
df_cov_cases[df_cov_cases['country'] == 'France']
index_names = df_cov_cases[
    (df_cov_cases['province'] != np.na) & (df_cov_cases['country'] == 'France')
    ].index

df_cov_cases.drop(index_names, inplace= True)

df_cov_cases[df_cov_cases['country'] == 'France']

AttributeError: module 'numpy' has no attribute 'na'

In [14]:
df_cov_cases[df_cov_cases['country'] == 'France']

index_names = df_cov_cases[
    (df_cov_cases['province'] == 'French Guiana')| 
    (df_cov_cases['province'] == 'French Polynesia')|
    (df_cov_cases['province'] == 'Guadeloupe')|
    (df_cov_cases['province'] == 'Martinique')|
    (df_cov_cases['province'] == 'Mayotte')|
    (df_cov_cases['province'] == 'New Caledonia')|
    (df_cov_cases['province'] == 'Reunion')|
    (df_cov_cases['province'] == 'Saint Barthelemy')|
    (df_cov_cases['province'] == 'Saint Pierre and Miquelon')|
    (df_cov_cases['province'] == 'St Martin')|
    (df_cov_cases['province'] == 'Wallis and Futuna')
    ].index

df_cov_cases.drop(index_names, inplace= True)
df_cov_cases[df_cov_cases['country'] == 'France']


Unnamed: 0,province,country,cases_2020
130,,France,2616902


In [15]:
df_cov_cases[df_cov_cases['country'] == 'United Kingdom']


Unnamed: 0,province,country,cases_2020
257,Anguilla,United Kingdom,13
258,Bermuda,United Kingdom,604
259,British Virgin Islands,United Kingdom,86
260,Cayman Islands,United Kingdom,338
261,Channel Islands,United Kingdom,3058
262,Falkland Islands (Malvinas),United Kingdom,29
263,Gibraltar,United Kingdom,2040
264,Isle of Man,United Kingdom,377
265,Montserrat,United Kingdom,13
266,"Saint Helena, Ascension and Tristan da Cunha",United Kingdom,4


In [16]:
df_cov_cases['category'] = df_cov_cases['province'].fillna('country')
df_cov_cases

Unnamed: 0,province,country,cases_2020,category
0,,Afghanistan,51526,country
1,,Albania,58316,country
2,,Algeria,99610,country
3,,Andorra,8049,country
4,,Angola,17553,country
...,...,...,...,...
273,,Vietnam,1465,country
274,,West Bank and Gaza,138004,country
275,,Yemen,2099,country
276,,Zambia,20725,country


In [17]:
(df_cov_cases.category == 'country').sum()

191

In [18]:
index_names = df_cov_cases[
    df_cov_cases['category'] == 'country' & df_cov_cases['country'] == 'United Kingdom'
    ].index

df_cov_cases.drop(index_names, inplace= True)

df_cov_cases[df_cov_cases['country'] == 'United Kingdom']

TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]

In [19]:
index_names = df_cov_cases[
    (df_cov_cases['province'].notna & df_cov_cases['country'] == 'United Kingdom')
    ].index

df_cov_cases.drop(index_names, inplace= True)

df_cov_cases[df_cov_cases['country'] == 'United Kingdom']

AssertionError: 

In [20]:
index_names = df_cov_cases[
    df_cov_cases['province'] != np.na & df_cov_cases['country'] == 'United Kingdom'
    ].index

df_cov_cases.drop(index_names, inplace= True)

df_cov_cases[df_cov_cases['country'] == 'United Kingdom']

AttributeError: module 'numpy' has no attribute 'na'

In [21]:
#Create new record totalling all Australian and Canadian cases

In [22]:
df_cov_cases[df_cov_cases['country'] == 'Australia']

Unnamed: 0,province,country,cases_2020,category
8,Australian Capital Territory,Australia,118,Australian Capital Territory
9,New South Wales,Australia,4928,New South Wales
10,Northern Territory,Australia,75,Northern Territory
11,Queensland,Australia,1253,Queensland
12,South Australia,Australia,580,South Australia
13,Tasmania,Australia,234,Tasmania
14,Victoria,Australia,20376,Victoria
15,Western Australia,Australia,861,Western Australia


In [23]:
aus_cases = df_cov_cases[df_cov_cases['country'] == 'Australia'].cases_2020.sum()
aus_cases

28425

In [24]:
can_df_cov_cases = df_cov_cases[df_cov_cases['country'] == 'Canada'].cases_2020.sum()
can_df_cov_cases 

584409

In [25]:
df_cov_cases['country'].value_counts().loc[lambda x: x>1]

China             34
Canada            16
United Kingdom    12
Australia          8
Name: country, dtype: int64

In [26]:
df_cov_cases[df_cov_cases['country'] == 'China']

Unnamed: 0,province,country,cases_2020,category
58,Anhui,China,993,Anhui
59,Beijing,China,987,Beijing
60,Chongqing,China,590,Chongqing
61,Fujian,China,513,Fujian
62,Gansu,China,182,Gansu
63,Guangdong,China,2046,Guangdong
64,Guangxi,China,264,Guangxi
65,Guizhou,China,147,Guizhou
66,Hainan,China,171,Hainan
67,Hebei,China,373,Hebei


In [27]:
# df_cov_cases = pd.concat([df_cov_cases, df_cov_cases[df_cov_cases['province'] == 'Hong Kong'].replace("China", "Hong Kong")])

df_cov_cases.replace((df_cov_cases['country'] == 'China') & (df_cov_cases['province'] == 'Hong Kong'), (df_cov_cases['country'] == 'Hong Kong' & df_cov_cases['province'] == 'Hong Kong')df_cov_cases

SyntaxError: invalid syntax (<ipython-input-27-307d6190009c>, line 3)

In [28]:
df_cov_cases[df_cov_cases['country'] == 'Hong Kong']

Unnamed: 0,province,country,cases_2020,category


In [29]:
df_cov_cases[df_cov_cases['country'] == 'China']


Unnamed: 0,province,country,cases_2020,category
58,Anhui,China,993,Anhui
59,Beijing,China,987,Beijing
60,Chongqing,China,590,Chongqing
61,Fujian,China,513,Fujian
62,Gansu,China,182,Gansu
63,Guangdong,China,2046,Guangdong
64,Guangxi,China,264,Guangxi
65,Guizhou,China,147,Guizhou
66,Hainan,China,171,Hainan
67,Hebei,China,373,Hebei


In [30]:
df_cov_cases.loc[(df_cov_cases["province"] != 'Hong Kong')]

df_cov_cases.loc[(df_cov_cases["province"] == 'Hong Kong')]

Unnamed: 0,province,country,cases_2020,category
70,Hong Kong,China,8846,Hong Kong


In [31]:
df_cov_cases[df_cov_cases['country'] == 'China']

Unnamed: 0,province,country,cases_2020,category
58,Anhui,China,993,Anhui
59,Beijing,China,987,Beijing
60,Chongqing,China,590,Chongqing
61,Fujian,China,513,Fujian
62,Gansu,China,182,Gansu
63,Guangdong,China,2046,Guangdong
64,Guangxi,China,264,Guangxi
65,Guizhou,China,147,Guizhou
66,Hainan,China,171,Hainan
67,Hebei,China,373,Hebei


In [32]:
df_cov_cases = df_cov_cases.loc[(df_cov_cases["province"] != 'Hong Kong')]

In [33]:
df_cov_cases[df_cov_cases['country'] == 'China']

Unnamed: 0,province,country,cases_2020,category
58,Anhui,China,993,Anhui
59,Beijing,China,987,Beijing
60,Chongqing,China,590,Chongqing
61,Fujian,China,513,Fujian
62,Gansu,China,182,Gansu
63,Guangdong,China,2046,Guangdong
64,Guangxi,China,264,Guangxi
65,Guizhou,China,147,Guizhou
66,Hainan,China,171,Hainan
67,Hebei,China,373,Hebei


In [34]:
df_cov_cases[df_cov_cases['country'] == 'Hong Kong']

Unnamed: 0,province,country,cases_2020,category


In [35]:
df_cov_cases = df_cov_cases.loc[(df_cov_cases["province"] != 'Hong Kong') & (df_cov_cases['country'] != 'China')]

In [36]:
df_cov_cases['country'].value_counts().loc[lambda x: x>1]

Canada            16
United Kingdom    12
Australia          8
Name: country, dtype: int64

In [37]:
df_combined = df_wh21.merge(df_nye20, left_on='Country name', right_on='Country/Region')
df_combined[df_combined['Country/Region'] == 'Australia']

NameError: name 'df_nye20' is not defined

In [38]:
df_cov19_deaths = pd.read_csv("time_series_covid19_deaths_global.csv")
df_cov19_deaths.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/14/21,6/15/21,6/16/21,6/17/21,6/18/21,6/19/21,6/20/21,6/21/21,6/22/21,6/23/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,3612,3683,3683,3842,3934,3934,3934,4215,4293,4366
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,2453,2454,2454,2454,2454,2454,2454,2454,2455,2455
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,3579,3588,3598,3605,3615,3624,3631,3641,3650,3660
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,127,127,127,127,127,127,127,127,127,127
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,836,842,847,851,853,856,859,868,875,878
