

## Source : 

* Kaggle made available by Sustainable Development Solutions Network under a CC0 license.
* https://www.kaggle.com/datasets/unsdsn/world-happiness

## Dataset 

- has five csv files, for the years  2015 -2019. 
- number of columns & names vary across the years, the following columns names have been used

### Data Dictionary:

- Rank  - Rank of the country based on the Happiness Score.
- Country - name of the country
- Region   - region the country belongs to, mappings as per 2015 csv file.
- GDP per Capita(Economy) – World Bank data on country’s Economy in terms of Purchasing Power Parity (PPP)
- Health (Life Expectancy) – WHO data on country’s life expectancy at birth.
- Family – based on survey in terms of level of social support from relatives or friends.
- Freedom - to make life choices.
- Generosity – donated money to a charity
- Trust- perceptions on corruption in Government.


In [1]:
import pandas as pd

In [2]:
df_2015 = pd.read_csv("2015.csv")
df_2016 = pd.read_csv("2016.csv")
df_2017 = pd.read_csv("2017.csv")
df_2018 = pd.read_csv("2018.csv")
df_2019 = pd.read_csv("2019.csv")

In [3]:
files = [df_2015,df_2016,df_2017,df_2018,df_2019]
strng =["df_2015","df_2016","df_2017","df_2018","df_2019"]

In [4]:
for i,j in zip(strng,files):
    print(i)
    print("**********")
    print(j.shape)
    print((j.columns.values))
    print()

df_2015
**********
(158, 12)
['Country' 'Region' 'Happiness Rank' 'Happiness Score' 'Standard Error'
 'Economy (GDP per Capita)' 'Family' 'Health (Life Expectancy)' 'Freedom'
 'Trust (Government Corruption)' 'Generosity' 'Dystopia Residual']

df_2016
**********
(157, 13)
['Country' 'Region' 'Happiness Rank' 'Happiness Score'
 'Lower Confidence Interval' 'Upper Confidence Interval'
 'Economy (GDP per Capita)' 'Family' 'Health (Life Expectancy)' 'Freedom'
 'Trust (Government Corruption)' 'Generosity' 'Dystopia Residual']

df_2017
**********
(155, 12)
['Country' 'Happiness.Rank' 'Happiness.Score' 'Whisker.high' 'Whisker.low'
 'Economy..GDP.per.Capita.' 'Family' 'Health..Life.Expectancy.' 'Freedom'
 'Generosity' 'Trust..Government.Corruption.' 'Dystopia.Residual']

df_2018
**********
(156, 9)
['Overall rank' 'Country or region' 'Score' 'GDP per capita'
 'Social support' 'Healthy life expectancy' 'Freedom to make life choices'
 'Generosity' 'Perceptions of corruption']

df_2019
**********
(

### Mapping Countries to Regions 

- df_2015 and df_2016 have a "Region" column.
- df_2017 doesnot have "Region" column.
- df_2018 and df_2019 have a "Country or Region" column where the values are country names.
- Pick region codes from df_2015 data and add "Region" to df_2019,df_2018, df_2017.

### Standardizing the column names
- All the data sets have different column names.
- Standardizing the column names as below.
  - "Rank",
  - "Country"
  - "Region"
  - "Happiness Score"
  - "GDP per Capita"
  - "Social Support"
  - "Health (Life Expectancy)"
  - "Freedom"
  - "Generosity",
  - "Trust"

In [5]:
country_region = pd.read_csv("2015.csv")[["Country","Region"]]
country_region_dict =dict(zip(country_region["Country"], country_region["Region"]))

#  2019

In [6]:
df_2019.columns = ["Rank", "Country","Happiness Score",
                   "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]

In [7]:
df_2019["Region"] = df_2019["Country"].map(country_region_dict)

In [8]:
df_2019 =df_2019[["Rank", "Country","Region","Happiness Score",
          "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]]

df_2019.head()

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
0,1,Finland,Western Europe,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,Western Europe,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,Western Europe,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,Western Europe,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,Western Europe,7.488,1.396,1.522,0.999,0.557,0.322,0.298


#  2018

In [9]:
df_2018.columns =["Rank", "Country","Happiness Score",
                   "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]

In [10]:
df_2018["Region"] = df_2018["Country"].map(country_region_dict)

df_2018 =df_2018[["Rank", "Country","Region","Happiness Score",
          "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]]

df_2018.head()

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
0,1,Finland,Western Europe,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,Western Europe,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,Western Europe,7.555,1.351,1.59,0.868,0.683,0.284,0.408
3,4,Iceland,Western Europe,7.495,1.343,1.644,0.914,0.677,0.353,0.138
4,5,Switzerland,Western Europe,7.487,1.42,1.549,0.927,0.66,0.256,0.357


#  2017

In [11]:
df_2017 = df_2017.drop(["Whisker.high", "Whisker.low","Dystopia.Residual"], axis =1)

In [12]:
df_2017.columns = ["Country","Rank","Happiness Score",
                   "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]

In [13]:
df_2017["Region"] = df_2017["Country"].map(country_region_dict)

df_2017 =df_2017[["Rank", "Country","Region","Happiness Score",
          "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]]

In [14]:
df_2017.head()

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
0,1,Norway,Western Europe,7.537,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964
1,2,Denmark,Western Europe,7.522,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077
2,3,Iceland,Western Europe,7.504,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527
3,4,Switzerland,Western Europe,7.494,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007
4,5,Finland,Western Europe,7.469,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612


#  2016

In [15]:
df_2016.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')

In [16]:
df_2016 = df_2016.drop(['Lower Confidence Interval',"Region", 'Upper Confidence Interval','Dystopia Residual'], axis =1)

In [17]:
df_2016.columns = ["Country","Rank","Happiness Score",
                   "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]

In [18]:
df_2016["Region"] = df_2016["Country"].map(country_region_dict)

df_2016 =df_2016[["Rank", "Country","Region","Happiness Score",
          "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]]
df_2016.head()

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
0,1,Denmark,Western Europe,7.526,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171
1,2,Switzerland,Western Europe,7.509,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083
2,3,Iceland,Western Europe,7.501,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678
3,4,Norway,Western Europe,7.498,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895
4,5,Finland,Western Europe,7.413,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492


#  2015

In [19]:
df_2015.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

In [20]:
df_2015 = df_2015.drop(['Standard Error','Dystopia Residual'], axis =1)


In [21]:
df_2015.columns = ["Country","Region","Rank","Happiness Score",
                   "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Trust","Generosity"]

In [22]:
df_2015 =df_2015[["Rank", "Country","Region","Happiness Score",
          "GDP per Capita", "Social Support",'Health (Life Expectancy)',
                  "Freedom", "Generosity","Trust"]]

In [23]:
df_2015.head()

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
0,1,Switzerland,Western Europe,7.587,1.39651,1.34951,0.94143,0.66557,0.29678,0.41978
1,2,Iceland,Western Europe,7.561,1.30232,1.40223,0.94784,0.62877,0.4363,0.14145
2,3,Denmark,Western Europe,7.527,1.32548,1.36058,0.87464,0.64938,0.34139,0.48357
3,4,Norway,Western Europe,7.522,1.459,1.33095,0.88521,0.66973,0.34699,0.36503
4,5,Canada,North America,7.427,1.32629,1.32261,0.90563,0.63297,0.45811,0.32957


In [24]:
set(df_2015.columns)==set(df_2016.columns)==set(df_2017.columns)==set(df_2018.columns)==set(df_2019.columns)

True

## Checking for null values in the dataframes

In [25]:
for i,j in zip(strng,files):
    print(i,"has",j.shape,"rows X cols" )
    print("*********************")
    print(j.isnull().sum()) 
    print("\n")

df_2015 has (158, 12) rows X cols
*********************
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Standard Error                   0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64


df_2016 has (157, 13) rows X cols
*********************
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        0
Upper Confidence Interval        0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                

In [26]:
print(df_2018[["Country","Region"]][df_2018["Region"].isnull()])

               Country Region
37   Trinidad & Tobago    NaN
48              Belize    NaN
57     Northern Cyprus    NaN
97             Somalia    NaN
118            Namibia    NaN
153        South Sudan    NaN


In [27]:
print(df_2019[["Country","Region"]][df_2019["Region"].isnull()])

               Country Region
38   Trinidad & Tobago    NaN
63     Northern Cyprus    NaN
83     North Macedonia    NaN
111            Somalia    NaN
112            Namibia    NaN
119             Gambia    NaN
155        South Sudan    NaN


In [28]:
"Trinidad & Tobago" in country_region_dict   

False

### Master Dictionary 

(to map null to the nearest countries region)

In [29]:
set(zip(df_2015["Country"],df_2015["Region"]))

{('Afghanistan', 'Southern Asia'),
 ('Albania', 'Central and Eastern Europe'),
 ('Algeria', 'Middle East and Northern Africa'),
 ('Angola', 'Sub-Saharan Africa'),
 ('Argentina', 'Latin America and Caribbean'),
 ('Armenia', 'Central and Eastern Europe'),
 ('Australia', 'Australia and New Zealand'),
 ('Austria', 'Western Europe'),
 ('Azerbaijan', 'Central and Eastern Europe'),
 ('Bahrain', 'Middle East and Northern Africa'),
 ('Bangladesh', 'Southern Asia'),
 ('Belarus', 'Central and Eastern Europe'),
 ('Belgium', 'Western Europe'),
 ('Benin', 'Sub-Saharan Africa'),
 ('Bhutan', 'Southern Asia'),
 ('Bolivia', 'Latin America and Caribbean'),
 ('Bosnia and Herzegovina', 'Central and Eastern Europe'),
 ('Botswana', 'Sub-Saharan Africa'),
 ('Brazil', 'Latin America and Caribbean'),
 ('Bulgaria', 'Central and Eastern Europe'),
 ('Burkina Faso', 'Sub-Saharan Africa'),
 ('Burundi', 'Sub-Saharan Africa'),
 ('Cambodia', 'Southeastern Asia'),
 ('Cameroon', 'Sub-Saharan Africa'),
 ('Canada', 'North 

In [30]:
list(df_2018[df_2018["Region"].isnull()]["Country"])

['Trinidad & Tobago',
 'Belize',
 'Northern Cyprus',
 'Somalia',
 'Namibia',
 'South Sudan']

In [31]:
nan_region_dict_2018 ={'Trinidad & Tobago':'Latin America and Caribbean',
                       'Belize':'Latin America and Caribbean',
                       'Northern Cyprus':'Western Europe',
                       'Somalia':'Middle East and Northern Africa',
                       'Namibia':'Sub-Saharan Africa',
                        'South Sudan':'Sub-Saharan Africa'
                  }

#Here mapping is done based on Belize :nearest neighbor Mexico's region in Master,
##Namibia: South Africa, Northern Cyprus : Cyprus etc

In [32]:
df_2018.loc[df_2018["Region"].isnull(),"Region"] = list(nan_region_dict_2018.values())

In [33]:
df_2018.isnull().sum()

Rank                        0
Country                     0
Region                      0
Happiness Score             0
GDP per Capita              0
Social Support              0
Health (Life Expectancy)    0
Freedom                     0
Generosity                  0
Trust                       1
dtype: int64

In [34]:
df_2018.iloc[[37,48,57,97,118,153]]

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
37,38,Trinidad & Tobago,Latin America and Caribbean,6.192,1.223,1.492,0.564,0.575,0.171,0.019
48,49,Belize,Latin America and Caribbean,5.956,0.807,1.101,0.474,0.593,0.183,0.089
57,58,Northern Cyprus,Western Europe,5.835,1.229,1.211,0.909,0.495,0.179,0.154
97,98,Somalia,Middle East and Northern Africa,4.982,0.0,0.712,0.115,0.674,0.238,0.282
118,119,Namibia,Sub-Saharan Africa,4.441,0.874,1.281,0.365,0.519,0.051,0.064
153,154,South Sudan,Sub-Saharan Africa,3.254,0.337,0.608,0.177,0.112,0.224,0.106


In [35]:
list(df_2019["Country"][df_2019["Region"].isnull()])

['Trinidad & Tobago',
 'Northern Cyprus',
 'North Macedonia',
 'Somalia',
 'Namibia',
 'Gambia',
 'South Sudan']

In [36]:
nan_region_dict_2019 ={'Trinidad & Tobago':'Latin America and Caribbean',
                       'Northern Cyprus':'Western Europe',
                       'North Macedonia': 'Western Europe',
                       'Somalia':'Middle East and Northern Africa',
                       'Namibia':'Sub-Saharan Africa',
                       'Gambia':'Sub-Saharan Africa',                  
                       'South Sudan':'Sub-Saharan Africa',
                       }

In [37]:
df_2019.loc[df_2019["Region"].isnull(),"Region"] = list(nan_region_dict_2019.values())

In [38]:
df_2019.isnull().sum()

Rank                        0
Country                     0
Region                      0
Happiness Score             0
GDP per Capita              0
Social Support              0
Health (Life Expectancy)    0
Freedom                     0
Generosity                  0
Trust                       0
dtype: int64

In [39]:
df_2019.iloc[[38,63,83,111,112,119,155]]

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
38,39,Trinidad & Tobago,Latin America and Caribbean,6.192,1.231,1.477,0.713,0.489,0.185,0.016
63,64,Northern Cyprus,Western Europe,5.718,1.263,1.252,1.042,0.417,0.191,0.162
83,84,North Macedonia,Western Europe,5.274,0.983,1.294,0.838,0.345,0.185,0.034
111,112,Somalia,Middle East and Northern Africa,4.668,0.0,0.698,0.268,0.559,0.243,0.27
112,113,Namibia,Sub-Saharan Africa,4.639,0.879,1.313,0.477,0.401,0.07,0.056
119,120,Gambia,Sub-Saharan Africa,4.516,0.308,0.939,0.428,0.382,0.269,0.167
155,156,South Sudan,Sub-Saharan Africa,2.853,0.306,0.575,0.295,0.01,0.202,0.091


In [40]:
df_2015.to_csv("dfclean_2015.csv",index=False)
df_2016.to_csv("dfclean_2016.csv",index=False)
df_2017.to_csv("dfclean_2017.csv",index=False)
df_2018.to_csv("dfclean_2018.csv",index=False)
df_2019.to_csv("dfclean_2019.csv",index=False)

In [41]:
df_2019[df_2019["Country"] =="Bhutan"]

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
94,95,Bhutan,Southern Asia,5.082,0.813,1.321,0.604,0.457,0.37,0.167


In [42]:
df_2018[df_2018["Country"] =="Bhutan"]

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
96,97,Bhutan,Southern Asia,5.082,0.796,1.335,0.527,0.541,0.364,0.171


In [43]:
df_2017[df_2017["Country"] =="Bhutan"]

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
96,97,Bhutan,Southern Asia,5.011,0.885416,1.340127,0.495879,0.501538,0.474055,0.17338


In [44]:
df_2016[df_2016["Country"] =="Bhutan"]

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
83,84,Bhutan,Southern Asia,5.196,0.8527,0.90836,0.49759,0.46074,0.1616,0.48546


In [45]:
df_2015[df_2015["Country"] =="Bhutan"]

Unnamed: 0,Rank,Country,Region,Happiness Score,GDP per Capita,Social Support,Health (Life Expectancy),Freedom,Generosity,Trust
78,79,Bhutan,Southern Asia,5.253,0.77042,1.10395,0.57407,0.53206,0.47998,0.15445
