# Data Description

## What are the observations (rows) and the attributes (columns)?
The data includes 157 observations that are identified as countries in the world. They are analyzed with attributes as follows: Country, Region, Happiness Rank, Happiness Score, Lower Confidence Interval, Upper Confidence Interval, Economy (GDP per capita), Family, Health (Life Expectancy), Freedom, Trust (Government Corruption), Generosity, and Dystopia Residual. These observations are taken from the year 2019 and 2018 as the data from 2017, 2016 and 2015 have different columns. 

## Why was this dataset created?

The dataset was created in order to assess the well-beings of citizens of different countries and see the correlation with the progress of the nation. Though this started off as a celebration of International Day of Happiness, the report gained traction through the years (2012-present) and has become a reference for world leaders in the economics, public health, and policy area of their country. It helps assess what direction these countries are going toward, and its progress in its overall wellbeing/policies. 

## Who funded the creation of the dataset?

The world happiness data is published by the Sustainable Development Solutions Network at United Nations, and the data is primarily provided by Gallup World Poll.

## What processes might have influenced what data was observed and recorded and what was not?

The world happiness data was created to make a ranking of national happiness for all countries. However, from the data we see that not all countries are included in the dataset. This may be due to situations such as war that makes surveying impossible to conduct. Other reasons may be government regulations, citizens unwilling to answer surveys, or not having enough samples to calculate a score.

## What preprocessing was done, and how did the data come to be in the form that you are using?

The dataset was taken from the GallupWorld Poll whose happiness scores were inspired by the United Nation. The World Happiness Report released by the United Nations ranks 155 countries, and this influenced the production of this dataset. Happiness reports have gained more recognition from the public as more government officials and different organizations use these observations to make certain decisions in economics, psychology, politics, and more. The initial process to determine the data to be observed most likely have been to define what happiness is. The attributes of happiness scores and rankings from this dataset were recorded using data from the Gallup World Poll, which are derived from answers to a life evaluation question known as Cantril ladder. People were asked this Cantril ladder question in a poll they took willingly--though it’s unclear if they knew what their polling results would be used. They were asked to rate their lives on a scale from 0 to 10, where 10 is the best possible life for them. Factors that may influence one’s well-being might have been determined to know what data to observe and record to measure happiness. The data observes mainly six factors - economic production, social support, life expectancy, freedom, absence of corruption, and generosity. Gallup weights were applied to the data that came from the Gallup World Poll, and then compared data to a “benchmark” imaginary country (“Dystopia”)) that had the lowest scores for the 6 major factors of happiness. All of the real country's data were used to compare against Dystopia for a consistent way of measuring the factors of happiness.

The data for 2015, 2016, 2017, 2018, and 2019 had different number of rows and columns. The first preprocessing was to match all the columns of the data from each year. Since 2019 was the most recent one, the data was transformed to match the column names and order of the 2019 data. This preprocessing will allows us to conduce more accurate regression analysis as there will be more data and contatenating the data would be easier. The second preprocessing was to match the list of countries with the 2019 data. While scanning through the data we realized that each year had different number of countries within the data. Therefore, we created another dataset that deleted counries which were not in all datasets.

## Data Source

https://www.kaggle.com/unsdsn/world-happiness#2019.csv

**Potential Problems with Dataset**

The dataset looks at various countries and their happiness in different years.
Attribues (columns) are slightly different in different year databases. For example, some differences include: the order of columns may be different, certain column names are slightly changed, or some year databases are missing some attributes. 

In [0]:
import numpy as np
import pandas as pd

In [0]:
# import all data
data_2019 = pd.read_csv('2019.csv')
data_2018 = pd.read_csv('2018.csv')
data_2017 = pd.read_csv('2017.csv')
data_2016 = pd.read_csv('2016.csv')
data_2015 = pd.read_csv('2015.csv')

In [0]:
# print out the columns for each dataset
for x in [data_2019, data_2018, data_2017, data_2016, data_2015]:
    print(x.columns)

Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')
Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')
Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')
Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', '

In [0]:
# Removes columns that were not in the 2019 version of the data to match the number of columns
data_2017_updated = data_2017.drop(columns = ['Whisker.high', 'Whisker.low', 'Dystopia.Residual'])
data_2016_updated = data_2016.drop(columns = ['Lower Confidence Interval', 'Upper Confidence Interval' ,'Region', 'Dystopia Residual'])
data_2015_updated = data_2015.drop(columns = ['Standard Error', 'Region', 'Dystopia Residual'])

In [0]:
# renaming the columns to match 2019 data
data_2017_updated = data_2017_updated.rename(columns = {'Happiness.Rank': 'Overall rank', 'Country':'Country or region', \
                                   'Happiness.Score': 'Score', 'Economy..GDP.per.Capita.':'GDP per capita', \
                                   'Family':'Social support', 'Health..Life.Expectancy.':'Healthy life expectancy', \
                                   'Freedom': 'Freedom to make life choices', \
                                   'Trust..Government.Corruption.': 'Perceptions of corruption'})
data_2016_updated = data_2016_updated.rename(columns = {'Happiness Rank': 'Overall rank', 'Country':'Country or region', \
                                   'Happiness Score': 'Score', 'Economy (GDP per Capita)':'GDP per capita', \
                                   'Family':'Social support', 'Health (Life Expectancy)':'Healthy life expectancy', \
                                   'Freedom': 'Freedom to make life choices', \
                                    'Trust (Government Corruption)': 'Perceptions of corruption'})
data_2015_updated = data_2015_updated.rename(columns = {'Happiness Rank': 'Overall rank', 'Country':'Country or region', \
                                   'Happiness Score': 'Score', 'Economy (GDP per Capita)':'GDP per capita', \
                                   'Family':'Social support', 'Health (Life Expectancy)':'Healthy life expectancy', \
                                   'Freedom': 'Freedom to make life choices', \
                                    'Trust (Government Corruption)': 'Perceptions of corruption'})

In [0]:
# re-order columns to match 2019 data
data_2017_updated = data_2017_updated.reindex(columns = data_2019.columns)
data_2016_updated = data_2016_updated.reindex(columns = data_2019.columns)
data_2015_updated = data_2015_updated.reindex(columns = data_2019.columns)

Now we check if all rankings have the same countries.

In [0]:
def check_missing(data1, data2):
    """
    Checks if counries in data1 are in data2
    
    Returns a list of countries not in data2
    """
    check_isin = data1['Country or region'].isin(data2['Country or region'])

    row = []
    for x in range(len(check_isin)):
        if check_isin[x] == False:
            row.append(x)

    country_list = []
    for x in row:
        country_list.append(data1['Country or region'][x])
        
    country_list.sort()
        
    return country_list

In [0]:
def combine_find_unique(arr1, arr2, arr3, arr4):
    """
    Finds the unique elements within the parameter
    """
    arr1_np = np.array(arr1)
    arr2_np = np.array(arr2)
    arr3_np = np.array(arr3)
    arr4_np = np.array(arr4)
    
    concat = np.concatenate((arr1, arr2, arr3, arr4))
    
    return np.unique(concat)

We get the list of countries not in 2019 data but are in othere dataset.

In [0]:
print('2015 vs 2019:', check_missing(data_2015_updated, data_2019))
print('2016 vs 2019:', check_missing(data_2016_updated, data_2019))
print('2017 vs 2019:', check_missing(data_2017_updated, data_2019))
print('2018 vs 2019:', check_missing(data_2018, data_2019))

2015 vs 2019: ['Angola', 'Djibouti', 'Macedonia', 'North Cyprus', 'Oman', 'Somaliland region', 'Sudan', 'Suriname', 'Trinidad and Tobago']
2016 vs 2019: ['Angola', 'Belize', 'Macedonia', 'North Cyprus', 'Puerto Rico', 'Somaliland Region', 'Sudan', 'Suriname', 'Trinidad and Tobago']
2017 vs 2019: ['Angola', 'Belize', 'Hong Kong S.A.R., China', 'Macedonia', 'North Cyprus', 'Sudan', 'Taiwan Province of China', 'Trinidad and Tobago']
2018 vs 2019: ['Angola', 'Belize', 'Macedonia', 'Sudan']


In [0]:
combine_find_unique(check_missing(data_2015_updated, data_2019), check_missing(data_2016_updated, data_2019), \
                   check_missing(data_2017_updated, data_2019), check_missing(data_2018, data_2019))

array(['Angola', 'Belize', 'Djibouti', 'Hong Kong S.A.R., China',
       'Macedonia', 'North Cyprus', 'Oman', 'Puerto Rico',
       'Somaliland Region', 'Somaliland region', 'Sudan', 'Suriname',
       'Taiwan Province of China', 'Trinidad and Tobago'], dtype='<U24')

We get the list of countries in 2019 but not in other dataset.

In [0]:
print('2019 vs 2015:', check_missing(data_2019, data_2015_updated))
print('2019 vs 2016:', check_missing(data_2019, data_2016_updated))
print('2019 vs 2017:', check_missing(data_2019, data_2017_updated))
print('2019 vs 2018:', check_missing(data_2019, data_2018))

2019 vs 2015: ['Gambia', 'Namibia', 'North Macedonia', 'Northern Cyprus', 'Somalia', 'South Sudan', 'Trinidad & Tobago']
2019 vs 2016: ['Central African Republic', 'Gambia', 'Lesotho', 'Mozambique', 'North Macedonia', 'Northern Cyprus', 'Swaziland', 'Trinidad & Tobago']
2019 vs 2017: ['Comoros', 'Gambia', 'Hong Kong', 'Laos', 'North Macedonia', 'Northern Cyprus', 'Swaziland', 'Taiwan', 'Trinidad & Tobago']
2019 vs 2018: ['Comoros', 'Gambia', 'North Macedonia', 'Swaziland']


In [0]:
combine_find_unique(check_missing(data_2019, data_2015_updated), check_missing(data_2019, data_2016_updated), \
                   check_missing(data_2019, data_2017_updated), check_missing(data_2019, data_2018))

array(['Central African Republic', 'Comoros', 'Gambia', 'Hong Kong',
       'Laos', 'Lesotho', 'Mozambique', 'Namibia', 'North Macedonia',
       'Northern Cyprus', 'Somalia', 'South Sudan', 'Swaziland', 'Taiwan',
       'Trinidad & Tobago'], dtype='<U24')

From the list of country names, we deduce the following are the same country:
 * Trinidad & Tobago and Trinidad and Tobago
 * North Macedonia and Macedonia
 * Northern Cyprus and North Cyprus
 * Hong Kong S.A.R., China and Hong Kong
 * Taiwan Province of China and Taiwan

In [0]:
# Updated to Trinidad & Tobago
data_2015_updated.loc[data_2015_updated['Country or region'] == 'Trinidad and Tobago', 'Country or region'] \
    = 'Trinidad & Tobago'
data_2016_updated.loc[data_2016_updated['Country or region'] == 'Trinidad and Tobago', 'Country or region'] \
    = 'Trinidad & Tobago'
data_2017_updated.loc[data_2017_updated['Country or region'] == 'Trinidad and Tobago', 'Country or region'] \
    = 'Trinidad & Tobago'

In [0]:
# Update to North Macedonia
data_2015_updated.loc[data_2015_updated['Country or region'] == 'Macedonia', 'Country or region'] \
    = 'North Macedonia'
data_2016_updated.loc[data_2016_updated['Country or region'] == 'Macedonia', 'Country or region'] \
    = 'North Macedonia'
data_2017_updated.loc[data_2017_updated['Country or region'] == 'Macedonia', 'Country or region'] \
    = 'North Macedonia'
data_2018.loc[data_2018['Country or region'] == 'Macedonia', 'Country or region'] \
    = 'North Macedonia'

In [0]:
# Update to Northern Cyprus
data_2015_updated.loc[data_2015_updated['Country or region'] == 'North Cyprus', 'Country or region'] \
    = 'Northern Cyprus'
data_2016_updated.loc[data_2016_updated['Country or region'] == 'North Cyprus', 'Country or region'] \
    = 'Northern Cyprus'
data_2017_updated.loc[data_2017_updated['Country or region'] == 'North Cyprus', 'Country or region'] \
    = 'Northern Cyprus'

In [0]:
# Update to Hong Kong
data_2017_updated.loc[data_2017_updated['Country or region'] == 'Hong Kong S.A.R., China', 'Country or region'] \
    = 'Hong Kong'

In [0]:
# Update to Taiwan
data_2017_updated.loc[data_2017_updated['Country or region'] == 'Taiwan Province of China', 'Country or region'] \
    = 'Taiwan'

After editing countries that were the same but had different names, we get countries that are not in each dataset.

In [0]:
print('2015 vs 2019:', check_missing(data_2015_updated, data_2019))
print('2016 vs 2019:', check_missing(data_2016_updated, data_2019))
print('2017 vs 2019:', check_missing(data_2017_updated, data_2019))
print('2018 vs 2019:', check_missing(data_2018, data_2019))

2015 vs 2019: ['Angola', 'Djibouti', 'Oman', 'Somaliland region', 'Sudan', 'Suriname']
2016 vs 2019: ['Angola', 'Belize', 'Puerto Rico', 'Somaliland Region', 'Sudan', 'Suriname']
2017 vs 2019: ['Angola', 'Belize', 'Sudan']
2018 vs 2019: ['Angola', 'Belize', 'Sudan']


In [0]:
print('2019 vs 2015:', check_missing(data_2019, data_2015_updated))
print('2019 vs 2016:', check_missing(data_2019, data_2016_updated))
print('2019 vs 2017:', check_missing(data_2019, data_2017_updated))
print('2019 vs 2018:', check_missing(data_2019, data_2018))

2019 vs 2015: ['Gambia', 'Namibia', 'Somalia', 'South Sudan']
2019 vs 2016: ['Central African Republic', 'Gambia', 'Lesotho', 'Mozambique', 'Swaziland']
2019 vs 2017: ['Comoros', 'Gambia', 'Laos', 'Swaziland']
2019 vs 2018: ['Comoros', 'Gambia', 'Swaziland']


In [0]:
combine_find_unique(check_missing(data_2019, data_2015_updated), check_missing(data_2019, data_2016_updated), \
                   check_missing(data_2019, data_2017_updated), check_missing(data_2019, data_2018))

array(['Central African Republic', 'Comoros', 'Gambia', 'Laos', 'Lesotho',
       'Mozambique', 'Namibia', 'Somalia', 'South Sudan', 'Swaziland'],
      dtype='<U24')

In [0]:
def country_list(data):
    """
    Return all elements from 'Country or region' column
    """
    country_list = []
    
    for country in data['Country or region']:
        country_list.append(country)
    
    return country_list

country_2019 = country_list(data_2019)
country_2018 = country_list(data_2018)
country_2017 = country_list(data_2017_updated)
country_2016 = country_list(data_2016_updated)
country_2015 = country_list(data_2015_updated)

In [0]:
# Find the countries that are in 2019 data but not in others.
data_2019_id = []

for country in country_2019:
    if country not in country_2018:
        data_2019_id.append(data_2019[data_2019['Country or region'] == country].index[0])
    if country not in country_2017:
        data_2019_id.append(data_2019[data_2019['Country or region'] == country].index[0])
    if country not in country_2016:
        data_2019_id.append(data_2019[data_2019['Country or region'] == country].index[0])
    if country not in country_2015:
        data_2019_id.append(data_2019[data_2019['Country or region'] == country].index[0])

# Do list(set()) to remove duplicates from array
data_2019_id_unique = list(set(data_2019_id))
print(data_2019_id_unique)

[134, 154, 104, 141, 111, 112, 143, 119, 122, 155]


In [0]:
data_2019_edited = data_2019.drop([134, 154, 104, 141, 111, 112, 143, 119, 122, 155])

Similarly, we drop rows from 2015, 2016, 2017, and 2018 data that are not in 2019 data. 

In [0]:
# 2015 data
index_2015 = []

for val in check_missing(data_2015_updated, data_2019_edited):
    row_num = data_2015_updated[data_2015_updated['Country or region'] == val].index[0]
    index_2015.append(row_num)

data_2015_edited = data_2015_updated.drop(index_2015)

In [0]:
# 2016 data
index_2016 = []

for val in check_missing(data_2016_updated, data_2019_edited):
    row_num = data_2016_updated[data_2016_updated['Country or region'] == val].index[0]
    index_2016.append(row_num)

data_2016_edited = data_2016_updated.drop(index_2016)

In [0]:
# 2017 data
index_2017 = []

for val in check_missing(data_2017_updated, data_2019_edited):
    row_num = data_2017_updated[data_2017_updated['Country or region'] == val].index[0]
    index_2017.append(row_num)

data_2017_edited = data_2017_updated.drop(index_2017)

In [0]:
# 2018 data
index_2018 = []

for val in check_missing(data_2018, data_2019_edited):
    row_num = data_2018[data_2018['Country or region'] == val].index[0]
    index_2018.append(row_num)

data_2018_edited = data_2018.drop(index_2018)

In [0]:
# rearrange the index of rows from 0 to 145
data_2015_edited.index = range(146)
data_2016_edited.index = range(146)
data_2017_edited.index = range(146)
data_2018_edited.index = range(146)
data_2019_edited.index = range(146)

The following variables(data) have the same column but have different countries(row):
* data_2015_updated
* data_2016_updated
* data_2017_updated
* data_2018
* data_2019

The following data/variables have the same column and the same countries(row):
* data_2015_edited
* data_2016_edited
* data_2017_edited
* data_2018_edited
* data_2019_edited