# Introduction

# Data Description

## What are the observations (rows) and the attributes (columns)?
The data includes 157 observations that are identified as countries in the world. They are analyzed with attributes as follows: Country, Region, Happiness Rank, Happiness Score, Lower Confidence Interval, Upper Confidence Interval, Economy (GDP per capita), Family, Health (Life Expectancy), Freedom, Trust (Government Corruption), Generosity, and Dystopia Residual. These observations are taken from the year 2019 and 2018 as the data from 2017, 2016 and 2015 have different columns. 

## Why was this dataset created?

The dataset was created in order to assess the well-beings of citizens of different countries and see the correlation with the progress of the nation. Though this started off as a celebration of International Day of Happiness, the report gained traction through the years (2012-present) and has become a reference for world leaders in the economics, public health, and policy area of their country. It helps assess what direction these countries are going toward, and its progress in its overall wellbeing/policies. 

## Who funded the creation of the dataset?

The world happiness data is published by the Sustainable Development Solutions Network at United Nations, and the data is primarily provided by Gallup World Poll.

## What processes might have influenced what data was observed and recorded and what was not?

The world happiness data was created to make a ranking of national happiness for all countries. However, from the data we see that not all countries are included in the dataset. This may be due to situations such as war that makes surveying impossible to conduct. Other reasons may be government regulations, citizens unwilling to answer surveys, or not having enough samples to calculate a score.

## What preprocessing was done, and how did the data come to be in the form that you are using?

The dataset was taken from the GallupWorld Poll whose happiness scores were inspired by the United Nation. The World Happiness Report released by the United Nations ranks 155 countries, and this influenced the production of this dataset. Happiness reports have gained more recognition from the public as more government officials and different organizations use these observations to make certain decisions in economics, psychology, politics, and more. The initial process to determine the data to be observed most likely have been to define what happiness is. The attributes of happiness scores and rankings from this dataset were recorded using data from the Gallup World Poll, which are derived from answers to a life evaluation question known as Cantril ladder. People were asked this Cantril ladder question in a poll they took willingly--though it’s unclear if they knew what their polling results would be used. They were asked to rate their lives on a scale from 0 to 10, where 10 is the best possible life for them. Factors that may influence one’s well-being might have been determined to know what data to observe and record to measure happiness. The data observes mainly six factors - economic production, social support, life expectancy, freedom, absence of corruption, and generosity. Gallup weights were applied to the data that came from the Gallup World Poll, and then compared data to a “benchmark” imaginary country (“Dystopia”)) that had the lowest scores for the 6 major factors of happiness. All of the real country's data were used to compare against Dystopia for a consistent way of measuring the factors of happiness.

The data for 2015, 2016, 2017, 2018, and 2019 had different number of rows and columns. The first preprocessing was to match all the columns of the data from each year. Since 2019 was the most recent one, the data was transformed to match the column names and order of the 2019 data. This preprocessing will allows us to conduce more accurate regression analysis as there will be more data and contatenating the data would be easier. The second preprocessing was to match the list of countries with the 2019 data. While scanning through the data we realized that each year had different number of countries within the data. Therefore, we created another dataset that deleted counries which were not in all datasets.

## Data Source

https://www.kaggle.com/unsdsn/world-happiness#2019.csv

**Potential Problems with Dataset**

The dataset looks at various countries and their happiness in different years.
Attribues (columns) are slightly different in different year databases. For example, some differences include: the order of columns may be different, certain column names are slightly changed, or some year databases are missing some attributes. 

# Data Analysis

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
plt.style.use('seaborn')

In [6]:
all_years = pd.read_csv('data/concat_data.csv')
all_years = all_years.drop([489]) # drop because Perceptions of corruption column is NaN
all_years.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,above_10
0,1,Switzerland,7.587,1.39651,1.34951,0.94143,0.66557,0.29678,0.41978,1
1,2,Iceland,7.561,1.30232,1.40223,0.94784,0.62877,0.4363,0.14145,1
2,3,Denmark,7.527,1.32548,1.36058,0.87464,0.64938,0.34139,0.48357,1
3,4,Norway,7.522,1.459,1.33095,0.88521,0.66973,0.34699,0.36503,1
4,5,Canada,7.427,1.32629,1.32261,0.90563,0.63297,0.45811,0.32957,1


In [7]:
all_years.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,above_10
count,781.0,781.0,781.0,781.0,781.0,781.0,781.0,781.0,781.0
mean,78.773367,5.377232,0.914537,1.07878,0.612342,0.411254,0.218618,0.125436,0.06402
std,45.162398,1.127071,0.405403,0.329581,0.248459,0.152911,0.122394,0.105816,0.244946
min,1.0,2.693,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,40.0,4.509,0.605,0.87021,0.44006,0.31048,0.13,0.054,0.0
50%,79.0,5.321,0.982,1.125,0.647239,0.431,0.202,0.091,0.0
75%,118.0,6.182,1.233748,1.328,0.808,0.531,0.27906,0.15603,0.0
max,158.0,7.769,1.870766,1.644,1.141,0.724,0.838075,0.55191,1.0


# Evaluation of Significance

# Conclusion

# Source Code