# 2006 to 2019 Data Exploration and Cleaning

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly_express as px
import seaborn as sns

import warnings
warnings.simplefilter('ignore')
import statsmodels.formula.api as smf

## Goal of Exploration notebook: 
- Examine dataset
- Look at columns
- Clean and rename if needed
- Export as clean csv

## Questions to be explored in analysis folder
- Question 1: highest and lowest happiness ratings recently (2019)?
- Question 2: how have happiness ratings changed since 2006 regionally?
- Question 3: correlation between GDP (income inequality) and happiness?
- Question 4: correlation between social support and happiness?
- Question 5: correlation between health (life expectancy) and happiness?
- Question 6: correlation between faith in gvt vs happiness?
- Question 7: correlation between affect vs. happiness?

## Initial Prelim Analysis of data 2006-2019

#### Load data into a data frame called hap_comp_df

In [2]:
hap_comp_df = pd.read_csv('../data/raw_data/06_to_19.csv')

#### Investigate my dataset and eyeball

In [3]:
hap_comp_df.shape

(1821, 27)

In [4]:
hap_comp_df.columns.values

array(['iso_alpha', 'Country name', 'year', 'Life Ladder',
       'Log GDP per capita', 'Social support',
       'Healthy life expectancy at birth', 'Freedom to make life choices',
       'Generosity', 'Perceptions of corruption', 'Positive affect',
       'Negative affect', 'Confidence in national government',
       'Democratic Quality', 'Delivery Quality',
       'Standard deviation of ladder by country-year',
       'Standard deviation/Mean of ladder by country-year',
       'GINI index (World Bank estimate)',
       'GINI index (World Bank estimate), average 2000-2017, unbalanced panel',
       'gini of household income reported in Gallup, by wp5-year',
       'Most people can be trusted, Gallup',
       'Most people can be trusted, WVS round 1981-1984',
       'Most people can be trusted, WVS round 1989-1993',
       'Most people can be trusted, WVS round 1994-1998',
       'Most people can be trusted, WVS round 1999-2004',
       'Most people can be trusted, WVS round 2005-2009'

#### Initial Observations
- The data recorded in the columns:
    - `Country name`- the country/location name
    - `year`- the year the data represents
    - `Life Ladder`- measure of happiness based on life satisfaction from a scale of 0 to 10 (also known as the Cantril Scale in the Gallup World Poll)
    - `Log GDP per capita`: self-explanatory
    - `Social support`: self-reported binary measure of "if you were in trouble, do you have relatives or friends you can count on to help you whenever you need them or not"
    - `Healthy life expectancy at birth`: life expectancy
    - `Confidence in national government`, `Perceptions of corruption`: measures of trust and faith in the government.
    - `Democratic Quality`: measure of political stability and absence of violence
    - `Delivery Quality`: 
    - `Freedom to make life choices`: an indicator of individualism
    - charitable `Generosity`: self-reported evaluation of generosity
    - `Positive affect`: average from self-reported measures to 3 questions regarding experiencing feelings of happiness, smiling/laughing, and enjoyment
    - `Negative affect`: average from self-reported measures to 3 questions regarding experiencing feelings of worry, sadness, and anger
    - `Most people can be trusted`: levels of trust in others, seems to be from a longitudinal dataset that goes even further back. 
    - `GINI index (World Bank estimate)`: income variable
    - `GINI index (World Bank estimate), average 2000-2017, unbalanced panel`: average of GINI index (World Bank estimate) from available years
    - `gini of household income reported in Gallup, by wp5-year`: actual GINI measure for income inequality, standardized between different currencies

#### Rename columns

In [5]:
#Renaming columns
name_dict = {
    'Country name': 'country',
    'Life Ladder': 'happiness_rating',
    'Log GDP per capita': 'GDP_per_capita',
    'Social support':'social_support_rating',
    'Healthy life expectancy at birth':'life_expectancy',
    'Positive affect':'positive_affect',
    'Negative affect':'negative_affect',
    'Perceptions of corruption': 'corruption_perceptions',
    'Democratic Quality': 'democratic_quality',
    'Delivery Quality': 'delivery_quality',
    'Confidence in national government': 'confidence_gvt',
    'gini of household income reported in Gallup, by wp5-year': 'income_inequality',
    'Most people can be trusted, WVS round 1981-1984': 'trust_81_to_84',
    'Most people can be trusted, WVS round 1989-1993': 'trust_89_to_93',
    'Most people can be trusted, WVS round 1994-1998': 'trust_94_to_98',
    'Most people can be trusted, WVS round 1999-2004': 'trust_99_to_04',
    'Most people can be trusted, WVS round 2005-2009': 'trust_05_to_09',
    'Most people can be trusted, WVS round 2010-2014': 'trust_10_to_14',}
hap_comp_df = hap_comp_df.rename(columns = name_dict)

In [6]:
# Making clean csv with these changes
hap_comp_df.to_csv('../data/cleaned_data/06_19_hap_cleaned.csv')