<h1> Data Process for World Happiness Report </h1>

<h2> Objective </h2>
<ul>
    <li><a href="#MDP">Missing data Process</a></li>
    <li><a href="#DFM">Data Formatting</a></li>
    <li><a href="#DS">Data Standardization</a></li>
    <li><a href="#DN">Data Normalization</a></li>
    <li><a href="#B">Binning</a></li>
    <li><a href="#IV">Indicator Variable</a></li>
    <li><a href="#Export">Export to new csv</a></li>
</ul>

## 1. Libraries

In [72]:
# Kernal: DSE
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 2. Load WHR dataset

In [73]:
filename = "/Users/chriz_yu/Documents/Datasets/WHR2023.csv"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Country name,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,Finland,7.804,0.036,7.875,7.733,10.792,0.969,71.15,0.961,-0.019,0.182,1.778,1.888,1.585,0.535,0.772,0.126,0.535,2.363
1,Denmark,7.586,0.041,7.667,7.506,10.962,0.954,71.25,0.934,0.134,0.196,1.778,1.949,1.548,0.537,0.734,0.208,0.525,2.084
2,Iceland,7.53,0.049,7.625,7.434,10.896,0.983,72.05,0.936,0.211,0.668,1.778,1.926,1.62,0.559,0.738,0.25,0.187,2.25
3,Israel,7.473,0.032,7.535,7.411,10.639,0.943,72.697,0.809,-0.023,0.708,1.778,1.833,1.521,0.577,0.569,0.124,0.158,2.691
4,Netherlands,7.403,0.029,7.46,7.346,10.942,0.93,71.55,0.887,0.213,0.379,1.778,1.942,1.488,0.545,0.672,0.251,0.394,2.11


In [74]:
# check columns
df.columns

Index(['Country name', 'Ladder score', 'Standard error of ladder score',
       'upperwhisker', 'lowerwhisker', 'Logged GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Ladder score in Dystopia',
       'Explained by: Log GDP per capita', 'Explained by: Social support',
       'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual'],
      dtype='object')

In [75]:
# check data types
df.dtypes

Country name                                   object
Ladder score                                  float64
Standard error of ladder score                float64
upperwhisker                                  float64
lowerwhisker                                  float64
Logged GDP per capita                         float64
Social support                                float64
Healthy life expectancy                       float64
Freedom to make life choices                  float64
Generosity                                    float64
Perceptions of corruption                     float64
Ladder score in Dystopia                      float64
Explained by: Log GDP per capita              float64
Explained by: Social support                  float64
Explained by: Healthy life expectancy         float64
Explained by: Freedom to make life choices    float64
Explained by: Generosity                      float64
Explained by: Perceptions of corruption       float64
Dystopia + residual         

In [76]:
# check data distribution
df.describe()

Unnamed: 0,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
count,137.0,137.0,137.0,137.0,137.0,137.0,136.0,137.0,137.0,137.0,137.0,137.0,137.0,136.0,137.0,137.0,137.0,136.0
mean,5.539796,0.064715,5.666526,5.412971,9.449796,0.799073,64.967632,0.787394,0.022431,0.725401,1.778,1.406985,1.156212,0.366176,0.54,0.148474,0.145898,1.777838
std,1.139929,0.023031,1.117421,1.163724,1.207302,0.129222,5.75039,0.112371,0.141707,0.176956,0.0,0.432963,0.326322,0.156691,0.149501,0.076053,0.126723,0.50439
min,1.859,0.029,1.923,1.795,5.527,0.341,51.53,0.382,-0.254,0.146,1.778,0.0,0.0,0.0,0.0,0.0,0.0,-0.11
25%,4.724,0.047,4.98,4.496,8.591,0.722,60.6485,0.724,-0.074,0.668,1.778,1.099,0.962,0.2485,0.455,0.097,0.06,1.55525
50%,5.684,0.06,5.797,5.529,9.567,0.827,65.8375,0.801,0.001,0.774,1.778,1.449,1.227,0.3895,0.557,0.137,0.111,1.8485
75%,6.334,0.077,6.441,6.243,10.54,0.896,69.4125,0.874,0.117,0.846,1.778,1.798,1.401,0.4875,0.656,0.199,0.187,2.07875
max,7.804,0.147,7.875,7.733,11.66,0.983,77.28,0.961,0.531,0.929,1.778,2.2,1.62,0.702,0.772,0.422,0.561,2.955


<h2 id="MDP">3. Missing data Process</h2>

In [77]:
missing_data = df.isnull()
missing_data.head()

Unnamed: 0,Country name,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [78]:
col_with_missing = []
for column in missing_data.columns.values.tolist():
    if missing_data[column].isin([True]).any() == True:
        print(column)
        col_with_missing.append(column)
        print (missing_data[column].value_counts())
        print("")  

Healthy life expectancy
False    136
True       1
Name: Healthy life expectancy, dtype: int64

Explained by: Healthy life expectancy
False    136
True       1
Name: Explained by: Healthy life expectancy, dtype: int64

Dystopia + residual
False    136
True       1
Name: Dystopia + residual, dtype: int64



<h3>3 missing data found</h3>
<p>replacing the missing value by it mean</p>

In [79]:
for column in col_with_missing:
    avg = df[column].mean(axis=0)
    df[column].replace(np.nan, avg, inplace = True)

In [80]:
missing_data = df.isnull()
for column in col_with_missing: 
    print(f'{column} miss data: {missing_data[column].isin([True]).any()}')

Healthy life expectancy miss data: False
Explained by: Healthy life expectancy miss data: False
Dystopia + residual miss data: False


<h2 id=DFM>4. Data Formatting</h2>

In [81]:
# check data types
df.dtypes

Country name                                   object
Ladder score                                  float64
Standard error of ladder score                float64
upperwhisker                                  float64
lowerwhisker                                  float64
Logged GDP per capita                         float64
Social support                                float64
Healthy life expectancy                       float64
Freedom to make life choices                  float64
Generosity                                    float64
Perceptions of corruption                     float64
Ladder score in Dystopia                      float64
Explained by: Log GDP per capita              float64
Explained by: Social support                  float64
Explained by: Healthy life expectancy         float64
Explained by: Freedom to make life choices    float64
Explained by: Generosity                      float64
Explained by: Perceptions of corruption       float64
Dystopia + residual         

<p>No data need to re-format</p>

<h2 id=DS>5. Data Standardization</h2>

<p>No data need to standardize</p>

<h2 id="DN">6. Data Normalization</h2>

<p>Data Normalization is in Modeling ipynb</p>

<h2 id="B">7. Binning</h2>

<p>No data need Binning</p>

<h2 id="IV">8. Indicator Variable</h2>

<p>No data need to swtich to dummy variable</p>

<h2 id="Export">9. Export to new csv</h2>

In [82]:
df.to_csv("/Users/chriz_yu/Documents/Datasets/WHR2023_P.csv")