In [20]:
import pandas as pd
from IPython.display import HTML
HTML('''<style>.CodeMirror{min-width:100% !important;}</style>''')

df = pd.read_csv("Data/GeneratedData/Raw_Data.csv", index_col = "Unnamed: 0")

# Impact of Work-Related Factors on Happiness

By: Matt Ring

# <u> Problem Statement <u>

"How might factors such as union membership, wages, and working time relate to life satisfaction?"

# <u> Background <u>

## Reasoning
*  Much of life is spent at work

## Data
* 1716 Observations
* Sources
    * International Labor Organization Statistics
    * Our World in Data
    * Varieties of Democracy
    * Maddison Project Database 2020

In [21]:
df.sample(10)

Unnamed: 0,Country,Year,Life_Satisfaction,EDI,Suffrage,Diag_Account,AFI,Democracy,GDPpc,Avg_Hours_Worked,U_Coverage,U_Density,Ineq_Diff,Ineq_Frac
1218,Poland,2005,5.587209,0.877,100.0,0.952,0.963,1.0,15580.936718,,18.1,18.1,24.883406,8.972356
626,Hong Kong,2011,5.474011,0.365,100.0,0.891,0.722,0.0,44532.0,,24.7,24.7,25.735734,12.383217
573,Greece,2016,5.302619,0.872,100.0,0.954,,1.0,22574.0,,18.6,18.6,13.778164,6.157404
799,Kosovo,2016,5.759412,,,,,,,,,,,
850,Lebanon,2012,4.572567,0.52,100.0,0.838,0.645,1.0,16315.0,,,,39.693419,17.541279
935,Malaysia,2013,5.7702,0.317,100.0,0.503,0.599,0.0,20760.0,,9.4,9.4,30.650278,14.933027
1005,Mongolia,2008,4.49301,0.667,100.0,0.938,0.912,1.0,6982.14643,46.7338,,,43.051582,24.413432
1543,Tunisia,2018,4.741132,0.729,100.0,0.918,0.727,1.0,11353.886488,,,,,
544,Germany,2011,6.621312,0.876,100.0,0.965,0.956,1.0,43189.0,,18.4,18.4,20.910358,11.179453
739,Japan,2007,6.238198,0.836,100.0,0.927,0.748,1.0,35892.712773,,18.1,18.1,27.128777,13.112454


# <u> Methods Considered <u>
## 1. Data
* Feature Engineering
* Dummy Variables
* Interaction Variables
* Scaling and Normalization
    
## 2. Models
* Models - Interpretable vs. Accurate
* Assessment - Mean Squared Error (MSE)
* Cross Validation - K-Fold or Bootstrapping   
    
## 3. Visualizations
* Correlation Matrix
* Feature Importance Bar Chart
* Cross Validation Learning Curves

# <u> Methods Used <u>

## 1. Data 

* Cleaning
* Feature Engineering - Inequality
* Feature Elimination - Multicollinearity
* Preprocessing
    * K-Fold - 10 times
    * StandardScaler
    
### Constructed Datasets
1. Full Data (~1700 observations)
2. Restricted Data (~1250 observations)
3. Imputed Data (~1700 observations)

<img src = "Figures/Feat_Corr_Reduced.png">

## 2. Algorithms

1. LR
2. KNN
    * Neighbors
3. SVR
    * rbf kernel, tolerance, C, and gamma
4. Random Forest
    * max_depth and n_estimators
        
## 3. Fixed Effects OLS

* Essentially `lm` from R with `Country` as 100+ dummy variables

# <u> Preliminary Results <u>
    
## 1. Best Model - KNN Using the Restricted Data
* Nearest Neighbors = 3
* MSE = 0.17
* Most Important Feature: GDP - 0.83

## 2. **Fixed Effects OLS**
* Full
    * Academic Freedom @ -2.77
* Restricted
    * Accountability @ -0.64
    * GDP per Capita @ 0.66
    * Inequality @ -0.32
    * Academic Freedom @ 1.30
* Imputed
    * Academic Freedom @ -2.76

<img src = "Figures/FeatImp_Restricted.png">

# <u> Preliminary Conclusions <u>

    
* KNN is the Best Model
* Important Feature(s)
    * GDP
    * Academic Freedom
* Imputing isn't helpful

# <u> Lessons Learned / Future Work <u>

## 1) Data Sparsity
Success: Showing whether any work statistic has a meaningful impact on happiness.
* Possible Fix: Mean Monthly Earnings Data

## 2) Models
* Poor performance among other models
* Cross Validation  

   
## 3) Interpretations
* Partial Dependency Plots
* Fixed Effects
* Feature Importances for Regression

# <u> Thanks for Watching!<u>
All feedback is appreciated