<img src="https://bit.ly/2VnXWr2" width="100" align="left">

# Project | Statistical Analysis: Does culture affects hapiness perception?

## Introduction

Hacer una intro breve

### Objectives

We wanted to see if there is relevant correlationship between being an individualistic or a colloectivistic country and how happy this country citizens are. 

### Imports

In [1]:
import plotly.graph_objs as go
from ipywidgets import interact
import cufflinks as cf
import plotly.offline as py
import numpy as np
import pandas as pd
import re
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import scipy.stats as stats
%matplotlib inline
cf.go_offline()

from pandas._testing import assert_frame_equal
from pandas_profiling import ProfileReport
#pd.set_option("display.max_rows", None)
#pd.set_option("display.max_columns", None)

## 1. EDA

### Hofstede's six-dimensions model (2015)

#### Context

**Geert Hofstede's cultural dimensions theory** proposes a method of analyzing cultures based on a handful of continuums.

`Power distance index (PDI):` The power distance index is defined as “the extent to which the less powerful members of organizations and institutions (like the family) accept and expect that power is distributed unequally.”

`Individualism vs. collectivism (IDV):` This index explores the “degree to which people in a society are integrated into groups.”

`Uncertainty avoidance index (UAI):` The uncertainty avoidance index is defined as “a society's tolerance for ambiguity,” in which people embrace or avert an event of something unexpected, unknown, or away from the status quo.

`Masculinity vs. femininity (MAS):` In this dimension, masculinity is defined as “a preference in society for achievement, heroism, assertiveness and material rewards for success.”

`Long-term orientation vs. short-term orientation (LTO):` This dimension associates the connection of the past with the current and future actions/challenges.

`Indulgence vs. restraint (IND):` This dimension is essentially a measure of happiness; whether or not simple joys are fulfilled.

More contextual information about this model can be found in here: https://scholarworks.gvsu.edu/cgi/viewcontent.cgi?referer=https://en.wikipedia.org/&httpsredir=1&article=1014&context=orpc

#### Read dataset and check head

In [2]:
cultural_dimensinality = pd.read_csv("data/culturaldimensions/6_dimensions_for_website.csv")
cultural_dimensinality.head()

Unnamed: 0,ctr,country,pdi,idv,mas,uai,ltowvs,ivr
0,AFE,Africa East,64.0,27.0,41.0,52.0,32.0,40.0
1,AFW,Africa West,77.0,20.0,46.0,54.0,9.0,78.0
2,ALB,Albania,,,,,61.460957,14.508929
3,ALG,Algeria,,,,,25.944584,32.366071
4,AND,Andorra,,,,,,65.0


##### Minor manipulation

In [3]:
#As it can be confusing we will rename columns to its original test short-forms and keep the same order
cultural_dimensinality.columns=["ctr", "country", "pdi", "idv", "mas", "uai", "lto", "ind"]
cultural_dimensinality = cultural_dimensinality[["ctr", "country", "pdi", "idv", "uai", "mas", "lto", "ind"]]
cultural_dimensinality.head()

Unnamed: 0,ctr,country,pdi,idv,uai,mas,lto,ind
0,AFE,Africa East,64.0,27.0,52.0,41.0,32.0,40.0
1,AFW,Africa West,77.0,20.0,54.0,46.0,9.0,78.0
2,ALB,Albania,,,,,61.460957,14.508929
3,ALG,Algeria,,,,,25.944584,32.366071
4,AND,Andorra,,,,,,65.0


#### Check tail

In [28]:
cultural_dimensinality[["ctr", "country", "pdi", "idv", "uai", "mas", "lto", "ind"]]

Unnamed: 0,ctr,country,pdi,idv,uai,mas,lto,ind
0,AFE,Africa East,64.0,27.0,52.0,41.0,32.000000,40.000000
1,AFW,Africa West,77.0,20.0,54.0,46.0,9.000000,78.000000
2,ALB,Albania,,,,,61.460957,14.508929
3,ALG,Algeria,,,,,25.944584,32.366071
4,AND,Andorra,,,,,,65.000000
...,...,...,...,...,...,...,...,...
106,URU,Uruguay,61.0,36.0,100.0,38.0,26.196474,53.348214
107,VEN,Venezuela,81.0,12.0,76.0,73.0,15.617128,100.000000
108,VIE,Vietnam,70.0,20.0,30.0,40.0,57.178841,35.491071
109,ZAM,Zambia,,,,,30.226700,42.187500


In [4]:
#As they're ordered I'll print tail too
cultural_dimensinality.tail()

Unnamed: 0,ctr,country,pdi,idv,uai,mas,lto,ind
106,URU,Uruguay,61.0,36.0,100.0,38.0,26.196474,53.348214
107,VEN,Venezuela,81.0,12.0,76.0,73.0,15.617128,100.0
108,VIE,Vietnam,70.0,20.0,30.0,40.0,57.178841,35.491071
109,ZAM,Zambia,,,,,30.2267,42.1875
110,ZIM,Zimbabwe,,,,,15.365239,27.678571


#### Check shape

In [5]:
cultural_dimensinality.shape

(111, 8)

#### Check dtypes and columns

In [6]:
cultural_dimensinality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ctr      111 non-null    object 
 1   country  111 non-null    object 
 2   pdi      78 non-null     float64
 3   idv      78 non-null     float64
 4   uai      78 non-null     float64
 5   mas      78 non-null     float64
 6   lto      96 non-null     float64
 7   ind      97 non-null     float64
dtypes: float64(6), object(2)
memory usage: 7.1+ KB


#### Check nulls

In [7]:
print("Are there any missing values? :",cultural_dimensinality.isnull().any().any())
print(cultural_dimensinality.isnull().sum())

Are there any missing values? : True
ctr         0
country     0
pdi        33
idv        33
uai        33
mas        33
lto        15
ind        14
dtype: int64


#### Check duplicates

In [8]:
print("Are there duplicated values? :",cultural_dimensinality.duplicated().any().any())
print(cultural_dimensinality.duplicated().sum())

Are there duplicated values? : False
0


#### See some descriptive statistics

In [9]:
cultural_dimensinality.describe()

Unnamed: 0,pdi,idv,uai,mas,lto,ind
count,78.0,78.0,78.0,78.0,96.0,97.0
mean,59.333333,45.166667,67.641026,49.269231,45.479272,45.425534
std,21.223405,23.971529,22.992926,19.007636,24.232016,22.174204
min,11.0,6.0,8.0,5.0,0.0,0.0
25%,42.5,23.5,51.25,40.0,25.566751,29.241071
50%,62.0,43.5,69.5,48.5,44.584383,43.080357
75%,72.5,67.75,86.0,61.75,63.602015,63.0
max,104.0,91.0,112.0,110.0,100.0,100.0


### World Happiness Report 2015

#### Context

**The World Happiness Report 2015** is a landmark survey of the state of global happiness which ranks 158 countries by their happiness levels based on six factors.


`Country:` Name of the country.

`Region:` Region the country belongs to.

`Happiness Rank:` Rank of the country based on the Happiness Score.

`Happiness Score:` A metric measured in 2015 by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest."

`Economy:` real GDP per capita

`Family:` social support

`Health:` healthy life expectancy

`Freedom:` freedom to make life choices

`Trust:` perceptions of corruption

`Generosity:` perceptions of generosity

`Dystopia:` each country is compared against a hypothetical nation that represents the lowest national averages for each key variable and is, along with residual error, used as a regression benchmark

#### Read dataset and check head

In [10]:
whr2015 = pd.read_csv("data/happiness/2015.csv")
whr2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


#### Check tail

In [11]:
#As they're ordered I'll print tail too
whr2015.tail()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.7737,0.42864,0.59201,0.55191,0.22628,0.67042
154,Benin,Sub-Saharan Africa,155,3.34,0.03656,0.28665,0.35386,0.3191,0.4845,0.0801,0.1826,1.63328
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.6632,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.0153,0.41587,0.22396,0.1185,0.10062,0.19727,1.83302
157,Togo,Sub-Saharan Africa,158,2.839,0.06727,0.20868,0.13995,0.28443,0.36453,0.10731,0.16681,1.56726


#### Check shape

In [12]:
whr2015.shape

(158, 12)

#### Check dtypes and columns

In [13]:
whr2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
memory usage: 1

#### Check nulls

In [14]:
print("Are there any missing values? :",whr2015.isnull().any().any())
print(whr2015.isnull().sum())

Are there any missing values? : False
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Standard Error                   0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64


#### Check duplicates

In [15]:
print("Are there duplicated values? :",whr2015.duplicated().any().any())
print(whr2015.duplicated().sum())

Are there duplicated values? : False
0


#### See some descriptive statistics

In [16]:
whr2015.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,79.493671,5.375734,0.047885,0.846137,0.991046,0.630259,0.428615,0.143422,0.237296,2.098977
std,45.754363,1.14501,0.017146,0.403121,0.272369,0.247078,0.150693,0.120034,0.126685,0.55355
min,1.0,2.839,0.01848,0.0,0.0,0.0,0.0,0.0,0.0,0.32858
25%,40.25,4.526,0.037268,0.545808,0.856823,0.439185,0.32833,0.061675,0.150553,1.75941
50%,79.5,5.2325,0.04394,0.910245,1.02951,0.696705,0.435515,0.10722,0.21613,2.095415
75%,118.75,6.24375,0.0523,1.158448,1.214405,0.811013,0.549092,0.180255,0.309883,2.462415
max,158.0,7.587,0.13693,1.69042,1.40223,1.02525,0.66973,0.55191,0.79588,3.60214


## 2. Hypothesize relationship between variables in each dataset

### Hofstede's six-dimensions model (2015)

In [38]:
cultural_dimensinality.corr().style.background_gradient(cmap=("coolwarm"))

Unnamed: 0,pdi,idv,uai,mas,lto,ind
pdi,1.0,-0.598411,0.228644,0.114673,3.2e-05,-0.28422
idv,-0.598411,1.0,-0.165157,0.082986,0.12343,0.136903
uai,0.228644,-0.165157,1.0,-0.060872,-0.012148,-0.07408
mas,0.114673,0.082986,-0.060872,1.0,0.031451,0.06704
lto,3.2e-05,0.12343,-0.012148,0.031451,1.0,-0.455667
ind,-0.28422,0.136903,-0.07408,0.06704,-0.455667,1.0


Lanza hipótesis

### World Happiness Report 2015

In [35]:
whr2015.corr().style.background_gradient(cmap="coolwarm")

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
Happiness Rank,1.0,-0.992105,0.158516,-0.785267,-0.733644,-0.735613,-0.556886,-0.372315,-0.160142,-0.521999
Happiness Score,-0.992105,1.0,-0.177254,0.780966,0.740605,0.7242,0.568211,0.395199,0.180319,0.530474
Standard Error,0.158516,-0.177254,1.0,-0.217651,-0.120728,-0.310287,-0.129773,-0.178325,-0.088439,0.083981
Economy (GDP per Capita),-0.785267,0.780966,-0.217651,1.0,0.645299,0.816478,0.3703,0.307885,-0.010465,0.040059
Family,-0.733644,0.740605,-0.120728,0.645299,1.0,0.531104,0.441518,0.205605,0.087513,0.148117
Health (Life Expectancy),-0.735613,0.7242,-0.310287,0.816478,0.531104,1.0,0.360477,0.248335,0.108335,0.018979
Freedom,-0.556886,0.568211,-0.129773,0.3703,0.441518,0.360477,1.0,0.493524,0.373916,0.062783
Trust (Government Corruption),-0.372315,0.395199,-0.178325,0.307885,0.205605,0.248335,0.493524,1.0,0.276123,-0.033105
Generosity,-0.160142,0.180319,-0.088439,-0.010465,0.087513,0.108335,0.373916,0.276123,1.0,-0.101301
Dystopia Residual,-0.521999,0.530474,0.083981,0.040059,0.148117,0.018979,0.062783,-0.033105,-0.101301,1.0


Lanza hipotesis

## 3. Features should be investigated in depth combining datasets

retocamos objetivos

## 4. Data cleaning & manipulation. Apply the following techniques as appropriate:

### 1. Adjust skewed data distribution.

### 2. Remove columns with high proportion of missing values.

### 3. Remove records with missing values.

### 4. Feature reduction.

### 5. Convert categorical data to numerical.

## 5. Compute field relationship scores with the chosen statistical model.

## 6. Present your findings in statistical summary and/or data visualizations.