# Insights from the European Social Survey 9 (2018)

This project uses the data collected in 2018/2019 during the 9th European Social Survey, which you can find here: [ESS9](https://www.europeansocialsurvey.org/about/singlenew.html?a=/about/news/essnews0076.html). 

The European Social Survey (ESS) is an academically driven cross-national survey that has been conducted across Europe since its establishment in 2001. Every two years, face-to-face interviews are conducted with newly selected, cross-sectional samples.
The survey measures the attitudes, beliefs and behaviour patterns of diverse populations. 

The data from the 2018 survey compared to the previous 2016 survey is significantly smaller in size, with over 8k fewer entries. The ESS9 is around 80% of entries from ESS8. The documentation of the survey reveals that there is a difference in participants. Iceland, Israel, Lithuania, Portugal, Russia, Spain, and Sweden are not considered. Which is interesting as the feature country being Russia came as one of the important ones in the previous analysis of the happiness factor, which you can find [here](https://www.kaggle.com/pascalbliem/insights-from-the-european-social-survey-8). But we also get new countries in the survey, such as Bulgaria, Cyprus, and Serbia. 7 out, 3 in. 
It is worth to note that this is the 1.0 edition, and for the ESS8 we were considering the 2.1 edition. That may suggest that more data will be added to this results. 

The questions in the survey are divided into specific topic groups:
- Country
- Weights
- Media and social trust
- Politics
- Subjective well-being, social exclusion, religion, national and ethnic identity
- Timing of life*
- Gender, Year of birth and Household grid
- Socio-demographics
- Justice and Fairness*
- Human values
- Administrative variables

The groups with asterisks, _Timing of life_ and _Justice and Fairness_, are exclusively prepared for the current edition of the survey. In the ESS8 the exclusive groups were _Climate change_ and _Welfare attitudes_. 

### The structure of the project is following:
1. __data preparation__
2. __exploratory data analysis__
3. __statistical analysis__
4. __machine learning__

### The objective:
The analysis of the differences between countries considered in the survey. The estimation of what factors have an influence on the happiness of the respondents and the prediction of the happiness factor of the test group.

## Data

First load the necessary libraries, set the style of the visualization and disable the warnings. 

In [1]:
import numpy as np
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
warnings.filterwarnings('ignore')

Then load the data into the dataframe and show the basic information:

In [2]:
df = pd.read_csv('.\\ESS9e01.0\\ESS9e01.0_F1.csv')
print('Info:')
df.info()

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36015 entries, 0 to 36014
Columns: 491 entries, name to pweight
dtypes: float64(195), int64(286), object(10)
memory usage: 134.9+ MB


#### __The information summary is pretty straightforward:__
RangeIndex gives us information about the number of respondents taking part in the survey among all surveyed countries. __We have 36015 respondents.__  
Columns entries correspond to the questions in the survey. __We have 491 questions.__  
dtypes gives us information about data types of response to the question asked. __We have 195 float type answers, 286 int type answers, and 10 object type answers.__  
Memory usage is the size of the dataset. __The dataset uses 135 Mb of the memory.__  

In [3]:
df.sample(5)

Unnamed: 0,name,essround,edition,proddate,idno,cntry,nwspol,netusoft,netustm,ppltrst,...,inwshh,inwsmm,inwdde,inwmme,inwyye,inwehh,inwemm,inwtm,dweight,pweight
35949,ESS9e01,9,1.0,31.10.2019,45662,SI,30,2,6666,5,...,16,35,28,11,2018,17,13,38.0,1.0,0.133248
447,ESS9e01,9,1.0,31.10.2019,8791,AT,20,5,300,0,...,15,26,20,11,2018,17,5,94.0,0.550646,0.302091
24821,ESS9e01,9,1.0,31.10.2019,37386,IE,260,2,6666,9,...,15,0,14,1,2019,16,27,84.0,0.461313,0.172561
24659,ESS9e01,9,1.0,31.10.2019,33962,IE,0,5,240,5,...,17,16,25,2,2019,18,9,52.0,0.922627,0.172561
23811,ESS9e01,9,1.0,31.10.2019,15640,IE,180,5,180,1,...,19,37,9,1,2019,20,31,16.0,0.461313,0.172561


The sample of 5 rows from the dataset gives us an insight into how the data is structured in the data frame. Each columns is named according to the 'Study Documentation'. The answers are predefined in the 'Study Documentation' as well. They can be *ordinal, nominal* or *binary*. The type: *discrete* or *continuous*. The format: *numeric* or *character*, with a specific number of digits. Some of the answers are **invalid**, due to refusal, no answer or they may be not applicable. The invalid answers are encoded with specific numbers (e.g. 6666). 

In [4]:
df.columns

Index(['name', 'essround', 'edition', 'proddate', 'idno', 'cntry', 'nwspol',
       'netusoft', 'netustm', 'ppltrst',
       ...
       'inwshh', 'inwsmm', 'inwdde', 'inwmme', 'inwyye', 'inwehh', 'inwemm',
       'inwtm', 'dweight', 'pweight'],
      dtype='object', length=491)

### Variables:

The variables_ess9.csv file contains the information about each variable in the dataset based on the 'Study Documentation':
- **Name** - unique column name
- **Label** - a shortened version of the survey's question 
- **Country_specific** - if *no*: all countries answer the question, if *yes*: each country answers different version of the question
- **Scale_type** - the answer scale type (ordinal, nominal, binary, continuous) 
- **Type** - the answer type (discrete or continuous)
- **Format** - the answer format (numeric or character) and number of digits/characters 
- **Valid** - the number of valid answers
- **Invalid** - the number of invalid answers
- **Question** - a full version of the survey's question with question code
- **Group** - a topic group of the question (groups presented in the introduction section) 

In [5]:
all_variables = pd.read_csv('./ESS9e01.0/variables_ess9.csv')
all_variables.head(5)

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
0,cntry,Country,no,nominal,discrete,character-2,36015,0,5 Country,Group Country
1,dweight,Design weight,no,continuous,continuous,numeric-4.2,36015,0,R17 Design weight,Group Weights
2,pweight,Population size weight (must be combined with ...,no,continuous,continuous,numeric-8.2,36015,0,R19 Population size weight (must be combined w...,Group Weights
3,nwspol,"News about politics and current affairs, watch...",no,continuous,continuous,numeric-4.0,35929,386,"A1 On a typical day, about how much time do yo...",Group Media and social trust
4,netusoft,"Internet use, how often",no,ordinal,discrete,numeric-1.0,35983,32,A2 People can use the internet on different de...,Group Media and social trust


A quick check if all variables are included:

In [6]:
for var in all_variables.Name:
    if var in df.columns:
        pass
    else:
        print(var)
print('Done!')

Done!


For further analysis let's filter some of the variables:  
- Ordinal - only no country-specific questions and no administrative questions

In [7]:
# -> not country specific 
# -> ordinal 
# -> not part of administrative group 

ordinal = all_variables.query('Country_specific == \"no\" & Scale_type == \"ordinal\" & Group != \"Group Administrative variables\"')
ordinal.head()

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
4,netusoft,"Internet use, how often",no,ordinal,discrete,numeric-1.0,35983,32,A2 People can use the internet on different de...,Group Media and social trust
6,ppltrst,Most people can be trusted or you can't be too...,no,ordinal,discrete,numeric-2.0,35906,109,"A4 Using this card, generally speaking, would ...",Group Media and social trust
7,pplfair,"Most people try to take advantage of you, or t...",no,ordinal,discrete,numeric-2.0,35787,228,"A5 Using this card, do you think that most peo...",Group Media and social trust
8,pplhlp,Most of the time people helpful or mostly look...,no,ordinal,discrete,numeric-2.0,35875,140,A6 Would you say that most of the time people ...,Group Media and social trust
9,polintr,How interested in politics,no,ordinal,discrete,numeric-1.0,35941,74,B1 How interested would you say you are in pol...,Group Politics


We are left with **107 ordinal variables**.

In [8]:
ordinal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 4 to 474
Data columns (total 10 columns):
Name                107 non-null object
Label               107 non-null object
Country_specific    107 non-null object
Scale_type          107 non-null object
Type                107 non-null object
Format              107 non-null object
Valid               107 non-null int64
Invalid             107 non-null int64
Question            107 non-null object
Group               107 non-null object
dtypes: int64(2), object(8)
memory usage: 9.2+ KB


- Continuous - only chosen variables: Age(*agea*), Years of education compleated(*eduyrs*), Daily news consumption in minutes(*nwspol*), and Daily internet usage in minutes(*nestustm*)

In [9]:
# -> Age, 
# -> Years of education compleated, 
# -> Daily news consumption in minutes, 
# -> Daily internet usage in minutes

continuous = all_variables.query('Name in [\"agea\",\"eduyrs\",\"nwspol\",\"netustm"]')
continuous.head()

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
3,nwspol,"News about politics and current affairs, watch...",no,continuous,continuous,numeric-4.0,35929,386,"A1 On a typical day, about how much time do yo...",Group Media and social trust
5,netustm,"Internet use, how much time on typical day, in...",no,continuous,continuous,numeric-4.0,25029,10986,"A3 On a typical day, about how much time do yo...",Group Media and social trust
214,agea,"Age of respondent, calculated",no,continuous,continuous,numeric-4.0,35848,167,"F31b Age of respondent, calculated","Group Gender, Year of birth and Household grid"
278,eduyrs,Years of full-time education completed,no,continuous,continuous,numeric-2.0,35510,505,F16 About how many years of education have you...,Group Socio-demographics


We have **4 continuous variables**.

In [10]:
continuous.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 3 to 278
Data columns (total 10 columns):
Name                4 non-null object
Label               4 non-null object
Country_specific    4 non-null object
Scale_type          4 non-null object
Type                4 non-null object
Format              4 non-null object
Valid               4 non-null int64
Invalid             4 non-null int64
Question            4 non-null object
Group               4 non-null object
dtypes: int64(2), object(8)
memory usage: 352.0+ bytes


- Nominal - only chosen variables: Country(cntry) and Gender(gndr)

In [11]:
# -> Country
# -> Gender

nominal = all_variables.query('Name in [\"cntry\",\"gndr\"]')
nominal.head()

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
0,cntry,Country,no,nominal,discrete,character-2,36015,0,5 Country,Group Country
199,gndr,Gender,no,nominal,discrete,numeric-1.0,36015,0,"F21 CODE SEX, respondent","Group Gender, Year of birth and Household grid"


We have **2 nominal variables**.

In [12]:
nominal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 199
Data columns (total 10 columns):
Name                2 non-null object
Label               2 non-null object
Country_specific    2 non-null object
Scale_type          2 non-null object
Type                2 non-null object
Format              2 non-null object
Valid               2 non-null int64
Invalid             2 non-null int64
Question            2 non-null object
Group               2 non-null object
dtypes: int64(2), object(8)
memory usage: 176.0+ bytes


Concatenate all of the chosen variables back into one data frame.

In [13]:
variables = pd.concat([nominal,continuous,ordinal]).reset_index(drop=True)
variables.sample(5)

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
44,sclact,Take part in social activities compared to oth...,no,ordinal,discrete,numeric-1.0,35396,619,"C4 Compared to other people of your age, how o...","Group Subjective well-being, social exclusion,..."
58,advcyc,Approve if person gets divorced while children...,no,ordinal,discrete,numeric-1.0,33221,2794,"D30a/b-D34a/b Using this card, how much do you...",Group Timing of life
67,frprtpl,Political system in country ensures everyone f...,no,ordinal,discrete,numeric-1.0,34058,1957,G1 How much would you say that the political s...,Group Justice and Fairness
48,atchctr,How emotionally attached to [country],no,ordinal,discrete,numeric-2.0,35833,182,C9 How emotionally attached do you feel to [co...,"Group Subjective well-being, social exclusion,..."
68,gvintcz,Government in country takes into account the i...,no,ordinal,discrete,numeric-1.0,34737,1278,G2 How much would you say that the government ...,Group Justice and Fairness


In [14]:
variables.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 10 columns):
Name                113 non-null object
Label               113 non-null object
Country_specific    113 non-null object
Scale_type          113 non-null object
Type                113 non-null object
Format              113 non-null object
Valid               113 non-null int64
Invalid             113 non-null int64
Question            113 non-null object
Group               113 non-null object
dtypes: int64(2), object(8)
memory usage: 8.9+ KB


The total number of variables now equals 113. Before this number was 491. Many questions were country-specific and got rejected from the final variables data frame. 

### Invalid answers:
The invalid answers are encoded differently depending on the answer data format. It can be for example code numbers: 7, 77 or 7777 for respectively numeric-1.0, numeric-2.0 and numeric-4.0 data formats. For cleaning the invalid answers it is useful to divide them into several groups.

In [15]:
humval = variables.query('Group == \"Group Human values\"').Name
num1 = variables.query('Format == \"numeric-1.0\" and Group != \"Group Human values\"').Name
num2 = variables.query('Format == \"numeric-2.0\" and Name != \"eduyrs\"').Name
edy = ['eduyrs']
num4 = variables.query('Format == \"numeric-4.0\" ').Name

Filter the ESS9 dataset with selected variables. `ess` contains only answers to chosen 113 survey questions. The sample of 10 respondent's answers is shown:

In [16]:
ess = df[variables.Name]
ess.sample(10)

Unnamed: 0,cntry,gndr,nwspol,netustm,agea,eduyrs,netusoft,ppltrst,pplfair,pplhlp,...,iphlppl,ipsuces,ipstrgv,ipadvnt,ipbhprp,iprspot,iplylfr,impenv,imptrad,impfun
16946,FI,1,0,480,18,12,5,6,4,7,...,2,4,3,1,3,6,2,4,5,2
21937,HU,1,60,6666,62,12,1,6,10,8,...,3,3,6,5,4,4,1,4,3,6
18087,FR,2,90,180,74,12,5,5,9,9,...,2,5,2,6,1,6,1,1,2,6
5456,BG,1,120,360,45,12,5,5,5,77,...,3,3,3,3,3,3,3,3,3,3
10504,CZ,2,30,240,21,13,5,6,7,5,...,2,2,2,2,2,2,2,2,5,2
20198,GB,2,90,6666,63,12,1,4,0,3,...,1,3,1,6,1,1,1,2,1,3
16318,FI,2,30,90,40,18,5,6,7,6,...,1,5,4,2,3,2,1,2,2,1
24761,IE,2,60,120,32,17,4,4,7,7,...,2,2,2,1,4,5,3,3,2,3
13600,EE,1,30,6666,83,8,1,6,7,8,...,2,3,2,4,1,4,2,1,1,4
22577,HU,2,0,180,17,10,5,6,5,7,...,2,2,3,3,3,4,4,4,4,4


### Missing values:
The dataframe can have some missing values. Print all the missing values if there are some. 

In [17]:
print(f'Missing values: \n{ess.isna().sum()[ess.isna().sum()>0]}')

Missing values: 
Series([], dtype: int64)


It turns out dataframe is not missing any values. But it sure contains invalid answers (e.g.6666). Replacing invalid responses with __'NaN'__ values is useful for analysis, as those answers can be ignored or imputed with the calculated values. 

In [18]:
for group, cutoff in zip([humval,num1,num2,edy,num4],[7,6,11,66,6666]):
    for var in group:
        ess.loc[:,var].where(ess[var]<cutoff,other=np.nan,inplace=True)

Print all the missing values after adding 'NaN' values. There is a lot of missing values now that the invalid answers are considered. 

In [19]:
print(f'Missing values after removing invalid answers: \
\n{ess.isna().sum()[ess.isna().sum()>0]}')

Missing values after removing invalid answers: 
nwspol        386
netustm     10986
eduyrs        505
netusoft       32
ppltrst       109
pplfair       228
pplhlp        140
polintr        74
psppsgva     1169
actrolga      935
psppipla      975
cptppola     1217
trstprl       868
trstlgl       702
trstplc       268
trstplt       723
trstprt       831
trstep       2575
trstun       3049
prtdgcl     20816
lrscale      5165
stflife       193
stfeco        927
stfgov       1304
stfdem       1332
stfedu       1374
stfhlth       265
gincdif       655
freehms      1206
hmsfmlsh     1787
            ...  
recimg       3456
recgndr      1728
sofrdst       789
sofrwrk       631
sofrpr        726
sofrprv      1285
ppldsrv       851
jstprev       874
pcmpinj      1382
ipcrtiv       869
imprich       693
ipeqopt       782
ipshabt       758
impsafe       628
impdiff       753
ipfrule       941
ipudrst       815
ipmodst       725
ipgdtim       738
impfree       704
iphlppl       682
ipsuces       83

### Reverse the scale:
Some of the scales of the variables are ordered from negative to positive, but some are from positive to negative. The details can be found in the 'Study Documentation'. The negative to positive order seems to be more natural. To keep the order of scales the same for all variables some need to be reversed. The list of variables to reverse is 42 variables long.

In [20]:
reverse = ["ipcrtiv", "imprich", "ipeqopt", "ipshabt", "impsafe", "impdiff", "ipfrule", "ipudrst", "ipmodst",
           "ipgdtim", "impfree", "iphlppl", "ipsuces", "ipstrgv", "ipadvnt", "ipbhprp", "iprspot", "iplylfr",
           "impenv", "imptrad", "impfun", "prtdgcl", "gincdif", "freehms", "hmsfmlsh", "hmsacld", 
           "imsmetn", "imdfetn", "impcntr", "aesfdrk", "health", "hlthhmp", "rlgatnd", "pray", "hincfel",
          "sofrdst", "sofrwrk", "sofrpr", "sofrprv", "ppldsrv", "jstprev", "pcmpinj"]
len(reverse)

42

The process of reversing the scales:

In [21]:
for var in reverse:
    upp = int(ess[var].max())
    low = int(ess[var].min())
    ess[var].replace(dict(zip(range(low,upp+1),range(upp,low-1,-1))),inplace=True)

The dataframe with reversed scales: 

In [22]:
ess.sample(10)

Unnamed: 0,cntry,gndr,nwspol,netustm,agea,eduyrs,netusoft,ppltrst,pplfair,pplhlp,...,iphlppl,ipsuces,ipstrgv,ipadvnt,ipbhprp,iprspot,iplylfr,impenv,imptrad,impfun
6061,BG,2,60.0,360.0,41,17.0,5.0,4.0,4.0,4.0,...,3.0,4.0,4.0,4.0,,4.0,4.0,4.0,4.0,4.0
35602,SI,2,30.0,120.0,21,13.0,5.0,3.0,5.0,5.0,...,5.0,6.0,4.0,4.0,6.0,6.0,6.0,6.0,4.0,6.0
29752,NO,1,60.0,180.0,46,17.0,5.0,9.0,9.0,7.0,...,3.0,5.0,4.0,3.0,5.0,4.0,4.0,4.0,3.0,4.0
3431,BE,1,10.0,240.0,20,17.0,5.0,3.0,4.0,2.0,...,5.0,3.0,5.0,4.0,6.0,3.0,5.0,5.0,6.0,5.0
21935,HU,2,0.0,240.0,45,12.0,5.0,2.0,4.0,3.0,...,4.0,4.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,4.0
23682,IE,1,155.0,650.0,24,14.0,5.0,5.0,9.0,8.0,...,6.0,5.0,5.0,4.0,4.0,5.0,5.0,5.0,5.0,5.0
28990,NL,2,120.0,120.0,59,16.0,5.0,8.0,8.0,8.0,...,6.0,4.0,4.0,5.0,4.0,3.0,6.0,6.0,6.0,6.0
11804,DE,1,30.0,120.0,50,18.0,5.0,8.0,7.0,5.0,...,5.0,5.0,5.0,2.0,3.0,5.0,6.0,6.0,2.0,2.0
26200,IT,2,90.0,,80,5.0,1.0,5.0,4.0,6.0,...,6.0,4.0,6.0,1.0,6.0,1.0,6.0,6.0,5.0,3.0
27480,IT,2,626.0,60.0,22,9.0,5.0,0.0,5.0,5.0,...,5.0,5.0,6.0,1.0,3.0,3.0,5.0,6.0,6.0,3.0


In this part, the data was prepared for further analysis. The next part will include visualizations and exploratory data analysis. 