# Insights from the European Social Survey 9 (2018)

This project uses the data collected in 2018/2019 during the 9th European Social Survey, which you can find here: [ESS9](https://www.europeansocialsurvey.org/about/singlenew.html?a=/about/news/essnews0076.html). 

The European Social Survey (ESS) is an academically driven cross-national survey that has been conducted across Europe since its establishment in 2001. Every two years, face-to-face interviews are conducted with newly selected, cross-sectional samples.
The survey measures the attitudes, beliefs and behaviour patterns of diverse populations. 

The data from the 2018 survey compared to the previous 2016 survey is significantly smaller in size, with over 8k fewer entries. The ESS9 is around 80% of entries from ESS8. The documentation of the survey reveals that there is a difference in participants. Iceland, Israel, Lithuania, Portugal, Russia, Spain, and Sweden are not considered. Which is interesting as the feature country being Russia came as one of the important ones in the previous analysis of the happiness factor, which you can find [here](https://www.kaggle.com/pascalbliem/insights-from-the-european-social-survey-8). But we also get new countries in the survey, such as Bulgaria, Cyprus, and Serbia. 7 out, 3 in. 
It is worth to note that this is the 1.0 edition, and for the ESS8 we were considering the 2.1 edition. That may suggest that more data will be added to this results. 

The questions in the survey are divided into specific topic groups:
- Country
- Weights
- Media and social trust
- Politics
- Subjective well-being, social exclusion, religion, national and ethnic identity
- Timing of life*
- Gender, Year of birth and Household grid
- Socio-demographics
- Justice and Fairness*
- Human values
- Administrative variables

The groups with asterisks, _Timing of life_ and _Justice and Fairness_, are exclusively prepared for the current edition of the survey. In the ESS8 the exclusive groups were _Climate change_ and _Welfare attitudes_. 

### The structure of the project is following:
1. __data preparation__
2. __exploratory data analysis__
3. __statistical analysis__
4. __machine learning__

### The objective:
The analysis of the differences between countries considered in the survey. The estimation of what factors have an influence on the happiness of the respondents and the prediction of the happiness factor of the test group.

## Data

First load the necessary libraries, set the style of the visualization and disable the warnings. 

In [1]:
import numpy as np
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
warnings.filterwarnings('ignore')

Then load the data into the dataframe and show the basic information:

In [2]:
df = pd.read_csv('.\\ESS9e01.0\\ESS9e01.0_F1.csv')
print('Info:')
df.info()

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36015 entries, 0 to 36014
Columns: 491 entries, name to pweight
dtypes: float64(195), int64(286), object(10)
memory usage: 134.9+ MB


#### __The information summary is pretty straightforward:__
RangeIndex gives us information about the number of respondents taking part in the survey among all surveyed countries. __We have 36015 respondents.__  
Columns entries correspond to the questions in the survey. __We have 491 questions.__  
dtypes gives us information about data types of response to the question asked. __We have 195 float type answers, 286 int type answers, and 10 object type answers.__  
Memory usage is the size of the dataset. __The dataset uses 135 Mb of the memory.__  

In [3]:
df.sample(5)

Unnamed: 0,name,essround,edition,proddate,idno,cntry,nwspol,netusoft,netustm,ppltrst,...,inwshh,inwsmm,inwdde,inwmme,inwyye,inwehh,inwemm,inwtm,dweight,pweight
13782,ESS9e01,9,1.0,31.10.2019,6306,EE,30,5,240,4,...,14,56,3,12,2018,16,20,42.0,0.999575,0.057978
8729,ESS9e01,9,1.0,31.10.2019,44525,CY,90,1,6666,1,...,17,7,12,10,2018,18,11,51.0,0.860057,0.092695
11720,ESS9e01,9,1.0,31.10.2019,10708,DE,5,5,300,3,...,10,56,23,2,2019,11,57,61.0,0.999466,3.037345
4696,ESS9e01,9,1.0,31.10.2019,9505,BG,60,3,6666,4,...,16,45,25,11,2018,18,29,104.0,1.043028,0.275053
825,ESS9e01,9,1.0,31.10.2019,16332,AT,60,5,120,9,...,10,12,16,10,2018,11,5,49.0,0.603386,0.302091


The sample of 5 rows from the dataset gives us an insight into how the data is structured in the data frame. Each columns is named according to the 'Study Documentation'. The answers are predefined in the 'Study Documentation' as well. They can be *ordinal, nominal* or *binary*. The type: *discrete* or *continuous*. The format: *numeric* or *character*, with a specific number of digits. Some of the answers are **invalid**, due to refusal, no answer or they may be not applicable. The invalid answers are encoded with specific numbers (e.g. 6666). 

In [4]:
df.columns

Index(['name', 'essround', 'edition', 'proddate', 'idno', 'cntry', 'nwspol',
       'netusoft', 'netustm', 'ppltrst',
       ...
       'inwshh', 'inwsmm', 'inwdde', 'inwmme', 'inwyye', 'inwehh', 'inwemm',
       'inwtm', 'dweight', 'pweight'],
      dtype='object', length=491)

### Variables:

The variables_ess9.csv file contains the information about each variable in the dataset based on the 'Study Documentation':
- **Name** - unique column name
- **Label** - a shortened version of the survey's question 
- **Country_specific** - if *no*: all countries answer the question, if *yes*: each country answers different version of the question
- **Scale_type** - the answer scale type (ordinal, nominal, binary, continuous) 
- **Type** - the answer type (discrete or continuous)
- **Format** - the answer format (numeric or character) and number of digits/characters 
- **Valid** - the number of valid answers
- **Invalid** - the number of invalid answers
- **Question** - a full version of the survey's question with question code
- **Group** - a topic group of the question (groups presented in the introduction section) 

In [5]:
all_variables = pd.read_csv('./ESS9e01.0/variables_ess9.csv')
all_variables.head(5)

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
0,cntry,Country,no,nominal,discrete,character-2,36015,0,5 Country,Group Country
1,dweight,Design weight,no,continuous,continuous,numeric-4.2,36015,0,R17 Design weight,Group Weights
2,pweight,Population size weight (must be combined with ...,no,continuous,continuous,numeric-8.2,36015,0,R19 Population size weight (must be combined w...,Group Weights
3,nwspol,"News about politics and current affairs, watch...",no,continuous,continuous,numeric-4.0,35929,386,"A1 On a typical day, about how much time do yo...",Group Media and social trust
4,netusoft,"Internet use, how often",no,ordinal,discrete,numeric-1.0,35983,32,A2 People can use the internet on different de...,Group Media and social trust


A quick check if all variables are included:

In [6]:
for var in all_variables.Name:
    if var in df.columns:
        pass
    else:
        print(var)
print('Done!')

Done!


For further analysis let's filter some of the variables:  
- Ordinal - only no country-specific questions and no administrative questions

In [7]:
# -> not country specific 
# -> ordinal 
# -> not part of administrative group 

ordinal = all_variables.query('Country_specific == \"no\" & Scale_type == \"ordinal\" & Group != \"Group Administrative variables\"')
ordinal.head()

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
4,netusoft,"Internet use, how often",no,ordinal,discrete,numeric-1.0,35983,32,A2 People can use the internet on different de...,Group Media and social trust
6,ppltrst,Most people can be trusted or you can't be too...,no,ordinal,discrete,numeric-2.0,35906,109,"A4 Using this card, generally speaking, would ...",Group Media and social trust
7,pplfair,"Most people try to take advantage of you, or t...",no,ordinal,discrete,numeric-2.0,35787,228,"A5 Using this card, do you think that most peo...",Group Media and social trust
8,pplhlp,Most of the time people helpful or mostly look...,no,ordinal,discrete,numeric-2.0,35875,140,A6 Would you say that most of the time people ...,Group Media and social trust
9,polintr,How interested in politics,no,ordinal,discrete,numeric-1.0,35941,74,B1 How interested would you say you are in pol...,Group Politics


We are left with **107 ordinal variables**.

In [8]:
ordinal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 4 to 474
Data columns (total 10 columns):
Name                107 non-null object
Label               107 non-null object
Country_specific    107 non-null object
Scale_type          107 non-null object
Type                107 non-null object
Format              107 non-null object
Valid               107 non-null int64
Invalid             107 non-null int64
Question            107 non-null object
Group               107 non-null object
dtypes: int64(2), object(8)
memory usage: 9.2+ KB


- Continuous - only chosen variables: Age(*agea*), Years of education compleated(*eduyrs*), Daily news consumption in minutes(*nwspol*), and Daily internet usage in minutes(*nestustm*)

In [9]:
# -> Age, 
# -> Years of education compleated, 
# -> Daily news consumption in minutes, 
# -> Daily internet usage in minutes

continuous = all_variables.query('Name in [\"agea\",\"eduyrs\",\"nwspol\",\"netustm"]')
continuous.head()

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
3,nwspol,"News about politics and current affairs, watch...",no,continuous,continuous,numeric-4.0,35929,386,"A1 On a typical day, about how much time do yo...",Group Media and social trust
5,netustm,"Internet use, how much time on typical day, in...",no,continuous,continuous,numeric-4.0,25029,10986,"A3 On a typical day, about how much time do yo...",Group Media and social trust
214,agea,"Age of respondent, calculated",no,continuous,continuous,numeric-4.0,35848,167,"F31b Age of respondent, calculated","Group Gender, Year of birth and Household grid"
278,eduyrs,Years of full-time education completed,no,continuous,continuous,numeric-2.0,35510,505,F16 About how many years of education have you...,Group Socio-demographics


We have **4 continuous variables**.

In [10]:
continuous.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 3 to 278
Data columns (total 10 columns):
Name                4 non-null object
Label               4 non-null object
Country_specific    4 non-null object
Scale_type          4 non-null object
Type                4 non-null object
Format              4 non-null object
Valid               4 non-null int64
Invalid             4 non-null int64
Question            4 non-null object
Group               4 non-null object
dtypes: int64(2), object(8)
memory usage: 352.0+ bytes


- Nominal - only chosen variables: Country(cntry) and Gender(gndr)

In [11]:
# -> Country
# -> Gender

nominal = all_variables.query('Name in [\"cntry\",\"gndr\"]')
nominal.head()

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
0,cntry,Country,no,nominal,discrete,character-2,36015,0,5 Country,Group Country
199,gndr,Gender,no,nominal,discrete,numeric-1.0,36015,0,"F21 CODE SEX, respondent","Group Gender, Year of birth and Household grid"


We have **2 nominal variables**.

In [12]:
nominal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 199
Data columns (total 10 columns):
Name                2 non-null object
Label               2 non-null object
Country_specific    2 non-null object
Scale_type          2 non-null object
Type                2 non-null object
Format              2 non-null object
Valid               2 non-null int64
Invalid             2 non-null int64
Question            2 non-null object
Group               2 non-null object
dtypes: int64(2), object(8)
memory usage: 176.0+ bytes


Concatenate all of the chosen variables back into one data frame.

In [13]:
variables = pd.concat([nominal,continuous,ordinal]).reset_index(drop=True)
variables.sample(5)

Unnamed: 0,Name,Label,Country_specific,Scale_type,Type,Format,Valid,Invalid,Question,Group
60,hhmmb,Number of people living regularly as member of...,no,ordinal,continuous,numeric-2.0,35902,113,"F1 Including yourself, how many people\n- incl...","Group Gender, Year of birth and Household grid"
67,frprtpl,Political system in country ensures everyone f...,no,ordinal,discrete,numeric-1.0,34058,1957,G1 How much would you say that the political s...,Group Justice and Fairness
1,gndr,Gender,no,nominal,discrete,numeric-1.0,36015,0,"F21 CODE SEX, respondent","Group Gender, Year of birth and Household grid"
2,nwspol,"News about politics and current affairs, watch...",no,continuous,continuous,numeric-4.0,35929,386,"A1 On a typical day, about how much time do yo...",Group Media and social trust
16,trstlgl,Trust in the legal system,no,ordinal,discrete,numeric-2.0,35313,702,"B6-12 Using this card, please tell me on a sco...",Group Politics


In [14]:
variables.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 10 columns):
Name                113 non-null object
Label               113 non-null object
Country_specific    113 non-null object
Scale_type          113 non-null object
Type                113 non-null object
Format              113 non-null object
Valid               113 non-null int64
Invalid             113 non-null int64
Question            113 non-null object
Group               113 non-null object
dtypes: int64(2), object(8)
memory usage: 8.9+ KB


The total number of variables now equals 113. Before this number was 491. Many questions were country-specific and got rejected from the final variables data frame. 

### Invalid answers:
The invalid answers are encoded differently depending on the answer data format. It can be for example code numbers: 7, 77 or 7777 for respectively numeric-1.0, numeric-2.0 and numeric-4.0 data formats. For cleaning the invalid answers it is useful to divide them into several groups.

In [15]:
humval = variables.query('Group == \"Group Human values\"').Name
num1 = variables.query('Format == \"numeric-1.0\" and Group != \"Group Human values\"').Name
num2 = variables.query('Format == \"numeric-2.0\" and Name != \"eduyrs\"').Name
edy = ['eduyrs']
num4 = variables.query('Format == \"numeric-4.0\" ').Name

Filter the ESS9 dataset with selected variables. `ess` contains only answers to chosen 113 survey questions. The sample of 10 respondent's answers is shown:

In [16]:
ess = df[variables.Name]
ess.sample(10)

Unnamed: 0,cntry,gndr,nwspol,netustm,agea,eduyrs,netusoft,ppltrst,pplfair,pplhlp,...,iphlppl,ipsuces,ipstrgv,ipadvnt,ipbhprp,iprspot,iplylfr,impenv,imptrad,impfun
24407,IE,1,60,180,18,14,5,8,7,7,...,2,2,3,2,4,5,2,2,5,3
2199,AT,2,30,60,66,12,5,3,7,10,...,2,3,2,2,3,3,1,1,1,2
23914,IE,2,30,6666,57,13,1,7,3,8,...,1,2,1,1,1,2,1,1,1,1
9130,CZ,1,10,60,62,12,4,0,3,5,...,2,5,3,6,2,4,1,2,4,4
34703,SI,2,120,6666,82,11,2,6,8,5,...,1,2,1,3,1,1,1,1,2,1
156,AT,2,30,6666,65,18,3,8,9,5,...,1,3,3,5,1,3,1,1,1,1
8897,CZ,2,10,210,77,12,5,5,7,5,...,3,5,1,6,2,2,1,1,3,1
19598,GB,1,45,120,45,13,5,5,5,6,...,2,3,2,3,1,5,2,2,2,2
5306,BG,1,15,8888,18,11,5,5,7,7,...,5,2,2,2,4,3,2,1,4,2
12193,DE,1,30,180,41,20,4,8,8,8,...,3,5,4,5,5,3,2,4,2,6


### Missing values:
The dataframe can have some missing values. Print all the missing values if there are some. 

In [17]:
print(f'Missing values: \n{ess.isna().sum()[ess.isna().sum()>0]}')

Missing values: 
Series([], dtype: int64)


It turns out dataframe is not missing any values. But it sure contains invalid answers (e.g.6666). Replacing invalid responses with __'NaN'__ values is useful for analysis, as those answers can be ignored or imputed with the calculated values. 

In [18]:
for group, cutoff in zip([humval,num1,num2,edy,num4],[7,6,11,66,6666]):
    for var in group:
        ess.loc[:,var].where(ess[var]<cutoff,other=np.nan,inplace=True)

Print all the missing values after adding 'NaN' values. There is a lot of missing values now that the invalid answers are considered. 

In [19]:
print(f'Missing values after removing invalid answers: \
\n{ess.isna().sum()[ess.isna().sum()>0]}')

Missing values after removing invalid answers: 
nwspol        386
netustm     10986
eduyrs        505
netusoft       32
ppltrst       109
pplfair       228
pplhlp        140
polintr        74
psppsgva     1169
actrolga      935
psppipla      975
cptppola     1217
trstprl       868
trstlgl       702
trstplc       268
trstplt       723
trstprt       831
trstep       2575
trstun       3049
prtdgcl     20816
lrscale      5165
stflife       193
stfeco        927
stfgov       1304
stfdem       1332
stfedu       1374
stfhlth       265
gincdif       655
freehms      1206
hmsfmlsh     1787
            ...  
recimg       3456
recgndr      1728
sofrdst       789
sofrwrk       631
sofrpr        726
sofrprv      1285
ppldsrv       851
jstprev       874
pcmpinj      1382
ipcrtiv       869
imprich       693
ipeqopt       782
ipshabt       758
impsafe       628
impdiff       753
ipfrule       941
ipudrst       815
ipmodst       725
ipgdtim       738
impfree       704
iphlppl       682
ipsuces       83

### Reverse the scale:
Some of the scales of the variables are ordered from negative to positive, but some are from positive to negative. The details can be found in the 'Study Documentation'. The negative to positive order seems to be more natural. To keep the order of scales the same for all variables some need to be reversed. The list of variables to reverse is 42 variables long.

In [20]:
reverse = ["ipcrtiv", "imprich", "ipeqopt", "ipshabt", "impsafe", "impdiff", "ipfrule", "ipudrst", "ipmodst",
           "ipgdtim", "impfree", "iphlppl", "ipsuces", "ipstrgv", "ipadvnt", "ipbhprp", "iprspot", "iplylfr",
           "impenv", "imptrad", "impfun", "prtdgcl", "gincdif", "freehms", "hmsfmlsh", "hmsacld", 
           "imsmetn", "imdfetn", "impcntr", "aesfdrk", "health", "hlthhmp", "rlgatnd", "pray", "hincfel",
          "sofrdst", "sofrwrk", "sofrpr", "sofrprv", "ppldsrv", "jstprev", "pcmpinj"]
len(reverse)

42

The process of reversing the scales:

In [21]:
for var in reverse:
    upp = int(ess[var].max())
    low = int(ess[var].min())
    ess[var].replace(dict(zip(range(low,upp+1),range(upp,low-1,-1))),inplace=True)

The dataframe with reversed scales: 

In [22]:
ess.sample(10)

Unnamed: 0,cntry,gndr,nwspol,netustm,agea,eduyrs,netusoft,ppltrst,pplfair,pplhlp,...,iphlppl,ipsuces,ipstrgv,ipadvnt,ipbhprp,iprspot,iplylfr,impenv,imptrad,impfun
31950,PL,2,180.0,,78,10.0,1.0,8.0,5.0,2.0,...,5.0,4.0,6.0,1.0,5.0,3.0,6.0,5.0,5.0,1.0
3147,BE,2,15.0,60.0,43,16.0,5.0,7.0,8.0,5.0,...,6.0,6.0,6.0,3.0,5.0,4.0,5.0,6.0,4.0,5.0
4202,BE,1,45.0,,62,14.0,3.0,7.0,8.0,8.0,...,5.0,3.0,4.0,3.0,5.0,3.0,5.0,5.0,5.0,4.0
10517,CZ,1,0.0,,75,16.0,1.0,7.0,8.0,8.0,...,4.0,3.0,6.0,4.0,5.0,6.0,5.0,6.0,6.0,5.0
23290,IE,2,30.0,30.0,38,15.0,5.0,5.0,4.0,5.0,...,5.0,3.0,3.0,4.0,5.0,3.0,4.0,5.0,5.0,5.0
4312,BG,1,30.0,120.0,69,11.0,5.0,3.0,5.0,5.0,...,5.0,4.0,5.0,5.0,6.0,2.0,5.0,5.0,4.0,5.0
108,AT,1,20.0,105.0,75,12.0,4.0,7.0,8.0,5.0,...,3.0,3.0,,2.0,5.0,4.0,5.0,3.0,5.0,5.0
32746,RS,1,60.0,60.0,27,12.0,5.0,6.0,2.0,2.0,...,5.0,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,2.0
4388,BG,2,55.0,120.0,45,11.0,5.0,7.0,7.0,6.0,...,5.0,4.0,5.0,3.0,4.0,5.0,5.0,5.0,4.0,3.0
3213,BE,2,0.0,600.0,30,13.0,5.0,2.0,1.0,3.0,...,5.0,3.0,2.0,6.0,3.0,3.0,6.0,3.0,6.0,6.0


## Part II - Exploration and visualization

The base code for interactive plots is owned by Pascal Bleim. The code has been modified according to the needs. The visualizations use the **Bokeh** library. 

The plots composition - three sub-plots form left to right:
1. Line graph - first variable as a function of the second variable (var1 vs var2) - shows calculated values of the median, the first quartile, the third quartile, adds jitter to value for more continuous scale. The median and interquartile range was chosen over the mean and standard deviation as variables distributions can be skewed.
2. Bar chart - first variable's mean values and standard deviations for each country in the survey
3. Bar chart - second variable's mean values and standard deviations for each country in the survey

2 and 3: sorted according to the mean value with the highest on the left, **hover** over the bars to display the summary values to get detailed information for:
- the overall mean and standard deviation, 
- the male and the female mean and standard deviation 
- the chosen age groups mean and standard deviation (age groups: 20, 40, 60) 

The dropdown menu allows selecting different variables to be displayed and to filter the first sub-plot by country. 

**You need to run the notebook to see the interactive plots.**

In [23]:
def ess_plot(var1="health",var2="happy"):
    """
    This function creates an interactive graphic visualizing  insights from th ESS9.
    
    It consists of three sub-plots:
       - A line graph that shows the median of one variable as a function of another variable (can be filtered by country).
       - A bar chart that shows the mean of the first variable for each country
       - A bar chart that shows the mean of the second variable for each country
       
    Hovering over the graphs will display values in detail.
    
    Parameters
    ----------
    var1, var2 : string
            Names of the two variables to compare. 

    Returns
    -------
    No returns
    """
    
    import math
    from bokeh.plotting import figure 
    from bokeh.io import output_notebook, show, push_notebook
    output_notebook() # the figure will be rendered inline
    from bokeh.models import Band
    from bokeh.models import Range1d
    from bokeh.models import FactorRange
    from bokeh.models import HoverTool
    from bokeh.models import ColumnDataSource
    from bokeh.layouts import row
    from ipywidgets import interact
    
    # width and height for subplots
    p_width ,p_height = 323, 383
    
    #  X and Y variable values
    x = var1
    y = var2
    
    # country filter variable and dictionary 
    cntry = "All countries"
    cntry_dict = dict(zip(
                          ["All countries","Austria","Belgium","Bulgaria","Switzerland","Cyprus","Czechia","Germany",
                           "Estonia", "Finland","France","United Kingdom","Hungary","Ireland","Italy",
                           "Netherlands","Norway","Poland","Serbia","Slovenia"],
                          ["All countries",'AT', 'BE', 'BG','CH','CY', 'CZ', 'DE', 'EE', 'FI', 'FR', 'GB', 'HU','IE',  
                            'IT',  'NL', 'NO', 'PL', 'RS','SI']))
    
    # boolean False for first setup, later True when plot is updated
    setup = False
    
    # notebook handle - inistialized later and used for updating the plot
    global h
    h = None
    
    def calc_median_iqr():
        """Calculates medians and quartiles for plotting"""
        nonlocal x
        nonlocal y
        nonlocal cntry
        
        # get a copy of the variable columns
        if cntry == "All countries":
            ess_c = ess[[x,y]].copy()
        else:
            c = cntry_dict[cntry]
            ess_c = ess.query("cntry == @c")[[x,y]].copy()
        
        # get the y-range
        yrange = (ess_c[y].min(),ess_c[y].max())
        
        # jitter the y-values
        ess_c[y] = ess_c[y] + np.random.uniform(-0.5,0.5,len(ess_c[y]))
        
        # remove NaNs from x-values because bokeh apparently has a problem converting them to JSON
        xs = sorted([n for n in ess_c[x].unique() if not math.isnan(n)])
        
        # get x-range
        xrange = (min(xs),max(xs))
        
        # calculate the median, first, and third quartile of the y-values for each x-value
        medians = [ess_c.loc[ess_c[x]==i,y].median() for i in xs]
        Q3 = [ess_c.loc[ess_c[x]==i,y].quantile(0.75) for i in xs]
        Q1 = [ess_c.loc[ess_c[x]==i,y].quantile(0.25) for i in xs]
        
        return yrange, xrange, xs, medians, Q1, Q3
    
    #################################
    # # # # Plot 1: Line plot X vs. Y
    
    # calculate median, Q1, Q3 etc.
    yrange, xrange, xs, medians, Q1, Q3 = calc_median_iqr()
    
    # set up the data source
    source1 = ColumnDataSource(dict(x = xs, medians = medians, Q3 = Q3, Q1 = Q1,))
        
    # create figure 1 for X vs. Y plot
    p1 = figure(plot_width= p_width, plot_height=p_height,
               title=cntry,y_range=yrange,x_range=xrange,
               tools="hover", tooltips="@x -> @medians")
        
    # line plot that shows the median +/- inter-quartile range of one variable as a function of another variable
    p1.line("x", "medians", color="black",line_width = 3,source=source1,legend="Median\n+/- IQR")
        
    # plot the inter-quartile range (IQR) as a band
    band = Band(base='x', lower='Q1', upper='Q3', source=source1, 
                  level='underlay', fill_color="lightseagreen",fill_alpha=.6, line_width=2, line_color='black',)
    p1.add_layout(band)
    p1.title.align = 'center'
    p1.legend.location = "bottom_right"
    
    
    ##############################
    # # # # Plot 2: Bar plot for X 
    
    def calc_cntry_X():
        """Calculates variable X for all countries"""
        nonlocal x
        
        # get mean and standard deviation of X for each country, also grouped by gender and age(20,40,60)
        gr = ess.groupby("cntry")[x].agg(["mean","std"])
        mean = gr["mean"]
        std = gr["std"]
        gr = ess.groupby(["gndr","cntry"])[x].agg(["mean","std"])
        mean_f = gr.loc[2,"mean"]
        std_f = gr.loc[2,"std"]
        mean_m = gr.loc[1,"mean"]
        std_m = gr.loc[1,"std"]
        gr = ess.groupby(["agea","cntry"])[x].agg(["mean","std"])
        mean_y20 = gr.loc[20,"mean"]
        std_y20 = gr.loc[20,"std"]
        mean_y40 = gr.loc[40,"mean"]
        std_y40 = gr.loc[40,"std"]
        mean_y60 = gr.loc[60,"mean"]
        std_y60 = gr.loc[60,"std"]
        
        # get list of coutries without "all countries"
        cntry_list = list(cntry_dict.keys())[1:]
        
        # sort so that the highest mean value is in front
        zipped = zip(mean, std, mean_f, std_f, mean_m, std_m, mean_y20, std_y20, mean_y40, std_y40, mean_y60, std_y60, cntry_list)
        zipped = sorted(zipped,reverse=True)
        mean, std, mean_f, std_f, mean_m, std_m, mean_y20, std_y20, mean_y40, std_y40, mean_y60, std_y60, cntry_list = zip(*zipped)
        
        return mean, std, mean_f, std_f, mean_m, std_m, mean_y20, std_y20, mean_y40, std_y40, mean_y60, std_y60, cntry_list
    
    # calc means and stds
    xmean, xstd, xmean_f, xstd_f, xmean_m, xstd_m, xmean_y20, xstd_y20, xmean_y40, xstd_y40, xmean_y60, xstd_y60, xcntry_list = calc_cntry_X()
    
    # set up the data source 2
    source2 = ColumnDataSource(dict(    cntry = xcntry_list,
                                        mean = xmean,
                                        std = xstd,
                                        std_h = [m+s for m,s in zip(xmean,xstd)],
                                        std_l = [m-s for m,s in zip(xmean,xstd)],
                                        mean_f = xmean_f,
                                        std_f = xstd_f,
                                        mean_m = xmean_m,
                                        std_m = xstd_m,
                                        mean_y20 = xmean_y20,
                                        std_y20 = xstd_y20,
                                        mean_y40 = xmean_y40,
                                        std_y40 = xstd_y40,
                                        mean_y60 = xmean_y60,
                                        std_y60 = xstd_y60,))
    
    # second plot: bar chart for all countries 
    p2 = figure(plot_width=p_width , plot_height=p_height,
               title=variables.query("Name == @x").Label.values[0],
               x_range = xcntry_list,
               y_range = (0,max(xmean)+max(xstd)+0.5),
               tools="")
    
    # bar plot that shows the variable for each country
    bars2 = p2.vbar(x='cntry', top='mean', width=0.8,source=source2,legend="Mean +/- Std.Dev.",
            fill_color='lightseagreen', fill_alpha=0.7, line_color="teal", 
            hover_fill_color="palevioletred", hover_alpha= 0.6, hover_line_color="mediumvioletred")
    
    # add lines to represent the spread given by the standard deviation
    p2.segment(x0='cntry', y0="std_l", x1='cntry', y1="std_h", line_width=2, color="black",source=source2)
    
    # set up the hover tool
    p2.add_tools(HoverTool(tooltips=[("Country", "@cntry"),
                                     ("All", "@mean{0.2f} +/- @std{0.2f}"),
                                     ("Female", "@mean_f{0.2f} +/- @std_f{0.2f}"),
                                     ("Male", "@mean_m{0.2f} +/- @std_m{0.2f}"),
                                     ("Aged 20", "@mean_y20{0.2f} +/- @std_y20{0.2f}"),
                                     ("Aged 40", "@mean_y40{0.2f} +/- @std_y40{0.2f}"),
                                     ("Aged 60", "@mean_y60{0.2f} +/- @std_y60{0.2f}"),],
                           renderers=[bars2]))
    
    # adjust legend location and rotate country labels
    p2.title.align = 'center'
    p2.legend.location = "top_right"
    p2.xaxis.major_label_orientation = math.pi/2
    
    ##############################
    # # # # Plot 3: Bar plot for Y
    
    def calc_cntry_Y():
        """Calculates variable Y for all countries"""
        nonlocal y
        
        # get mean and standard deviation of X for each country, also grouped by gender
        gr = ess.groupby("cntry")[y].agg(["mean","std"])
        mean = gr["mean"]
        std = gr["std"]
        gr = ess.groupby(["gndr","cntry"])[y].agg(["mean","std"])
        mean_f = gr.loc[2,"mean"]
        std_f = gr.loc[2,"std"]
        mean_m = gr.loc[1,"mean"]
        std_m = gr.loc[1,"std"]
        gr = ess.groupby(["agea","cntry"])[y].agg(["mean","std"])
        mean_y20 = gr.loc[20,"mean"]
        std_y20 = gr.loc[20,"std"]
        mean_y40 = gr.loc[40,"mean"]
        std_y40 = gr.loc[40,"std"]
        mean_y60 = gr.loc[60,"mean"]
        std_y60 = gr.loc[60,"std"]
        
        # get list of coutries without "all countries"
        cntry_list = list(cntry_dict.keys())[1:]
        
        # sort so that the highest mean value is in front
        zipped = zip(mean, std, mean_f, std_f, mean_m, std_m, mean_y20, std_y20, mean_y40, std_y40, mean_y60, std_y60, cntry_list)
        zipped = sorted(zipped,reverse=True)
        mean, std, mean_f, std_f, mean_m, std_m, mean_y20, std_y20, mean_y40, std_y40, mean_y60, std_y60, cntry_list = zip(*zipped)
        
        return mean, std, mean_f, std_f, mean_m, std_m, mean_y20, std_y20, mean_y40, std_y40, mean_y60, std_y60, cntry_list
    
    # calc means and stds
    ymean, ystd, ymean_f, ystd_f, ymean_m, ystd_m, ymean_y20, ystd_y20, ymean_y40, ystd_y40, ymean_y60, ystd_y60, ycntry_list = calc_cntry_Y()
    
    # set up the data source 3
    source3 = ColumnDataSource(dict(    cntry = ycntry_list,
                                        mean = ymean,
                                        std = ystd,
                                        std_h = [m+s for m,s in zip(ymean,ystd)],
                                        std_l = [m-s for m,s in zip(ymean,ystd)],
                                        mean_f = ymean_f,
                                        std_f = ystd_f,
                                        mean_m = ymean_m,
                                        std_m = ystd_m,
                                        mean_y20 = ymean_y20,
                                        std_y20 = ystd_y20,
                                        mean_y40 = ymean_y40,
                                        std_y40 = ystd_y40,
                                        mean_y60 = ymean_y60,
                                        std_y60 = ystd_y60,))
    
    # third plot: bar chart for all countries 
    p3 = figure(plot_width=p_width , plot_height=p_height,
               title=variables.query("Name == @y").Label.values[0],
               x_range = ycntry_list,
               y_range = (0,max(ymean)+max(ystd)+0.5),
               tools="")
    
    # bar plot that shows the variable for each country
    bars3 = p3.vbar(x='cntry', top='mean', width=0.8,source=source3,legend="Mean +/- Std.Dev.",
            fill_color='lightseagreen', fill_alpha=0.7, line_color="teal", 
            hover_fill_color="palevioletred", hover_alpha= 0.6, hover_line_color="mediumvioletred")
    
    # add lines to represent the spread given by the standard deviation
    p3.segment(x0='cntry', y0="std_l", x1='cntry', y1="std_h", line_width=2, color="black",source=source3)
    
    # set up the hover tool
    p3.add_tools(HoverTool(tooltips=[("Country", "@cntry"),
                                     ("All", "@mean{0.2f} +/- @std{0.2f}"),
                                     ("Female", "@mean_f{0.2f} +/- @std_f{0.2f}"),
                                     ("Male", "@mean_m{0.2f} +/- @std_m{0.2f}"),
                                     ("Aged 20", "@mean_y20{0.2f} +/- @std_y20{0.2f}"),
                                     ("Aged 40", "@mean_y40{0.2f} +/- @std_y40{0.2f}"),
                                     ("Aged 60", "@mean_y60{0.2f} +/- @std_y60{0.2f}"),],
                           renderers=[bars3]))
    
    # adjust legend location and rotate country labels
    p3.title.align = 'center'
    p3.legend.location = "top_right"
    p3.xaxis.major_label_orientation = math.pi/2
    
    def plot_styling(plots):
        """Styles the plot in seaborn-like style"""
    
        # various commands for styling the plot, I'm trying to give it the "seaborn" look which I like a lot
        for p in plots:
            p.background_fill_color="whitesmoke"
            p.background_fill_alpha=0.8
            p.axis.axis_line_color ="black"
            p.axis.minor_tick_line_color ="black"
            p.axis.major_tick_line_color ="black"
            p.legend.background_fill_color = "whitesmoke"
            p.legend.background_fill_alpha = 0.6
            p.legend.border_line_color="dimgrey"
            p.grid.grid_line_color = "silver"
            p.grid.grid_line_alpha = 0.8
            p.axis.major_label_text_font_size = "10pt"
            p.toolbar_location = None
            p.min_border_right = 10
    
    # update functions for dropdown variable selecters
    def updateX(VariableX):    
        nonlocal x 
        nonlocal y 
        nonlocal setup
        new = variables.query("Label == @VariableX").Name.values[0]
        if new != y:
            x = new
        if setup:
            update_plot()
        
    def updateY(VariableY): 
        nonlocal x 
        nonlocal y 
        nonlocal setup
        new = variables.query("Label == @VariableY").Name.values[0]
        if new != x:
            y = new
        if setup:
            update_plot()
            
    def updateCntry(Country): 
        nonlocal cntry
        nonlocal setup
        cntry = Country
        if setup:
            update_plot()
            
    # the main updating function    
    def update_plot():
        """The main function that creates and updates the plot elements"""
        nonlocal x
        nonlocal y
        nonlocal cntry
        
        # # # # # Updates for Plot 1
        
        # calculate median, Q1, Q3 etc.
        yrange, xrange, xs, medians, Q1, Q3 = calc_median_iqr()
        
        # update the data source
        source1.data = dict(x = xs,
                            medians = medians,
                            Q3 = Q3,
                            Q1 = Q1,)
        
        # update axis names and ranges
        p1.xaxis.axis_label = variables.query("Name == @x").Label.values[0]
        p1.yaxis.axis_label = variables.query("Name == @y").Label.values[0]
        p1.x_range.start = min(xs)
        p1.x_range.end = max(xs)
        p1.y_range.start = min(yrange)
        p1.y_range.end = max(yrange)
        p1.title.text = cntry
        
        # # # # # Updates for Plot 2
        
        # calc updated means and stds
        xmean, xstd, xmean_f, xstd_f, xmean_m, xstd_m, xmean_y20, xstd_y20, xmean_y40, xstd_y40, xmean_y60, xstd_y60, xcntry_list = calc_cntry_X()
        
        # update the data source 2
        source2.data = dict(cntry = xcntry_list,
                            mean = xmean,
                            std = xstd,
                            std_h = [m+s for m,s in zip(xmean,xstd)],
                            std_l = [m-s for m,s in zip(xmean,xstd)],
                            mean_f = xmean_f,
                            std_f = xstd_f,
                            mean_m = xmean_m,
                            std_m = xstd_m,
                            mean_y20 = xmean_y20,
                            std_y20 = xstd_y20,
                            mean_y40 = xmean_y40,
                            std_y40 = xstd_y40,
                            mean_y60 = xmean_y60,
                            std_y60 = xstd_y60,)
        
                                        
        # update range and title
        p2.x_range.factors = xcntry_list
        p2.y_range.end = max(xmean)+max(xstd)+0.5
        p2.title.text = variables.query("Name == @x").Label.values[0]
        
        # # # # # Updates for Plot 3
        
        # calc updated means and stds
        ymean, ystd, ymean_f, ystd_f, ymean_m, ystd_m, ymean_y20, ystd_y20, ymean_y40, ystd_y40, ymean_y60, ystd_y60,  ycntry_list = calc_cntry_Y()
        
        # update the data source 2
        source3.data = dict(cntry = ycntry_list,
                            mean = ymean,
                            std = ystd,
                            std_h = [m+s for m,s in zip(ymean,ystd)],
                            std_l = [m-s for m,s in zip(ymean,ystd)],
                            mean_f = ymean_f,
                            std_f = ystd_f,
                            mean_m = ymean_m,
                            std_m = ystd_m,
                            mean_y20 = ymean_y20,
                            std_y20 = ystd_y20,
                            mean_y40 = ymean_y40,
                            std_y40 = ystd_y40,
                            mean_y60 = ymean_y60,
                            std_y60 = ystd_y60,)
    
        # update range and title
        p3.x_range.factors = ycntry_list
        p3.y_range.end = max(ymean)+max(ystd)+0.5
        p3.title.text = variables.query("Name == @y").Label.values[0]
        
        # style the plots
        plot_styling([p1,p2,p3])
        
        # if not first setup, update plot with push_notebook
        global h
        if setup:
            push_notebook(handle=h)
        
            
    # set up the interactive dropdown variable and country selecter
    x_default = x
    y_default = y
    x_first = variables.query("Name == @x_default ").Label.values.tolist()
    y_first = variables.query("Name == @y_default ").Label.values.tolist()
    var_x = interact(updateX,VariableX=x_first+list(ordinal.Label.values))
    var_y = interact(updateY,VariableY=y_first+list(ordinal.Label.values))
    var_cntry = interact(updateCntry,Country=cntry_dict.keys())
    
    # build the plot
    update_plot()
    h = show(row(p1,p2,p3),notebook_handle=True)
    setup = True

### Plot: Subjective health vs happiness

The default variables to compare are `health` 'Subjective general health' ranged 1 ('Very bad') to 5 ('Very good') after being reversed in data preparation part of this project and `happy` 'How happy are you' ranged 0 ('Extremely unhappy') to 10 ('Extremely happy'). Here is a bug to fix: some of the variables are ranged starting at 0 some at 1. The plotting function assumes that all start at 0. One would have to manually check the correct value in 'Study Documentation' which variables start with what, because there is no guarantee that there was a respondent whose answer was at the beginning of the scale.  
At the comparison plot, the first on the left, we can see that there is visible dependence between respondents' health and happiness. Healthier people tend to be happier, with interquartile range narrowing at higher values.  
  
  
The cental and the right plot showcase comparison of the corresponding variables between European countries which participated in the survey. When hovering over the bars representing each country a summary is displayed. For each country the summary list exact number regarding the overall mean value of the variable +/- standard deviation, with additional differentiation to male and female answers as well as the mean and the standard deviation of chosen age buckets for comparison. 
The healthiest and happiest country is Switzerland. Let's choose 2 countries for further, closer analysis: Poland and Ireland.  
  
  
In the `health` plot Ireland takes the second-highest mean value after Switzerland with an overall mean value of 4.12 +/- 0.86 (range 1-5), with females with slightly better results of 4.14 +/- 0.87. Different age categories show as one may suspect, that subjective health is decreasing with the aging (4.39 for participants aged 20 - 4.04 for participants aged 60 = 0.35). Poland is situated in the middle of the plot on the tenth position with an overall mean value of 3.83 +/- 0.94, with on contrary to Ireland, here male respondents have the better results of 3.93 +/- 0.93. The age categories again show a decrease with the aging, but for Poland, this degree of decrease is much more significant (4.50 for participants aged 20 - 3.68 for participants aged 60 = 0.82). 
  
  
In the `happy` plot Ireland takes eighth place with an overall mean value of 7.70 +/- 1.65 (range 0-10), with again female participants feeling a bit more happy (7.77+/-1.68) than male(7.63+/-1.63). The happiest age category is the 40s with a mean value of 7.58+/-1.76. Poland this time takes the thirteenth place with a mean value of 7.26+/-1.91. The women seem to be a bit more happy despite the lower subjective health score with a mean value of 7.29+/-1.90. In Poland happiest age category is the 40s as well with the result of 7.78 +/- 1.58, which is, interestingly, much higher than the overall mean for the whole country. 

In [24]:
ess_plot()

interactive(children=(Dropdown(description='VariableX', options=('Subjective general health', 'Internet use, h…

interactive(children=(Dropdown(description='VariableY', options=('How happy are you', 'Internet use, how often…

interactive(children=(Dropdown(description='Country', options=('All countries', 'Austria', 'Belgium', 'Bulgari…

### Plot: Ashamed if gay/lesbian in family vs Immigrants influence on country

The variables can be changed to represent different features. To explore the data further set the first variable to `hmsfmlsh` variable, ranked 1('Disagree strongly') to 5('Agree strongly') after being reversed in data preparation part, which represents the respondents' answers to question *"Are you ashamed if a close family member is gay or lesbian?"* and set the second variable to `imwbcnt`, ranked 0('Worse place to live') to 10('Better place to live'), which represents answers to question *"Do immigrants make the country a worse or a better place to live?"*. 

In [25]:
ess_plot("hmsfmlsh","imwbcnt")

interactive(children=(Dropdown(description='VariableX', options=('Ashamed if close family member gay or lesbia…

interactive(children=(Dropdown(description='VariableY', options=('Immigrants make country worse or better plac…

interactive(children=(Dropdown(description='Country', options=('All countries', 'Austria', 'Belgium', 'Bulgari…

At the comparison plot, on the left, we can see that there is slight, but visible dependence between respondents, who are feeling ashamed if their family member is homosexual and their view that immigrants are making the country worse place to live. The interquartile range shows this trend even stronger.  
 
Again the central and the right plot showcase a comparison of the corresponding variables between European countries which participated in the survey. The most ashamed people if their relative being a gay or lesbian are living in Serbia with the overall mean value of 3.36 +/- 1.33, and the most tolerant ones in Norway, with the overall mean value of 1.40 +/- 0.71. Let's choose those same 2 countries for further, closer analysis: Poland and Ireland.  

In the `ashamed of homosexuality` plot, Ireland takes the eleventh place with the mean value of 1.77 +/- 1.05 (range 1-5), with male participants slightly less tolerant with results of 1.84 +/- 1.07. Different age categories show, that tolerance is the highest(lowest value of being ashamed) for the age group of 20 with the mean value of 1.39, similar to the overall mean in Norway. Poland is situated in the sixth position of the plot making it one of the less tolerant countries in Europe, with an overall mean value of 2.77 +/- 1.29, male respondents are also less tolerant than females with results of 2.90 +/- 1.26. The age categories again show that younger respondents are less ashamed of a family member being homosexual. 

In the `immigrants` plot Ireland takes first place as the most friendly country for immigrants with an overall mean value of 6.28 +/- 2.33 (range 0-10), with male participants even more welcoming (6.41+/-2.36) than female(6.16+/-2.30). The youngest age category scored the highest with a mean value of 7.00+/-1.85. Poland this time takes the sixth place again with a mean value of 5.63+/-2.21. The men and women seem to be similarly thinking about immigrants with a mean value close to the overall mean. Interestingly in Poland also the oldest age category scores the highest result of 6.38 +/- 2.43. 

### Plot: Trust in European Parliament vs Trust in United Nations

To explore the data further we set the first variable to `trstep` variable, ranked 0('No trust at all') to 10('Complete trust') and set the second variable to `trstun`, ranked 0('No trust at all') to 10('Complete trust'). The questions are respectively about the trust in the EP and the UN.

In [26]:
ess_plot("trstep","trstun")

interactive(children=(Dropdown(description='VariableX', options=('Trust in the European Parliament', 'Internet…

interactive(children=(Dropdown(description='VariableY', options=('Trust in the United Nations', 'Internet use,…

interactive(children=(Dropdown(description='Country', options=('All countries', 'Austria', 'Belgium', 'Bulgari…

The comparison plot, on the left, clearly shows the correlation between respondents' answers to both questions. Those who show trust for the European Parliament also will trust in the United Nations. The interquartile range shows a broader range in the lower values of the plot, from which it can be concluded that the United Nations are trusted by respondents with no trust for the European Parliament, but not the other way around.

The cental and the right plot showcase a comparison of the corresponding variables between European countries which participated in the survey. The highest trust in both the European Parliament and the United Nations are in Norway, with the overall mean value for the EP of 5.45 +/- 2.12 and for the UN of 6.96 +/- 1.95, and the least trust is in Serbia and Bulgaria, with the overall mean value for Serbia trust in the EP of 3.07 +/- 2.87 and for Bulgaria in the UN of 3.32 +/- 2.74. Let's choose 2 countries for further, closer analysis: Poland and Ireland.

In the `trust in EP` plot, Ireland takes fifth place with the mean value of 4.96 +/- 2.53 (range 0-10), with both male and female participants result similar to the overall mean. Different age categories show bigger differences, the group for age 20 has the biggest trust in the EP (5.44 +/- 2.41), and the group for age 60 has the lowest trust (4.04 +/- 2.69). Poland is situated in the ninth position of the plot, with an overall mean value of 4.64 +/- 2.41, female respondents have slightly higher trust in the EP (4.68 +/- 2.45). The age groups are reversed in comparison to Ireland, the highest trust in the EP is seen among the oldest group (5.15 +/- 2.48) and the lowest among the youngest group (4.43 +/- 2.01).

In the `trust in UN` plot, Ireland takes fifth place with an overall mean value of 5.55 +/- 2.58 (range 0-10), with both male and female participants result similar to the overall mean. Similarly to the trust in the EP, the group for age 20 has the biggest trust in the UN (5.83 +/- 2.66), and the group for age 60 has the lowest trust (4.88 +/- 2.95). Poland is situated in the next sixth position of the plot, with an overall mean value of 5.50 +/- 2.26, both male and female participants result similar to the overall mean. Again the age group with the highest trust in the UN is seen among the oldest group (5.68 +/- 1.95) and the groups for the ages 20 and 40 share similar results of around 5.33.

### Plot: Fair chance to achive education vs Knowledge influence decision to recruit

The further exploration of the data with the first variable set to `recskil` variable, where respondents answer question about 'Influence decision to recruit in country: person's knowledge and skills', ranked 1('Not much or no influence') to 4('A great deal of influence') and the second variable set to `evfredu`, where the question is 'To what extent do you think statement *everyone in country fair chance achieve level of education they seek* applies in your country?' 0('Does not apply at all') to 10('Applies completely').

In [27]:
ess_plot("evfredu","recskil")

interactive(children=(Dropdown(description='VariableX', options=('Everyone in country fair chance achieve leve…

interactive(children=(Dropdown(description='VariableY', options=("Influence decision to recruit in country: pe…

interactive(children=(Dropdown(description='Country', options=('All countries', 'Austria', 'Belgium', 'Bulgari…

The comparison plot, on the left, shows some correlation between respondents' answers to both questions. Those who believe that everyone has a fair chance to achieve education also more often belives that this education has an influence on recruitment decisions. The interquartile range shows a broad range in the full spectrum of values.

The cental and the right plot showcase a comparison of the corresponding variables between European countries which participated in the survey. The highest belief that everyone has a fair chance to achieve education is in Estonia, with the overall mean value of 7.48 +/- 2.27, and the lowest in France, with the overall mean value of 4.71 +/- 2.35. The differences are pretty significant between those two countries. Respondents in the Netherlands have the highest belief among European countries, that knowledge and skill have a big influence in the job recruitment process with the overall mean value of 3.51 +/- 0.67, and the lowest belief in that statement is in Serbia, with the overall mean value of 2.41 +/- 1.03. Let's choose 2 countries for further, closer analysis: Poland and Ireland.

In the plot presenting the `belief in fair chance to get an education` Ireland takes the eleventh place with the mean value of 6.24 +/- 2.35 (range 0-10), the male participants' belief in higher with the overall mean of 6.40 + / - 2.40. Different age categories show differences, group for age 20 has the lowest belief (5.12 +/- 3.00), and the group for age 60 has the highest belief (7.41 +/- 1.72). Poland is situated in the eighth position of the plot, with an overall mean value of 6.42 +/- 2.43, female respondents have slightly higher belief (6.48 +/- 2.35). The age groups show the highest belief in fair chance to get education is seen among the middle group (6.96 +/- 1.82), but the youngest has only slightly lower result (6.85 +/- 1.95) and the lowest among the oldest group ( 5.25 +/- 2.63), which is opposite to the result in Ireland.

In the plot presenting the `knowledge and skills influence decision to recruit` Ireland takes the high third place with the mean value of 3.28 +/- 0.80 (range 1-4), both male and female participants' results are close to the overall mean. Different age categories show differences, group for age 20 thinks the influence is the highest (3.44 +/- 0.70), and the group for age 60 thinks the influence is the lowest (3.11 +/- 0.80), but still quite high. Poland is situated in the next forth position of the plot, with an overall mean value of 3.26 +/- 0.71, again both male and female participants' results are close to the overall mean. The age groups show that the group for age 20 thinks the influence is the highest (3.38 +/- 0.80), and the remaining groups are close to the overall mean.

## tbc

In this part, the visualizations and exploratory data analysis were conducted. Next part will include hypothesis testing.  