## Predicting Life Satisfaction
### Exploratory Data Analysis  
  
#### Objectives:  
* [Analysis of missing observations](#Analysis-of-Missing-Observations)
* [Analysis of categorical/numeric features](#Categorical-and-Numeric-Variables)
* [Correlation analysis](#Correlation-Analysis)
* [1 way/ 2 way variable plots (histograms, scatterplots, etc.)](#Visualizations)

### From Kaggle:
  
#### File descriptions  
* train.csv - the training set with some preprocesssing of values.  
* train_raw.csv - the training set with original responses (no preprocessing).  
  
#### Data fields
* v1-v270 - survey response fields
* cntry - survey respondent country
* satisfied - whether (1) or not (0) the survey respondent is 'very satisfied' with their life (training set only)

### Data Cleaning Steps
1) Categorize any missing response from (NA, "", ., .a, .b, .c, .d) to -1

2) Remove variables/columns that have more than 30% of responses missing

3) Remove observation/rows that have more than 30 resposnes missing to the remaining questions/variables

4) Encode country and language resposnes 

5) Clean data types of variables/columns to corresponding str/float/int64



In [1]:
# Import packages
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Set options
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = -1

In [65]:
# Import data
train_raw = pd.read_csv("../01-data/train_raw.csv", low_memory = False)
train = pd.read_csv("../01-data/train.csv", low_memory = False)

# Fill missing responses with "." so that they can be counted and categorized as missing later on
train_no_blanks = train.fillna('.')
train_raw_no_blanks = train_raw.fillna('.')

# Custom data
codebook = pd.read_csv("../01-data/codebook_compact.csv", low_memory = False) # OG codebook+dtypes from codebook_long
codebook_labels = ['Variable', "Label"]

There are 30,080 observations with 273 columns (id, v1-v270, cntry, & satisfied) for each observartion.

In [3]:
# Mainly analyzing the preprocessed data
display(train_no_blanks.head())
train_no_blanks.shape

Unnamed: 0,id,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,cntry,satisfied
0,9948,2,2,74,11010,.a,2,2,2,2,1,1,2,2,1,0,0,66,2,2,AT33,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,10,3,3,3,.a,.a,1,.a,1,1,2,2,9,1,66,2,1,3,4,3,3,2,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,.,1,1,12,2,4,4,9,25,107,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,.a,1,.a,GER,0,3,2,.a,5,5,66,2,6,.a,1,1,23,20,4,4,2,2,1,1993,0,0,3,10,10,8,1,2,2,2,2,7,1,.a,1,1,.a,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.a,1,0,2,4,2,1,3,.b,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,.a,2,.a,0,0,0,0,1,8,40,40,.a,2,2,1,2,2,1941,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,AT,0
1,25601,4,2,58,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,2,AT31,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,.a,0,0,3,322,2,212,2,212,.a,.a,12,3,2,2,.a,.a,1,.a,1,1,2,2,5,1,66,2,1,1,1,5,3,1,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,.,17,17,19,15,1,1,17,46,75,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,.a,2,.a,GER,0,5,.b,.a,6,6,66,3,3,.a,1,1,25,.a,8,8,2,2,1,2014,0,0,4,4,7,3,6,.a,4,5,2,5,1,.a,1,1,.a,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.a,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,.b,7,7,.b,5,5,.b,1,3,2,1,1,1,0,0,0,1,3,39,39,.a,2,2,2,2,2,1957,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,AT,0
2,8592,6,2,47,11010,11010,2,2,1,2,1,.a,1,1,2,0,0,66,2,3,AT33,2,.a,.a,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,3,322,9,2,2,2,3,.a,1,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,.,.,.,.,1,2,2,.a,.a,.a,.a,.,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,.,18,18,17,28,3,3,16,30,50,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,.d,1,.a,GER,0,4,.d,.a,1,.a,66,3,1,.a,1,1,56,1,1,8,1,2,.a,.a,1,1,3,9,8,8,2,.a,3,2,2,3,1,.a,8,1,.a,.,.,.,.,1,2,2,.a,.a,.a,.a,.,.,.,.,.,1,2,2,.a,.a,.a,.a,.,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,.a,2,.a,0,0,0,0,1,8,30,35,40,4,2,1,2,2,1968,.,.,.,.,1963,1993,1995,.a,.a,.a,.a,.,AT,1
3,29593,10,2,22,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,10,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,6,3,3,3,.a,.a,1,.a,1,1,3,2,0,1,66,1,2,2,1,1,1,1,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,.,7,7,8,46,4,4,7,51,44,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,.a,2,.a,GER,0,10,1,1,4,4,66,3,1,.a,1,1,45,.a,6,5,2,2,.a,.a,1,0,3,5,0,5,7,.a,0,0,2,6,1,.a,5,1,.a,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.a,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,.a,2,.a,0,0,0,0,2,0,40,50,.a,3,2,1,2,2,1993,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,AT,0
4,4252,0,1,24,11010,.a,2,2,2,2,1,2,2,1,1,0,0,66,1,2,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,.a,.a,13,5,3,3,.a,.a,1,.a,1,3,4,2,8,1,66,1,1,1,1,1,1,1,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,.,29,29,19,20,3,3,18,25,42,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,.a,2,.a,GER,0,4,2,.a,6,6,66,3,1,.a,1,1,66,.a,6,.a,2,2,.a,.a,1,0,2,6,8,5,6,3,5,5,2,5,1,.a,5,1,.a,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,.a,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,.a,2,.a,0,0,0,0,1,8,38,38,.a,1,2,1,2,2,1991,.,.,.,.,.a,.a,.a,.a,.a,.a,.a,.,AT,1


(30080, 273)

### Analysis of Missing Observations
<a id="Analysis-of-Missing-Observations"></a>

Counting the frequency of unique responses/answers for each variable/question

In [4]:
frequency_train_data = pd.DataFrame()

for i in list(train_no_blanks)[1:]:
    grouped_data = train_no_blanks.groupby(i)["id"].count()
    num_unique_answers = grouped_data.size
    temp_dict = {"Question": [i] * num_unique_answers,
                 "Response": list(grouped_data.index),
                 "Frequency": list(grouped_data)}
    frequency_train_data = frequency_train_data.append(pd.DataFrame(temp_dict))

frequency_train_data.reset_index(inplace=True, drop=True)
temp = frequency_train_data.groupby(["Question","Response"]).agg({"Frequency":"sum"})
temp = temp.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
temp.reset_index(inplace=True)
temp = temp.rename(columns = {"Frequency":"Relative Frequency (%)"})
frequency_train_data = frequency_train_data.merge(temp, left_on = ["Question","Response"], right_on = ["Question","Response"], how = "left")
frequency_train_data = frequency_train_data.merge(codebook[["Variable","Label","Unique","Type_codebook_long"]], left_on = "Question", right_on = "Variable", how = "left").drop("Variable", axis=1)
frequency_train_data

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
0,v1,.a,18,0.059840,Able to take active role in po...,11,double
1,v1,.b,546,1.815160,Able to take active role in po...,11,double
2,v1,.c,10,0.033245,Able to take active role in po...,11,double
3,v1,0,6405,21.293218,Able to take active role in po...,11,double
4,v1,1,2432,8.085106,Able to take active role in po...,11,double
...,...,...,...,...,...,...,...
6408,cntry,PT,992,3.297872,Country,21,string
6409,cntry,SE,1434,4.767287,Country,21,string
6410,cntry,SI,950,3.158245,Country,21,string
6411,satisfied,0,14454,48.051862,Target,2,float


In [5]:
frequency_train_raw_data = pd.DataFrame()

for i in list(train_raw_no_blanks)[1:]:
    grouped_data = train_raw_no_blanks.groupby(i)["id"].count()
    num_unique_answers = grouped_data.size
    temp_dict = {"Question": [i] * num_unique_answers,
                 "Response": list(grouped_data.index),
                 "Frequency": list(grouped_data)}
    frequency_train_raw_data = frequency_train_raw_data.append(pd.DataFrame(temp_dict))

frequency_train_raw_data.reset_index(inplace=True, drop=True)
temp = frequency_train_raw_data.groupby(["Question","Response"]).agg({"Frequency":"sum"})
temp = temp.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
temp.reset_index(inplace=True)
temp = temp.rename(columns = {"Frequency":"Relative Frequency (%)"})
frequency_train_raw_data = frequency_train_raw_data.merge(temp, left_on = ["Question","Response"], right_on = ["Question","Response"], how = "left")
frequency_train_raw_data = frequency_train_raw_data.merge(codebook[["Variable","Label","Unique","Type_codebook_long"]], left_on = "Question", right_on = "Variable", how = "left").drop("Variable", axis=1)
frequency_train_raw_data

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
0,v1,1,2432,8.085106,Able to take active role in po...,11,double
1,v1,2,3251,10.807846,Able to take active role in po...,11,double
2,v1,3,3335,11.087101,Able to take active role in po...,11,double
3,v1,4,2502,8.317819,Able to take active role in po...,11,double
4,v1,5,3349,11.133644,Able to take active role in po...,11,double
...,...,...,...,...,...,...,...
6402,cntry,PT,992,3.297872,Country,21,string
6403,cntry,SE,1434,4.767287,Country,21,string
6404,cntry,SI,950,3.158245,Country,21,string
6405,satisfied,0,14454,48.051862,Target,2,float


In [6]:
# If response contains a "." (., .a, .b, etc) then categorize as missing a response
frequency_train_data['Response Missing'] = pd.np.where(frequency_train_data.Response.str.find(".") > -1, 1, 0)
frequency_train_data

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long,Response Missing
0,v1,.a,18,0.059840,Able to take active role in po...,11,double,1
1,v1,.b,546,1.815160,Able to take active role in po...,11,double,1
2,v1,.c,10,0.033245,Able to take active role in po...,11,double,1
3,v1,0,6405,21.293218,Able to take active role in po...,11,double,0
4,v1,1,2432,8.085106,Able to take active role in po...,11,double,0
...,...,...,...,...,...,...,...,...
6408,cntry,PT,992,3.297872,Country,21,string,0
6409,cntry,SE,1434,4.767287,Country,21,string,0
6410,cntry,SI,950,3.158245,Country,21,string,0
6411,satisfied,0,14454,48.051862,Target,2,float,0


Calculate the percentage of resposnes missing for each question

In [7]:
missing_value_df = frequency_train_data.groupby(["Question","Response","Response Missing"]).agg({"Frequency":"sum"})
missing_value_df = missing_value_df.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
missing_value_df = missing_value_df.groupby(["Question","Response Missing"]).agg({"Frequency":"sum"})
missing_value_df = missing_value_df[missing_value_df.index.get_level_values("Response Missing")==1]
missing_value_df = missing_value_df[missing_value_df.index.get_level_values("Question")!="satisfied"]
missing_value_df.reset_index(inplace=True)
missing_value_df = missing_value_df.rename(columns = {"Frequency":"Percentage Missing"}).drop("Response Missing", axis=1)
missing_value_df = missing_value_df.merge(codebook[['Variable','Label','Unique','Type_codebook_long']], left_on = 'Question', right_on = 'Variable', how = 'left').drop("Variable", axis=1)
missing_value_df

Unnamed: 0,Question,Percentage Missing,Label,Unique,Type_codebook_long
0,v1,1.908245,Able to take active role in po...,11,double
1,v10,0.023271,Born in country,2,double
2,v100,0.28258,Number of people living regula...,14,double
3,v101,0.914229,Feeling about household's inco...,4,double
4,v102,2.094415,Main source of household income,7,double
5,v103,2.094415,Main source of household income,8,double
6,v104,21.087101,"Household's total net income, ...",10,double
7,v105,0.222739,Hampered in daily activities b...,3,double
8,v108,7.948803,Have a set 'basic' or contract...,2,double
9,v109,3.617021,Immigration bad or good for co...,11,double


Let's look at what variables/questions have the highest missing reponses %

In [8]:
percent = 30
cols_missing = missing_value_df[missing_value_df["Percentage Missing"] > percent]
n_cols_missing = len(missing_value_df[missing_value_df["Percentage Missing"] > percent])
print("There are " + str(n_cols_missing) + " features with over " + str(percent) + "% missing.")
display(cols_missing)

There are 71 features with over 30% missing.


Unnamed: 0,Question,Percentage Missing,Label,Unique,Type_codebook_long
10,v11,35.954122,Ever had children living in ho...,2,double
25,v123,92.06117,"Place of interview: East, West...",2,double
53,v151,64.361702,"Occupation partner, ISCO08",524,double
55,v153,88.889628,What year you first came to li...,88,double
58,v158,86.439495,Main activity last 7 days,9,double
60,v160,49.31516,Legal marital status,6,double
63,v164,94.075798,Partner's main activity last 7...,8,double
67,v168,73.397606,Number of people responsible f...,155,double
69,v170,42.054521,Mother's occupation when respo...,9,double
72,v173,54.498005,Ever had a paid job,2,double


Let's drop these columns

In [9]:
drop_missing = cols_missing.Question.to_list()
train_v2 = train_no_blanks.drop(columns = drop_missing)
train_v2

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,2,74,11010,2,2,2,2,1,2,2,1,0,0,66,2,2,AT33,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,10,3,3,3,1,1,1,2,2,9,1,66,2,1,3,4,3,3,2,.a,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,1,1,12,2,4,4,9,25,107,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,1,GER,0,3,2,5,66,2,6,1,1,23,4,2,2,0,0,3,10,10,8,1,2,2,2,7,1,1,.a,.a,1,0,2,4,2,1,3,.b,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,2,0,0,0,0,1,8,40,40,2,2,1,2,2,1941,.a,AT,0
1,25601,4,2,58,11010,2,2,2,2,1,2,2,2,0,0,66,2,2,AT31,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,.a,0,0,3,322,2,212,2,212,12,3,2,2,1,1,1,2,2,5,1,66,2,1,1,1,5,3,1,.a,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,17,17,19,15,1,1,17,46,75,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,2,GER,0,5,.b,6,66,3,3,1,1,25,8,2,2,0,0,4,4,7,3,6,4,5,2,5,1,1,.a,.a,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,.b,7,7,.b,5,5,.b,1,3,1,1,0,0,0,1,3,39,39,2,2,2,2,2,1957,.a,AT,0
2,8592,6,2,47,11010,2,2,1,2,1,1,1,2,0,0,66,2,3,AT33,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,9,2,2,2,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,1,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,18,18,17,28,3,3,16,30,50,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,1,GER,0,4,.d,1,66,3,1,1,1,56,1,1,2,1,1,3,9,8,8,2,3,2,2,3,1,8,1,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,2,0,0,0,0,1,8,30,35,4,2,1,2,2,1968,1963,AT,1
3,29593,10,2,22,11010,2,2,2,2,1,2,2,2,0,0,66,2,10,AT12,2,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,6,3,3,3,1,1,1,3,2,0,1,66,1,2,2,1,1,1,1,.a,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,7,7,8,46,4,4,7,51,44,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,2,GER,0,10,1,4,66,3,1,1,1,45,6,2,2,1,0,3,5,0,5,7,0,0,2,6,1,5,.a,.a,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,2,0,0,0,0,2,0,40,50,3,2,1,2,2,1993,.a,AT,0
4,4252,0,1,24,11010,2,2,2,2,1,2,1,1,0,0,66,1,2,AT12,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,13,5,3,3,1,1,3,4,2,8,1,66,1,1,1,1,1,1,1,.a,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,29,29,19,20,3,3,18,25,42,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,2,GER,0,4,2,6,66,3,1,1,1,66,6,2,2,1,0,2,6,8,5,6,5,5,2,5,1,5,.a,.a,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,2,0,0,0,0,1,8,38,38,1,2,1,2,2,1991,.a,AT,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30075,34440,0,1,72,14120,2,2,2,2,1,2,1,2,0,0,66,2,4,SI017,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,321,2,213,2,213,11,3,2,2,2,4,3,2,1,8,1,66,2,2,1,1,3,2,1,2,6,2,2,1,3,4,8,3,0,0,1,5,2,2,2,2,2,2,3,2,2,2,6,3,3,29,29,16,48,11,11,16,23,25,2014,2014,10,5,2,2,2,3,2,2,2,2,2,3,2,2,2,1219,1,SLV,0,5,2,1,66,2,6,1,1,49,.a,1,2,0,0,2,4,2,5,7,0,0,3,7,1,3,1,1,1,1,3,5,2,2,8,5,7,5,7,29,11,2014,5,0,0,3,0,2,0,3,2,7,2,0,0,0,0,1,10,40,60,2,2,.a,.,2,1942,1945,SI,1
30076,13566,0,1,38,14120,2,2,2,2,1,2,1,2,0,0,66,2,0,SI021,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,323,3,321,3,321,11,4,3,3,1,1,1,4,1,5,1,66,1,1,1,2,.b,2,1,.a,7,2,1,2,1,1,4,3,0,0,1,2,4,4,2,1,1,2,3,2,3,4,3,.b,3,17,17,16,16,11,11,15,46,30,2014,2014,6,3,1,2,3,5,2,2,2,2,2,2,2,3,3,3322,2,SLV,0,5,2,6,66,3,1,1,1,47,7,2,2,1,0,3,7,5,3,7,0,0,3,7,2,5,.a,.a,0,0,3,7,2,1,0,1,5,1,5,17,11,2014,4,2,5,5,2,2,2,2,2,6,2,0,0,0,0,2,6,40,40,4,2,1,.,2,1976,.a,SI,1
30077,29824,5,2,49,14120,2,2,2,2,1,1,1,2,0,0,66,2,5,SI011,2,1,66,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,213,2,213,2,213,8,2,2,2,1,1,1,3,4,4,1,66,1,1,1,1,2,2,1,2,6,3,4,2,1,1,1,3,0,0,1,4,4,4,1,1,3,4,4,2,1,3,3,4,2,5,5,16,6,11,11,15,43,23,2014,2014,8,4,2,2,1,2,2,1,2,2,2,2,2,1,2,8160,2,SLV,0,6,2,1,66,1,1,1,1,10,7,1,2,1,0,3,3,4,4,3,5,3,3,4,2,6,1,1,0,0,3,5,2,1,4,4,4,3,5,5,11,2014,4,2,6,5,2,3,1,6,0,2,1,0,1,0,0,1,7,40,48,3,2,1,.,2,1965,1967,SI,0
30078,9573,0,1,16,14120,2,2,2,2,1,2,1,2,0,0,66,2,0,SI018,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,1,0,2,213,3,313,3,323,10,2,4,4,.a,1,1,3,.a,5,1,66,1,1,1,3,4,2,1,2,10,1,3,1,1,1,.b,3,0,0,.a,2,1,1,3,3,1,1,5,2,3,1,5,5,3,19,19,17,7,11,11,16,37,30,2014,2014,.a,2,2,3,1,3,1,3,1,3,2,1,2,3,2,.a,.a,SLV,0,5,2,6,66,3,2,1,1,.a,5,2,2,0,0,4,5,5,1,7,1,.b,3,5,2,0,3,3,0,0,2,7,2,1,2,0,3,0,5,19,11,2014,.a,0,2,2,0,0,0,3,0,3,2,0,0,0,0,3,.a,.a,.a,3,.a,.a,.,2,1998,1972,SI,1


In [19]:
train_raw_v2 = train_raw_no_blanks.drop(columns = drop_missing)
train_raw_v2

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,Safe,74,Austrian nfs,No,No,No,No,Yes,Does not,Some of the time,Yes,Not marked,Not marked,66,No,2,AT33,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",10,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Employee,Some of the time,10 to 24,9,Yes,66,Some of the time,None or almost none of the time,Most of the time,All or almost all of the time,Neither agree nor disagree,Neither agree nor disagree,Female,Not applicable,5,Good,1,Coping on present income,Pensions,Pensions,R - 2nd decile,Yes to some extent,Not marked,Not marked,Yes,Bad for the economy,Allow a few,Allow a few,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Not like me at all,Somewhat like me,Like me,Allow some,2,Worse place to live,,1,1,12,2,4,4,9,25,107,2015,2015,4,Not like me,Somewhat like me,A little like me,Very much like me,Not like me,Somewhat like me,Somewhat like me,Like me,Like me,Somewhat like me,Not like me,Like me,Not like me at all,Not like me,Office supervisors,Yes,GER,0,3,No,Widowed/civil partner died,66,"Yes, previously",Retired,Yes,Face to face interview,Manufacture of other non-metallic mineral products,Sales occupations,Does not,No,Not marked,Not marked,Hardly interested,Most people try to be fair,People mostly try to be helpful,8,Every day,2,2,NUTS level 2,Never,Yes,1,Not applicable,Not applicable,Marked,Not marked,Less than most,Several times a month,No,None or almost none of the time,3,Don't know,5,1,9,1,4,2015,A private firm,No trust at all,5,Complete trust,No trust at all,2,No trust at all,3,No time at all,"Less than 0,5 hour",No,Not marked,Not marked,Not marked,Not marked,Yes,8,40,40,Some of the time,No,Unlimited,No,No,1941,Not applicable,AT,0
1,25601,4,Safe,58,Austrian nfs,No,No,No,No,Yes,Does not,Some of the time,No,Not marked,Not marked,66,No,2,AT31,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Refusal,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",12,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Employee,Employee,Employee,Some of the time,10 to 24,5,Yes,66,Some of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Disagree strongly,Neither agree nor disagree,Male,Not applicable,4,Fair,1,Coping on present income,Unemployment/redundancy benefit,Unemployment/redundancy benefit,R - 2nd decile,No,Not marked,Not marked,Yes,6,Allow a few,Allow some,A little like me,Somewhat like me,Somewhat like me,A little like me,A little like me,A little like me,A little like me,Allow many to come and live here,5,4,2,17,17,19,15,1,1,17,46,75,2015,2015,3,A little like me,Somewhat like me,Somewhat like me,Very much like me,Very much like me,A little like me,Like me,Like me,Somewhat like me,Like me,Like me,Somewhat like me,Somewhat like me,Very much like me,Spray painters and varnishers,No,GER,0,5,Refusal,None of these (NEVER married or in legally registered civil union),66,No,"Unemployed, looking for job",Yes,Face to face interview,"Manufacture of fabricated metal products, except machinery and equipment",Unskilled worker,Does not,No,Not marked,Not marked,Not at all interested,4,7,3,Less often,4,5,NUTS level 2,Only on special holy days,Yes,1,Not applicable,Not applicable,Not marked,Not marked,About the same,Several times a month,No,None or almost none of the time,5,3,8,5,8,17,1,2015,A private firm,Don't know,7,7,Don't know,5,5,Don't know,"Less than 0,5 hour","More than 1 hour, up to 1,5 hours",Yes,Marked,Not marked,Not marked,Not marked,Yes,3,39,39,Some of the time,No,Limited,No,No,1957,Not applicable,AT,0
2,8592,6,Safe,47,Austrian nfs,No,No,Yes,No,Yes,Respondent lives with children at household grid,None or almost none of the time,No,Not marked,Not marked,66,No,3,AT33,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",9,"ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Employee,Employee,Employee,All or almost all of the time,Under 10,2,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Neither agree nor disagree,Female,Male,Extremely happy,Good,4,Coping on present income,Wages or salaries,Wages or salaries,J - 1st decile,No,Not marked,Not marked,Yes,5,Allow some,Allow a few,Not like me,Like me,Like me,Somewhat like me,Not like me,Like me,Like me,Allow some,5,3,10 or more,18,18,17,28,3,3,16,30,50,2015,2015,2,Not like me,Like me,Very much like me,Like me,Not like me,Like me,Like me,Like me,Like me,Like me,Not like me,Like me,Not like me,A little like me,"Cleaners and helpers in offices, hotels and other establishments",Yes,GER,0,4,No answer,Legally married,66,No,Paid work,Yes,Face to face interview,Food and beverage service activities,Professional and technical occupations,Lives with husband/wife/partner at household grid,No,Marked,Marked,Hardly interested,9,8,8,More than once a week,3,2,NUTS level 2,Once a week,Yes,8,Husband/wife/partner,Husband/wife/partner,Not marked,Not marked,Less than most,Once a week,No,None or almost none of the time,6,6,8,6,8,18,3,2015,A private firm,5,9,9,8,5,6,4,"More than 2 hours, up to 2,5 hours","More than 2,5 hours, up to 3 hours",No,Not marked,Not marked,Not marked,Not marked,Yes,8,30,35,All or almost all of the time,No,Unlimited,No,No,1968,1963,AT,1
3,29593,Completely able,Safe,22,Austrian nfs,No,No,No,No,Yes,Does not,Some of the time,No,Not marked,Not marked,66,No,Completely confident,AT12,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Suburbs or outskirts of big city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",6,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Employee,Most of the time,10 to 24,Unification already gone too far,Yes,66,None or almost none of the time,Some of the time,Some of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,Not applicable,8,Very good,1,Coping on present income,Wages or salaries,Wages or salaries,R - 2nd decile,No,Marked,Not marked,Yes,Bad for the economy,Allow none,Allow none,Like me,Like me,Like me,Very much like me,Somewhat like me,Very much like me,Very much like me,Allow none,Cultural life undermined,Worse place to live,3,7,7,8,46,4,4,7,51,44,2015,2015,I have/had no influence,Very much like me,A little like me,Like me,Not like me at all,Somewhat like me,Very much like me,A little like me,Very much like me,Somewhat like me,Like me,Somewhat like me,Very much like me,Somewhat like me,Not like me at all,Motor vehicle mechanics and repairers,No,GER,0,Right,Yes,Legally divorced/civil union dissolved,66,No,Paid work,Yes,Face to face interview,Wholesale and retail trade and repair of motor vehicles and motorcycles,Skilled worker,Does not,No,Marked,Not marked,Hardly interested,5,People mostly look out for themselves,5,Never,Not at all,Not at all,NUTS level 2,Less often,Yes,5,Not applicable,Not applicable,Not marked,Not marked,About the same,Several times a week,No,None or almost none of the time,Extremely dissatisfied,Extremely dissatisfied,Extremely bad,Extremely dissatisfied,8,7,4,2015,A private firm,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No time at all,More than 3 hours,No,Not marked,Not marked,Not marked,Not marked,No,I have/had no influence,40,50,Most of the time,No,Unlimited,No,No,1993,Not applicable,AT,0
4,4252,Not at all able,Very safe,24,Austrian nfs,No,No,No,No,Yes,Does not,None or almost none of the time,Yes,Not marked,Not marked,66,Yes,2,AT12,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Post-secondary non-tertiary education completed (ISCED 4),"Vocational ISCED 4A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",13,"ES-ISCED IV, advanced vocational, sub-degree","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Not working,All or almost all of the time,10 to 24,8,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,Not applicable,8,Very good,1,Living comfortably on present income,Wages or salaries,Wages or salaries,C - 3rd decile,No,Not marked,Not marked,Yes,7,Allow some,Allow some,Not like me at all,Very much like me,Like me,Somewhat like me,A little like me,Very much like me,Not like me at all,Allow some,7,8,3,29,29,19,20,3,3,18,25,42,2015,2015,6,Not like me at all,Like me,Like me,Very much like me,Somewhat like me,Like me,Like me,Very much like me,Like me,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Somewhat like me,Accounting and bookkeeping clerks,No,GER,0,4,No,None of these (NEVER married or in legally registered civil union),66,No,Paid work,Yes,Face to face interview,Activities auxiliary to financial services and insurance activities,Skilled worker,Does not,No,Marked,Not marked,Quite interested,6,8,5,Less often,5,5,NUTS level 2,Only on special holy days,Yes,5,Not applicable,Not applicable,Not marked,Not marked,About the same,Several times a week,Yes,Some of the time,9,6,8,6,8,29,3,2015,A private firm,2,6,9,7,7,7,5,"0,5 hour to 1 hour","More than 1,5 hours, up to 2 hours",No,Not marked,Not marked,Not marked,Not marked,Yes,8,38,38,None or almost none of the time,No,Unlimited,No,No,1991,Not applicable,AT,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30075,34440,Not at all able,Very safe,72,Slovene nfs,No,No,No,No,Yes,Does not,None or almost none of the time,No,Not marked,Not marked,66,No,4,SI017,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3C >= 2 years, no access ISCED 5",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",11,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Self-employed,Father dead/absent,Not working,Some of the time,Under 10,8,Yes,66,Some of the time,Some of the time,None or almost none of the time,None or almost none of the time,Neither agree nor disagree,Agree,Male,Female,6,Good,2,Living comfortably on present income,Pensions,Pensions,P - 8th decile,No,Not marked,Not marked,Yes,5,Allow some,Allow some,Like me,Like me,Like me,Like me,Somewhat like me,Like me,Like me,Allow some,6,3,3,29,29,16,48,11,11,16,23,25,2014,2014,I have/had complete control,Not like me,Like me,Like me,Like me,Somewhat like me,Like me,Like me,Like me,Like me,Like me,Somewhat like me,Like me,Like me,Like me,Business services and administration managers not elsewhere classified,Yes,SLV,0,5,No,Legally married,66,"Yes, previously",Retired,Yes,Face to face interview,Land transport and transport via pipelines,Not applicable,Lives with husband/wife/partner at household grid,No,Not marked,Not marked,Quite interested,4,2,5,Never,Not at all,Not at all,NUTS level 3,Never,Yes,3,Husband/wife/partner,Husband/wife/partner,Marked,Marked,About the same,Once a week,No,Some of the time,8,5,7,5,7,29,11,2014,Self employed,No trust at all,No trust at all,3,No trust at all,2,No trust at all,3,"0,5 hour to 1 hour",More than 3 hours,No,Not marked,Not marked,Not marked,Not marked,Yes,I have/had complete control,40,60,Some of the time,No,Not applicable,.,No,1942,1945,SI,1
30076,13566,Not at all able,Very safe,38,Slovene nfs,No,No,No,No,Yes,Does not,None or almost none of the time,No,Not marked,Not marked,66,No,Not at all confident,SI021,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3C >= 2 years, no access ISCED 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3C >= 2 years, no access ISCED 5",11,"ES-ISCED IIIa, upper tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Employee,All or almost all of the time,Under 10,5,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,Some of the time,Don't know,Agree,Male,Not applicable,7,Good,1,Coping on present income,Wages or salaries,Wages or salaries,M - 4th decile,No,Not marked,Not marked,Yes,2,Allow none,Allow none,Like me,Very much like me,Very much like me,Like me,Somewhat like me,Like me,Somewhat like me,Allow none,3,Don't know,3,17,17,16,16,11,11,15,46,30,2014,2014,6,Somewhat like me,Very much like me,Like me,Somewhat like me,Not like me,Like me,Like me,Like me,Like me,Like me,Like me,Like me,Somewhat like me,Somewhat like me,Commercial sales representatives,No,SLV,0,5,No,None of these (NEVER married or in legally registered civil union),66,No,Paid work,Yes,Face to face interview,"Retail trade, except of motor vehicles and motorcycles",Semi-skilled worker,Does not,No,Marked,Not marked,Hardly interested,7,5,3,Never,Not at all,Not at all,NUTS level 3,Never,No,5,Not applicable,Not applicable,Not marked,Not marked,About the same,Every day,No,None or almost none of the time,Extremely dissatisfied,1,5,1,5,17,11,2014,A private firm,2,5,5,2,2,2,2,"0,5 hour to 1 hour","More than 2,5 hours, up to 3 hours",No,Not marked,Not marked,Not marked,Not marked,No,6,40,40,All or almost all of the time,No,Unlimited,.,No,1976,Not applicable,SI,1
30077,29824,5,Safe,49,Slovene nfs,No,No,No,No,Yes,Respondent lives with children at household grid,None or almost none of the time,No,Not marked,Not marked,66,No,5,SI011,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Farm or home in countryside,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",8,"ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Employee,Employee,Employee,Most of the time,100 to 499,4,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree,Agree,Male,Female,6,Fair,4,Coping on present income,Wages or salaries,Wages or salaries,J - 1st decile,No,Not marked,Not marked,Yes,4,Allow none,Allow none,Very much like me,Very much like me,Somewhat like me,A little like me,A little like me,Like me,Very much like me,Allow a few,3,4,2,5,5,16,6,11,11,15,43,23,2014,2014,8,A little like me,Like me,Like me,Very much like me,Like me,Like me,Very much like me,Like me,Like me,Like me,Like me,Like me,Very much like me,Like me,Food and related products machine operators,No,SLV,0,6,No,Legally married,66,"Yes, currently",Paid work,Yes,Face to face interview,Manufacture of food products,Semi-skilled worker,Lives with husband/wife/partner at household grid,No,Marked,Not marked,Hardly interested,3,4,4,Once a week,5,3,NUTS level 3,At least once a month,No,6,Husband/wife/partner,Husband/wife/partner,Not marked,Not marked,About the same,Once a week,No,None or almost none of the time,4,4,4,3,5,5,11,2014,A private firm,2,6,5,2,3,1,6,No time at all,"0,5 hour to 1 hour",Yes,Not marked,Marked,Not marked,Not marked,Yes,7,40,48,Most of the time,No,Unlimited,.,No,1965,1967,SI,0
30078,9573,Not at all able,Very safe,16,Slovene nfs,No,No,No,No,Yes,Does not,None or almost none of the time,No,Not marked,Not marked,66,No,Not at all confident,SI018,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Marked,Not marked,Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Upper secondary education completed (ISCED 3),"General ISCED 3A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access upper tier ISCED 5A/all 5",10,"ES-ISCED II, lower secondary","ES-ISCED IIIa, upper tier upper secondary","ES-ISCED IIIa, upper tier upper secondary",Not applicable,Employee,Employee,Most of the time,Not applicable,5,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,Most of the time,Disagree,Agree,Male,Female,Extremely happy,Very good,3,Living comfortably on present income,Wages or salaries,Wages or salaries,Don't know,No,Not marked,Not marked,Not applicable,2,Allow many to come and live here,Allow many to come and live here,Somewhat like me,Somewhat like me,Very much like me,Very much like me,Not like me,Like me,Somewhat like me,Allow many to come and live here,5,5,3,19,19,17,7,11,11,16,37,30,2014,2014,Not applicable,Like me,Like me,Somewhat like me,Very much like me,Somewhat like me,Very much like me,Somewhat like me,Very much like me,Somewhat like me,Like me,Very much like me,Like me,Somewhat like me,Like me,Not applicable,Not applicable,SLV,0,5,No,None of these (NEVER married or in legally registered civil union),66,No,Education,Yes,Face to face interview,Not applicable,Service occupations,Does not,No,Not marked,Not marked,Not at all interested,5,5,1,Never,1,Don't know,NUTS level 3,Only on special holy days,No,Not at all religious,Parent/parent-in-law,Parent/parent-in-law,Not marked,Not marked,Less than most,Every day,No,None or almost none of the time,2,Extremely dissatisfied,3,Extremely dissatisfied,5,19,11,2014,Not applicable,No trust at all,2,2,No trust at all,No trust at all,No trust at all,3,No time at all,"More than 1 hour, up to 1,5 hours",No,Not marked,Not marked,Not marked,Not marked,Not eligible to vote,Not applicable,Not applicable,Not applicable,Most of the time,Not applicable,Not applicable,.,No,1998,1972,SI,1


To deal with missing responses, we will group all missing responses (., .a, .b, etc) together and code them as -1

In [25]:
train_v2 = train_v2.replace([".", ".a", ".b", ".c", ".d"], [-1, -1, -1, -1, -1])
train_raw_v2 = train_raw_v2.replace([".", ".a", ".b", ".c", ".d"], [-1, -1, -1, -1, -1])

Not only are columns with lots of missing responses concerning, we should also check if there are observations/rows with many missing responses that can be removed as well. This is done after the columns were dropped previously.

In [12]:
train_v2["Num -1's per person"] = (train_v2[list(train_v2)[1:]] == -1).sum(axis=1)
tolerance = 30
rows_many_missing_entries = train_v2[train_v2["Num -1's per person"] > tolerance]
n_rows_missing = len(rows_many_missing_entries)
print("There are " + str(n_rows_missing) + " entries with over " + str(tolerance) + " missing responses to the survey.")
rows_missing_index = list(rows_many_missing_entries.index.values)
rows_missing_ids = list(train_v2.loc[rows_missing_index]["id"])

There are 499 entries with over 30 missing responses to the survey.


New data set to work with:

In [22]:
train_v3 = train_v2.drop(index = rows_missing_index).drop(columns = "Num -1's per person")
train_raw_v3 = train_raw_v2.drop(index = rows_missing_index)
display(train_v3)
display(train_raw_v3)

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,2,74,11010,2,2,2,2,1,2,2,1,0,0,66,2,2,AT33,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,10,3,3,3,1,1,1,2,2,9,1,66,2,1,3,4,3,3,2,-1,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,1,1,12,2,4,4,9,25,107.0,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,1,GER,0,3,2,5,66,2,6,1,1,23,4,2,2,0,0,3,10,10,8,1,2,2,2,7,1,1,-1,-1,1,0,2,4,2,1,3,-1,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,2,0,0,0,0,1,8,40,40,2,2,1,2,2,1941,-1,AT,0
1,25601,4,2,58,11010,2,2,2,2,1,2,2,2,0,0,66,2,2,AT31,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,-1,0,0,3,322,2,212,2,212,12,3,2,2,1,1,1,2,2,5,1,66,2,1,1,1,5,3,1,-1,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,17,17,19,15,1,1,17,46,75.0,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,2,GER,0,5,-1,6,66,3,3,1,1,25,8,2,2,0,0,4,4,7,3,6,4,5,2,5,1,1,-1,-1,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,-1,7,7,-1,5,5,-1,1,3,1,1,0,0,0,1,3,39,39,2,2,2,2,2,1957,-1,AT,0
2,8592,6,2,47,11010,2,2,1,2,1,1,1,2,0,0,66,2,3,AT33,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,9,2,2,2,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,1,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,18,18,17,28,3,3,16,30,50.0,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,1,GER,0,4,-1,1,66,3,1,1,1,56,1,1,2,1,1,3,9,8,8,2,3,2,2,3,1,8,1,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,2,0,0,0,0,1,8,30,35,4,2,1,2,2,1968,1963,AT,1
3,29593,10,2,22,11010,2,2,2,2,1,2,2,2,0,0,66,2,10,AT12,2,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,6,3,3,3,1,1,1,3,2,0,1,66,1,2,2,1,1,1,1,-1,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,7,7,8,46,4,4,7,51,44.0,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,2,GER,0,10,1,4,66,3,1,1,1,45,6,2,2,1,0,3,5,0,5,7,0,0,2,6,1,5,-1,-1,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,2,0,0,0,0,2,0,40,50,3,2,1,2,2,1993,-1,AT,0
4,4252,0,1,24,11010,2,2,2,2,1,2,1,1,0,0,66,1,2,AT12,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,13,5,3,3,1,1,3,4,2,8,1,66,1,1,1,1,1,1,1,-1,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,29,29,19,20,3,3,18,25,42.0,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,2,GER,0,4,2,6,66,3,1,1,1,66,6,2,2,1,0,2,6,8,5,6,5,5,2,5,1,5,-1,-1,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,2,0,0,0,0,1,8,38,38,1,2,1,2,2,1991,-1,AT,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30075,34440,0,1,72,14120,2,2,2,2,1,2,1,2,0,0,66,2,4,SI017,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,321,2,213,2,213,11,3,2,2,2,4,3,2,1,8,1,66,2,2,1,1,3,2,1,2,6,2,2,1,3,4,8,3,0,0,1,5,2,2,2,2,2,2,3,2,2,2,6,3,3,29,29,16,48,11,11,16,23,25.0,2014,2014,10,5,2,2,2,3,2,2,2,2,2,3,2,2,2,1219,1,SLV,0,5,2,1,66,2,6,1,1,49,-1,1,2,0,0,2,4,2,5,7,0,0,3,7,1,3,1,1,1,1,3,5,2,2,8,5,7,5,7,29,11,2014,5,0,0,3,0,2,0,3,2,7,2,0,0,0,0,1,10,40,60,2,2,-1,-1,2,1942,1945,SI,1
30076,13566,0,1,38,14120,2,2,2,2,1,2,1,2,0,0,66,2,0,SI021,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,323,3,321,3,321,11,4,3,3,1,1,1,4,1,5,1,66,1,1,1,2,-1,2,1,-1,7,2,1,2,1,1,4,3,0,0,1,2,4,4,2,1,1,2,3,2,3,4,3,-1,3,17,17,16,16,11,11,15,46,30.0,2014,2014,6,3,1,2,3,5,2,2,2,2,2,2,2,3,3,3322,2,SLV,0,5,2,6,66,3,1,1,1,47,7,2,2,1,0,3,7,5,3,7,0,0,3,7,2,5,-1,-1,0,0,3,7,2,1,0,1,5,1,5,17,11,2014,4,2,5,5,2,2,2,2,2,6,2,0,0,0,0,2,6,40,40,4,2,1,-1,2,1976,-1,SI,1
30077,29824,5,2,49,14120,2,2,2,2,1,1,1,2,0,0,66,2,5,SI011,2,1,66,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,213,2,213,2,213,8,2,2,2,1,1,1,3,4,4,1,66,1,1,1,1,2,2,1,2,6,3,4,2,1,1,1,3,0,0,1,4,4,4,1,1,3,4,4,2,1,3,3,4,2,5,5,16,6,11,11,15,43,23.0,2014,2014,8,4,2,2,1,2,2,1,2,2,2,2,2,1,2,8160,2,SLV,0,6,2,1,66,1,1,1,1,10,7,1,2,1,0,3,3,4,4,3,5,3,3,4,2,6,1,1,0,0,3,5,2,1,4,4,4,3,5,5,11,2014,4,2,6,5,2,3,1,6,0,2,1,0,1,0,0,1,7,40,48,3,2,1,-1,2,1965,1967,SI,0
30078,9573,0,1,16,14120,2,2,2,2,1,2,1,2,0,0,66,2,0,SI018,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,1,0,2,213,3,313,3,323,10,2,4,4,-1,1,1,3,-1,5,1,66,1,1,1,3,4,2,1,2,10,1,3,1,1,1,-1,3,0,0,-1,2,1,1,3,3,1,1,5,2,3,1,5,5,3,19,19,17,7,11,11,16,37,30.0,2014,2014,-1,2,2,3,1,3,1,3,1,3,2,1,2,3,2,-1,-1,SLV,0,5,2,6,66,3,2,1,1,-1,5,2,2,0,0,4,5,5,1,7,1,-1,3,5,2,0,3,3,0,0,2,7,2,1,2,0,3,0,5,19,11,2014,-1,0,2,2,0,0,0,3,0,3,2,0,0,0,0,3,-1,-1,-1,3,-1,-1,-1,2,1998,1972,SI,1


Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,Safe,74,Austrian nfs,No,No,No,No,Yes,Does not,Some of the time,Yes,Not marked,Not marked,66,No,2,AT33,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",10,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Employee,Some of the time,10 to 24,9,Yes,66,Some of the time,None or almost none of the time,Most of the time,All or almost all of the time,Neither agree nor disagree,Neither agree nor disagree,Female,Not applicable,5,Good,1,Coping on present income,Pensions,Pensions,R - 2nd decile,Yes to some extent,Not marked,Not marked,Yes,Bad for the economy,Allow a few,Allow a few,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Not like me at all,Somewhat like me,Like me,Allow some,2,Worse place to live,,1,1,12,2,4,4,9,25,107,2015,2015,4,Not like me,Somewhat like me,A little like me,Very much like me,Not like me,Somewhat like me,Somewhat like me,Like me,Like me,Somewhat like me,Not like me,Like me,Not like me at all,Not like me,Office supervisors,Yes,GER,0,3,No,Widowed/civil partner died,66,"Yes, previously",Retired,Yes,Face to face interview,Manufacture of other non-metallic mineral products,Sales occupations,Does not,No,Not marked,Not marked,Hardly interested,Most people try to be fair,People mostly try to be helpful,8,Every day,2,2,NUTS level 2,Never,Yes,1,Not applicable,Not applicable,Marked,Not marked,Less than most,Several times a month,No,None or almost none of the time,3,Don't know,5,1,9,1,4,2015,A private firm,No trust at all,5,Complete trust,No trust at all,2,No trust at all,3,No time at all,"Less than 0,5 hour",No,Not marked,Not marked,Not marked,Not marked,Yes,8,40,40,Some of the time,No,Unlimited,No,No,1941,Not applicable,AT,0
1,25601,4,Safe,58,Austrian nfs,No,No,No,No,Yes,Does not,Some of the time,No,Not marked,Not marked,66,No,2,AT31,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Refusal,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",12,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Employee,Employee,Employee,Some of the time,10 to 24,5,Yes,66,Some of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Disagree strongly,Neither agree nor disagree,Male,Not applicable,4,Fair,1,Coping on present income,Unemployment/redundancy benefit,Unemployment/redundancy benefit,R - 2nd decile,No,Not marked,Not marked,Yes,6,Allow a few,Allow some,A little like me,Somewhat like me,Somewhat like me,A little like me,A little like me,A little like me,A little like me,Allow many to come and live here,5,4,2,17,17,19,15,1,1,17,46,75,2015,2015,3,A little like me,Somewhat like me,Somewhat like me,Very much like me,Very much like me,A little like me,Like me,Like me,Somewhat like me,Like me,Like me,Somewhat like me,Somewhat like me,Very much like me,Spray painters and varnishers,No,GER,0,5,Refusal,None of these (NEVER married or in legally registered civil union),66,No,"Unemployed, looking for job",Yes,Face to face interview,"Manufacture of fabricated metal products, except machinery and equipment",Unskilled worker,Does not,No,Not marked,Not marked,Not at all interested,4,7,3,Less often,4,5,NUTS level 2,Only on special holy days,Yes,1,Not applicable,Not applicable,Not marked,Not marked,About the same,Several times a month,No,None or almost none of the time,5,3,8,5,8,17,1,2015,A private firm,Don't know,7,7,Don't know,5,5,Don't know,"Less than 0,5 hour","More than 1 hour, up to 1,5 hours",Yes,Marked,Not marked,Not marked,Not marked,Yes,3,39,39,Some of the time,No,Limited,No,No,1957,Not applicable,AT,0
2,8592,6,Safe,47,Austrian nfs,No,No,Yes,No,Yes,Respondent lives with children at household grid,None or almost none of the time,No,Not marked,Not marked,66,No,3,AT33,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",9,"ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Employee,Employee,Employee,All or almost all of the time,Under 10,2,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Neither agree nor disagree,Female,Male,Extremely happy,Good,4,Coping on present income,Wages or salaries,Wages or salaries,J - 1st decile,No,Not marked,Not marked,Yes,5,Allow some,Allow a few,Not like me,Like me,Like me,Somewhat like me,Not like me,Like me,Like me,Allow some,5,3,10 or more,18,18,17,28,3,3,16,30,50,2015,2015,2,Not like me,Like me,Very much like me,Like me,Not like me,Like me,Like me,Like me,Like me,Like me,Not like me,Like me,Not like me,A little like me,"Cleaners and helpers in offices, hotels and other establishments",Yes,GER,0,4,No answer,Legally married,66,No,Paid work,Yes,Face to face interview,Food and beverage service activities,Professional and technical occupations,Lives with husband/wife/partner at household grid,No,Marked,Marked,Hardly interested,9,8,8,More than once a week,3,2,NUTS level 2,Once a week,Yes,8,Husband/wife/partner,Husband/wife/partner,Not marked,Not marked,Less than most,Once a week,No,None or almost none of the time,6,6,8,6,8,18,3,2015,A private firm,5,9,9,8,5,6,4,"More than 2 hours, up to 2,5 hours","More than 2,5 hours, up to 3 hours",No,Not marked,Not marked,Not marked,Not marked,Yes,8,30,35,All or almost all of the time,No,Unlimited,No,No,1968,1963,AT,1
3,29593,Completely able,Safe,22,Austrian nfs,No,No,No,No,Yes,Does not,Some of the time,No,Not marked,Not marked,66,No,Completely confident,AT12,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Suburbs or outskirts of big city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",6,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Employee,Most of the time,10 to 24,Unification already gone too far,Yes,66,None or almost none of the time,Some of the time,Some of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,Not applicable,8,Very good,1,Coping on present income,Wages or salaries,Wages or salaries,R - 2nd decile,No,Marked,Not marked,Yes,Bad for the economy,Allow none,Allow none,Like me,Like me,Like me,Very much like me,Somewhat like me,Very much like me,Very much like me,Allow none,Cultural life undermined,Worse place to live,3,7,7,8,46,4,4,7,51,44,2015,2015,I have/had no influence,Very much like me,A little like me,Like me,Not like me at all,Somewhat like me,Very much like me,A little like me,Very much like me,Somewhat like me,Like me,Somewhat like me,Very much like me,Somewhat like me,Not like me at all,Motor vehicle mechanics and repairers,No,GER,0,Right,Yes,Legally divorced/civil union dissolved,66,No,Paid work,Yes,Face to face interview,Wholesale and retail trade and repair of motor vehicles and motorcycles,Skilled worker,Does not,No,Marked,Not marked,Hardly interested,5,People mostly look out for themselves,5,Never,Not at all,Not at all,NUTS level 2,Less often,Yes,5,Not applicable,Not applicable,Not marked,Not marked,About the same,Several times a week,No,None or almost none of the time,Extremely dissatisfied,Extremely dissatisfied,Extremely bad,Extremely dissatisfied,8,7,4,2015,A private firm,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No time at all,More than 3 hours,No,Not marked,Not marked,Not marked,Not marked,No,I have/had no influence,40,50,Most of the time,No,Unlimited,No,No,1993,Not applicable,AT,0
4,4252,Not at all able,Very safe,24,Austrian nfs,No,No,No,No,Yes,Does not,None or almost none of the time,Yes,Not marked,Not marked,66,Yes,2,AT12,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Post-secondary non-tertiary education completed (ISCED 4),"Vocational ISCED 4A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",13,"ES-ISCED IV, advanced vocational, sub-degree","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Not working,All or almost all of the time,10 to 24,8,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,Not applicable,8,Very good,1,Living comfortably on present income,Wages or salaries,Wages or salaries,C - 3rd decile,No,Not marked,Not marked,Yes,7,Allow some,Allow some,Not like me at all,Very much like me,Like me,Somewhat like me,A little like me,Very much like me,Not like me at all,Allow some,7,8,3,29,29,19,20,3,3,18,25,42,2015,2015,6,Not like me at all,Like me,Like me,Very much like me,Somewhat like me,Like me,Like me,Very much like me,Like me,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Somewhat like me,Accounting and bookkeeping clerks,No,GER,0,4,No,None of these (NEVER married or in legally registered civil union),66,No,Paid work,Yes,Face to face interview,Activities auxiliary to financial services and insurance activities,Skilled worker,Does not,No,Marked,Not marked,Quite interested,6,8,5,Less often,5,5,NUTS level 2,Only on special holy days,Yes,5,Not applicable,Not applicable,Not marked,Not marked,About the same,Several times a week,Yes,Some of the time,9,6,8,6,8,29,3,2015,A private firm,2,6,9,7,7,7,5,"0,5 hour to 1 hour","More than 1,5 hours, up to 2 hours",No,Not marked,Not marked,Not marked,Not marked,Yes,8,38,38,None or almost none of the time,No,Unlimited,No,No,1991,Not applicable,AT,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30075,34440,Not at all able,Very safe,72,Slovene nfs,No,No,No,No,Yes,Does not,None or almost none of the time,No,Not marked,Not marked,66,No,4,SI017,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3C >= 2 years, no access ISCED 5",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",11,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Self-employed,Father dead/absent,Not working,Some of the time,Under 10,8,Yes,66,Some of the time,Some of the time,None or almost none of the time,None or almost none of the time,Neither agree nor disagree,Agree,Male,Female,6,Good,2,Living comfortably on present income,Pensions,Pensions,P - 8th decile,No,Not marked,Not marked,Yes,5,Allow some,Allow some,Like me,Like me,Like me,Like me,Somewhat like me,Like me,Like me,Allow some,6,3,3,29,29,16,48,11,11,16,23,25,2014,2014,I have/had complete control,Not like me,Like me,Like me,Like me,Somewhat like me,Like me,Like me,Like me,Like me,Like me,Somewhat like me,Like me,Like me,Like me,Business services and administration managers not elsewhere classified,Yes,SLV,0,5,No,Legally married,66,"Yes, previously",Retired,Yes,Face to face interview,Land transport and transport via pipelines,Not applicable,Lives with husband/wife/partner at household grid,No,Not marked,Not marked,Quite interested,4,2,5,Never,Not at all,Not at all,NUTS level 3,Never,Yes,3,Husband/wife/partner,Husband/wife/partner,Marked,Marked,About the same,Once a week,No,Some of the time,8,5,7,5,7,29,11,2014,Self employed,No trust at all,No trust at all,3,No trust at all,2,No trust at all,3,"0,5 hour to 1 hour",More than 3 hours,No,Not marked,Not marked,Not marked,Not marked,Yes,I have/had complete control,40,60,Some of the time,No,Not applicable,.,No,1942,1945,SI,1
30076,13566,Not at all able,Very safe,38,Slovene nfs,No,No,No,No,Yes,Does not,None or almost none of the time,No,Not marked,Not marked,66,No,Not at all confident,SI021,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3C >= 2 years, no access ISCED 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3C >= 2 years, no access ISCED 5",11,"ES-ISCED IIIa, upper tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Employee,Employee,Employee,All or almost all of the time,Under 10,5,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,Some of the time,Don't know,Agree,Male,Not applicable,7,Good,1,Coping on present income,Wages or salaries,Wages or salaries,M - 4th decile,No,Not marked,Not marked,Yes,2,Allow none,Allow none,Like me,Very much like me,Very much like me,Like me,Somewhat like me,Like me,Somewhat like me,Allow none,3,Don't know,3,17,17,16,16,11,11,15,46,30,2014,2014,6,Somewhat like me,Very much like me,Like me,Somewhat like me,Not like me,Like me,Like me,Like me,Like me,Like me,Like me,Like me,Somewhat like me,Somewhat like me,Commercial sales representatives,No,SLV,0,5,No,None of these (NEVER married or in legally registered civil union),66,No,Paid work,Yes,Face to face interview,"Retail trade, except of motor vehicles and motorcycles",Semi-skilled worker,Does not,No,Marked,Not marked,Hardly interested,7,5,3,Never,Not at all,Not at all,NUTS level 3,Never,No,5,Not applicable,Not applicable,Not marked,Not marked,About the same,Every day,No,None or almost none of the time,Extremely dissatisfied,1,5,1,5,17,11,2014,A private firm,2,5,5,2,2,2,2,"0,5 hour to 1 hour","More than 2,5 hours, up to 3 hours",No,Not marked,Not marked,Not marked,Not marked,No,6,40,40,All or almost all of the time,No,Unlimited,.,No,1976,Not applicable,SI,1
30077,29824,5,Safe,49,Slovene nfs,No,No,No,No,Yes,Respondent lives with children at household grid,None or almost none of the time,No,Not marked,Not marked,66,No,5,SI011,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Farm or home in countryside,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",8,"ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Employee,Employee,Employee,Most of the time,100 to 499,4,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree,Agree,Male,Female,6,Fair,4,Coping on present income,Wages or salaries,Wages or salaries,J - 1st decile,No,Not marked,Not marked,Yes,4,Allow none,Allow none,Very much like me,Very much like me,Somewhat like me,A little like me,A little like me,Like me,Very much like me,Allow a few,3,4,2,5,5,16,6,11,11,15,43,23,2014,2014,8,A little like me,Like me,Like me,Very much like me,Like me,Like me,Very much like me,Like me,Like me,Like me,Like me,Like me,Very much like me,Like me,Food and related products machine operators,No,SLV,0,6,No,Legally married,66,"Yes, currently",Paid work,Yes,Face to face interview,Manufacture of food products,Semi-skilled worker,Lives with husband/wife/partner at household grid,No,Marked,Not marked,Hardly interested,3,4,4,Once a week,5,3,NUTS level 3,At least once a month,No,6,Husband/wife/partner,Husband/wife/partner,Not marked,Not marked,About the same,Once a week,No,None or almost none of the time,4,4,4,3,5,5,11,2014,A private firm,2,6,5,2,3,1,6,No time at all,"0,5 hour to 1 hour",Yes,Not marked,Marked,Not marked,Not marked,Yes,7,40,48,Most of the time,No,Unlimited,.,No,1965,1967,SI,0
30078,9573,Not at all able,Very safe,16,Slovene nfs,No,No,No,No,Yes,Does not,None or almost none of the time,No,Not marked,Not marked,66,No,Not at all confident,SI018,No,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Marked,Not marked,Lower secondary education completed (ISCED 2),"General ISCED 2A, access ISCED 3A general/all 3",Upper secondary education completed (ISCED 3),"General ISCED 3A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access upper tier ISCED 5A/all 5",10,"ES-ISCED II, lower secondary","ES-ISCED IIIa, upper tier upper secondary","ES-ISCED IIIa, upper tier upper secondary",Not applicable,Employee,Employee,Most of the time,Not applicable,5,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,Most of the time,Disagree,Agree,Male,Female,Extremely happy,Very good,3,Living comfortably on present income,Wages or salaries,Wages or salaries,Don't know,No,Not marked,Not marked,Not applicable,2,Allow many to come and live here,Allow many to come and live here,Somewhat like me,Somewhat like me,Very much like me,Very much like me,Not like me,Like me,Somewhat like me,Allow many to come and live here,5,5,3,19,19,17,7,11,11,16,37,30,2014,2014,Not applicable,Like me,Like me,Somewhat like me,Very much like me,Somewhat like me,Very much like me,Somewhat like me,Very much like me,Somewhat like me,Like me,Very much like me,Like me,Somewhat like me,Like me,Not applicable,Not applicable,SLV,0,5,No,None of these (NEVER married or in legally registered civil union),66,No,Education,Yes,Face to face interview,Not applicable,Service occupations,Does not,No,Not marked,Not marked,Not at all interested,5,5,1,Never,1,Don't know,NUTS level 3,Only on special holy days,No,Not at all religious,Parent/parent-in-law,Parent/parent-in-law,Not marked,Not marked,Less than most,Every day,No,None or almost none of the time,2,Extremely dissatisfied,3,Extremely dissatisfied,5,19,11,2014,Not applicable,No trust at all,2,2,No trust at all,No trust at all,No trust at all,3,No time at all,"More than 1 hour, up to 1,5 hours",No,Not marked,Not marked,Not marked,Not marked,Not eligible to vote,Not applicable,Not applicable,Not applicable,Most of the time,Not applicable,Not applicable,.,No,1998,1972,SI,1


### Categorical and Numeric Variables
<a id="Categorical-and-Numeric-Variables"></a>

In [23]:
frequency_train_v3_data = pd.DataFrame()

for i in list(train_v3)[1:]:
    grouped_data = train_v3.groupby(i)['id'].count()
    num_unique_answers = grouped_data.size
    temp_dict = {'Question': [i]*num_unique_answers,
                 'Response': list(grouped_data.index),
                 'Frequency': list(grouped_data)}
    frequency_train_v3_data = frequency_train_v3_data.append(pd.DataFrame(temp_dict))
frequency_train_v3_data.reset_index(inplace=True, drop=True)
temp = frequency_train_v3_data.groupby(["Question","Response"]).agg({"Frequency":"sum"})
temp = temp.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
temp.reset_index(inplace=True)
temp = temp.rename(columns = {"Frequency":"Relative Frequency (%)"})
frequency_train_v3_data = frequency_train_v3_data.merge(temp, left_on = ['Question','Response'], right_on = ['Question','Response'], how = 'left')
frequency_train_v3_data = frequency_train_v3_data.merge(codebook[['Variable','Label','Unique','Type_codebook_long']], left_on = 'Question', right_on = 'Variable', how = 'left').drop("Variable", axis=1)
frequency_train_v3_data

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
0,v1,-1,500,1.690274,Able to take active role in po...,11,double
1,v1,0,6204,20.972922,Able to take active role in po...,11,double
2,v1,1,2393,8.089652,Able to take active role in po...,11,double
3,v1,10,899,3.039113,Able to take active role in po...,11,double
4,v1,2,3211,10.854941,Able to take active role in po...,11,double
...,...,...,...,...,...,...,...
4015,cntry,PT,978,3.306176,Country,21,string
4016,cntry,SE,1419,4.796998,Country,21,string
4017,cntry,SI,926,3.130388,Country,21,string
4018,satisfied,0,14144,47.814476,Target,2,float


In [27]:
frequency_train_raw_v3_data = pd.DataFrame()

for i in list(train_raw_v3)[1:]:
    grouped_data = train_raw_v3.groupby(i)['id'].count()
    num_unique_answers = grouped_data.size
    temp_dict = {'Question': [i]*num_unique_answers,
                 'Response': list(grouped_data.index),
                 'Frequency': list(grouped_data)}
    frequency_train_raw_v3_data = frequency_train_raw_v3_data.append(pd.DataFrame(temp_dict))
frequency_train_raw_v3_data.reset_index(inplace=True, drop=True)
temp = frequency_train_raw_v3_data.groupby(["Question","Response"]).agg({"Frequency":"sum"})
temp = temp.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
temp.reset_index(inplace=True)
temp = temp.rename(columns = {"Frequency":"Relative Frequency (%)"})
frequency_train_raw_v3_data = frequency_train_raw_v3_data.merge(temp, left_on = ['Question','Response'], right_on = ['Question','Response'], how = 'left')
frequency_train_raw_v3_data = frequency_train_raw_v3_data.merge(codebook[['Variable','Label','Unique','Type_codebook_long']], left_on = 'Question', right_on = 'Variable', how = 'left').drop("Variable", axis=1)
frequency_train_raw_v3_data

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
0,v1,1,2393,8.089652,Able to take active role in po...,11,double
1,v1,2,3211,10.854941,Able to take active role in po...,11,double
2,v1,3,3293,11.132146,Able to take active role in po...,11,double
3,v1,4,2481,8.387140,Able to take active role in po...,11,double
4,v1,5,3310,11.189615,Able to take active role in po...,11,double
...,...,...,...,...,...,...,...
4324,cntry,PT,978,3.306176,Country,21,string
4325,cntry,SE,1419,4.796998,Country,21,string
4326,cntry,SI,926,3.130388,Country,21,string
4327,satisfied,0,14144,47.814476,Target,2,float


Most of the variables have already been encoded via the preprocessing. The variables identified to be strings, according to codebook_long are: v17, v20, v25, v78, v154, v155, v161, cntry - these are all country/region codes or language codes.

In [60]:
cat_cols = ["v17", "v20", "v25", "v78", "v154", "v155", "v161", "cntry"]
cat_df = train_v3[cat_cols]
# change variable type to str
train_v3[cat_cols] = train_v3[cat_cols].astype(str)

In [30]:
# concerning resposnes: 2, 3, 4, 6, 66, 77, 88, 99
frequency_train_v3_data[frequency_train_v3_data["Question"]=="v17"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
313,v17,2,25,0.084514,Country of birth,172,string
314,v17,3,8,0.027044,Country of birth,172,string
315,v17,4,7,0.023664,Country of birth,172,string
316,v17,66,26275,88.823907,Country of birth,172,string
317,v17,77,2,0.006761,Country of birth,172,string
318,v17,88,1,0.003381,Country of birth,172,string
319,v17,99,173,0.584835,Country of birth,172,string
320,v17,AF,13,0.043947,Country of birth,172,string
321,v17,AL,3,0.010142,Country of birth,172,string
322,v17,AM,4,0.013522,Country of birth,172,string


In [31]:
frequency_train_raw_v3_data[frequency_train_raw_v3_data["Question"]=="v17"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
332,v17,2,25,0.084514,Country of birth,172,string
333,v17,3,8,0.027044,Country of birth,172,string
334,v17,4,7,0.023664,Country of birth,172,string
335,v17,66,26275,88.823907,Country of birth,172,string
336,v17,77,2,0.006761,Country of birth,172,string
337,v17,88,1,0.003381,Country of birth,172,string
338,v17,99,173,0.584835,Country of birth,172,string
339,v17,AF,13,0.043947,Country of birth,172,string
340,v17,AL,3,0.010142,Country of birth,172,string
341,v17,AM,4,0.013522,Country of birth,172,string


For v17, 89% of responses are '66' for 'Country of birth' in both the raw and preprocessed data. Will just leave this column in there since its not a preprocessing error - maybe drop this column if there is another variable that is highly correlated with this

In [32]:
# concerning responses: 99999
frequency_train_v3_data[frequency_train_v3_data["Question"]=="v20"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
493,v20,99999,3,0.010142,Region,251,string
494,v20,AT11,47,0.158886,Region,251,string
495,v20,AT12,257,0.868801,Region,251,string
496,v20,AT13,262,0.885704,Region,251,string
497,v20,AT21,91,0.30763,Region,251,string
498,v20,AT22,204,0.689632,Region,251,string
499,v20,AT31,248,0.838376,Region,251,string
500,v20,AT32,96,0.324533,Region,251,string
501,v20,AT33,111,0.375241,Region,251,string
502,v20,AT34,68,0.229877,Region,251,string


In [33]:
frequency_train_raw_v3_data[frequency_train_raw_v3_data["Question"]=="v20"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
516,v20,99999,3,0.010142,Region,251,string
517,v20,AT11,47,0.158886,Region,251,string
518,v20,AT12,257,0.868801,Region,251,string
519,v20,AT13,262,0.885704,Region,251,string
520,v20,AT21,91,0.30763,Region,251,string
521,v20,AT22,204,0.689632,Region,251,string
522,v20,AT31,248,0.838376,Region,251,string
523,v20,AT32,96,0.324533,Region,251,string
524,v20,AT33,111,0.375241,Region,251,string
525,v20,AT34,68,0.229877,Region,251,string


For v20, 3 responses are '99999' in both the raw and preprocessed data. We can ignore this and nothing has to be done for this column

In [34]:
# concerning resposnes: 6, 65, 66, 88, 99
frequency_train_v3_data[frequency_train_v3_data["Question"]=="v25"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
745,v25,65,138,0.466516,Citizenship,130,string
746,v25,66,28051,94.827761,Citizenship,130,string
747,v25,88,17,0.057469,Citizenship,130,string
748,v25,99,117,0.395524,Citizenship,130,string
749,v25,AF,6,0.020283,Citizenship,130,string
750,v25,AL,2,0.006761,Citizenship,130,string
751,v25,AM,1,0.003381,Citizenship,130,string
752,v25,AO,3,0.010142,Citizenship,130,string
753,v25,AR,3,0.010142,Citizenship,130,string
754,v25,AT,14,0.047328,Citizenship,130,string


In [35]:
frequency_train_raw_v3_data[frequency_train_raw_v3_data["Question"]=="v25"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
772,v25,65,138,0.466516,Citizenship,130,string
773,v25,66,28051,94.827761,Citizenship,130,string
774,v25,88,17,0.057469,Citizenship,130,string
775,v25,99,117,0.395524,Citizenship,130,string
776,v25,AF,6,0.020283,Citizenship,130,string
777,v25,AL,2,0.006761,Citizenship,130,string
778,v25,AM,1,0.003381,Citizenship,130,string
779,v25,AO,3,0.010142,Citizenship,130,string
780,v25,AR,3,0.010142,Citizenship,130,string
781,v25,AT,14,0.047328,Citizenship,130,string


For v25, 95% of responses are '66' for 'Citizenship' in both the raw and preprocessed data. Will just leave this column in there since its not a preprocessing error - maybe drop this column if there is another variable that is highly correlated with this

In [36]:
# concerning responses: 2, 3, 4, 6, 66, 77, 88, 99
frequency_train_v3_data[frequency_train_v3_data["Question"]=="v78"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
1144,v78,2,37,0.12508,"Country of birth, father",171,string
1145,v78,3,20,0.067611,"Country of birth, father",171,string
1146,v78,4,31,0.104797,"Country of birth, father",171,string
1147,v78,66,24508,82.850478,"Country of birth, father",171,string
1148,v78,77,2,0.006761,"Country of birth, father",171,string
1149,v78,88,41,0.138602,"Country of birth, father",171,string
1150,v78,99,189,0.638924,"Country of birth, father",171,string
1151,v78,AF,16,0.054089,"Country of birth, father",171,string
1152,v78,AG,1,0.003381,"Country of birth, father",171,string
1153,v78,AL,6,0.020283,"Country of birth, father",171,string


In [37]:
frequency_train_raw_v3_data[frequency_train_raw_v3_data["Question"]=="v78"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
1219,v78,2,37,0.12508,"Country of birth, father",171,string
1220,v78,3,20,0.067611,"Country of birth, father",171,string
1221,v78,4,31,0.104797,"Country of birth, father",171,string
1222,v78,66,24508,82.850478,"Country of birth, father",171,string
1223,v78,77,2,0.006761,"Country of birth, father",171,string
1224,v78,88,41,0.138602,"Country of birth, father",171,string
1225,v78,99,189,0.638924,"Country of birth, father",171,string
1226,v78,AF,16,0.054089,"Country of birth, father",171,string
1227,v78,AG,1,0.003381,"Country of birth, father",171,string
1228,v78,AL,6,0.020283,"Country of birth, father",171,string


For v78, 83% of responses are '66' for 'Country of birth, father' in both the raw and preprocessed data. Will just leave this column in there since its not a preprocessing error - maybe drop this column if there is another variable that is highly correlated with this

In [38]:
# concerning responses: 777, 888, 999
frequency_train_v3_data[frequency_train_v3_data["Question"]=="v154"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
2646,v154,777,11,0.037186,Language most often spoken at ...,123,string
2647,v154,888,45,0.152125,Language most often spoken at ...,123,string
2648,v154,999,171,0.578074,Language most often spoken at ...,123,string
2649,v154,ABK,1,0.003381,Language most often spoken at ...,123,string
2650,v154,AKA,1,0.003381,Language most often spoken at ...,123,string
2651,v154,ALB,31,0.104797,Language most often spoken at ...,123,string
2652,v154,AMH,16,0.054089,Language most often spoken at ...,123,string
2653,v154,APA,15,0.050708,Language most often spoken at ...,123,string
2654,v154,ARA,448,1.514486,Language most often spoken at ...,123,string
2655,v154,ARM,9,0.030425,Language most often spoken at ...,123,string


In [39]:
frequency_train_raw_v3_data[frequency_train_raw_v3_data["Question"]=="v154"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
2838,v154,777,11,0.037186,Language most often spoken at ...,123,string
2839,v154,888,45,0.152125,Language most often spoken at ...,123,string
2840,v154,999,171,0.578074,Language most often spoken at ...,123,string
2841,v154,ABK,1,0.003381,Language most often spoken at ...,123,string
2842,v154,AKA,1,0.003381,Language most often spoken at ...,123,string
2843,v154,ALB,31,0.104797,Language most often spoken at ...,123,string
2844,v154,AMH,16,0.054089,Language most often spoken at ...,123,string
2845,v154,APA,15,0.050708,Language most often spoken at ...,123,string
2846,v154,ARA,448,1.514486,Language most often spoken at ...,123,string
2847,v154,ARM,9,0.030425,Language most often spoken at ...,123,string


For v154, less than 1% of responses are 777, 888, 999. We can ignore this and nothing has to be done with this column

In [40]:
# concerning responses: 0, 777, 888, 999
frequency_train_v3_data[frequency_train_v3_data["Question"]=="v155"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
2752,v155,0,23485,79.392177,Language most often spoken at ...,129,string
2753,v155,777,12,0.040567,Language most often spoken at ...,129,string
2754,v155,888,26,0.087894,Language most often spoken at ...,129,string
2755,v155,999,1729,5.844968,Language most often spoken at ...,129,string
2756,v155,ALB,15,0.050708,Language most often spoken at ...,129,string
2757,v155,AMH,15,0.050708,Language most often spoken at ...,129,string
2758,v155,APA,5,0.016903,Language most often spoken at ...,129,string
2759,v155,ARA,119,0.402285,Language most often spoken at ...,129,string
2760,v155,ARM,1,0.003381,Language most often spoken at ...,129,string
2761,v155,BAQ,14,0.047328,Language most often spoken at ...,129,string


In [41]:
frequency_train_raw_v3_data[frequency_train_raw_v3_data["Question"]=="v155"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
2944,v155,0,23485,79.392177,Language most often spoken at ...,129,string
2945,v155,777,12,0.040567,Language most often spoken at ...,129,string
2946,v155,888,26,0.087894,Language most often spoken at ...,129,string
2947,v155,999,1729,5.844968,Language most often spoken at ...,129,string
2948,v155,ALB,15,0.050708,Language most often spoken at ...,129,string
2949,v155,AMH,15,0.050708,Language most often spoken at ...,129,string
2950,v155,APA,5,0.016903,Language most often spoken at ...,129,string
2951,v155,ARA,119,0.402285,Language most often spoken at ...,129,string
2952,v155,ARM,1,0.003381,Language most often spoken at ...,129,string
2953,v155,BAQ,14,0.047328,Language most often spoken at ...,129,string


For v155, the question is "Language most often spoken at home: second mentioned" and 80% of responses are 0 which I am interpreting as Not Applicable. Will just leave this column in there since its not a preprocessing error

In [42]:
# concerncing responses: 2, 3, 4, 66, 77, 88, 99
frequency_train_v3_data[frequency_train_v3_data["Question"]=="v161"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
2899,v161,2,38,0.128461,"Country of birth, mother",173,string
2900,v161,3,18,0.06085,"Country of birth, mother",173,string
2901,v161,4,23,0.077753,"Country of birth, mother",173,string
2902,v161,66,24739,83.631385,"Country of birth, mother",173,string
2903,v161,77,3,0.010142,"Country of birth, mother",173,string
2904,v161,88,32,0.108178,"Country of birth, mother",173,string
2905,v161,99,184,0.622021,"Country of birth, mother",173,string
2906,v161,AD,1,0.003381,"Country of birth, mother",173,string
2907,v161,AF,17,0.057469,"Country of birth, mother",173,string
2908,v161,AL,5,0.016903,"Country of birth, mother",173,string


In [43]:
frequency_train_raw_v3_data[frequency_train_raw_v3_data["Question"]=="v161"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
3098,v161,2,38,0.128461,"Country of birth, mother",173,string
3099,v161,3,18,0.06085,"Country of birth, mother",173,string
3100,v161,4,23,0.077753,"Country of birth, mother",173,string
3101,v161,66,24739,83.631385,"Country of birth, mother",173,string
3102,v161,77,3,0.010142,"Country of birth, mother",173,string
3103,v161,88,32,0.108178,"Country of birth, mother",173,string
3104,v161,99,184,0.622021,"Country of birth, mother",173,string
3105,v161,AD,1,0.003381,"Country of birth, mother",173,string
3106,v161,AF,17,0.057469,"Country of birth, mother",173,string
3107,v161,AL,5,0.016903,"Country of birth, mother",173,string


For v161, 84% of responses are '66' for 'Country of birth, mother' in both the raw and preprocessed data. Will just leave this column in there since its not a preprocessing error - maybe drop this column if there is another variable that is highly correlated with this

In [44]:
# concerncing responses: none
frequency_train_v3_data[frequency_train_v3_data["Question"]=="cntry"]

Unnamed: 0,Question,Response,Frequency,Relative Frequency (%),Label,Unique,Type_codebook_long
3997,cntry,AT,1384,4.678679,Country,21,string
3998,cntry,BE,9,0.030425,Country,21,string
3999,cntry,CH,1186,4.00933,Country,21,string
4000,cntry,CZ,1603,5.419019,Country,21,string
4001,cntry,DE,2377,8.035563,Country,21,string
4002,cntry,DK,1189,4.019472,Country,21,string
4003,cntry,EE,1592,5.381833,Country,21,string
4004,cntry,ES,1511,5.108009,Country,21,string
4005,cntry,FI,1639,5.540719,Country,21,string
4006,cntry,FR,1498,5.064061,Country,21,string


Lets convert all non-categorical variables to int64

In [61]:
num_cols = train_v3.loc[:, ~train_v3.columns.isin(cat_cols)].columns.tolist()
num_df = train_v3[num_cols]

train_v3[num_cols] = train_v3[num_cols].astype("int64")

### Correlation Analysis
<a id="Correlation-Analysis"></a>

We will leave all variables/columns in the data set for now. 

Once we get to the logistic regression modelling stage, we may decide to remove highly correlated variables/columns.

In [63]:
def get_redundant_pairs(df):
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0,i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

corr_df = get_top_abs_correlations(num_df, 30).to_frame().reset_index().rename(columns={0: "abs_corr"})
corr_df = corr_df.merge(codebook[['Variable', "Label"]], left_on = 'level_0', right_on = "Variable", how = "left").drop("Variable", axis = 1).rename(columns = {"Label":"label_0"})
corr_df = corr_df.merge(codebook[['Variable', "Label"]], left_on = 'level_1', right_on = "Variable", how = "left").drop("Variable", axis = 1).rename(columns = {"Label":"label_1"})
corr_df

Unnamed: 0,level_0,level_1,abs_corr,label_0,label_1
0,v128,v129,0.998918,"End of interview, month","Start of interview, month"
1,v124,v125,0.996732,"End of interview, day of month","Start of interview, day of month"
2,v196,v208,0.994988,Second person in household: re...,Second person in household: re...
3,v125,v228,0.992443,"Start of interview, day of month","Day of month, supplementary qu..."
4,v124,v228,0.991285,"End of interview, day of month","Day of month, supplementary qu..."
5,v57,v65,0.991019,Highest level of education,"Highest level of education, ES..."
6,v102,v103,0.990104,Main source of household income,Main source of household income
7,v58,v66,0.989397,Father's highest level of educ...,Father's highest level of educ...
8,v59,v66,0.989349,Father's highest level of educ...,Father's highest level of educ...
9,v60,v67,0.987533,Mother's highest level of educ...,Mother's highest level of educ...


Isolating high correlations with the target variable

In [66]:
corr_matrix = num_df.corr()

print("Top positive correlations with target")
pos_corr = pd.DataFrame(corr_matrix["satisfied"].sort_values(ascending=False)[0:10])
pos_corr.merge(codebook[codebook_labels], left_index = True, right_on = "Variable", how = "left")

Top positive correlations with target


Unnamed: 0,satisfied,Variable,Label
272,1.0,satisfied,Target
97,0.550926,v98,How happy are you
223,0.329335,v224,How satisfied with present sta...
73,0.32611,v74,"Enjoyed life, how often past week"
252,0.319583,v253,"Were happy, how often past week"
222,0.271414,v223,How satisfied with the way dem...
177,0.249736,v178,Most people try to take advant...
179,0.248674,v180,Most people can be trusted or ...
232,0.235256,v233,Trust in the legal system
225,0.230834,v226,How satisfied with the nationa...


In [67]:
print("Top negative correlations with target")
neg_corr = pd.DataFrame(corr_matrix["satisfied"].sort_values(ascending=True)[0:10])
neg_corr.merge(codebook[codebook_labels], left_index = True, right_on = "Variable", how = "left")

Top negative correlations with target


Unnamed: 0,satisfied,Variable,Label
100,-0.321658,v101,Feeling about household's inco...
78,-0.295399,v79,"Felt depressed, how often past..."
98,-0.277019,v99,Subjective general health
81,-0.267112,v82,"Felt sad, how often past week"
80,-0.236721,v81,"Felt lonely, how often past week"
79,-0.204854,v80,"Felt everything did as effort,..."
12,-0.201093,v13,"Could not get going, how often..."
221,-0.173122,v222,"Sleep was restless, how often ..."
1,-0.163894,v2,Feeling of safety of walking a...
133,-0.144672,v134,"Start of interview, year"


### Visualizations
<a id="Visualizations"></a>

In [71]:
sns.pairplot(train_v3, hue="satisfied")