## Predicting Life Satisfaction
### Exploratory Data Analysis  
  
#### Objectives:  
* Analysis of missing observations 
* Analysis of categorical/numeric features 
* 1 way/ 2 way variable plots (histograms, scatterplots, etc.) 
* Correlation analysis

### From Kaggle:
#### File descriptions  
* train.csv - the training set with some preprocesssing of values.  
* test.csv - the test set with some preprocesssing of values.  
* train_raw.csv - the training set with original responses (no preprocessing).  
* test_raw.csv - the test set with original responses (no preprocessing).  
* codebook_compact.txt - short descriptions of each data field.  
* codebook_long.txt - long descriptions of each field.  
* sample_submission.csv - a sample submission file in the correct format. Change values in the Predicted column to estimates of the probability the person is very satisfied and submit to receive a leaderboard score.  
  
#### Data fields
* sample_submission.csv - survey respondent id
* v1-v270 - survey response fields
* cntry - survey respondent country
* satisfied - whether (1) or not (0) the survey respondent is 'very satisfied' with their life (training set only)

In [166]:
# Import packages
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Set options
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

In [167]:
# Import data
train_raw = pd.read_csv("../01-data/train_raw.csv", low_memory = False)
train = pd.read_csv("../01-data/train.csv", low_memory = False)
test_raw = pd.read_csv("../01-data/test_raw.csv", low_memory = False)
test = pd.read_csv("../01-data/test.csv", low_memory = False)

# Custom data
codebook = pd.read_csv("../01-data/codebook_compact.csv", low_memory = False) # OG codebook+dtypes from codebook_long

Note: These initial analyses will be completed on the training sets, to preserve the test set for testing.

### Comparing processed vs. unprocessed data

In [168]:
display(train_raw.head())
train_raw.shape

Unnamed: 0,id,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,cntry,satisfied
0,9948,2,Safe,74,Austrian nfs,No second ancestry,No,No,No,No,Yes,Yes,Does not,Some of the time,Yes,Not marked,Not marked,66,No,2,AT33,No,No,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Not applicable,Not applicable,10,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Some of the time,10 to 24,9,Yes,66,Some of the time,None or almost none of the time,Most of the time,All or almost all of the time,Neither agree nor disagree,Neither agree nor disagree,Female,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,5,Good,1,Coping on present income,Pensions,Pensions,R - 2nd decile,Yes to some extent,Not marked,Not marked,Yes,Bad for the economy,Allow a few,Allow a few,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Not like me at all,Somewhat like me,Like me,Allow some,2,Worse place to live,,,1,1,12,2,4,4,9,25,107.0,2015,2015,4,Not like me,Somewhat like me,A little like me,Very much like me,Not like me,Somewhat like me,Somewhat like me,Like me,Like me,Somewhat like me,Not like me,Like me,Not like me at all,Not like me,Office supervisors,Not applicable,Yes,Not applicable,GER,0,3,No,Not applicable,Widowed/civil partner died,Widowed/civil partner died,66,"Yes, previously",Retired,Not applicable,Yes,Face to face interview,Manufacture of other non-metallic mineral prod...,20,Sales occupations,Sales occupations,Does not,No,Yes,1993,Not marked,Not marked,Hardly interested,Most people try to be fair,People mostly try to be helpful,8,Every day,Quite close,2,2,NUTS level 2,Never,Yes,Not applicable,1,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Marked,Not marked,Less than most,Several times a month,No,None or almost none of the time,3,Don't know,5,1,9,1,4,2015,A private firm,No trust at all,5,Complete trust,No trust at all,2,No trust at all,3,No time at all,"Less than 0,5 hour",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,40,40,Not applicable,Some of the time,No,Unlimited,No,No,1941,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
1,25601,4,Safe,58,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,Some of the time,No,Not marked,Not marked,66,No,2,AT31,No,No,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Refusal,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISC...",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISC...",Not applicable,Not applicable,12,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Some of the time,10 to 24,5,Yes,66,Some of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Disagree strongly,Neither agree nor disagree,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,4,Fair,1,Coping on present income,Unemployment/redundancy benefit,Unemployment/redundancy benefit,R - 2nd decile,No,Not marked,Not marked,Yes,6,Allow a few,Allow some,A little like me,Somewhat like me,Somewhat like me,A little like me,A little like me,A little like me,A little like me,Allow many to come and live here,5,4,2,,17,17,19,15,1,1,17,46,75.0,2015,2015,3,A little like me,Somewhat like me,Somewhat like me,Very much like me,Very much like me,A little like me,Like me,Like me,Somewhat like me,Like me,Like me,Somewhat like me,Somewhat like me,Very much like me,Spray painters and varnishers,Not applicable,No,Not applicable,GER,0,5,Refusal,Not applicable,None of these (NEVER married or in legally reg...,None of these (NEVER married or in legally reg...,66,No,"Unemployed, looking for job",Not applicable,Yes,Face to face interview,"Manufacture of fabricated metal products, exce...",Not applicable,Unskilled worker,Unskilled worker,Does not,No,Yes,2014,Not marked,Not marked,Not at all interested,4,7,3,Less often,Not applicable,4,5,NUTS level 2,Only on special holy days,Yes,Not applicable,1,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a month,No,None or almost none of the time,5,3,8,5,8,17,1,2015,A private firm,Don't know,7,7,Don't know,5,5,Don't know,"Less than 0,5 hour","More than 1 hour, up to 1,5 hours",No,Yes,Yes,Marked,Not marked,Not marked,Not marked,Yes,3,39,39,Not applicable,Some of the time,No,Limited,No,No,1957,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
2,8592,6,Safe,47,Austrian nfs,Austrian nfs,No,No,Yes,No,Yes,Not applicable,Respondent lives with children at household grid,None or almost none of the time,No,Not marked,Not marked,66,No,3,AT33,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISC...",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISC...",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISC...",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",9,"ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Employee,Employee,Employee,Employee,All or almost all of the time,Under 10,2,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Neither agree nor disagree,Female,,,,,Male,Female,Female,Not applicable,Not applicable,Not applicable,Not applicable,,Extremely happy,Good,4,Coping on present income,Wages or salaries,Wages or salaries,J - 1st decile,No,Not marked,Not marked,Yes,5,Allow some,Allow a few,Not like me,Like me,Like me,Somewhat like me,Not like me,Like me,Like me,Allow some,5,3,10 or more,,18,18,17,28,3,3,16,30,50.0,2015,2015,2,Not like me,Like me,Very much like me,Like me,Not like me,Like me,Like me,Like me,Like me,Like me,Not like me,Like me,Not like me,A little like me,"Cleaners and helpers in offices, hotels and ot...",No answer,Yes,Not applicable,GER,0,4,No answer,Not applicable,Legally married,Not applicable,66,No,Paid work,Not applicable,Yes,Face to face interview,Food and beverage service activities,1,Professional and technical occupations,Unskilled worker,Lives with husband/wife/partner at household grid,No,Not applicable,Not applicable,Marked,Marked,Hardly interested,9,8,8,More than once a week,Not applicable,3,2,NUTS level 2,Once a week,Yes,Not applicable,8,Roman Catholic,Not applicable,,,,,Husband/wife/partner,Son/daughter/step/adopted,Son/daughter/step/adopted,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Husband/wife/partner,Son/daughter/step/adopted/foster,Son/daughter/step/adopted/foster,Not applicable,Not applicable,Not applicable,Not applicable,,Legally married,Not marked,Not marked,Less than most,Once a week,No,None or almost none of the time,6,6,8,6,8,18,3,2015,A private firm,5,9,9,8,5,6,4,"More than 2 hours, up to 2,5 hours","More than 2,5 hours, up to 3 hours",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,30,35,40,All or almost all of the time,No,Unlimited,No,No,1968,,,,,1963,1993,1995,Not applicable,Not applicable,Not applicable,Not applicable,,AT,1
3,29593,Completely able,Safe,22,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,Some of the time,No,Not marked,Not marked,66,No,Completely confident,AT12,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Suburbs or outskirts of big city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Not applicable,Not applicable,6,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Most of the time,10 to 24,Unification already gone too far,Yes,66,None or almost none of the time,Some of the time,Some of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,8,Very good,1,Coping on present income,Wages or salaries,Wages or salaries,R - 2nd decile,No,Marked,Not marked,Yes,Bad for the economy,Allow none,Allow none,Like me,Like me,Like me,Very much like me,Somewhat like me,Very much like me,Very much like me,Allow none,Cultural life undermined,Worse place to live,3,,7,7,8,46,4,4,7,51,44.0,2015,2015,I have/had no influence,Very much like me,A little like me,Like me,Not like me at all,Somewhat like me,Very much like me,A little like me,Very much like me,Somewhat like me,Like me,Somewhat like me,Very much like me,Somewhat like me,Not like me at all,Motor vehicle mechanics and repairers,Not applicable,No,Not applicable,GER,0,Right,Yes,Paid work,Legally divorced/civil union dissolved,Legally divorced/civil union dissolved,66,No,Paid work,Not applicable,Yes,Face to face interview,Wholesale and retail trade and repair of motor...,Not applicable,Skilled worker,Service occupations,Does not,No,Not applicable,Not applicable,Marked,Not marked,Hardly interested,5,People mostly look out for themselves,5,Never,Not applicable,Not at all,Not at all,NUTS level 2,Less often,Yes,Not applicable,5,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a week,No,None or almost none of the time,Extremely dissatisfied,Extremely dissatisfied,Extremely bad,Extremely dissatisfied,8,7,4,2015,A private firm,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No time at all,More than 3 hours,Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,No,I have/had no influence,40,50,Not applicable,Most of the time,No,Unlimited,No,No,1993,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
4,4252,Not at all able,Very safe,24,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,None or almost none of the time,Yes,Not marked,Not marked,66,Yes,2,AT12,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Post-secondary non-tertiary education complete...,"Vocational ISCED 4A, access upper tier ISCED 5...",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower ti...",Not applicable,Not applicable,13,"ES-ISCED IV, advanced vocational, sub-degree","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Not working,All or almost all of the time,10 to 24,8,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,8,Very good,1,Living comfortably on present income,Wages or salaries,Wages or salaries,C - 3rd decile,No,Not marked,Not marked,Yes,7,Allow some,Allow some,Not like me at all,Very much like me,Like me,Somewhat like me,A little like me,Very much like me,Not like me at all,Allow some,7,8,3,,29,29,19,20,3,3,18,25,42.0,2015,2015,6,Not like me at all,Like me,Like me,Very much like me,Somewhat like me,Like me,Like me,Very much like me,Like me,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Somewhat like me,Accounting and bookkeeping clerks,Not applicable,No,Not applicable,GER,0,4,No,Not applicable,None of these (NEVER married or in legally reg...,None of these (NEVER married or in legally reg...,66,No,Paid work,Not applicable,Yes,Face to face interview,Activities auxiliary to financial services and...,Not applicable,Skilled worker,Not applicable,Does not,No,Not applicable,Not applicable,Marked,Not marked,Quite interested,6,8,5,Less often,Not close,5,5,NUTS level 2,Only on special holy days,Yes,Not applicable,5,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a week,Yes,Some of the time,9,6,8,6,8,29,3,2015,A private firm,2,6,9,7,7,7,5,"0,5 hour to 1 hour","More than 1,5 hours, up to 2 hours",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,38,38,Not applicable,None or almost none of the time,No,Unlimited,No,No,1991,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,1


(30080, 273)

In [169]:
train_raw.describe()

Unnamed: 0,id,v129,v132,v134,satisfied
count,30080.0,30080.0,28845.0,30080.0,30080.0
mean,19702.757015,7.743085,57.399827,2014.436336,0.519481
std,11357.684769,3.596224,20.710074,0.495939,0.499629
min,1.0,1.0,0.0,2014.0,0.0
25%,9856.75,5.0,45.0,2014.0,0.0
50%,19759.5,9.0,54.0,2014.0,1.0
75%,29515.25,11.0,67.0,2015.0,1.0
max,39324.0,12.0,772.0,2015.0,1.0


In [170]:
display(train.head())
train.shape

Unnamed: 0,id,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,cntry,satisfied
0,9948,2,2,74,11010,.a,2,2,2,2,1,1,2,2,1,0,0,66,2,2,AT33,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,10,3,3,3,.a,.a,1,.a,1,1,2,2,9,1,66,2,1,3,4,3,3,2,,,,,.a,.a,.a,.a,.a,.a,.a,,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,,1,1,12,2,4,4,9,25,107.0,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,.a,1,.a,GER,0,3,2,.a,5,5,66,2,6,.a,1,1,23,20,4,4,2,2,1,1993,0,0,3,10,10,8,1,2,2,2,2,7,1,.a,1,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,1,0,2,4,2,1,3,.b,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,.a,2,.a,0,0,0,0,1,8,40,40,.a,2,2,1,2,2,1941,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
1,25601,4,2,58,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,2,AT31,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,.a,0,0,3,322,2,212,2,212,.a,.a,12,3,2,2,.a,.a,1,.a,1,1,2,2,5,1,66,2,1,1,1,5,3,1,,,,,.a,.a,.a,.a,.a,.a,.a,,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,,17,17,19,15,1,1,17,46,75.0,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,.a,2,.a,GER,0,5,.b,.a,6,6,66,3,3,.a,1,1,25,.a,8,8,2,2,1,2014,0,0,4,4,7,3,6,.a,4,5,2,5,1,.a,1,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,.b,7,7,.b,5,5,.b,1,3,2,1,1,1,0,0,0,1,3,39,39,.a,2,2,2,2,2,1957,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
2,8592,6,2,47,11010,11010,2,2,1,2,1,.a,1,1,2,0,0,66,2,3,AT33,2,.a,.a,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,3,322,9,2,2,2,3,.a,1,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,,,,,1,2,2,.a,.a,.a,.a,,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,,18,18,17,28,3,3,16,30,50.0,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,.d,1,.a,GER,0,4,.d,.a,1,.a,66,3,1,.a,1,1,56,1,1,8,1,2,.a,.a,1,1,3,9,8,8,2,.a,3,2,2,3,1,.a,8,1,.a,,,,,1,2,2,.a,.a,.a,.a,,,,,,1,2,2,.a,.a,.a,.a,,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,.a,2,.a,0,0,0,0,1,8,30,35,40,4,2,1,2,2,1968,,,,,1963,1993,1995,.a,.a,.a,.a,,AT,1
3,29593,10,2,22,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,10,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,6,3,3,3,.a,.a,1,.a,1,1,3,2,0,1,66,1,2,2,1,1,1,1,,,,,.a,.a,.a,.a,.a,.a,.a,,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,,7,7,8,46,4,4,7,51,44.0,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,.a,2,.a,GER,0,10,1,1,4,4,66,3,1,.a,1,1,45,.a,6,5,2,2,.a,.a,1,0,3,5,0,5,7,.a,0,0,2,6,1,.a,5,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,.a,2,.a,0,0,0,0,2,0,40,50,.a,3,2,1,2,2,1993,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
4,4252,0,1,24,11010,.a,2,2,2,2,1,2,2,1,1,0,0,66,1,2,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,.a,.a,13,5,3,3,.a,.a,1,.a,1,3,4,2,8,1,66,1,1,1,1,1,1,1,,,,,.a,.a,.a,.a,.a,.a,.a,,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,,29,29,19,20,3,3,18,25,42.0,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,.a,2,.a,GER,0,4,2,.a,6,6,66,3,1,.a,1,1,66,.a,6,.a,2,2,.a,.a,1,0,2,6,8,5,6,3,5,5,2,5,1,.a,5,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,.a,2,.a,0,0,0,0,1,8,38,38,.a,1,2,1,2,2,1991,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,1


(30080, 273)

In [171]:
train.describe()

Unnamed: 0,id,v15,v16,v26,v27,v28,v29,v30,v31,v32,v33,v34,v36,v37,v38,v39,v40,v41,v42,v44,v45,v46,v47,v48,v49,v50,v51,v52,v54,v55,v106,v107,v123,v129,v132,v134,v175,v176,v185,v217,v218,v244,v245,v246,v247,satisfied
count,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,2388.0,30080.0,28845.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0,30080.0
mean,19702.757015,0.001895,0.000332,0.000465,0.001363,0.000831,0.003424,0.408477,0.014761,0.005652,0.000831,0.001629,0.034874,0.013863,0.013497,0.000898,0.007447,0.010273,0.010505,0.012234,0.000133,0.91639,0.020445,0.018384,0.01373,0.000233,0.016888,0.005751,0.099701,0.012467,0.142121,0.084309,1.675461,7.743085,57.399827,2014.436336,0.522939,0.357513,2.402926,0.271243,0.147141,0.042553,0.019049,0.019847,0.008045,0.519481
std,11357.684769,0.04349,0.01823,0.021569,0.036895,0.028818,0.058417,0.49156,0.120595,0.074966,0.028818,0.040329,0.183463,0.116924,0.115393,0.029947,0.085974,0.100834,0.101957,0.109931,0.011531,0.276807,0.141521,0.134339,0.11637,0.015253,0.128855,0.07562,0.299606,0.110958,0.34918,0.277855,0.468301,3.596224,20.710074,0.495939,0.499482,0.479276,0.806457,0.444609,0.354252,0.201851,0.1367,0.139477,0.089335,0.499629
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2014.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,9856.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,45.0,2014.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,19759.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,9.0,54.0,2014.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,29515.25,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,11.0,67.0,2015.0,1.0,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
max,39324.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,12.0,772.0,2015.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We are missing many numeric summaries because those columns are being misclassified as Object/Character.

In [172]:
# Proper type names
codebook = codebook.replace({'double': 'float', 'string': 'str'})

# Convert to dict
dtype_dict = dict(zip(codebook.Variable,codebook.Type_codebook_long))

We can't convert to float yet, since .a, .b, .c values will throw an error. Definitions of these values are not consistent between survey responses, e.g. while .a corresponds to "Refusal" for some questions, .b corresponds to "Refusal" for others. 

### Analysis of Missing Observations

In [184]:
percent_missing = train.isnull().sum() * 100 / len(train)
missing_value_df = pd.DataFrame({'column_name': train.columns,
                                 'percent_missing': percent_missing})

# Columns where % missing not equal 0 
missing_value_df[missing_value_df.percent_missing > 0]

Unnamed: 0,column_name,percent_missing
v58,v58,4.428191
v59,v59,4.428191
v60,v60,4.428191
v61,v61,4.428191
v62,v62,4.428191
v63,v63,4.428191
v66,v66,4.428191
v67,v67,4.428191
v68,v68,4.428191
v86,v86,38.075133


Note that this is not including .a, .b, .c, .d values. Second analysis including these:

In [185]:
train_nodots = train.replace([".", ".a", ".b", ".c", ".d"], [np.nan, np.nan, np.nan, np.nan, np.nan])

In [186]:
percent_missing_nodots = train_nodots.isnull().sum() * 100 / len(train_nodots)
missing_value_df_nodots = pd.DataFrame({'column_name': train_nodots.columns,
                                 'percent_missing': percent_missing_nodots})

# Columns where % missing not equal 0 
missing_value_df_nodots[missing_value_df_nodots.percent_missing > 0]

Unnamed: 0,column_name,percent_missing
v1,v1,1.908245
v2,v2,1.053856
v3,v3,0.239362
v4,v4,0.668218
v5,v5,70.671543
v6,v6,0.601729
v7,v7,0.236037
v8,v8,0.478723
v9,v9,1.389628
v10,v10,0.023271


In [187]:
percent = 30
cols_missing = missing_value_df_nodots[missing_value_df_nodots.percent_missing > percent]
n_cols_missing = len(missing_value_df_nodots[missing_value_df_nodots.percent_missing > percent])

In [188]:
print("There are " + str(n_cols_missing) + " features with over " + str(percent) + "% missing.")

There are 71 features with over 30% missing.


In [189]:
# Attach short desc for more context
codebook_labels = ['Variable', "Label"]
cols_missing.merge(codebook[codebook_labels], left_on = 'column_name', right_on = "Variable", how = "left")

Unnamed: 0,column_name,percent_missing,Variable,Label
0,v5,70.671543,v5,"Second ancestry, European Stan..."
1,v11,35.954122,v11,Ever had children living in ho...
2,v22,52.446809,v22,Control paid work last 7 days
3,v23,77.140957,v23,"Partner, control paid work las..."
4,v62,44.56117,v62,Partner's highest level of edu...
5,v63,44.56117,v63,Partner's highest level of edu...
6,v68,44.56117,v68,Partner's highest level of edu...
7,v69,89.704122,v69,Number of employees respondent...
8,v71,63.710106,v71,Partner's employment relation
9,v86,99.883644,v86,Gender of tenth person in hous...


Many of these have to be with characteristics of 3rd+ person in family - likely majority are 1-2 people families.

Let's drop these columns for the remainder of the analysis:

In [192]:
# This dataset has no dots
drop_missing = cols_missing.column_name.tolist()
train_v2 = train_nodots.drop(drop_missing, axis = 1)

### Categorical/Numeric Variables

Most of the features in this data are categorical (not continuous), but many have been preprocessed and converted to double.

Let's deal with the NaN values (all . variables were grouped together above), 0 is taken so they'll be coded as -1:

In [193]:
train_v3 = train_v2.replace([np.nan], [-1])
train_v3.head(5)

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,2,74,11010,2,2,2,2,1,2,2,1,0,0,66,2,2,AT33,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,10,3,3,3,1,1,1,2,2,9,1,66,2,1,3,4,3,3,2,-1,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,1,1,12,2,4,4,9,25,107.0,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,1,GER,0,3,2,5,66,2,6,1,1,23,4,2,2,0,0,3,10,10,8,1,2,2,2,7,1,1,-1,-1,1,0,2,4,2,1,3,-1,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,2,0,0,0,0,1,8,40,40,2,2,1,2,2,1941,-1,AT,0
1,25601,4,2,58,11010,2,2,2,2,1,2,2,2,0,0,66,2,2,AT31,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,-1,0,0,3,322,2,212,2,212,12,3,2,2,1,1,1,2,2,5,1,66,2,1,1,1,5,3,1,-1,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,17,17,19,15,1,1,17,46,75.0,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,2,GER,0,5,-1,6,66,3,3,1,1,25,8,2,2,0,0,4,4,7,3,6,4,5,2,5,1,1,-1,-1,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,-1,7,7,-1,5,5,-1,1,3,1,1,0,0,0,1,3,39,39,2,2,2,2,2,1957,-1,AT,0
2,8592,6,2,47,11010,2,2,1,2,1,1,1,2,0,0,66,2,3,AT33,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,9,2,2,2,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,1,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,18,18,17,28,3,3,16,30,50.0,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,1,GER,0,4,-1,1,66,3,1,1,1,56,1,1,2,1,1,3,9,8,8,2,3,2,2,3,1,8,1,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,2,0,0,0,0,1,8,30,35,4,2,1,2,2,1968,1963,AT,1
3,29593,10,2,22,11010,2,2,2,2,1,2,2,2,0,0,66,2,10,AT12,2,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,6,3,3,3,1,1,1,3,2,0,1,66,1,2,2,1,1,1,1,-1,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,7,7,8,46,4,4,7,51,44.0,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,2,GER,0,10,1,4,66,3,1,1,1,45,6,2,2,1,0,3,5,0,5,7,0,0,2,6,1,5,-1,-1,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,2,0,0,0,0,2,0,40,50,3,2,1,2,2,1993,-1,AT,0
4,4252,0,1,24,11010,2,2,2,2,1,2,1,1,0,0,66,1,2,AT12,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,13,5,3,3,1,1,3,4,2,8,1,66,1,1,1,1,1,1,1,-1,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,29,29,19,20,3,3,18,25,42.0,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,2,GER,0,4,2,6,66,3,1,1,1,66,6,2,2,1,0,2,6,8,5,6,5,5,2,5,1,5,-1,-1,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,2,0,0,0,0,1,8,38,38,1,2,1,2,2,1991,-1,AT,1


Most of the variables have already been encoded via the preprocessing. The variables identified to be strings, according to codebook_long are: v17, v20, v25, v78, v154, v155, v161, cntry - these are all country/region codes or language codes.

In [201]:
cat_cols = ['v17', "v20", "v25", "v78", "v154", "v155", "v161", "cntry"]
cat_df = train_v3[cat_cols]

Checking unique values and counts of these features:

In [223]:
# v17 is mostly "66" - is this a "missing" indicator? One 6 which is likely miscoded 66, get index to drop later
cat_df.index[cat_df['v17'] == "6"].tolist()

[4483]

In [225]:
# v20 looks fine

In [226]:
# v25 mostly 66. One 6
cat_df.index[cat_df['v25'] == "6"].tolist()

[4483]

In [231]:
# v78 is mostly 66, some stray numbers 2,4,3,77, one 6
cat_df.index[cat_df['v78'] == "6"].tolist()
# v154 looks fine 
# v155 mostly 0 or 999 - "missing" indicators?

[4483]

In [234]:
# v161 is mostly 66, some stray numbers
# cntry looks fine

In [239]:
# Index 4483 contains many missing/miscoded entries - drop
train_v4 = train_v3.drop(4483)