## Predicting Life Satisfaction
### Exploratory Data Analysis
#### Objectives:
- Analysis of missing observations
- Analysis of categorical/numeric features
- Correlation analysis
- 1 way/ 2 way variable plots (histograms, scatterplots, etc.)


### From Kaggle:
  
#### File descriptions  
* train.csv - the training set with some preprocesssing of values.  
* train_raw.csv - the training set with original responses (no preprocessing).  
  
#### Data fields
* v1-v270 - survey response fields
* cntry - survey respondent country
* satisfied - whether (1) or not (0) the survey respondent is 'very satisfied' with their life (training set only)

In [1]:
# Import packages
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Set options
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.max_colwidth = -1

In [2]:
# Import data
train_raw = pd.read_csv("../01-data/train_raw.csv", low_memory = False)
train = pd.read_csv("../01-data/train.csv", low_memory = False)

# Custom data
codebook = pd.read_csv("../01-data/codebook_compact.csv", low_memory = False) # OG codebook+dtypes from codebook_long

In [3]:
train_no_blanks = train.fillna('.')
train_raw_no_blanks = train_raw.fillna('.')

In [4]:
codebook

Unnamed: 0,Variable,Obs,Unique,Mean,Min,Max,Label,Type_codebook_long
0,v1,38612,11,3.629442,0,10,Able to take active role in po...,double
1,v2,38929,4,1.932749,1,4,Feeling of safety of walking a...,double
2,v3,39228,90,49.30302,14,114,"Age of respondent, calculated",double
3,v4,39075,184,15484.9,10000,444444,"First ancestry, European Stand...",double
4,v5,12117,174,22455.79,10000,444444,"Second ancestry, European Stan...",double
5,v6,39087,2,1.689104,1,2,Improve knowledge/skills: cour...,double
6,v7,39237,2,1.913347,1,2,Worn or displayed campaign bad...,double
7,v8,39145,2,1.805211,1,2,Boycotted certain products las...,double
8,v9,38810,2,1.935326,1,2,Belong to minority ethnic grou...,double
9,v10,39315,2,1.112959,1,2,Born in country,double


In [5]:
display(train_raw.head())
train_raw.shape

Unnamed: 0,id,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,cntry,satisfied
0,9948,2,Safe,74,Austrian nfs,No second ancestry,No,No,No,No,Yes,Yes,Does not,Some of the time,Yes,Not marked,Not marked,66,No,2,AT33,No,No,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Not applicable,Not applicable,10,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Some of the time,10 to 24,9,Yes,66,Some of the time,None or almost none of the time,Most of the time,All or almost all of the time,Neither agree nor disagree,Neither agree nor disagree,Female,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,5,Good,1,Coping on present income,Pensions,Pensions,R - 2nd decile,Yes to some extent,Not marked,Not marked,Yes,Bad for the economy,Allow a few,Allow a few,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Not like me at all,Somewhat like me,Like me,Allow some,2,Worse place to live,,,1,1,12,2,4,4,9,25,107.0,2015,2015,4,Not like me,Somewhat like me,A little like me,Very much like me,Not like me,Somewhat like me,Somewhat like me,Like me,Like me,Somewhat like me,Not like me,Like me,Not like me at all,Not like me,Office supervisors,Not applicable,Yes,Not applicable,GER,0,3,No,Not applicable,Widowed/civil partner died,Widowed/civil partner died,66,"Yes, previously",Retired,Not applicable,Yes,Face to face interview,Manufacture of other non-metallic mineral products,20,Sales occupations,Sales occupations,Does not,No,Yes,1993,Not marked,Not marked,Hardly interested,Most people try to be fair,People mostly try to be helpful,8,Every day,Quite close,2,2,NUTS level 2,Never,Yes,Not applicable,1,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Marked,Not marked,Less than most,Several times a month,No,None or almost none of the time,3,Don't know,5,1,9,1,4,2015,A private firm,No trust at all,5,Complete trust,No trust at all,2,No trust at all,3,No time at all,"Less than 0,5 hour",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,40,40,Not applicable,Some of the time,No,Unlimited,No,No,1941,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
1,25601,4,Safe,58,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,Some of the time,No,Not marked,Not marked,66,No,2,AT31,No,No,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Refusal,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Not applicable,Not applicable,12,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Some of the time,10 to 24,5,Yes,66,Some of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Disagree strongly,Neither agree nor disagree,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,4,Fair,1,Coping on present income,Unemployment/redundancy benefit,Unemployment/redundancy benefit,R - 2nd decile,No,Not marked,Not marked,Yes,6,Allow a few,Allow some,A little like me,Somewhat like me,Somewhat like me,A little like me,A little like me,A little like me,A little like me,Allow many to come and live here,5,4,2,,17,17,19,15,1,1,17,46,75.0,2015,2015,3,A little like me,Somewhat like me,Somewhat like me,Very much like me,Very much like me,A little like me,Like me,Like me,Somewhat like me,Like me,Like me,Somewhat like me,Somewhat like me,Very much like me,Spray painters and varnishers,Not applicable,No,Not applicable,GER,0,5,Refusal,Not applicable,None of these (NEVER married or in legally registered civil union),None of these (NEVER married or in legally registered civil union),66,No,"Unemployed, looking for job",Not applicable,Yes,Face to face interview,"Manufacture of fabricated metal products, except machinery and equipment",Not applicable,Unskilled worker,Unskilled worker,Does not,No,Yes,2014,Not marked,Not marked,Not at all interested,4,7,3,Less often,Not applicable,4,5,NUTS level 2,Only on special holy days,Yes,Not applicable,1,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a month,No,None or almost none of the time,5,3,8,5,8,17,1,2015,A private firm,Don't know,7,7,Don't know,5,5,Don't know,"Less than 0,5 hour","More than 1 hour, up to 1,5 hours",No,Yes,Yes,Marked,Not marked,Not marked,Not marked,Yes,3,39,39,Not applicable,Some of the time,No,Limited,No,No,1957,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
2,8592,6,Safe,47,Austrian nfs,Austrian nfs,No,No,Yes,No,Yes,Not applicable,Respondent lives with children at household grid,None or almost none of the time,No,Not marked,Not marked,66,No,3,AT33,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",9,"ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Employee,Employee,Employee,Employee,All or almost all of the time,Under 10,2,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Neither agree nor disagree,Female,,,,,Male,Female,Female,Not applicable,Not applicable,Not applicable,Not applicable,,Extremely happy,Good,4,Coping on present income,Wages or salaries,Wages or salaries,J - 1st decile,No,Not marked,Not marked,Yes,5,Allow some,Allow a few,Not like me,Like me,Like me,Somewhat like me,Not like me,Like me,Like me,Allow some,5,3,10 or more,,18,18,17,28,3,3,16,30,50.0,2015,2015,2,Not like me,Like me,Very much like me,Like me,Not like me,Like me,Like me,Like me,Like me,Like me,Not like me,Like me,Not like me,A little like me,"Cleaners and helpers in offices, hotels and other establishments",No answer,Yes,Not applicable,GER,0,4,No answer,Not applicable,Legally married,Not applicable,66,No,Paid work,Not applicable,Yes,Face to face interview,Food and beverage service activities,1,Professional and technical occupations,Unskilled worker,Lives with husband/wife/partner at household grid,No,Not applicable,Not applicable,Marked,Marked,Hardly interested,9,8,8,More than once a week,Not applicable,3,2,NUTS level 2,Once a week,Yes,Not applicable,8,Roman Catholic,Not applicable,,,,,Husband/wife/partner,Son/daughter/step/adopted,Son/daughter/step/adopted,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Husband/wife/partner,Son/daughter/step/adopted/foster,Son/daughter/step/adopted/foster,Not applicable,Not applicable,Not applicable,Not applicable,,Legally married,Not marked,Not marked,Less than most,Once a week,No,None or almost none of the time,6,6,8,6,8,18,3,2015,A private firm,5,9,9,8,5,6,4,"More than 2 hours, up to 2,5 hours","More than 2,5 hours, up to 3 hours",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,30,35,40,All or almost all of the time,No,Unlimited,No,No,1968,,,,,1963,1993,1995,Not applicable,Not applicable,Not applicable,Not applicable,,AT,1
3,29593,Completely able,Safe,22,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,Some of the time,No,Not marked,Not marked,66,No,Completely confident,AT12,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Suburbs or outskirts of big city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Not applicable,Not applicable,6,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Most of the time,10 to 24,Unification already gone too far,Yes,66,None or almost none of the time,Some of the time,Some of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,8,Very good,1,Coping on present income,Wages or salaries,Wages or salaries,R - 2nd decile,No,Marked,Not marked,Yes,Bad for the economy,Allow none,Allow none,Like me,Like me,Like me,Very much like me,Somewhat like me,Very much like me,Very much like me,Allow none,Cultural life undermined,Worse place to live,3,,7,7,8,46,4,4,7,51,44.0,2015,2015,I have/had no influence,Very much like me,A little like me,Like me,Not like me at all,Somewhat like me,Very much like me,A little like me,Very much like me,Somewhat like me,Like me,Somewhat like me,Very much like me,Somewhat like me,Not like me at all,Motor vehicle mechanics and repairers,Not applicable,No,Not applicable,GER,0,Right,Yes,Paid work,Legally divorced/civil union dissolved,Legally divorced/civil union dissolved,66,No,Paid work,Not applicable,Yes,Face to face interview,Wholesale and retail trade and repair of motor vehicles and motorcycles,Not applicable,Skilled worker,Service occupations,Does not,No,Not applicable,Not applicable,Marked,Not marked,Hardly interested,5,People mostly look out for themselves,5,Never,Not applicable,Not at all,Not at all,NUTS level 2,Less often,Yes,Not applicable,5,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a week,No,None or almost none of the time,Extremely dissatisfied,Extremely dissatisfied,Extremely bad,Extremely dissatisfied,8,7,4,2015,A private firm,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No time at all,More than 3 hours,Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,No,I have/had no influence,40,50,Not applicable,Most of the time,No,Unlimited,No,No,1993,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
4,4252,Not at all able,Very safe,24,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,None or almost none of the time,Yes,Not marked,Not marked,66,Yes,2,AT12,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Post-secondary non-tertiary education completed (ISCED 4),"Vocational ISCED 4A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Not applicable,Not applicable,13,"ES-ISCED IV, advanced vocational, sub-degree","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Not working,All or almost all of the time,10 to 24,8,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,8,Very good,1,Living comfortably on present income,Wages or salaries,Wages or salaries,C - 3rd decile,No,Not marked,Not marked,Yes,7,Allow some,Allow some,Not like me at all,Very much like me,Like me,Somewhat like me,A little like me,Very much like me,Not like me at all,Allow some,7,8,3,,29,29,19,20,3,3,18,25,42.0,2015,2015,6,Not like me at all,Like me,Like me,Very much like me,Somewhat like me,Like me,Like me,Very much like me,Like me,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Somewhat like me,Accounting and bookkeeping clerks,Not applicable,No,Not applicable,GER,0,4,No,Not applicable,None of these (NEVER married or in legally registered civil union),None of these (NEVER married or in legally registered civil union),66,No,Paid work,Not applicable,Yes,Face to face interview,Activities auxiliary to financial services and insurance activities,Not applicable,Skilled worker,Not applicable,Does not,No,Not applicable,Not applicable,Marked,Not marked,Quite interested,6,8,5,Less often,Not close,5,5,NUTS level 2,Only on special holy days,Yes,Not applicable,5,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a week,Yes,Some of the time,9,6,8,6,8,29,3,2015,A private firm,2,6,9,7,7,7,5,"0,5 hour to 1 hour","More than 1,5 hours, up to 2 hours",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,38,38,Not applicable,None or almost none of the time,No,Unlimited,No,No,1991,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,1


(30080, 273)

In [6]:
display(train.head())
train.shape

Unnamed: 0,id,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,cntry,satisfied
0,9948,2,2,74,11010,.a,2,2,2,2,1,1,2,2,1,0,0,66,2,2,AT33,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,10,3,3,3,.a,.a,1,.a,1,1,2,2,9,1,66,2,1,3,4,3,3,2,,,,,.a,.a,.a,.a,.a,.a,.a,,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,,1,1,12,2,4,4,9,25,107.0,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,.a,1,.a,GER,0,3,2,.a,5,5,66,2,6,.a,1,1,23,20,4,4,2,2,1,1993,0,0,3,10,10,8,1,2,2,2,2,7,1,.a,1,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,1,0,2,4,2,1,3,.b,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,.a,2,.a,0,0,0,0,1,8,40,40,.a,2,2,1,2,2,1941,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
1,25601,4,2,58,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,2,AT31,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,.a,0,0,3,322,2,212,2,212,.a,.a,12,3,2,2,.a,.a,1,.a,1,1,2,2,5,1,66,2,1,1,1,5,3,1,,,,,.a,.a,.a,.a,.a,.a,.a,,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,,17,17,19,15,1,1,17,46,75.0,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,.a,2,.a,GER,0,5,.b,.a,6,6,66,3,3,.a,1,1,25,.a,8,8,2,2,1,2014,0,0,4,4,7,3,6,.a,4,5,2,5,1,.a,1,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,.b,7,7,.b,5,5,.b,1,3,2,1,1,1,0,0,0,1,3,39,39,.a,2,2,2,2,2,1957,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
2,8592,6,2,47,11010,11010,2,2,1,2,1,.a,1,1,2,0,0,66,2,3,AT33,2,.a,.a,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,3,322,9,2,2,2,3,.a,1,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,,,,,1,2,2,.a,.a,.a,.a,,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,,18,18,17,28,3,3,16,30,50.0,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,.d,1,.a,GER,0,4,.d,.a,1,.a,66,3,1,.a,1,1,56,1,1,8,1,2,.a,.a,1,1,3,9,8,8,2,.a,3,2,2,3,1,.a,8,1,.a,,,,,1,2,2,.a,.a,.a,.a,,,,,,1,2,2,.a,.a,.a,.a,,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,.a,2,.a,0,0,0,0,1,8,30,35,40,4,2,1,2,2,1968,,,,,1963,1993,1995,.a,.a,.a,.a,,AT,1
3,29593,10,2,22,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,10,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,6,3,3,3,.a,.a,1,.a,1,1,3,2,0,1,66,1,2,2,1,1,1,1,,,,,.a,.a,.a,.a,.a,.a,.a,,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,,7,7,8,46,4,4,7,51,44.0,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,.a,2,.a,GER,0,10,1,1,4,4,66,3,1,.a,1,1,45,.a,6,5,2,2,.a,.a,1,0,3,5,0,5,7,.a,0,0,2,6,1,.a,5,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,.a,2,.a,0,0,0,0,2,0,40,50,.a,3,2,1,2,2,1993,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
4,4252,0,1,24,11010,.a,2,2,2,2,1,2,2,1,1,0,0,66,1,2,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,.a,.a,13,5,3,3,.a,.a,1,.a,1,3,4,2,8,1,66,1,1,1,1,1,1,1,,,,,.a,.a,.a,.a,.a,.a,.a,,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,,29,29,19,20,3,3,18,25,42.0,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,.a,2,.a,GER,0,4,2,.a,6,6,66,3,1,.a,1,1,66,.a,6,.a,2,2,.a,.a,1,0,2,6,8,5,6,3,5,5,2,5,1,.a,5,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,.a,2,.a,0,0,0,0,1,8,38,38,.a,1,2,1,2,2,1991,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,1


(30080, 273)

### Raw vs. preprocessed data
Counting the frequency of unique answers for each variables question

In [7]:
frequency_train_raw_no_blanks = pd.DataFrame()

for i in list(train_raw_no_blanks)[1:]:
    grouped_data = train_raw_no_blanks.groupby(i)['id'].count()
    num_unique_answers = grouped_data.size
    temp_dict = {'Question': [i]*num_unique_answers,
                 'Response': list(grouped_data.index),
                 'Frequency': list(grouped_data)}
    frequency_train_raw_no_blanks = frequency_train_raw_no_blanks.append(pd.DataFrame(temp_dict))

frequency_train_raw_no_blanks.reset_index(inplace=True, drop=True)


In [8]:
frequency_train_raw_no_blanks

Unnamed: 0,Question,Response,Frequency
0,v1,1,2432
1,v1,2,3251
2,v1,3,3335
3,v1,4,2502
4,v1,5,3349
5,v1,6,2139
6,v1,7,2456
7,v1,8,1999
8,v1,9,735
9,v1,Completely able,903


In [9]:
frequency_train_no_blanks = pd.DataFrame()

for i in list(train_no_blanks)[1:]:
    grouped_data = train_no_blanks.groupby(i)['id'].count()
    num_unique_answers = grouped_data.size
    temp_dict = {'Question': [i]*num_unique_answers,
                 'Response': list(grouped_data.index),
                 'Frequency': list(grouped_data)}
    frequency_train_no_blanks = frequency_train_no_blanks.append(pd.DataFrame(temp_dict))

frequency_train_no_blanks.reset_index(inplace=True, drop=True)


In [10]:
frequency_train_no_blanks

Unnamed: 0,Question,Response,Frequency
0,v1,.a,18
1,v1,.b,546
2,v1,.c,10
3,v1,0,6405
4,v1,1,2432
5,v1,10,903
6,v1,2,3251
7,v1,3,3335
8,v1,4,2502
9,v1,5,3349


In [11]:
frequency_train_no_blanks['Response Missing'] = pd.np.where(frequency_train_no_blanks.Response.str.find(".") > -1, 1, 0)
frequency_train_no_blanks

Unnamed: 0,Question,Response,Frequency,Response Missing
0,v1,.a,18,1
1,v1,.b,546,1
2,v1,.c,10,1
3,v1,0,6405,0
4,v1,1,2432,0
5,v1,10,903,0
6,v1,2,3251,0
7,v1,3,3335,0
8,v1,4,2502,0
9,v1,5,3349,0


Attach short description for question

In [12]:
frequency_train_no_blanks.merge(codebook[['Variable','Label','Unique','Type_codebook_long']], left_on = 'Question', right_on = 'Variable', how = 'left')


Unnamed: 0,Question,Response,Frequency,Response Missing,Variable,Label,Unique,Type_codebook_long
0,v1,.a,18,1,v1,Able to take active role in po...,11,double
1,v1,.b,546,1,v1,Able to take active role in po...,11,double
2,v1,.c,10,1,v1,Able to take active role in po...,11,double
3,v1,0,6405,0,v1,Able to take active role in po...,11,double
4,v1,1,2432,0,v1,Able to take active role in po...,11,double
5,v1,10,903,0,v1,Able to take active role in po...,11,double
6,v1,2,3251,0,v1,Able to take active role in po...,11,double
7,v1,3,3335,0,v1,Able to take active role in po...,11,double
8,v1,4,2502,0,v1,Able to take active role in po...,11,double
9,v1,5,3349,0,v1,Able to take active role in po...,11,double


In [13]:
missing_value_df = frequency_train_no_blanks.groupby(["Question","Response","Response Missing"]).agg({"Frequency":"sum"})
missing_value_df = missing_value_df.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
missing_value_df = missing_value_df.groupby(["Question","Response Missing"]).agg({"Frequency":"sum"})
missing_value_df = missing_value_df[missing_value_df.index.get_level_values("Response Missing")==1]
missing_value_df = missing_value_df[missing_value_df.index.get_level_values("Question")!="satisfied"]
missing_value_df


Unnamed: 0_level_0,Unnamed: 1_level_0,Frequency
Question,Response Missing,Unnamed: 2_level_1
v1,1,1.908245
v10,1,0.023271
v100,1,0.28258
v101,1,0.914229
v102,1,2.094415
v103,1,2.094415
v104,1,21.087101
v105,1,0.222739
v108,1,7.948803
v109,1,3.617021


In [14]:
percent = 30
cols_missing = missing_value_df[missing_value_df["Frequency"] > percent]
n_cols_missing = len(missing_value_df[missing_value_df["Frequency"] > percent])
print("There are " + str(n_cols_missing) + " features with over " + str(percent) + "% missing.")

There are 71 features with over 30% missing.


In [15]:
cols_missing

Unnamed: 0_level_0,Unnamed: 1_level_0,Frequency
Question,Response Missing,Unnamed: 2_level_1
v11,1,35.954122
v123,1,92.06117
v151,1,64.361702
v153,1,88.889628
v158,1,86.439495
v160,1,49.31516
v164,1,94.075798
v168,1,73.397606
v170,1,42.054521
v173,1,54.498005


In [16]:
drop_missing = cols_missing.index.get_level_values("Question").to_list()
train_v2 = train_no_blanks.drop(drop_missing, axis=1)
train_v2