## Predicting Life Satisfaction
### Exploratory Data Analysis
#### Objectives:
- Analysis of missing observations
- Analysis of categorical/numeric features
- Correlation analysis
- 1 way/ 2 way variable plots (histograms, scatterplots, etc.)


### From Kaggle:
  
#### File descriptions  
* train.csv - the training set with some preprocesssing of values.  
* train_raw.csv - the training set with original responses (no preprocessing).  
  
#### Data fields
* v1-v270 - survey response fields
* cntry - survey respondent country
* satisfied - whether (1) or not (0) the survey respondent is 'very satisfied' with their life (training set only)

In [2]:
# Import packages
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Set options
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = -1

In [3]:
# Import data
train_raw = pd.read_csv("../01-data/train_raw.csv", low_memory = False)
train = pd.read_csv("../01-data/train.csv", low_memory = False)

# Custom data
codebook = pd.read_csv("../01-data/codebook_compact.csv", low_memory = False) # OG codebook+dtypes from codebook_long

In [4]:
train_no_blanks = train.fillna('.')
train_raw_no_blanks = train_raw.fillna('.')

In [5]:
codebook

Unnamed: 0,Variable,Obs,Unique,Mean,Min,Max,Label,Type_codebook_long
0,v1,38612,11,3.629442,0,10,Able to take active role in po...,double
1,v2,38929,4,1.932749,1,4,Feeling of safety of walking a...,double
2,v3,39228,90,49.30302,14,114,"Age of respondent, calculated",double
3,v4,39075,184,15484.9,10000,444444,"First ancestry, European Stand...",double
4,v5,12117,174,22455.79,10000,444444,"Second ancestry, European Stan...",double
5,v6,39087,2,1.689104,1,2,Improve knowledge/skills: cour...,double
6,v7,39237,2,1.913347,1,2,Worn or displayed campaign bad...,double
7,v8,39145,2,1.805211,1,2,Boycotted certain products las...,double
8,v9,38810,2,1.935326,1,2,Belong to minority ethnic grou...,double
9,v10,39315,2,1.112959,1,2,Born in country,double


In [6]:
display(train_raw.head())
train_raw.shape

Unnamed: 0,id,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,cntry,satisfied
0,9948,2,Safe,74,Austrian nfs,No second ancestry,No,No,No,No,Yes,Yes,Does not,Some of the time,Yes,Not marked,Not marked,66,No,2,AT33,No,No,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Not applicable,Not applicable,10,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Some of the time,10 to 24,9,Yes,66,Some of the time,None or almost none of the time,Most of the time,All or almost all of the time,Neither agree nor disagree,Neither agree nor disagree,Female,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,5,Good,1,Coping on present income,Pensions,Pensions,R - 2nd decile,Yes to some extent,Not marked,Not marked,Yes,Bad for the economy,Allow a few,Allow a few,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Not like me at all,Somewhat like me,Like me,Allow some,2,Worse place to live,,,1,1,12,2,4,4,9,25,107.0,2015,2015,4,Not like me,Somewhat like me,A little like me,Very much like me,Not like me,Somewhat like me,Somewhat like me,Like me,Like me,Somewhat like me,Not like me,Like me,Not like me at all,Not like me,Office supervisors,Not applicable,Yes,Not applicable,GER,0,3,No,Not applicable,Widowed/civil partner died,Widowed/civil partner died,66,"Yes, previously",Retired,Not applicable,Yes,Face to face interview,Manufacture of other non-metallic mineral products,20,Sales occupations,Sales occupations,Does not,No,Yes,1993,Not marked,Not marked,Hardly interested,Most people try to be fair,People mostly try to be helpful,8,Every day,Quite close,2,2,NUTS level 2,Never,Yes,Not applicable,1,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Marked,Not marked,Less than most,Several times a month,No,None or almost none of the time,3,Don't know,5,1,9,1,4,2015,A private firm,No trust at all,5,Complete trust,No trust at all,2,No trust at all,3,No time at all,"Less than 0,5 hour",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,40,40,Not applicable,Some of the time,No,Unlimited,No,No,1941,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
1,25601,4,Safe,58,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,Some of the time,No,Not marked,Not marked,66,No,2,AT31,No,No,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Refusal,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Not applicable,Not applicable,12,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Some of the time,10 to 24,5,Yes,66,Some of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Disagree strongly,Neither agree nor disagree,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,4,Fair,1,Coping on present income,Unemployment/redundancy benefit,Unemployment/redundancy benefit,R - 2nd decile,No,Not marked,Not marked,Yes,6,Allow a few,Allow some,A little like me,Somewhat like me,Somewhat like me,A little like me,A little like me,A little like me,A little like me,Allow many to come and live here,5,4,2,,17,17,19,15,1,1,17,46,75.0,2015,2015,3,A little like me,Somewhat like me,Somewhat like me,Very much like me,Very much like me,A little like me,Like me,Like me,Somewhat like me,Like me,Like me,Somewhat like me,Somewhat like me,Very much like me,Spray painters and varnishers,Not applicable,No,Not applicable,GER,0,5,Refusal,Not applicable,None of these (NEVER married or in legally registered civil union),None of these (NEVER married or in legally registered civil union),66,No,"Unemployed, looking for job",Not applicable,Yes,Face to face interview,"Manufacture of fabricated metal products, except machinery and equipment",Not applicable,Unskilled worker,Unskilled worker,Does not,No,Yes,2014,Not marked,Not marked,Not at all interested,4,7,3,Less often,Not applicable,4,5,NUTS level 2,Only on special holy days,Yes,Not applicable,1,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a month,No,None or almost none of the time,5,3,8,5,8,17,1,2015,A private firm,Don't know,7,7,Don't know,5,5,Don't know,"Less than 0,5 hour","More than 1 hour, up to 1,5 hours",No,Yes,Yes,Marked,Not marked,Not marked,Not marked,Yes,3,39,39,Not applicable,Some of the time,No,Limited,No,No,1957,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
2,8592,6,Safe,47,Austrian nfs,Austrian nfs,No,No,Yes,No,Yes,Not applicable,Respondent lives with children at household grid,None or almost none of the time,No,Not marked,Not marked,66,No,3,AT33,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Country village,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Lower secondary education completed (ISCED 2),"General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",9,"ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED II, lower secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Employee,Employee,Employee,Employee,All or almost all of the time,Under 10,2,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Neither agree nor disagree,Female,,,,,Male,Female,Female,Not applicable,Not applicable,Not applicable,Not applicable,,Extremely happy,Good,4,Coping on present income,Wages or salaries,Wages or salaries,J - 1st decile,No,Not marked,Not marked,Yes,5,Allow some,Allow a few,Not like me,Like me,Like me,Somewhat like me,Not like me,Like me,Like me,Allow some,5,3,10 or more,,18,18,17,28,3,3,16,30,50.0,2015,2015,2,Not like me,Like me,Very much like me,Like me,Not like me,Like me,Like me,Like me,Like me,Like me,Not like me,Like me,Not like me,A little like me,"Cleaners and helpers in offices, hotels and other establishments",No answer,Yes,Not applicable,GER,0,4,No answer,Not applicable,Legally married,Not applicable,66,No,Paid work,Not applicable,Yes,Face to face interview,Food and beverage service activities,1,Professional and technical occupations,Unskilled worker,Lives with husband/wife/partner at household grid,No,Not applicable,Not applicable,Marked,Marked,Hardly interested,9,8,8,More than once a week,Not applicable,3,2,NUTS level 2,Once a week,Yes,Not applicable,8,Roman Catholic,Not applicable,,,,,Husband/wife/partner,Son/daughter/step/adopted,Son/daughter/step/adopted,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Husband/wife/partner,Son/daughter/step/adopted/foster,Son/daughter/step/adopted/foster,Not applicable,Not applicable,Not applicable,Not applicable,,Legally married,Not marked,Not marked,Less than most,Once a week,No,None or almost none of the time,6,6,8,6,8,18,3,2015,A private firm,5,9,9,8,5,6,4,"More than 2 hours, up to 2,5 hours","More than 2,5 hours, up to 3 hours",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,30,35,40,All or almost all of the time,No,Unlimited,No,No,1968,,,,,1963,1993,1995,Not applicable,Not applicable,Not applicable,Not applicable,,AT,1
3,29593,Completely able,Safe,22,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,Some of the time,No,Not marked,Not marked,66,No,Completely confident,AT12,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Suburbs or outskirts of big city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Not applicable,Not applicable,6,"ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Employee,Most of the time,10 to 24,Unification already gone too far,Yes,66,None or almost none of the time,Some of the time,Some of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,8,Very good,1,Coping on present income,Wages or salaries,Wages or salaries,R - 2nd decile,No,Marked,Not marked,Yes,Bad for the economy,Allow none,Allow none,Like me,Like me,Like me,Very much like me,Somewhat like me,Very much like me,Very much like me,Allow none,Cultural life undermined,Worse place to live,3,,7,7,8,46,4,4,7,51,44.0,2015,2015,I have/had no influence,Very much like me,A little like me,Like me,Not like me at all,Somewhat like me,Very much like me,A little like me,Very much like me,Somewhat like me,Like me,Somewhat like me,Very much like me,Somewhat like me,Not like me at all,Motor vehicle mechanics and repairers,Not applicable,No,Not applicable,GER,0,Right,Yes,Paid work,Legally divorced/civil union dissolved,Legally divorced/civil union dissolved,66,No,Paid work,Not applicable,Yes,Face to face interview,Wholesale and retail trade and repair of motor vehicles and motorcycles,Not applicable,Skilled worker,Service occupations,Does not,No,Not applicable,Not applicable,Marked,Not marked,Hardly interested,5,People mostly look out for themselves,5,Never,Not applicable,Not at all,Not at all,NUTS level 2,Less often,Yes,Not applicable,5,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a week,No,None or almost none of the time,Extremely dissatisfied,Extremely dissatisfied,Extremely bad,Extremely dissatisfied,8,7,4,2015,A private firm,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No trust at all,No time at all,More than 3 hours,Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,No,I have/had no influence,40,50,Not applicable,Most of the time,No,Unlimited,No,No,1993,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,0
4,4252,Not at all able,Very safe,24,Austrian nfs,No second ancestry,No,No,No,No,Yes,No,Does not,None or almost none of the time,Yes,Not marked,Not marked,66,Yes,2,AT12,No,Not applicable,Not applicable,Yes,66,Not marked,Not marked,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Town or small city,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Marked,Not marked,Not marked,Not marked,Not marked,Not marked,Not marked,No,Not marked,Not marked,Post-secondary non-tertiary education completed (ISCED 4),"Vocational ISCED 4A, access upper tier ISCED 5A/all 5",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Upper secondary education completed (ISCED 3),"Vocational ISCED 3A, access ISCED 5B/ lower tier 5A",Not applicable,Not applicable,13,"ES-ISCED IV, advanced vocational, sub-degree","ES-ISCED IIIb, lower tier upper secondary","ES-ISCED IIIb, lower tier upper secondary",Not applicable,Not applicable,Employee,Not applicable,Employee,Not working,All or almost all of the time,10 to 24,8,Yes,66,None or almost none of the time,None or almost none of the time,None or almost none of the time,None or almost none of the time,Agree strongly,Agree strongly,Male,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,8,Very good,1,Living comfortably on present income,Wages or salaries,Wages or salaries,C - 3rd decile,No,Not marked,Not marked,Yes,7,Allow some,Allow some,Not like me at all,Very much like me,Like me,Somewhat like me,A little like me,Very much like me,Not like me at all,Allow some,7,8,3,,29,29,19,20,3,3,18,25,42.0,2015,2015,6,Not like me at all,Like me,Like me,Very much like me,Somewhat like me,Like me,Like me,Very much like me,Like me,Somewhat like me,Somewhat like me,Like me,Somewhat like me,Somewhat like me,Accounting and bookkeeping clerks,Not applicable,No,Not applicable,GER,0,4,No,Not applicable,None of these (NEVER married or in legally registered civil union),None of these (NEVER married or in legally registered civil union),66,No,Paid work,Not applicable,Yes,Face to face interview,Activities auxiliary to financial services and insurance activities,Not applicable,Skilled worker,Not applicable,Does not,No,Not applicable,Not applicable,Marked,Not marked,Quite interested,6,8,5,Less often,Not close,5,5,NUTS level 2,Only on special holy days,Yes,Not applicable,5,Roman Catholic,Not applicable,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,Not applicable,Not marked,Not marked,About the same,Several times a week,Yes,Some of the time,9,6,8,6,8,29,3,2015,A private firm,2,6,9,7,7,7,5,"0,5 hour to 1 hour","More than 1,5 hours, up to 2 hours",Not applicable,No,Not applicable,Not marked,Not marked,Not marked,Not marked,Yes,8,38,38,Not applicable,None or almost none of the time,No,Unlimited,No,No,1991,,,,,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,Not applicable,,AT,1


(30080, 273)

In [7]:
display(train.head())
train.shape

Unnamed: 0,id,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v190,v191,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v241,v242,v243,v244,v245,v246,v247,v248,v249,v250,v251,v252,v253,v254,v255,v256,v257,v258,v259,v260,v261,v262,v263,v264,v265,v266,v267,v268,v269,v270,cntry,satisfied
0,9948,2,2,74,11010,.a,2,2,2,2,1,1,2,2,1,0,0,66,2,2,AT33,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,10,3,3,3,.a,.a,1,.a,1,1,2,2,9,1,66,2,1,3,4,3,3,2,,,,,.a,.a,.a,.a,.a,.a,.a,,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,,1,1,12,2,4,4,9,25,107.0,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,.a,1,.a,GER,0,3,2,.a,5,5,66,2,6,.a,1,1,23,20,4,4,2,2,1,1993,0,0,3,10,10,8,1,2,2,2,2,7,1,.a,1,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,1,0,2,4,2,1,3,.b,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,.a,2,.a,0,0,0,0,1,8,40,40,.a,2,2,1,2,2,1941,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
1,25601,4,2,58,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,2,AT31,2,2,.a,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,.a,0,0,3,322,2,212,2,212,.a,.a,12,3,2,2,.a,.a,1,.a,1,1,2,2,5,1,66,2,1,1,1,5,3,1,,,,,.a,.a,.a,.a,.a,.a,.a,,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,,17,17,19,15,1,1,17,46,75.0,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,.a,2,.a,GER,0,5,.b,.a,6,6,66,3,3,.a,1,1,25,.a,8,8,2,2,1,2014,0,0,4,4,7,3,6,.a,4,5,2,5,1,.a,1,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,.b,7,7,.b,5,5,.b,1,3,2,1,1,1,0,0,0,1,3,39,39,.a,2,2,2,2,2,1957,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
2,8592,6,2,47,11010,11010,2,2,1,2,1,.a,1,1,2,0,0,66,2,3,AT33,2,.a,.a,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,3,322,9,2,2,2,3,.a,1,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,,,,,1,2,2,.a,.a,.a,.a,,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,,18,18,17,28,3,3,16,30,50.0,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,.d,1,.a,GER,0,4,.d,.a,1,.a,66,3,1,.a,1,1,56,1,1,8,1,2,.a,.a,1,1,3,9,8,8,2,.a,3,2,2,3,1,.a,8,1,.a,,,,,1,2,2,.a,.a,.a,.a,,,,,,1,2,2,.a,.a,.a,.a,,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,.a,2,.a,0,0,0,0,1,8,30,35,40,4,2,1,2,2,1968,,,,,1963,1993,1995,.a,.a,.a,.a,,AT,1
3,29593,10,2,22,11010,.a,2,2,2,2,1,2,2,2,2,0,0,66,2,10,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,.a,.a,6,3,3,3,.a,.a,1,.a,1,1,3,2,0,1,66,1,2,2,1,1,1,1,,,,,.a,.a,.a,.a,.a,.a,.a,,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,,7,7,8,46,4,4,7,51,44.0,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,.a,2,.a,GER,0,10,1,1,4,4,66,3,1,.a,1,1,45,.a,6,5,2,2,.a,.a,1,0,3,5,0,5,7,.a,0,0,2,6,1,.a,5,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,.a,2,.a,0,0,0,0,2,0,40,50,.a,3,2,1,2,2,1993,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,0
4,4252,0,1,24,11010,.a,2,2,2,2,1,2,2,1,1,0,0,66,1,2,AT12,2,.a,.a,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,.a,.a,13,5,3,3,.a,.a,1,.a,1,3,4,2,8,1,66,1,1,1,1,1,1,1,,,,,.a,.a,.a,.a,.a,.a,.a,,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,,29,29,19,20,3,3,18,25,42.0,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,.a,2,.a,GER,0,4,2,.a,6,6,66,3,1,.a,1,1,66,.a,6,.a,2,2,.a,.a,1,0,2,6,8,5,6,3,5,5,2,5,1,.a,5,1,.a,,,,,.a,.a,.a,.a,.a,.a,.a,,,,,,.a,.a,.a,.a,.a,.a,.a,,.a,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,.a,2,.a,0,0,0,0,1,8,38,38,.a,1,2,1,2,2,1991,,,,,.a,.a,.a,.a,.a,.a,.a,,AT,1


(30080, 273)

### Raw vs. preprocessed data
Counting the frequency of unique answers for each variables question

In [8]:
frequency_train_raw_no_blanks = pd.DataFrame()

for i in list(train_raw_no_blanks)[1:]:
    grouped_data = train_raw_no_blanks.groupby(i)['id'].count()
    num_unique_answers = grouped_data.size
    temp_dict = {'Question': [i]*num_unique_answers,
                 'Response': list(grouped_data.index),
                 'Frequency': list(grouped_data)}
    frequency_train_raw_no_blanks = frequency_train_raw_no_blanks.append(pd.DataFrame(temp_dict))

frequency_train_raw_no_blanks.reset_index(inplace=True, drop=True)


In [35]:
frequency_train_raw_no_blanks[frequency_train_raw_no_blanks["Question"]=="v17"]

Unnamed: 0,Question,Response,Frequency
507,v17,2,25
508,v17,3,8
509,v17,4,9
510,v17,6,1
511,v17,66,26687
512,v17,77,4
513,v17,88,1
514,v17,99,181
515,v17,AF,14
516,v17,AL,3


In [10]:
frequency_train_no_blanks = pd.DataFrame()

for i in list(train_no_blanks)[1:]:
    grouped_data = train_no_blanks.groupby(i)['id'].count()
    num_unique_answers = grouped_data.size
    temp_dict = {'Question': [i]*num_unique_answers,
                 'Response': list(grouped_data.index),
                 'Frequency': list(grouped_data)}
    frequency_train_no_blanks = frequency_train_no_blanks.append(pd.DataFrame(temp_dict))

frequency_train_no_blanks.reset_index(inplace=True, drop=True)


In [10]:
frequency_train_no_blanks

Unnamed: 0,Question,Response,Frequency
0,v1,.a,18
1,v1,.b,546
2,v1,.c,10
3,v1,0,6405
4,v1,1,2432
5,v1,10,903
6,v1,2,3251
7,v1,3,3335
8,v1,4,2502
9,v1,5,3349


In [11]:
frequency_train_no_blanks['Response Missing'] = pd.np.where(frequency_train_no_blanks.Response.str.find(".") > -1, 1, 0)
frequency_train_no_blanks

Unnamed: 0,Question,Response,Frequency,Response Missing
0,v1,.a,18,1
1,v1,.b,546,1
2,v1,.c,10,1
3,v1,0,6405,0
4,v1,1,2432,0
...,...,...,...,...
6408,cntry,PT,992,0
6409,cntry,SE,1434,0
6410,cntry,SI,950,0
6411,satisfied,0,14454,0


Attach short description for question

In [12]:
frequency_train_no_blanks.merge(codebook[['Variable','Label','Unique','Type_codebook_long']], left_on = 'Question', right_on = 'Variable', how = 'left')


Unnamed: 0,Question,Response,Frequency,Response Missing,Variable,Label,Unique,Type_codebook_long
0,v1,.a,18,1,v1,Able to take active role in po...,11,double
1,v1,.b,546,1,v1,Able to take active role in po...,11,double
2,v1,.c,10,1,v1,Able to take active role in po...,11,double
3,v1,0,6405,0,v1,Able to take active role in po...,11,double
4,v1,1,2432,0,v1,Able to take active role in po...,11,double
...,...,...,...,...,...,...,...,...
6408,cntry,PT,992,0,cntry,Country,21,string
6409,cntry,SE,1434,0,cntry,Country,21,string
6410,cntry,SI,950,0,cntry,Country,21,string
6411,satisfied,0,14454,0,satisfied,Target,2,float


In [13]:
missing_value_df = frequency_train_no_blanks.groupby(["Question","Response","Response Missing"]).agg({"Frequency":"sum"})
missing_value_df = missing_value_df.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
missing_value_df = missing_value_df.groupby(["Question","Response Missing"]).agg({"Frequency":"sum"})
missing_value_df = missing_value_df[missing_value_df.index.get_level_values("Response Missing")==1]
missing_value_df = missing_value_df[missing_value_df.index.get_level_values("Question")!="satisfied"]
missing_value_df


Unnamed: 0_level_0,Unnamed: 1_level_0,Frequency
Question,Response Missing,Unnamed: 2_level_1
v1,1,1.908245
v10,1,0.023271
v100,1,0.28258
v101,1,0.914229
v102,1,2.094415
v103,1,2.094415
v104,1,21.087101
v105,1,0.222739
v108,1,7.948803
v109,1,3.617021


In [14]:
percent = 30
cols_missing = missing_value_df[missing_value_df["Frequency"] > percent]
n_cols_missing = len(missing_value_df[missing_value_df["Frequency"] > percent])
print("There are " + str(n_cols_missing) + " features with over " + str(percent) + "% missing.")

There are 71 features with over 30% missing.


In [20]:
cols_missing.merge(codebook[['Variable','Label','Unique','Type_codebook_long']], left_on = 'Question', right_on = 'Variable', how = 'left')


Unnamed: 0,Frequency,Variable,Label,Unique,Type_codebook_long
0,35.954122,v11,Ever had children living in ho...,2,double
1,92.06117,v123,"Place of interview: East, West...",2,double
2,64.361702,v151,"Occupation partner, ISCO08",524,double
3,88.889628,v153,What year you first came to li...,88,double
4,86.439495,v158,Main activity last 7 days,9,double
5,49.31516,v160,Legal marital status,6,double
6,94.075798,v164,Partner's main activity last 7...,8,double
7,73.397606,v168,Number of people responsible f...,155,double
8,42.054521,v170,Mother's occupation when respo...,9,double
9,54.498005,v173,Ever had a paid job,2,double


In [21]:
drop_missing = cols_missing.index.get_level_values("Question").to_list()
drop_missing

['v11',
 'v123',
 'v151',
 'v153',
 'v158',
 'v160',
 'v164',
 'v168',
 'v170',
 'v173',
 'v174',
 'v182',
 'v188',
 'v190',
 'v191',
 'v192',
 'v193',
 'v194',
 'v195',
 'v197',
 'v198',
 'v199',
 'v200',
 'v201',
 'v202',
 'v203',
 'v204',
 'v205',
 'v206',
 'v207',
 'v209',
 'v210',
 'v211',
 'v212',
 'v213',
 'v214',
 'v215',
 'v216',
 'v22',
 'v23',
 'v241',
 'v243',
 'v252',
 'v259',
 'v260',
 'v261',
 'v262',
 'v264',
 'v265',
 'v266',
 'v267',
 'v268',
 'v269',
 'v270',
 'v5',
 'v62',
 'v63',
 'v68',
 'v69',
 'v71',
 'v86',
 'v87',
 'v88',
 'v89',
 'v91',
 'v92',
 'v93',
 'v94',
 'v95',
 'v96',
 'v97']

In [23]:
train_v2 = train_no_blanks.drop(columns = drop_missing)
train_v2

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,2,74,11010,2,2,2,2,1,2,2,1,0,0,66,2,2,AT33,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,10,3,3,3,1,1,1,2,2,9,1,66,2,1,3,4,3,3,2,.a,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,1,1,12,2,4,4,9,25,107,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,1,GER,0,3,2,5,66,2,6,1,1,23,4,2,2,0,0,3,10,10,8,1,2,2,2,7,1,1,.a,.a,1,0,2,4,2,1,3,.b,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,2,0,0,0,0,1,8,40,40,2,2,1,2,2,1941,.a,AT,0
1,25601,4,2,58,11010,2,2,2,2,1,2,2,2,0,0,66,2,2,AT31,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,.a,0,0,3,322,2,212,2,212,12,3,2,2,1,1,1,2,2,5,1,66,2,1,1,1,5,3,1,.a,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,17,17,19,15,1,1,17,46,75,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,2,GER,0,5,.b,6,66,3,3,1,1,25,8,2,2,0,0,4,4,7,3,6,4,5,2,5,1,1,.a,.a,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,.b,7,7,.b,5,5,.b,1,3,1,1,0,0,0,1,3,39,39,2,2,2,2,2,1957,.a,AT,0
2,8592,6,2,47,11010,2,2,1,2,1,1,1,2,0,0,66,2,3,AT33,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,9,2,2,2,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,1,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,18,18,17,28,3,3,16,30,50,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,1,GER,0,4,.d,1,66,3,1,1,1,56,1,1,2,1,1,3,9,8,8,2,3,2,2,3,1,8,1,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,2,0,0,0,0,1,8,30,35,4,2,1,2,2,1968,1963,AT,1
3,29593,10,2,22,11010,2,2,2,2,1,2,2,2,0,0,66,2,10,AT12,2,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,6,3,3,3,1,1,1,3,2,0,1,66,1,2,2,1,1,1,1,.a,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,7,7,8,46,4,4,7,51,44,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,2,GER,0,10,1,4,66,3,1,1,1,45,6,2,2,1,0,3,5,0,5,7,0,0,2,6,1,5,.a,.a,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,2,0,0,0,0,2,0,40,50,3,2,1,2,2,1993,.a,AT,0
4,4252,0,1,24,11010,2,2,2,2,1,2,1,1,0,0,66,1,2,AT12,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,13,5,3,3,1,1,3,4,2,8,1,66,1,1,1,1,1,1,1,.a,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,29,29,19,20,3,3,18,25,42,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,2,GER,0,4,2,6,66,3,1,1,1,66,6,2,2,1,0,2,6,8,5,6,5,5,2,5,1,5,.a,.a,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,2,0,0,0,0,1,8,38,38,1,2,1,2,2,1991,.a,AT,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30075,34440,0,1,72,14120,2,2,2,2,1,2,1,2,0,0,66,2,4,SI017,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,321,2,213,2,213,11,3,2,2,2,4,3,2,1,8,1,66,2,2,1,1,3,2,1,2,6,2,2,1,3,4,8,3,0,0,1,5,2,2,2,2,2,2,3,2,2,2,6,3,3,29,29,16,48,11,11,16,23,25,2014,2014,10,5,2,2,2,3,2,2,2,2,2,3,2,2,2,1219,1,SLV,0,5,2,1,66,2,6,1,1,49,.a,1,2,0,0,2,4,2,5,7,0,0,3,7,1,3,1,1,1,1,3,5,2,2,8,5,7,5,7,29,11,2014,5,0,0,3,0,2,0,3,2,7,2,0,0,0,0,1,10,40,60,2,2,.a,.,2,1942,1945,SI,1
30076,13566,0,1,38,14120,2,2,2,2,1,2,1,2,0,0,66,2,0,SI021,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,323,3,321,3,321,11,4,3,3,1,1,1,4,1,5,1,66,1,1,1,2,.b,2,1,.a,7,2,1,2,1,1,4,3,0,0,1,2,4,4,2,1,1,2,3,2,3,4,3,.b,3,17,17,16,16,11,11,15,46,30,2014,2014,6,3,1,2,3,5,2,2,2,2,2,2,2,3,3,3322,2,SLV,0,5,2,6,66,3,1,1,1,47,7,2,2,1,0,3,7,5,3,7,0,0,3,7,2,5,.a,.a,0,0,3,7,2,1,0,1,5,1,5,17,11,2014,4,2,5,5,2,2,2,2,2,6,2,0,0,0,0,2,6,40,40,4,2,1,.,2,1976,.a,SI,1
30077,29824,5,2,49,14120,2,2,2,2,1,1,1,2,0,0,66,2,5,SI011,2,1,66,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,213,2,213,2,213,8,2,2,2,1,1,1,3,4,4,1,66,1,1,1,1,2,2,1,2,6,3,4,2,1,1,1,3,0,0,1,4,4,4,1,1,3,4,4,2,1,3,3,4,2,5,5,16,6,11,11,15,43,23,2014,2014,8,4,2,2,1,2,2,1,2,2,2,2,2,1,2,8160,2,SLV,0,6,2,1,66,1,1,1,1,10,7,1,2,1,0,3,3,4,4,3,5,3,3,4,2,6,1,1,0,0,3,5,2,1,4,4,4,3,5,5,11,2014,4,2,6,5,2,3,1,6,0,2,1,0,1,0,0,1,7,40,48,3,2,1,.,2,1965,1967,SI,0
30078,9573,0,1,16,14120,2,2,2,2,1,2,1,2,0,0,66,2,0,SI018,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,1,0,2,213,3,313,3,323,10,2,4,4,.a,1,1,3,.a,5,1,66,1,1,1,3,4,2,1,2,10,1,3,1,1,1,.b,3,0,0,.a,2,1,1,3,3,1,1,5,2,3,1,5,5,3,19,19,17,7,11,11,16,37,30,2014,2014,.a,2,2,3,1,3,1,3,1,3,2,1,2,3,2,.a,.a,SLV,0,5,2,6,66,3,2,1,1,.a,5,2,2,0,0,4,5,5,1,7,1,.b,3,5,2,0,3,3,0,0,2,7,2,1,2,0,3,0,5,19,11,2014,.a,0,2,2,0,0,0,3,0,3,2,0,0,0,0,3,.a,.a,.a,3,.a,.a,.,2,1998,1972,SI,1


### Categorical and Numeric Variabels


In [24]:
train_v2 = train_v2.replace([".", ".a", ".b", ".c", ".d"], [-1, -1, -1, -1, -1])
train_v2.head()

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,2,74,11010,2,2,2,2,1,2,2,1,0,0,66,2,2,AT33,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,10,3,3,3,1,1,1,2,2,9,1,66,2,1,3,4,3,3,2,-1,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,1,1,12,2,4,4,9,25,107.0,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,1,GER,0,3,2,5,66,2,6,1,1,23,4,2,2,0,0,3,10,10,8,1,2,2,2,7,1,1,-1,-1,1,0,2,4,2,1,3,-1,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,2,0,0,0,0,1,8,40,40,2,2,1,2,2,1941,-1,AT,0
1,25601,4,2,58,11010,2,2,2,2,1,2,2,2,0,0,66,2,2,AT31,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,-1,0,0,3,322,2,212,2,212,12,3,2,2,1,1,1,2,2,5,1,66,2,1,1,1,5,3,1,-1,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,17,17,19,15,1,1,17,46,75.0,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,2,GER,0,5,-1,6,66,3,3,1,1,25,8,2,2,0,0,4,4,7,3,6,4,5,2,5,1,1,-1,-1,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,-1,7,7,-1,5,5,-1,1,3,1,1,0,0,0,1,3,39,39,2,2,2,2,2,1957,-1,AT,0
2,8592,6,2,47,11010,2,2,1,2,1,1,1,2,0,0,66,2,3,AT33,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,9,2,2,2,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,1,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,18,18,17,28,3,3,16,30,50.0,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,1,GER,0,4,-1,1,66,3,1,1,1,56,1,1,2,1,1,3,9,8,8,2,3,2,2,3,1,8,1,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,2,0,0,0,0,1,8,30,35,4,2,1,2,2,1968,1963,AT,1
3,29593,10,2,22,11010,2,2,2,2,1,2,2,2,0,0,66,2,10,AT12,2,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,6,3,3,3,1,1,1,3,2,0,1,66,1,2,2,1,1,1,1,-1,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,7,7,8,46,4,4,7,51,44.0,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,2,GER,0,10,1,4,66,3,1,1,1,45,6,2,2,1,0,3,5,0,5,7,0,0,2,6,1,5,-1,-1,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,2,0,0,0,0,2,0,40,50,3,2,1,2,2,1993,-1,AT,0
4,4252,0,1,24,11010,2,2,2,2,1,2,1,1,0,0,66,1,2,AT12,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,13,5,3,3,1,1,3,4,2,8,1,66,1,1,1,1,1,1,1,-1,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,29,29,19,20,3,3,18,25,42.0,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,2,GER,0,4,2,6,66,3,1,1,1,66,6,2,2,1,0,2,6,8,5,6,5,5,2,5,1,5,-1,-1,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,2,0,0,0,0,1,8,38,38,1,2,1,2,2,1991,-1,AT,1


In [40]:
frequency_train_v2 = pd.DataFrame()

for i in list(train_v2)[1:]:
    grouped_data = train_v2.groupby(i)['id'].count()
    num_unique_answers = grouped_data.size
    temp_dict = {'Question': [i]*num_unique_answers,
                 'Response': list(grouped_data.index),
                 'Frequency': list(grouped_data)}
    frequency_train_v2 = frequency_train_v2.append(pd.DataFrame(temp_dict))

frequency_train_v2.reset_index(inplace=True, drop=True)
frequency_train_v2 = frequency_train_v2.merge(codebook[['Variable','Label','Unique','Type_codebook_long']], left_on = 'Question', right_on = 'Variable', how = 'left')
frequency_train_v2

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
0,v1,-1,574,v1,Able to take active role in po...,11,double
1,v1,0,6405,v1,Able to take active role in po...,11,double
2,v1,1,2432,v1,Able to take active role in po...,11,double
3,v1,10,903,v1,Able to take active role in po...,11,double
4,v1,2,3251,v1,Able to take active role in po...,11,double
...,...,...,...,...,...,...,...
4031,cntry,PT,992,cntry,Country,21,string
4032,cntry,SE,1434,cntry,Country,21,string
4033,cntry,SI,950,cntry,Country,21,string
4034,satisfied,0,14454,satisfied,Target,2,float


In [26]:
cat_cols = ["v17", "v20", "v25", "v78", "v154", "v155", "v161", "cntry"]
cat_df = train_v2[cat_cols]
cat_df.head()

Unnamed: 0,v17,v20,v25,v78,v154,v155,v161,cntry
0,66,AT33,66,66,GER,0,66,AT
1,66,AT31,66,66,GER,0,66,AT
2,66,AT33,66,66,GER,0,66,AT
3,66,AT12,66,66,GER,0,66,AT
4,66,AT12,66,66,GER,0,66,AT


In [41]:
# concerning resposnes: 2, 3, 4, 6, 66, 77, 88, 99
frequency_train_v2[frequency_train_v2["Question"]=="v17"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
315,v17,2,25,v17,Country of birth,172,string
316,v17,3,8,v17,Country of birth,172,string
317,v17,4,9,v17,Country of birth,172,string
318,v17,6,1,v17,Country of birth,172,string
319,v17,66,26687,v17,Country of birth,172,string
320,v17,77,4,v17,Country of birth,172,string
321,v17,88,1,v17,Country of birth,172,string
322,v17,99,181,v17,Country of birth,172,string
323,v17,AF,14,v17,Country of birth,172,string
324,v17,AL,3,v17,Country of birth,172,string


In [42]:
# concerning responses: 99999
frequency_train_v2[frequency_train_v2["Question"]=="v20"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
497,v20,99999,3,v20,Region,251,string
498,v20,AT11,48,v20,Region,251,string
499,v20,AT12,257,v20,Region,251,string
500,v20,AT13,263,v20,Region,251,string
501,v20,AT21,92,v20,Region,251,string
502,v20,AT22,204,v20,Region,251,string
503,v20,AT31,248,v20,Region,251,string
504,v20,AT32,96,v20,Region,251,string
505,v20,AT33,111,v20,Region,251,string
506,v20,AT34,68,v20,Region,251,string


In [43]:
# concerning resposnes: 6, 65, 66, 88, 99
frequency_train_v2[frequency_train_v2["Question"]=="v25"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
749,v25,6,1,v25,Citizenship,130,string
750,v25,65,140,v25,Citizenship,130,string
751,v25,66,28515,v25,Citizenship,130,string
752,v25,88,17,v25,Citizenship,130,string
753,v25,99,122,v25,Citizenship,130,string
754,v25,AF,6,v25,Citizenship,130,string
755,v25,AL,2,v25,Citizenship,130,string
756,v25,AM,2,v25,Citizenship,130,string
757,v25,AO,3,v25,Citizenship,130,string
758,v25,AR,3,v25,Citizenship,130,string


In [44]:
# concerning responses: 2, 3, 4, 6, 66, 77, 88, 99
frequency_train_v2[frequency_train_v2["Question"]=="v78"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
1151,v78,2,37,v78,"Country of birth, father",171,string
1152,v78,3,20,v78,"Country of birth, father",171,string
1153,v78,4,35,v78,"Country of birth, father",171,string
1154,v78,6,1,v78,"Country of birth, father",171,string
1155,v78,66,24893,v78,"Country of birth, father",171,string
1156,v78,77,3,v78,"Country of birth, father",171,string
1157,v78,88,43,v78,"Country of birth, father",171,string
1158,v78,99,197,v78,"Country of birth, father",171,string
1159,v78,AF,17,v78,"Country of birth, father",171,string
1160,v78,AG,1,v78,"Country of birth, father",171,string


In [45]:
# concerning responses: 777, 888, 999
frequency_train_v2[frequency_train_v2["Question"]=="v154"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
2657,v154,777,11,v154,Language most often spoken at ...,123,string
2658,v154,888,55,v154,Language most often spoken at ...,123,string
2659,v154,999,177,v154,Language most often spoken at ...,123,string
2660,v154,ABK,1,v154,Language most often spoken at ...,123,string
2661,v154,AKA,1,v154,Language most often spoken at ...,123,string
2662,v154,ALB,32,v154,Language most often spoken at ...,123,string
2663,v154,AMH,16,v154,Language most often spoken at ...,123,string
2664,v154,APA,16,v154,Language most often spoken at ...,123,string
2665,v154,ARA,454,v154,Language most often spoken at ...,123,string
2666,v154,ARM,10,v154,Language most often spoken at ...,123,string


In [46]:
# concerning responses: 0, 777, 888, 999
frequency_train_v2[frequency_train_v2["Question"]=="v155"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
2766,v155,0,23837,v155,Language most often spoken at ...,129,string
2767,v155,777,12,v155,Language most often spoken at ...,129,string
2768,v155,888,26,v155,Language most often spoken at ...,129,string
2769,v155,999,1815,v155,Language most often spoken at ...,129,string
2770,v155,ALB,15,v155,Language most often spoken at ...,129,string
2771,v155,AMH,17,v155,Language most often spoken at ...,129,string
2772,v155,APA,5,v155,Language most often spoken at ...,129,string
2773,v155,ARA,121,v155,Language most often spoken at ...,129,string
2774,v155,ARM,1,v155,Language most often spoken at ...,129,string
2775,v155,BAQ,14,v155,Language most often spoken at ...,129,string


In [47]:
# concerncing responses: 2, 3, 4, 66, 77, 88, 99
frequency_train_v2[frequency_train_v2["Question"]=="v161"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
2913,v161,2,38,v161,"Country of birth, mother",173,string
2914,v161,3,18,v161,"Country of birth, mother",173,string
2915,v161,4,25,v161,"Country of birth, mother",173,string
2916,v161,66,25127,v161,"Country of birth, mother",173,string
2917,v161,77,4,v161,"Country of birth, mother",173,string
2918,v161,88,32,v161,"Country of birth, mother",173,string
2919,v161,99,190,v161,"Country of birth, mother",173,string
2920,v161,AD,1,v161,"Country of birth, mother",173,string
2921,v161,AF,18,v161,"Country of birth, mother",173,string
2922,v161,AL,5,v161,"Country of birth, mother",173,string


In [48]:
# concerncing responses: none
frequency_train_v2[frequency_train_v2["Question"]=="cntry"]

Unnamed: 0,Question,Response,Frequency,Variable,Label,Unique,Type_codebook_long
4013,cntry,AT,1387,cntry,Country,21,string
4014,cntry,BE,9,cntry,Country,21,string
4015,cntry,CH,1189,cntry,Country,21,string
4016,cntry,CZ,1683,cntry,Country,21,string
4017,cntry,DE,2388,cntry,Country,21,string
4018,cntry,DK,1193,cntry,Country,21,string
4019,cntry,EE,1606,cntry,Country,21,string
4020,cntry,ES,1520,cntry,Country,21,string
4021,cntry,FI,1653,cntry,Country,21,string
4022,cntry,FR,1502,cntry,Country,21,string


In [50]:
train_v2.loc[4483]

id           32251
v1           0    
v2           2    
v3           15   
v4           14080
v6           2    
v7           2    
v8           2    
v9           -1   
v10          2    
v12          2    
v13          1    
v14          2    
v15          0    
v16          0    
v17          6    
v18          2    
v19          0    
v20          DE9  
v21          2    
v24          2    
v25          6    
v26          0    
v27          0    
v28          0    
v29          0    
v30          1    
v31          0    
v32          0    
v33          0    
v34          0    
v35          4    
v36          0    
v37          0    
v38          0    
v39          0    
v40          0    
v41          0    
v42          0    
v43          2    
v44          0    
v45          0    
v46          1    
v47          0    
v48          0    
v49          0    
v50          0    
v51          0    
v52          0    
v53          2    
v54          1    
v55          0    
v56         

In [51]:
train_v4 = train_v2.drop(4483)
num_cols = train_v4.loc[:, ~train_v4.columns.isin(cat_cols)].columns.tolist()
train_v5 = train_v4.copy()
train_v5[num_cols] = train_v4[num_cols].astype("int64")
train_v5.head()

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v24,v25,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v154,v155,v156,v157,v159,v161,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,cntry,satisfied
0,9948,2,2,74,11010,2,2,2,2,1,2,2,1,0,0,66,2,2,AT33,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,10,3,3,3,1,1,1,2,2,9,1,66,2,1,3,4,3,3,2,-1,5,2,1,2,3,4,2,2,0,0,1,0,3,3,3,3,2,3,6,3,2,2,2,0,0,1,1,12,2,4,4,9,25,107,2015,2015,4,5,3,4,1,5,3,3,2,2,3,5,2,6,5,3341,1,GER,0,3,2,5,66,2,6,1,1,23,4,2,2,0,0,3,10,10,8,1,2,2,2,7,1,1,-1,-1,1,0,2,4,2,1,3,-1,5,1,9,1,4,2015,4,0,5,10,0,2,0,3,0,1,2,0,0,0,0,1,8,40,40,2,2,1,2,2,1941,-1,AT,0
1,25601,4,2,58,11010,2,2,2,2,1,2,2,2,0,0,66,2,2,AT31,2,1,66,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,-1,0,0,3,322,2,212,2,212,12,3,2,2,1,1,1,2,2,5,1,66,2,1,1,1,5,3,1,-1,4,3,1,2,4,5,2,3,0,0,1,6,3,2,4,3,3,4,4,4,4,1,5,4,2,17,17,19,15,1,1,17,46,75,2015,2015,3,4,3,3,1,1,4,2,2,3,2,2,3,3,1,7132,2,GER,0,5,-1,6,66,3,3,1,1,25,8,2,2,0,0,4,4,7,3,6,4,5,2,5,1,1,-1,-1,0,0,3,4,2,1,5,3,8,5,8,17,1,2015,4,-1,7,7,-1,5,5,-1,1,3,1,1,0,0,0,1,3,39,39,2,2,2,2,2,1957,-1,AT,0
2,8592,6,2,47,11010,2,2,1,2,1,1,1,2,0,0,66,2,3,AT33,2,1,66,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,2,212,2,212,2,212,9,2,2,2,1,1,1,4,1,2,1,66,1,1,1,1,1,3,2,1,10,2,4,2,1,1,1,3,0,0,1,5,2,3,5,2,2,3,5,2,2,2,5,3,6,18,18,17,28,3,3,16,30,50,2015,2015,2,5,2,1,2,5,2,2,2,2,2,5,2,5,4,9112,1,GER,0,4,-1,1,66,3,1,1,1,56,1,1,2,1,1,3,9,8,8,2,3,2,2,3,1,8,1,1,0,0,2,5,2,1,6,6,8,6,8,18,3,2015,4,5,9,9,8,5,6,4,5,6,2,0,0,0,0,1,8,30,35,4,2,1,2,2,1968,1963,AT,1
3,29593,10,2,22,11010,2,2,2,2,1,2,2,2,0,0,66,2,10,AT12,2,1,66,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,3,322,3,322,3,322,6,3,3,3,1,1,1,3,2,0,1,66,1,2,2,1,1,1,1,-1,8,1,1,2,1,1,2,3,1,0,1,0,4,4,2,2,2,1,3,1,1,4,0,0,3,7,7,8,46,4,4,7,51,44,2015,2015,0,1,4,2,6,3,1,4,1,3,2,3,1,3,6,7231,2,GER,0,10,1,4,66,3,1,1,1,45,6,2,2,1,0,3,5,0,5,7,0,0,2,6,1,5,-1,-1,0,0,3,6,2,1,0,0,0,0,8,7,4,2015,4,0,0,0,0,0,0,0,0,7,2,0,0,0,0,2,0,40,50,3,2,1,2,2,1993,-1,AT,0
4,4252,0,1,24,11010,2,2,2,2,1,2,1,1,0,0,66,1,2,AT12,2,1,66,0,0,0,0,1,0,0,0,0,3,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,2,0,0,4,423,3,322,3,322,13,5,3,3,1,1,3,4,2,8,1,66,1,1,1,1,1,1,1,-1,8,1,1,1,1,1,3,3,0,0,1,7,2,2,6,1,2,3,4,1,6,2,7,8,3,29,29,19,20,3,3,18,25,42,2015,2015,6,6,2,2,1,3,2,2,1,2,3,3,2,3,3,4311,2,GER,0,4,2,6,66,3,1,1,1,66,6,2,2,1,0,2,6,8,5,6,5,5,2,5,1,5,-1,-1,0,0,3,6,1,2,9,6,8,6,8,29,3,2015,4,2,6,9,7,7,7,5,2,4,2,0,0,0,0,1,8,38,38,1,2,1,2,2,1991,-1,AT,1


In [52]:
train_v5.describe()

Unnamed: 0,id,v1,v2,v3,v4,v6,v7,v8,v9,v10,v12,v13,v14,v15,v16,v18,v19,v21,v24,v26,v27,v28,v29,v30,v31,v32,v33,v34,v35,v36,v37,v38,v39,v40,v41,v42,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v58,v59,v60,v61,v64,v65,v66,v67,v70,v72,v73,v74,v75,v76,v77,v79,v80,v81,v82,v83,v84,v85,v90,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v152,v156,v157,v159,v162,v163,v165,v166,v167,v169,v171,v172,v175,v176,v177,v178,v179,v180,v181,v183,v184,v185,v186,v187,v189,v196,v208,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v233,v234,v235,v236,v237,v238,v239,v240,v242,v244,v245,v246,v247,v248,v249,v250,v251,v253,v254,v255,v256,v257,v258,v263,satisfied
count,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0,30079.0
mean,19702.339838,3.550018,1.89481,49.319558,15493.313707,1.67582,1.90638,1.788723,1.893946,1.112105,1.645733,1.534293,1.436118,0.001895,0.000332,1.830945,3.802154,1.827188,1.050234,0.000465,0.001363,0.000831,0.003424,0.408458,0.014761,0.005652,0.000831,0.001629,2.842814,0.034875,0.013863,0.013498,0.000898,0.007447,0.010273,0.010506,1.890854,0.012234,0.000133,0.916387,0.020446,0.018385,0.013731,0.000233,0.016889,0.005752,1.816982,0.099671,0.012467,3.432627,394.340437,2.234981,261.60617,2.148276,242.506367,12.74926,4.014329,2.524918,2.411018,0.974201,1.432195,1.874597,2.897703,2.174407,4.320722,1.15968,1.428571,1.632135,1.3796,1.521493,1.992719,2.050068,1.529971,0.946142,7.333522,2.18734,2.619967,1.925263,1.871339,2.226171,3.960338,2.653047,0.142126,0.084311,0.880515,4.743309,2.341567,2.514944,2.889557,1.997939,2.064763,2.900429,3.99498,2.232355,2.659796,2.021111,5.369161,4.807407,2.781542,15.842747,15.844775,14.469962,28.092523,7.740184,7.74301,13.563416,23.923003,55.002294,2014.034609,2014.436351,3.86336,3.748196,2.524353,2.447056,1.945909,3.025865,2.750557,2.076332,1.807307,2.558729,3.043618,3.114598,2.167459,3.050168,2.255893,4473.52033,1.48612,4.401509,1.33555,3.120316,2.382227,3.142824,1.159281,1.130988,51.397753,4.741913,1.401011,1.919811,0.522956,0.357525,2.55966,5.753881,5.127431,5.220253,4.764121,3.287044,3.390139,2.402972,5.470195,1.404601,4.288806,0.981316,1.005885,0.271252,0.147146,2.630407,4.693806,1.739386,1.753948,5.045912,4.664317,5.383789,4.029988,5.556601,15.716147,7.680442,1999.292064,3.101001,3.712756,5.236544,6.205692,3.487849,4.356827,3.466605,4.572326,1.820639,4.100569,1.702783,0.042555,0.01905,0.019848,0.008045,1.373018,5.606004,31.202267,34.143156,2.912763,1.704345,0.884803,1.729612,1.949234,1960.290136,1523.335982,0.519465
std,11357.643104,2.972861,0.819934,18.890611,25512.619384,0.505174,0.314677,0.442007,0.421893,0.317706,0.478715,0.747458,0.609427,0.043491,0.018231,0.391722,2.961652,0.39663,0.226942,0.021569,0.036895,0.028818,0.058418,0.491557,0.120597,0.074967,0.028818,0.040329,1.232563,0.183466,0.116926,0.115395,0.029948,0.085976,0.100835,0.101959,0.38562,0.109933,0.011531,0.276811,0.141523,0.134341,0.116372,0.015254,0.128857,0.075622,0.436601,0.299566,0.11096,3.129777,340.590337,3.65381,380.868236,3.000735,316.297965,4.156422,3.355318,3.859082,3.211972,0.701276,0.965036,1.037449,0.924114,1.678267,2.987935,0.413872,0.690052,0.803869,0.714979,0.704206,1.28717,1.124544,0.502495,1.112329,1.993641,0.933299,1.404812,0.874535,1.360144,1.78542,3.566285,0.623615,0.349185,0.277859,0.585657,2.634058,1.042178,1.110807,1.504002,1.137263,1.203065,1.501676,1.55486,1.334643,1.494629,0.977872,2.739389,2.526417,1.494627,8.794138,8.792951,4.329882,18.282493,3.598918,3.59626,4.212382,18.947306,23.358002,28.467078,0.49594,3.779278,1.650559,1.375141,1.382634,1.145871,1.561896,1.463039,1.119479,1.029312,1.385044,1.533117,1.562749,1.304489,1.530757,1.204219,2763.09893,0.856713,2.84674,1.002661,2.279797,0.833713,2.568047,0.385327,0.378099,31.145292,3.311456,0.512397,0.292797,0.499481,0.479279,0.945286,2.257414,2.273524,2.373187,2.474307,2.624857,2.628054,0.806429,1.576612,0.519453,3.093336,1.33905,1.410541,0.444613,0.354257,1.034324,1.699047,0.465659,0.863054,2.749049,2.554438,2.655169,2.548857,2.526365,8.877371,3.666922,174.045424,1.708016,2.817051,2.767169,2.54129,2.468023,2.720294,2.455063,3.005826,1.524839,2.110225,0.485404,0.201854,0.136702,0.139479,0.089337,0.676536,3.882816,17.744368,18.992476,0.89496,0.829098,1.08759,0.633997,0.246491,97.886105,819.864496,0.499629
min,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,2014.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
25%,9856.5,1.0,1.0,34.0,11070.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,2.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,223.0,1.0,113.0,1.0,113.0,11.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,1.0,3.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,2.0,1.0,4.0,3.0,2.0,8.0,8.0,12.0,12.0,5.0,5.0,11.0,6.0,43.0,2014.0,2014.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,2341.0,1.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,27.0,2.0,1.0,2.0,0.0,0.0,2.0,5.0,4.0,4.0,2.0,1.0,1.0,2.0,5.0,1.0,1.0,1.0,1.0,0.0,0.0,2.0,4.0,1.0,1.0,3.0,3.0,4.0,2.0,4.0,8.0,5.0,2014.0,2.0,1.0,3.0,5.0,1.0,2.0,2.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,21.0,25.0,2.0,2.0,1.0,2.0,2.0,1950.0,1934.0,0.0
50%,19759.0,3.0,2.0,50.0,12060.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,0.0,0.0,2.0,4.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,3.0,322.0,2.0,213.0,2.0,213.0,12.0,4.0,2.0,2.0,1.0,1.0,1.0,3.0,2.0,5.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,8.0,2.0,2.0,2.0,1.0,1.0,4.0,3.0,0.0,0.0,1.0,5.0,2.0,3.0,3.0,2.0,2.0,3.0,4.0,2.0,2.0,2.0,5.0,5.0,3.0,16.0,16.0,15.0,28.0,9.0,9.0,14.0,21.0,53.0,2014.0,2014.0,3.0,4.0,2.0,2.0,2.0,3.0,3.0,2.0,2.0,2.0,3.0,3.0,2.0,3.0,2.0,4224.0,2.0,5.0,2.0,1.0,3.0,1.0,1.0,1.0,49.0,6.0,1.0,2.0,1.0,0.0,3.0,6.0,5.0,5.0,6.0,3.0,3.0,2.0,6.0,1.0,5.0,1.0,1.0,0.0,0.0,3.0,5.0,2.0,2.0,5.0,5.0,6.0,4.0,6.0,16.0,9.0,2014.0,4.0,4.0,5.0,7.0,3.0,5.0,3.0,5.0,2.0,4.0,2.0,0.0,0.0,0.0,0.0,1.0,7.0,39.0,40.0,3.0,2.0,1.0,2.0,2.0,1965.0,1957.0,1.0
75%,29514.5,6.0,2.0,64.0,15040.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,0.0,2.0,6.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,5.0,520.0,3.0,322.0,3.0,322.0,15.0,5.0,4.0,4.0,1.0,2.0,3.0,4.0,3.0,6.0,1.0,2.0,2.0,2.0,2.0,3.0,3.0,2.0,2.0,9.0,3.0,4.0,2.0,3.0,4.0,7.0,3.0,0.0,0.0,1.0,7.0,3.0,3.0,4.0,3.0,3.0,4.0,5.0,3.0,4.0,3.0,7.0,6.0,4.0,24.0,24.0,18.0,45.0,11.0,11.0,17.0,40.0,66.0,2015.0,2015.0,7.0,5.0,3.0,3.0,2.0,4.0,4.0,3.0,2.0,3.0,4.0,4.0,3.0,4.0,3.0,7115.0,2.0,6.0,2.0,6.0,3.0,6.0,1.0,1.0,84.0,7.0,2.0,2.0,1.0,1.0,3.0,7.0,7.0,7.0,7.0,5.0,5.0,3.0,7.0,2.0,7.0,1.0,1.0,1.0,0.0,3.0,6.0,2.0,2.0,7.0,7.0,7.0,6.0,8.0,23.0,11.0,2015.0,4.0,6.0,7.0,8.0,5.0,6.0,5.0,7.0,2.0,6.0,2.0,0.0,0.0,0.0,0.0,2.0,9.0,40.0,45.0,4.0,2.0,1.0,2.0,2.0,1980.0,1972.0,1.0
max,39324.0,10.0,4.0,114.0,444444.0,2.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,1.0,1.0,2.0,10.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,55.0,5555.0,55.0,5555.0,55.0,5555.0,50.0,55.0,55.0,55.0,3.0,4.0,4.0,4.0,5.0,10.0,2.0,4.0,4.0,4.0,4.0,5.0,5.0,2.0,2.0,10.0,5.0,13.0,4.0,7.0,8.0,10.0,3.0,1.0,1.0,2.0,10.0,4.0,4.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,4.0,10.0,10.0,6.0,31.0,31.0,23.0,59.0,12.0,12.0,23.0,59.0,772.0,2015.0,2015.0,10.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,9629.0,2.0,10.0,2.0,6.0,3.0,9.0,2.0,2.0,99.0,9.0,2.0,2.0,1.0,1.0,4.0,10.0,10.0,10.0,7.0,10.0,10.0,4.0,7.0,2.0,10.0,5.0,6.0,1.0,1.0,5.0,7.0,2.0,4.0,10.0,10.0,10.0,10.0,10.0,31.0,12.0,2015.0,6.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,7.0,7.0,2.0,1.0,1.0,1.0,1.0,3.0,10.0,168.0,168.0,4.0,2.0,3.0,2.0,2.0,2000.0,2014.0,1.0


### Correlation Analysis

In [53]:
numeric_df = train_v5[num_cols]

In [57]:
def get_redundant_pairs(df):
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0,i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

corr_df = get_top_abs_correlations(numeric_df, 30).to_frame().reset_index().rename(columns={0: "abs_corr"})
corr_df = corr_df.merge(codebook[['Variable', "Label"]], left_on = 'level_0', right_on = "Variable", how = "left").drop("Variable", axis = 1).rename(columns = {"Label":"label_0"})
corr_df = corr_df.merge(codebook[['Variable', "Label"]], left_on = 'level_1', right_on = "Variable", how = "left").drop("Variable", axis = 1).rename(columns = {"Label":"label_1"})
corr_df

Unnamed: 0,level_0,level_1,abs_corr,label_0,label_1
0,v128,v129,0.998224,"End of interview, month","Start of interview, month"
1,v124,v125,0.996107,"End of interview, day of month","Start of interview, day of month"
2,v196,v208,0.995009,Second person in household: re...,Second person in household: re...
3,v57,v65,0.991022,Highest level of education,"Highest level of education, ES..."
4,v102,v103,0.990102,Main source of household income,Main source of household income
5,v58,v66,0.989494,Father's highest level of educ...,Father's highest level of educ...
6,v59,v66,0.989159,Father's highest level of educ...,Father's highest level of educ...
7,v60,v67,0.987781,Mother's highest level of educ...,Mother's highest level of educ...
8,v61,v67,0.984702,Mother's highest level of educ...,Mother's highest level of educ...
9,v58,v59,0.982498,Father's highest level of educ...,Father's highest level of educ...


In [60]:
corr_matrix = numeric_df.corr()
corr_matrix["satisfied"].sort_values(ascending=False)[0:10]

satisfied    1.000000
v98          0.547952
v224         0.327189
v74          0.322694
v253         0.316676
v223         0.270685
v178         0.250165
v180         0.250016
v233         0.235280
v226         0.229284
Name: satisfied, dtype: float64

In [61]:
corr_matrix["satisfied"].sort_values(ascending=True)[0:10]

v101   -0.319063
v79    -0.291363
v99    -0.277223
v82    -0.263764
v81    -0.235305
v80    -0.203810
v13    -0.198201
v222   -0.172973
v2     -0.162184
v134   -0.145847
Name: satisfied, dtype: float64