# Data Wrangling

Start by importing packages needed for importing the data and organizing the data.

In [1]:
import pandas as pd
import numpy as np

Import the data and list out all the columns.

In [2]:
df = pd.read_csv("../data/NFWBS_PUF_2016_data.csv")

In [3]:
col = list(df.columns)

print(len(col))

for i in col:
    print(i)

217
PUF_ID
sample
fpl
SWB_1
SWB_2
SWB_3
FWBscore
FWB1_1
FWB1_2
FWB1_3
FWB1_4
FWB1_5
FWB1_6
FWB2_1
FWB2_2
FWB2_3
FWB2_4
FSscore
FS1_1
FS1_2
FS1_3
FS1_4
FS1_5
FS1_6
FS1_7
FS2_1
FS2_2
FS2_3
SUBKNOWL1
ACT1_1
ACT1_2
FINGOALS
PROPPLAN_1
PROPPLAN_2
PROPPLAN_3
PROPPLAN_4
MANAGE1_1
MANAGE1_2
MANAGE1_3
MANAGE1_4
SAVEHABIT
FRUGALITY
AUTOMATED_1
AUTOMATED_2
ASK1_1
ASK1_2
SUBNUMERACY2
SUBNUMERACY1
CHANGEABLE
GOALCONF
LMscore
FINKNOWL1
FINKNOWL2
FINKNOWL3
FK1correct
FK2correct
FK3correct
KHscore
KHKNOWL1
KHKNOWL2
KHKNOWL3
KHKNOWL4
KHKNOWL5
KHKNOWL6
KHKNOWL7
KHKNOWL8
KHKNOWL9
KH1correct
KH2correct
KH3correct
KH4correct
KH5correct
KH6correct
KH7correct
KH8correct
KH9correct
ENDSMEET
HOUSING
LIVINGARRANGEMENT
HOUSERANGES
IMPUTATION_FLAG
VALUERANGES
MORTGAGE
SAVINGSRANGES
PRODHAVE_1
PRODHAVE_2
PRODHAVE_3
PRODHAVE_4
PRODHAVE_5
PRODHAVE_6
PRODHAVE_7
PRODHAVE_8
PRODHAVE_9
PRODUSE_1
PRODUSE_2
PRODUSE_3
PRODUSE_4
PRODUSE_5
PRODUSE_6
CONSPROTECT1
CONSPROTECT2
CONSPROTECT3
EARNERS
VOLATILITY
SNAP
MATHARDSHIP_1

This dataset has 217 variables and will likely lead to overfitting. Reducing the number of variables will lead to a better model. 

Changing the number of correct test question variables to a single score is one way to reduce the number of variables. This will be done for the Knoll and Houts test questions (KH) and financial knowledge questions (FK).

Summing the number of of things that have happened or things someone has is another way to reduce the number of variables. This will be done to the number of benefits, life shocks experienced (SHOCKS), material hardships faced (MATHARDSHIP), and financial experiences taught growing up (FINTAUGHT). 

Averaging the score of similar question variables to get an average score of that category. These variables are on a score of 1 to 5. This will be done to the materialism questions, the planning questions, and the managing finances questions (MANAGE). 

In [4]:
df["KHscore"] = df[["KH1correct", "KH2correct", "KH3correct", "KH4correct", "KH5correct", 
                    "KH6correct", "KH7correct", "KH8correct", "KH9correct"]].sum(axis=1) / 9

In [5]:
df["FKscore"] = df[["FK1correct", "FK2correct", "FK3correct"]].sum(axis=1) / 3

In [6]:
df["BENEFITS"] = df[["BENEFITS_1", "BENEFITS_2", "BENEFITS_3", 
                     "BENEFITS_4", "BENEFITS_5"]].replace(-1, 0).sum(axis=1)

In [7]:
df["SHOCKS"] = df[["SHOCKS_1", "SHOCKS_2", "SHOCKS_3", "SHOCKS_4", "SHOCKS_5", "SHOCKS_6", 
                   "SHOCKS_7", "SHOCKS_8", "SHOCKS_9", "SHOCKS_10", "SHOCKS_11"]].sum(axis=1)

In [8]:
df["MATHARDSHIP"] = df[["MATHARDSHIP_1", "MATHARDSHIP_2", 
                        "MATHARDSHIP_3", "MATHARDSHIP_4", 
                        "MATHARDSHIP_5", "MATHARDSHIP_6"]].replace([-1, 1], 0).replace(2, 1).sum(axis=1)

In [9]:
df["FINTAUGHT"] = df[["FINSOC2_1", "FINSOC2_2", "FINSOC2_3", "FINSOC2_4", 
                    "FINSOC2_5", "FINSOC2_6", "FINSOC2_7"]].replace(-1, 0).sum(axis=1)

In [10]:
df["MATERIALISM"] = df[["MATERIALISM_1", "MATERIALISM_2", "MATERIALISM_3"]].replace(-1, 0).sum(axis=1) / 3

In [11]:
df["PLAN"] = df[["PROPPLAN_1", "PROPPLAN_2", "PROPPLAN_3", "PROPPLAN_4"]].replace(-1, 0).sum(axis=1) / 4

In [12]:
df["MANAGE"] = df[["MANAGE1_1", "MANAGE1_2", "MANAGE1_3", "MANAGE1_4"]].replace(-1, 0).sum(axis=1) / 4

## Selecting Variables
I selected the variables that I think are most applicable to the question.
Variables dropped (Reason):
* sample (Shows oversampling and is not relevant)
* FRUGALITY (The question asked to score frugality could have different implied meanings)
* SUBNUMERACY1/2 (Doesn't seem relevant to happiness)
* ASK1_1/2 (While doing research for monetary decisions is smart, it doesn't seem relevant to happiness)
* IMPUTATION_FLAG (This only affects a small portion of the data)
* CONSPROTECT1/2/3 (These don't seem relevant to happiness)
* SNAP (Only applies to small portion of the data and is likely represented by other variables)
* COLLECT (Doesn't seem relevant to question)
* REJECTED_1/2 (Doesn't seem relevant to question)
* ABSORBSHOCK (Doesn't seem relevant to happiness)
* FRAUD2 (Seems like could affect happiness, but survey question is very open ended)
* COVERCOSTS (Categorical variable that isn't explainable by numerical values)
* BORROW_1/2 (Doesn't seem relevant to happiness)
* MANAGE2, PAIDHELP (Categorical variable that isn't explained well by numerical values)
* HSLOC (Doesn't seem relevant to question)
* ON1/2correct (Doesn't seem relevant to question)
* CONNECT (Couldn't find a good description of this variable)
* DISCOUNT (Represents financial knowledge, but not good representation)
* MEMLOSS (Open ended question that may not have been easily interpreted)
* SELFCONTROL_1/2/3 (Doesn't seem relevant to question)
* OUTLOOK_1/2 (Doesn't seem relevant to question)
* INTERCONNECTIONS_1-10 (Difficult to interpret meaning from the model)
* SOCSEC1/2/3 (Only applicable to a portion of the data and doesn't seem like 
* KIDS_NoChildren (Difficult to interpret meaning from as applies to happiness)
* EMPLOY (The other employ variables better explain this variable)
* MILITARY (Very open ended question)
* generation (redundant information)
* PPREG4 (redundant information)
* PPT vars (redundant information)
* PEM (This could be seen as another way of asking about the person's future)
* CHANGEABLE (This could be seen as another way of asking about the person's future)
* LIVINGARRANGEMENT (This variable isn't described well by numbers)


This results in 65 remaining variables and 3 of them are the varaibles that I am trying to predict. 

In [13]:
variables = df[["SWB_1", "SWB_2", "SWB_3", 
                "FWBscore", "FSscore", "FKscore", "KHscore", "SUBKNOWL1", 
                "ACT1_1", "ACT1_2", "FINGOALS", "GOALCONF", 
                "PLAN", # All represent planning skills
                "MANAGE", "ENDSMEET", # All represent ability to manage expenses
                "SAVEHABIT", "AUTOMATED_1", "AUTOMATED_2", # These are similar showing savings and putting money to future use. 
                "HOUSING", "HOUSERANGES", "VALUERANGES", "MORTGAGE", "HOUSESAT", # HOUSESAT related to housing variables
                "SAVINGSRANGES", 
                "PRODHAVE_9", # PRODHAVE vars ask if the person uses different investment vehicles
                "PRODUSE_6", # PRODUSE vars ask if the person uses non-backed or high interest ways to get quick money
                "EARNERS", "VOLATILITY", 
                "MATHARDSHIP", 
                "BENEFITS",  
                "SHOCKS", 
                "PAREDUC", "FINTAUGHT", 
                "MATERIALISM", 
                "HEALTH", "DISTRESS", "LIFEEXPECT", 
                "SCFHORIZON",
                "RETIRE", 
                "HHEDUC", "PPEDUC", 
                "KIDS_1", "KIDS_2", "KIDS_3", "KIDS_4", # 1: 0-7, 2: 7-12, 3: 13-17, 4: 18+
                "EMPLOY1_1", "EMPLOY1_2", "EMPLOY1_3", "EMPLOY1_4", 
                "EMPLOY1_5", "EMPLOY1_6", "EMPLOY1_7", "EMPLOY1_8", "EMPLOY1_9",
                "Military_Status", 
                "agecat", "PPETHM", "PPGENDER", "PPINCIMP", "PPHHSIZE", "PPMARIT", "PPMSACAT", "fpl", # fpl = relation to poverty status 
                "PPREG9", "PCTLT200FPL"]]

variables.describe()

Unnamed: 0,SWB_1,SWB_2,SWB_3,FWBscore,FSscore,FKscore,KHscore,SUBKNOWL1,ACT1_1,ACT1_2,...,agecat,PPETHM,PPGENDER,PPINCIMP,PPHHSIZE,PPMARIT,PPMSACAT,fpl,PPREG9,PCTLT200FPL
count,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,...,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0,6394.0
mean,5.353769,5.362215,5.43228,56.034094,50.719112,0.835419,0.700414,4.674069,4.213481,3.607288,...,4.450422,1.622771,1.475759,5.510635,2.52299,2.042071,0.866124,2.658899,5.145605,-0.081952
std,1.500913,1.544942,1.613876,14.154676,12.656921,0.251738,0.208606,1.283933,0.904444,0.925751,...,2.120741,1.077631,0.499451,2.671075,1.223571,1.393808,0.340545,0.656944,2.529397,1.328498
min,-4.0,-4.0,-4.0,-4.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,-5.0
25%,5.0,5.0,5.0,48.0,42.0,0.666667,0.555556,4.0,4.0,3.0,...,3.0,1.0,1.0,3.0,2.0,1.0,1.0,3.0,3.0,0.0
50%,6.0,6.0,6.0,56.0,50.0,1.0,0.777778,5.0,4.0,4.0,...,4.0,1.0,1.0,6.0,2.0,1.0,1.0,3.0,5.0,0.0
75%,6.0,7.0,7.0,65.0,57.0,1.0,0.888889,5.0,5.0,4.0,...,6.0,2.0,2.0,8.0,3.0,3.0,1.0,3.0,7.0,0.0
max,7.0,7.0,7.0,95.0,85.0,1.0,1.0,7.0,5.0,5.0,...,8.0,4.0,2.0,9.0,5.0,5.0,1.0,3.0,9.0,1.0


Export the resulting dataframe to a csv file. 

In [14]:
# variables.to_csv("../data/clean_data.csv")