# Predicting Income With Census Data: Exploratory Data Analysis

Rafael G. Guerra | April 2022

Due to the large number of features, I will not be visualizing them in this analysis. Rather, I will focus my time on removing features that will not be likely to be relevant for the analysis. I will do so by examining correlations between every variable and the target variable 'income' as well as covariances within the variables themselves.

### Import libraries

In [10]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Import Data

In [41]:
train_data = pd.read_csv('train_clean.csv')
test_data = pd.read_csv('test_clean.csv')

### Remove variables with no correlation to income

We will only keep features that are at least minimally correlated with income. Astonishingly, we drop the number of features from 372 all the way to 55.

In [38]:
feature_list = []
for feature in train_data:
    income_corr = np.corrcoef(train_data['INCOME'], train_data[feature])
    income_corr = income_corr[0,1]
    if (income_corr > 0.05 or income_corr < -0.05):
        feature_list.append(feature)

In [39]:
feature_list

['ACLSWKR_ Federal government',
 'ACLSWKR_ Not in universe',
 'ACLSWKR_ Private',
 'ACLSWKR_ Self-employed-incorporated',
 'ACLSWKR_ Self-employed-not incorporated',
 'AHGA_ Bachelors degree(BA AB BS)',
 'AHGA_ Children',
 'AHGA_ Doctorate degree(PhD EdD)',
 'AHGA_ High school graduate',
 'AHGA_ Masters degree(MA MS MEng MEd MSW MBA)',
 'AHGA_ Prof school degree (MD DDS DVM LLB JD)',
 'AHSCOL_ Not in universe',
 'AMARITL_ Married-civilian spouse present',
 'AMARITL_ Never married',
 'AMJIND_ Communications',
 'AMJIND_ Finance insurance and real estate',
 'AMJIND_ Manufacturing-durable goods',
 'AMJIND_ Not in universe or children',
 'AMJIND_ Other professional services',
 'AMJIND_ Public administration',
 'AMJIND_ Wholesale trade',
 'AMJOCC_ Executive admin and managerial',
 'AMJOCC_ Not in universe',
 'AMJOCC_ Other service',
 'AMJOCC_ Professional specialty',
 'AMJOCC_ Sales',
 'ARACE_ White',
 'AREORGN_ All other',
 'ASEX_ Female',
 'ASEX_ Male',
 'AUNMEM_ No',
 'AUNMEM_ Not in univ

In [42]:
train_data = train_data[feature_list]
test_data = test_data[feature_list]

### Remove variables with high co-variance

We want to avoid 'inflation' by removing variables that are highly correlated with one another

In [45]:
df_cor_matrix = train_data.corr().abs()

In [47]:
upper_tri = df_cor_matrix.where(np.triu(np.ones(df_cor_matrix.shape),k=1).astype(np.bool))

In [48]:
high_covariance = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)]

In [49]:
print(high_covariance)

['AMJIND_ Not in universe or children', 'AMJOCC_ Not in universe', 'ASEX_ Male', 'HHDREL_ Child under 18 never married', 'VETYN']


In [51]:
train_data = train_data.drop(high_covariance,1)

In [52]:
test_data = test_data.drop(high_covariance,1)

In [53]:
train_data.to_csv('train_eda.csv', index=False)
test_data.to_csv('test_eda.csv', index=False)