# Dataset Description

http://archive.ics.uci.edu/ml/datasets/Adult


**Donor**: 

    Ronny Kohavi and Barry Becker 
    Data Mining and Visualization 
    Silicon Graphics. 
    e-mail: ronnyk '@' live.com for questions. 


**Data Set Information**:

    Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) 

    Prediction task is to determine whether a person makes over 50K a year. 


----

**Listing of attributes**: 

**class**: >50K, <=50K. 

**workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 


**education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 

**marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 

**occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 

**relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 

**race**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 

**sex**: Female, Male. 

**native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

**age**: continuous. 

**capital-gain**: continuous. 

**capital-loss**: continuous. 

**hours-per-week**: continuous. 

**education-num**: continuous. 

**fnlwgt**: continuous. 

# Import  Libraries 
These libraries are going to be used in all notebooks.

In [112]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Loading Dataset

## Input dataset (further split into training and validation dasets)

In [113]:
path = 'adult.data'

df = pd.read_csv(path, index_col=False)

## Test dataset

This should remain untouched during training or validation/calibration.

In [114]:
path = 'adult.test'

df_t = pd.read_csv(path, index_col=False)

# Discard non-informative dimensions in both datasets

In [115]:
df = df.drop('fnlwgt', 1)
df = df.drop('education-num', 1)

df_t = df_t.drop(u'fnlwgt', 1)
df_t = df_t.drop(u'education-num', 1)

# Discard information that may lead to gender or race bias

In [116]:
df = df.drop('race', 1)
df = df.drop('sex', 1)

df_t = df_t.drop(u'race', 1)
df_t = df_t.drop(u'sex', 1)

# Hidden gender 

In [117]:
df['relationship'].unique()

array(['Not-in-family', 'Husband', 'Wife', 'Own-child', 'Unmarried',
       'Other-relative'], dtype=object)

In [118]:
df['relationship'].replace('Husband', 'Married', inplace=True)
df['relationship'].replace('Wife', 'Married', inplace=True)

df_t['relationship'].replace('Husband', 'Married', inplace=True)
df_t['relationship'].replace('Wife', 'Married', inplace=True)

# Discard singular values
After cleaning up records (see above), some values in a particular column may occur just in the trainning dataset (and not in the test dataset), or vice-versa.

That poses a problem later when we vectorize categorical values.
So, we will discard these values for simplifications sake.

In [119]:
df[df['native-country'] == 'Holand-Netherlands']

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,capital-gain,capital-loss,hours-per-week,native-country,class
19609,32,Private,Some-college,Never-married,Machine-op-inspct,Other-relative,0,2205,40,Holand-Netherlands,<=50K


In [120]:
df_t[df_t['native-country'] == 'Holand-Netherlands']

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,capital-gain,capital-loss,hours-per-week,native-country,class


In [121]:
df.drop(19609, inplace=True)

# Define input features, classification target, and false/positive classes

In [122]:
target_column = 'class'

input_features = df.columns.values.tolist()
input_features.remove(target_column)

positive_class = '>50K'
negative_class = '<=50K'
class_labels = [negative_class, positive_class]
class_names = [r'Low Income', r'High Income']

# Filter out data points with missing values (purification) in both datasets

In [123]:
count_before_drop = len(df)

df = df.replace(['?'],[None]).dropna(how='any')
df_t = df_t.replace(['?'],[None]).dropna(how='any')

count_after_drop = len(df)

In [124]:
df.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Married,0,0,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,0,0,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Married,0,0,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Married,0,0,40,Cuba,<=50K


# Split categorical dimensions into multiple boolean dimensions (dummies)

Using https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

In [125]:
from sklearn.feature_extraction import DictVectorizer as DV

vectorizer = DV(sparse=False)

cat_dicts = df[input_features].T.to_dict().values()
vec_cat = vectorizer.fit_transform( cat_dicts )

One line of the table transformed into a dictionary

In [126]:
next(cat_dicts.__iter__())

{'age': 39,
 'workclass': 'State-gov',
 'education': 'Bachelors',
 'marital-status': 'Never-married',
 'occupation': 'Adm-clerical',
 'relationship': 'Not-in-family',
 'capital-gain': 2174,
 'capital-loss': 0,
 'hours-per-week': 40,
 'native-country': 'United-States'}

In [127]:
vec_cat[0]

array([3.900e+01, 2.174e+03, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 4.000e+01, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.000e+00, 0.000e+00,
       0.000e+00, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 

In [128]:
vectorizer.inverse_transform(X=vec_cat)[-1]

{'age': 52.0,
 'capital-gain': 15024.0,
 'education=HS-grad': 1.0,
 'hours-per-week': 40.0,
 'marital-status=Married-civ-spouse': 1.0,
 'native-country=United-States': 1.0,
 'occupation=Exec-managerial': 1.0,
 'relationship=Married': 1.0,
 'workclass=Self-emp-inc': 1.0}

In [129]:
input_features_cat = vectorizer.get_feature_names()

In [130]:
# Do the same for the test dataset
vectorizer_t = DV( sparse = False )
cat_test_dicts = df_t[input_features].T.to_dict().values()
vec_t_cat = vectorizer_t.fit_transform( cat_test_dicts )
input_features_t = vectorizer_t.get_feature_names()

# Split input dataset into training and validation (hold-out) datasets 

In [131]:
from sklearn.model_selection import train_test_split
x_input = vec_cat
y_input = df[target_column]
x_train, x_validation, y_train, y_validation = train_test_split(x_input, y_input, test_size=0.20)


In [132]:
x_train[0]

array([41.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0., 40.,  1.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.])

In [133]:
len(x_train), len(x_validation),  len(x_validation)/(len(x_train)+ len(x_validation))*100

(24128, 6033, 20.00265243194854)

## The test dataset is not split

In [134]:
y_test = df_t[target_column]
x_test = vec_t_cat

## Check weather the test dataset has the same dimensions as the training dataset

If there are attribute-value pairs in the training dataset that do not occur in the test dataset, the notebooks may fail.

In [135]:
set(input_features_cat) - set(input_features_t), set(input_features_t) - set(input_features_cat) 

(set(), set())

In [136]:
all([a==b for a,b in zip(input_features_cat,input_features_t)])

True