## Get data

In [83]:
import pandas as pd
import numpy as np

In [11]:
!curl -X GET http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of
|   reasonably clean records was extracted using the following conditions:
|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a

In [124]:
features = ['age', 'workclass', 'fnlwgt', 'education', 'education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','target'] 

In [125]:
df = pd.read_csv('adult.data', names=features, header=None,index_col=False)

In [126]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [127]:
df.drop(['fnlwgt'],axis=1,inplace=True)

## Feature clean up
### Workclass

In [128]:
list(df)

['age',
 'workclass',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country',
 'target']

In [129]:
df['workclass'].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [130]:
df['workclass'].replace([' ?'], 'unknown', inplace=True)
df['workclass'].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', 'unknown', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [132]:
# we probably can just use education-num
df['education'].unique()

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

In [133]:
df['education-num'].unique()

array([13,  9,  7, 14,  5, 10, 12, 11,  4, 16, 15,  3,  6,  2,  1,  8])

In [134]:
list(df)

['age',
 'workclass',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country',
 'target']

In [135]:
target = df.target
X = pd.get_dummies(df.drop(['target'], axis=1)).values

In [136]:
X.shape

(32561, 107)

In [137]:
y = np.array([int(i) for i in target.values == ' <=50K'])
y.shape

(32561,)

## Modeling

In [33]:
from sklearn.model_selection import train_test_split, KFold

In [140]:
import tensorflow as tf
tf.__version__

'2.1.0'

In [141]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)

In [163]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics = ['accuracy'])

In [167]:
model.fit(X_train, y_train, batch_size=32, epochs=50)

Train on 24420 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50

KeyboardInterrupt: 

[]

 Each model should have a different number of layers, different activation functions, and different weight initializations.

    Now try cross validation on the whole data set for each of the five models with and without the selected features. Does the feature selection lead to under or over fitting? How do you know?
    Compare and contrast which features you kept and which ones you dropped, based on the model. Note: it may be the case that different features perform better with different models, so please explore keeping and dropping different features depending on the algorithm. How does your choice of model and features effect under or overfitting?
    Try regularizing each of your models, does the generalizability increase? Decrease? In which cases does each happen and why? Please try this with all of your features and then with the reduced set of features. Report your precision, recall and f1 score on the train and test set. Next carry out cross validation. Does regularization reduce under or overfitting? Why or why not? How does the space of features your metrics and your optimal regularization parameters?

Hint: you can use L1 or L2 norm for regularization or dropout.

    Now instead of try different models we will use grid search and cross validation to tune the hyper parameters of our model. Our tunable parameters are:
    The number of layers (please don't go deeper than 10 hidden layers)
    The number of nodes per layer
    The type of regularization to use
    The type of weight initialization to use.
    The type of activation function.
    The metric to evaluate with, although logloss is standard, try using other metrics of accuracy. You may even try multiple and averaging or taking the harmonic weight of multiple metrics.
