In [1]:
%matplotlib inline

This US Census dataset contains detailed but anonymised information for approximately 300,000 people.

The archive contains 3 files:
o   A large learning .csv file
o   Another test .csv file
o   A metadata file describing the columns of the two above mentioned files (identical for both)

The goal of this exercise is to “modelize” / “predict” the information contained in the last column (42nd), i.e., which people save more or less than $50,000 / year, from the information contained in the other columns.
The exercise here consists of modelizing a binary variable.

# Load Libraries

In [2]:
from pprint import pprint
import matplotlib.pyplot as Plot
import pandas as pd
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import SelectPercentile, SelectKBest, f_classif, chi2
from sklearn import metrics
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#future version of sklearn (0.18.dev)
# from sklearn.neural_network import MLPClassifier 
from sklearn import svm
from IPython.display import display, HTML
pd.set_option('display.max_columns', 50)
import sys

print('Loaded...')

Loaded...


# Read Data

In [3]:
learndf = pd.read_csv("census_income_learn.csv", header = None, skipinitialspace = True, na_values= "Not in universe")
testdf = pd.read_csv("census_income_test.csv", header = None, skipinitialspace = True, na_values= "Not in universe")
print(len(learndf.columns))
display(learndf)

42


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41
0,73,,0,0,High school graduate,0,,Widowed,Not in universe or children,,White,All other,Female,,,Not in labor force,0,0,0,Nonfiler,,,Other Rel 18+ ever marr not in subfamily,Other relative of householder,1700.09,?,?,?,Not in universe under 1 year old,?,0,,United-States,United-States,United-States,Native- Born in the United States,0,,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,,Divorced,Construction,Precision production craft & repair,White,All other,Male,,,Children or Armed Forces,0,0,0,Head of household,South,Arkansas,Householder,Householder,1053.55,MSA to MSA,Same county,Same county,No,Yes,1,,United-States,United-States,United-States,Native- Born in the United States,0,,2,52,94,- 50000.
2,18,,0,0,10th grade,0,High school,Never married,Not in universe or children,,Asian or Pacific Islander,All other,Female,,,Not in labor force,0,0,0,Nonfiler,,,Child 18+ never marr Not in a subfamily,Child 18 or older,991.95,?,?,?,Not in universe under 1 year old,?,0,,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,,2,0,95,- 50000.
3,9,,0,0,Children,0,,Never married,Not in universe or children,,White,All other,Female,,,Children or Armed Forces,0,0,0,Nonfiler,,,Child <18 never marr not in subfamily,Child under 18 never married,1758.14,Nonmover,Nonmover,Nonmover,Yes,,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,,0,0,94,- 50000.
4,10,,0,0,Children,0,,Never married,Not in universe or children,,White,All other,Female,,,Children or Armed Forces,0,0,0,Nonfiler,,,Child <18 never marr not in subfamily,Child under 18 never married,1069.16,Nonmover,Nonmover,Nonmover,Yes,,0,Both parents present,United-States,United-States,United-States,Native- Born in the United States,0,,0,0,94,- 50000.
5,48,Private,40,10,Some college but no degree,1200,,Married-civilian spouse present,Entertainment,Professional specialty,Amer Indian Aleut or Eskimo,All other,Female,No,,Full-time schedules,0,0,0,Joint both under 65,,,Spouse of householder,Spouse of householder,162.61,?,?,?,Not in universe under 1 year old,?,1,,Philippines,United-States,United-States,Native- Born in the United States,2,,2,52,95,- 50000.
6,42,Private,34,3,Bachelors degree(BA AB BS),0,,Married-civilian spouse present,Finance insurance and real estate,Executive admin and managerial,White,All other,Male,,,Children or Armed Forces,5178,0,0,Joint both under 65,,,Householder,Householder,1535.86,Nonmover,Nonmover,Nonmover,Yes,,6,,United-States,United-States,United-States,Native- Born in the United States,0,,2,52,94,- 50000.
7,28,Private,4,40,High school graduate,0,,Never married,Construction,Handlers equip cleaners etc,White,All other,Female,,Job loser - on layoff,Unemployed full-time,0,0,0,Single,,,Secondary individual,Nonrelative of householder,898.83,?,?,?,Not in universe under 1 year old,?,4,,United-States,United-States,United-States,Native- Born in the United States,0,,2,30,95,- 50000.
8,47,Local government,43,26,Some college but no degree,876,,Married-civilian spouse present,Education,Adm support including clerical,White,All other,Female,No,,Full-time schedules,0,0,0,Joint both under 65,,,Spouse of householder,Spouse of householder,1661.53,?,?,?,Not in universe under 1 year old,?,5,,United-States,United-States,United-States,Native- Born in the United States,0,,2,52,95,- 50000.
9,34,Private,4,37,Some college but no degree,0,,Married-civilian spouse present,Construction,Machine operators assmblrs & inspctrs,White,All other,Male,,,Children or Armed Forces,0,0,0,Joint both under 65,,,Householder,Householder,1146.79,Nonmover,Nonmover,Nonmover,Yes,,6,,United-States,United-States,United-States,Native- Born in the United States,0,,2,52,94,- 50000.


# Analysis
Based on the learning file, make a quick statistic based and univariate audit of the different columns’ content and produce the results in visual / graphic format

In [4]:
learndf.describe()

Unnamed: 0,0,2,3,5,16,17,18,24,30,36,38,39,40
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,34.494199,15.35232,11.306556,55.426908,434.71899,37.313788,197.529533,1740.380269,1.95618,0.175438,1.514833,23.174897,94.499672
std,22.310895,18.067129,14.454204,274.896454,4697.53128,271.896428,1984.163658,993.768156,2.365126,0.553694,0.851473,24.411488,0.500001
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.87,0.0,0.0,0.0,0.0,94.0
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,1061.615,0.0,0.0,2.0,0.0,94.0
50%,33.0,0.0,0.0,0.0,0.0,0.0,0.0,1618.31,1.0,0.0,2.0,8.0,94.0
75%,50.0,33.0,26.0,0.0,0.0,0.0,0.0,2188.61,4.0,0.0,2.0,52.0,95.0
max,90.0,51.0,46.0,9999.0,99999.0,4608.0,99999.0,18656.3,6.0,2.0,2.0,52.0,95.0


In [5]:
for col in learndf.columns:
    print('{} : {} {}% NaN'.format(col, learndf[col].dtype, (learndf[col].isnull().sum() / len(learndf.index)) * 100))
    print(learndf[col].unique())

0 : int64 0.0% NaN
[73 58 18  9 10 48 42 28 47 34  8 32 51 46 26 13 39 16 35 12 27 56 55  2  1
 37  4 63 25 81 11 30  7 66 84 52  5 36 72 61 41 90 49  6  0 33 57 50 24 17
 53 40 54 22 29 85 38 76 21 31 74 19 15  3 43 68 71 45 62 23 69 75 44 59 60
 64 65 70 67 78 20 14 83 86 89 77 79 82 80 87 88]
1 : object 50.24232795216591% NaN
[nan 'Self-employed-not incorporated' 'Private' 'Local government'
 'Federal government' 'Self-employed-incorporated' 'State government'
 'Never worked' 'Without pay']
2 : int64 0.0% NaN
[ 0  4 40 34 43 37 24 39 12 35 45  3 19 29 32 48 33 23 44 36 31 30 41  5 11
  9 42  6 18 50  2  1 26 47 16 14 22 17  7  8 25 46 27 15 13 49 38 21 28 20
 51 10]
3 : int64 0.0% NaN
[ 0 34 10  3 40 26 37 31 12 36 41 22  2 35 25 23 42  8 19 29 27 16 33 13 18
  9 17 39 32 11 30 38 20  7 21 44 24 43 28  4  1  6 45 14  5 15 46]
4 : object 0.0% NaN
['High school graduate' 'Some college but no degree' '10th grade'
 'Children' 'Bachelors degree(BA AB BS)'
 'Masters degree(MA MS MEng MEd 

In [6]:
learndf.describe(include='all')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41
count,199523.0,99278,199523.0,199523.0,199523,199523.0,12580,199523,199523,98839,199523,198649,199523,19064,6070,199523,199523.0,199523.0,199523.0,199523,15773,15773,199523,199523,199523.0,198007,198007,198007,199523,115469,199523.0,55291,199523,199523,199523,199523,199523.0,1984,199523.0,199523.0,199523.0,199523
unique,,8,,,17,,2,7,24,14,5,9,2,2,5,8,,,,6,5,50,38,8,,9,8,9,3,3,,4,43,43,43,5,,2,,,,2
top,,Private,,,High school graduate,,High school,Never married,Not in universe or children,Adm support including clerical,White,All other,Female,No,Other job loser,Children or Armed Forces,,,,Nonfiler,South,California,Householder,Householder,,?,?,?,Not in universe under 1 year old,?,,Both parents present,United-States,United-States,United-States,Native- Born in the United States,,No,,,,- 50000.
freq,,72028,,,48407,,6892,86485,100684,14837,167365,171907,103984,16034,2038,123769,,,,75094,4889,1714,53248,75475,,99696,99696,99696,101212,99696,,38983,159163,160479,176989,176992,,1593,,,,187141
mean,34.494199,,15.35232,11.306556,,55.426908,,,,,,,,,,,434.71899,37.313788,197.529533,,,,,,1740.380269,,,,,,1.95618,,,,,,0.175438,,1.514833,23.174897,94.499672,
std,22.310895,,18.067129,14.454204,,274.896454,,,,,,,,,,,4697.53128,271.896428,1984.163658,,,,,,993.768156,,,,,,2.365126,,,,,,0.553694,,0.851473,24.411488,0.500001,
min,0.0,,0.0,0.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,37.87,,,,,,0.0,,,,,,0.0,,0.0,0.0,94.0,
25%,15.0,,0.0,0.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,1061.615,,,,,,0.0,,,,,,0.0,,2.0,0.0,94.0,
50%,33.0,,0.0,0.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,1618.31,,,,,,1.0,,,,,,0.0,,2.0,8.0,94.0,
75%,50.0,,33.0,26.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,2188.61,,,,,,4.0,,,,,,0.0,,2.0,52.0,95.0,


# Procedural Comments

Having taken a look at the data, it is obvious to me that the columns might not necessarily be the columns indicated in the metadata file.  Other than assuming that column 0 = age and column 41 is our prediction column, I made a decision to assume that the remaining data was mislabeled. 

My first step was to create a sparse matrix using the date provided.  Then I would perform VarianceThresholding to reduce the number of columns, followed by univariate feature selection to take the top 10% of the features.  Feature ranking with recursive feature elimination and cross-validated selection of the best number of features was too time exhausting and the section has been commented out.

In [7]:
def isOver(row):
    return 0 if row[len(learndf.columns)-1] == '- 50000.' else 1

def validate(df):
    assert isinstance(df, pd.DataFrame)
    for col in df:
        if df[col].isnull().sum() > 0:
            print("Error NaN detected for {}!".format(col))
            return False
    print("No NaNs.")
    return True

def print_scores(model, X_test, y_true, y_pred):
    if y_pred.dtype == np.float16 or y_pred.dtype == np.float32 or y_pred.dtype == np.float64:
        y_pred = GetPrediction(y_pred)
    acc_score_norm = metrics.accuracy_score(y_true, y_pred)
    acc_score_non_norm = metrics.accuracy_score(y_true, y_pred, normalize=False)
    print('Acc norm: {} Acc non-norm: {}'.format(acc_score_norm, acc_score_non_norm))
    ce_score_norm = metrics.log_loss(y_true, y_pred)
    ce_score_non_norm = metrics.log_loss(y_true, y_pred, normalize=False)
    print('CE norm: {} CE non-norm: {}'.format(ce_score_norm, ce_score_non_norm))
    matthews = metrics.matthews_corrcoef(y_true, y_pred)
    print('Matthews Cor. Coef: {}'.format(matthews))
    scores = get_roc_auc(model, X_test, y_true, y_pred)
    print('roc_auc: {} <- {}'.format(np.average(scores), scores))

def get_roc_auc(model, X_test, y_true, y_pred):
    scores = cross_val_score(model, X_test, y=y_true, scoring='roc_auc', n_jobs=-1)
    return scores

print('Loaded...')

Loaded...


# Prep Data

In [8]:
if (len(learndf.columns) == 42):
    y_train = pd.DataFrame()
    y_train['IsOver'] = learndf.apply(isOver, axis=1)
    y_test = pd.DataFrame()
    y_test['IsOver'] = testdf.apply(isOver, axis=1)
    del learndf[len(learndf.columns)-1]
    del testdf[len(testdf.columns)-1]

assert(len(learndf.columns) == len(testdf.columns))
y_train.describe()

Unnamed: 0,IsOver
count,199523.0
mean,0.062058
std,0.241261
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [9]:
X_train = pd.DataFrame()
X_test = pd.DataFrame()

for col in learndf:
    try:
        if learndf[col].dtype.name == 'int64' or learndf[col].dtype.name == 'float64' :
            col_train = pd.DataFrame({col : learndf[col]})
            col_test = pd.DataFrame({col : testdf[col]})
        elif learndf[col].dtype.name == 'object':
            col_train = pd.get_dummies(learndf[col], prefix=str(col))
            col_test = pd.get_dummies(testdf[col], prefix=str(col))
        else:
            print('bad type')
            () + 1
        assert isinstance(col_train, pd.DataFrame)
        assert isinstance(col_test, pd.DataFrame)
        if len(col_train.columns) != len(col_test.columns):
            for newcol in col_train:
                if newcol not in col_test.columns:
                    col_test[newcol] = 0
        X_train = pd.concat([X_train, col_train], axis=1)
        X_test = pd.concat([X_test, col_test], axis=1)
    except:
        print("Exception on column {}: {}".format(col, sys.exc_info()[0]))
        raise

X_train.describe(include='all')

Unnamed: 0,0,1_Federal government,1_Local government,1_Never worked,1_Private,1_Self-employed-incorporated,1_Self-employed-not incorporated,1_State government,1_Without pay,2,3,4_10th grade,4_11th grade,4_12th grade no diploma,4_1st 2nd 3rd or 4th grade,4_5th or 6th grade,4_7th and 8th grade,4_9th grade,4_Associates degree-academic program,4_Associates degree-occup /vocational,4_Bachelors degree(BA AB BS),4_Children,4_Doctorate degree(PhD EdD),4_High school graduate,4_Less than 1st grade,...,34_Panama,34_Peru,34_Philippines,34_Poland,34_Portugal,34_Puerto-Rico,34_Scotland,34_South Korea,34_Taiwan,34_Thailand,34_Trinadad&Tobago,34_United-States,34_Vietnam,34_Yugoslavia,35_Foreign born- Not a citizen of U S,35_Foreign born- U S citizen by naturalization,35_Native- Born abroad of American Parent(s),35_Native- Born in Puerto Rico or U S Outlying,35_Native- Born in the United States,36,37_No,37_Yes,38,39,40
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,...,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,34.494199,0.01466,0.039013,0.0022,0.361001,0.016364,0.042326,0.021186,0.000827,15.35232,11.306556,0.037875,0.034462,0.010655,0.009017,0.016424,0.040131,0.031224,0.021867,0.026854,0.099562,0.237677,0.00633,0.242614,0.004105,...,0.00014,0.001343,0.004235,0.00191,0.000872,0.007017,0.000376,0.002361,0.001007,0.000566,0.000331,0.887061,0.00196,0.000331,0.067165,0.029345,0.008801,0.007613,0.887076,0.175438,0.007984,0.00196,1.514833,23.174897,94.499672
std,22.310895,0.120188,0.193626,0.046855,0.480292,0.126871,0.201332,0.144003,0.028745,18.067129,14.454204,0.190895,0.182414,0.102674,0.094526,0.1271,0.196266,0.173924,0.14625,0.161657,0.299416,0.425661,0.07931,0.428664,0.063937,...,0.011845,0.036625,0.06494,0.043657,0.029518,0.083472,0.019384,0.048529,0.031724,0.023791,0.018185,0.316519,0.044225,0.018185,0.250308,0.168772,0.0934,0.086921,0.316501,0.553694,0.088996,0.044225,0.851473,24.411488,0.500001
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,94.0
50%,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,8.0,94.0
75%,50.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,33.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,52.0,95.0
max,90.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,51.0,46.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,52.0,95.0


# Feature Selection
- http://scikit-learn.org/stable/modules/feature_selection.html 
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html 
- http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html#example-feature-selection-plot-feature-selection-py 
- http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold 

In [10]:
from sklearn.feature_selection import VarianceThreshold

# By default, remove all zero-variance features, i.e. features that have the same value in all samples.
vt = VarianceThreshold(threshold=(.95 * (1 - .95)))
vt.fit(X_train)
supportIndices = vt.get_support(indices=True)
print(supportIndices)
vt_X_train = X_train.iloc[:, supportIndices]
vt_X_test = X_test.iloc[:, supportIndices]
vt_X_train.describe(include='all')

[  0   4   9  10  20  21  23  27  28  31  33  35  37  52  57  62  64  68
  69  71  73  78  80  81  90  91  92  99 100 101 107 108 109 112 114 115
 173 179 191 193 208 209 211 213 216 217 218 221 225 227 233 235 242 244
 245 246 247 248 250 251 253 281 295 338 381 384 388 389 392 393 394]


Unnamed: 0,0,1_Private,2,3,4_Bachelors degree(BA AB BS),4_Children,4_High school graduate,4_Some college but no degree,5,7_Divorced,7_Married-civilian spouse present,7_Never married,7_Widowed,8_Not in universe or children,8_Retail trade,9_Adm support including clerical,9_Executive admin and managerial,9_Other service,9_Precision production craft & repair,9_Professional specialty,9_Sales,10_Black,10_White,11_All other,12_Female,...,25_?,25_MSA to MSA,25_Nonmover,26_?,26_Nonmover,27_?,27_Nonmover,28_No,28_Not in universe under 1 year old,28_Yes,29_?,29_No,30,31_Both parents present,31_Mother only present,32_Mexico,32_United-States,33_United-States,34_United-States,35_Foreign born- Not a citizen of U S,35_Native- Born in the United States,36,38,39,40
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,...,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,34.494199,0.361001,15.35232,11.306556,0.099562,0.237677,0.242614,0.139433,55.426908,0.063702,0.422117,0.433459,0.05244,0.504624,0.085554,0.074362,0.062624,0.06064,0.052716,0.069867,0.059056,0.102319,0.838826,0.86159,0.521163,...,0.499672,0.053132,0.413677,0.499672,0.413677,0.499672,0.413677,0.079054,0.50727,0.413677,0.499672,0.050054,1.95618,0.195381,0.064013,0.05016,0.797718,0.804313,0.887061,0.067165,0.887076,0.175438,1.514833,23.174897,94.499672
std,22.310895,0.480292,18.067129,14.454204,0.299416,0.425661,0.428664,0.346398,274.896454,0.244222,0.493898,0.495554,0.222913,0.49998,0.279705,0.26236,0.242287,0.238669,0.223466,0.254923,0.23573,0.303068,0.367693,0.345331,0.499553,...,0.500001,0.224297,0.492493,0.500001,0.492493,0.500001,0.492493,0.269823,0.499948,0.492493,0.500001,0.218058,2.365126,0.396495,0.244776,0.218275,0.401703,0.396729,0.316519,0.250308,0.316501,0.553694,0.851473,24.411488,0.500001
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,94.0
50%,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,8.0,94.0
75%,50.0,1.0,33.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,4.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,52.0,95.0
max,90.0,1.0,51.0,46.0,1.0,1.0,1.0,1.0,9999.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,52.0,95.0


In [None]:
# import matplotlib.pyplot as plt
# from sklearn.svm import SVC
# from sklearn.cross_validation import StratifiedKFold
# from sklearn.feature_selection import RFECV

# # Create the RFE object and compute a cross-validated score.
# svc = SVC(kernel="linear")
# # The "accuracy" scoring is proportional to the number of correct classifications
# rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y_train.IsOver.values, 3), scoring='accuracy')
# rfecv.fit(X_train, y_train.IsOver.ravel())

# print("Optimal number of features : %d" % rfecv.n_features_)

# # Plot number of features VS. cross-validation scores
# plt.figure()
# plt.xlabel("Number of features selected")
# plt.ylabel("Cross validation score (nb of correct classifications)")
# plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
# plt.show()

In [11]:
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X_train, y_train.IsOver.ravel())

supportIndices = selector.get_support(indices=True)
print(len(supportIndices))
print(supportIndices)

spf_X_train = X_train.iloc[:, supportIndices]
spf_X_test = X_test.iloc[:, supportIndices]
spf_X_train.describe(include='all')

40
[  0   1   4   5   9  20  21  22  25  26  33  35  45  48  52  53  56  64
  71  73  81  90  91  99 100 101 107 108 109 112 114 179 191 211 213 250
 251 253 392 393]


Unnamed: 0,0,1_Federal government,1_Private,1_Self-employed-incorporated,2,4_Bachelors degree(BA AB BS),4_Children,4_Doctorate degree(PhD EdD),4_Masters degree(MA MS MEng MEd MSW MBA),4_Prof school degree (MD DDS DVM LLB JD),7_Married-civilian spouse present,7_Never married,8_Finance insurance and real estate,8_Manufacturing-durable goods,8_Not in universe or children,8_Other professional services,8_Public administration,9_Executive admin and managerial,9_Professional specialty,9_Sales,11_All other,12_Female,12_Male,15_Children or Armed Forces,15_Full-time schedules,15_Not in labor force,16,17,18,19_Joint both under 65,19_Nonfiler,22_Child <18 never marr not in subfamily,22_Householder,23_Child under 18 never married,23_Householder,30,31_Both parents present,31_Mother only present,38,39
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,34.494199,0.01466,0.361001,0.016364,15.35232,0.099562,0.237677,0.00633,0.032783,0.008986,0.422117,0.433459,0.030798,0.045183,0.504624,0.022464,0.023105,0.062624,0.069867,0.059056,0.86159,0.521163,0.478837,0.620324,0.204167,0.13436,434.71899,37.313788,197.529533,0.33772,0.376368,0.252232,0.266877,0.252733,0.378277,1.95618,0.195381,0.064013,1.514833,23.174897
std,22.310895,0.120188,0.480292,0.126871,18.067129,0.299416,0.425661,0.07931,0.178069,0.09437,0.493898,0.495554,0.172772,0.207705,0.49998,0.148186,0.150238,0.242287,0.254923,0.23573,0.345331,0.499553,0.499553,0.485307,0.403093,0.34104,4697.53128,271.896428,1984.163658,0.472934,0.484475,0.434295,0.442328,0.43458,0.484958,2.365126,0.396495,0.244776,0.851473,24.411488
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
50%,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,8.0
75%,50.0,0.0,1.0,0.0,33.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.0,0.0,2.0,52.0
max,90.0,1.0,1.0,1.0,51.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,99999.0,4608.0,99999.0,1.0,1.0,1.0,1.0,1.0,1.0,6.0,1.0,1.0,2.0,52.0


In [12]:
selector = SelectPercentile(chi2, percentile=10)
selector.fit(X_train, y_train.IsOver.ravel())

supportIndices = selector.get_support(indices=True)
print(len(supportIndices))
print(supportIndices)

spc_X_train = X_train.iloc[:, supportIndices]
spc_X_test = X_test.iloc[:, supportIndices]
spc_X_train.describe(include='all')

40
[  0   1   4   5   9  20  21  22  25  26  28  33  35  45  48  52  53  56
  64  71  73  90  91 100 101 107 108 109 112 114 179 191 211 213 217 250
 251 253 392 393]


Unnamed: 0,0,1_Federal government,1_Private,1_Self-employed-incorporated,2,4_Bachelors degree(BA AB BS),4_Children,4_Doctorate degree(PhD EdD),4_Masters degree(MA MS MEng MEd MSW MBA),4_Prof school degree (MD DDS DVM LLB JD),5,7_Married-civilian spouse present,7_Never married,8_Finance insurance and real estate,8_Manufacturing-durable goods,8_Not in universe or children,8_Other professional services,8_Public administration,9_Executive admin and managerial,9_Professional specialty,9_Sales,12_Female,12_Male,15_Full-time schedules,15_Not in labor force,16,17,18,19_Joint both under 65,19_Nonfiler,22_Child <18 never marr not in subfamily,22_Householder,23_Child under 18 never married,23_Householder,24,30,31_Both parents present,31_Mother only present,38,39
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,34.494199,0.01466,0.361001,0.016364,15.35232,0.099562,0.237677,0.00633,0.032783,0.008986,55.426908,0.422117,0.433459,0.030798,0.045183,0.504624,0.022464,0.023105,0.062624,0.069867,0.059056,0.521163,0.478837,0.204167,0.13436,434.71899,37.313788,197.529533,0.33772,0.376368,0.252232,0.266877,0.252733,0.378277,1740.380269,1.95618,0.195381,0.064013,1.514833,23.174897
std,22.310895,0.120188,0.480292,0.126871,18.067129,0.299416,0.425661,0.07931,0.178069,0.09437,274.896454,0.493898,0.495554,0.172772,0.207705,0.49998,0.148186,0.150238,0.242287,0.254923,0.23573,0.499553,0.499553,0.403093,0.34104,4697.53128,271.896428,1984.163658,0.472934,0.484475,0.434295,0.442328,0.43458,0.484958,993.768156,2.365126,0.396495,0.244776,0.851473,24.411488
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.87,0.0,0.0,0.0,0.0,0.0
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1061.615,0.0,0.0,0.0,2.0,0.0
50%,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1618.31,1.0,0.0,0.0,2.0,8.0
75%,50.0,0.0,1.0,0.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2188.61,4.0,0.0,0.0,2.0,52.0
max,90.0,1.0,1.0,1.0,51.0,1.0,1.0,1.0,1.0,1.0,9999.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,99999.0,4608.0,99999.0,1.0,1.0,1.0,1.0,1.0,1.0,18656.3,6.0,1.0,1.0,2.0,52.0


In [13]:
selector = SelectKBest(f_classif)
selector.fit(X_train, y_train.IsOver.ravel())

supportIndices = selector.get_support(indices=True)
print(len(supportIndices))
print(supportIndices)

kbf_X_train = X_train.iloc[:, supportIndices]
kbf_X_test = X_test.iloc[:, supportIndices]
kbf_X_train.describe(include='all')

10
[ 52  64  71 107 112 114 191 213 250 393]


Unnamed: 0,8_Not in universe or children,9_Executive admin and managerial,9_Professional specialty,16,19_Joint both under 65,19_Nonfiler,22_Householder,23_Householder,30,39
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,0.504624,0.062624,0.069867,434.71899,0.33772,0.376368,0.266877,0.378277,1.95618,23.174897
std,0.49998,0.242287,0.254923,4697.53128,0.472934,0.484475,0.442328,0.484958,2.365126,24.411488
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0
75%,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,4.0,52.0
max,1.0,1.0,1.0,99999.0,1.0,1.0,1.0,1.0,6.0,52.0


In [14]:
selector = SelectKBest(chi2)
selector.fit(X_train, y_train.IsOver.ravel())

supportIndices = selector.get_support(indices=True)
print(len(supportIndices))
print(supportIndices)

kbc_X_train = X_train.iloc[:, supportIndices]
kbc_X_test = X_test.iloc[:, supportIndices]
kbc_X_train.describe(include='all')

10
[  0   9  28  64 107 108 109 217 250 393]


Unnamed: 0,0,2,5,9_Executive admin and managerial,16,17,18,24,30,39
count,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0,199523.0
mean,34.494199,15.35232,55.426908,0.062624,434.71899,37.313788,197.529533,1740.380269,1.95618,23.174897
std,22.310895,18.067129,274.896454,0.242287,4697.53128,271.896428,1984.163658,993.768156,2.365126,24.411488
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.87,0.0,0.0
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,1061.615,0.0,0.0
50%,33.0,0.0,0.0,0.0,0.0,0.0,0.0,1618.31,1.0,8.0
75%,50.0,33.0,0.0,0.0,0.0,0.0,0.0,2188.61,4.0,52.0
max,90.0,51.0,9999.0,1.0,99999.0,4608.0,99999.0,18656.3,6.0,52.0


# Build Model

In [39]:
data = [(X_train, X_test), (vt_X_train, vt_X_test), (spf_X_train, spf_X_test), 
        (spc_X_train, spc_X_test), (kbf_X_train, kbf_X_test), (kbc_X_train, kbc_X_test)]

for count, (train, test) in enumerate(data):
    print('{} Logistic Regression'.format(count))
    logReg = LogisticRegression()
    logReg.fit(train, y_train.IsOver.ravel())
    y_pred = logReg.predict(test)
    print_scores(logReg, test, y_test.IsOver.values, y_pred)
    print('{} Random Forest'.format(count))
    rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
    rf.fit(train, y_train.IsOver.ravel())
    y_pred = rf.predict(test)
    print_scores(rf, test, y_test.IsOver.values, y_pred)

# adamNN = MLPClassifier(hidden_layer_sizes=(100, ), activation='relu', algorithm='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
# adamNN.fit(X_train, y_train.IsOver.ravel())
# y_pred = adamNN.predict(X_test)
# print_scores(adamNN, X_test, y_test.IsOver.values, y_pred)

# bfgsNN = MLPClassifier(algorithm='l-bfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
# bfgsNN.fit(X_train, y_train.IsOver.ravel())
# y_pred = bfgsNN.predict(X_test)
# print_scores(bfgsNN, X_test, y_test.IsOver.values, y_pred)

0 Logistic Regression
Acc norm: 0.9509933642068122 Acc non-norm: 94873
CE norm: 1.6926341006631636 CE non-norm: 168860.56315035853
Matthews Cor. Coef: 0.4638988103168886
roc_auc: 0.9431312210256936 <- [ 0.94360981  0.94483573  0.94094812]
0 Random Forest
Acc norm: 0.9511737936288366 Acc non-norm: 94891
CE norm: 1.6863997644600106 CE non-norm: 168238.6133020596
Matthews Cor. Coef: 0.45297383596350244
roc_auc: 0.9473797399658214 <- [ 0.94905869  0.94755406  0.94552647]
1 Logistic Regression
Acc norm: 0.9490587598484392 Acc non-norm: 94680
CE norm: 1.759452126429363 CE non-norm: 175526.4630368461
Matthews Cor. Coef: 0.4263580295836432
roc_auc: 0.9359663420209299 <- [ 0.9381811   0.93649639  0.93322153]
1 Random Forest
Acc norm: 0.9534792806880376 Acc non-norm: 95121
CE norm: 1.6067725532402337 CE non-norm: 160294.8434563522
Matthews Cor. Coef: 0.4966035164082199
roc_auc: 0.945552086606139 <- [ 0.94734978  0.94532427  0.94398222]
2 Logistic Regression
Acc norm: 0.9512439606262906 Acc non-n

In [15]:
# models = []
# rocs = []

# for (train, test) in data:
#     model = LogisticRegression()
#     roc_auc = 0
#     roc_auc_list = [roc_auc] 
#     goodFeatures_train = pd.DataFrame()
#     goodFeatures_test = pd.DataFrame()

#     for col in train:
#         temp_train = pd.concat([goodFeatures_train, train[col]], axis=1)
#         temp_test = pd.concat([goodFeatures_test, test[col]], axis=1)
#         model.fit(temp_train, y_train.IsOver.ravel())
#         y_pred = model.predict(temp_test)
#         scores = get_roc_auc(model, temp_test, y_test.IsOver.values, y_pred)
#         new_roc_auc = np.average(scores)
#         print('{} roc_auc: {} <- {}'.format(col, np.average(scores), scores))
#         if new_roc_auc > roc_auc:
#             goodFeatures_train = temp_train
#             goodFeatures_test = temp_test
#             roc_auc = new_roc_auc
#             roc_auc_list.append(roc_auc)
#     model.fit(goodFeatures_train, y_train.IsOver.ravel())
#     y_pred = model.predict(goodFeatures_test)
#     print_scores(model, goodFeatures_test, y_test.IsOver.values, y_pred)
#     models.append(model)
#     rocs.append(roc_auc_list)

# Model Analysis

In [44]:
X_train.describe(include='all')

Acc norm: 0.9391351416370963 Acc non-norm: 93690
CE norm: 2.1022029831620705 CE non-norm: 209719.9740062145
Matthews Cor. Coef: 0.23851670022483315
roc_auc: 0.8875778974389466 <- [ 0.88924118  0.89113586  0.88235665]


In [28]:
#ToDo: Graph roc_auc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41
count,199523.0,99278,199523.0,199523.0,199523,199523.0,12580,199523,199523,98839,199523,198649,199523,19064,6070,199523,199523.0,199523.0,199523.0,199523,15773,15773,199523,199523,199523.0,198007,198007,198007,199523,115469,199523.0,55291,199523,199523,199523,199523,199523.0,1984,199523.0,199523.0,199523.0,199523
unique,,8,,,17,,2,7,24,14,5,9,2,2,5,8,,,,6,5,50,38,8,,9,8,9,3,3,,4,43,43,43,5,,2,,,,2
top,,Private,,,High school graduate,,High school,Never married,Not in universe or children,Adm support including clerical,White,All other,Female,No,Other job loser,Children or Armed Forces,,,,Nonfiler,South,California,Householder,Householder,,?,?,?,Not in universe under 1 year old,?,,Both parents present,United-States,United-States,United-States,Native- Born in the United States,,No,,,,- 50000.
freq,,72028,,,48407,,6892,86485,100684,14837,167365,171907,103984,16034,2038,123769,,,,75094,4889,1714,53248,75475,,99696,99696,99696,101212,99696,,38983,159163,160479,176989,176992,,1593,,,,187141
mean,34.494199,,15.35232,11.306556,,55.426908,,,,,,,,,,,434.71899,37.313788,197.529533,,,,,,1740.380269,,,,,,1.95618,,,,,,0.175438,,1.514833,23.174897,94.499672,
std,22.310895,,18.067129,14.454204,,274.896454,,,,,,,,,,,4697.53128,271.896428,1984.163658,,,,,,993.768156,,,,,,2.365126,,,,,,0.553694,,0.851473,24.411488,0.500001,
min,0.0,,0.0,0.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,37.87,,,,,,0.0,,,,,,0.0,,0.0,0.0,94.0,
25%,15.0,,0.0,0.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,1061.615,,,,,,0.0,,,,,,0.0,,2.0,0.0,94.0,
50%,33.0,,0.0,0.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,1618.31,,,,,,1.0,,,,,,0.0,,2.0,8.0,94.0,
75%,50.0,,33.0,26.0,,0.0,,,,,,,,,,,0.0,0.0,0.0,,,,,,2188.61,,,,,,4.0,,,,,,0.0,,2.0,52.0,95.0,


In [27]:
learndf.isnull().sum()

0          0
1     100245
2          0
3          0
4          0
5          0
6     186943
7          0
8          0
9     100684
10         0
11       874
12         0
13    180459
14    193453
15         0
16         0
17         0
18         0
19         0
20    183750
21    183750
22         0
23         0
24         0
25      1516
26      1516
27      1516
28         0
29     84054
30         0
31    144232
32         0
33         0
34         0
35         0
36         0
37    197539
38         0
39         0
40         0
41         0
dtype: int64