# Evaluate Classification Model - Binary Classification using Adult dataset

Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.  

https://archive.ics.uci.edu/ml/datasets/Census+Income  
Attribute Information:  
Listing of attributes:   

>50K, <=50K.   

**age**: continuous.   
**workclas**s: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.   
**fnlwgt**: continuous.   
**education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.   
**education-num**: continuous.   
**marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.   
**occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.   
**relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.   
**race**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.   
**sex**: Female, Male.   
**capital-gain**: continuous.   
**capital-loss**: continuous.   
**hours-per-week**: continuous.   
**native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.  

In [None]:
import numpy as np
import pandas as pd

In [22]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
col_names = ['age', 'workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hourspw','native_country','income']
income = pd.read_csv(url, header=None, names=col_names)
#income = pd.read_csv(url, header=None, names=col_names, na_values=[' ?'])
print(income.shape)
income.head()

(32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hourspw,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#### Data Exploration
Print out the unique values of each column in this dataframe.   
Check for missing values, incorrect entries.  
  
Check for redundant columns - If there is a feature that only contains one value, it does not provide any added value for any classifier. The best thing to do is to remove this feature.  

#### Data Preparation
Remove any rows (with .dropna() )   
Apply one hot encoding (to transform categorical data into numerical data)   
Standardize the data (with StandardScaler().fit_transform(X)).  
Fixing outliers  
One way to deal with missing data is mean imputation: If we know that the values for a measurement fall in a certain range, we can fill in empty values with the average of that measurement. 
Distribution of the different classes in both the training and the test set should be equal to the distribution in the actual dataset.   

  

In [60]:
income.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hourspw
count,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0
mean,38.437902,189793.8,10.121312,1092.007858,88.372489,40.931238
std,13.134665,105653.0,2.549995,7406.346497,404.29837,11.979984
min,17.0,13769.0,1.0,0.0,0.0,1.0
25%,28.0,117627.2,9.0,0.0,0.0,40.0
50%,37.0,178425.0,10.0,0.0,0.0,40.0
75%,47.0,237628.5,13.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [61]:
for col in income.columns.values:
    #
    print(col,'#unique',income[col].nunique())

age #unique 72
workclass #unique 7
fnlwgt #unique 20263
education #unique 16
education_num #unique 16
marital_status #unique 7
occupation #unique 14
relationship #unique 6
race #unique 5
sex #unique 2
capital_gain #unique 118
capital_loss #unique 90
hourspw #unique 94
native_country #unique 41
income #unique 2


In [23]:
for col in income.columns.values:
    print(col, income[col].unique())
    print(col,'#unique',income[col].nunique())

age [39 50 38 53 28 37 49 52 31 42 30 23 32 40 34 25 43 54 35 59 56 19 20 45 22
 48 21 24 57 44 41 29 18 47 46 36 79 27 67 33 76 17 55 61 70 64 71 68 66 51
 58 26 60 90 75 65 77 62 63 80 72 74 69 73 81 78 88 82 83 84 85 86 87]
age #unique 73
workclass [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']
workclass #unique 9
fnlwgt [ 77516  83311 215646 ...,  34066  84661 257302]
fnlwgt #unique 21648
education [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
education #unique 16
education_num [13  9  7 14  5 10 12 11  4 16 15  3  6  2  1  8]
education_num #unique 16
marital_status [' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']
marital_status #unique 7
occupation [' Adm-clerical' ' Exec-managerial' ' Hand

In [24]:
# Replace  ' ?' with None
for col in income.columns.values:
    income[col] = income[col].apply(lambda x: None if x == ' ?' else x)

In [None]:
for col in income.columns.values:
    print(col, income[col].unique())
    print(col,'#unique',income[col].nunique())

In [25]:
# Remove rows with misisng values
print(income.shape)
income = income.dropna(axis=0, how='any')
print(income.shape)

(32561, 15)
(30162, 15)


#### Encoding categorical data  

Most classifier can only work with numerical data, and will raise an error when categorical values in the form of strings are used as input. 

In [26]:
#Use dataframe._get_numeric_data() to get numeric columns and then find out categorical columns
allcols = income.columns
numeric_cols = income._get_numeric_data().columns
categorical_cols = list(set(allcols) - set(numeric_cols))
categorical_cols

['income',
 'sex',
 'education',
 'race',
 'occupation',
 'relationship',
 'native_country',
 'marital_status',
 'workclass']

In [27]:
list(set(numeric_cols))

['age', 'fnlwgt', 'capital_gain', 'capital_loss', 'hourspw', 'education_num']

In [40]:
print(income.shape)
income_ohe = income.drop('income', axis=1)
print(income_ohe.shape)
income_ohe = pd.get_dummies(income_ohe)
print(income_ohe.shape)

(30162, 15)
(30162, 14)
(30162, 104)


### Notice how the dataset exploded

In [29]:
print(income_ohe.columns.values)

['age' 'fnlwgt' 'education_num' 'capital_gain' 'capital_loss' 'hourspw'
 'workclass_ Federal-gov' 'workclass_ Local-gov' 'workclass_ Private'
 'workclass_ Self-emp-inc' 'workclass_ Self-emp-not-inc'
 'workclass_ State-gov' 'workclass_ Without-pay' 'education_ 10th'
 'education_ 11th' 'education_ 12th' 'education_ 1st-4th'
 'education_ 5th-6th' 'education_ 7th-8th' 'education_ 9th'
 'education_ Assoc-acdm' 'education_ Assoc-voc' 'education_ Bachelors'
 'education_ Doctorate' 'education_ HS-grad' 'education_ Masters'
 'education_ Preschool' 'education_ Prof-school' 'education_ Some-college'
 'marital_status_ Divorced' 'marital_status_ Married-AF-spouse'
 'marital_status_ Married-civ-spouse'
 'marital_status_ Married-spouse-absent' 'marital_status_ Never-married'
 'marital_status_ Separated' 'marital_status_ Widowed'
 'occupation_ Adm-clerical' 'occupation_ Armed-Forces'
 'occupation_ Craft-repair' 'occupation_ Exec-managerial'
 'occupation_ Farming-fishing' 'occupation_ Handlers-cleaners

In [30]:
print(income.columns.values)

['age' 'workclass' 'fnlwgt' 'education' 'education_num' 'marital_status'
 'occupation' 'relationship' 'race' 'sex' 'capital_gain' 'capital_loss'
 'hourspw' 'native_country' 'income']


In [41]:
X = income_ohe
y = income['income']

In [42]:
print(X.shape)
print(y.shape)

(30162, 104)
(30162,)


In [None]:
%whos

## Split the data into training and test set

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

In [44]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(22621, 104)
(7541, 104)
(22621,)
(7541,)


In [49]:
# examine the class distribution of the testing set (using a Pandas Series method)
print(y_train.value_counts())
print(y_test.value_counts())

 <=50K    17015
 >50K      5606
Name: income, dtype: int64
 <=50K    5639
 >50K     1902
Name: income, dtype: int64


"Logistic Regression": LogisticRegression()  
"Nearest Neighbors": KNeighborsClassifier()  
"Linear SVM": SVC()  
"Gradient Boosting Classifier": GradientBoostingClassifier()  
"Decision Tree": tree.DecisionTreeClassifier()  
"Random Forest": RandomForestClassifier(n_estimators = 18)  
"Neural Net": MLPClassifier(alpha = 1)  
"Naive Bayes": GaussianNB()  

In [45]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [50]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

In [51]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.784776554834


In [52]:
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])

True: [' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
 ' >50K' ' >50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' >50K'
 ' <=50K' ' <=50K' ' <=50K' ' >50K' ' <=50K' ' <=50K' ' <=50K' ' >50K'
 ' <=50K']
Pred: [' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
 ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' >50K' ' <=50K' ' >50K'
 ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
 ' <=50K']


In [53]:
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred_class))

[[5431  208]
 [1415  487]]


In [54]:
from sklearn.metrics import confusion_matrix, recall_score, accuracy_score, precision_score

def Evaluate(predicted, actual, labels):
    output_labels = []
    output = []
    
    # Calculate and display confusion matrix
    cm = confusion_matrix(actual, predicted, labels=labels)
    print('Confusion matrix\n- x-axis is true labels \n- y-axis is predicted labels')
    print(cm)
    
    # Calculate precision, recall, and F1 score
    accuracy = np.array([float(np.trace(cm)) / np.sum(cm)] * len(labels))
    precision = precision_score(actual, predicted, average=None, labels=labels)
    recall = recall_score(actual, predicted, average=None, labels=labels)
    f1 = 2 * precision * recall / (precision + recall)
    output.extend([accuracy.tolist(), precision.tolist(), recall.tolist(), f1.tolist()])
    output_labels.extend(['accuracy', 'precision', 'recall', 'F1'])
    
    # Calculate the macro versions of these metrics
    output.extend([[np.mean(precision)] * len(labels),
                   [np.mean(recall)] * len(labels),
                   [np.mean(f1)] * len(labels)])
    output_labels.extend(['macro precision', 'macro recall', 'macro F1'])
    
    # Find the one-vs.-all confusion matrix
    cm_row_sums = cm.sum(axis = 1)
    cm_col_sums = cm.sum(axis = 0)
    s = np.zeros((2, 2))
    for i in range(len(labels)):
        v = np.array([[cm[i, i],
                       cm_row_sums[i] - cm[i, i]],
                      [cm_col_sums[i] - cm[i, i],
                       np.sum(cm) + cm[i, i] - (cm_row_sums[i] + cm_col_sums[i])]])
        s += v
    s_row_sums = s.sum(axis = 1)
    
    # Add average accuracy and micro-averaged  precision/recall/F1
    avg_accuracy = [np.trace(s) / np.sum(s)] * len(labels)
    micro_prf = [float(s[0,0]) / s_row_sums[0]] * len(labels)
    output.extend([avg_accuracy, micro_prf])
    output_labels.extend(['average accuracy',
                          'micro-averaged precision/recall/F1'])
    
    # Compute metrics for the majority classifier
    mc_index = np.where(cm_row_sums == np.max(cm_row_sums))[0][0]
    cm_row_dist = cm_row_sums / float(np.sum(cm))
    mc_accuracy = 0 * cm_row_dist; mc_accuracy[mc_index] = cm_row_dist[mc_index]
    mc_recall = 0 * cm_row_dist; mc_recall[mc_index] = 1
    mc_precision = 0 * cm_row_dist
    mc_precision[mc_index] = cm_row_dist[mc_index]
    mc_F1 = 0 * cm_row_dist;
    mc_F1[mc_index] = 2 * mc_precision[mc_index] / (mc_precision[mc_index] + 1)
    output.extend([mc_accuracy.tolist(), mc_recall.tolist(),
                   mc_precision.tolist(), mc_F1.tolist()])
    output_labels.extend(['majority class accuracy', 'majority class recall',
                          'majority class precision', 'majority class F1'])
        
    # Random accuracy and kappa
    cm_col_dist = cm_col_sums / float(np.sum(cm))
    exp_accuracy = np.array([np.sum(cm_row_dist * cm_col_dist)] * len(labels))
    kappa = (accuracy - exp_accuracy) / (1 - exp_accuracy)
    output.extend([exp_accuracy.tolist(), kappa.tolist()])
    output_labels.extend(['expected accuracy', 'kappa'])
    

    # Random guess
    rg_accuracy = np.ones(len(labels)) / float(len(labels))
    rg_precision = cm_row_dist
    rg_recall = np.ones(len(labels)) / float(len(labels))
    rg_F1 = 2 * cm_row_dist / (len(labels) * cm_row_dist + 1)
    output.extend([rg_accuracy.tolist(), rg_precision.tolist(),
                   rg_recall.tolist(), rg_F1.tolist()])
    output_labels.extend(['random guess accuracy', 'random guess precision',
                          'random guess recall', 'random guess F1'])
    
    # Random weighted guess
    rwg_accuracy = np.ones(len(labels)) * sum(cm_row_dist**2)
    rwg_precision = cm_row_dist
    rwg_recall = cm_row_dist
    rwg_F1 = cm_row_dist
    output.extend([rwg_accuracy.tolist(), rwg_precision.tolist(),
                   rwg_recall.tolist(), rwg_F1.tolist()])
    output_labels.extend(['random weighted guess accuracy',
                          'random weighted guess precision',
                          'random weighted guess recall',
                          'random weighted guess F1'])

    output_df = pd.DataFrame(output, columns=labels)
    output_df.index = output_labels
                  
    return output_df

In [55]:

evaluation_result = Evaluate(actual = y_test.values,
                                 predicted = y_pred_class,
                                 labels = [' <=50K', ' >50K'])

Confusion matrix
- x-axis is true labels 
- y-axis is predicted labels
[[5431  208]
 [1415  487]]


In [56]:
evaluation_result

Unnamed: 0,<=50K,>50K
accuracy,0.784777,0.784777
precision,0.79331,0.700719
recall,0.963114,0.256046
F1,0.870004,0.375048
macro precision,0.747015,0.747015
macro recall,0.60958,0.60958
macro F1,0.622526,0.622526
average accuracy,0.784777,0.784777
micro-averaged precision/recall/F1,0.784777,0.784777
majority class accuracy,0.747779,0.0


For datasets, where this is not the case we can play around with the features in the dataset, add extra features from additional datasets or change the parameters of the classifiers in order to improve the accuracy.

In [None]:

# Required Python Packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [57]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

# Fit the model

rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [58]:
predictions = rfc.predict(X_test)
    
print(predictions.shape)
print(y_test.shape)

(7541,)
(7541,)


In [59]:
evaluation_result = Evaluate(actual = y_test.values,
                                 predicted = predictions,
                                 labels = [' <=50K', ' >50K'])
evaluation_result

Confusion matrix
- x-axis is true labels 
- y-axis is predicted labels
[[5207  432]
 [ 798 1104]]


Unnamed: 0,<=50K,>50K
accuracy,0.836892,0.836892
precision,0.867111,0.71875
recall,0.923391,0.580442
F1,0.894366,0.642234
macro precision,0.79293,0.79293
macro recall,0.751916,0.751916
macro F1,0.7683,0.7683
average accuracy,0.836892,0.836892
micro-averaged precision/recall/F1,0.836892,0.836892
majority class accuracy,0.747779,0.0
