# Decision Tree Classification: Adult Data

We now turn to a more complex data set with which to perform classification by using a decision tree. The data we will explore next is the Adult income prediction task. It consists of the following features: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, and salary.

In [8]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [24]:
# Set up Notebook
%matplotlib inline
warnings.filterwarnings('ignore')
sns.set_style('white')
%config Completer.use_jedi = False

In [26]:
# Notebook specific imports
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [12]:
# define column names
col_names = ['Age', 'Workclass', 'FNLWGT', 'Education', 
             'EducationLevel', 'MaritalStatus', 'Occupation', 
             'Relationship', 'Race', 'Sex', 'CapitalGain', 'CapitalLoss', 
             'HoursPerWeek', 'NativeCountry', 'Salary']
# read data with the new column names
adult_data = pd.read_csv('../datasets/adult.data', index_col=False,
                        names=col_names)
# Display random samples
adult_data.sample(5)

Unnamed: 0,Age,Workclass,FNLWGT,Education,EducationLevel,MaritalStatus,Occupation,Relationship,Race,Sex,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Salary
29886,32,Self-emp-not-inc,379412,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
19297,33,Private,258932,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,0,0,40,United-States,<=50K
4059,39,Private,63910,Some-college,10,Never-married,Adm-clerical,Own-child,Asian-Pac-Islander,Female,0,0,40,United-States,<=50K
29464,52,Local-gov,236497,Bachelors,13,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States,<=50K
4527,19,Private,39026,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K


In [14]:
adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Age             32561 non-null  int64 
 1   Workclass       32561 non-null  object
 2   FNLWGT          32561 non-null  int64 
 3   Education       32561 non-null  object
 4   EducationLevel  32561 non-null  int64 
 5   MaritalStatus   32561 non-null  object
 6   Occupation      32561 non-null  object
 7   Relationship    32561 non-null  object
 8   Race            32561 non-null  object
 9   Sex             32561 non-null  object
 10  CapitalGain     32561 non-null  int64 
 11  CapitalLoss     32561 non-null  int64 
 12  HoursPerWeek    32561 non-null  int64 
 13  NativeCountry   32561 non-null  object
 14  Salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [13]:
adult_data.describe()

Unnamed: 0,Age,FNLWGT,EducationLevel,CapitalGain,CapitalLoss,HoursPerWeek
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [15]:
adult_data.describe(include=[object])

Unnamed: 0,Workclass,Education,MaritalStatus,Occupation,Relationship,Race,Sex,NativeCountry,Salary
count,32561,32561,32561,32561,32561,32561,32561,32561,32561
unique,9,16,7,15,6,5,2,42,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,22696,10501,14976,4140,13193,27816,21790,29170,24720


 This data is generally used to test classification algorithms, as the data include a _Salary_ column that includes one of two entries: `<=50K` or `>50K` to indicate individual's salary range.
 
 To apply a machine learning algorithm to these data, we need to generate a numerical label that maps to these two values. For this, we create a new column in our DataFrame called `Label` and map the original column to $1$ if the `Salary` feature is equal to `>50K` and $0$ otherwise

In [16]:
# Create label column, one for >50K, zero otherwise.
adult_data['Label'] = adult_data['Salary'].map(lambda x : 1 
                                               if '>50K' in x else 0)
# Display label and original column for comparison
adult_data[['Salary', 'Label']].sample(12)

Unnamed: 0,Salary,Label
13388,<=50K,0
23817,<=50K,0
15270,<=50K,0
19718,<=50K,0
18329,<=50K,0
9705,<=50K,0
5922,<=50K,0
11880,<=50K,0
2811,<=50K,0
23875,<=50K,0


In [17]:
# Now we can drop the original column
adult_data = adult_data.drop('Salary', axis=1)

With our new `Label` feature, we can compute what is known as the _zero model_, in which we classify the data by always predicting the majority class. While we do not do this in practice since the model provides no predictive power or insights into the data, this does set a useful baseline for how well an algorithm should perform. Any model that performs worse or similar to the _zero model_ should be discarded. Instead, we will want to perform better than this value.

In [18]:
labels = adult_data['Label']
print(f'{np.sum(labels==0):d} low salaries')
print(f'{np.sum(labels==1):d} high salaries')

24720 low salaries
7841 high salaries


In [19]:
zm = float(np.sum(labels==0)) / (np.sum(labels==0) + np.sum(labels==1))
print(f'Zero Model Performance = {100.0 * zm:4.2f}%')

Zero Model Performance = 75.92%


In this case, our zero model performs at a 75% classification accuracy, which indicates that our data set is unbalanced since we have roughly three lower salary instances to every higher salary instance.

---
With our target label constructed, we now create the feature array that we will use to construct the decision tree classifier. To do this, we will first extract and convert the categorical features to binarized features, and then extract the numerical features. Finally, we will combine these two data sets together to make the final feature array.

In the following Code cell, we create a Python list containing the categorical features. We use the `get_dummies` function to create binarized features for each of these categorical features

In [21]:
# Categorical DataFrame
categorical = ['Education', 'Workclass', 'Race', 
               'Sex', 'Occupation', 'Relationship', 
               'NativeCountry']
cat_data = pd.get_dummies(adult_data[categorical])
cat_data.sample(5)

Unnamed: 0,Education_ 10th,Education_ 11th,Education_ 12th,Education_ 1st-4th,Education_ 5th-6th,Education_ 7th-8th,Education_ 9th,Education_ Assoc-acdm,Education_ Assoc-voc,Education_ Bachelors,...,NativeCountry_ Portugal,NativeCountry_ Puerto-Rico,NativeCountry_ Scotland,NativeCountry_ South,NativeCountry_ Taiwan,NativeCountry_ Thailand,NativeCountry_ Trinadad&Tobago,NativeCountry_ United-States,NativeCountry_ Vietnam,NativeCountry_ Yugoslavia
10821,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
11724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4877,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
24609,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29265,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [22]:
# Next, we construct our numerical feature DataFrame
numerical = ['Age', 'EducationLevel', 'HoursPerWeek', 
             'CapitalGain', 'CapitalLoss']

num_data = adult_data[numerical]
num_data.sample(5)

Unnamed: 0,Age,EducationLevel,HoursPerWeek,CapitalGain,CapitalLoss
10528,55,13,50,0,0
20382,58,9,40,0,0
24732,47,9,85,0,0
7506,32,9,40,3781,0
12374,61,4,40,0,1651


We now combine these two DF into a new `feature` that we will use to perform decision tree classification.

In [23]:
# Features matrix
features = pd.concat([num_data, cat_data], axis=1)
features.sample(5)

Unnamed: 0,Age,EducationLevel,HoursPerWeek,CapitalGain,CapitalLoss,Education_ 10th,Education_ 11th,Education_ 12th,Education_ 1st-4th,Education_ 5th-6th,...,NativeCountry_ Portugal,NativeCountry_ Puerto-Rico,NativeCountry_ Scotland,NativeCountry_ South,NativeCountry_ Taiwan,NativeCountry_ Thailand,NativeCountry_ Trinadad&Tobago,NativeCountry_ United-States,NativeCountry_ Vietnam,NativeCountry_ Yugoslavia
474,67,11,24,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
30917,41,9,40,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
19864,31,9,40,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
10190,40,4,40,4064,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
22471,53,11,45,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


With our feature and label data prepared, we are now ready to begin the machine learning process.

In [27]:
# test sequence size
frac = 0.4
d_train, d_test, l_train, l_test = \
        train_test_split(features, labels,
                        test_size=frac,
                        random_state=23)
adult_model = DecisionTreeClassifier(random_state=23)
adult_model = adult_model.fit(d_train, l_train)

In [28]:
from sklearn import metrics

# Classify test data and display the score
predicted = adult_model.predict(d_test)
score = 100.0 * metrics.accuracy_score(l_test, predicted)
print(f'Decision Tree Classification [Adult Data] Score = {score:4.1f}%\n')

Decision Tree Classification [Adult Data] Score = 82.0%



In [29]:
# Display report
print('Classification Report:\n {0}\n'.format(
    metrics.classification_report(l_test, predicted)))

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.89      0.88      9811
           1       0.64      0.61      0.63      3214

    accuracy                           0.82     13025
   macro avg       0.76      0.75      0.75     13025
weighted avg       0.82      0.82      0.82     13025




As we can see, the result was reasonable, but not really satisfiying, specially if we consider the performance of the zero model.

So we need to tweak a bit our decision tree model to make 