data : adult.csv  
* target : income  
preprocess:  
* missing value : simple imputer with constant  
* one hot encoding : relationship, race, sex  
* binary encoding : workclass, marital status, occupation, native country    
* ordinal encoding : education (already encoded)  
* no treatment : numerical  
out : fnlwgt  
Random state 10, data splitting 70:30 model Tree(max depth 5,  criterion entropy)

An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc.
<br>
Fields:
<br>
The dataset contains 16 columns
<br>
Target filed: Income
<br>
-- The income is divide into two classes: 50K
<br>
Number of attributes: 14
<br>
-- These are the demographics and other 
<br>
features to describe a person

We will explore the possibility in predicting income level based on the individual’s personal information.

# Dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_dataset = pd.read_csv('adult.csv')
df_dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


# Missing Value

In [3]:
from sklearn.impute import SimpleImputer

In [4]:
df_dataset.replace('?', np.nan, inplace=True)

In [5]:
df_dataset.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     583
income               0
dtype: int64

In [6]:
imputer_cons = SimpleImputer(strategy='constant', fill_value='mv')
df_dataset[['workclass_mv', 'occupation_mv', 'native.country_mv']] = imputer_cons.fit_transform(df_dataset[['workclass', 'occupation', 'native.country']])


In [7]:
df_dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income,workclass_mv,occupation_mv,native.country_mv
0,90,,77053,HS-grad,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,<=50K,mv,mv,United-States
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K,Private,Exec-managerial,United-States
2,66,,186061,Some-college,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,<=50K,mv,mv,United-States
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K,Private,Machine-op-inspct,United-States
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K,Private,Prof-specialty,United-States


# Cleaning Dataset

In [8]:
df_dataset['income'].value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

In [9]:
df_dataset['income'] = np.where(df_dataset['income'] == '>50K', 1, 0)

In [10]:
df = df_dataset.copy()
df = df.drop(columns=['fnlwgt', 'education', 'workclass', 'occupation', 'native.country'])

In [11]:
df.head()

Unnamed: 0,age,education.num,marital.status,relationship,race,sex,capital.gain,capital.loss,hours.per.week,income,workclass_mv,occupation_mv,native.country_mv
0,90,9,Widowed,Not-in-family,White,Female,0,4356,40,0,mv,mv,United-States
1,82,9,Widowed,Not-in-family,White,Female,0,4356,18,0,Private,Exec-managerial,United-States
2,66,10,Widowed,Unmarried,Black,Female,0,4356,40,0,mv,mv,United-States
3,54,4,Divorced,Unmarried,White,Female,0,3900,40,0,Private,Machine-op-inspct,United-States
4,41,10,Separated,Own-child,White,Female,0,3900,40,0,Private,Prof-specialty,United-States


# Preprocessing

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce

In [13]:
transformer = ColumnTransformer([
    ('One Hot', OneHotEncoder(), ['relationship', 'race', 'sex']),
    ('Binary', ce.BinaryEncoder(), ['workclass_mv', 'marital.status', 'occupation_mv', 'native.country_mv']),
], remainder='passthrough')

In [14]:
transformer.fit_transform(df.drop(columns=['income']))

  elif pd.api.types.is_categorical(cols):


array([[0.000e+00, 1.000e+00, 0.000e+00, ..., 0.000e+00, 4.356e+03,
        4.000e+01],
       [0.000e+00, 1.000e+00, 0.000e+00, ..., 0.000e+00, 4.356e+03,
        1.800e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 0.000e+00, 4.356e+03,
        4.000e+01],
       ...,
       [1.000e+00, 0.000e+00, 0.000e+00, ..., 0.000e+00, 0.000e+00,
        4.000e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 0.000e+00, 0.000e+00,
        4.000e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, ..., 0.000e+00, 0.000e+00,
        2.000e+01]])

# Data Splitting

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
x = df.drop(columns=['income'])
y = df['income']

In [17]:
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    stratify=y,
    test_size=0.3, # nilai default : 0.3
    random_state=10
)

In [18]:
x_train_preprocessed = transformer.fit_transform(x_train)
x_test_preprocessed = transformer.transform(x_test)

  elif pd.api.types.is_categorical(cols):


In [19]:
x_train_preprocessed = pd.DataFrame(x_train_preprocessed)
x_test_preprocessed = pd.DataFrame(x_test_preprocessed)

In [20]:
features = list(transformer.transformers_[0][1].get_feature_names()) \
    + list(transformer.transformers_[1][1].get_feature_names()) \
    + ['age', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']

In [21]:
x_train_preprocessed.columns = features
x_test_preprocessed.columns = features

# Model Fitting

In [22]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [23]:
tree = DecisionTreeClassifier(criterion='entropy', max_depth=5) # criterion{“gini”, “entropy”}, default=”gini”
tree.fit(x_train_preprocessed, y_train)
y_predict = tree.predict(x_test_preprocessed)

modelAccScore = accuracy_score(y_test, y_predict)
print('Nilai akurasi model : ', modelAccScore)

Nilai akurasi model :  0.8447128672330843
