## Introduction to Decision Trees

### Explore the dataset
We'll be looking at individual income in the United States. The data is from the 1994 census, and contains information on an individual's marital status, age, type of work, and more. The target column, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than 50k a year.

In [1]:
import pandas as pd
import numpy as np
import math

In [3]:
import pandas

# Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age)
income = pandas.read_csv("income.csv", index_col=False)

In [4]:
print(income.head(5))

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country high_income  
0          2174             0              40   United-States   

In [5]:
income.shape

(32561, 15)

In [6]:
income.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'high_income'],
      dtype='object')

### Converting the categorical variables to numeric variables

In [8]:
categorical_cols = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 
                    'race', 'sex', 'native_country', 'high_income']


In [9]:
# Convert a single column from text categories to numbers
col = pandas.Categorical(income["workclass"])
income["workclass"] = col.codes
print(income["workclass"].head(5))

0    7
1    6
2    4
3    4
4    4
Name: workclass, dtype: int8


In [10]:
for cat in categorical_cols:
    income[cat] = pd.Categorical(income[cat]).codes

In [11]:
print(income['sex'].head())

0    1
1    1
2    1
3    1
4    0
Name: sex, dtype: int8


### Creating Splits in Data

In [13]:
income_filter = income['workclass'] ==4
private_incomes = income[income_filter]
private_incomes.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0
5,37,4,284582,12,14,2,4,5,4,0,0,0,40,39,0
6,49,4,160187,6,5,3,8,1,2,0,0,0,16,23,0


In [15]:
private_incomes.shape

(22696, 15)

In [14]:
public_incomes = income[income_filter == False]
public_incomes.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
7,52,6,209642,11,9,2,4,0,4,1,0,0,45,39,1
11,30,7,141297,9,13,2,10,0,1,1,0,0,40,19,1
16,25,6,176756,11,9,4,5,3,4,1,0,0,35,39,0


In [16]:
public_incomes.shape

(9865, 15)