# Adult Dataset
In this notebook we will discover the adult dataset provided by [UCI Machine Learning Libary](https://archive.ics.uci.edu/ml/index.php)
and make some predictions on it, so you can test if you estimated income is **over**,**below**or **exactly** 50K$ per year.

## Importing the libarys

In [190]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

## Importing the data
Here we import the data from the dataset you can find [here](https://archive.ics.uci.edu/ml/datasets/Census+Income).

In [191]:
columns = ["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
           "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
           "Hours per week", "Country", "Income"]

train_data_in = pd.read_csv('Adult_Dataset/adult.data.txt',names=columns,engine='python',na_values=' ?',keep_default_na=False)
test_data_in = pd.read_csv('Adult_Dataset/adult.test.txt',names=columns,engine='python',na_values=' ?',keep_default_na=False)
train_data_in.dropna(inplace=True)
test_data_in.dropna(inplace=True)

Now we delte the label column(Income) from the train_x dataframe.

In [192]:
train_x = pd.DataFrame(train_data_in,columns=columns)
del train_x['Income']

Then we create a new dataframe for the labels for the training data.

In [193]:
train_y = pd.DataFrame(train_data_in["Income"],columns=['Income'])
train_y.tail()

Unnamed: 0,Income
32556,<=50K
32557,>50K
32558,<=50K
32559,<=50K
32560,>50K


Now we do the same for the testing data.

In [194]:
test_x = pd.DataFrame(test_data_in,columns=columns)
del test_x['Income']
test_y = pd.DataFrame(test_data_in['Income'],columns=['Income'])
test_y = test_y['Income'].astype(str).str[:-1]
test_y.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
5     <=50K
Name: Income, dtype: object

## Encoding the features
Here we encode the features to numbers, because the compiler can work better with numbers(integers or floats) than with strings.

In [195]:
workclass = [' Private', ' Self-emp-not-inc', ' Self-emp-inc', ' Federal-gov', ' Local-gov', ' State-gov', ' Without-pay', ' Never-worked']
education = [' Bachelors', ' Some-college', ' 11th', ' HS-grad', ' Prof-school', ' Assoc-acdm', ' Assoc-voc', ' 9th', ' 7th-8th', ' 12th', ' Masters', ' 1st-4th', ' 10th', ' Doctorate', ' 5th-6th', ' Preschool']
marital_status = [' Married-civ-spouse', ' Divorced', ' Never-married', ' Separated', ' Widowed', ' Married-spouse-absent', ' Married-AF-spouse']
occupation = [' Tech-support', ' Craft-repair', ' Other-service', ' Sales', ' Exec-managerial', ' Prof-specialty', ' Handlers-cleaners', ' Machine-op-inspct', ' Adm-clerical', ' Farming-fishing', ' Transport-moving', ' Priv-house-serv', ' Protective-serv', ' Armed-Forces']
relationship = [' Wife', ' Own-child', ' Husband', ' Not-in-family', ' Other-relative', ' Unmarried']
race = [' White', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo', ' Other', ' Black']
sex = [' Female',' Male']
native_country = [' United-States', ' Cambodia', ' England', ' Puerto-Rico', ' Canada', ' Germany', ' Outlying-US(Guam-USVI-etc)', ' India', ' Japan', ' Greece', ' South', ' China', ' Cuba', ' Iran', ' Honduras', ' Philippines', ' Italy', ' Poland', ' Jamaica', ' Vietnam', ' Mexico', ' Portugal', ' Ireland', ' France', ' Dominican-Republic', ' Laos', ' Ecuador', ' Taiwan', ' Haiti', ' Columbia', ' Hungary', ' Guatemala', ' Nicaragua', ' Scotland', ' Thailand', ' Yugoslavia', ' El-Salvador', ' Trinadad&Tobago', ' Peru', ' Hong', ' Holand-Netherlands']

def number_encode_features(dtf):
    in_age = []
    in_workclass = []
    in_fnlwgt = []
    in_education = []
    in_education_num = []
    in_marital_status = []
    in_occupation = []
    in_relationship = []
    in_race = []
    in_sex = []
    in_capital_gain = []
    in_capital_loss = []
    in_hours_per_week = []
    in_native_country = []
    for e in columns[:-1]:
        if e == 'Age':
            for i in dtf[e]:
                in_age.append(int(i))
        if e == 'Hours per week':
            for i in dtf[e]:
                in_hours_per_week.append(int(i))
        if e == 'fnlwgt':
            for i in dtf[e]:
                in_fnlwgt.append(int(i))
        if e == 'Education-Num':
            for i in dtf[e]:
                in_education_num.append(int(i))
        if e == 'Capital Gain':
            for i in dtf[e]:
                in_capital_gain.append(int(i))
        if e == 'Capital Loss':
            for i in dtf[e]:
                in_capital_loss.append(int(i))
        if e == 'Workclass':
            for i in dtf[e]:
                in_workclass.append(workclass.index(i))
        if e == 'Education':
            for i in dtf[e]:
                in_education.append(education.index(i))
        if e == 'Martial Status':
            for i in dtf[e]:
                in_marital_status.append(marital_status.index(i))
        if e == 'Occupation':
            for i in dtf[e]:
                in_occupation.append(occupation.index(i))
        if e == 'Relationship':
            for i in dtf[e]:
                in_relationship.append(relationship.index(i))
        if e == 'Race':
            for i in dtf[e]:
                in_race.append(race.index(i))
        if e == 'Sex':
            for i in dtf[e]:
                in_sex.append(sex.index(i))
        if e == 'Country':
            for i in dtf[e]:
                in_native_country.append(native_country.index(i))
    df = pd.DataFrame({columns[0]:in_age,
                       columns[1]:in_workclass,
                       columns[2]:in_fnlwgt,
                       columns[3]:in_education,
                       columns[4]:in_education_num,
                       columns[5]:in_marital_status,
                       columns[6]:in_occupation,
                       columns[7]:in_relationship,
                       columns[8]:in_race,
                       columns[9]:in_sex,
                       columns[10]:in_capital_gain,
                       columns[11]:in_capital_loss,
                       columns[12]:in_hours_per_week,
                       columns[13]:in_native_country})
    return df

train_x = number_encode_features(train_x)
test_x = number_encode_features(test_x)

## Defining the classifier
Now it's time to define the classifier. For this dataset we use the DecisionTreeClassifer with the following settings:

In [196]:
clf = DecisionTreeClassifier(criterion='entropy',min_samples_split=40,min_samples_leaf=10)
clf.fit(train_x,train_y)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=40,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [197]:
accuracy = clf.score(test_x,test_y)
print("Accuracy: " + str(accuracy))

Accuracy: 0.841633466135


## Predictions on the data
Here we can predict our own estimated income. We write our own data in a .txt file called 'self_test.txt'. You can replace it with your own data of course too.

In [198]:
prediction_attr = pd.read_csv('Adult_Dataset/self_test.txt',names=columns,engine='python')
prediction_attr = number_encode_features(prediction_attr)
prediction = clf.predict(prediction_attr)

if prediction == " >50K":
    print("Your estimated income is above 50K$/yr!")
else:
    print("Your estimated income is below or exactly 50K$/yr!")

Your estimated income is above 50K$/yr!
