# 1994 Census bureau database 

### Goal: Objective: Given some characteristics, return the probability of receiving more than \$50k

[Reference: Nham, Tracy. Classifying Income from 1994 Census Data](http://cseweb.ucsd.edu/classes/sp15/cse190-c/reports/sp15/024.pdf)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Adult_Census_Income.csv')

#### A) What we have?

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


#### B) Selecting features

In [6]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education.num      int64
marital.status    object
occupation        object
relationship      object
race              object
sex               object
capital.gain       int64
capital.loss       int64
hours.per.week     int64
native.country    object
income            object
dtype: object

- 'fnlwgt': not useful for this work.
- 'education.num': the feature 'education' is enough.
- 'income': our target.

In [25]:
target = 'income'
black_cols = ['fnlwgt','education.num','income']
features = df.columns.tolist()
for col in black_cols:
    features.pop(features.index(col))
print(features)

['age', 'workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country']


#### C) Filtering and staging values 

- Taking a look at unique values. *Some values like numbers may not be seen below.*

In [18]:
cols_not_uniques = []
for col in features:
    try:
        uniqs = df[col].unique()
        print('Feature: \'{}\', total uniques {}:\n{}\n\n'.format(col,len(uniqs),"\n".join(uniqs)))
    except Exception as e:
        cols_not_uniques.append(col)
        

Feature: 'workclass', total uniques 9:
?
Private
State-gov
Federal-gov
Self-emp-not-inc
Self-emp-inc
Local-gov
Without-pay
Never-worked


Feature: 'education', total uniques 16:
HS-grad
Some-college
7th-8th
10th
Doctorate
Prof-school
Bachelors
Masters
11th
Assoc-acdm
Assoc-voc
1st-4th
5th-6th
12th
9th
Preschool


Feature: 'marital.status', total uniques 7:
Widowed
Divorced
Separated
Never-married
Married-civ-spouse
Married-spouse-absent
Married-AF-spouse


Feature: 'occupation', total uniques 15:
?
Exec-managerial
Machine-op-inspct
Prof-specialty
Other-service
Adm-clerical
Craft-repair
Transport-moving
Handlers-cleaners
Sales
Farming-fishing
Tech-support
Protective-serv
Armed-Forces
Priv-house-serv


Feature: 'relationship', total uniques 6:
Not-in-family
Unmarried
Own-child
Other-relative
Husband
Wife


Feature: 'race', total uniques 5:
White
Black
Asian-Pac-Islander
Other
Amer-Indian-Eskimo


Feature: 'sex', total uniques 2:
Female
Male


Feature: 'native.country', total uniques 42:


In [19]:
black_vals = ['?']

In [20]:
cols_not_uniques

['age', 'capital.gain', 'capital.loss', 'hours.per.week']

In [21]:
df[cols_not_uniques].describe()

Unnamed: 0,age,capital.gain,capital.loss,hours.per.week
count,32561.0,32561.0,32561.0,32561.0
mean,38.581647,1077.648844,87.30383,40.437456
std,13.640433,7385.292085,402.960219,12.347429
min,17.0,0.0,0.0,1.0
25%,28.0,0.0,0.0,40.0
50%,37.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,45.0
max,90.0,99999.0,4356.0,99.0


- Thinking about a person's use at any time, knowing and reporting the capital gain in the year may cause an inconsistent data entry.

In [28]:
black_cols.extend(['capital.gain','capital.loss'])
features_num = ['age','hours.per.week']

In [27]:
features = df.columns.tolist()
for col in black_cols:
    features.pop(features.index(col))
print(features)

['age', 'workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'hours.per.week', 'native.country']


In [29]:
features_cat = features
for col in features_num:
    features_cat.pop(features_cat.index(col))

In [31]:
print("Categorical features: {}\n\nNumerical features: {}".format(features_cat,features_num))

Categorical features: ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']

Numerical features: ['age', 'hours.per.week']


#### Creating the model

- Dictionary for categorical features.

In [33]:
col = 'race'


['White', 'Black', 'Asian-Pac-Islander', 'Other', 'Amer-Indian-Eskimo']

In [35]:
features_cat_dict = {}
for col in features_cat:
    tmp = df[col].unique().tolist()
    for val in black_vals:
        if val in tmp:
            tmp.pop(tmp.index(val))
    features_cat_dict[col] = tmp

In [36]:
features_cat_dict

{'workclass': ['Private',
  'State-gov',
  'Federal-gov',
  'Self-emp-not-inc',
  'Self-emp-inc',
  'Local-gov',
  'Without-pay',
  'Never-worked'],
 'education': ['HS-grad',
  'Some-college',
  '7th-8th',
  '10th',
  'Doctorate',
  'Prof-school',
  'Bachelors',
  'Masters',
  '11th',
  'Assoc-acdm',
  'Assoc-voc',
  '1st-4th',
  '5th-6th',
  '12th',
  '9th',
  'Preschool'],
 'marital.status': ['Widowed',
  'Divorced',
  'Separated',
  'Never-married',
  'Married-civ-spouse',
  'Married-spouse-absent',
  'Married-AF-spouse'],
 'occupation': ['Exec-managerial',
  'Machine-op-inspct',
  'Prof-specialty',
  'Other-service',
  'Adm-clerical',
  'Craft-repair',
  'Transport-moving',
  'Handlers-cleaners',
  'Sales',
  'Farming-fishing',
  'Tech-support',
  'Protective-serv',
  'Armed-Forces',
  'Priv-house-serv'],
 'relationship': ['Not-in-family',
  'Unmarried',
  'Own-child',
  'Other-relative',
  'Husband',
  'Wife'],
 'race': ['White',
  'Black',
  'Asian-Pac-Islander',
  'Other',
  'Am

- Get randomly sample dataset and some categories:

In [41]:
some_cats = ['education','age','marital.status','sex']

df_tmp = df[some_cats].sample()
df_tmp

Unnamed: 0,education,age,marital.status,sex
69,Prof-school,42,Married-civ-spouse,Male


In [48]:
cond = (df[some_cats[0]] == df_tmp[some_cats[0]].values.tolist()[0])
for col in some_cats[1:]:
    cond = (cond & (df[col] == df_tmp[col].values.tolist()[0]))
df_pop = df.loc[cond]
df_pop

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
69,42,Self-emp-inc,187702,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,2415,60,United-States,>50K
378,42,Self-emp-inc,123838,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1977,50,United-States,>50K
537,42,Self-emp-not-inc,214242,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,50,United-States,>50K
562,42,Private,378384,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,60,United-States,>50K
845,42,Self-emp-not-inc,185129,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1887,40,United-States,>50K
1603,42,Self-emp-not-inc,201908,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,99999,0,50,United-States,>50K
1956,42,Private,252518,Prof-school,15,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K
2031,42,Self-emp-inc,277488,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,15024,0,65,United-States,>50K
2779,42,Private,87284,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,7298,0,35,United-States,>50K
5005,42,Self-emp-not-inc,177307,Prof-school,15,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,65,United-States,>50K


In [60]:
target_val = '>50K'
tg_vals = ((df_pop[target] == '>50K')*1).values.tolist()
N = len(df)
n = len(tg_vals)
p = 100*sum(tg_vals)/n
s = 100*n/N
print(df_tmp)
print("P(target) = {:.2f}%, with {:,}/{:,} ({:.4f}%) of the entire dataset.".format(p,n,N,s))

      education  age      marital.status   sex
69  Prof-school   42  Married-civ-spouse  Male
P(target) = 93.75%, with 16/32,561 (0.0491%) of the entire dataset.


# Done!!